Data processing with multi-database, off database, log, txt and excel: Method of Computing Link Relative Ratio and Year-on-year Comparison in R Language

Cross-row and –group computation often involves computing link relative ratio and year-on-year comparison. Link relative ratio refers to comparison between the current data and data of the previous period. Generally, it takes month as the time interval. For example, compare the sales amount of April with that of March, and the growth rate we get is the link relative ratio of April. Hour, day, week and quarter can also be used as the time interval. Year-on-year comparison is the comparison between the current data and data of the corresponding period of the previous year. For example, compare the sales amount of April 2014 with that of April 2013 and compute the growth rate which is April's year-on-year comparison. Data of multiple periods are usually compared to find the variation trend in practical business.

Now let's look at the method of computing link relative ratio and year-on-year comparison in R language through an example.

Case description:

Compute the link relative ratio and year-on-year comparison of each month's sales amount during a specified period of time. The data come from orders table sales, in which column Amount contains order amount and column OrderDate contains order dates. Some of the data are as follows:

Code:
sales<-read.table("E:\\ salesGroup.txt",sep="\t", header=TRUE)
filtered<-subset(sales,as.POSIXlt(OrderDate)>=as.POSIXlt('2011-01-01 00:00:00') &as.POSIXlt(OrderDate)<=as.POSIXlt('2014-08-29 00:00:00'))
filtered$y<-format(as.POSIXlt(filtered$OrderDate),'%Y')
filtered$m<-format(as.POSIXlt(filtered$OrderDate),'%m')
agged<-aggregate(filtered$Amount, filtered[,c("m","y")],sum)
agged$lrr<- c(0, (agged$x[-1]-agged$x[-length(agged$x)])/agged$x[-length(agged$x)])
result<-agged[order(agged$m),]
result$yoy<-NA
for(i in 1:nrow(result)){
if(i>1 && result[i,]$m==result[i-1,]$m){
result[i,]$yoy<-(result[i,]$x-result[i-1,]$x)/result[i-1,]$x
}
}
Code interpretation:
1. The first four lines of code are easy to understand. read.table is used to read data from the table and subset to filter data, and two format functions are used to generate year and month respectively. Note that the beginning and ending time should be output dynamically from the console using scan function; here they are simplified as fixed constants.

After computing, some of the values of database frame filtered are:

2. agged<-aggregate(filtered$Amount, filtered[,c("m","y")],sum), this line of code summates the order amount of each month of each year. Note that in the code, the month must be written before the year though data are grouped by the year and the month according to business logic. Otherwise R language will perform grouping first by the month, then by the year, which will get result inconsistent with business logic and make data viewing inconvenient.

After computing, some of the values of data frame agged are:

3. agged$lrr<- c(0, (agged$x[-1]-agged$x[-length(agged$x)])/agged$x[-length(agged$x)])，this line of code computes link relative ratio. The result will be stored in the new column Irr. Business logic is (order amount of the current month – order amount of the previous month)\order amount of the previous month.

Note: [-N] in the code represents that the Nth row of data is removed. So agged$x[-1]means the first row of data is removed and agged$x[-length(agged$x)]means the last row of data is removed. By performing certain operation between the two, link relative ratio can be obtained indirectly. But the result won’t include the link relative ratio of the first month (i.e. January 2011), so a zero should be added to the code. We can see that the code logic and the business logic share some similarities but are quite different. The code is difficult to understand.

At this point, some of the values of data frame aggedare:

4. result<-agged[order(agged$m),], this line of code sorts data by the month and the year. Since the data of the year are ordered, we just need to perform sorting by the month. result$yoy<-NA initializes a new column which will be used to store the year-on-year comparison of sales amount.

Now the value of result is:

5. The loop judgment in the last four lines of code is to compute the year-on-year comparison. Business logic: (order amount of the current month – order amount of the previous month)\order amount of the previous month. Code logic: from the second line, if the month in the current line is the same as that in the previous line, the code will compute year-on-year comparison. Detailed code is result[i,]$yoy<-(result[i,]$x-result[i-1,]$x)/result[i-1,]$x. We can see that the code written in this way is easy to understand and its logic is quite similar to the business logic.

The only weakness of this piece of code is that it cannot use the loop function of R language, which makes it a little lengthy. But compared with the difficult operation of link relative ratio, maybe a longer but simple code is better.

The final results are as follows:

Summary:R language can compute link relative ratio and year-on-year comparison, but the operation of link relative ratio is difficult to understand and the code of year-on-year comparison is a little lengthy. The codes of both operations are not easy to learn.

The third-party solution

Python, esProc and Perl, all of which can perform structured data computation, can be used to handle this case. In the following, we'll briefly introduce esProc and Python's solutions.

esProc

esProc is good at expressing business logic freely with agile syntax. Its code is concise and easy, as shown below:

In the above code, groups function is used to group and summarize data by the year and the month. The derive functions in A4 and A6 generate link relative ratio and year-on-year comparison respectively.

As can be seen from the code, esProc also uses[-N]. Different from [-N] in R language, it doesn't represent removing the Nth row; it represents the Nth row counted from the current line. For example, [-1] is the previous line. In this way, the operation of link relative ratio can be simply expressed as (x-x[-1])/x[-1].But R language hasn't expressions for relative positions, which makes its code difficult to understand.

In the year-on-year comparison operation, esProc uses judgment function if in loop function, making it avoid the lengthy loop statement and its code simpler. While R language only has the judgment statement but hasn't the judgment function. This is the reason why its code is lengthy.

Finally, these are the computed results:

Python（Pandas）
Pandas is Python's third-party package. Its basic data type is created by imitating R’s dataframe but gets improved greatly. At present, its latest version is 0.14. Its code for handling this case is as follows:

sales = pandas.read_csv('E:\\salesGroup.txt',sep='\t')
sales['OrderDate']=pandas.to_datetime(sales.OrderDate,format='%Y-%m-%d %H:%M:%S')
filtered=sales[(sales.OrderDate>='2011-01-01 00:00:00') & (sales.OrderDate<='2014-08-29 00:00:00')]
filtered['y']=filtered.OrderDate.apply(lambda x: x.year)
filtered['m']=filtered.OrderDate.apply(lambda x: x.month)
grouped=filtered.groupby(['y','m'],as_index=False)
agged=grouped.agg({'Amount':[sum]})
agged['lrr']=agged['Amount'].pct_change()
result=agged.sort_index(by=['m','y'])
result.reset_index(drop=True,inplace=True)
result['yoy']=result.apply(lambda _:numpy.nan, axis=1)
for row_index, row in result.iterrows():
if(row_index>0 and result.ix[row_index,'m']==result.ix[row_index-1,'m']):
result.ix[row_index,'yoy']=(result.ix[row_index,'Amount']-result.ix[row_index-1,'Amount'])/result.ix[row_index-1,'Amount']

In the code, pct_change() function is used to directly compute the link relative ratio, which is more convenient than the method used by R language and esProc. But this kind of function is not universal and can only deal with isolated cases. When it is required to compute link relative ratio or year-on-year comparison, Pandas can only complete the task by combining div function and shift function, which makes its code more difficult to understand than R’s.

In computing year-on-year comparison, Pandas' code is as lengthy as R’s. This is because Pandas also cannot use if function in loop function. I’m afraid cooperation of apply function and lambda syntax is needed if we want to write simpler code.

Finally, let's look at the computed results:

Please pay attention to the following easy-to-get-wrong details:

1. The code must be sort_index(by=['m','y'])when we perform sorting by the month and the year. The simple form sort(m), which used in R language and esProc, is not allowed.

2. Pandas has the assignment syntax as result.loc[row_index,‘yoy’]=value. But when assigning value to a certain element in data frame, we should write the code as result.ix[row_index,'yoy']=value.

3. When iterrows() is used to perform loop, its loop number row_index is index instead of row number. To make the row number conform to the index, reset_index() should be used to reset the indexes.

Data processing with multi-database, off database, log, txt and excel

2014年9月9日星期二

Method of Computing Link Relative Ratio and Year-on-year Comparison in R Language

没有评论:

发表评论