Cross-row
and –group computation often involves computing link relative ratio and
year-on-year comparison. Link relative ratio refers to comparison between the
current data and data of the previous period. Generally, it takes month as the
time interval. For example, compare the sales amount of April with that of March,
and the growth rate we get is the link relative ratio of April. Hour, day, week
and quarter can also be used as the time interval. Year-on-year comparison is
the comparison between the current data and data of the corresponding period of
the previous year. For example, compare the sales amount of April 2014 with
that of April 2013 and compute the growth rate which is April's year-on-year
comparison. Data of multiple periods are usually compared to find the variation
trend in practical business.
Now let's look at the method of computing link relative ratio and
year-on-year comparison in R language through an example.
Case description:
Compute the link relative ratio and year-on-year comparison of each
month's sales amount during a specified period of time. The data come from orders
table sales, in which column Amount
contains order amount and column OrderDate contains order dates. Some of the
data are as follows:
Code:
sales<-read.table("E:\\ salesGroup.txt",sep="\t", header=TRUE)
filtered<-subset(sales,as.POSIXlt(OrderDate)>=as.POSIXlt('2011-01-01 00:00:00') &as.POSIXlt(OrderDate)<=as.POSIXlt('2014-08-29 00:00:00'))
filtered$y<-format(as.POSIXlt(filtered$OrderDate),'%Y')
filtered$m<-format(as.POSIXlt(filtered$OrderDate),'%m')
agged<-aggregate(filtered$Amount, filtered[,c("m","y")],sum)
agged$lrr<- c(0, (agged$x[-1]-agged$x[-length(agged$x)])/agged$x[-length(agged$x)])
result<-agged[order(agged$m),]
result$yoy<-NA
for(i in 1:nrow(result)){
if(i>1 && result[i,]$m==result[i-1,]$m){
result[i,]$yoy<-(result[i,]$x-result[i-1,]$x)/result[i-1,]$x
}
}
Code interpretation:
1. The first four lines of code are easy to understand. read.table is used to read data from the table and subset to filter data, and two format functions are used to generate year and month respectively. Note that the beginning and ending time should be output dynamically from the console using scan function; here they are simplified as fixed constants.
After computing, some of the values of database frame filtered are:
2. agged<-aggregate(filtered$Amount,
filtered[,c("m","y")],sum), this
line of code summates the order amount of each month of each year. Note that in
the code, the month must be written before the year though data are grouped by
the year and the month according to business logic. Otherwise R language will
perform grouping first by the month, then by the year, which will get result
inconsistent with business logic and make data viewing inconvenient.
3. agged$lrr<- c(0,
(agged$x[-1]-agged$x[-length(agged$x)])/agged$x[-length(agged$x)]),this line of code computes
link relative ratio. The result will be stored in the new column Irr. Business logic is (order amount of
the current month – order amount of the previous month)\order amount of the
previous month.
Note: [-N] in the code represents that the Nth row of data
is removed. So agged$x[-1]means
the first row of data is removed and agged$x[-length(agged$x)]means the last row of data is removed. By
performing certain operation between the two, link relative ratio can be
obtained indirectly. But the result won’t include the link relative ratio of
the first month (i.e. January 2011), so a zero should be added to the code. We
can see that the code logic and the business logic share some similarities but
are quite different. The code is difficult to understand.
4. result<-agged[order(agged$m),],
this line of code sorts data by the month and the year. Since the data of the
year are ordered, we just need to perform sorting by the month. result$yoy<-NA initializes
a new column which will be used to store the year-on-year comparison of sales
amount.
5. The loop judgment in the last four lines of code is to compute the
year-on-year comparison. Business logic: (order amount of the current month –
order amount of the previous month)\order amount of the previous month. Code
logic: from the second line, if the month in the current line is the same as
that in the previous line, the code will compute year-on-year comparison.
Detailed code is result[i,]$yoy<-(result[i,]$x-result[i-1,]$x)/result[i-1,]$x. We can see that the code written in
this way is easy to understand and its logic is quite similar to the business
logic.
The only weakness of
this piece of code is that it cannot use the loop function of R language, which makes it a little lengthy. But
compared with the difficult operation of link relative ratio, maybe a longer
but simple code is better.
Summary:R language can compute link relative ratio
and year-on-year comparison, but the operation of link relative ratio is
difficult to understand and the code of year-on-year comparison is a little lengthy.
The codes of both operations are not easy to learn.
The third-party solution
Python,
esProc and Perl, all of which can perform structured data computation, can be
used to handle this case. In the following, we'll briefly introduce esProc and
Python's solutions.
esProc
esProc is good at expressing business logic
freely with agile syntax. Its code is concise and easy, as shown below:
In the above code, groups function is used to group and summarize data by the year and
the month. The derive functions in A4
and A6 generate link relative ratio and year-on-year comparison respectively.
As can be seen from the code, esProc also
uses[-N]. Different from [-N] in R language, it doesn't represent removing the
Nth row; it represents the Nth row counted from the current line. For example,
[-1] is the previous line. In this way, the operation of link relative ratio
can be simply expressed as (x-x[-1])/x[-1].But R language hasn't expressions
for relative positions, which makes its code difficult to understand.
In the year-on-year comparison operation,
esProc uses judgment function if in loop function, making it avoid the
lengthy loop statement and its code simpler. While R language only has the
judgment statement but hasn't the judgment function. This is the reason why its
code is lengthy.
Finally, these are the computed results:Python(Pandas)
Pandas is Python's third-party package. Its basic data type is created by imitating R’s dataframe but gets improved greatly. At present, its latest version is 0.14. Its code for handling this case is as follows:
sales = pandas.read_csv('E:\\salesGroup.txt',sep='\t')
sales['OrderDate']=pandas.to_datetime(sales.OrderDate,format='%Y-%m-%d %H:%M:%S')
filtered=sales[(sales.OrderDate>='2011-01-01 00:00:00') & (sales.OrderDate<='2014-08-29 00:00:00')]
filtered['y']=filtered.OrderDate.apply(lambda x: x.year)
filtered['m']=filtered.OrderDate.apply(lambda x: x.month)
grouped=filtered.groupby(['y','m'],as_index=False)
agged=grouped.agg({'Amount':[sum]})
agged['lrr']=agged['Amount'].pct_change()
result=agged.sort_index(by=['m','y'])
result.reset_index(drop=True,inplace=True)
result['yoy']=result.apply(lambda _:numpy.nan, axis=1)
for row_index, row in result.iterrows():
if(row_index>0 and result.ix[row_index,'m']==result.ix[row_index-1,'m']):
result.ix[row_index,'yoy']=(result.ix[row_index,'Amount']-result.ix[row_index-1,'Amount'])/result.ix[row_index-1,'Amount']
In the code, pct_change() function is used to directly compute the link relative ratio, which is more convenient than the method used by R language and esProc. But this kind of function is not universal and can only deal with isolated cases. When it is required to compute link relative ratio or year-on-year comparison, Pandas can only complete the task by combining div function and shift function, which makes its code more difficult to understand than R’s.
In computing year-on-year comparison, Pandas' code is as lengthy as R’s. This is because Pandas also cannot use if function in loop function. I’m afraid cooperation of apply function and lambda syntax is needed if we want to write simpler code.
Finally, let's look at the computed results:
Please pay attention to the following
easy-to-get-wrong details:
1. The code must be sort_index(by=['m','y'])when
we perform sorting by the month and the year. The simple form sort(m), which used in R
language and esProc, is not allowed.
2. Pandas has the assignment syntax as result.loc[row_index,‘yoy’]=value.
But when assigning value to a certain element in data frame, we should write
the code as result.ix[row_index,'yoy']=value.
3. When iterrows() is used to perform loop, its loop number
row_index is index
instead of row number. To make the row number conform to the index, reset_index() should be used
to reset the indexes.
没有评论:
发表评论