2014年8月20日星期三

Comparison Between esProc’s Sequence Table Object and R’s Data Frame part(I)

Both esProc and R language are typical data processing and analysis languages with two-dimensional structured data objects. They are all good at multi-step complex computations. However their two-dimensional structured data objects are quite different from each other in the underlying mechanism. As a result, esProc is better at computation with structured data, and especially suitable for developers to do business computing. R is better at matrix computation and more suitable for scientists to do scientific or engineering computation.

esProc's two-dimensional structured data type is sequence table object (TSeq). Sequence table is based on records, with multiple records forming a row-styled two-dimensional table. In combination with the column name, this two-dimensional table can form a complete data structure. R language is based on vector, with multiple vectors forming a column-styled two-dimensional table. In combination with the column name, the two-dimensional table can form a complete data structure.

These underlying mechanisms affect actual user experience. In the following part we will compare the difference in practical use between sequence table object and data frame, in terms of basic functions, advanced features, actual use cases and test results.

Note: Primitive functions of development language are to be used in the following comparisons, the third party extension packages won’t be involved.

Basic functions

Example 1:retrieve two-dimensional structured data from the file, and access the value of the second column in the first row by coordinates.
Data frame:
         data<-read.table("e:/sales.txt",header=TRUE,sep="\t")
         result<-data[1,2]         
Sequence table:
         =data=file("e:/sales.txt").import@t()
         =data(1).#2
Comparison: there is no significant difference in the most basic functions.

Note: the sales.txt file is tab separated structured data, and the first few lines are as following:

Example 2: access the value of the second column in the first row, by row number and by field name.
         Data frame:
         Result1<-data$Client[1]
         Result2<-data[1,]$Client
         Sequence table:
         =data(1).(Client)
         =data.(Client)(1)
         Comparison: there is no significant difference between the two.

Example 3: Access column data. There are two scenarios, and each falls into two situations: access by column number and column names:retrieve only the second column, or retrieve a combination of the second column and the fourth column.
         Data frame:
         Result1<-data[2]
         Result2<-data[,c(2,4)]
         Result3<-data$Client
         Result4<-data[,c("Client","Amount")]
         Sequence table:
         =data.(#2)
         =data.new(#2,#4)
         =data.(Client)
         =data.new(Client,Amount)
         Comparison: Both can access the column data. The only difference is in the syntax for retrieving multiple column data. Data frame is retrieving the number directly, while with sequence table a new sequence table will be build with the new function. Although the syntax is different, the actual methods used are the same: both are duplicating two columns of data from the original objects to new objects.

Example 4: record manipulation. Includes: retrieve the first two records, appending records, inserting record in the second row, deleting the record in the second row.
         Data frame
         Record1<-data[c(1,2),]

         append<- data.frame(OrderID=152,  Client="CA",       SellerId=5,        Amount=2961.40,   OrderDate="2010-12-5 0:00:00")
         data<- rbind(data, append)
         insert<-data.frame(OrderID=153,  Client="RA",  SellerId=4,     Amount=1931.20,   OrderDate="2009-11-5 0:00:00")
         data<-rbind(data[1,], insert,data[2:151,]) 
         data<-data[-2,]
         Sequence table:
         =data([1,2])
         =data.insert(0,152:OrderID,"CA":Client,5:SellerId,2961.40:Amount,"2010-12-5 0:00:00":OrderDate)
         =data.insert(2,153:OrderID,"RA":Client,4:SellerId,1931.20:Amount,"2009-11-5 0:00:00":OrderDate)
         =data.delete(2)

Comparison: record manipulation is possible in both ways. esProc is relatively more convenient. It can use insert function to append or insert records directly to sequence table, while in R language we need to split the data frame and then merge them again to achieve the same result in an indirect way.

Summary:

As both sequence table and data frame are structured, two-dimensional data object, no significant difference exists in basic functions for data reading/writing,data access and maintenance.

没有评论:

发表评论