Data processing with multi-database, off database, log, txt and excel: Comparison Between esProc’s Sequence Table Object and R’s Data Frame part(II)

Advanced features

Example 5: modifying the association. A1, A2 are two-dimensional structured data object with the same field ID. We now need to add the bonus field values of A2 to the salary field values in A1 according to ID.

Sequence table:

A1=db.query("select id,name,salary from salary order by id")

A2=db.query("select id,bonus from bonus order by id")

A1.modify(1:A2,salary+bonus:salary)

Data frame has no functions to modify the association. We need to do manual coding for this, which is omitted here.

Example 6: merging associations. A1, A2, A3 are two-dimensional structured data objects with the same field sequence number. Please associate them with left join. As the data is sorted by sequence number, please leverage merging methods to improve the speed for association.

Sequence Table：join@m1(A1:salary,id; A2:bonus,id; A3,attendance,id)

Data frame supports association of two tables, such as: merge(A1,A2,by.x="id",by.y="id",all=TRUE).

In this case three tables are associated, which can be achieved indirectly through two two-table associations.

n addition, the data frame does not support merging of association, and therefore no speed improvement is possible. In other words, data frame cannot use ordered sequence data to improve performance, not only with association, but also with other operations.

Example 7: Record lookup. Four scenarios: retrieving records with the Amount greater than 1000; retrieving the sequence number or records with the Amount greater than 1000;return records with primary key value of “v”, return the sequence number for records with primary key value of “v”.

Sequence table:

=data.select(Amount>1000)

= data.pselect(Amount>1000)

= data.find(v)

= data.pfind(v)

Data frame：only the first two scenarios can be achieved, which is done with following code:

newdata<- data [data $ Amount>1000,]

which(data $ Amount >1000)

Data frame hasn't the concept of major key, so we need to do manual coding for other 2 scenarios as indirect methods, or employ a third party package (i.e. data.table). The codes are omitted here.

Example 8: Group sum. The data is grouped by Client and SellerId. Then the other two fields are aggregated: do a sum for Amount field, and do a count for OrderID field.

Sequence table：

=data.groups(Client,SellerId;sum(Amount),count(OrderID))

Data frame：only support single field aggregation, such as the sum of Amount. As following：

result<aggregate(data[,4],data[c(2,3)],sum)

To do aggregation of two fields at the same time with data frame, we can only use two separate aggregate statements and then merge the results. Codes are omitted here.

Example 9: Reuse grouping. Group data by Client. Complete multiple subsequent computations on group result. Including: aggregation by amount, and count after grouping by SellerId.

Sequence table:

A2=data.group(Client)

=A2.(~.sum(Amount))

=A2.(~.groups(SellerId;count(OrderID)))

Data frame does not support reuse of grouping directly. Grouping and aggregation usually need to be done in one step. This means we need to do two identical grouping operations to accomplish the same purpose. As following:

result<-aggregate(data[,2],sum)

result<-aggregate(data[,2],data[,3],count)

If we want to reuse grouping, we must use split function and loop to achieve this. The code is both lengthy and with low performance.

Summary:Sequence tables and data frame are quite different in terms of advanced features. This is mainly demonstrated in the following five ways:

1. Richness of features. Sequence table has rich functions, and is very convenient to do structured data computation. Data frame originates from matrix, with less support for structured data and lack of many features. Use of the third party packages can in some degree supplement the functions data frame lacks, but these packages are no match for R’s primitive library function in muturity and stability.

2. Difficulty in syntax. The function names of sequence table are more intuitive.For example, select means to find; pselect is to find the location (position). With data frame the syntax is relatively obscure. For example, “find by field” is data [data $ Amount> 1000,], and retrieve value by field is data[,"Amount"]. These two are confusing and difficult for the programmer to understand. One must have some knowledge on vector to grasp it.

3. Memory consumption. Basically sequence table function only returns a reference, with very little memory occupation. Data frame must copied record from the original object. If we need to do multiple search, association and grouping operations on large amounts of data, data frame’s memory consumption will be very large. It will impact the whole system.

4. Code workload and code performance. The functions supported by data frame are not rich enough. We need to do hand-coding to achieve this indirectly. This means more workload. The R interpreter is known to be very slow. With hand-coding the performance is much lower than library functions.

5. Library function performance. Sequence table has many functions to improve computing performance, such as merging association, grouping functions, binary search, hash lookup. Although data frame supports association, aggregation and search, it’s hard to improve the performance.

Actual case

In this part we use a real case for comprehensive comparison o fdata frame and sequence table.
Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.
Ideas: Importing data; filtering out previous month's data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.

Sequence Table:

Data frame:

Comparison：

1. Data frame function is not rich enough, and is lack of professionalism. We need to use nested loops to meet the requirement in this case. It’s of low computational efficiency. Sequence table has rich and diverse functions. Without the use of loop statement we can achieve the same purpose. The code is shorter and simpler, and the performance is higher.

2. When programming for data frame, the code is obscure and hard to write. With sequence table, the code is clear and easy to understand. The cost of learning is lower.

3. When large amount of data is involved in this scenario, the memory consumption will be huge. Sequence table is computation by reference, which consumes less memory. Data frame is computation by value pass. The memory consumption is several times more than sequence table. It easy to result into memory overflow in this scenario.

4. To import Excel data into data frame, R requires third-party software packages. However they seem to have difficulty working together. Data import needs ten minutes to complete. With sequence table this only needs tens of seconds.

Test Performance

Test 1: Generating 10 million records in memory, each consists of three fields. All values are random numbers. Records are filtered, and each field is summed.

Sequence table:

Data frame：

Comparison: sequence table needs 50.534 seconds, while data frame needs 91.999 seconds. The gap is obvious.

Test 2: Retrieving 1.2G txt file. Do filtering and sum on two fields

Sequence Table

Data frame

Comparison: sequence table takes 87.122 seconds, while data frame takes 1.1347 hours. The performance difference is tens of times. The reason for this is mainly due to the extremely low speed for file reading.

From the above comparison, we can see that sequence table are better than data frame in terms of rich features, easy syntax, memory consumption, development effort, library function performance and coding performance, etc.. Of course, data frame is not the full strength of R language. R has a powerful vector matrix and the associated mass functions, which make it more professional than esProc in scientific and engineering computation.

Data processing with multi-database, off database, log, txt and excel

2014年8月20日星期三

Comparison Between esProc’s Sequence Table Object and R’s Data Frame part(II)

Advanced features

Actual case

Test Performance

没有评论:

发表评论