Data processing with multi-database, off database, log, txt and excel: A Rich Class Library is the Basis to Speed Big Data Computing and Developing

In developing the big data application, framework plays a very important role. It hides the complex parallel logics and guarantees the system reliability and stability. So programmers can be more focused on the business algorithm.

Important as framework is, it doesn't take us much time to develop, while implementing basic algorithms would cost programmers great efforts. For example, to complete the simplest algorithm of grouping and summing up in MapReduce, the code of Reduce part is shown below:

Even the simplest algorithm as such requires so many codes to implement. We can imagine that the normal algorithms would require definitely much lengthy code. Take the normal problem of finding the top N for example. A part of the MapReduce code is as follows:

There are still some algorithms, join operation for example, which are frequently used but extremely difficult to implement with MapReduce. To implement the join in MapReduce, users will usually have to inherit or implement the Partitioner, Writable Comparator, and Writable Comparable,in addition to the Map and Reduce implementation. The codes are too lengthy to list here.
In addition to the grouping and summing up, top N, and join, there are lots of basic algorithms, for example,filtering, distinct, intersection set, sorting, ranking, yearly link relative ratio, year-on-year comparison, relative position computing, and interval computing. As can be imaged, to implement the real business logics, we must combine these basic algorithms in actual use, resulting in a project of great workload.
Hive and other SQL-like means have been tried to package these basic algorithms in Hadoop. However, the functions available in Hive are far from rich. Users usually have to resort to MapReduce or customized class to implement it. So, theyare unable to improve the development efficiency significantly. For example, to make statistics on the top 10 best sellers for each department with Hive, the below Java code need implementing:

In addition, Hive lacks the overall flow control and the statement for judging and looping. It is weak for data traversal, let alone the business algorithms involving multiple steps or complex logics. The similar algorithms must be implemented through combined uses with MapReduce.
MapReduce plays a vital role in Hadoop development. However, MapReduce lacks the structured data functions, making it hard to speed the development of big data application. To reduce the workload, we must package these basic algorithms into function. The rich library functions will definitely boost the efficiency of big data computing.

esProc is the parallel computing architecture specially optimized for the small and medium sized cluster, featuring the rich library functions. Still the above three examples, the codes of esProc solution are respectively shown below:
Grouping and summarizing: sales.groups(empID;sum(amount))
Top N: counts.top(keyword;10)
Top 10 best sellers in each department: products.group(department). (~. top(quantity;10)

As can be seen, regarding the big chunk of MapReduce code, esProc users only need a few library functions like groups, sum, top, and group to substitute it and the development efficiency would increase remarkably.

To meet the challenge of big data computing, esProc meticulously designed two kinds of complete and practical library functions: The cursor function for the high fault tolerance computing in external storage, and the TSeq function for the high performance computing in memory. These two types of functions can be used together to dramatically reduce the workload on big data development.

Cursor function
Cursor is the specific data type for big data computation. Confronting the big data in the database table or files in the external storage, users can use the cursor to retrieve a small amount of data in batches, and complete the computing over a batch of data in the memory. Once this in-memory computation is completed, go proceed to loop and retrieve the data of the next batch till the computation over all data is completed. The typical cursor functions include: cursor for creating, fetch for fetching, skip for skipping, mergex for merging,joinx for joining, groupx for grouping, and sortx for sorting.

For example, when summarizing by the field amount on the big data file sales.dat, the computing job are to be allocated to multiple node machines. Each node machine will respectively compute one piece of the computing file, and return the result to the summary machine. In which, the data allocated to the node machine may be still too many to be loaded into the memory all at once. In this case, cursors can be used to retrieve the data in batches, with 100,000 pieces for each time of computing. The code for the node machine is shown below:

Multiple cursors can be consolidated into a single cursor. Multiple sorted cursors can be merged with mergex, the join for the join operation, and the conj@x for the union operation. In which, the code for merging cursors is shown below:

The join computation can be conducted between multiple cursors, in which case, you can use function joinx. For the big data grouping and summarizing, the result of summarizing might be too great to be loaded to the memory. In this case, cursor is required to present the result, and function groupx can be used for such kind of grouping. The big data must be sorted in the external storage, using the sortx function to return a cursor directly.

TSeq function
The cursor can retrieve the data in batches and then compute, so that the low-efficient computing in external storage can be converted to the high-performance computing in memory. The structured data in memory is just the TSeq.
TSeq is generic and sorted, especially fit for the complex computing related to the orders. Moreover, TSeq inherits the concept of table of the database. Users can also use fields and records to access the data, which is quite ideal and convenient for the computing over structured data. In the above example, sales, counts, and products are all TSeq, while groups, max, and group are TSeq functions. In addition, if summarizing on sales.dat, sum function is also the TSeq function. Then, let’s demonstrate it with more examples:

To search the goods with the lowest total price, we can use minp like this: A1.minp(price*quantity).
To find the teams whose goal difference is greater than 30, we can use select like this: A1.select(F-A>30)。

Firstly, filter out the teams whose goal difference is greater than 0. Then, of these teams, select the ones with the greatest points. In this case, we can use maxif, which is equivalent to A.select().max(), that is,A1.maxif(W*3+D;F>A).

In addition to the direct record filtering, we can also only filter the serial numbers of records, using pos, pmin, pmax, pslect, and other functions alike. For example: To find serial numbers of male employees whose initials are C, just write A1.pselect@a( Gender:"M",left(Name,1):"C")

esProc library functions are rich and diversified. Besides the cursor function and the TSeq functions, there are also functions for the database, remote files, parallel dispatching, mathematics, character strings, time, operating on sets, aggregating, loop computation, positioning, screening, and associating. With the various combined uses of these library functions, the development efficiency over big data computing is boosted dramatically.

Data processing with multi-database, off database, log, txt and excel

2014年7月29日星期二

A Rich Class Library is the Basis to Speed Big Data Computing and Developing

没有评论:

发表评论