2014年4月1日星期二

esProc Parallel Computing Improves ETL Performance for Data Warehouse

The ETL performance is a hot topic nowadays. To address such problem, we can choose either the ETL tool featuring parallel computing, or MPP database/data warehouse. But we may find ourselves caught in a dilemma. On the one hand, the former solution is not powerful enough in computing; On the other hand, the latter solution does not support the cost-effective scale-out.

As for me, I strongly recommend esProc which not only supports the cost-effective parallel computing on PC, but also offers a complete computing architecture. Let’s check out the below example:

There is an enterprise whose production database generates huge amount of orders everyday. These order records are to be imported to various target tables of data warehouse through the ETL procedure. The ETL procedure and structure of target table vary with query and analysis jobs. One of the ETL jobs is to compute the total amount of historical orders for each sales person.

Note: An intermediate table is introduced to store the standardized order data since the order data is to be used in multiple ETL procedures and these original order data are not standardized.
        
The following rules apply to moving the data from source order table to the intermediate order table:
1.    Record is invalid if salesID and cfirstName is null or empty string.
2.    salesID shall only hold the number. Records will be invalidated if containing any letter.
3.    Remove the heading and tailing spaces of data if any.
4.    Convert the initials of cfirstName records from lower case to upper case.
5.    The fullName field of intermediate table equals to that of cfirstName+”. ”+”clastName”. But if the lastName is null or empty string, then the fullname equals to firstName.
         esProc code for handing ETL jobs through parallel computing is shown below:
Code for summary machine:










The parallel computing procedure for ETL jobs with esProc is as follows:

1.       Summary machine A1-A3: Find the latest time stamp from the intermediate order table (external file), and retrieve the incremental data from the source order table accordingly.

This code snippet demonstrates that esProc supports the cross-database and cross-data-source computing which are common in ETL. The data source types include the binary file, text file, Excel file and database having the ODBC and JDBC interfaces. esProc also supports the files or database result set retrieval with cursor retrieval to handle those are too great to load into memory all at once. For example, the file cursor is employed in A2-C2.

2.       Summary machine A4-A15: Clean the incremental data and append them to the intermediate order table.

In this case, we store the intermediate table in the file system but not the database. It is because: first, more database spaces can be saved; second, the intermediate table must go through the following ETL procedure before using; third, file access is much faster than database access, and is much easier for parallel running.

With the support for looping and branching, esProc is ideal to implement the intricate and complex hard-coded ETL algorithm. Generally speaking, the flow control statement requires the clear working scope. In the esProc grid, such working scope can be represented with the indentation of cells.

For the frequently-used variables or those to be reused later, we can define it explicitly and then reference, like the Ifile in A1. As for those temporary variables to be used in a certain section or the transitional variables between steps, esProc supports the direct reference by cell name without defining in advance. B6-D6 is the temporary variable within such loop. Referencing so conveniently is a feature resulting from the grid-style code of esProc.

Note: Since the above two steps are also used in other ETL jobs, they can be taken out for independent running.

3.       Summary machine A16-A19: Based on the number of bytes in the intermediate order table, the summary machine divides the job into 40 shares and distributes them to 4 node machines for execution.

esProc supports the cost-effective scale-out. The big ETL job can be divided into several small jobs for parallel running. The ETL performance can be elevated by running the jobs parallelly on multiple nodes. In addition, the parallel node does not need the high performance server. Just the normal PC can do the job well.

4.       Node machine A1-A3: Every node machine will return the result to the summary machine once its own grouping and summarizing job is done.

As can be seen, the syntax and function library of esProc is ideal for the massive structured data computing. The structured data in files and databases is really common for ETL.

5.       Summary machine A20-A21: The summary machine will consolidate the results from each node machines, and then group and summarize these results. They are the data required in the target table.

This step and the previous actions of decomposing and distributing jobs can be handled as a whole. You may find that the esProc computing procedure is very intuitive. The summary machine is responsible for decomposing and distributing jobs to each node in the form of parameters, and ultimately consolidating and summarizing the results from those node machines. The node machines are responsible for getting a segment of the whole data piece as specified by parameters, and then grouping and summarizing the data of this segment.

6.       Summary machine A22-A25: Refresh the target table salesAmount of the data warehouse.

The “Update” is another function specific to the massive structured data computing of ETL. With this function, the whole batch of data can be written into the database directly, not requiring any loop statement.

In one word, esProc parallel computing can boost the data warehouse ETL performance easily.


没有评论:

发表评论