The ETL performance is a hot topic
nowadays. To address such problem, we can choose either the ETL tool featuring parallel
computing, or MPP database/data warehouse. But we may find ourselves caught in
a dilemma. On the one hand, the former solution is not powerful enough in
computing; On the other hand, the latter solution does not support the cost-effective
scale-out.
As
for me, I strongly recommend esProc which not only supports the cost-effective
parallel computing on PC, but also offers a complete computing architecture.
Let’s check out the below example:
There
is an enterprise whose production database generates huge amount of orders
everyday. These order records are to be imported to various target tables of
data warehouse through the ETL procedure. The ETL procedure and structure of target
table vary with query and analysis jobs. One of the ETL jobs is to compute the
total amount of historical orders for each sales person.
Note:
An intermediate table is introduced to store the standardized order data since
the order data is to be used in multiple ETL procedures and these original
order data are not standardized.
The
following rules apply to moving the data from source order table to the
intermediate order table:
2.
salesID shall only hold the
number. Records will be invalidated if containing any letter.
3.
Remove the heading and tailing
spaces of data if any.
4.
Convert the initials of
cfirstName records from lower case to upper case.
5.
The fullName field of
intermediate table equals to that of cfirstName+”. ”+”clastName”. But if the
lastName is null or empty string, then the fullname equals to firstName.
esProc
code for handing ETL jobs through parallel computing is shown below:
Code for summary
machine:
The parallel computing procedure for ETL
jobs with esProc is as follows:
1.
Summary machine A1-A3: Find the
latest time stamp from the intermediate order table (external file), and
retrieve the incremental data from the source order table accordingly.
This code snippet demonstrates
that esProc supports the cross-database and cross-data-source computing which
are common in ETL. The data source types include the binary file, text file,
Excel file and database having the ODBC and JDBC interfaces. esProc also
supports the files or database result set retrieval with cursor retrieval to
handle those are too great to load into memory all at once. For example, the
file cursor is employed in A2-C2.
2.
Summary machine A4-A15: Clean
the incremental data and append them to the intermediate order table.
In this case, we store the
intermediate table in the file system but not the database. It is because: first,
more database spaces can be saved; second, the intermediate table must go
through the following ETL procedure before using; third, file access is much
faster than database access, and is much easier for parallel running.
With the support for looping and
branching, esProc is ideal to implement the intricate and complex hard-coded
ETL algorithm. Generally speaking, the flow control statement requires the
clear working scope. In the esProc grid, such working scope can be represented
with the indentation of cells.
For the frequently-used variables
or those to be reused later, we can define it explicitly and then reference,
like the Ifile in A1. As for those temporary variables to be used in a certain
section or the transitional variables between steps, esProc supports the direct
reference by cell name without defining in advance. B6-D6 is the temporary
variable within such loop. Referencing so conveniently is a feature resulting
from the grid-style code of esProc.
Note: Since the above two steps
are also used in other ETL jobs, they can be taken out for independent running.
3.
Summary machine A16-A19: Based
on the number of bytes in the intermediate order table, the summary machine
divides the job into 40 shares and distributes them to 4 node machines for
execution.
esProc
supports the cost-effective scale-out. The big ETL job can be divided into
several small jobs for parallel running. The ETL performance can be elevated by
running the jobs parallelly on multiple nodes. In addition, the parallel node
does not need the high performance server. Just the normal PC can do the job
well.
4.
Node machine A1-A3: Every node
machine will return the result to the summary machine once its own grouping and
summarizing job is done.
As
can be seen, the syntax and function library of esProc is ideal for the massive
structured data computing. The structured data in files and databases is really
common for ETL.
5.
Summary machine A20-A21: The
summary machine will consolidate the results from each node machines, and then
group and summarize these results. They are the data required in the target
table.
This step and the previous actions
of decomposing and distributing jobs can be handled as a whole. You may find
that the esProc computing procedure is very intuitive. The summary machine is
responsible for decomposing and distributing jobs to each node in the form of
parameters, and ultimately consolidating and summarizing the results from those
node machines. The node machines are responsible for getting a segment of the
whole data piece as specified by parameters, and then grouping and summarizing
the data of this segment.
6.
Summary machine A22-A25:
Refresh the target table salesAmount of the data warehouse.
The
“Update” is another function specific to the massive structured data computing
of ETL. With this function, the whole batch of data can be written into the
database directly, not requiring any loop statement.
In
one word, esProc parallel computing can boost the data warehouse ETL
performance easily.
没有评论:
发表评论