Data processing with multi-database, off database, log, txt and excel: ETL or ELT?

ETL process usually includes lot of business logics, which is beyond the reach of visual ETL tools. Most of times they need to be implemented through hard coding. For example, from the production database of a store, we need to ETL the consumer’s purchasing record into the membership score table in the query database. The business logic is: for every 10 Dollar of purchase in the supermarket or fashion shops, the consumer will get 1 point in his/her scorecard. For electronic products, the rate is 1 point every 20 Dollars, and jewelries 1 point every 50 Dollars. If the consumer purchased in the store for 5 consecutive days, the score will be doubled. Meanwhile, if the purchased amount exceeds 1000 Dollar 3 times in a single month, the score will also be doubled. Also, there might be such business logic: computing the mortgage balance for bank customers based on changes of interest rates, or computing the current month phone bill for telecom customers against different promotion policy.

The above business logics can be implemented through hard coded ETL scripts tool, or SQL/SP. ETL scripts can be written in any mainstream languages such as: TCL, Perl, python, and script languages come with some ETL tools. These languages are normally good at logic and flow control, or some simple data sorting functionalities, such as: data type validation, adding the missing fields, eliminating duplicate data, data internationalization and data standardization. However, the lack of computing syntax or function (such as aggregation, grouping, query, sorting, ranking, etc.) for batch structured data makes it difficult to implement normal business logics, such as the above-mentioned membership scores, mortgage balance, the monthly phone bills. To implement the above business logics, programmers have to first realize the underlying functions for batch process of structured data, which is a mammoth job. Thus ETL script is limited to some use cases where the business logic is relatively simple.

The difficulty for implementing business logic with ETL scripts leads to frequent use of hard-coded SQL/SP. This is because SQL/SP have built-in computing functions for batch structured data, and also support logic judgments and flow control logic syntax for complex business logics. Under such circumstances ETL process becomes ELT process (actually ELTL), and follows these steps generally:

1.Extract data from data sources include business database/Txt/logs/Excel files.

2.Simply load the data into an intermediate database, most of times direct loading without computation.

3. Carry out data Transform in the intermediate database, through hard coded business logics in SQL/SP. This is the core of the ELT process.

4. Loading the results into target database, usually a data warehouse or non-production query database.

The benefits for using SQL/SP to implement the data transformation lie in the easy realization of business logics. However, the disadvantages are also obvious: poor performance and high cost.

Poor performance.

To use SQL/SP for ELT business logic, you must first load raw data into intermediate database. This is a simple data replication process without cleaning or conversion. By nature it’s just writing the original data to hard disk files, but it is extremely time-consuming for the database. Different from writing files to the hard disk, inserting data into the database is a complex process,which involves factors such as data consistency, log, constraint index, data block continuity and partition, etc..Thus the I/O performance is extremely low for database. Even after optimization, the speed of copying data to database is usually a fraction of that writing to files.

The raw data for ELT is generally big in volume. In the case of mortgage customers, tens of millions of customers might be involved. For Telco companies, hundreds of millions of call records are generated every day. In large chained stores generate TB level of data every month in terms of consumer purchasing record. It’s easy to imagine that loading the original data into database will result in a sharp decline in performance, which would seriously affect the availability of reports and business analysis.

High cost.

Another obstacle for ELT is the high cost.

To improve SQL/SP’ performance, users must upgrade the database. Traditional non-MPP database upgrade is done by scale-up, which means adding the hardware configuration to improve the performance. The disadvantage of this is that it’s easy to reach the performance bottlenecks, and the performance improvement is not substantial enough to solve the issues completely. MPP database supports parallel computing and scale-out, which can realize efficient ELT process, but with a cost concern. So far, all product-level MPP databases are proprietary and very expensive. The use of MPP database means heavy up-front and upgrade project cost for the enterprise.

In addition, ELT requires a database to store the raw data temporarily, which requires purchase of additional storage devices and licenses, in many cases this will result in greater costs.

The lack of syntax and functions for batch structured data computation in ETL script, plus the poor performance and high cost of SQL/SP, makes it extremely difficult for companies to implement business logic computation in ETL. They are caught in helpless dilemma between heavy workload and low performance high cost ELT. Therefore, ELT is often criticized for its long development cycle , late-come and expired analysis conclusion , as well as budget overflow.

How to implement ETL with business logic? How to limit the cost, reduce the difficulty in development and achieve high performance? The use of esProc is one of the options.

esProc is a programming language for structured data computation. It provides a rich library of built-in objects and functions, to implement complex business logic, and reduce the threshold to transform businesses logic into code lines. Among them, the ordered sets can resolve some typical SQL / SP problems, such as relative position access, multi-level grouping cross-row computation, complex ranking computation, etc.. In the Extract and Load stage, esProc supports both heterogeneous data sources and heterogeneous data source computation, such as relational databases, NoSQL databases and semi-structured data, local and LAN file data, HDFS files.

esProc has built-in batch process functions for structured data computation, which makes the implementation of business logic computation easier, and offset the weakness of ETL scripts. esProc can either read the source data directly, or copy / export them to multiple nodes and HDFS for computation, without prior data loading to databases. This avoids inefficient database IO, thus greatly improving performance. esProc also supports scale-out with parallel computing framework to support big data computation, which ensures high performance and reduces cost effectively. esProc leverages file system and cheap hard disk to store data temporarily(in many cases temporary storage is not needed), which eliminated the need for buying additional database storage device and licenses.

The following example uses the “high-risk-for-out-of-stock-commodity” code snippets to demonstrate how to use esProc.

There is a chain store with 49 malls,of them each has its own databases.After the store is closed at night, they need to do data aggregation and complete a number of business computation. One of the goals is to compute for goods within all the stores for which the out-of-stock-risk-index is higher than 80%. The algorithm for out-of-stock rate is as: 1 - (current_stock_for_the_commodity / total_sales_in_last_month). The higher the number the greater the risk is. The number “1” means the goods is completely sold out.

1.Summary machine assigns tasks to node machines.Every time each node machine aggregates the current day sales data for a specific store, and then return the results to summary machine for aggregation of all stores. The result is the total sales volume of the chain store for that day.

2.Calculating the sales value for a single store on node machine.

Note that in the codes we first stored the data into file and then do the computation, which is not a necessary step. It’s only for time saving during concurrent database retrieving when computing other targets. Also we used batch reading/writing, and big data grouping skills, etc..

3. On summary machine: associate original stock value, last month sales volume and current day sales volume.

4. On summary machine: compute current day actual stock, out-of-stock-risk-index, and write the actual stock back to inventory table. Also write the products with higher out-of-stock-risk-index back to risk table.

Data processing with multi-database, off database, log, txt and excel

2014年7月16日星期三

ETL or ELT?

没有评论:

发表评论