ETL
process usually includes lot of business logics, which is beyond the reach of
visual ETL tools. Most of times they need to be implemented through hard
coding. For example, from the production database of a store, we need to ETL
the consumer’s purchasing record into the membership score table in the query
database. The business logic is: for every 10 Dollar of purchase in the
supermarket or fashion shops, the consumer will get 1 point in his/her scorecard.
For electronic products, the rate is 1 point every 20 Dollars, and jewelries 1
point every 50 Dollars. If the consumer purchased in the store for 5
consecutive days, the score will be doubled. Meanwhile, if the purchased amount
exceeds 1000 Dollar 3 times in a single month, the score will also be doubled.
Also, there might be such business logic: computing the mortgage balance for bank
customers based on changes of interest rates, or computing the current month
phone bill for telecom customers against different promotion policy.
The above
business logics can be implemented through hard coded ETL scripts tool, or
SQL/SP. ETL scripts can be written in any mainstream languages such as: TCL,
Perl, python, and script languages come with some ETL tools. These languages are
normally good at logic and flow control, or some simple data sorting
functionalities, such as: data type validation, adding the missing fields,
eliminating duplicate data, data internationalization and data standardization.
However, the lack of computing syntax or function (such as aggregation,
grouping, query, sorting, ranking, etc.) for batch structured data makes it difficult
to implement normal business logics, such as the above-mentioned membership
scores, mortgage balance, the monthly phone bills. To implement the above
business logics, programmers have to first realize the underlying functions for batch
process of structured data, which is a mammoth job. Thus ETL script is limited
to some use cases where the business logic is relatively simple.
The
difficulty for implementing business logic with ETL scripts leads to frequent
use of hard-coded SQL/SP. This is because SQL/SP have built-in computing functions
for batch structured data, and also support logic judgments and flow control
logic syntax for complex business logics. Under such circumstances ETL process
becomes ELT process (actually ELTL), and follows these steps generally:
1.Extract data from data sources
include business database/Txt/logs/Excel files.
2.Simply load the data into an intermediate
database, most of times direct loading without computation.
3. Carry out data Transform in the
intermediate database, through hard coded business logics in SQL/SP. This is
the core of the ELT process.
4. Loading the results into target
database, usually a data warehouse or non-production query database.
The benefits for using SQL/SP to implement
the data transformation lie in the easy realization of business logics. However,
the disadvantages are also obvious: poor performance and high cost.
Poor performance.
To
use SQL/SP for ELT business logic, you must first load raw data into intermediate
database. This is a simple data replication process without cleaning or
conversion. By nature it’s just writing the original data to hard disk files,
but it is extremely time-consuming for the database. Different from writing
files to the hard disk, inserting data into the database is a complex process,which
involves factors such as data consistency, log, constraint index, data block
continuity and partition, etc..Thus the I/O performance is extremely low for
database. Even after optimization, the speed of copying data to database is
usually a fraction of that writing to files.
The
raw data for ELT is generally big in volume. In the case of mortgage customers,
tens of millions of customers might be involved. For Telco companies, hundreds
of millions of call records are generated every day. In large chained stores
generate TB level of data every month in terms of consumer purchasing record.
It’s easy to imagine that loading the original data into database will result in a
sharp decline in performance, which would seriously affect the availability of
reports and business analysis.
High cost.
Another
obstacle for ELT is the high cost.
To
improve SQL/SP’ performance, users must upgrade the database. Traditional
non-MPP database upgrade is done by scale-up, which means adding the hardware
configuration to improve the performance. The disadvantage of this is that it’s
easy to reach the performance bottlenecks, and the performance improvement is not
substantial enough to solve the issues completely. MPP database supports
parallel computing and scale-out, which can realize efficient ELT process, but with
a cost concern. So far, all product-level MPP databases are proprietary and very
expensive. The use of MPP database means heavy up-front and upgrade project
cost for the enterprise.
In
addition, ELT requires a database to store the raw data temporarily, which
requires purchase of additional storage devices and licenses, in many cases
this will result in greater costs.
The
lack of syntax and functions for batch structured data computation in ETL
script, plus the poor performance and high cost of SQL/SP, makes it extremely
difficult for companies to implement business logic computation in ETL. They
are caught in helpless dilemma between heavy workload and low performance high cost
ELT. Therefore, ELT is often criticized for its long development cycle , late-come
and expired analysis conclusion , as well as budget overflow.
How
to implement ETL with business logic? How to limit the cost, reduce the
difficulty in development and achieve high performance? The use of esProc is one
of the options.
esProc
is a programming language for structured data computation. It provides a rich
library of built-in objects and functions, to implement complex business logic,
and reduce the threshold to transform businesses logic into code lines. Among
them, the ordered sets can resolve some typical SQL / SP problems, such as
relative position access, multi-level grouping cross-row computation, complex
ranking computation, etc.. In the Extract and Load stage, esProc supports both heterogeneous
data sources and heterogeneous data source computation, such as relational
databases, NoSQL databases and semi-structured data, local and LAN file data,
HDFS files.
esProc
has built-in batch process functions for structured data computation, which
makes the implementation of business logic computation easier, and offset the
weakness of ETL scripts. esProc can either read the source data directly, or copy
/ export them to multiple nodes and HDFS for computation, without prior data loading to
databases. This avoids inefficient database IO, thus greatly improving
performance. esProc also supports scale-out with parallel computing framework to
support big data computation, which ensures high performance and reduces
cost effectively. esProc leverages file system and cheap hard disk to store data
temporarily(in many cases temporary storage is not needed), which eliminated
the need for buying additional database storage device and licenses.
The
following example uses the “high-risk-for-out-of-stock-commodity” code snippets
to demonstrate how to use esProc.
There
is a chain store with 49 malls,of them each has its own databases.After the
store is closed at night, they need to do data aggregation and complete a
number of business computation. One of the goals is to compute for goods within
all the stores for which the out-of-stock-risk-index is higher than 80%. The algorithm
for out-of-stock rate is as: 1 - (current_stock_for_the_commodity / total_sales_in_last_month).
The higher the number the greater the risk is. The number “1” means the goods
is completely sold out.
1.Summary machine assigns tasks
to node machines.Every time each node machine aggregates the current day sales
data for a specific store, and then return the results to summary machine for aggregation of
all stores. The result is the total sales volume of the chain store for that
day.
2.Calculating the sales value for
a single store on node machine.
Note that in
the codes we first stored the data into file and then do the computation, which
is not a necessary step. It’s only for time saving during concurrent database
retrieving when computing other targets. Also we used batch reading/writing, and
big data grouping skills, etc..
3.
On summary machine: associate
original stock value, last month sales volume and current day sales volume.
4.
On summary machine: compute
current day actual stock, out-of-stock-risk-index, and write the actual stock
back to inventory table. Also write the products with higher
out-of-stock-risk-index back to risk table.
没有评论:
发表评论