Data processing with multi-database, off database, log, txt and excel: Diverse Computing Ability is Key to Data Integration

Data integration is a procedure to integrate the distributed and hard-to-use diversified data source for the centralized and easy use as a single data source. Data integration demands the diversified computing capabilities, including:

1. Cross-database computing capability. For example, the head office of a great department store chain wants to summarize and collect statistics over the data from the respective database of each outlet.

2. Compute the structured data from the non-database sources. Consolidate the contract data from company business department and payment data from finance department into the project progress payment information. Some data are from the Excel file.

3. Computing capability for the semi-structured data. For example, the weather conditions, bus arrangement, flight information, and other associative response. Some data are from the EDI, and some are from the WebService.

The typical procedure of data integration is to implement the specific business logics. Sometimes, it is in the stage to get the source data, sometimes it is before the heterogeneous data source combination, and sometimes it is after the consolidating. The business logics specific to a certain industry can hardly ever be implemented with the existing ETL functions, and hard coding become necessary to complete computation. Therefore, the diversified data computing capability is very important. For example, the headquarter of department store is to make statistics on which products will expire on that day or in three days, which products have already been sold out or are highly possible to be sold out soon, which members should be awarded with bonus points or extended membership. For another example: The heterogeneous data source computation to make statistics on the projects for which the payment is overdue for 3 times in a row, except for last payment.

SQL is the database query language with powerful enough computing capability to handle the structured data, capable of implementing the hard coding for business logics, and acting as the data integration tool in most cases. However, SQL is poor regarding its ability to handle the diversified computing and also not so good at the data integration support. This is mainly demonstrated in the below aspects:

1.SQL cross-database computation is not good enough in the practical use. As for the database, SQL demonstrates a certain degree of cross-database computing capability. However, SQL is seldom used in the practical cross-database computation because SQL users may experience the low performance, instability, and encounter the troubles of complex configuration and data conversion.

2.The structured data outside the database cannot be computed directly. For the txt file, Excel, and thestructured data outside database, they cannot be used directly by SQL only if they are already cleaned, converted, and loaded to the database. To do so, users often need to deploy an additional database for computing, and buy the dedicated server and storage equipment, adding more workload to the database administer.

Besides the increasing cost, the additional time loss is another critical issue. SQL users cannot compute the data outside the database directly. So the data must be inserted to the database at the very start. Unlike writing into hard disk, it is a more complicated procedure to insert data to the database, involving the data consistency, logging, constraint index, data block consistency, and partition. So, its I/O performance is extremely low. Even if optimized, copying the data to the database is often one order of magnitude slower than directly copying to the file.

3. The pre-processing workload is huge for the semi-structured data. SQL does not offer the perfect support for Website logging\XML\e-mails\HTML and the semi-structured data. SQL users have to undergo the complex character string parsing and convert the result to the structured data from use in computation. The workload of developing SQL code is thus further added.

As can be seen from the above, the SQL is not powerful enough to handle the diversified computation, resulting in the data integrations of poor performance, high cost, and extra time loss seriously, and great development workload. Although ETL scripts support various data sources, there are few functions for massive structured data computing. Therefore, ETL scripts are hard to implement the normal business algorithms and play a key role in data integrating.

To implement the diversified computation in data integration, esProc is a good choice.Equipped with esProc, the performance is great, the cost is low, no additional time loss is incurred, and the computing functions over massive structureddata are abundant.

Self-contained engine supports the heterogeneous data sources. esProc users can directly access various relational databases and NoSQL databases, which means they can easily implement the cross-database computing across the heterogeneous and homogeneous databases. esProc can also retrieve the structured data from the text file and Excel sheet to support the combined computation between heterogeneous data sources. esProc has the built-in text analyzer to support the character string resolving and regular expression, capable of processing the semi-structured data directly, and the processed data can be put into the combined computing directly.

Structured data computing function. esProc is a programming language for the structured data computing. With the abundant esProcbuilt-in objects and library functions, the complex business logics can be implemented while lowering the requirements on converting the business logics into the program code. Of which, the ordered set can solve the typical the tough problems of SQL\SP, for example, the relative position access, inter-row computations for multiple-level groupings, and complex ranking.
esProc can be used for the combined computation over the heterogeneous data source. Therefore, you neither require a database as the transit or temporary store, nor the dedicated computing database deployment, and no additional time loss will be incurred. esProc users can have a much higher performance while still keeping the cost low. esProc also supports the inexpensive scale-out to further improve the performance. Because esProc has the abundant structured data computing function, you can easily implement the business logics of data integration, and the development efficiency will thus be improved effectively.

Below please find the code snippet of “cross-database salary computation” on the general usage of esProc.

An enterprise need to compute the actual salary for employees of different types of jobs based on the employee performances and their respective basic salaries. The basic salary data is stored in the MSSQL database of finance management software, and the performance scores are stored in the Oracle database of performance assessment system. Of which the typical business logic is the procedure of converting the salary scores to the typical business logics.

1. Retrieve the data from the heterogeneous database Oracle and MSSQL

2. Implement the cross-database associating computation

3. Implement the business logics: Compute the performance salary for employee in the sales department according to the various performance scoring range:

The full code is shown below:

Data processing with multi-database, off database, log, txt and excel

2014年7月16日星期三

Diverse Computing Ability is Key to Data Integration

没有评论:

发表评论