Data processing with multi-database, off database, log, txt and excel: Empowering Files with Computing Capability

If the files are empowered with computing capability, you will be thereby greatly benefited, such as, reducing database output bottlenecks, cutting down database costs, alleviating database expansion pressure and making data easier to be managed.

There are a lot of files outside the database, each of which contains some structured data, such as: text file, Excel file, log file and binary file. In order to use these structured data in an application, the general practice is to import data into a database, then compute them via database interface, and the computed results are returnedeventually.

However, there are many disadvantages if you perform computing after importing the files into the database:

Additional workloads. To import the files into the database, additional ETL development works are usually required. In many cases, you have to cope with a series of excess works, such as incremental data processing, scheduler implementation, stored procedure development and database authorization configuration. By doing so, the workload is further increased.

Increased database output load. Some files include a large amount of data like web logs. After importing data into the database, these data occupy expensive storage space of database, plus database License, CPU resources, Network IO, all of these will increase more loads on database output.

Boosted database expansion. As the database output load has increased due to more data, this will give impetus to expansion of the database. More often than not, Users have to upgrade and expand the database by purchasing more expensive dedicated storage devices, database License, server CPU. Sometimes, just expansion alone is not enough. Therefore you also need to buy some more upscale servers dedicated to the database.

Poor data manageability. Based on multi-level directory, the files can be managed in chronological order by various lines of business, module relationship, with a higher manageability. While the database, as a flat structure, cannot manage the data under multi-level directory, while only fit for managing a very small number of tables. Sometimes, as the imported data are diversified, massive, and named randomly, it is easy to form a large number of table names with obscure meanings in the database. As time goes by, more and more meaningless obstinate data get accumulated in the database, which are unknown but cannot be deleted freely. So the manageability gets worse and worse.

Since there are so many disadvantages when importing the files into the database, why do you want to import it into the database? This is because the file itself lacks computing capability, and in order to get it, you have to put it into database. This is all what you willingly bear for it. As you see, if a file has a computing capability, these defects will be easily offset. Even more, you can export original redundant data of database as files outside the database, bringing about greater benefits!

There are two types of data in the database, namely, core business data and its derived application-specific redundant data (such as pool data that are summarized in advance to improve performance and facilitate data query). At the initial stage of construction, the database is mainly full of the core data. However, with the development of business, the redundant data will be much higher than core data in both magnitude order and growth rate. As core data requires a higher security, but is low in quantity, it is necessary to be stored in the expensive database. Derived from core data, the redundant data has a lower requirement on security, and can be regenerated from core data once damaged. So it is unnecessary to be stored in the database. As the redundant data is stored in the database is not for the sake of security, but forthe computing capability as well.

The most of storage spaces in the database are consumed by redundant data due to its big size and plurality of associated applications, which will bring about some disadvantages, such as, producing greater impact on the output bottleneck for the database, and leading to a relatively higher cost in database expansion, as well as worsen data manageability.

In summary, if we make the files have computing capability, we will have these benefits as follows:

No additional workloads. For the files outside the database, we can load and compute them directly, do not have to import them into the database, so additional ETL workloads will be reduced to zero.

Easier to manage. By nature, the files support multi-level directory, and are much simpler and more efficient to be copied, moved and split than a database. This enables users to classify and manage the data by specific rules such as business module, date, etc. When the application gets offline, you can also delete the corresponding data on the application via directory. Data management thus becomes simple and clear with obviously reduced workloads.

Lower cost. Since they are files, you can store them in inexpensive hard disks, instead of purchasing more software and hardware dedicated to the database.

Reduced database output. If the redundant data, which is the largest volume and used most intensively, is migrated out of the database, the database will naturally present significantly improved performance with diminished output loads.

Reduced database expansion pressure. If the database output load is reduced, this can significantly delay the arrival of the expansion critical limit. The database can continue to serve well, and you can also save a lot of expansion costs.

High resource utilization. Storing data in files is by no means equivalent to discarding database. On the contrary, the files should be used to merely store such redundant data that the security requir ement is lower and the volume is relatively larger, while the database still should be used to store core data. By doing so, the files and database can serve for their respective purposes of storages according to their own characteristics, and the resource utilization can be increased significantly.

Significantly improved overall performance.The database uses a proprietary interface with a congested channel; the file system is open, to which, any application can access, so that the performance ceiling is much higher. More importantly, Files can be copied and distributed to multiple cheap PCs in an easy way. By doing so, users can take advantage of the multi-node parallel computing mechanism to solve the throughput bottleneck, thereby the performance can be further improved dramatically. Although there are many parallel solutions available in the database, Teradata, Greenplum, and other sophisticated solutions are quite expensive. The freeware Hive and Impala have not yet increased computing power due to lower maturity, so they are hard to be popularized for use today.

There are many advantages for file computation. However, a critical drawback of it is that the file itself lacks the computing capability, so that we need a specific tool to implement such computing on the files outside the database. R, MapReduce, high-level languages, and esProc are all such tools, of which, esProc has integrated more comprehensive functions with a number of advantages, let’s introduce it in detail.

Supporting file computing with its own engine.esProc can read the structured data directly from Txt, Excel and binary files, which makes it easy to support file computation outside database. esProc also enables direct access to various relational database, semi-structured data, so as to easily implement mix computation over multiple data sources.

Rich built-in computing functions. esProc, as a programming language, is aiming at structured data computing with rich built-in objects and library functions, capable of achieving complex business logic, while also reducing the threshold of conversion from business logic to programming codes. For example, the ordered sets can be used to solve the typical puzzlesof SQL\SP, including relative position access, inter-row computation in multi-level group, and complex ranking operation.

Low cost scale-out. esProc supports multi-node parallel computingfor large file computation outside the database, thus effectively reducing the costs while ensuring high performance. esProc is almost free, there is no necessity to purchase dedicated storage device and license. It can run on Windows/Linux/Unix, midrange and high-end servers and low-cost PCs, with more powerful scalability than that of database.

Let’s take an example to explain how esProc implements file computation: an enterprise take files to store history data and redundant data, including historical orders; the database is used to store core data and current data, including orders for current month and complete customer information. Now based on the list of commodities inputtedoutside, to find how many customers have purchased all the listed items since the beginning of last year; after the computation is completed, the customers’ IDs and names need to be returned to the reporting tool. The codes in esProc are shown as follows:

A1-A3: Read the history data from the file, and retrieve the data in current month from the database, then merge the two.

A4-A5: Filter out the orders by date and the list of items from outside sources (usually they are reporting tools). It should be noted that in the example, the data amount is assumed not very large, so we take the way to filter it aftermerging. If the actual data amount is large, filtering data before mergingwill be taken; if the data amount is further increased, the way of segmentation read and computation via file cursor should be used; for given huge amount of data, a multi-node parallel computing mechanism can be used.

A6-A10: To count up those customers who have purchased the same products. Since one customer will purchase the same items, the orders need to be grouped by the customers, in each group, evaluate which are non-duplicated items that the customers have purchased (equivalent to distinct in SQL). Continue to count up those non-duplicately-purchased items; you can make clear how many types of items the customers have purchased. If this figure is consistent with the item list, it means this customer has purchased all the items in the list.

A11-A12: Retrieve complete customer IDs, name of customer from the database, evaluate what are the names of the customer corresponding to that customer IDs in A10.

A13: The computed result is returned to the reporting tool.

Data processing with multi-database, off database, log, txt and excel

2014年7月16日星期三

Empowering Files with Computing Capability

没有评论:

发表评论