Redundant data is often the main driver to database capacity expansion. Core business data and redundant data are two types of data in database. Derived from the former type, the latter type is aimed for the concrete application. For example, use the redundant data to boost performance and facilitate per-summarized data query. The core data plays a primary role in the initial stage of building the database. With the growing business, the redundant data will be greater than the core data over time on both the order of magnitude and the speed of increase, becoming the main cause to expand database capacity.
Redundant data may be stored outside the database. Core data requires more on security. They are low in quantity and need storing in expensive database storages. Since the redundant data is derived from the core data, it requires less on security requirements. With the core data, the damaged data can always be regenerated. So, the redundant data can be stored outside the database. More often than not, putting redundant data to the database is not for the sake of the security, but to obtain enough powerful computing capability to process the redundant data for use in applications. On the other hand, putting the redundant data to the database will worsen the manageability of data. The database table is flatly structured, making it only fit for managing the core table of relatively small quantity, while it is impossible to manage the data under multi-level directory. The redundant data is of many types, great in quantity, and named randomly. In the database, it is quite common that a large number of table names with obscure meanings would be formed. As time goes by, more and more meaningless obstinate data get accumulated that we dare not to clear them since we do not know what they are actually representing. Owing to this, many expensive database storages are consumed for a long time. Users have to purchase more upscale dedicated servers, more spacious dedicated storage equipment, and more licenses to meet the needs of capacity expansion.
Redundant data will cause the performance bottleneck. The private interface can not only guarantee the interests for the manufacturers, but also act as an effective security measure when the data volume is relatively small. The database access interface is private. But,once the data volume increase and becomes huge, a large volume of redundant data and the centralized concurrentism will flood into the only one secluded interface all at the same time, resulting in the unpredictable output bottleneck.
As can be seen, if using the database to hold the redundant data, many disadvantages would be incurred when expanding the database capacity: high cost, poor manageability, limited effect on alleviating the expansion pressure, and serious wasting of resources. Comparatively, the file expansion is a better means to lower the cost, facilitate the management, improve the data utilization, and achieve the obvious result. The file expansion is aimed to store the continually increasing redundant data to the file, using the direct file access through the open interface of the operating system. By this means, ultimately implement the parallel access, data computing, and data management. As shown in the below figure:
The file expansion has four advantages: Low cost, convenient management, high resource utilization, and remarkable performance boost.
Low cost. Since data is stored in files, we can simply add the inexpensive hard drives when expanding. There is no need to purchase the expansive software or hardware, such as the dedicated server and storage equipment, and database license.
Convenient management.The files support the multi-level directory and is much simpler and more efficient than database by nature in copying, transmitting, and splitting. This enables users to classify and manage the data by rules such as business module, timeline and schedule. When applications get off line, you can also delete the data corresponding to the application by directory. Data management thus becomes simple and clear. The workload is reduced obviously.
High resource utilization. Storing data in file is by no means equivalent to discard database. On the contrary, the file should be used to store the redundant data requiring less on security and relative small in quantity, while database still should be used to store the core data. By doing so, the file storage and database storage can serve their respective purpose according to their characteristics, and the resource utilization can be increased significantly.
Obvious effect on alleviating pressure. All programs can use the open interface of operating system to access file. The situation of congested channel will be improved greatly, and the performance ceiling will be lifted accordingly. More importantly, a file can be copied and distributed to multiple machines conveniently. By doing so, users can take advantage of the multi-node parallel computing to solve the throughput bottleneck, and the performance can be further improved dramatically. Although there are many parallel solutions available in the database, Teradata, Greenplum, and other sophisticated solutions are quite expensive. The computing power and maturity of the freeware Hive and Impala are not great yet, and they are hard to popularize.
There are many advantages for the file expansion. However, the file itself lacks the computing capability, so that we need a specific tool to implement such computing outside the database. Such tools as R, MapReduce, senior language, and esProc can all be used to implement the computing outside database with their respective advantages and disadvantages.
R language is the computing tool for scientists, with rich package for extension, and quite strong computing capability. However, their syntax and function is far too dedicated for the normal programmers to understand. R is mainly used for the desktop computing. So, it is hard to integrate to the reporting tools or Java, C#, and other applications. On the other hand, the low parallel computing ability of R is so week that users have to use R together with the third party tools.
MapReduce has the main advantage of the inexpensive scale-out. It is not only a quite powerful computing tool, but also a very programmer-friendly tool considering its support for the seamless integration with JAVA. But MapReduce does not provide the under layer computing function. All basic computing must be implemented through the hard coding by programmers, and the development workload is quite huge. In addition, it is an undeniable fact that MapReduce scarifies its performance to ensure the high fault tolerance.
Java, VB, Perl, and other senior languages can all implement the computing outside the database as well. However, in doing so, their development difficulty would be much greater. Even the parallel computing framework would have to be implemented manually by programmers.
esProc on the standalone machine offers a performance close or superior to that of database. In addition, esProc also supports the inexpensive scale-out and parallel computing. As a result, the overall performance of esProc is outstanding. It supports the JDBC interface for convenient integrating to the reporting tools and Java. In addition, esProc is also inbuilt with a great number of library functions for structured data computing, and the development difficulty is relatively low. But it is a pity that esProc does not support the large-scale cluster well. In this aspect, esProc is worse than MapReduce.
Let’s take esProc for example to explain on how the computing outside database can alleviate expansion pressure and remove output bottleneck for database.
A company has its database configured as follows:
Server: Dedicated database server, 8-core CPU, and 32 G memory. (Support up to 8 CPUs and 64 G Memory)
Storage equipment: Dedicated disk cabinet + dedicated server disk with 2 T space.
Separate database upgrading:
Server: Purchase more upscale servers, 16-core CPU, and 128 G memory.
Storage equipment: Keep the original disk cabinets and hard disks, adding 8 T space, and the total space will be amount to 10 T.
license: Increase to 20.
File expansion with esProc:
Server: Keep it unchanged.
Storage equipment: Keep the original disk cabinet, migrate the 1 T redundant data to the file computing node, and the remaining 1 T space is enough for the use in the future 3 years (if not migrated, then it is enough to meet the needs for the future 5 years)
License: Unchanged (Can be reduced actually)
File computing node: 4 normal PCs, and each of them is configured as follows: 4 CPUs, 8 G memory, 2 T normal hard disk.
With esProc for file expansion, the database server can at least serve for another 2-5 years, not requiring to upgrade to the upscale servers; The newly added 4 file computing nodes are enough in future 2 years. Thanks to the parallel computing, the network load, and CPU pressure can be alleviated greatly. Because they are normal PCs, the prices of hard disks, CPUs, and memories are far lower than we could have to pay for database-dedicated server. If upgraded 2 years later, we only need to add some new file computing nodes at quite low cost.