If
the files are empowered with computing capability, you will be thereby greatly
benefited, such as, reducing database output bottlenecks, cutting down database
costs, alleviating database expansion pressure and making data easier to be
managed.
There
are a lot of files outside the database, each of which contains some structured
data, such as: text file, Excel file, log file and binary file. In order to use
these structured data in an application, the general practice is to import data
into a database, then compute them via database interface, and the computed
results are returnedeventually.
However,
there are many disadvantages if you perform computing after importing the files
into the database:
Additional workloads. To import the files into
the database, additional ETL development works are usually required. In many
cases, you have to cope with a series of excess works, such as incremental data
processing, scheduler implementation, stored procedure development and database
authorization configuration. By doing so, the workload is further increased.
Increased database output load. Some files include a
large amount of data like web logs. After importing data into the database,
these data occupy expensive storage space of database, plus database License,
CPU resources, Network IO, all of these will increase more loads on database
output.
Boosted database expansion. As the database output
load has increased due to more data, this will give impetus to expansion of the
database. More often than not, Users have to upgrade and expand the database by
purchasing more expensive dedicated storage devices, database License, server
CPU. Sometimes, just expansion alone is not enough. Therefore you also need to
buy some more upscale servers dedicated to the database.
Poor data manageability. Based on multi-level
directory, the files can be managed in chronological order by various lines of
business, module relationship, with a higher manageability. While the database,
as a flat structure, cannot manage the data under multi-level directory, while
only fit for managing a very small number of tables. Sometimes, as the imported
data are diversified, massive, and named randomly, it is easy to form a large
number of table names with obscure meanings in the database. As time goes by,
more and more meaningless obstinate data get accumulated in the database, which
are unknown but cannot be deleted freely. So the manageability gets worse and
worse.
Since
there are so many disadvantages when importing the files into the database, why
do you want to import it into the database? This is because the file itself
lacks computing capability, and in order to get it, you have to put it into
database. This is all what you willingly bear for it. As you see, if a file has
a computing capability, these defects will be easily offset. Even more, you can
export original redundant data of database as files outside the database,
bringing about greater benefits!
There
are two types of data in the database, namely, core business data and its
derived application-specific redundant data (such as pool data that are summarized
in advance to improve performance and facilitate data query). At the initial
stage of construction, the database is mainly full of the core data. However,
with the development of business, the redundant data will be much higher than
core data in both magnitude order and growth rate. As core data requires a
higher security, but is low in quantity, it is necessary to be stored in the
expensive database. Derived from core data, the redundant data has a lower
requirement on security, and can be regenerated from core data once damaged. So
it is unnecessary to be stored in the database. As the redundant data is stored
in the database is not for the sake of security, but forthe computing
capability as well.
The
most of storage spaces in the database are consumed by redundant data due to
its big size and plurality of associated applications, which will bring about
some disadvantages, such as, producing greater impact on the output bottleneck
for the database, and leading to a relatively higher cost in database
expansion, as well as worsen data manageability.
In
summary, if we make the files have computing capability, we will have these
benefits as follows:
No additional workloads. For the files outside the database,
we can load and compute them directly, do not have to import them into the
database, so additional ETL workloads will be reduced to zero.
Easier to manage. By nature, the files
support multi-level directory, and are much simpler and more efficient to be
copied, moved and split than a database. This enables users to classify and
manage the data by specific rules such as business module, date, etc. When the
application gets offline, you can also delete the corresponding data on the
application via directory. Data management thus becomes simple and clear with
obviously reduced workloads.
Lower cost. Since they are files, you can store them
in inexpensive hard disks, instead of purchasing more software and hardware
dedicated to the database.
Reduced database output. If the redundant data,
which is the largest volume and used most intensively, is migrated out of the
database, the database will naturally present significantly improved
performance with diminished output loads.
Reduced database expansion pressure. If the database output
load is reduced, this can significantly delay the arrival of the expansion
critical limit. The database can continue to serve well, and you can also save
a lot of expansion costs.
High
resource utilization.
Storing data in files is by no means equivalent to discarding database. On the
contrary, the files should be used to merely store such redundant data that the
security requirement is lower and the volume is
relatively larger, while the database still should be used to store core data.
By doing so, the files and database can serve for their respective purposes of
storages according to their own characteristics, and the resource utilization
can be increased significantly.
Significantly improved overall
performance.The database uses a proprietary interface
with a congested channel; the file system is open, to which, any application
can access, so that the performance ceiling is much higher. More importantly, Files
can be copied and distributed to multiple cheap PCs in an easy way. By doing
so, users can take advantage of the multi-node parallel computing mechanism to
solve the throughput bottleneck, thereby the performance can be further
improved dramatically. Although there are many parallel solutions available in
the database, Teradata, Greenplum, and other sophisticated solutions are quite
expensive. The freeware Hive and Impala have not yet increased computing power
due to lower maturity, so they are hard to be popularized for use today.
There are many advantages for file
computation. However, a critical drawback of it is that the file itself lacks
the computing capability, so that we need a specific tool to implement such
computing on the files outside the database. R, MapReduce, high-level
languages, and esProc are all such tools, of which, esProc has integrated more
comprehensive functions with a number of advantages, let’s introduce it in
detail.
Supporting file computing with its own
engine.esProc can read the structured data directly from Txt,
Excel and binary files, which makes it easy to support file computation outside
database. esProc also enables direct access to various relational database,
semi-structured data, so as to easily implement mix computation over multiple
data sources.
Rich built-in computing functions. esProc, as a programming
language, is aiming at structured data computing with rich built-in objects and
library functions, capable of achieving complex business logic, while also
reducing the threshold of conversion from business logic to programming codes.
For example, the ordered sets can be used to solve the typical puzzlesof
SQL\SP, including relative position access, inter-row computation in
multi-level group, and complex ranking operation.
Low cost scale-out. esProc supports multi-node
parallel computingfor large file computation outside the database, thus
effectively reducing the costs while ensuring high performance. esProc is almost
free, there is no necessity to purchase dedicated storage device and license.
It can run on Windows/Linux/Unix, midrange and high-end servers and low-cost
PCs, with more powerful scalability than that of database.
Let’s
take an example to explain how esProc implements file computation: an
enterprise take files to store history data and redundant data, including
historical orders; the database is used to store core data and current data,
including orders for current month and complete customer information. Now based
on the list of commodities inputtedoutside, to find how many
customers have purchased all the listed items since the beginning of last year;
after the computation is completed, the customers’ IDs and names need to be
returned to the reporting tool. The codes in esProc are shown as follows:
A1-A3:
Read the history data from the file, and retrieve the data in current month
from the database, then merge the two.
A4-A5:
Filter out the orders by date and the list of items from outside sources
(usually they are reporting tools). It should be noted that in the example, the
data amount is assumed not very large, so we take the way to filter it aftermerging.
If the actual data amount is large, filtering data before mergingwill be taken; if the data amount is further increased, the way
of segmentation read and computation via file cursor should be used; for given
huge amount of data, a multi-node parallel computing mechanism can be used.
A6-A10:
To count up those customers who have purchased the same products. Since one
customer will purchase the same items, the orders need to be grouped by the
customers, in each group, evaluate which are non-duplicated items that the
customers have purchased (equivalent to distinct
in SQL). Continue to count up those non-duplicately-purchased items; you can
make clear how many types of items the customers have purchased. If this figure
is consistent with the item list, it means this customer has purchased all the
items in the list.
A11-A12:
Retrieve complete customer IDs, name of customer from the database, evaluate
what are the names of the customer corresponding to that customer IDs in A10.
A13: The computed result is returned
to the reporting tool.
没有评论:
发表评论