In Java, implementing via SQL is a well-developed
practice for database computing. However, the structured data is not only stored
in the database, but also in the text, Excel, and XML files. Considering this, how
to compute appropriately regarding the structured data from non-database files?
This article raises 3 solutions for your reference: implement via Java API,
convert to database computation, and adopt the common data computation layers.
Implement
via Java API. This is the most straightforward method.
Programmers will benefit from Java API in controlling every computational step meticulously,
monitoring the computed result in each step intuitively, and debugging
conveniently. Needless to say, no learning cost is also an additional advantage
of Java API.
Thanks to the well-developed API for retrieving
and writing-back data to Txt, Excel, and XML files, Java has enough technical
strength to offer the full support for such computation, in particular the
simple computational goals.
However, this method requires great
workload and quite inconvenient.
For example, since the common data algorithms
have not implemented in Java, programmers will have to spend great time and
efforts to implement all the ins and outs manually by aggregating, filtering,
grouping, and sorting and some other common actions.
For another example of data storage and
detail data retrieval through Java API, programmers will have to combine every
data and 2D table with List/map and other objects, and then compute in nested
loops at multi-levels. Moreover, such computation usually involves the set
operations and relational computations on massive data, as well as the
computations between objects and object properties. It takes great efforts to
implement the underlying logics and even greater workload in handling the
complex ordered computation.
In order to reduce the programing workload,
programmers always prefer leveraging the existing algorithms to implementing
all specifics by themselves. In view of this, the second choice below would be
a better choice:
Convert
to database computation. This is the most conservative
method. Concretely speaking, it is to import the non-database data to the
database via the common ETL tools like DataStage, DTS, Informatica, and Kettle.
The advantages of this practice include the high computational efficiency, steadfast
running, and less workload for Java programmers. It fits for the scenarios of
great data volume, high performance demand, and medium-level computational
complexity. These advantages are evident for the mixed computation on the
database and the non-database files in particular.
The main drawback of this method is the
great workload in the early stage of ETL and the great maintenance difficulty. First,
since the non-database data cannot be used directly without field-splitting,
merging, and judging, programmers have to write a great many of Perl/JS scripts
to clean and re-organize the data. Second, the data is usually updatable, so
the scripting must handle the changing incremental update issues. The data from
various data sources can hardly be compatible with a normal form. So, the data is
unusable before the level 2 or even the level 3 ETL process. Third, scheduling
is also a problem when there are lots of tables – which table must be uploaded
first? Which one is the second to upload? What’s the interval? In facts, the
huge workload of ETL is always beyond our expectation, and it is always quite
tough to evade project risk. Plus, the real-time performance of ETL is poor owing
to the regular transit of the database.
In some operating environments, there is probably
no database service at all for the sake of security or performance. For another
example, if most data is saved in the TXT/XML/Excel and no database involved, then
the existence value of ETL gets void. What can we do? Let’s try the 3rd
method:
The common
data computing layer is typified by the esProc and R. The data
computational layer is a layer in-between the data persistence layer and the
application layer. This layer is responsible for computing the data from data persistence
layer uniformly and returning the computed result to the application layer. The
data computation layer of Java is mainly used to reduce the coupling between the
application layer and the data persistence layer, and alleviate the
computational pressure on them.
The common data computational layer offers
the direct support for various data sources - not only the database, but also
the non-database data sources. By taking the advantage, programmers can access
to various data sources directly, free from such things as real-time problems. In
addition, programmers are allowed to implement the interactive computation
between various data sources conveniently, for example, the computations between
DB2 and Oracle, and MYSQL and Excel. In the past, such access is by no means
easy to implement.
The versatile data computational layers are
usually more professional on structured data, for example, it supports the generic,
explicit set, and ordered array. So, the complex computational goals, which are
tough jobs for ETL/SQL and other conventional tools, can be solved with this
layer easily.
The drawback of such method mainly lies in
the performance. The common data computation layer is of the full memory
computation, so the size of memory determines the upper limit of the data
volumes to handle. But both esProc and R support the Hadoop directly so that
their users can handle the big data in the distributed environment.
The main difference between esProc and R is
that esProc supports the direct JDBC output and convenient integrating with
Java codes. In addition, esProc IDE is much easier to use, with the support for
the true debugging, and scripts in grid, and cell name for direct referencing
the computed result. R does not provide such advantages, nor support for JDBC,
and thus a bit complex for R users to integrate. However, R supports the correlation
analyses and other model analyses. R programmers do not have to implement all specifics
to generate the computed result directly. R also supports the Txt/ Excel / XML files
and other lots of more non-database data sources. By comparison, esProc only
supports 2 of them. The last but not the least advantage of R is that the
low-end edition of R supports the open source to the full.
The above is the comparison between these
three methods, and you can choose the right one based on your project
characteristics.
没有评论:
发表评论