Data processing with multi-database, off database, log, txt and excel: 七月 2014

2014年7月31日星期四

Code Examples of Foreign Key Function in esProc

If field (or combination of fields) k is table M’s primary key, and k, at the same time, exists in table B, then k is regarded as the foreign key of B. The foreign key maintains a relationship between two tables and is one of the most important concepts in structured data computing. Through object references, esProc makes foreign key easy to function. The following examples aim to illustrate the operation.

Example 1 Create relationship between a referenced table and a referencing table
order is the referencing table, emp is the referenced table, connect emp to order. Display emp's Name field, Gender field and Salary field and order's OrderID field and Amount field.
Note: Only tables emp and order are used here. Later, table dep is to be used in other examples. The relationship between the three tables over foreign key is shown as follows:

Data may originate from databases or texts. For example:

order=esProc.query("select OrderID,SellerId,Amount,OrderDate from sales")
emp=esProc.query("select EId,Name,Gender,Birthday,Dept,Salary from emp")
dep=esProc.query("select * from department")
esProc code:
A3=order.switch(SellerId, emp:EId)
A4=order.new(OrderID,Amount,SellerId.Name,SellerId.Gender,SellerId.Salary)
Computed results:

Code interpretation:
A3: Replace records of SellerID in order with their counterparts in emp, create foreign key relationship between the two tables.
A4: Get OrderID field and Amount field in order, and get Name, Gender and Salary field in emp through related references. We can see that, with object references, fields in emp can be accessed to directly from order. Thus complex and difficult join statements can be skipped.

Example 2: Query referencing table according to condition of referenced table
Find orders signed by female sellers whose salary is greater than 10,000.
esProc code:
A3=order.switch(SellerId, emp:EId) / the same as above example
A5=order.select(SellerId.Salary>10000 && SellerId.Gender=="F")
Computed results:

Click the above blue hyperlinks and corresponding employee information will be shown:

Example 3: Group according to referenced table
Compute sales amount of each department.
esProc code:
A3=order.switch(SellerId, emp:EId) / the same as above example
A5=order.groups(SellerId.Dept;sum(Amount))
Computed results:

Field names can be changed, like order.groups(SellerId.Dept:dt;sum(Amount):amt). The effect of name change is shown below:

Example 4: Complex association between multiple tables
For departments whose sales amount is greater than 50,000, find their managers’ names.
esProc code:
A3=order.switch(SellerId, emp:EId)
A4=dep.switch(Manager,emp:EId)
A5=emp.switch(Dept,dep:DeptNo)
A6=order.groups(SellerId.Dept:dt;sum(Amount):amt)
A7=A6.select(amt<=50000).(dt).(Manager).(Name)
Computed results:

Code interpretation:
A3, A4, A5：Create complete foreign key relationship.
A6: Compute sales amount of each department (See the above example). See below:

A7：Use object references to solve the question intuitionally. Expression A6.select(amt<=50000).(dt).(Manager).(Name) can be divided into four steps according to full stops. See below:
1. Find records whose sales amount is greater than 50,000 in A6.
2. Get records corresponding to dt field (in table dep)
3. Get records corresponding to Manager field (in table emp)
4. Get Name field.
Details are shown below:
A6.select(amt<=50000)

.(dt)

.(Manager)

.(Name)

A Code Example of Computing Link Relative Ratio and Year-on-year Comparison with esProc

Link relative ratio refers to comparison between the current data and data of the previous period. The interval is usually one month. For example, divide sales amount of April by that of March, and you get the link relative ratio of April. Hour, day, week and quarter can also be used as the time interval. Year-on-year comparison is the comparison between the current data and data of the corresponding period of the previous year. For example, divide sales amount of April 2014 by that of April 2013. In business, data of multiple periods is usually computed to find the variation trend.
Seeking link relative ratio and year-on-year comparison is common inter-row and inter-group computations, which are easy to be performed with esProc. The following example is used to illustrate the computations.
Case description:
Compute link relative ratio and year-on-year comparison of each month’s sales amount within the designated period. The data comes from table order. Some of the data is shown below:

esProc code:

A1=esProc.query("select * from sales3 where OrderDate>=? and OrderDate<=?",begin,end)
A2=A1.groups(year(OrderDate):y,month(OrderDate):m;sum(Amount):mAmount)
A3=A2.derive(mAmount/mAmount[-1]:lrr)
A4=A3.sort(m)
A5=A4.derive(if(m==m[-1],mAmount/mAmount[-1],null):yoy)

Code interpretation:
A1: Query in the database according to periods. begin and end are external parameters. Such as, begin="2011-01-01 00:00:00", end="2014-07-08 00:00:00"(i.e. the date of today which can be obtained through now() function). Some of the query results are as follows:

A2: Group orders by year and month, then summarize and seek each month's sales amount. Some of the computed results are as follows:

A3: Add a new field Irr, i.e, the link relative ratio on a month-on-month basis. The code is mAmount/mAmount[-1], in which mAmount represents sales amount of the current month, and mAmount[-1] represents that of the previous month. Note that the initial month’s link relative ratio is empty (i.e. January 2011). Computed results are:

A4: Sort A3 by month and year to compute year-on-year comparison. Complete code should be: =A3.sort(m,y). Since A3 is originally sorted by the year, so we just need to sort by the month, the code is: A3.sort(m), which has a higher performance. Some of the computed results are:

A5: A5: Add a new field yoy, i.e., theyear-on-year comparison of monthly sales amount. The code is: if(m==m[-1],mAmount/mAmount[-1],null), meaning that the computation of year-on-year comparison is only performed over the corresponding months. Note that the year-on-year comparison for months of the initial year (i.e. the year 2011) is always. Some of the computed results are:

A row of code, A6=A5.sort(y:-1,m), can be added to make observation easier. That is, sort A5 in descending year order and ascending month order. Note that the data comes to an end in July 2014. Results are shown below:

Examples of Database Transaction Management with esProc

esProc can write to databases and manage database transactions. Here we'll look at the programming method of rollbacks and controlling transaction submission, etc.
A．Submit transactions automatically
esProc can conveniently execute operations like insert, delete and update. The simplest code is:

In the above figure, insert, update and delete are respectively executed from A2 to A4. Execution of each SQL statement will be submitted automatically. Note that:
1. That three SQL statements are submitted three times is too frequent operations for a database.
2. There exists no transaction relation between the three SQL statements. So, if the execution of one SQL statement fails, the previous SQL statement will remain unaffected.
The following examples are to introduce how to submit transactions in batches and how to compose a transaction with multiple SQL statements in a table sequence.

B．Submit transactions in batches
Import students’ information from students.txt to update table students1 in the database. Since there are a lot of records to be modified, using method of submitting transactions in batches is more reasonable.

A1：Define a file object in which students’ information is stored.
A2：Import file content.
A3：Use students' information in A2 to update table students1 in batches. Submitting SQL in batches can avoid accessing the database too frequently. Meanwhile, this can ensure consistency of the data for the submission could succeed or fail simultaneously.

C．Program control transactions
Now we’ll add a new student. The student’s id should be modified to 9 after data are inserted. In order to ensure consistency of the data, submission must be executed after the insertion and modification are proved to be successful. Otherwise rollback should be executed.

A1：Connect to the database. Note that connect function uses option @e and the subsequent code will return error message when something wrong happens. If the option is not used, the database will terminate esProc program directly when errors occur.
A2：Execute the insert SQL statement. Note that execute function uses option @k, meaning the transaction will not automatically submitted after it is executed. If the option is not used, the insert SQL statement will be submitted immediately.
A3：Get the result of last operation in the database, i.e., the insert statement. If err variable is zero, the execution is successful; otherwise, err is the error code.
A4：Judging whether err variable, the execution result, is zero. If the answer is yes, the last operation of the insert statement is successful and modification in B4 can be executed.
C4：Get execution result of the update SQL.
A5：Make judgment over err variable. If it is zero, submit the database; otherwise execute rollback.
A6：Close database connection.

2014年7月30日星期三

Example & Comments for SQL Computation Disadvantage (II)

l Set-lization is not complete.

It is beyond any doubt that set is the basis of mass data computation. Although SQL has the concept of set, it is limited to describing simple result set, and it does not take the set as a basic data type to enlarge its application scope.

Task5 Employees in the company whose birthday are the same as those of others

1	select from employee*
2	where to_char (birthday, ‘MMDD’) in
3	( select to_char(birthday, ‘MMDD’) from employee
4	group by to_char(birthday, ‘MMDD’)
5	having count()>1 )*

The original intention of grouping is to split the source set into several subsets, and its returned values are also these sub-sets. But SQL cannot describe this kind of "set consisting of sets", so it forcibly conducts the next step aggregating computation on these sub-sets and forms conventional result set.

But sometimes what we want is not the summary value on sub-sets, but rather the subsets themselves. At this time, it is necessary to use from the source set the condition obtained from grouping to query again, so sub-query appears again unavoidably.

Task6 Find out students whose scores ranks in top 10 for all subjects

1	select name
2	from (select name
3	from (select name,
4	rank() over(partition by subject order by score DESC) ranking
5	from score)
6	where ranking<=10)
7	group by name
8	having count()=(select count(distinct subject) from score)*

Use set-lized train of thought, order and filter the sub-sets of subjects after grouping to select the top 10 of every subject, and then it is possible to complete the task by finding out the intersection set of these sub-sets. But SQL cannot describe the "set of set" and has not the intersection operation to cope with indefinite quantity set. At this time, it is necessary to change the train of thought and use the window function to find out the top 10 of every subject, and then find out, according to student sub-group, the students whose number of appearances is the same as the quantity of subjects, which causes difficulty in understanding.

l It lacks object reference.

In SQL, the reference relation between tables depends on equivalent foreign key for maintenance and it is impossible to directly use the record at which the foreign key point as the field of this record. In query, it is necessary to seek help of multi-table join or sub-query to complete the query, which causes not only trouble in writing but also low efficiency in operations.

Task7 Female manager’s male employees

Use multi-table join.

1	select A.*
2	from employee A, department B, employee C
3	where A. department=B. department and B. manager=C. name and
4	A.sex ='male' AND C. gender ='female'

Use sub-query.

1	select from employee*
2	where department in
3	(select department from department
4	where manager in
5	(select name from employee where gender ='female'))

If the department field in the employee table points at the record in the department table while the manager field in the department table points at the record in the employee table, then it is only necessary to write this query condition simply as this kind of intuitive high-efficiency form:

where department.manager.sex ='female' and sex ='male'

But in SQL, it is only possible to use multi-table join or sub-query to write out the two kinds of obviously obscure statements.

Task8 Companies with which employees have their first jobs

Use multi-table join.

1	select name, company, first_company
2	from (select employee.name name, resume.company company,
3	row_number() over(partition by resume. name
4	order by resume.start_date) work_seq
5	from employee, resume where employee.name = resume.name)
6	where work_seq=1

Use sub-query.

1	select name,
2	(select company from resume
3	where name=A. name and
4	start date=(select min(start_date) from resume
5	where name=A. name)) first_company
6	from employee A

Without object reference mechanism and the completely set-lized of SQL, it is naturally impossible to handle the sub-table as an attribute of the primary table (field value). Regarding the query of sub-table, there are two methods. The first is to use multi-table join, increase the complexity of the statement, and use filter or grouping to convert the result set into the situation having one-to-one correspondence with the primary table record (the joined record has one-to-one correspondence with the sub-table). The second is to adopt sub-query, and each time compute temporarily the sub-table relating to the primary table record to record sub-sets, and increase the overall computation workload (it is impossible to use with sub-statement in sub-query) and trouble in writing.

Example & Comments for SQL Computation Disadvantage (I)

The computing power of SQL for mass structured data is complete, that is to say, it is impossible to find anything that SQL cannot compute. But its support layer is too low, which can lead to over-elaborate operation in practical application.

The over-elaborate operation is specifically reflected in the following four aspects:

l Computation without sub-step: SQL requires computation to be written out in one statement, and it is necessary to adopt storage procedure to implement computation step by step. No sub-step not only causes difficulty in thinking, but also makes it difficult to use intermediate result.

l Set is unordered: SQL does not directly provide the mechanism of using position to refer to set members, and conversion is needed to implement computation relating to order and positioning.

l Set-lization is not complete: SQL set function is simple and is only used to indicate the query result set and cannot be explicitly applied as basic data type.

l It lacks object reference: SQL does not support record reference, the association between data tables adopts equivalent foreign key scheme, and in conducting multi-table joint computation, it is necessary to conduct join operation. So it is not only difficult to understand, but also low in efficiency.

Implementing data computation process based on a type of computation system is in fact the process of translating business problem into formalized computation syntax (which is similar to the case in which a primary-school student solves an application problem by translating the problem into formalized four arithmetic operations). Because of the above-mentioned four problems of SQL, in handling complex computation, its model system is inconsistent with people’s natural thinking habit. It causes a great barrier in translating problems, leading to the case that the difficulty to formalize the problem-solving method into computation syntax is much greater than to find the solution of the problem.

We give the following examples to describe respectively the problems in the four aspects.

To make the statement in the examples as simple as possible, here a large number of SQL2003 standard window functions are used. So we adopt the ORACLE database syntax that does a relatively good job in supporting SQL2003 standard as it will be generally more complex to adopt the syntax of other databases to program these SQLs.

l Computation without sub-step

Carrying out complex computation step by step can reduce the difficulty of the problem to a great extent, conversely, collecting a multi-step computation into one to be completed in just one step increases the complexity of the problem.

Task1 The number of persons of the sales department, where, the number of persons whose native place is NY, and where, the number of female employees?

The number of persons of the sales department

1	select count() from employee where department='sales'*

Where, the number of persons whose native place is Beijing

1	select count() from* employee where department=‘sales ’ and native_place='NY'

And where, the number of female employees

1	select count () from employee*
2	where department='sales' and native_place='NY' and gender ='female'

Conventional thought: Select the persons of the sales department for counting, and from it, find out the persons whose native place is NY for counting, and then further find out the number of female employees for counting. The query each time is based on the existing result last time, so it is not only simple in writing but also higher in efficiency.

But, the computation of SQL cannot be conducted in steps, and it is impossible to reuse the preceding result in answering the next question, and it is only possible to copy the query condition once more.

Task2 Each department selects a pair of male and female employees to form a game team.

1	with A as
2	(select name, department,
3	row_number() over (partition by department order by 1) seq
4	from employee where gender =‘female’),
5	B as
6	(select name, department,
7	row_number() over(partition by department order by 1) seq
8	from employee where sex =‘female’)
9	select name, department from A
10	where department in ( select distinct department from B ) and seq=1
11	union all
12	select name, department from B
13	where department in (select distinct department from A ) and seq=1

Computation without sub-step sometimes not only causes trouble in writing and low efficiency in computation, but even causes serious deformation in the train of thought.

The intuitive thought of this task: For each department cycle, if this department has male and female employees, then select one male employee and one female employee and add them to the result set. But SQL does not support this kind of writing with which the result set is completed step by step (to implement this kind of scheme, it is necessary to use the stored procedure). At this time, it is necessary to change the train of thought into: Select male employee from each department, select female employee from each department, select out, respectively from the two result sets, members whose departments appear in another result set, and finally seek the union of the sets.

Fortunately, there are still with sub-statement and window function over (SQL2003 standard begins to support); otherwise this SQL statement will be simply ugly.

l The set is unordered.

Ordered computation is very common in mass data computation (obtain the first 3 places/the third place, compare with the preceding period). But SQL adopts the mathematical concept of unordered set, so ordered computation cannot be conducted directly, and it is necessary to adjust the train of thought and change the method.

Task3 Company's employees whose ages are in the middle

1	select name, birthday
2	from (select name, birthday, row_number() over (order by birthday) ranking
3	from employee )
4	where ranking=(select floor((count()+1)/2) from employee)*

Median is a very common computation, and originally it is only necessary to simple get out, from the ordered set, the members whose positions are in the middle. But SQL unordered set mechanism does not provide the mechanism which directly uses position to access member. It is necessary to create a man-made sequence number field, and then use the condition query method to select it out, causing the case in which a sub-query is needed to complete the query.

Task4 For how many trading days has this stock gone up consecutively in the longest?

1	select max(consecutive_day)
2	from (select count() (consecutive_day*
3	from (select sum(rise_mark) over(order by trade_date) days_no_gain
4	from (select trade_date,
5	case when
6	closing_price >lag(closing_price) over(order by trade_date)
7	then 0 else 1 end rise_mark
8	from stock_price) )
9	group by days_no_gain)

Unordered set can also cause train of thought to deform.

The conventional train of thought for computing the number of consecutive days in which the stock rises: Set up a temporary variable whose initial value is 0 to record the consecutive dates in which the stock rises, and then compare it with the preceding day. If the stock does not rise, then clear the variable to 0; if it rises, add 1 to the variable, and see the maximum value appearing from the variable when the cycle is over.

In using SQL, it is impossible to describe this process, so it is necessary to change the train of thought. To compute the accumulate number of days in which stock does not rise from the initial date to the current date, and the one with the same number of days in which stock does not rise is the consecutive trading days in which the stock rises, and from its sub-group, it is possible to find out the interval in which the stock rises, and then seek its maximum count. It is already not so easy to read and understand this statement and it is more difficult to write it out.

2014年7月29日星期二

Computing outside Database to Alleviate Expansion Pressure and Remove Output Bottleneck for Database

The data explosion gives rise to the constant increase on the data volume in enterprises. With increasing applications for database, the concurrent access gets more and more intensive. The database needs unremitting expansion to meet these changes. However, quite often than not, enterprises which expand the database capacity only unavoidably find themselves stuck in the awkward situation of high cost and low efficient. To meet the challenge, they should implement computing outside the database to alleviate expansion pressure and remove bottleneck for databases.
Redundant data is often the main driver to database capacity expansion. Core business data and redundant data are two types of data in database. Derived from the former type, the latter type is aimed for the concrete application. For example, use the redundant data to boost performance and facilitate per-summarized data query. The core data plays a primary role in the initial stage of building the database. With the growing business, the redundant data will be greater than the core data over time on both the order of magnitude and the speed of increase, becoming the main cause to expand database capacity.

Redundant data may be stored outside the database. Core data requires more on security. They are low in quantity and need storing in expensive database storages. Since the redundant data is derived from the core data, it requires less on security requirements. With the core data, the damaged data can always be regenerated. So, the redundant data can be stored outside the database. More often than not, putting redundant data to the database is not for the sake of the security, but to obtain enough powerful computing capability to process the redundant data for use in applications. On the other hand, putting the redundant data to the database will worsen the manageability of data. The database table is flatly structured, making it only fit for managing the core table of relatively small quantity, while it is impossible to manage the data under multi-level directory. The redundant data is of many types, great in quantity, and named randomly. In the database, it is quite common that a large number of table names with obscure meanings would be formed. As time goes by, more and more meaningless obstinate data get accumulated that we dare not to clear them since we do not know what they are actually representing. Owing to this, many expensive database storages are consumed for a long time. Users have to purchase more upscale dedicated servers, more spacious dedicated storage equipment, and more licenses to meet the needs of capacity expansion.

Redundant data will cause the performance bottleneck. The private interface can not only guarantee the interests for the manufacturers, but also act as an effective security measure when the data volume is relatively small. The database access interface is private. But,once the data volume increase and becomes huge, a large volume of redundant data and the centralized concurrentism will flood into the only one secluded interface all at the same time, resulting in the unpredictable output bottleneck.

As can be seen, if using the database to hold the redundant data, many disadvantages would be incurred when expanding the database capacity: high cost, poor manageability, limited effect on alleviating the expansion pressure, and serious wasting of resources. Comparatively, the file expansion is a better means to lower the cost, facilitate the management, improve the data utilization, and achieve the obvious result. The file expansion is aimed to store the continually increasing redundant data to the file, using the direct file access through the open interface of the operating system. By this means, ultimately implement the parallel access, data computing, and data management. As shown in the below figure:

The file expansion has four advantages: Low cost, convenient management, high resource utilization, and remarkable performance boost.
Low cost. Since data is stored in files, we can simply add the inexpensive hard drives when expanding. There is no need to purchase the expansive software or hardware, such as the dedicated server and storage equipment, and database license.

Convenient management.The files support the multi-level directory and is much simpler and more efficient than database by nature in copying, transmitting, and splitting. This enables users to classify and manage the data by rules such as business module, timeline and schedule. When applications get off line, you can also delete the data corresponding to the application by directory. Data management thus becomes simple and clear. The workload is reduced obviously.

High resource utilization. Storing data in file is by no means equivalent to discard database. On the contrary, the file should be used to store the redundant data requiring less on security and relative small in quantity, while database still should be used to store the core data. By doing so, the file storage and database storage can serve their respective purpose according to their characteristics, and the resource utilization can be increased significantly.

Obvious effect on alleviating pressure. All programs can use the open interface of operating system to access file. The situation of congested channel will be improved greatly, and the performance ceiling will be lifted accordingly. More importantly, a file can be copied and distributed to multiple machines conveniently. By doing so, users can take advantage of the multi-node parallel computing to solve the throughput bottleneck, and the performance can be further improved dramatically. Although there are many parallel solutions available in the database, Teradata, Greenplum, and other sophisticated solutions are quite expensive. The computing power and maturity of the freeware Hive and Impala are not great yet, and they are hard to popularize.

There are many advantages for the file expansion. However, the file itself lacks the computing capability, so that we need a specific tool to implement such computing outside the database. Such tools as R, MapReduce, senior language, and esProc can all be used to implement the computing outside database with their respective advantages and disadvantages.

R language is the computing tool for scientists, with rich package for extension, and quite strong computing capability. However, their syntax and function is far too dedicated for the normal programmers to understand. R is mainly used for the desktop computing. So, it is hard to integrate to the reporting tools or Java, C#, and other applications. On the other hand, the low parallel computing ability of R is so week that users have to use R together with the third party tools.
MapReduce has the main advantage of the inexpensive scale-out. It is not only a quite powerful computing tool, but also a very programmer-friendly tool considering its support for the seamless integration with JAVA. But MapReduce does not provide the under layer computing function. All basic computing must be implemented through the hard coding by programmers, and the development workload is quite huge. In addition, it is an undeniable fact that MapReduce scarifies its performance to ensure the high fault tolerance.
Java, VB, Perl, and other senior languages can all implement the computing outside the database as well. However, in doing so, their development difficulty would be much greater. Even the parallel computing framework would have to be implemented manually by programmers.
esProc on the standalone machine offers a performance close or superior to that of database. In addition, esProc also supports the inexpensive scale-out and parallel computing. As a result, the overall performance of esProc is outstanding. It supports the JDBC interface for convenient integrating to the reporting tools and Java. In addition, esProc is also inbuilt with a great number of library functions for structured data computing, and the development difficulty is relatively low. But it is a pity that esProc does not support the large-scale cluster well. In this aspect, esProc is worse than MapReduce.
Let’s take esProc for example to explain on how the computing outside database can alleviate expansion pressure and remove output bottleneck for database.

A company has its database configured as follows:
Server: Dedicated database server, 8-core CPU, and 32 G memory. (Support up to 8 CPUs and 64 G Memory)
Storage equipment: Dedicated disk cabinet + dedicated server disk with 2 T space.
License: 10.
Separate database upgrading:
Server: Purchase more upscale servers, 16-core CPU, and 128 G memory.
Storage equipment: Keep the original disk cabinets and hard disks, adding 8 T space, and the total space will be amount to 10 T.
license: Increase to 20.
File expansion with esProc:
Server: Keep it unchanged.
Storage equipment: Keep the original disk cabinet, migrate the 1 T redundant data to the file computing node, and the remaining 1 T space is enough for the use in the future 3 years (if not migrated, then it is enough to meet the needs for the future 5 years)
License: Unchanged (Can be reduced actually)
File computing node: 4 normal PCs, and each of them is configured as follows: 4 CPUs, 8 G memory, 2 T normal hard disk.

With esProc for file expansion, the database server can at least serve for another 2-5 years, not requiring to upgrade to the upscale servers; The newly added 4 file computing nodes are enough in future 2 years. Thanks to the parallel computing, the network load, and CPU pressure can be alleviated greatly. Because they are normal PCs, the prices of hard disks, CPUs, and memories are far lower than we could have to pay for database-dedicated server. If upgraded 2 years later, we only need to add some new file computing nodes at quite low cost.

Several Methods for Structured Big Data Computation

All data can become valuable only by getting involved in computing and analytics. The big data makes no exception. The computational capability on structured big data determines the range of practical applications of big data. In this article, I'd like to introduce several computation methods: API, Script, SQL, and SQL-like languages.

API: The "API" here refers to a self-contained API access method without using JDBC or ODBC. Let's take MapReduce as an example. MapReduce is designed to handle the parallel computation cost-effectively from the very bottom layer. So, MapReduce offers superior scale-out, hot-swap, and cost efficiency. MapReduce is one of the Hadoop components with open-source code and abundant resources.
Sample code:
public void reduce(Text key, Iterator<Text> value,
OutputCollector<Text, Text> output, Reporter arg3)
throws IOException {
double avgX=0;
double avgY=0;
double sumX=0;
double sumY=0;
int count=0;
String [] strValue = null;
while(value.hasNext()){
count++;
strValue = value.next().toString().split("\t");
sumX = sumX + Integer.parseInt(strValue[1]);
sumY = sumY + Integer.parseInt(strValue[1]);
}

avgX = sumX/count;
avgY = sumY/count;
tKey.set("K"+key.toString().substring(1,2));
tValue.set(avgX + "\t" + avgY);
output.collect(tKey, tValue);
}
Since the universal programming language adopted is unsuitable for the specialized data computing, MapReduce is less capable than SQL and other specialized computation languages in computing. Plus, it is inefficient in developing. No wonder that the programmers generally complain it is "painful". In addition, the rigid framework of MapReduce results in the relatively poorer performance.
There are several products using API, and MapReduce is the most typical one among them.

Script: The "Script" here refers to the specialized script for computing. Take esProc as an example. esProc is designed to improve the computational capability of Hadoop. So, in addition to the inexpensive scale-out, it also offers the high performance, great computational capability, and convenient computation between heterogeneous data sources, especially ideal for achieving the complex computational goal. In addition, it is the grid-style script characterized with the high development efficiency and complete debug functions.
Sample code:

Java users can invoke the result from esProc via JDBC, but they are only allowed to invoke the result in the form of stored procedure instead of any SQL statement. Plus, esProc is not open source. These are two disadvantages of esProc.

The Script is widespread used in Mongo DB, Redis, and many other big data solutions, but they are not specialized enough in computing. For another example, the multi-table joining operation for Mongo DB is not only inefficient, but also involves the coding of one order of magnitude more complex than that of SQL or esProc.

SQL: The "SQL" here refers to the complete and whole SQL/SP, i.e. ANSI 2000 and its super set. Take Greenplum as an example, the major advantages of Greenplum SQL are the powerful computing, highly efficient developing, and great performance. Other advantages include the widespread use of its language, low learning cost, simple maintenance, and migration possibility -not to mention its trump-card of offering support for stored procedure to handle the complex computation. By this way, business value can be exploited from the big data conveniently.
Sample code:
CREATE OR REPLACE function view.merge_emp()
returns voidas$$
BEGIN
truncate view.updated_record;
insert into view.updated_record select y.* from view.emp_edw x right outer join emp_src y on x.empid=y.empid where x.empid is not null;
update view.emp_edwset deptno=y.deptno,sal=y.sal from view.updated_record y
where view.emp_edw.empid=y.empid;
insert into emp_edwselect y.* from emp_edw x right outer join emp_src y on
x.empid=y.empid where x.empid is null;
end;
$$ language 'plpgsql';

The other databases with the similar structure to MPP include Teradata, Vertical, Oracle, and IBM. Their syntax characteristics are mostly alike. The disadvantages are similar. The acquisition cost and the ongoing maintenance expenses are extremely high. Charging its users by data scale, the so-called inexpensive Greenplum is actually not a bargain at all - it is way more like making big money under cover of big data. Other disadvantages include awkward debugging, incompatible syntax, lengthy down-time if expansion, and awkward multi-data-source computation.

SQL-like language: It refers to the output interfaces like JDBC/ODBC and only limited to those scripting languages that are the subset of standard SQL. Take Hive QL as an example. The greatest advantage of Hive QL is its ability to scale out cost-effectively while still a convenient tool for users to develop. The SQL syntax feature is kept in Hive QL, so that the learning cost is low, development efficient, and maintenance simple. In addition, Hive is a component of Hadoop. The open-source is another advantage.
Sample code:

SELECT e.* FROM (
SELECT name, salary, deductions["Federal Taxes"] as ded,
salary * (1 – deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees
) e
WHERE round(e.salary_minus_fed_taxes) > 70000;

The weak point of Hive QL is its non-support for stored procedure. Due to this, it is difficult for HiveQL to undertake the complex computation, and thus difficult to provide the truly valuable result. The slightly more complex computation will rely on MapReduce. Needless to say, the development efficiency is low. The poor performance and the threshold time can be regarded as a bane, especially in task allocation, multi-table joining, inter-row computation, multi-level query, and ordered grouping, as well as implementing other algorithm alike. So, it is quite difficult for HiveQL to implement the real-time Hadoop application for big data.
There are also some other products with SQL-like languages - MongoDB as an example - they are still worse than Hive yet.

The big data computation methods currently available are no more than these 4 types of API, Script, SQL, and SQL-like languages. Wish they would develop steadfastly and there would be more and more cost-effective, powerful, practical and usable products for data computing.

订阅：博文 (Atom)