Data processing with multi-database, off database, log, txt and excel: Example & Comments for SQL Computation Disadvantage (I)

The computing power of SQL for mass structured data is complete, that is to say, it is impossible to find anything that SQL cannot compute. But its support layer is too low, which can lead to over-elaborate operation in practical application.

The over-elaborate operation is specifically reflected in the following four aspects:

l Computation without sub-step: SQL requires computation to be written out in one statement, and it is necessary to adopt storage procedure to implement computation step by step. No sub-step not only causes difficulty in thinking, but also makes it difficult to use intermediate result.

l Set is unordered: SQL does not directly provide the mechanism of using position to refer to set members, and conversion is needed to implement computation relating to order and positioning.

l Set-lization is not complete: SQL set function is simple and is only used to indicate the query result set and cannot be explicitly applied as basic data type.

l It lacks object reference: SQL does not support record reference, the association between data tables adopts equivalent foreign key scheme, and in conducting multi-table joint computation, it is necessary to conduct join operation. So it is not only difficult to understand, but also low in efficiency.

Implementing data computation process based on a type of computation system is in fact the process of translating business problem into formalized computation syntax (which is similar to the case in which a primary-school student solves an application problem by translating the problem into formalized four arithmetic operations). Because of the above-mentioned four problems of SQL, in handling complex computation, its model system is inconsistent with people’s natural thinking habit. It causes a great barrier in translating problems, leading to the case that the difficulty to formalize the problem-solving method into computation syntax is much greater than to find the solution of the problem.

We give the following examples to describe respectively the problems in the four aspects.

To make the statement in the examples as simple as possible, here a large number of SQL2003 standard window functions are used. So we adopt the ORACLE database syntax that does a relatively good job in supporting SQL2003 standard as it will be generally more complex to adopt the syntax of other databases to program these SQLs.

l Computation without sub-step

Carrying out complex computation step by step can reduce the difficulty of the problem to a great extent, conversely, collecting a multi-step computation into one to be completed in just one step increases the complexity of the problem.

Task1 The number of persons of the sales department, where, the number of persons whose native place is NY, and where, the number of female employees?

The number of persons of the sales department

1	select count() from employee where department='sales'*

Where, the number of persons whose native place is Beijing

1	select count() from* employee where department=‘sales ’ and native_place='NY'

And where, the number of female employees

1	select count () from employee*
2	where department='sales' and native_place='NY' and gender ='female'

Conventional thought: Select the persons of the sales department for counting, and from it, find out the persons whose native place is NY for counting, and then further find out the number of female employees for counting. The query each time is based on the existing result last time, so it is not only simple in writing but also higher in efficiency.

But, the computation of SQL cannot be conducted in steps, and it is impossible to reuse the preceding result in answering the next question, and it is only possible to copy the query condition once more.

Task2 Each department selects a pair of male and female employees to form a game team.

1	with A as
2	(select name, department,
3	row_number() over (partition by department order by 1) seq
4	from employee where gender =‘female’),
5	B as
6	(select name, department,
7	row_number() over(partition by department order by 1) seq
8	from employee where sex =‘female’)
9	select name, department from A
10	where department in ( select distinct department from B ) and seq=1
11	union all
12	select name, department from B
13	where department in (select distinct department from A ) and seq=1

Computation without sub-step sometimes not only causes trouble in writing and low efficiency in computation, but even causes serious deformation in the train of thought.

The intuitive thought of this task: For each department cycle, if this department has male and female employees, then select one male employee and one female employee and add them to the result set. But SQL does not support this kind of writing with which the result set is completed step by step (to implement this kind of scheme, it is necessary to use the stored procedure). At this time, it is necessary to change the train of thought into: Select male employee from each department, select female employee from each department, select out, respectively from the two result sets, members whose departments appear in another result set, and finally seek the union of the sets.

Fortunately, there are still with sub-statement and window function over (SQL2003 standard begins to support); otherwise this SQL statement will be simply ugly.

l The set is unordered.

Ordered computation is very common in mass data computation (obtain the first 3 places/the third place, compare with the preceding period). But SQL adopts the mathematical concept of unordered set, so ordered computation cannot be conducted directly, and it is necessary to adjust the train of thought and change the method.

Task3 Company's employees whose ages are in the middle

1	select name, birthday
2	from (select name, birthday, row_number() over (order by birthday) ranking
3	from employee )
4	where ranking=(select floor((count()+1)/2) from employee)*

Median is a very common computation, and originally it is only necessary to simple get out, from the ordered set, the members whose positions are in the middle. But SQL unordered set mechanism does not provide the mechanism which directly uses position to access member. It is necessary to create a man-made sequence number field, and then use the condition query method to select it out, causing the case in which a sub-query is needed to complete the query.

Task4 For how many trading days has this stock gone up consecutively in the longest?

1	select max(consecutive_day)
2	from (select count() (consecutive_day*
3	from (select sum(rise_mark) over(order by trade_date) days_no_gain
4	from (select trade_date,
5	case when
6	closing_price >lag(closing_price) over(order by trade_date)
7	then 0 else 1 end rise_mark
8	from stock_price) )
9	group by days_no_gain)

Unordered set can also cause train of thought to deform.

The conventional train of thought for computing the number of consecutive days in which the stock rises: Set up a temporary variable whose initial value is 0 to record the consecutive dates in which the stock rises, and then compare it with the preceding day. If the stock does not rise, then clear the variable to 0; if it rises, add 1 to the variable, and see the maximum value appearing from the variable when the cycle is over.

In using SQL, it is impossible to describe this process, so it is necessary to change the train of thought. To compute the accumulate number of days in which stock does not rise from the initial date to the current date, and the one with the same number of days in which stock does not rise is the consecutive trading days in which the stock rises, and from its sub-group, it is possible to find out the interval in which the stock rises, and then seek its maximum count. It is already not so easy to read and understand this statement and it is more difficult to write it out.

Data processing with multi-database, off database, log, txt and excel

2014年7月30日星期三

Example & Comments for SQL Computation Disadvantage (I)

没有评论:

发表评论