Data processing with multi-database, off database, log, txt and excel: 八月 2014

2014年8月31日星期日

Code Examples of Accessing HTTP Data in esProc

esProc can access data conveniently in http data source for processing. Now we'll look at some functions through an example.

In this example, a servlet provides outward query of employee information in json format. Servletaccessesemployee table in the database and saves employee information as follows:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000

7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000

8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000

9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000

…

doGet function of servlet receives employee id strings of json format, queries corresponding employee information through the database and generates employee information list injson format and then returns it. Process of reading the database and generating employee information is omitted in the following code:

protected void doGet(HttpServletRequestreq, HttpServletResponseresp) throws ServletException, IOException {

// TODO Auto-generated method stub

String inputString=(String) req.getParameter("input");

//inputString input value is："[{EID:8},{EID:32},{EID:44}]";

if (inputString==null) inputString="";

String outputString ="";

{...}//code for querying the database through inputString and generating outputSring is omitted here

// the generated outputString

//"[{EID:8,NAME:"Megan",SURNAME:"Wilson",GENDER:"F",STATE:\...";

resp.getOutputStream().println(outputString);

resp.setContentType("text/json");

}

The following code can be used for esProc to access this http servlet:

A1：Define the input parameter to be submitted to servlet, i.e. the employeeidlist in json format.

A2：Define httpfile objects, URL is http://localhost:8080/demo/testServlet?input=[{EID:8},{EID:32},{EID:44}].

A3：Import A2, the result returned by httpfile objects.

A4：Parse by rows the json format information of each employee, and create a sequence.

A5：Compute on the table sequence in A4 and combine SURNAME and NAME into FULLNAME.

A6：Export results of A5 to a text file.

2014年8月28日星期四

Debugging Function Comparison R Language v.s. esProc

As is well known, in the development process of program, the time consumed to remove and correct the error is usually greater than time spent in coding.

Therefore, a friendly debug environment can save a lot of time. In this respect, VB.NET and SQL are two extremes that the former provides almost a perfect Debug environment, while the latter nearly provides no error debugging tool.

R language and esProc as two development tools for computation and analysis are both capable to debug to some degree. We will study on their differences in this respect.

Let's kick off by making ourselves familiar with the debugging environments of both R (take R Studio for example) and esProc from their respective interfaces:

R Studio Debugging Environment

esProc's Debugging Environment

Let's compare the basic functions.

Breakpoint: For R, the breakpoint is set by inserting browser() into the codes. Users have to remove these statements manually once debugged, which seems to be back to the cherished old times of using BASIC to code when Windows was not invented, impressed us with a strong feeling of reminiscence. In those days, removing the stop breakpoint statement is even an important job before releasing codes. By comparison, the breakpoint style of esProc is similar to that of VB.NET and other alike modern programming languages. By clicking the button or pressing shortcut keys, the break point can be set to the cell in which the mouse cursor is located. This is nothing special.

Debug command: with the same style of break point, debug command of R is input from the console, including c to resume running,n to go run the next statement, and Q to exit the debug mode. In addition, there are also functions like trace/setBreakpoint/debug/undebug/stop.It is important to note that it would be best not to have any variable named after c, n, and Q in the codes. Otherwise, accidental conflicts will occur.

Regarding the procedure control, esProc is no different than VB.net and like programming languages for just requires click(s) on button or shortcut keys to implement, not requiring users to memorize any command, as we all know.

Variable watch: The variable watch window of R language is on the right, in which all current variables will be listed. On clicking it, a new window will prompt to display the value of this variable. Alternatively, R language users can also enter the fix(variable name)at the command line window as shown below. In the right bottom corner of esProc user interface, there is a similar variable list. Seldom do esProc users use this list because esProc does not require users to specially define the variable name. The name of cell is taken as the variable name by default, and thus users can simply click the cell to review the variable value.

One thing to notice is that R is friendly to display the variables of Frame format. However, it is comparatively not so friendly to support the irregularly-structured variables that we can say it is unreadable at all, as the below typical List for example:

esProc does a much better job in this respect. For the same data, in esProc, it is represented by drilling through the hyperlinks:

Then, let's compare some more advanced functions, and start from checking the Immediate Running first.

As for esProc, a cell will be calculated immediately and automatically once codes are entered into this cell. Therefore, the developers can view the execution result immediately and adjust the code for re-run on conditions. This style can speed up the development speed and lower the probability of errors, allowing the green hand to become familiar with it quickly. R Studio provides the similar means that more resembles the "immediate window" of VB, that is,user's type in codes data command line window and run immediately. Ifit is run correctly, then copy the codes to the formal code section. Judging on the whole, R is less convenient than esProc in this respect.

Finally, let us discuss the function to debug the functions separately.

R users can use the debug(Function Name) to debug the functions separately and directly so as to modularize in development and implement the large-scale test. esProc users,on the contrary, are not allowed to debug the function separately, which is a pity more or less. However, the debug function of R language has not implemented the true “separate” test. Its working principle is actuallyto add a browser () command prior to the function to be debugged, still requiring running all codes before entering the function to debug.

From another perspective, such computational analysis software is rarely used for the large-scale development and test. There is not much significance and value for its ability to debugfunction separately.

Through the above comparison, we can see that both R and esProc provide some debugging functions. In which,esProc is performing better in terms of convenience and usability.

2014年8月27日星期三

How to Clear Cell Values to Release Memory inesProc

In esProc, the storage of cellset variables is ubiquitous. Cell values are convenient references during computation, but they could occupy too much memory. Data can be cleared from cells to reduce memory footprint after they accomplish their mission in computation. It should be particularly noted that, when the intermediate data are obtained and further complicated computations are needed,cell values no longer to be used must be deleted to reduce memory usage in order to effectively avoid memory overflow.

Please look at the following case. List top 200 transaction records of all household appliances orders and food orders according to total order amount, and sort by product names. Order records come from two text files: Order_Appliances.txt and Order_Foods.txt. First summarize the data in the two sheets, get top 200 order records in total order amount, and then sort by product names.

Computed results of all cells are as follows:

The table sequence in A1 contains order records of Order_Appliances.txt:

The table sequence in A2 contains order records of Order_Foods.txt:

A3 combines records of the two table sequences simply for the use of filtering in the next step.

A4 filters out top 200 order records according to total sales amount and selects from them the needed fields to generate a new table sequence. Here we need to sort by sales amount in descending order. Thus the Amount in top() function is preceded by a minus sign and the results are sorted by sales amount:

A5 sorts top 200 order records of sales amount by product names as required:

In fact, what we really need finally is the data in A5. After A4 gets all the necessary information of order records, the information in original cells A1 and A2 becomes useless. Deleting these useless data after getting intermediate data can release memory and make the operations more steady.

Therefore, cellset program can be sort out according to the following method:

If cell value is set as null, the data in the cell would be deleted, as shown by statements in C3 and B4. After statement in C3 deletes the cell value of A2 and B4 deletes the data in A3 referencing records of food orders, the original food orders will be removed from memory.

T.reset() function, which is a little different from others, in B3 will delete all records in the table sequence but will retain its data structure. After B3 is executed, value of A1 is as follows:

We can choose the method for deleting cell values as needed. Setting cell value as null is more commonly used. T.reset() is used only when the table sequence’s data structure is really need retaining.

Note that though statement in B5 sets the value of A4 as null, it cannot reduce memory footprint. Because the result A5 returns is a record sequence in which records come from the table sequence in A4, these records cannot be deleted and will be still in use in A5 even if A4 is set as null. Therefore, when the method of setting cell values as null is to be used, we must find out whether data in the cell is being used or not.

In addition, A5 sorts records in A4, but the execution won't produce new records. What A5 stores is merely the references resulted from sorting records, which have a limited memory footprint and won't increase memory usage.

2014年8月26日星期二

Thinking of Serial Number and Locating Computation in esProc

1.Accessing Members

Members in a set (sequence) of esProc are organized in order. Therefore, you can reference a member in a set with the serial number of the member. The more flexible use of serial number, the better esProc functions and the operation will be much simpler and more efficient.In fact, the serial number or serial number ISeq must be used to implement certain functions in esProc, such as the delete() function for deleting record, and compose() function for resorting TSeq.

The simplest application is to access members with their serial numbers directly; this is the same as what to do for accessing an array with a normal programming language.

You can use the m() function to get members backwards or in a loop manner.

In addition, esProc provides a series of functions whose names begin with the letter "p". These functions are used for searching for the serial numbers of members, as given below:

When running the pos function, if a specified member is not found in a sequence, 0 will be returned. This function can be used to judge whether a member is in a set or not.

2.Accessing Subsets

With serial numbers, you can access the subsets of a set.

In addition, you can also use the m() function to access a subset by specifying the corresponding serial number.

Similarly, you can use the option @a in a position search function to search for the serial numbers of all the members satisfying specified conditions.

If you want to get the positions of multiple members once for all, you can use the pos function, the @i option may be required in certain cases.

The posi function returns null if a certain member is not found in a sequence. Considering the misplaced order and repetitive members may also result in the null value returned, you cannot simply use this function to judge if the specified subsets are included; instead, you should use an intersection operation.

3. Locating by Using the Loop Function

Like the symbol ~, the symbol # in a loop function indicates the serial number of the current member.

In a loop function, you can use the symbol [] to access members in a relative mode.

In addition, you can use the symbol {} to access subsets in a relative mode.

4.Alignment Access

As we know, the symbol # in a loop function is used to indicate the serial number of the current member. In fact, it is a number which can be operated like other numbers. Especially, it can be used as a serial number to access a member in another sequence. This is very important for the alignment access.

When independent sequences are arranged in the same order, you can use the alignment access to generate fields consisting of records.

5.Sequence Alignment

Before an alignment access is executed, it is necessary that all the sequences are arranged in the same order. However, in practice, sequences are not always in the same order. Under such circumstance, you should use the align function to re-order sequences according to the order of a certain sequence so as to arrange them in the same order.

In fact, an align group function align@a can also return a sequence aligned with a standard sequence; however, in this case, each member in the group is a set.

Using the align() function can fetch the first member of each grouped subset and then return a set consisting of these first members, instead of returning a set consisting of subsets. If there is only one member in each grouped subset, using this function is to order these members according to a standard sequence.

Similarly, the alignment access can be used in an enum group; here, enum@1 is not commonly used.

6.Interval Integer Sequence

An integer sequence is a special set that is applicable to all the set operations. In addition, it can be used as a serial number for accessing a subset in another sequence. Using the integer sequences freely is vital for you to form a thinking of serial number.

You can process subsets by an integer sequence consisting of the subsets' positions in the original set.

7.ISeq Consisting of Serial Numbers

After a sequence is ordered, the previous order of the members in the sequence will be discarded. However, in certain conditions, this order information may be required. For example, we may need to know the entry order of the three oldest employees in the company, the amount of increase of a share's price for the three trading day on which the share prices are on the highest level, and so on.

This problem can be solved by using the psort function in esProc; the function returns the previous order of the ordered members.

In plain words, in an integer sequence returned by the psort function, the first number is, relative to the original sequence, the serial number of the member which should be placed in the first place; the second member is, relative to the original sequence, the serial number of the member which should be placed in the second place; the rest may be deduced by analogy.

For the sequence resulting from the serial number ISeq, you can also use the inv() function to get the inverse ISeq composed of the serial number ISeq to restore.

You can use the psort function to solve the above problem which requires that the original serial numbers should be kept.

A binary search is widely recognized for its high efficiency; however, it requires that an original sequence is sorted by keywords. So, before a binary search is executed, the original sequence must be sorted. However, this is not suitable for all. For example, if you want to search for a member in a sequence, you can of course run the sort function before the searching; but if you want to search for an index of a member, running the sort function before searching would damage order; in this case ,you should use psort() function.

In this case, psort creates a binary search index for the sequence; there could be one or more search serial numbers, depending on keywords, for a single sequence.

In addition, an align group function can also return an ISeq consisting of serial numbers, instead of the sequence aligned.

8.Locating Computation

After working out serial numbers of records needed, we can compute the required results with locating computation A.calc().The locating computation can avoid unnecessary computation and increase efficiency.

In this case, the binary file Vote Record stores poll results, with a descending sort of the votes. A4 is the computed result of employee ID sequence of a specified state. A5 represents the number of votes they needed in order to moving up. For example, Ryan Williams, now ranking 3^rd, needs another 69 votes to move up one place. Cross-rows operation will be needed for computation, because it cannot be completed only with data of selected employees.

2014年8月25日星期一

Thinking of Set in esProc

1.Sequence and Set in esProc

Unlike traditional programming languages, esProc employs set commonly. In fact, the sequence in esProc belongs to the field of set. So it’s quite important to deeply understand the concept of sets when using esProc. Like an integer and a string, the set is a basic data type; it could be a variable value, a computation of an expression, and a return value of a function.

As a data type, esProc provides operators of two sets A and B, like intersection, union, union of set, subtract and so on: A^B，A|B，A&B，A\B.

It is recommended for users to deeply understand and master these set operations. The thinking of sets and good use of data may bring about an easier solution.

The following is an illustration of using set operations to simplify code:

Unlike a set in the sense of mathematics,that in esProc is arranged orderly; there may be duplicates of a member in a set, like that in a sequence or sequence table.

In the scope of mathematics, the set Intersection and Union operations are both commutative; in other words, A∩B º B∩A and A∪B º B∪A. However, this commutative property is not valid for esProc because the member order in a sequence for esProc cannot be changed at will. For esProc, the result of the intersection / union operation is required to be arranged according to the order of the left operator.

Because of this ordering feature of sequence members, we should adopt function A.eq(B)to judge whether two sequences have the same members instead of simply using the comparison operator == :

2.Parameters in a Loop Function

With the data type of set, you can handle several operations on the members of set in a single clause, without the need of loop code.

Sometimes, the loop functions will not process the set members but the values computed based on the members. In this case, you are allowed to use parameters to represent the formula in a function, in which the "~" represents the current member.

For the nested loop functions, "~" is interpreted as a member in an inner sequence; in such case, if you want to reference a member in an outer sequence, name of the outer sequence must headed the "~".

The above rule is also applicable to the field reference when ~ is omitted, of which the fields will be interpreted as fields in an inner RSeq first, if such fields cannot be found in the inner RSeq, esProc will search for them in the outer RSeq.

3 Order of Loop

The arguments in a loop function will be computed according to their order in the original sequence. This is very important.

In many cases, it is available for a single expression to implement the function which can be achieved by simple loop code.

4.Computation Sequence

Different from such loop functions as sum and avg returning a single aggregate value, a computation sequence function A.(x) returns a set. Often, a new set can also be created by employingcomputationsequence, in addition to using the basic set operations such as union, intersection and subtract.

The execution of an aggregate function with arguments will be divided into two steps:

1) Use arguments to produce computed column;

2) Aggregate the result.

In other words, A.f(x) = A.(x).f()

The new function is used to return a sequence table by computation sequence.

In addition, there is a run function for relevant sequence computation. This function returns the original sequences themselves instead of the result of the relevant sequence computation. Generally, this function is used to assign values to fields in a record sequence (sequence table).

5.Impure Sets

esProc has no restriction on the consistency of sequence member types, that is, a sequence may consist of numbers, character strings, and records.

However, in many cases, it would be of little practical significance to arrange data in variable types in one sequence. Therefore, users should not be too concerned about it.

On the other hand, a record sequence – A sequence consisting of records – can consist of records from different sequence tables; this is very convenient.

Under the environment of esProc, it is not necessary that records in a record sequence originate from the same sequence table. As long as they have the same field names, the records can be processed uniformly. Here, merits of esProc include simpler program writing, higher efficiency and reduced occupation of memory. For SQL, however, two tables with different structures must be united into a new one by using the UNION clause before making any operations.

6.Sets consisting of Subsets

Since the member of a set in esProc is less restrictive by nature, a set may be a member of another set.If A is a set consisting of other sets, functions A.conj(), A.union(), A.diff(), A.isect() could be employed to compute concatenate, union, difference and intersection between subsets of A.

Also, a record sequence may be a member of a sequence.

7.Understanding Group

The group operation is commonly used in SQL. However, many people have no in-depth knowledge about it. The nature of a group operation is to split a set into several subsets according to a certain rule. In other words, the return value of a group operation shall be a set consisting of subsets. However, people often do not need to use such a set; instead, they may need to view a part of aggregate values of its subsets. Therefore, group operations are often followed by the summarization operations for subsets.

This is what the SQL does. The GROUP BY clause is always followed by a summarization operation. Because there are no explicit set data types in SQL, a set consisting of subsets cannot be returned directly. Therefore, summarization operations must follow group operations.

As time passes, one would think that the group operations are always accompanied with the follow-up summarization operations and forget that the group operations and the summarization operations are independently.

However, sometimes, we would still be interested in these grouped subsets but not the summarized values. Although a part of us might be interested in the summarized values, they would still need to hold these subsets for reuse, but not to discard them once a summarization is completed.

This requests us to understand the original meaning of the group operations. With a full realization of the thinking of sets, esProc can achieve this goal very well. In fact, the basic group functions in esProc can only be used for grouping which is independent to the summarization operation.

The group result is a set consisting of several subsets, and the subset also can be grouped. Members in the group result are also sets; and they can be grouped.Both of the two operations will produce a multilayer set.

Because these results are so deep in hierarchy, they are rarely used in practice. The above example is only to show you the thinking pattern of set and the nature of the operation.

8.Non-Equi Group

Besides the ordinary group function, esProc provides an align@a() function for processing alignment group and an enum() function for processing the enum group.

The group implemented by the group function is called an equi-group with the following features:

1) Any member in the original set must and can only be in one subset;

2) There is no empty subset.

The above two features are unavailable for the alignment group or the enumeration group.

Alignment group is an operation that calculates grouping expressions with members in a set and makes a match between resulting subsets and specified values. For the alignment group, the following steps are required:

1) Specify a group of values;

2) In the set to be grouped, move members whose property values are equal to the specified values to one subset;

3) Each resulting subset must be corresponding to a pre-defined condition.

It is possible that a member exists in none of subsets, an empty set exists, or any member exists in both subsets.

The following case will group employees by specified states sequence:

Enum group: First, specify a group of conditions, take the members in the set to be grouped as arguments and compute these conditions, members satisfying these conditions will be grouped into a subset; each subset is corresponding to a pre-defined condition. A member may be in none of these subsets, or in two subsets at the same time; besides, an empty set may appear.

The following case will group employees by specified age groups:

Although apparently it seams that these two functions differ greatly with the group, the three functions share the same nature regarding the group operation – to split a set into several sub-sets. The only difference is that these three functions split sets in different ways.

订阅：评论 (Atom)