Data processing with multi-database, off database, log, txt and excel: Computing the Online Time for Users with esProc (IV)

In last article we mentioned that IT engineers from the Web Company used esProc to code single-machine multi-threaded program which could handle large data volume and complex requirements. This leverages the full power of one multi-core multi-CPU machine. Now once again these engineers found a new issue: with the user numbers for the online application growing explosively, colleagues from the Operation Department complained that the online time computation program is still running too slow.

IT Engineers leverage esProc's multi-machine parallel computing capability, to split the task for multiple machines to complete. The performance problem is resolved successfully. The single machine parallel processing is shifted to multi-machine parallel processing, with relatively low cost for hardware and software upgrade.

First, let's review the way the user behavior information is recorded in the Web Company. Data was recorded in the log file. Everyday a separate log file is generated. For example, the following log file, “2014-01-07.log”, contains the users online actions on January 7, 2014. To compute the online time for user in the week of 2014-01-05 to 2014-01-11, we need to retrieve data from 7 log files:

logtime userid action
2014-01-07 09:27:56 258872799 login
2014-01-07 09:27:57 264484116 login
2014-01-07 09:27:58 264484279 login
2014-01-07 09:27:58 264548231 login
2014-01-07 09:27:58 248900695 login
2014-01-07 09:28:00 263867071 login
2014-01-07 09:28:01 264548400 login
2014-01-07 09:28:02 264549535 login
2014-01-07 09:28:02 264483234 login
2014-01-07 09:28:03 264484643 login
2014-01-07 09:28:05 308343890 login
2014-01-07 09:28:08 1210636885 post
2014-01-07 09:28:09 263786154 login
2014-01-07 09:28:12 263340514 get
2014-01-07 09:28:13 312717032 login
2014-01-07 09:28:16 263210957 login
2014-01-07 09:28:19 116285288 login
2014-01-07 09:28:22 311560888 login
2014-01-07 09:28:25 652277973 login
2014-01-07 09:28:34 310100518 login
2014-01-07 09:28:38 1513040773 login
2014-01-07 09:28:41 1326724709 logout
2014-01-07 09:28:45 191382377 login
2014-01-07 09:28:46 241719423 login
2014-01-07 09:28:46 245054760 login
2014-01-07 09:28:46 1231483493 get
2014-01-07 09:28:48 266079580 get
2014-01-07 09:28:51 1081189909 post
2014-01-07 09:28:51 312718109 login
2014-01-07 09:29:00 1060091317 login
2014-01-07 09:29:02 1917203557 login
2014-01-07 09:29:16 271415361 login
2014-01-07 09:29:18 277849970 login

Log files record, in chronological order, users’operation (action), user ID (userid) and the time when the actions took place (logtime) in the application. Users operations include three different types, which are login, logout and get/post actions.
The Operation Department provided the following requirements for computation of users online time:
1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.

3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

4. If there is only login, without logout, then the last operation time should be treated as time for logout.

5. For users who completed a post operation, his/her current time online time will be tripled in computation.

To improve performance, the Web Company increased the number of server from the original number of 1 to 3. Accordingly, the following steps are needed to shift from single-machine parallel to multi-machine parallel:

The first step:Modify the esProc program for weekly log files processing. Divide user ID by3 and separate the weekly log file into 3 files according to the remainder. Every server would be processing one of these. This way the file size were reduced and file transfer time could be shortened.Later the three files were uploaded to three servers, using multiple parallel programs to do the computation. The actual program is as following:

Note in the last screenshot that, A6 used the @g option of export function to retrieve"log files for one week" into three binary files. During subsequent use of parallel processing time, the content of log files can be retrieved by blocks for different user. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks.

The second step:the single-machine multi-threaded program is unchanged. Let’s go back.

Subroutine parameters are shown below. They are used to pass the log file name, block number and total number of blocks for the week when called by the main program.Here the log file name for the week, week file, was already one of the three segmented files corresponding to this machine.

The subroutine is as following:

The above screenshot illustrates that:

1. As we previously used export@g to output the file in group according to different user ID, the use of @z option by cursor in A2 to handle specific block (value is block number) among total (value is total blocks) from file will retrieve the complete group for the same userID. Data for one user will not be split into two blocks.

2. The code line in red box returns the resulting file as cursor to the main program.Since multi-machine parallel processing were used here, this cursor is remote cursor ( Read esProc's Documents for detailed introduction on remote cursor).

The third step:writing main program for parallel computing, to call the parallel computing subroutine. As illustrated below, the main program called parallel tasks on tree machines, which effectively improved the performance for computation.

The server list in the program could also be written into the configuration file, this way any subsequent increase or decrease of the server would be easy.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.

Notes on the above screen capture:

1. callx@ parameter specifies 3 servers from A1 to A3, to handle three log files B1 to B3.

2. The syntax of callx's input parameter, is to specify three servers through A5, and specify 6 parallel computing tasks for each server in A6.

3. Server list, server number, and the number of tasks for each server can be adjusted according to actual situation, to leverage full performance potential of the server.

The fourth step: implement the esProc server, and upload related program & data files. Refer to instructions on esProc for specific steps and methods.

After the transformation to multi-machine parallel computing, the Operations Department found significant improvement in the computation speed of users online time. The cost of this transformation is much lower than that for application databases upgrade, especially, in the hardware part, only 2 additional PC Servers were needed.

So far, The Web Company finished implementation of esProc based user behavior analysis and computation platform. Its main advantages are:

1. The platform is easy to be adjusted with more complex algorithm for future, shortened the response time and saved labor costs from engineers.

2. It’s easy to scale out for even larger data amount in the future, with shortened project time and reduced cost of upgrade.

Using esProc to Compute the Online Time of Users (I)

Using esProc to Compute the Online Time of Users (II)

Computing the Online Time for Users with esProc(III)

Data processing with multi-database, off database, log, txt and excel

2014年7月17日星期四

Computing the Online Time for Users with esProc (IV)

没有评论:

发表评论