2014年8月17日星期日

A Handy Method of Grouping and Summarizing Text File Data in Java

We often need to process text file data while programming. Here is an example for illustrating how to group and summarize text file data in Java: load employee information from text file employee.txt, group according to DEPT and seek COUNT, the number of employee, and total amount of SALARY of each group.

Text file employee.txt is in a format as follows:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000
9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000
10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000
11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000
12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000
13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000

Java's way of writing code for this task is:
1.Import data from the file by rows and save them in emp, the multiple Map objects of sourceList, the List object.
2.Traverse the object of sourceList, perform grouping according to DEPT and save the result in list, which contains different List objects of group, the Map object.
3.Traverse group and then traverse each DEPT’s list object, and sum up SALARY.
4.While traversing group, save the values of DEPT, COUNT and SALARY in result, the Map object, and the results of different departments in resultList, the List object.
5.Print out the data of resultList.

The code is as follows:
public static void myGroup() throws Exception{
           File file = new File("D:\\esProc\\employee.txt");
           FileInputStream fis = null;
           fis = new FileInputStream(file);
           InputStreamReader input = new InputStreamReader(fis);
           BufferedReader br = new BufferedReader(input);
           String line = null;
           String info[] = null;
           List<Map<String,String>> sourceList= new ArrayList<Map<String,String>>();
           List<Map<String,Object>> resultList= new ArrayList<Map<String,Object>>();
           if ((line = br.readLine())== null) return;//skip the first row
           while((line = br.readLine())!= null){
                    info = line.split("\t");
                    Map<String,String> emp=new HashMap<String,String>();
                    emp.put("EID",info[0]);
                    emp.put("NAME",info[1]);
                    emp.put("SURNAME",info[2]);
                    emp.put("GENDER",info[3]);
                    emp.put("STATE",info[4]);
                    emp.put("BIRTHDAY",info[5]);
                    emp.put("HIREDATE",info[6]);
                    emp.put("DEPT",info[7]);
                    emp.put("SALARY",info[8]);
                    sourceList.add(emp);
           }
           Map<String,List<Map<String,String>>> group = new HashMap<String,List<Map<String,String>>>();
           //grouping object
           for (int i = 0, len = sourceList.size(); i < len; i++) {//group datafrom different DEPT
                    Map<String,String> emp =(Map) sourceList.get(i); 
            if(group.containsKey(emp.get("DEPT"))) {
                group.get(emp.get("DEPT")).add(emp) ;
            } else {
                List<Map<String,String>> list = new ArrayList<Map<String,String>>() ;
                list.add(emp) ;
                group.put(emp.get("DEPT"),list) ;
            }
           }
           Set<String> key = group.keySet();
           for (Iterator it = key.iterator(); it.hasNext();) {//summarize the grouped data                     
String dept = (String) it.next();
                    List<Map<String,String>> list = group.get(dept);
                    double salary =0;
                    for (int i = 0, len = list.size(); i < len; i++) {
                             salary += Float.parseFloat(list.get(i).get("SALARY"));
                    }
                    Map<String,Object> result=new HashMap<String,Object>();
                    result.put("DEPT",dept);
                    result.put("SALARY",salary);
                    result.put("COUNT",list.size());
                    resultList.add(result);
           }
           for (int i = 0, len = resultList.size(); i < len; i++) {//print out the resulting data
                    System.out.println("dept="+resultList.get(i).get("DEPT")+
                                       "||salary="+resultList.get(i).get("SALARY")+
                                       "||count="+resultList.get(i).get("COUNT"));
           }
}

The results after the code is executed are as follows:

    dept=Sales||salary=1362500.0||count=187
    dept=Finance||salary=177500.0||count=24
    dept=Administration||salary=40000.0||count=4
    dept=Production||salary=663000.0||count=91
    dept=Marketing||salary=733500.0||count=99

Here myGroup function has only one grouping field. If it has multiple grouping fields, nested multi-layer collections class is needed and the code will become more complicated. As myGroup function has fixed grouping fields and summarizing fields, if there is any change about the fields, we have no choice but to modify the program. This robsthe function of the ability to deal with situations of flexible and dynamic grouping and summarizing. In order to enable it to handle these situations as well as SQL statement does, we need to develop additional program for analyzing and evaluating dynamic expressions, which is a rather difficult job.

As a programming language specially designed for processing structured (semi-structured) data and able to perform dynamic grouping and summarizing easily, esProc can rise to the occasion at this time as a better assistive tool. It can integrate with Java seamlessly, enabling Java to access and process text file data as dynamically as SQL does.

For example, group according to DEPT and seek COUNT, the number of employees, and the total amount of SALARY of each group. To do this, esProc can import from external an input parameter "groupBy" as the condition of dynamic grouping and summarizing. See the following chart:

The value of “groupBy” is DEPT:dept;count(~):count,sum(SALARY):salary. esProc needs only three lines of code as follows to do this job: 

A1Define a file object and import the data to it. The first row is the headline which uses tab as the default field separator. esProc’s IDE can display the imported data visually, like the right part of the above chart.
A2Group and summarize according to the specified field. Here macro is used to dynamically analyze the expression in which groupBy is an input parameter. esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as the macro string value and interpret and execute the result. In this example, the code we finally execute is =A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary).
A3Return the eligible result set to the external program.

When the grouping field is changed, it is no need to change the program. We just need to change the parameter groupBy. For example, group according to the two fields DEPT and GENDER and seek COUNT, the number of employees, and the total amount of SALARY of each group. The value of parameter groupBy can be expressed like this: DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary. After execution, the result set in A2 is as follows: 
Finally, call this piece of esProc code in Java to get the grouping and summarizing result byusing JDBCprovided by esProc. The code called by Java for saving the above esProc code as test.dfx file is as follows:
  // create esProc jdbc connection
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
//call esProc code (the stored procedure) in which test is the file name of dfx
com.esproc.jdbc.InternalCStatement st;
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
// set parameters
st.setObject(1,"DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary");//the parameters are the dynamic grouping and summarizing fields
// execute the esProc stored procedure
st.execute();
// get the result set
ResultSet set = st.getResultSet();

Here the relatively simple esProc code can be called directly by Java, so it is unnecessary to write esProc script file (like the above test.dfx). The code is as follows:
st=(com. esproc.jdbc.InternalCStatement)con.createStatement();
          ResultSet set=st.executeQuery("=file(\"D:\\\\esProc\\\\employee.txt\").import@t().groups(DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary)");

The above Java code directly calls a line of code from esProc script: get data from the text file, group and summarize them according to the specified fields and return the result toset, Java's ResultSet object.

没有评论:

发表评论