Data processing with multi-database, off database, log, txt and excel: A Handy Method of Grouping and Summarizing Text File Data in Java

We often need to process text file data while programming. Here is an example for illustrating how to group and summarize text file data in Java: load employee information from text file employee.txt, group according to DEPT and seek COUNT, the number of employee, and total amount of SALARY of each group.

Text file employee.txt is in a format as follows:

EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY

1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000

2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000

3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000

4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000

5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000

6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000

7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000

8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000

9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000

10 Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 R&D 13000

11 Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000

12 Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000

13 Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000

…

Java's way of writing code for this task is:

1.Import data from the file by rows and save them in emp, the multiple Map objects of sourceList, the List object.

2.Traverse the object of sourceList, perform grouping according to DEPT and save the result in list, which contains different List objects of group, the Map object.

3.Traverse group and then traverse each DEPT’s list object, and sum up SALARY.

4.While traversing group, save the values of DEPT, COUNT and SALARY in result, the Map object, and the results of different departments in resultList, the List object.

5.Print out the data of resultList.

The code is as follows:

public static void myGroup() throws Exception{

File file = new File("D:\\esProc\\employee.txt");

FileInputStream fis = null;

fis = new FileInputStream(file);

InputStreamReader input = new InputStreamReader(fis);

BufferedReader br = new BufferedReader(input);

String line = null;

String info[] = null;

List<Map<String,String>> sourceList= new ArrayList<Map<String,String>>();

List<Map<String,Object>> resultList= new ArrayList<Map<String,Object>>();

if ((line = br.readLine())== null) return;//skip the first row

while((line = br.readLine())!= null){

info = line.split("\t");

Map<String,String> emp=new HashMap<String,String>();

emp.put("EID",info[0]);

emp.put("NAME",info[1]);

emp.put("SURNAME",info[2]);

emp.put("GENDER",info[3]);

emp.put("STATE",info[4]);

emp.put("BIRTHDAY",info[5]);

emp.put("HIREDATE",info[6]);

emp.put("DEPT",info[7]);

emp.put("SALARY",info[8]);

sourceList.add(emp);

}

Map<String,List<Map<String,String>>> group = new HashMap<String,List<Map<String,String>>>();

//grouping object

for (int i = 0, len = sourceList.size(); i < len; i++) {//group datafrom different DEPT

Map<String,String> emp =(Map) sourceList.get(i);

if(group.containsKey(emp.get("DEPT"))) {

group.get(emp.get("DEPT")).add(emp) ;

} else {

List<Map<String,String>> list = new ArrayList<Map<String,String>>() ;

list.add(emp) ;

group.put(emp.get("DEPT"),list) ;

}

Set<String> key = group.keySet();

for (Iterator it = key.iterator(); it.hasNext();) {//summarize the grouped data

String dept = (String) it.next();

List<Map<String,String>> list = group.get(dept);

double salary =0;

for (int i = 0, len = list.size(); i < len; i++) {

salary += Float.parseFloat(list.get(i).get("SALARY"));

}

Map<String,Object> result=new HashMap<String,Object>();

result.put("DEPT",dept);

result.put("SALARY",salary);

result.put("COUNT",list.size());

resultList.add(result);

}

for (int i = 0, len = resultList.size(); i < len; i++) {//print out the resulting data

System.out.println("dept="+resultList.get(i).get("DEPT")+

"||salary="+resultList.get(i).get("SALARY")+

"||count="+resultList.get(i).get("COUNT"));

}

The results after the code is executed are as follows:

dept=Sales||salary=1362500.0||count=187

dept=Finance||salary=177500.0||count=24

dept=Administration||salary=40000.0||count=4

dept=Production||salary=663000.0||count=91

dept=Marketing||salary=733500.0||count=99

Here myGroup function has only one grouping field. If it has multiple grouping fields, nested multi-layer collections class is needed and the code will become more complicated. As myGroup function has fixed grouping fields and summarizing fields, if there is any change about the fields, we have no choice but to modify the program. This robsthe function of the ability to deal with situations of flexible and dynamic grouping and summarizing. In order to enable it to handle these situations as well as SQL statement does, we need to develop additional program for analyzing and evaluating dynamic expressions, which is a rather difficult job.

As a programming language specially designed for processing structured (semi-structured) data and able to perform dynamic grouping and summarizing easily, esProc can rise to the occasion at this time as a better assistive tool. It can integrate with Java seamlessly, enabling Java to access and process text file data as dynamically as SQL does.

For example, group according to DEPT and seek COUNT, the number of employees, and the total amount of SALARY of each group. To do this, esProc can import from external an input parameter "groupBy" as the condition of dynamic grouping and summarizing. See the following chart:

The value of “groupBy” is DEPT:dept;count(~):count,sum(SALARY):salary. esProc needs only three lines of code as follows to do this job:

A1：Define a file object and import the data to it. The first row is the headline which uses tab as the default field separator. esProc’s IDE can display the imported data visually, like the right part of the above chart.

A2：Group and summarize according to the specified field. Here macro is used to dynamically analyze the expression in which groupBy is an input parameter. esProc will first compute the expression enclosed by ${…}, then replace ${…} with the computed result acting as the macro string value and interpret and execute the result. In this example, the code we finally execute is =A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary).

A3：Return the eligible result set to the external program.

When the grouping field is changed, it is no need to change the program. We just need to change the parameter groupBy. For example, group according to the two fields DEPT and GENDER and seek COUNT, the number of employees, and the total amount of SALARY of each group. The value of parameter groupBy can be expressed like this: DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary. After execution, the result set in A2 is as follows:

Finally, call this piece of esProc code in Java to get the grouping and summarizing result byusing JDBCprovided by esProc. The code called by Java for saving the above esProc code as test.dfx file is as follows:

// create esProc jdbc connection

Class.forName("com.esproc.jdbc.InternalDriver");

con= DriverManager.getConnection("jdbc:esproc:local://");

//call esProc code (the stored procedure) in which test is the file name of dfx

com.esproc.jdbc.InternalCStatement st;

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");

// set parameters

st.setObject(1,"DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary");//the parameters are the dynamic grouping and summarizing fields

// execute the esProc stored procedure

st.execute();

// get the result set

ResultSet set = st.getResultSet();

Here the relatively simple esProc code can be called directly by Java, so it is unnecessary to write esProc script file (like the above test.dfx). The code is as follows:

st=(com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet set=st.executeQuery("=file(\"D:\\\\esProc\\\\employee.txt\").import@t().groups(DEPT:dept,GENDER:gender;count(~):count,sum(SALARY):salary)");

The above Java code directly calls a line of code from esProc script: get data from the text file, group and summarize them according to the specified fields and return the result toset, Java's ResultSet object.

Data processing with multi-database, off database, log, txt and excel

2014年8月17日星期日

A Handy Method of Grouping and Summarizing Text File Data in Java

没有评论:

发表评论