“Group By” Operation with Pig

SQL with SQLDF Package: 59 Seconds User Time

a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2';, header = FALSE)

Apache Pig: 47 Seconds User Time

a = LOAD '2008.csv2' USING PigStorage(',');  
b = FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0);
dump b;

September 27, 2014

September 27, 2014 at 9:23 pm

Posted in Pig Latin

