Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Implement HAVING Clause in SQL with Pig

SQL with SQLDF Package: 58 Seconds User Time

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2 having count(V1) > 600000', header = FALSE)
print(a)

Apache Pig: 47 Seconds User Time

a = LOAD '2008.csv2' USING PigStorage(',');  
b = FILTER (FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0)) BY $1 > 600000;
dump b;
Advertisements

Written by statcompute

September 27, 2014 at 10:59 pm

Posted in Big Data, Pig Latin, SQL

Tagged with

%d bloggers like this: