Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for the ‘Pig Latin’ Category

Implement HAVING Clause in SQL with Pig

SQL with SQLDF Package: 58 Seconds User Time

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2 having count(V1) > 600000', header = FALSE)
print(a)

Apache Pig: 47 Seconds User Time

a = LOAD '2008.csv2' USING PigStorage(',');  
b = FILTER (FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0)) BY $1 > 600000;
dump b;
Advertisements

Written by statcompute

September 27, 2014 at 10:59 pm

Posted in Big Data, Pig Latin, SQL

Tagged with

“Group By” Operation with Pig

SQL with SQLDF Package: 59 Seconds User Time

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2';, header = FALSE)
print(a)

Apache Pig: 47 Seconds User Time

a = LOAD '2008.csv2' USING PigStorage(',');  
b = FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0);
dump b;

Written by statcompute

September 27, 2014 at 9:23 pm

Posted in Pig Latin

Tagged with

Select Distinct Values with Pig

First of all, I used SQL statement with SQLDF package in R. It took ~51 seconds user time to select 12 rows out of 7 millions.

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = "select distinct V1, V2 from file", header = FALSE)
print(a)

Next, I used Apache Pig running in the local mode and spent ~36 seconds getting the same 12 rows.

a = LOAD '2008.csv2' USING PigStorage(',');  
b = DISTINCT(FOREACH a GENERATE $0, $1);
dump b;

Although my purpose of this exercise is to learn Pig Latin through SQL statement, I am still very impressed by the efficiency of Apache Pig.

Written by statcompute

September 24, 2014 at 10:20 pm

Posted in Pig Latin

Tagged with , ,

Test Drive of PiggyBank

REGISTER /home/liuwensui/pig/lib/piggybank.jar;

DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader;

a = LOAD '/home/liuwensui/Downloads/2008.csv2' USING CSVLoader();

b = FOREACH a GENERATE $8 as x;

c = LIMIT b 4;

dump c;

/*
(WN)
(WN)
(WN)
(WN)
*/

DEFINE L org.apache.pig.piggybank.evaluation.string.LOWER;

DEFINE R org.apache.pig.piggybank.evaluation.string.REPLACE;

d = FOREACH c GENERATE x, L(x), R(x, 'WN', 'WW');

dump d;

/*
(WN,wn,WW)
(WN,wn,WW)
(WN,wn,WW)
(WN,wn,WW)
*/

Written by statcompute

September 23, 2014 at 10:27 pm

Posted in Pig Latin

Tagged with ,

Learning Pig Latin on 2014-03-01

grunt> sh head -4 2008.csv;
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

grunt> sh sed '1d' 2008.csv > 2008.csv2;

grunt> sh head -4 2008.csv2;            
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,926,930,1054,1100,WN,1746,N612SW,88,90,78,-6,-4,IND,BWI,515,3,7,0,,0,NA,NA,NA,NA,NA

grunt> A = LOAD '2008.csv2' USING PigStorage(',');                

grunt> B = GROUP A BY ($0, $1); 

grunt> C = FOREACH B GENERATE group, COUNT(A);

grunt> D = FILTER C BY $0.$1 IN (1, 2, 3);

grunt> SPLIT D INTO D1 IF $0.$1 == 1, D2 IF $0.$1 == 2, D3 IF $0.$1 == 3;

grunt> dump D3;
((2008,3),616090)

Written by statcompute

March 1, 2014 at 11:05 pm

Posted in Big Data, Pig Latin

My First Learning Session in Pig Latin

-- load data
A = load '/home/liuwensui/Documents/data/test_pig.txt' using PigStorage(',') as (id: chararray, x: int, y: float);
dump A;

/*
(A,1,1.1)
(A,2,2.2)
(A,2,3.3)
(B,1,1.1)
(B,1,2.2)
*/

-- group data
B = group A by id;
dump B;

/*
(A,{(A,1,1.1),(A,2,2.2),(A,2,3.3)})
(B,{(B,1,1.1),(B,1,2.2)})
*/

-- summarize data by group
C1 = foreach B generate group as id1, MAX(A.x) as max_x;
dump C1;

/*
(A,2)
(B,1)
*/

C2 = foreach B generate group as id2, MIN(A.y) as min_y;
dump C2;

/*
(A,1.1)
(B,1.1)
*/

-- select data by criterion
C3 = filter C2 by id2 == 'A';
dump C3;

/*
(A,1.1)
*/

-- inner join
C4 = join C1 by id1, C3 by id2;
dump C4;

/*
(A,2,A,1.1)
*/

-- full join
C5 = join C1 by id1 full, C3 by id2;
dump C5;

/*
(A,2,A,1.1)
(B,1,,)
*/

-- union
C6 = union C4, C5;
dump C6;

/*
(A,2,A,1.1)
(B,1,,)
(A,2,A,1.1)
*/

Written by statcompute

February 23, 2014 at 11:23 pm

Posted in Big Data, Pig Latin