Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Select Distinct Values with Pig

First of all, I used SQL statement with SQLDF package in R. It took ~51 seconds user time to select 12 rows out of 7 millions.

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = "select distinct V1, V2 from file", header = FALSE)
print(a)

Next, I used Apache Pig running in the local mode and spent ~36 seconds getting the same 12 rows.

a = LOAD '2008.csv2' USING PigStorage(',');  
b = DISTINCT(FOREACH a GENERATE $0, $1);
dump b;

Although my purpose of this exercise is to learn Pig Latin through SQL statement, I am still very impressed by the efficiency of Apache Pig.

Advertisements

Written by statcompute

September 24, 2014 at 10:20 pm

Posted in Pig Latin

Tagged with , ,

%d bloggers like this: