## Archive for **September 2014**

## Row Search in Parallel

I’ve been always wondering whether the efficiency of row search can be improved if the whole data.frame is splitted into chunks and then the row search is conducted within each chunk in parallel.

In the R code below, a comparison is done between the standard row search and the parallel row search with the FOREACH package. The result is very encouraging. For 10 replications, the elapsed time of parallel search is only the fraction of the elapsed time of standard search.

load('2008.Rdata') data2 <- split(data, 1:8) library(rbenchmark) library(doParallel) registerDoParallel(cores = 8) library(foreach) benchmark(replications = 10, order = "elapsed", non_parallel = { test1 <- data[which(data$ArrTime == 1500 & data$Origin == 'ABE'), ] }, parallel = { test2 <- foreach(i = data2, .combine = rbind) %dopar% i[which(i$ArrTime == 1500 & i$Origin == 'ABE'), ] } ) # test replications elapsed relative user.self sys.self user.child # 2 parallel 10 2.680 1.000 0.319 0.762 12.078 # 1 non_parallel 10 7.474 2.789 7.339 0.139 0.000

## Implement HAVING Clause in SQL with Pig

**SQL with SQLDF Package: 58 Seconds User Time**

library(sqldf) a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2 having count(V1) > 600000', header = FALSE) print(a)

**Apache Pig: 47 Seconds User Time**

a = LOAD '2008.csv2' USING PigStorage(','); b = FILTER (FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0)) BY $1 > 600000; dump b;

## “Group By” Operation with Pig

**SQL with SQLDF Package: 59 Seconds User Time**

library(sqldf) a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2';, header = FALSE) print(a)

**Apache Pig: 47 Seconds User Time**

a = LOAD '2008.csv2' USING PigStorage(','); b = FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0); dump b;

## Select Distinct Values with Pig

First of all, I used SQL statement with SQLDF package in R. It took ~51 seconds user time to select 12 rows out of 7 millions.

library(sqldf) a <- read.csv.sql('2008.csv2', sql = "select distinct V1, V2 from file", header = FALSE) print(a)

Next, I used Apache Pig running in the local mode and spent ~36 seconds getting the same 12 rows.

a = LOAD '2008.csv2' USING PigStorage(','); b = DISTINCT(FOREACH a GENERATE $0, $1); dump b;

Although my purpose of this exercise is to learn Pig Latin through SQL statement, I am still very impressed by the efficiency of Apache Pig.

## Test Drive of PiggyBank

REGISTER /home/liuwensui/pig/lib/piggybank.jar; DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader; a = LOAD '/home/liuwensui/Downloads/2008.csv2' USING CSVLoader(); b = FOREACH a GENERATE $8 as x; c = LIMIT b 4; dump c; /* (WN) (WN) (WN) (WN) */ DEFINE L org.apache.pig.piggybank.evaluation.string.LOWER; DEFINE R org.apache.pig.piggybank.evaluation.string.REPLACE; d = FOREACH c GENERATE x, L(x), R(x, 'WN', 'WW'); dump d; /* (WN,wn,WW) (WN,wn,WW) (WN,wn,WW) (WN,wn,WW) */

## Performance Improvement by Expression Evaluation in Pandas

In [1]: import pandas as pd In [2]: df1 = pd.read_csv('/home/liuwensui/Downloads/2008.csv') In [3]: df1.shape Out[3]: (7009728, 29) In [4]: # STANDARD PANDAS SYNTAX In [5]: %timeit -n 10 df2 = df1[(df1['Month'] == 1) & (df1['DayofMonth'] == 1)] 10 loops, best of 3: 85.1 ms per loop In [6]: # EXPRESSION EVALUATION In [7]: %timeit -n 10 df3 = df1.query('Month == 1 and DayofMonth == 1') 10 loops, best of 3: 44 ms per loop

## SAS Macro Aligning The Logit Variable to A Scorecard with Specific PDO and Base Point

%macro align_score(data = , y = , logit = , pdo = , base_point = , base_odds = , min_point = 100, max_point = 900); ***********************************************************; * THE MACRO IS TO ALIGN A LOGIT VARIABLE TO A SCORE WITH *; * SPECIFIC PDO, BASE POINT, AND BASE ODDS *; * ======================================================= *; * PAMAMETERS: *; * DATA : INPUT SAS DATA TABLE *; * Y : PERFORMANCE VARIABLE WITH 0/1 VALUE *; * LOGIT : A LOGIT VARIABLE TO BE ALIGNED FROM *; * PDO : PDO OF THE SCORE ALIGNED TO *; * BASE_POINT: BASE POINT OF THE SCORE ALIGNED TO *; * BASE_ODDS : ODDS AT BASE POINT OF THE SCORE ALIGNED TO *; * MIN_POINT : LOWER END OF SCORE POINT, 100 BY DEFAULT *; * MAX_POINT : UPPER END OF SCORE POINT, 900 BY DEFAULT *; * ======================================================= *; * OUTPUTS: *; * ALIGN_FORMULA.SAS *; * A SAS CODE WITH THE FORMULA TO ALIGN THE LOGIT FIELD *; * TO A SPECIFIC SCORE TOGETHER WITH THE STATISTICAL *; * SUMMARY OF ALIGN_SCORE *; ***********************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*"; %local b0 b1; data _tmp1 (keep = &y &logit); set &data; where &y in (0, 1) and &logit ~= .; run; ods listing close; ods output ParameterEstimates = _est1 (keep = variable estimate); proc logistic data = _last_ desc; model &y = &logit; run; ods listing; data _null_; set _last_; if _n_ = 1 then do; b = - (estimate + (log(&base_odds) - (log(2) / &pdo) * &base_point)) / (log(2) / &pdo); call symput('b0', put(b, 15.8)); end; else do; b = estimate / (log(2) / &pdo); call symput('b1', put(b, 15.8)); end; run; filename formula "ALIGN_FORMULA.SAS"; data _null_; file formula; put @3 3 * "*" " SCORE ALIGNMENT FORMULA OF %upcase(&logit) " 3 * "*" ";"; put; put @3 "ALIGN_SCORE = MAX(MIN(ROUND((%trim(&b0)) - (%trim(&b1)) * %upcase(&logit), 1), &max_point), &min_point);"; put; run; data _tmp2; set _tmp1; %inc formula; run; proc summary data = _last_ nway; class &y; output out = _tmp3(drop = _type_ _freq_) min(align_score) = min_scr max(align_score) = max_scr median(align_score) = mdn_scr; run; data _null_; set _last_; file formula mod; if _n_ = 1 then do; put @3 3 * "*" " STATISTICAL SUMMARY OF ALIGN_SCORE BY INPUT DATA: " 3 * "*" ";"; end; put @3 "* WHEN %upcase(&y) = " &y ": MIN(SCORE) = " min_scr " MEDIAN(SCORE) = " mdn_scr " MAX(SCORE) = " max_scr "*;"; run; proc datasets library = work nolist; delete _: (mt = data); run; quit; ***********************************************************; * END OF THE MACRO *; ***********************************************************; %mend align_score;