Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Row Search in Parallel

I’ve been always wondering whether the efficiency of row search can be improved if the whole data.frame is splitted into chunks and then the row search is conducted within each chunk in parallel.

In the R code below, a comparison is done between the standard row search and the parallel row search with the FOREACH package. The result is very encouraging. For 10 replications, the elapsed time of parallel search is only the fraction of the elapsed time of standard search.

load('2008.Rdata')
data2 <- split(data, 1:8)

library(rbenchmark)
library(doParallel)
registerDoParallel(cores = 8)
library(foreach)

benchmark(replications = 10, order = "elapsed",
  non_parallel = {
    test1 <- data[which(data$ArrTime == 1500 & data$Origin == 'ABE'), ]
  },
  parallel = {
    test2 <- foreach(i = data2, .combine = rbind) %dopar% i[which(i$ArrTime == 1500 & i$Origin == 'ABE'), ]
  }
)
#           test replications elapsed relative user.self sys.self user.child
# 2     parallel           10   2.680    1.000     0.319    0.762     12.078
# 1 non_parallel           10   7.474    2.789     7.339    0.139      0.000

Written by statcompute

September 28, 2014 at 5:41 pm

Posted in S+/R, Big Data

Tagged with ,

Implement HAVING Clause in SQL with Pig

SQL with SQLDF Package: 58 Seconds User Time

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2 having count(V1) > 600000', header = FALSE)
print(a)

Apache Pig: 47 Seconds User Time

a = LOAD '2008.csv2' USING PigStorage(',');  
b = FILTER (FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0)) BY $1 > 600000;
dump b;

Written by statcompute

September 27, 2014 at 10:59 pm

Posted in Big Data, Pig Latin, SQL

Tagged with

“Group By” Operation with Pig

SQL with SQLDF Package: 59 Seconds User Time

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = 'select V2, count(V1) from file group by V2';, header = FALSE)
print(a)

Apache Pig: 47 Seconds User Time

a = LOAD '2008.csv2' USING PigStorage(',');  
b = FOREACH (GROUP a BY $1) GENERATE group, COUNT(a.$0);
dump b;

Written by statcompute

September 27, 2014 at 9:23 pm

Posted in Pig Latin

Tagged with

Select Distinct Values with Pig

First of all, I used SQL statement with SQLDF package in R. It took ~51 seconds user time to select 12 rows out of 7 millions.

library(sqldf)
a <- read.csv.sql('2008.csv2', sql = "select distinct V1, V2 from file", header = FALSE)
print(a)

Next, I used Apache Pig running in the local mode and spent ~36 seconds getting the same 12 rows.

a = LOAD '2008.csv2' USING PigStorage(',');  
b = DISTINCT(FOREACH a GENERATE $0, $1);
dump b;

Although my purpose of this exercise is to learn Pig Latin through SQL statement, I am still very impressed by the efficiency of Apache Pig.

Written by statcompute

September 24, 2014 at 10:20 pm

Posted in Pig Latin

Tagged with , ,

Test Drive of PiggyBank

REGISTER /home/liuwensui/pig/lib/piggybank.jar;

DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader;

a = LOAD '/home/liuwensui/Downloads/2008.csv2' USING CSVLoader();

b = FOREACH a GENERATE $8 as x;

c = LIMIT b 4;

dump c;

/*
(WN)
(WN)
(WN)
(WN)
*/

DEFINE L org.apache.pig.piggybank.evaluation.string.LOWER;

DEFINE R org.apache.pig.piggybank.evaluation.string.REPLACE;

d = FOREACH c GENERATE x, L(x), R(x, 'WN', 'WW');

dump d;

/*
(WN,wn,WW)
(WN,wn,WW)
(WN,wn,WW)
(WN,wn,WW)
*/

Written by statcompute

September 23, 2014 at 10:27 pm

Posted in Pig Latin

Tagged with ,

Performance Improvement by Expression Evaluation in Pandas

In [1]: import pandas as pd

In [2]: df1 = pd.read_csv('/home/liuwensui/Downloads/2008.csv')

In [3]: df1.shape
Out[3]: (7009728, 29)

In [4]: # STANDARD PANDAS SYNTAX

In [5]: %timeit -n 10 df2 = df1[(df1['Month'] == 1) & (df1['DayofMonth'] == 1)]
10 loops, best of 3: 85.1 ms per loop

In [6]: # EXPRESSION EVALUATION

In [7]: %timeit -n 10 df3 = df1.query('Month == 1 and DayofMonth == 1')
10 loops, best of 3: 44 ms per loop

Written by statcompute

September 14, 2014 at 11:04 pm

Posted in PYTHON

Tagged with ,

SAS Macro Aligning The Logit Variable to A Scorecard with Specific PDO and Base Point

%macro align_score(data = , y = , logit = , pdo = , base_point = , base_odds = , min_point = 100, max_point = 900); 
***********************************************************; 
* THE MACRO IS TO ALIGN A LOGIT VARIABLE TO A SCORE WITH  *; 
* SPECIFIC PDO, BASE POINT, AND BASE ODDS                 *; 
* ======================================================= *; 
* PAMAMETERS:                                             *; 
*  DATA      : INPUT SAS DATA TABLE                       *; 
*  Y         : PERFORMANCE VARIABLE WITH 0/1 VALUE        *; 
*  LOGIT     : A LOGIT VARIABLE TO BE ALIGNED FROM        *; 
*  PDO       : PDO OF THE SCORE ALIGNED TO                *; 
*  BASE_POINT: BASE POINT OF THE SCORE ALIGNED TO         *; 
*  BASE_ODDS : ODDS AT BASE POINT OF THE SCORE ALIGNED TO *; 
*  MIN_POINT : LOWER END OF SCORE POINT, 100 BY DEFAULT   *; 
*  MAX_POINT : UPPER END OF SCORE POINT, 900 BY DEFAULT   *; 
* ======================================================= *; 
* OUTPUTS:                                                *; 
*  ALIGN_FORMULA.SAS                                      *; 
*  A SAS CODE WITH THE FORMULA TO ALIGN THE LOGIT FIELD   *; 
*  TO A SPECIFIC SCORE TOGETHER WITH THE STATISTICAL      *; 
*  SUMMARY OF ALIGN_SCORE                                 *;                                    
***********************************************************; 
 
options nocenter nonumber nodate mprint mlogic symbolgen 
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\&lt;&gt;*"; 
 
%local b0 b1; 
 
data _tmp1 (keep = &amp;y &amp;logit); 
  set &amp;data; 
  where &amp;y in (0, 1) and &amp;logit ~= .; 
run; 
 
ods listing close; 
ods output ParameterEstimates = _est1 (keep = variable estimate); 
proc logistic data = _last_ desc; 
  model &amp;y = &amp;logit; 
run; 
ods listing; 
 
data _null_; 
  set _last_; 
   
  if _n_ = 1 then do; 
    b = - (estimate + (log(&amp;base_odds) - (log(2) / &amp;pdo) * &amp;base_point)) / (log(2) / &amp;pdo); 
    call symput('b0', put(b, 15.8)); 
  end; 
  else do; 
    b = estimate / (log(2) / &amp;pdo); 
    call symput('b1', put(b, 15.8)); 
  end; 
run; 
 
filename formula "ALIGN_FORMULA.SAS"; 
 
data _null_; 
  file formula; 
 
  put @3 3 * "*" " SCORE ALIGNMENT FORMULA OF %upcase(&amp;logit) " 3 * "*" ";"; 
  put; 
  put @3 "ALIGN_SCORE = MAX(MIN(ROUND((%trim(&amp;b0)) - (%trim(&amp;b1)) * %upcase(&amp;logit), 1), &amp;max_point), &amp;min_point);"; 
  put; 
run; 
 
data _tmp2; 
  set _tmp1; 
  %inc formula; 
run; 
 
proc summary data = _last_ nway; 
  class &amp;y; 
  output out = _tmp3(drop = _type_ _freq_) 
  min(align_score) = min_scr max(align_score) = max_scr 
  median(align_score) = mdn_scr; 
run; 
 
data _null_; 
  set _last_; 
  file formula mod; 
 
  if _n_ = 1 then do; 
    put @3 3 * "*" " STATISTICAL SUMMARY OF ALIGN_SCORE BY INPUT DATA: " 3 * "*" ";"; 
  end; 
  put @3 "* WHEN %upcase(&amp;y) = " &amp;y ": MIN(SCORE) = " min_scr " MEDIAN(SCORE) = " mdn_scr " MAX(SCORE) = " max_scr "*;"; 
run; 
 
proc datasets library = work nolist; 
  delete _: (mt = data); 
run; 
quit; 
 
***********************************************************; 
*                     END OF THE MACRO                    *; 
***********************************************************;  
%mend align_score;

Written by statcompute

September 3, 2014 at 9:51 pm

Posted in Credit Risk, SAS, Scorecard

Follow

Get every new post delivered to your Inbox.

Join 72 other followers