I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## Rmagic, A Handy Interface Bridging Python and R

Rmagic (http://ipython.org/ipython-doc/dev/config/extensions/rmagic.html) is the ipython extension that utilizes rpy2 in the back-end and provides a convenient interface accessing R from ipython. Compared with the generic use of rpy2, the rmagic extension allows users to exchange objects between ipython and R in a more flexible way and to run a single R function or a block of R code conveniently.

Below is an example demonstrating a simple use case how to push a pandas DataFrame object into R, convert it to a R data.frame, and then transfer back to a new pandas DataFrame object again.

```In [1]: import pandas as pd

In [2]: # READ DATA INTO PANDAS DATAFRAME

In [4]: print pydf1.describe()
LEV_LT3     TAX_NDEB      COLLAT1        SIZE1        PROF2  \
count  4421.000000  4421.000000  4421.000000  4421.000000  4421.000000
mean      0.090832     0.824537     0.317354    13.510870     0.144593
std       0.193872     2.884129     0.227150     1.692520     0.110908
min       0.000000     0.000000     0.000000     7.738052     0.000016
25%       0.000000     0.349381     0.124094    12.316970     0.072123
50%       0.000000     0.566577     0.287613    13.539574     0.120344
75%       0.011689     0.789128     0.472355    14.751119     0.187515
max       0.998372   102.149483     0.995346    18.586632     1.590201

GROWTH2          AGE          LIQ        IND2A        IND3A  \
count  4421.000000  4421.000000  4421.000000  4421.000000  4421.000000
mean     13.619633    20.366433     0.202813     0.611626     0.190228
std      36.517739    14.538997     0.233256     0.487435     0.392526
min     -81.247627     6.000000     0.000000     0.000000     0.000000
25%      -3.563235    11.000000     0.034834     0.000000     0.000000
50%       6.164303    17.000000     0.108544     1.000000     0.000000
75%      21.951632    25.000000     0.291366     1.000000     0.000000
max     681.354187   210.000000     1.000182     1.000000     1.000000

IND4A        IND5A
count  4421.000000  4421.000000
mean      0.026917     0.099073
std       0.161859     0.298793
min       0.000000     0.000000
25%       0.000000     0.000000
50%       0.000000     0.000000
75%       0.000000     0.000000
max       1.000000     1.000000

In [5]: # CONVERT PANDAS DATAFRAME TO R DATA.FRAME

In [7]: col = pydf1.columns

In [8]: %R -i pydf1,col colnames(pydf1) <- unlist(col); print(is.matrix(pydf1))
[1] TRUE

In [9]: %R rdf <- data.frame(pydf1); print(is.data.frame(rdf))
[1] TRUE

In [10]: %R print(summary(rdf))
LEV_LT3           TAX_NDEB           COLLAT1           SIZE1
Min.   :0.00000   Min.   :  0.0000   Min.   :0.0000   Min.   : 7.738
1st Qu.:0.00000   1st Qu.:  0.3494   1st Qu.:0.1241   1st Qu.:12.317
Median :0.00000   Median :  0.5666   Median :0.2876   Median :13.540
Mean   :0.09083   Mean   :  0.8245   Mean   :0.3174   Mean   :13.511
3rd Qu.:0.01169   3rd Qu.:  0.7891   3rd Qu.:0.4724   3rd Qu.:14.751
Max.   :0.99837   Max.   :102.1495   Max.   :0.9953   Max.   :18.587
PROF2              GROWTH2             AGE              LIQ
Min.   :0.0000158   Min.   :-81.248   Min.   :  6.00   Min.   :0.00000
1st Qu.:0.0721233   1st Qu.: -3.563   1st Qu.: 11.00   1st Qu.:0.03483
Median :0.1203435   Median :  6.164   Median : 17.00   Median :0.10854
Mean   :0.1445929   Mean   : 13.620   Mean   : 20.37   Mean   :0.20281
3rd Qu.:0.1875148   3rd Qu.: 21.952   3rd Qu.: 25.00   3rd Qu.:0.29137
Max.   :1.5902009   Max.   :681.354   Max.   :210.00   Max.   :1.00018
IND2A            IND3A            IND4A             IND5A
Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000
1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000
Median :1.0000   Median :0.0000   Median :0.00000   Median :0.00000
Mean   :0.6116   Mean   :0.1902   Mean   :0.02692   Mean   :0.09907
3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000
Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000

In [11]: # CONVER R DATA.FRAME BACK TO PANDAS DATAFRAME

In [12]: %R -d rdf

In [13]: pydf2 = pd.DataFrame(rdf)

In [14]: print pydf2.describe()
LEV_LT3     TAX_NDEB      COLLAT1        SIZE1        PROF2  \
count  4421.000000  4421.000000  4421.000000  4421.000000  4421.000000
mean      0.090832     0.824537     0.317354    13.510870     0.144593
std       0.193872     2.884129     0.227150     1.692520     0.110908
min       0.000000     0.000000     0.000000     7.738052     0.000016
25%       0.000000     0.349381     0.124094    12.316970     0.072123
50%       0.000000     0.566577     0.287613    13.539574     0.120344
75%       0.011689     0.789128     0.472355    14.751119     0.187515
max       0.998372   102.149483     0.995346    18.586632     1.590201

GROWTH2          AGE          LIQ        IND2A        IND3A  \
count  4421.000000  4421.000000  4421.000000  4421.000000  4421.000000
mean     13.619633    20.366433     0.202813     0.611626     0.190228
std      36.517739    14.538997     0.233256     0.487435     0.392526
min     -81.247627     6.000000     0.000000     0.000000     0.000000
25%      -3.563235    11.000000     0.034834     0.000000     0.000000
50%       6.164303    17.000000     0.108544     1.000000     0.000000
75%      21.951632    25.000000     0.291366     1.000000     0.000000
max     681.354187   210.000000     1.000182     1.000000     1.000000

IND4A        IND5A
count  4421.000000  4421.000000
mean      0.026917     0.099073
std       0.161859     0.298793
min       0.000000     0.000000
25%       0.000000     0.000000
50%       0.000000     0.000000
75%       0.000000     0.000000
max       1.000000     1.000000
```

Written by statcompute

May 31, 2013 at 10:25 pm

Posted in PYTHON, S+/R

## Import All Text Files in A Folder with Parallel Execution

Sometimes, we might need to import all files, e.g. *.txt, with the same data layout in a folder without knowing each file name and then combine all pieces together. With the old method, we can use lapply() and do.call() functions to accomplish the task. However, when there are a large number of such files and each file size is large, it could be computational expensive.

With the foreach package, we are able to split the data import task into pieces and then distribute them to multiple CPUs with the parallel execution, as shown in the code below.

```setwd('../data/csdata')
files <- list.files(pattern = "[.]txt\$")

library(rbenchmark)
benchmark(replications = 10, order = "user.self",
LAPPLY = {
},
FOREACH = {
library(doParallel)
registerDoParallel(cores = 4)
library(foreach)
data2 <- foreach(i = files, .combine = rbind) %dopar% read.table(i, header = TRUE)
}
)

library(compare)
all.equal(data1, data2)
```

From the output below, it is shown that both methods generated identical data.frames. However, the method with foreach() is much more efficient than the method with lapply() due to the parallelism.

```     test replications elapsed relative user.self sys.self user.child sys.child
2 FOREACH           10   0.689    1.000     0.156    0.076      1.088     0.308
1  LAPPLY           10   1.078    1.565     1.076    0.004      0.000     0.000

Attaching package: ‘compare’

The following object is masked from ‘package:base’:

isTRUE

[1] TRUE
```

Written by statcompute

May 26, 2013 at 11:55 pm

Posted in S+/R

## Test Drive of Parallel Computing with R

Today, I did a test run of parallel computing with snow and multicore packages in R and compared the parallelism with the single-thread lapply() function.

In the test code below, a data.frame with 20M rows is simulated in a Ubuntu VM with 8-core CPU and 10-G memory. As the baseline, lapply() function is employed to calculate the aggregation by groups. For the comparison purpose, parLapply() function in snow package and mclapply() in multicore package are also used to generate the identical aggregated data.

```n <- 20000000
set.seed(2013)
df <- data.frame(id = sample(20, n, replace = TRUE), x = rnorm(n), y = runif(n), z = rpois(n, 1))

library(rbenchmark)
benchmark(replications = 5, order = "user.self",
LAPPLY = {
cat('LAPPLY...\n')
df1 <- data.frame(lapply(split(df[-1], df[1]), colMeans))
},
SNOW = {
library(snow)
cat('SNOW...\n')
cl <- makeCluster(8, type = "SOCK")
df2 <- data.frame(parLapply(cl, split(df[-1], df[1]), colMeans))
stopCluster(cl)
},
MULTICORE = {
cat('MULTICORE...\n')
library(multicore)
df3 <- data.frame(mclapply(split(df[-1], df[1]), colMeans, mc.cores = 8))
}
)

library(compare)
all.equal(df1, df2)
all.equal(df1, df3)
```

Below is the benchmark output. As shown, the parallel solution, e.g. SNOW or MULTICORE, is 3 times more efficient than the baseline solution, e.g. LAPPLY, in terms of user time.

```       test replications elapsed relative user.self sys.self user.child
3 MULTICORE            5 101.075    1.000    48.587    6.620    310.771
2      SNOW            5 127.715    1.264    53.192   13.685      0.012
1    LAPPLY            5 184.738    1.828   179.855    4.880      0.000
sys.child
3     7.764
2     0.740
1     0.000

Attaching package: ‘compare’

The following object is masked from ‘package:base’:

isTRUE

[1] TRUE
[1] TRUE
```

In order to illustrate the CPU usage, multiple screenshots have also been taken to show the difference between parallelism and single-thread.

In the first screenshot, it is shown that only 1 out of 8 CPUs is used at 100% with lapply() function and the rest 7 are idle.

In the second screenshot, it is shown that all 8 CPUs are used at 100% with parLapply() function in the snow package.

In the third screenshot, it is also shown that all 8 CPUs are used at 100% with mulapply() function in the multicore package.

Written by statcompute

May 25, 2013 at 11:40 pm

Posted in S+/R

## Conversion between Factor and Dummies in R

```data(iris)
str(iris)
# OUTPUT:
# 'data.frame': 150 obs. of  5 variables:
#  \$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  \$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  \$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  \$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  \$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

### CONVERT THE FACTOR TO DUMMIES ###
library(caret)
dummies <- predict(dummyVars(~ Species, data = iris), newdata = iris)
# OUTPUT:
#   Species.setosa Species.versicolor Species.virginica
# 1              1                  0                 0
# 2              1                  0                 0
# 3              1                  0                 0

### CONVERT DUMMIES TO THE FACTOR ###
header <- unlist(strsplit(colnames(dummies), '[.]'))[2 * (1:ncol(dummies))]
species <- factor(dummies %*% 1:ncol(dummies), labels = header)
str(species)
# OUTPUT:
#  Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

### COMPARE THE ORIGINAL AND THE CALCULATED FACTORS ###
library(compare)
all.equal(species, iris\$Species)
# OUTPUT:
# [1] TRUE
```

Written by statcompute

May 18, 2013 at 10:25 pm

Posted in S+/R

## A Prototype of Monotonic Binning Algorithm with R

I’ve been asked many time if I have a piece of R code implementing the monotonic binning algorithm, similar to the one that I developed with SAS (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) and with Python (https://statcompute.wordpress.com/2012/12/08/monotonic-binning-with-python). Today, I finally had time to draft a quick prototype with 20 lines of R code, which is however barely useable without the further polish. But it is still a little surprising to me how efficient it can be to use R in algorithm prototyping, much sleeker than SAS macro.

```library(sas7bdat)
library(Hmisc)

bin <- function(x, y){
n <- min(50, length(unique(x)))
repeat {
n   <- n - 1
d1  <- data.frame(x, y, bin = cut2(x, g = n))
d2  <- aggregate(d1[-3], d1[3], mean)
cor <- cor(d2[-1], method = "spearman")
if(abs(cor[1, 2]) == 1) break
}
d2[2] <- NULL
colnames(d2) <- c('LEVEL', 'RATE')
head <- paste(toupper(substitute(y)), " RATE by ", toupper(substitute(x)), sep = '')
cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
cat("| ", head, ' |\n', sep = '')
cat("+-", rep("-", nchar(head)), "-+\n", sep = '')
print(d2)
cat("\n")
}

attach(data)

```

R output:

```+--------------------------+
| BAD RATE by BUREAU_SCORE |
+--------------------------+
LEVEL       RATE
1  [443,618) 0.44639376
2  [618,643) 0.38446602
3  [643,658) 0.31835938
4  [658,673) 0.23819302
5  [673,686) 0.19838057
6  [686,702) 0.17850288
7  [702,715) 0.14168378
8  [715,731) 0.09815951
9  [731,752) 0.07212476
10 [752,776) 0.05487805
11 [776,848] 0.02605210

+---------------------------+
| BAD RATE by AGE_OLDEST_TR |
+---------------------------+
LEVEL       RATE
1  [  1, 34) 0.33333333
2  [ 34, 62) 0.30560928
3  [ 62, 87) 0.25145068
4  [ 87,113) 0.23346304
5  [113,130) 0.21616162
6  [130,149) 0.20036101
7  [149,168) 0.19361702
8  [168,198) 0.15530303
9  [198,245) 0.11111111
10 [245,308) 0.10700389
11 [308,588] 0.08730159

+------------------------+
| BAD RATE by TOT_INCOME |
+------------------------+
LEVEL      RATE
1 [   0,   2570) 0.2498715
2 [2570,   4510) 0.2034068
3 [4510,8147167] 0.1602327

+--------------------+
| BAD RATE by TOT_TR |
+--------------------+
LEVEL      RATE
1 [ 0,12) 0.2672370
2 [12,22) 0.1827676
3 [22,77] 0.1422764
```

Written by statcompute

May 4, 2013 at 4:33 pm

## A SAS Macro for Scorecard Evaluation with Weights

On 09/28/2012, I posted a SAS macro evaluation the scorecard performance, e.g. KS / AUC statistics (https://statcompute.wordpress.com/2012/09/28/a-sas-macro-for-scorecard-performance-evaluation). However, this macro is not generic enough to handle cases with a weighting variable. In a recent project that I am working on, there is a weight variable attached to each credit applicant due to the reject inference. Therefore, there is no excuse for me to continue neglecting the necessity of developing another SAS macro that can take care of the weight variable with any positive values in the scorecard evaluation. Below is a quick draft of the macro. You might have to tweak it a little to suit your needs in the production.

```%macro wt_auc(data = , score = , y = , weight = );
***********************************************************;
* THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A      *;
* SCORECARD WITH WEIGHTS                                  *;
* ------------------------------------------------------- *;
* PARAMETERS:                                             *;
*  DATA  : INPUT DATASET                                  *;
*  SCORE : SCORECARD VARIABLE                             *;
*  Y     : RESPONSE VARIABLE IN (0, 1)                    *;
*  WEIGHT: WEIGHT VARIABLE WITH POSITIVE VALUES           *;
* ------------------------------------------------------- *;
* OUTPUTS:                                                *;
*  A SUMMARY TABLE WITH KS AND AUC STATISTICS             *;
* ------------------------------------------------------- *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;
options nocenter nonumber nodate mprint mlogic symbolgen
orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

data _tmp1 (keep = &score &y &weight);
set &data;
where &score ~= . and &y in (0, 1) and &weight >= 0;
run;

*** CAPTURE THE DIRECTION OF SCORE ***;
ods listing close;
ods output spearmancorr = _cor;
proc corr data = _tmp1 spearman;
var &y;
with &score;
run;
ods listing;

data _null_;
set _cor;
if &y >= 0 then do;
call symput('desc', 'descending');
end;
else do;
call symput('desc', ' ');
end;
run;

proc sql noprint;
create table
_tmp2 as
select
&score                                         as _scr,
sum(&y)                                        as _bcnt,
count(*)                                       as _cnt,
sum(case when &y = 1 then &weight else 0 end)  as _bwt,
sum(case when &y = 0 then &weight else 0 end)  as _gwt
from
_tmp1
group by
&score;

select
sum(_bwt) into :bsum
from
_tmp2;

select
sum(_gwt) into :gsum
from
_tmp2;

select
sum(_cnt) into :cnt
from
_tmp2;
quit;
%put &cnt;

proc sort data = _tmp2;
by &desc _scr;
run;

data _tmp3;
set _tmp2;
by &desc _scr;
retain _gcum _bcum _cntcum;
_gcum + _gwt;
_bcum + _bwt;
_cntcum + _cnt;
_gpct = _gcum / &gsum;
_bpct = _bcum / &bsum;
_ks   = abs(_gpct - _bpct) * 100;
_rank = int(_cntcum / ceil(&cnt / 10)) + 1;
run;

proc sort data = _tmp3 sortsize = max;
by _gpct _bpct;
run;

data _tmp4;
set _tmp3;
by _gpct _bpct;
if last._gpct then do;
_idx + 1;
output;
end;
run;

proc sql noprint;
create table
_tmp5 as
select
a._gcum as gcum,
(b._gpct - a._gpct) * (b._bpct + a._bpct) / 2 as dx
from
_tmp4 as a, _tmp4 as b
where
a._idx + 1 = b._idx;

select
sum(dx) format = 15.8 into :AUC
from
_tmp5;

select
max(_ks) format = 15.8 into :KS_STAT
from
_tmp3;

select
_scr format = 6.2 into :KS_SCORE
from
_tmp3
where
_ks = (select max(_ks) from _tmp3);

create table
_tmp6 as
select
_rank                       as rank,
min(_scr)                   as min_scr,
max(_scr)                   as max_scr,
sum(_cnt)                   as cnt,
sum(_gwt + _bwt)            as wt,
sum(_gwt)                   as gwt,
sum(_bwt)                   as bwt,
sum(_bwt) / calculated wt   as bad_rate
from
_tmp3
group by
_rank;
quit;

proc report data = _last_ spacing = 1 split = "/" headline nowd;
column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/
MAXIMUM KS = %trim(&ks_stat) AT SCORE POINT %trim(&ks_score) and AUC STATISTICS = %trim(&auc)/ /"
rank min_scr max_scr cnt wt gwt bwt bad_rate);
define rank       / noprint order order = data;
define min_scr    / "MIN/SCORE"             width = 10 format = 9.2        analysis min center;
define max_scr    / "MAX/SCORE"             width = 10 format = 9.2        analysis max center;
define cnt        / "RAW/COUNT"             width = 10 format = comma9.    analysis sum;
define wt         / "WEIGHTED/SUM"          width = 15 format = comma14.2  analysis sum;
define gwt        / "WEIGHTED/GOODS"        width = 15 format = comma14.2  analysis sum;
define bwt        / "WEIGHTED/BADS"         width = 15 format = comma14.2  analysis sum;
define bad_rate   / "BAD/RATE"              width = 10 format = percent9.2 order center;
rbreak after / summarize dol skip;
run;

proc datasets library = work nolist;
delete _tmp: / memtype = data;
run;
quit;

***********************************************************;
*                     END OF THE MACRO                    *;
***********************************************************;
%mend wt_auc;
```

In the first demo below, a weight variable with fractional values is tested.

```*** TEST CASE OF FRACTIONAL WEIGHTS ***;
data one;
set data.accepts;
weight = ranuni(1);
run;

%wt_auc(data = one, score = bureau_score, y = bad, weight = weight);
/*
GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA ONE
MAXIMUM KS = 34.89711721 AT SCORE POINT 678.00 and AUC STATISTICS = 0.73521009

MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
-------------------------------------------------------------------------------------------
443.00     619.00         539          276.29          153.16          123.13   44.56%
620.00     644.00         551          273.89          175.00           98.89   36.11%
645.00     660.00         544          263.06          176.88           86.18   32.76%
661.00     676.00         555          277.26          219.88           57.38   20.70%
677.00     692.00         572          287.45          230.41           57.04   19.84%
693.00     707.00         510          251.51          208.25           43.26   17.20%
708.00     724.00         576          276.31          243.89           32.42   11.73%
725.00     746.00         566          285.53          262.73           22.80    7.98%
747.00     772.00         563          285.58          268.95           16.62    5.82%
773.00     848.00         546          272.40          264.34            8.06    2.96%
========== ========== ========== =============== =============== ===============
443.00     848.00       5,522        2,749.28        2,203.49          545.79
*/
```

In the second demo, a weight variable with positive integers is also tested.

```*** TEST CASE OF INTEGER WEIGHTS ***;
data two;
set data.accepts;
weight = rand("poisson", 20);
run;

%wt_auc(data = two, score = bureau_score, y = bad, weight = weight);
/*
GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA TWO
MAXIMUM KS = 35.58884479 AT SCORE POINT 679.00 and AUC STATISTICS = 0.73725030

MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
-------------------------------------------------------------------------------------------
443.00     619.00         539       10,753.00        6,023.00        4,730.00   43.99%
620.00     644.00         551       11,019.00        6,897.00        4,122.00   37.41%
645.00     660.00         544       10,917.00        7,479.00        3,438.00   31.49%
661.00     676.00         555       11,168.00        8,664.00        2,504.00   22.42%
677.00     692.00         572       11,525.00        9,283.00        2,242.00   19.45%
693.00     707.00         510       10,226.00        8,594.00        1,632.00   15.96%
708.00     724.00         576       11,497.00       10,117.00        1,380.00   12.00%
725.00     746.00         566       11,331.00       10,453.00          878.00    7.75%
747.00     772.00         563       11,282.00       10,636.00          646.00    5.73%
773.00     848.00         546       10,893.00       10,598.00          295.00    2.71%
========== ========== ========== =============== =============== ===============
443.00     848.00       5,522      110,611.00       88,744.00       21,867.00
*/
```

As shown, the macro works well in both cases. Please feel free to let me know if it helps in your cases.

Written by statcompute

May 4, 2013 at 12:22 pm