I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## Joining DataFrame with pandas

Driven by my own curiosity, I also did a test on joining two DataFrames with python pandas package and tried to compare it with R in terms of efficiency.

In pandas package, there are 2 ways to join two DataFrames, pandas.merge() function and pandas.DataFrame.join() method. Based upon the result shown below, both methods seem very comparable in terms of speed, which doubles the time with R data.table package but is a lot faster then the other three methods, e.g. merge(), sqldf(), and ff package.

```In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import random as rd

In [4]: rd.seed(2013)

In [5]: n = 1000000

In [6]: ldf = pd.DataFrame({"id1": rd.sample(range(n), n),
...:                     "id2": np.random.randint(0, n / 100, n),
...:                     "x1" : np.random.normal(size = n),
...:                     "x2" : np.random.uniform(size = n)})

In [7]: rdf = pd.DataFrame({"id1": rd.sample(range(n), n),
...:                     "id2": np.random.randint(0, n / 100, n),
...:                     "y1" : np.random.normal(size = n),
...:                     "y2" : np.random.uniform(size = n)})

Out[8]:
id1   id2        x1        x2
0  425177  5902  0.445512  0.393837
1   51906  6066  0.336391  0.911894
2  401789  6609  0.347967  0.719724

Out[9]:
id1   id2        y1        y2
0  724470  4559 -0.674539  0.767925
1  196104  8637  0.001924  0.246559
2  988955  2348 -0.254853  0.258329

In [10]: # METHOD 1: PANDAS.MERGE() FUNCTION

In [11]: %timeit -n10 -r3 pd.merge(ldf, rdf, how = 'inner', left_on = ['id1', 'id2'], right_on = ['id1', 'id2'])
10 loops, best of 3: 3.44 s per loop

In [12]: # METHOD 2: PANDAS.DATAFRAME.JOIN() METHOD

In [13]: %timeit -n10 -r3 ldf.join(rdf.set_index(['id1', 'id2']), on = ['id1', 'id2'], how = 'inner')
10 loops, best of 3: 3.71 s per loop
```

Written by statcompute

January 31, 2013 at 12:16 am

Posted in PYTHON

Tagged with ,

## Another Benchmark for Joining Two Data Frames

In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table is almost 10 times faster than the other in terms of user time. Although ff package is claimed to be able to handle large-size data, its efficiency seems questionable.

```n <- 1000000
set.seed(2013)
ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))

library(rbenchmark)
benchmark(replications = 10, order = "user.self",
# GENERIC MERGE() IN BASE PACKAGE
merge = merge(ldf, rdf, by = c("id1", "id2")),
# DATA.TABLE PACKAGE
datatable = {
ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
merge(ldt, rdt, by = c("id1", "id2"))
},
# FF PACKAGE
ff = {
lff <- ff::as.ffdf(ldf)
rff <- ff::as.ffdf(rdf)
merge(lff, rff, by = c("id1", "id2"))
},
# SQLDF PACKAGE
sqldf = sqldf::sqldf(c("create index ldx on ldf(id1, id2)",
"select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2"))
)

#        test replications elapsed relative user.self sys.self user.child
# 2 datatable           10  17.923    1.000    16.605    1.432          0
# 4     sqldf           10 105.002    5.859   102.294    3.345          0
# 1     merge           10 131.279    7.325   119.139   13.049          0
# 3        ff           10 187.150   10.442   154.670   33.758          0
```

Written by statcompute

January 29, 2013 at 11:30 pm

Posted in S+/R

Tagged with

## Efficiency in Joining Two Data Frames

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work.

For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well except that the approach with sql.select() is significantly slower than the rest. The generic merge() function in the base package is a very natural choice without much overhead of loading additional libraries and converting data frame. sqldf() is also attractive in that it might be the most user-friendly function with a very intuitive syntax.

```> n <- 1000
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
>
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
user  system elapsed
0.032   0.012   0.064
>
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
user  system elapsed
0.028   0.000   0.044
>
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
user  system elapsed
0.044   0.004   0.096
>
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)",
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
user  system elapsed
0.168   0.008   0.332
>
> # METHOD 5: PLYR
> system.time(join5 <- plyr::join(ldf, rdf, by = c("id1", "id2"), type = "inner"))
user  system elapsed
0.088   0.020   0.152
>
> # METHOD 6: SQL.SELECT
> system.time(join6 <- sql.select("select * from ldf inner join rdf on (`ldf\$id1 == rdf\$id1 & ldf\$id2 == rdf\$id2`)"))
user  system elapsed
53.775  19.725  73.813
```

However, when it comes to mid-size data frames with 1,000,000 rows, the story has changed. First of all, out of six methods shown above, the last two fails directly due to the insufficient memory size in my 32-bit ubuntu virtual machine. In this case, data.table package shows a significant advantage after converting 2 data.frames to data.tables. In additional, it is interesting that although ff and sqldf packages are slower than merge() function for the small-size data with 1,000 rows, both of them seem slightly faster for the data with 1,000,000 rows.

```> n <- 1000 ^ 2
> set.seed(2013)
> ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n))
> rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n))
>
> # METHOD 1: MERGE
> system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2")))
user  system elapsed
55.223  12.437  68.054
>
> # METHOD 2: DATA.TABLE
> ldt <- data.table::data.table(ldf, key = c("id1", "id2"))
> rdt <- data.table::data.table(rdf, key = c("id1", "id2"))
> system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2")))
user  system elapsed
0.484   0.008   0.492
>
> # METHOD 3: FF
> lff <- ff::as.ffdf(ldf)
> rff <- ff::as.ffdf(rdf)
> system.time(join3 <- merge(lff, rff, by = c("id1", "id2")))
user  system elapsed
49.811  13.821  64.004
>
> # METHOD 4: SQLDF
> system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)",
+                                     "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")))
user  system elapsed
40.418   1.268  42.076
```

Written by statcompute

January 29, 2013 at 12:08 am

Posted in S+/R

Tagged with

As in the stat workshop supporting the loss forecasting, my analysts and I are frequently asked to quantify the “correlation” between time series. In the summary below, I will briefly convey a statistical method other than “correlation”, namely cointegration, to describe the relationship between time series.

In the empirical finance, it is a popular practice for many financial practitioners to use correlation describing a relationship between multiple time series. However, this approach has been criticized in that a relationship might be wrongfully inferred due to the existence of other latent causal factors. In this case, cointegration, proposed by Engle and Granger (1987), becomes an alternative to characterize this correlated nature between time series.

In layman’s term, cointegration describes if two or more time series are moving with a common trend. In the statistical definition, assumed two time series X_t and Y_t individually integrated of order one, i.e. I(1), if a linear combination of X_t and Y_t, e.g. Z_t = X_t + B * Y_t, is stationary, i.e. I(0), then these two series X_t and Y_t are defined to be co-integrated. Since the idea of co-integration is concerned with a co-movement / long-term equilibrium among multiple time series, it is also used to test the Efficient Market Hypothesis (EMH) in econometrics.

In the cointegration analysis, most practices are fallen into two major categories, either the minimization of certain variances or the maximization of certain correlations. For instance, the single equation approach, such as the one suggested by Engel and Granger (1987), looks for the linear combination of X_t and Y_t with minimum variance and therefore belongs to the first category. On the other hand, reduced rank system based approach, such as the one proposed by Johansen (1988), belongs to the second category in that it looks for the linear combination of X_t and Y_t with maximum correlation.

Following the logic outlined by Engle and Granger, it is straightforward to formulate an augmented Dickey-Fuller (ADF) test for the cointegration analysis between two time series, although other unit-root test such as Phillips–Perron or Durbin–Watson should also suffice. Given X_t and Y_t integrated of order one, the first step is to estimate a simple linear regression Y_t = a + B * X_t + e_t, in which e_t is just the residual term. Afterwards, ADF test is used to check the existence of unit roots in e_t. If the unit-root hypothesis for e_t is rejected, then e_t is I(0) and therefore stationary, implying that X_t and Y_t are co-integrated. Otherwise, co-integration between X_t and Y_t is not concluded. This approach is attractive is that it is extremely easy to implement and understand. However, this method is only appropriate for a system with only two time series and one possible cointegrating relationship.

Johansen test for cointegration is a maximum likelihood estimation procedure based on the Vector Autoregressive (VAR) model that allows for dynamic interactions between two or more series and therefore is more general than the previous approach. Consider a VAR model with order p such that Y_t = A_1 * Y_t-1 + … + A_p * Y_t-p + e_t, where Y_t is a vector of variables integrated of order one and e_t is a vector of innovations. Without the loss of generality, the VAR can be re-written as delta_Y_t = PI * Y_t-1 + SUM[GAMMA_i * delta_Y_t-i]. The whole idea of Johansen test is to decompose PI into two n by r matrices, α and β, such that PI = α * β` and β` * Y_t is stationary. r is the number of co-integrating relations (the cointegrating rank) and each column of β is the cointegrating vector. A likelihood ratio test can be formulated to test the null hypothesis of r co-integrating vectors against the alternative hypothesis of n cointegrating vectors. While Johansen’s method is a more powerful test for cointegration, the drawback is more complicated implementation and interpretation.

Written by statcompute

January 13, 2013 at 2:02 pm

Posted in Econometrics

Tagged with

## PART – A Rule-Learning Algorithm

```> require('RWeka')
> require('pROC')
>
> # SEPARATE DATA INTO TRAINING AND TESTING SETS
> df2 <- df1[df1\$CARDHLDR == 1, 2:12]
> set.seed(2013)
> rows <- sample(1:nrow(df2), nrow(df2) - 1000)
> set1 <- df2[rows, ]
> set2 <- df2[-rows, ]
>
> # BUILD A PART RULE MODEL
> mdl1 <- PART(factor(BAD) ~., data = set1)
> print(mdl1)
PART decision list
------------------

EXP_INC > 0.000774 AND
AGE > 21.833334 AND
INCOME > 2100 AND
MAJORDRG <= 0 AND
OWNRENT > 0 AND
MINORDRG <= 1: 0 (2564.0/103.0)

AGE > 21.25 AND
EXP_INC > 0.000774 AND
INCPER > 17010 AND
INCOME > 1774.583333 AND
MINORDRG <= 0: 0 (2278.0/129.0)

AGE > 20.75 AND
EXP_INC > 0.016071 AND
OWNRENT > 0 AND
SELFEMPL > 0 AND
EXP_INC <= 0.233759 AND
MINORDRG <= 1: 0 (56.0)

AGE > 20.75 AND
EXP_INC > 0.016071 AND
SELFEMPL <= 0 AND
OWNRENT > 0: 0 (1123.0/130.0)

OWNRENT <= 0 AND
AGE > 20.75 AND
MINORDRG > 0 AND

OWNRENT <= 0 AND
AGE > 20.75 AND

INCOME > 1423 AND
OWNRENT <= 0 AND
MINORDRG <= 1 AND
SELFEMPL <= 0 AND
MINORDRG <= 0: 0 (943.0/124.0)

SELFEMPL > 0 AND
MAJORDRG <= 0 AND

SELFEMPL > 0 AND
MAJORDRG <= 1 AND
MAJORDRG <= 0 AND
MINORDRG <= 0 AND
INCOME > 2708.333333: 0 (17.0)

SELFEMPL > 0 AND
MAJORDRG <= 1 AND
OWNRENT <= 0 AND
MINORDRG <= 0 AND
INCPER <= 8400: 0 (13.0)

SELFEMPL <= 0 AND
OWNRENT > 0 AND
MINORDRG <= 0 AND
MAJORDRG <= 0: 0 (107.0/15.0)

OWNRENT <= 0 AND
MINORDRG > 0 AND
MINORDRG <= 1 AND
MAJORDRG <= 1 AND
MAJORDRG <= 0 AND
SELFEMPL <= 0: 0 (87.0/13.0)

OWNRENT <= 0 AND
SELFEMPL <= 0 AND
MAJORDRG <= 0 AND
MINORDRG <= 1: 0 (373.0/100.0)

MAJORDRG > 0 AND
MINORDRG > 0 AND
MAJORDRG <= 1 AND
MINORDRG <= 1: 0 (29.0)

SELFEMPL <= 0 AND
OWNRENT > 0 AND
MAJORDRG <= 0: 0 (199.0/57.0)

OWNRENT <= 0 AND
SELFEMPL <= 0: 0 (84.0/24.0)

MAJORDRG > 1: 0 (17.0/3.0)

MAJORDRG > 0: 0 (10.0)

MAJORDRG <= 0 AND
OWNRENT <= 0: 0 (29.0/7.0)

OWNRENT > 0 AND
SELFEMPL > 0 AND
EXP_INC <= 0.218654 AND
MINORDRG <= 2 AND
MINORDRG <= 1: 0 (8.0/1.0)

OWNRENT > 0 AND
INCOME <= 2041.666667 AND
MAJORDRG > 0 AND

OWNRENT > 0 AND
AGE > 33.416668 AND
SELFEMPL > 0: 0 (10.0/1.0)

OWNRENT > 0 AND
SELFEMPL <= 0 AND
MINORDRG <= 1 AND
AGE > 33.5 AND
EXP_INC > 0.006737: 0 (6.0)

EXP_INC > 0.001179: 1 (16.0/1.0)

: 0 (3.0)

Number of Rules  : 	25

> pred1 <- data.frame(prob = predict(mdl1, newdata = set2, type = 'probability')[, 2])
> # ROC FOR TESTING SET

Call:
roc.default(response = set2\$BAD, predictor = pred1\$prob)

Data: pred1\$prob in 905 controls (set2\$BAD 0) < 95 cases (set2\$BAD 1).
Area under the curve: 0.6794
>
> # BUILD A LOGISTIC REGRESSION
> mdl2 <- Logistic(factor(BAD) ~., data = set1)
> print(mdl2)
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable           0
====================
AGE           0.0112
MAJORDRG     -0.2312
MINORDRG     -0.1991
OWNRENT       0.2244
INCOME        0.0004
SELFEMPL     -0.1206
INCPER             0
EXP_INC       0.4472
Intercept     0.7965

Odds Ratios...
Class
Variable           0
====================
AGE           1.0113
MAJORDRG      0.7936
MINORDRG      0.8195
OWNRENT       1.2516
INCOME        1.0004
SELFEMPL      0.8864
INCPER             1
EXP_INC       1.5639

> pred2 <- data.frame(prob = predict(mdl2, newdata = set2, type = 'probability')[, 2])
> # ROC FOR TESTING SET

Call:
roc.default(response = set2\$BAD, predictor = pred2\$prob)

Data: pred2\$prob in 905 controls (set2\$BAD 0) < 95 cases (set2\$BAD 1).
Area under the curve: 0.6529
>
> # COMPARE TWO ROCS
> roc.test(roc1, roc2)

DeLong's test for two correlated ROC curves

data:  roc1 and roc2
Z = 1.0344, p-value = 0.301
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.6793894   0.6528875
```

Written by statcompute

January 12, 2013 at 12:16 am

## Oct2Py and GNU Octave

GNU Octave (www.gnu.org/software/octave) is a matrix-based computing language most compatible with Matlab and is gaining popularity in Machine Learning after Dr Ng’s open course (www.coursera.org/course/ml). It is not a general-purpose programming language and is primarily designed for numerical computation. As a result, it is usually not convenient to do dirty works such as data munging within octave. In this case, python becomes a good compliment.

Oct2Py is a python interface to octave, which allows users to run octave commands and m-files dynamically within python and to exchange data between two computing systems seamlessly.

In the example below, I will show how to do simple data wrangling with oct2py package in python, convert python numpy array to octave binary *.mat file, load the data in *.mat file into octave, and then estimate a logistic regression through IRLS (iteratively reweighted least squares) method. As shown in octave session below, the vectorized implementation makes octave very efficient in prototyping machine learning algorithm, i.e. logistic regression in this case.

PYTHON SESSION

```In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import oct2py as op

In [4]: # READ DATA FROM CSV

In [6]: # SCRUB RAW DATA

In [7]: df2 = df.ix[df.CARDHLDR == 1, ['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'BAD']]

In [8]: # CONVERT DATAFRAME TO NUMPY ARRAY

In [9]: arr = np.array(df2, dtype = np.double)

In [10]: # INVOKE AN OCTAVE SESSION

In [11]: octave = op.Oct2Py()

In [12]: # PUT NUMPY ARRAY INTO OCTAVE WORKSPACE

In [13]: octave.put('m', arr)

In [14]: print octave.run('whos')
Variables in the current scope:

Attr Name        Size                     Bytes  Class
==== ====        ====                     =====  =====
ans         1x70                       763  cell
m       10499x7                     587944  double

Total is 73563 elements using 588707 bytes

In [15]: # SAVE MAT DATA TO BE USED IN OCTAVE

In [16]: octave.run('save data.mat m')
Out[16]: ''
```

OCTAVE SESSION

```octave:1> % LOAD DATA
octave:2> whos
Variables in the current scope:

Attr Name        Size                     Bytes  Class
==== ====        ====                     =====  =====
ans         1x70                       763  cell
m       10499x7                     587944  double

Total is 73563 elements using 588707 bytes

octave:3> x = [ones(size(m, 1), 1) m(:, 1:6)];
octave:4> y = m(:, 7);
octave:5> % SET INITIAL VALUES OF PARAMETERS
octave:5> b = zeros(size(x, 2), 1);
octave:6> % MAX ITERATIONS
octave:6> maxiter = 100;
octave:7> % SOLVE FOR BETA THROUGH IRLS
octave:7> for i = 1:maxiter
>   xb = x * b;
>   p = 1 ./ (1 + exp(-xb));
>   w = diag(p .* (1 - p));
>   z = x * b + w \ (y - p);
>   b = (x' * w * x) \ (x' * w) * z;
>   new_ll = y' * log(p) + (1 - y)' * log(1 - p);
>   disp(sprintf('log likelihood for iteration %0.0f: %0.8f', i, new_ll));
>   if i == 1
>     old_ll = new_ll;
>   elseif new_ll > old_ll
>     old_ll = new_ll;
>   else
>     break
>   endif
> end
log likelihood for iteration 1: -7277.35224870
log likelihood for iteration 2: -3460.32494800
log likelihood for iteration 3: -3202.91405878
log likelihood for iteration 4: -3176.95671906
log likelihood for iteration 5: -3175.85136112
log likelihood for iteration 6: -3175.84836320
log likelihood for iteration 7: -3175.84836317
log likelihood for iteration 8: -3175.84836317
log likelihood for iteration 9: -3175.84836317
octave:8> % CALCULATE STANDARD ERRORS AND Z-VALUES
octave:8> stder = sqrt(diag(inv(x' * w * x)));
octave:9> z = b ./ stder;
octave:10> result = [b, stder, z];
octave:11> format short g;
octave:12> disp(result);
-0.96483     0.13317     -7.2453
-0.0094622   0.0038629     -2.4495
0.13384     0.02875      4.6554
0.21028    0.069724      3.0159
0.20073    0.048048      4.1776
-0.0004632  4.1893e-05     -11.057
-0.22627    0.077392     -2.9237
```

Written by statcompute

January 2, 2013 at 11:42 pm

Tagged with , , ,

## Efficiecy of Extracting Rows from A Data Frame in R

In the example below, 552 rows are extracted from a data frame with 10 million rows using six different methods. Results show a significant disparity between the least and the most efficient methods in terms of CPU time. Similar to the finding in my previous post, the method with data.table package is the most efficient solution with 0.64s CPU time. Albeit user-friendly, the method with sqldf() is the least efficient solution with 82.27s CPU time.

```> # SIMULATE A DATA.FRAME WITH 10,000,000 ROWS
> set.seed(2013)
> df <- data.frame(x1 = rpois(10000000, 1), x2 = rpois(10000000, 1), x3 = rpois(10000000, 1))
>
> # METHOD 1: EXTRACT ROWS WITH LOGICAL SUBSCRIPTS
> system.time(set1 <- df[df\$x1 == 4 & df\$x2 > 4 & df\$x3 < 4,])
user  system elapsed
1.484   1.932   3.640
> dim(set1)
[1] 552   3
>
> # METHOD 2: EXTRACT ROWS WITH ROW INDEX
> system.time(set2 <- df[which(df\$x1 == 4 & df\$x2 > 4 & df\$x3 < 4),])
user  system elapsed
0.856   1.200   2.197
> dim(set2)
[1] 552   3
>
> # METHOD 3: EXTRACT ROWS WITH SUBSET()
> system.time(set3 <- subset(df, x1 == 4 & x2 > 4 & x3 < 4))
user  system elapsed
1.680   2.644   4.690
> dim(set3)
[1] 552   3
>
> # METHOD 4: EXTRACT ROWS WITH SQLDF()
> require(sqldf)
> system.time(set4 <- sqldf("select * from df where x1 = 4 and x2 > 4 and x3 < 4", row.names = TRUE))
user  system elapsed
82.269  13.733  98.943
> dim(set4)
[1] 552   3
>
> # METHOD 5: EXTRACT ROWS WITH SQL.SELECT()
> system.time(set5 <- sql.select("select * from df where `x1 == 4 & x2 > 4 & x3 < 4`"))
user  system elapsed
2.800   3.152   7.107
> dim(set5)
[1] 552   3
>
> # METHOD 6: EXTRACT ROWS WITH DATA.TABLE PACKAGE
> require(data.table)
> dt <- data.table(df)
> system.time(set6 <- dt[dt\$x1 == 4 & dt\$x2 > 4 & dt\$x3 < 4,])
user  system elapsed
0.636   0.000   0.655
> dim(set6)
[1] 552   3
```

Written by statcompute

January 1, 2013 at 9:23 pm

Posted in S+/R

Tagged with