## Archive for **January 2013**

## Joining DataFrame with pandas

Driven by my own curiosity, I also did a test on joining two DataFrames with python pandas package and tried to compare it with R in terms of efficiency.

In pandas package, there are 2 ways to join two DataFrames, pandas.merge() function and pandas.DataFrame.join() method. Based upon the result shown below, both methods seem very comparable in terms of speed, which doubles the time with R data.table package but is a lot faster then the other three methods, e.g. merge(), sqldf(), and ff package.

In [1]: import pandas as pd In [2]: import numpy as np In [3]: import random as rd In [4]: rd.seed(2013) In [5]: n = 1000000 In [6]: ldf = pd.DataFrame({"id1": rd.sample(range(n), n), ...: "id2": np.random.randint(0, n / 100, n), ...: "x1" : np.random.normal(size = n), ...: "x2" : np.random.uniform(size = n)}) In [7]: rdf = pd.DataFrame({"id1": rd.sample(range(n), n), ...: "id2": np.random.randint(0, n / 100, n), ...: "y1" : np.random.normal(size = n), ...: "y2" : np.random.uniform(size = n)}) In [8]: ldf.head(3) Out[8]: id1 id2 x1 x2 0 425177 5902 0.445512 0.393837 1 51906 6066 0.336391 0.911894 2 401789 6609 0.347967 0.719724 In [9]: rdf.head(3) Out[9]: id1 id2 y1 y2 0 724470 4559 -0.674539 0.767925 1 196104 8637 0.001924 0.246559 2 988955 2348 -0.254853 0.258329 In [10]: # METHOD 1: PANDAS.MERGE() FUNCTION In [11]: %timeit -n10 -r3 pd.merge(ldf, rdf, how = 'inner', left_on = ['id1', 'id2'], right_on = ['id1', 'id2']) 10 loops, best of 3: 3.44 s per loop In [12]: # METHOD 2: PANDAS.DATAFRAME.JOIN() METHOD In [13]: %timeit -n10 -r3 ldf.join(rdf.set_index(['id1', 'id2']), on = ['id1', 'id2'], how = 'inner') 10 loops, best of 3: 3.71 s per loop

## Another Benchmark for Joining Two Data Frames

In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table is almost 10 times faster than the other in terms of user time. Although ff package is claimed to be able to handle large-size data, its efficiency seems questionable.

n <- 1000000 set.seed(2013) ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) library(rbenchmark) benchmark(replications = 10, order = "user.self", # GENERIC MERGE() IN BASE PACKAGE merge = merge(ldf, rdf, by = c("id1", "id2")), # DATA.TABLE PACKAGE datatable = { ldt <- data.table::data.table(ldf, key = c("id1", "id2")) rdt <- data.table::data.table(rdf, key = c("id1", "id2")) merge(ldt, rdt, by = c("id1", "id2")) }, # FF PACKAGE ff = { lff <- ff::as.ffdf(ldf) rff <- ff::as.ffdf(rdf) merge(lff, rff, by = c("id1", "id2")) }, # SQLDF PACKAGE sqldf = sqldf::sqldf(c("create index ldx on ldf(id1, id2)", "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")) ) # test replications elapsed relative user.self sys.self user.child # 2 datatable 10 17.923 1.000 16.605 1.432 0 # 4 sqldf 10 105.002 5.859 102.294 3.345 0 # 1 merge 10 131.279 7.325 119.139 13.049 0 # 3 ff 10 187.150 10.442 154.670 33.758 0

## Efficiency in Joining Two Data Frames

In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work.

For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well except that the approach with sql.select() is significantly slower than the rest. The generic merge() function in the base package is a very natural choice without much overhead of loading additional libraries and converting data frame. sqldf() is also attractive in that it might be the most user-friendly function with a very intuitive syntax.

> n <- 1000 > set.seed(2013) > ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) > rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) > > # METHOD 1: MERGE > system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2"))) user system elapsed 0.032 0.012 0.064 > > # METHOD 2: DATA.TABLE > ldt <- data.table::data.table(ldf, key = c("id1", "id2")) > rdt <- data.table::data.table(rdf, key = c("id1", "id2")) > system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2"))) user system elapsed 0.028 0.000 0.044 > > # METHOD 3: FF > lff <- ff::as.ffdf(ldf) > rff <- ff::as.ffdf(rdf) > system.time(join3 <- merge(lff, rff, by = c("id1", "id2"))) user system elapsed 0.044 0.004 0.096 > > # METHOD 4: SQLDF > system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", + "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2"))) user system elapsed 0.168 0.008 0.332 > > # METHOD 5: PLYR > system.time(join5 <- plyr::join(ldf, rdf, by = c("id1", "id2"), type = "inner")) user system elapsed 0.088 0.020 0.152 > > # METHOD 6: SQL.SELECT > source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R") > system.time(join6 <- sql.select("select * from ldf inner join rdf on (`ldf$id1 == rdf$id1 & ldf$id2 == rdf$id2`)")) user system elapsed 53.775 19.725 73.813

However, when it comes to mid-size data frames with 1,000,000 rows, the story has changed. First of all, out of six methods shown above, the last two fails directly due to the insufficient memory size in my 32-bit ubuntu virtual machine. In this case, data.table package shows a significant advantage after converting 2 data.frames to data.tables. In additional, it is interesting that although ff and sqldf packages are slower than merge() function for the small-size data with 1,000 rows, both of them seem slightly faster for the data with 1,000,000 rows.

> n <- 1000 ^ 2 > set.seed(2013) > ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) > rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) > > # METHOD 1: MERGE > system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2"))) user system elapsed 55.223 12.437 68.054 > > # METHOD 2: DATA.TABLE > ldt <- data.table::data.table(ldf, key = c("id1", "id2")) > rdt <- data.table::data.table(rdf, key = c("id1", "id2")) > system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2"))) user system elapsed 0.484 0.008 0.492 > > # METHOD 3: FF > lff <- ff::as.ffdf(ldf) > rff <- ff::as.ffdf(rdf) > system.time(join3 <- merge(lff, rff, by = c("id1", "id2"))) user system elapsed 49.811 13.821 64.004 > > # METHOD 4: SQLDF > system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", + "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2"))) user system elapsed 40.418 1.268 42.076

## About Cointegration

As in the stat workshop supporting the loss forecasting, my analysts and I are frequently asked to quantify the “correlation” between time series. In the summary below, I will briefly convey a statistical method other than “correlation”, namely cointegration, to describe the relationship between time series.

In the empirical finance, it is a popular practice for many financial practitioners to use correlation describing a relationship between multiple time series. However, this approach has been criticized in that a relationship might be wrongfully inferred due to the existence of other latent causal factors. In this case, cointegration, proposed by Engle and Granger (1987), becomes an alternative to characterize this correlated nature between time series.

In layman’s term, cointegration describes if two or more time series are moving with a common trend. In the statistical definition, assumed two time series X_t and Y_t individually integrated of order one, i.e. I(1), if a linear combination of X_t and Y_t, e.g. Z_t = X_t + B * Y_t, is stationary, i.e. I(0), then these two series X_t and Y_t are defined to be co-integrated. Since the idea of co-integration is concerned with a co-movement / long-term equilibrium among multiple time series, it is also used to test the Efficient Market Hypothesis (EMH) in econometrics.

In the cointegration analysis, most practices are fallen into two major categories, either the minimization of certain variances or the maximization of certain correlations. For instance, the single equation approach, such as the one suggested by Engel and Granger (1987), looks for the linear combination of X_t and Y_t with minimum variance and therefore belongs to the first category. On the other hand, reduced rank system based approach, such as the one proposed by Johansen (1988), belongs to the second category in that it looks for the linear combination of X_t and Y_t with maximum correlation.

Following the logic outlined by Engle and Granger, it is straightforward to formulate an augmented Dickey-Fuller (ADF) test for the cointegration analysis between two time series, although other unit-root test such as Phillips–Perron or Durbin–Watson should also suffice. Given X_t and Y_t integrated of order one, the first step is to estimate a simple linear regression Y_t = a + B * X_t + e_t, in which e_t is just the residual term. Afterwards, ADF test is used to check the existence of unit roots in e_t. If the unit-root hypothesis for e_t is rejected, then e_t is I(0) and therefore stationary, implying that X_t and Y_t are co-integrated. Otherwise, co-integration between X_t and Y_t is not concluded. This approach is attractive is that it is extremely easy to implement and understand. However, this method is only appropriate for a system with only two time series and one possible cointegrating relationship.

Johansen test for cointegration is a maximum likelihood estimation procedure based on the Vector Autoregressive (VAR) model that allows for dynamic interactions between two or more series and therefore is more general than the previous approach. Consider a VAR model with order p such that Y_t = A_1 * Y_t-1 + … + A_p * Y_t-p + e_t, where Y_t is a vector of variables integrated of order one and e_t is a vector of innovations. Without the loss of generality, the VAR can be re-written as delta_Y_t = PI * Y_t-1 + SUM[GAMMA_i * delta_Y_t-i]. The whole idea of Johansen test is to decompose PI into two n by r matrices, α and β, such that PI = α * β` and β` * Y_t is stationary. r is the number of co-integrating relations (the cointegrating rank) and each column of β is the cointegrating vector. A likelihood ratio test can be formulated to test the null hypothesis of r co-integrating vectors against the alternative hypothesis of n cointegrating vectors. While Johansen’s method is a more powerful test for cointegration, the drawback is more complicated implementation and interpretation.

## PART – A Rule-Learning Algorithm

> require('RWeka') > require('pROC') > > # SEPARATE DATA INTO TRAINING AND TESTING SETS > df1 <- read.csv('credit_count.csv') > df2 <- df1[df1$CARDHLDR == 1, 2:12] > set.seed(2013) > rows <- sample(1:nrow(df2), nrow(df2) - 1000) > set1 <- df2[rows, ] > set2 <- df2[-rows, ] > > # BUILD A PART RULE MODEL > mdl1 <- PART(factor(BAD) ~., data = set1) > print(mdl1) PART decision list ------------------ EXP_INC > 0.000774 AND AGE > 21.833334 AND INCOME > 2100 AND MAJORDRG <= 0 AND OWNRENT > 0 AND MINORDRG <= 1: 0 (2564.0/103.0) AGE > 21.25 AND EXP_INC > 0.000774 AND INCPER > 17010 AND INCOME > 1774.583333 AND MINORDRG <= 0: 0 (2278.0/129.0) AGE > 20.75 AND EXP_INC > 0.016071 AND OWNRENT > 0 AND SELFEMPL > 0 AND EXP_INC <= 0.233759 AND MINORDRG <= 1: 0 (56.0) AGE > 20.75 AND EXP_INC > 0.016071 AND SELFEMPL <= 0 AND OWNRENT > 0: 0 (1123.0/130.0) OWNRENT <= 0 AND AGE > 20.75 AND ACADMOS <= 20 AND ADEPCNT <= 2 AND MINORDRG > 0 AND ACADMOS <= 14: 0 (175.0/10.0) OWNRENT <= 0 AND AGE > 20.75 AND ADEPCNT <= 0: 0 (1323.0/164.0) INCOME > 1423 AND OWNRENT <= 0 AND MINORDRG <= 1 AND ADEPCNT > 0 AND SELFEMPL <= 0 AND MINORDRG <= 0: 0 (943.0/124.0) SELFEMPL > 0 AND MAJORDRG <= 0 AND ACADMOS > 85: 0 (24.0) SELFEMPL > 0 AND MAJORDRG <= 1 AND MAJORDRG <= 0 AND MINORDRG <= 0 AND INCOME > 2708.333333: 0 (17.0) SELFEMPL > 0 AND MAJORDRG <= 1 AND OWNRENT <= 0 AND MINORDRG <= 0 AND INCPER <= 8400: 0 (13.0) SELFEMPL <= 0 AND OWNRENT > 0 AND ADEPCNT <= 0 AND MINORDRG <= 0 AND MAJORDRG <= 0: 0 (107.0/15.0) OWNRENT <= 0 AND MINORDRG > 0 AND MINORDRG <= 1 AND MAJORDRG <= 1 AND MAJORDRG <= 0 AND SELFEMPL <= 0: 0 (87.0/13.0) OWNRENT <= 0 AND SELFEMPL <= 0 AND MAJORDRG <= 0 AND MINORDRG <= 1: 0 (373.0/100.0) MAJORDRG > 0 AND MINORDRG > 0 AND MAJORDRG <= 1 AND MINORDRG <= 1: 0 (29.0) SELFEMPL <= 0 AND OWNRENT > 0 AND MAJORDRG <= 0: 0 (199.0/57.0) OWNRENT <= 0 AND SELFEMPL <= 0: 0 (84.0/24.0) MAJORDRG > 1: 0 (17.0/3.0) ACADMOS <= 34 AND MAJORDRG > 0: 0 (10.0) MAJORDRG <= 0 AND ADEPCNT <= 2 AND OWNRENT <= 0: 0 (29.0/7.0) OWNRENT > 0 AND SELFEMPL > 0 AND EXP_INC <= 0.218654 AND MINORDRG <= 2 AND MINORDRG <= 1: 0 (8.0/1.0) OWNRENT > 0 AND INCOME <= 2041.666667 AND MAJORDRG > 0 AND ADEPCNT > 0: 1 (5.0) OWNRENT > 0 AND AGE > 33.416668 AND ACADMOS <= 174 AND SELFEMPL > 0: 0 (10.0/1.0) OWNRENT > 0 AND SELFEMPL <= 0 AND MINORDRG <= 1 AND AGE > 33.5 AND EXP_INC > 0.006737: 0 (6.0) EXP_INC > 0.001179: 1 (16.0/1.0) : 0 (3.0) Number of Rules : 25 > pred1 <- data.frame(prob = predict(mdl1, newdata = set2, type = 'probability')[, 2]) > # ROC FOR TESTING SET > print(roc1 <- roc(set2$BAD, pred1$prob)) Call: roc.default(response = set2$BAD, predictor = pred1$prob) Data: pred1$prob in 905 controls (set2$BAD 0) < 95 cases (set2$BAD 1). Area under the curve: 0.6794 > > # BUILD A LOGISTIC REGRESSION > mdl2 <- Logistic(factor(BAD) ~., data = set1) > print(mdl2) Logistic Regression with ridge parameter of 1.0E-8 Coefficients... Class Variable 0 ==================== AGE 0.0112 ACADMOS -0.0005 ADEPCNT -0.0747 MAJORDRG -0.2312 MINORDRG -0.1991 OWNRENT 0.2244 INCOME 0.0004 SELFEMPL -0.1206 INCPER 0 EXP_INC 0.4472 Intercept 0.7965 Odds Ratios... Class Variable 0 ==================== AGE 1.0113 ACADMOS 0.9995 ADEPCNT 0.928 MAJORDRG 0.7936 MINORDRG 0.8195 OWNRENT 1.2516 INCOME 1.0004 SELFEMPL 0.8864 INCPER 1 EXP_INC 1.5639 > pred2 <- data.frame(prob = predict(mdl2, newdata = set2, type = 'probability')[, 2]) > # ROC FOR TESTING SET > print(roc2 <- roc(set2$BAD, pred2$prob)) Call: roc.default(response = set2$BAD, predictor = pred2$prob) Data: pred2$prob in 905 controls (set2$BAD 0) < 95 cases (set2$BAD 1). Area under the curve: 0.6529 > > # COMPARE TWO ROCS > roc.test(roc1, roc2) DeLong's test for two correlated ROC curves data: roc1 and roc2 Z = 1.0344, p-value = 0.301 alternative hypothesis: true difference in AUC is not equal to 0 sample estimates: AUC of roc1 AUC of roc2 0.6793894 0.6528875

## Oct2Py and GNU Octave

GNU Octave (www.gnu.org/software/octave) is a matrix-based computing language most compatible with Matlab and is gaining popularity in Machine Learning after Dr Ng’s open course (www.coursera.org/course/ml). It is not a general-purpose programming language and is primarily designed for numerical computation. As a result, it is usually not convenient to do dirty works such as data munging within octave. In this case, python becomes a good compliment.

Oct2Py is a python interface to octave, which allows users to run octave commands and m-files dynamically within python and to exchange data between two computing systems seamlessly.

In the example below, I will show how to do simple data wrangling with oct2py package in python, convert python numpy array to octave binary *.mat file, load the data in *.mat file into octave, and then estimate a logistic regression through IRLS (iteratively reweighted least squares) method. As shown in octave session below, the vectorized implementation makes octave very efficient in prototyping machine learning algorithm, i.e. logistic regression in this case.

**PYTHON SESSION**

In [1]: import pandas as pd In [2]: import numpy as np In [3]: import oct2py as op In [4]: # READ DATA FROM CSV In [5]: df = pd.read_csv('credit_count.csv') In [6]: # SCRUB RAW DATA In [7]: df2 = df.ix[df.CARDHLDR == 1, ['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT', 'BAD']] In [8]: # CONVERT DATAFRAME TO NUMPY ARRAY In [9]: arr = np.array(df2, dtype = np.double) In [10]: # INVOKE AN OCTAVE SESSION In [11]: octave = op.Oct2Py() In [12]: # PUT NUMPY ARRAY INTO OCTAVE WORKSPACE In [13]: octave.put('m', arr) In [14]: print octave.run('whos') Variables in the current scope: Attr Name Size Bytes Class ==== ==== ==== ===== ===== ans 1x70 763 cell m 10499x7 587944 double Total is 73563 elements using 588707 bytes In [15]: # SAVE MAT DATA TO BE USED IN OCTAVE In [16]: octave.run('save data.mat m') Out[16]: ''

**OCTAVE SESSION**

octave:1> % LOAD DATA octave:1> load data.mat octave:2> whos Variables in the current scope: Attr Name Size Bytes Class ==== ==== ==== ===== ===== ans 1x70 763 cell m 10499x7 587944 double Total is 73563 elements using 588707 bytes octave:3> x = [ones(size(m, 1), 1) m(:, 1:6)]; octave:4> y = m(:, 7); octave:5> % SET INITIAL VALUES OF PARAMETERS octave:5> b = zeros(size(x, 2), 1); octave:6> % MAX ITERATIONS octave:6> maxiter = 100; octave:7> % SOLVE FOR BETA THROUGH IRLS octave:7> for i = 1:maxiter > xb = x * b; > p = 1 ./ (1 + exp(-xb)); > w = diag(p .* (1 - p)); > z = x * b + w \ (y - p); > b = (x' * w * x) \ (x' * w) * z; > new_ll = y' * log(p) + (1 - y)' * log(1 - p); > disp(sprintf('log likelihood for iteration %0.0f: %0.8f', i, new_ll)); > if i == 1 > old_ll = new_ll; > elseif new_ll > old_ll > old_ll = new_ll; > else > break > endif > end log likelihood for iteration 1: -7277.35224870 log likelihood for iteration 2: -3460.32494800 log likelihood for iteration 3: -3202.91405878 log likelihood for iteration 4: -3176.95671906 log likelihood for iteration 5: -3175.85136112 log likelihood for iteration 6: -3175.84836320 log likelihood for iteration 7: -3175.84836317 log likelihood for iteration 8: -3175.84836317 log likelihood for iteration 9: -3175.84836317 octave:8> % CALCULATE STANDARD ERRORS AND Z-VALUES octave:8> stder = sqrt(diag(inv(x' * w * x))); octave:9> z = b ./ stder; octave:10> result = [b, stder, z]; octave:11> format short g; octave:12> disp(result); -0.96483 0.13317 -7.2453 -0.0094622 0.0038629 -2.4495 0.13384 0.02875 4.6554 0.21028 0.069724 3.0159 0.20073 0.048048 4.1776 -0.0004632 4.1893e-05 -11.057 -0.22627 0.077392 -2.9237

## Efficiecy of Extracting Rows from A Data Frame in R

In the example below, 552 rows are extracted from a data frame with 10 million rows using six different methods. Results show a significant disparity between the least and the most efficient methods in terms of CPU time. Similar to the finding in my previous post, the method with data.table package is the most efficient solution with 0.64s CPU time. Albeit user-friendly, the method with sqldf() is the least efficient solution with 82.27s CPU time.

> # SIMULATE A DATA.FRAME WITH 10,000,000 ROWS > set.seed(2013) > df <- data.frame(x1 = rpois(10000000, 1), x2 = rpois(10000000, 1), x3 = rpois(10000000, 1)) > > # METHOD 1: EXTRACT ROWS WITH LOGICAL SUBSCRIPTS > system.time(set1 <- df[df$x1 == 4 & df$x2 > 4 & df$x3 < 4,]) user system elapsed 1.484 1.932 3.640 > dim(set1) [1] 552 3 > > # METHOD 2: EXTRACT ROWS WITH ROW INDEX > system.time(set2 <- df[which(df$x1 == 4 & df$x2 > 4 & df$x3 < 4),]) user system elapsed 0.856 1.200 2.197 > dim(set2) [1] 552 3 > > # METHOD 3: EXTRACT ROWS WITH SUBSET() > system.time(set3 <- subset(df, x1 == 4 & x2 > 4 & x3 < 4)) user system elapsed 1.680 2.644 4.690 > dim(set3) [1] 552 3 > > # METHOD 4: EXTRACT ROWS WITH SQLDF() > require(sqldf) > system.time(set4 <- sqldf("select * from df where x1 = 4 and x2 > 4 and x3 < 4", row.names = TRUE)) user system elapsed 82.269 13.733 98.943 > dim(set4) [1] 552 3 > > # METHOD 5: EXTRACT ROWS WITH SQL.SELECT() > source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R") > system.time(set5 <- sql.select("select * from df where `x1 == 4 & x2 > 4 & x3 < 4`")) user system elapsed 2.800 3.152 7.107 > dim(set5) [1] 552 3 > > # METHOD 6: EXTRACT ROWS WITH DATA.TABLE PACKAGE > require(data.table) > dt <- data.table(df) > system.time(set6 <- dt[dt$x1 == 4 & dt$x2 > 4 & dt$x3 < 4,]) user system elapsed 0.636 0.000 0.655 > dim(set6) [1] 552 3