# Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## Modeling in R with Log Likelihood Function

Similar to NLMIXED procedure in SAS, optim() in R provides the functionality to estimate a model by specifying the log likelihood function explicitly. Below is a demo showing how to estimate a Poisson model by optim() and its comparison with glm() result.

```> df <- read.csv('credit_count.csv')
> # ESTIMATE A POISSON MODEL WITH GLM()
> mdl <- glm(MAJORDRG ~ AGE + ACADMOS + ADEPCNT + MINORDRG, family = poisson, data = df)
> summary(mdl)

Call:
glm(formula = MAJORDRG ~ AGE + ACADMOS + ADEPCNT + MINORDRG,
family = poisson, data = df)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-9.9940  -0.8907  -0.8079  -0.7633  11.6866

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3813012  0.0450281 -30.676  < 2e-16 ***
AGE          0.0056126  0.0013616   4.122 3.76e-05 ***
ACADMOS      0.0013437  0.0001975   6.803 1.03e-11 ***
ADEPCNT      0.0803056  0.0093378   8.600  < 2e-16 ***
MINORDRG     0.4499422  0.0068969  65.238  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 24954  on 13443  degrees of freedom
Residual deviance: 22026  on 13439  degrees of freedom
AIC: 28520

Number of Fisher Scoring iterations: 6

> # ESTIMATE A POISSON MODEL WITH OPTIM()
> log.like <- function(par) {
+   xb <- par[1] + par[2] * df\$AGE + par[3] * df\$ACADMOS + par[4] * df\$ADEPCNT + par[5] * df\$MINORDRG
+   mu <- exp(xb)
+   ll <- sum(log(exp(-mu) * (mu ^ df\$MAJORDRG) / factorial(df\$MAJORDRG)))
+   return(-ll)
+ }
> result <- optim(c(0, 0, 0, 0, 0), log.like, hessian = TRUE, method = "BFGS")
> stder <- sqrt(diag(solve(result\$hessian)))
> estimate <- data.frame(beta = result\$par, stder = stder, z_values = result\$par / stder)
> print(estimate)
beta        stder   z_values
1 -1.380911081 0.0450398804 -30.659741
2  0.005656423 0.0013611828   4.155520
3  0.001298029 0.0001956315   6.635072
4  0.080171673 0.0093427325   8.581180
5  0.450468859 0.0068922289  65.358952
```

Written by statcompute

December 30, 2012 at 11:54 pm

Posted in S+/R, Statistical Models, Statistics

Tagged with

## Surprising Performance of data.table in Data Aggregation

data.table (http://datatable.r-forge.r-project.org/) inherits from data.frame and provides functionality in fast subset, fast grouping, and fast joins. In previous posts, it is shown that the shortest CPU time to aggregate a data.frame with 13,444 rows and 14 columns for 10 times is 0.236 seconds with summarize() in Hmisc package. However, after the conversion from data.frame to data.table, the CPU time of aggregation improves significantly, as shown in the example below.

```> library(data.table)
data.table 1.8.6  For help type: help("data.table")
> class(df)
[1] "data.frame"
> dt <- data.table(df)
> class(dt)
[1] "data.table" "data.frame"
> system.time({
+   for (i in 1:10){
+     summ <- dt[, list(INCOME = mean(INCOME), BAD = mean(BAD)),by = list(SELFEMPL, OWNRENT)]
+   }
+ })
user  system elapsed
0.060   0.000   0.062
> print(summ)
SELFEMPL OWNRENT   INCOME        BAD
1:        0       0 2133.314 0.08470957
2:        0       1 2881.201 0.06293210
3:        1       1 3487.910 0.05316973
4:        1       0 2742.247 0.06896552
```

Written by statcompute

December 29, 2012 at 12:11 am

Posted in S+/R

Tagged with ,

## More about Aggregation by Group in R

Motivated by my young friend, HongMing Song, I managed to find more handy ways to calculate aggregated statistics by group in R. They require loading additional packages, plyr, doBy, Hmisc, and gdata, and are extremely user-friendly. In terms of CPU time, while the method with summarize() is as efficient as the 2nd method with by() introduced yesterday, summaryBy() in doBy package seems the slowest.

“Learn as if you were to live forever” – Mahatma Gandhi

```> # METHOD 5: USING DDPLY()
> library(plyr)
> summ5 <- ddply(df, .(SELFEMPL, OWNRENT), summarize, INCOME = mean(INCOME), BAD = mean(BAD))
> print(summ5)
SELFEMPL OWNRENT   INCOME        BAD
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
>
> # METHOD 6: USING DOBy()
> library(doBy)
> summ6 <- summaryBy(INCOME + BAD ~ SELFEMPL + OWNRENT, data = df, fun = c(mean), keep.names = TRUE)
> print(summ6)
SELFEMPL OWNRENT   INCOME        BAD
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
>
> # METHOD 7: USING SUMMARIZE()
> library(Hmisc)
> summ7 <- summarize(df[c('INCOME', 'BAD', 'SELFEMPL', 'OWNRENT')], df[c('SELFEMPL', 'OWNRENT')], colMeans, stat.name = NULL)
> print(summ7)
SELFEMPL OWNRENT   INCOME        BAD
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
>
> # METHOD 8: USING FRAMEAPPLY()
> library(gdata)
> summ8 <- frameApply(df, by = c('SELFEMPL', 'OWNRENT'), on = c('INCOME', 'BAD'), fun = colMeans)
> rownames(summ8) <- NULL
> print(summ8)
SELFEMPL OWNRENT   INCOME        BAD
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
```

Efficiency Comparison

```> test5 <- function(n){
+   for (i in 1:n){
+     summ5 <- ddply(df, .(SELFEMPL, OWNRENT), summarize, INCOME = mean(INCOME), BAD = mean(BAD))
+   }
+ }
> system.time(test5(10))
user  system elapsed
0.524   0.068   0.622
>
> test6 <- function(n){
+   for (i in 1:n){
+     summ6 <- summaryBy(INCOME + BAD ~ SELFEMPL + OWNRENT, data = df, fun = c(mean), keep.names = TRUE)
+   }
+ }
> system.time(test6(10))
user  system elapsed
1.800   0.060   1.903
>
> test7 <- function(n){
+   for (i in 1:n){
+     summ7 <- summarize(df[c('INCOME', 'BAD', 'SELFEMPL', 'OWNRENT')], df[c('SELFEMPL', 'OWNRENT')], colMeans, stat.name = NULL)
+   }
+ }
> system.time(test7(10))
user  system elapsed
0.236   0.020   0.274
>
> test8 <- function(n){
+   for (i in 1:n){
+     summ8 <- frameApply(df, by = c('SELFEMPL', 'OWNRENT'), on = c('INCOME', 'BAD'), fun = colMeans)
+     rownames(summ8) <- NULL
+   }
+ }
> system.time(test8(10))
user  system elapsed
0.580   0.008   0.668
```

Written by statcompute

December 24, 2012 at 12:06 pm

Posted in S+/R

Tagged with

## Aggregation by Group in R

```> df <- read.csv('credit_count.csv')
>
> # METHOD 1: USING AGGREGAGE()
> summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean)
> print(summ1)
SELFEMPL OWNRENT   INCOME        BAD
1        0       0 2133.314 0.08470957
2        1       0 2742.247 0.06896552
3        0       1 2881.201 0.06293210
4        1       1 3487.910 0.05316973
>
> # METHOD 2: USING BY()
> temp2 <- by(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], colMeans)
> summ2 <- cbind(expand.grid(dimnames(temp2)), do.call(rbind, temp2))
> print(summ2)
SELFEMPL OWNRENT   INCOME        BAD
1        0       0 2133.314 0.08470957
2        1       0 2742.247 0.06896552
3        0       1 2881.201 0.06293210
4        1       1 3487.910 0.05316973
>
> # METHOD 3: USING SQLDF()
> library(sqldf)
> summ3 <- sqldf("select SELFEMPL, OWNRENT, avg(INCOME) as INCOME, avg(BAD) from df
+                 group by SELFEMPL, OWNRENT")
> print(summ3)
SELFEMPL OWNRENT   INCOME   avg(BAD)
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       0 2742.247 0.06896552
4        1       1 3487.910 0.05316973
>
> # METHOD 4: USING SQL.SELECT()
Creating a generic function for ‘as.data.frame’ from package ‘base’ in the global environment
> summ4 <- sql.select("select SELFEMPL, OWNRENT, `mean(INCOME)` as INCOME, `mean(BAD)` as BAD
+                      from df group by SELFEMPL, OWNRENT")
> print(summ4)
SELFEMPL OWNRENT   INCOME        BAD
1        0       0 2133.314 0.08470957
2        0       1 2881.201 0.06293210
3        1       1 3487.910 0.05316973
4        1       0 2742.247 0.06896552
```

Efficiency Comparison among 4 Methods above

```> test1 <- function(n){
+   for (i in 1:n){
+     summ1 <- aggregate(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], mean)
+   }
+ }
> system.time(test1(10))
user  system elapsed
0.404   0.036   0.513
>
> test2 <- function(n){
+   for (i in 1:n){
+     temp2 <- by(df[c('INCOME', 'BAD')], df[c('SELFEMPL', 'OWNRENT')], colMeans)
+     summ2 <- cbind(expand.grid(dimnames(temp2)), do.call(rbind, temp2))
+   }
+ }
> system.time(test2(10))
user  system elapsed
0.244   0.020   0.309
>
> test3 <- function(n){
+   for (i in 1:n){
+     summ3 <- sqldf("select SELFEMPL, OWNRENT, avg(INCOME) as INCOME, avg(BAD) from df
+                     group by SELFEMPL, OWNRENT")
+   }
+ }
> system.time(test3(10))
user  system elapsed
0.956   0.112   1.178
>
> test4 <- function(n){
+   for (i in 1:n){
+     summ4 <- sql.select("select SELFEMPL, OWNRENT, `mean(INCOME)` as INCOME, `mean(BAD)` as BAD
+                          from df group by SELFEMPL, OWNRENT")
+   }
+ }
> system.time(test4(10))
user  system elapsed
0.432   0.112   0.601
```

Written by statcompute

December 24, 2012 at 12:22 am

Posted in S+/R

Tagged with

## Data Import Efficiency – A Case in R

Below is a piece of R snippet comparing the data import efficiencies among CSV, SQLITE, and HDF5. Similar to the case in Python posted yesterday, HDF5 shows the highest efficiency.

```> library(RSQLite)
> library(rhdf5)
> df <- read.csv('credit_count.csv')
> do.call(cat, list(nrow(df), ncol(df), '\n'))
13444 14
>
> # WRITE DF INTO SQLITE
> if(file.exists('data.db')) file.remove('data.db')
[1] TRUE
> con <- dbConnect("SQLite", dbname = "data.db")
> dbWriteTable(con, "tbl", df)
[1] TRUE
>
> # WRITE DF INTO HDF5
> if(file.exists('data.h5')) file.remove('data.h5')
[1] TRUE
> h5createFile("data.h5")
[1] TRUE
> h5write(df, 'data.h5', 'tbl')
>
> # CALCULATE CPU TIMES
> system.time(for(i in 1:10) read.csv('credit_count.csv'))
user  system elapsed
1.148   0.056   1.576
> system.time(for(i in 1:10) dbReadTable(con, 'tbl'))
user  system elapsed
0.492   0.024   0.649
> system.time(for(i in 1:10) h5read('data.h5','tbl'))
user  system elapsed
0.164   1.184   1.946
```

Written by statcompute

December 23, 2012 at 5:22 pm

Posted in S+/R

Tagged with ,

## Data Import Efficiency

HDF5 stands for hierarchical data format and is a popular data model designed to store and organize large amounts of data with interfaces in multiple computing languages such as R, PYTHON, and MATLAB.

Below is a demonstration showing the efficiency of data import from HDF5 and its comparison with CSV and SQLITE. As shown in the result, the time of data import from HDF5 is the shortest, only ~50% of import time from CSV and ~25% of import time from SQLITE.

```In [1]: import pandas as pd

In [2]: import pandas.io.sql as pd_sql

In [3]: import sqlite3 as sql

In [4]: DF = pd.read_csv('credit_count.csv')

Out[5]:
CARDHLDR  BAD        AGE  ACADMOS  ADEPCNT  MAJORDRG  MINORDRG  OWNRENT       INCOME  \
0         0    0  27.250000        4        0         0         0        0  1200.000000
1         0    0  40.833332      111        3         0         0        1  4000.000000
2         1    0  37.666668       54        3         0         0        1  3666.666667

SELFEMPL  INCPER   EXP_INC     SPENDING   LOGSPEND
0         0   18000  0.000667
1         0   13500  0.000222
2         0   11300  0.033270  121.9896773  4.8039364

In [6]: con = sql.connect('data.db') #WRITE DF INTO SQLITE DB

In [7]: pd_sql.write_frame(DF, "tbl", con)

In [8]: con.commit()

In [9]: h5 = pd.HDFStore('data.h5', 'w') #WRITE DF INTO HDF5

In [10]: h5['tbl'] = DF

In [11]: %timeit -n100 -r10 pd.read_csv('credit_count.csv')
100 loops, best of 10: 64.3 ms per loop

In [12]: %timeit -n100 -r10 pd_sql.read_frame("select * from tbl", con)
100 loops, best of 10: 114 ms per loop

In [13]: %timeit -n100 -r10 output = h5['tbl']
100 loops, best of 10: 26.3 ms per loop
```

Written by statcompute

December 22, 2012 at 11:21 pm

Posted in PYTHON

Tagged with ,

## Removing Records by Duplicate Values in R – An Efficiency Comparison

After posting “Removing Records by Duplicate Values” yesterday, I had an interesting communication thread with my friend Jeffrey Allard tonight regarding how to code this in R, a combination of order() and duplicated() or sqldf().

Afterward, I did a simple efficiency comparison between two methods as below. The comparison result is pretty self-explanatory. In terms of “user time”, dedup1() is at least 10 times more efficient than dedup2().

```> library(sqldf)
> df1 <- read.table("../data/credit_count.txt", header = TRUE, sep = ",")
> cat(nrow(df1), ncol(df1), '\n')
13444 14
> # DEDUP WITH ORDER() AND DUPLICATED()
> dedup1 <- function(n){
+   for (i in 1:n){
+     df12 <- df1[order(df1\$MAJORDRG, df1\$INCOME), ]
+     df13 <- df12[!duplicated(df12\$MAJORDRG), ]
+   }
+ }
> # DEDUP WITH SQLDF()
> dedup2 <- function(n){
+   for (i in 1:n){
+     df22 <- sqldf("select * from df1 order by MAJORDRG, INCOME")
+     df23 <- sqldf("select a.* from df22 as a inner join (select MAJORDRG, min(rowid) as min_id from df22 group by MAJORDRG) as b on a.MAJORDRG = b.MAJORDRG and a.rowid = b.min_id")
+   }
+ }
> # RUN BOTH METHODS 100 TIMES AND COMPARE CPU TIMES
> system.time(dedup2(100))
user  system elapsed
22.581   1.684  26.965
> system.time(dedup1(100))
user  system elapsed
1.732   0.080   2.033
```

Written by statcompute

December 20, 2012 at 11:30 pm

Posted in S+/R

Tagged with