## Posts Tagged ‘**R**’

## Mimicking SQLDF with MonetDBLite

Like many useRs, I am also a big fan of the sqldf package developed by Grothendieck, which uses SQL statement for data frame manipulations with SQLite embedded database as the default back-end.

In examples below, I drafted a couple R utility functions with the MonetDBLite back-end by mimicking the sqldf function. There are several interesting observations shown in the benchmark comparison.

– The data import for csv data files is more efficient with MonetDBLite than with the generic read.csv function or read.csv.sql function in the sqldf package.

– The data manipulation for a single data frame, such as selection, aggregation, and subquery, is also significantly faster with MonetDBLite than with the sqldf function.

– However, the sqldf function is extremely efficient in joining 2 data frames, e.g. inner join in the example.

# IMPORT monet.read.csv <- function(file) { monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(MonetDBLite::monetdb.read.csv(monet.con, file, "file", sep = ",")) result <- DBI::dbReadTable(monet.con, "file") DBI::dbDisconnect(monet.con, shutdown = T) return(result) } microbenchmark::microbenchmark(monet = {df <- monet.read.csv("Downloads/nycflights.csv")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 528.5378 532.5463 539.2877 539.0902 542.4301 559.1191 10 microbenchmark::microbenchmark(read.csv = {df <- read.csv("Downloads/nycflights.csv")}, times = 10) #Unit: seconds # expr min lq mean median uq max neval # read.csv 2.310238 2.338134 2.360688 2.343313 2.373913 2.444814 10 # SELECTION AND AGGREGATION monet.sql <- function(df, sql) { df_str <- deparse(substitute(df)) monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(DBI::dbWriteTable(monet.con, df_str, df, overwrite = T)) result <- DBI::dbGetQuery(monet.con, sql) DBI::dbDisconnect(monet.con, shutdown = T) return(result) } microbenchmark::microbenchmark(monet = {monet.sql(df, "select * from df sample 3")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 422.761 429.428 439.0438 438.3503 447.3286 453.104 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from df order by RANDOM() limit 3")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 903.9982 908.256 925.4255 920.2692 930.0934 963.6983 10 microbenchmark::microbenchmark(monet = {monet.sql(df, "select origin, median(distance) as med_dist from df group by origin")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 450.7862 456.9589 458.6389 458.9634 460.4402 465.2253 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select origin, median(distance) as med_dist from df group by origin")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 833.1494 836.6816 841.952 843.5569 846.8117 851.0771 10 microbenchmark::microbenchmark(monet = {monet.sql(df, "with df1 as (select dest, avg(distance) as dist from df group by dest), df2 as (select dest, count(*) as cnts from df group by dest) select * from df1 inner join df2 on (df1.dest = df2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 426.0248 431.2086 437.634 438.4718 442.8799 451.275 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from (select dest, avg(distance) as dist from df group by dest) df1 inner join (select dest, count(*) as cnts from df group by dest) df2 on (df1.dest = df2.dest)")}, times = 10) #Unit: seconds # expr min lq mean median uq max neval # sqldf 1.013116 1.017248 1.024117 1.021555 1.025668 1.048133 10 # MERGE monet.sql2 <- function(df1, df2, sql) { df1_str <- deparse(substitute(df1)) df2_str <- deparse(substitute(df2)) monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(DBI::dbWriteTable(monet.con, df1_str, df1, overwrite = T)) suppressMessages(DBI::dbWriteTable(monet.con, df2_str, df2, overwrite = T)) result <- DBI::dbGetQuery(monet.con, sql) DBI::dbDisconnect(monet.con, shutdown = T) return(result) } tbl1 <- monet.sql(df, "select dest, avg(distance) as dist from df group by dest") tbl2 <- monet.sql(df, "select dest, count(*) as cnts from df group by dest") microbenchmark::microbenchmark(monet = {monet.sql2(tbl1, tbl2, "select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 93.94973 174.2211 170.7771 178.487 182.4724 187.3155 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 19.49334 19.60981 20.29535 20.001 20.93383 21.51837 10

## MLE with General Optimization Functions in R

In my previous post (https://statcompute.wordpress.com/2018/02/25/mle-in-r/), it is shown how to estimate the MLE based on the log likelihood function with the general-purpose optimization algorithm, e.g. optim(), and that the optimizer is more flexible and efficient than wrappers in statistical packages.

A benchmark comparison are given below showing the use case of other general optimizers commonly used in R, including **optim()**, **nlm()**, **nlminb()**, and **ucminf()**. Since these optimizers are normally designed to minimize the objective function, we need to add a minus (-) sign to the log likelihood function that we want to maximize, as shown in the minLL() function below. In addition, in order to speed up the optimization process, we can suppress the hessian in the function call. If indeed the hessian is required to calculate standard errors of estimated parameters, it can be calculated by calling the hessian() function in the numDeriv package.

As shown in the benchmark result, although the ucminf() is the most efficient optimization function, a hessian option can increase the computing time by 70%. In addition, in the second fastest nlminb() function, there is no built-in option to output the hessian. Therefore, sometimes it might be preferable to estimate model parameters first and then calculate the hessian afterwards for the analysis purpose, as demonstrated below.

df <- read.csv("Downloads/credit_count.txt") ### DEFINE THE OBJECTIVE FUNCTION ### minLL <- function(par) { mu <- exp(par[1] + par[2] * df$AGE + par[3] * df$ACADMOS + par[4] * df$MINORDRG + par[5] * df$OWNRENT) return(ll <- -sum(log(exp(-mu) * (mu ^ df$MAJORDRG) / factorial(df$MAJORDRG)))) } ### BENCHMARKING ### import::from("rbenchmark", "benchmark") benchmark(replications = 10, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative"), optim = {optim(par = rep(0, 5), fn = minLL, hessian = F)}, nlm = {nlm(f = minLL, p = rep(0, 5), hessian = F)}, nlminb = {nlminb(start = rep(0, 5), objective = minLL)}, ucminf = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 0)}, hessian = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 1)} ) # test replications elapsed relative # 4 ucminf 10 4.044 1.000 # 3 nlminb 10 6.444 1.593 # 5 hessian 10 6.973 1.724 # 2 nlm 10 8.292 2.050 # 1 optim 10 12.027 2.974 ### HOW TO CALCULATE THE HESSIAN ### fit <- nlminb(start = rep(0, 5), objective = minLL) import::from("numDeriv", "hessian") std <- sqrt(diag(solve(hessian(minLL, fit$par)))) est <- data.frame(beta = fit$par, stder = std, z_values = fit$par / std) # beta stder z_values # 1 -1.379324501 0.0438155970 -31.480217 # 2 0.010394876 0.0013645030 7.618068 # 3 0.001532188 0.0001956843 7.829894 # 4 0.461129515 0.0068557359 67.261856 # 5 -0.199393808 0.0283222704 -7.040177

It is worth mentioning that, although these general optimizers are fast, they are less user-friendly than wrappers in statistical packages, such as mle or maxLik. For instance, we have to calculate AIC or BIC based on the log likelihood function or p-values based on Z-scores.

## Read Random Rows from A Huge CSV File

Given R data frames stored in the memory, sometimes it is beneficial to sample and examine the data in a large-size csv file before importing into the data frame. To the best of my knowledge, there is no off-shelf R function performing such data sampling with a relatively low computing cost. Therefore, I drafted two utility functions serving this particular purpose, one with the LaF library and the other with the reticulate library by leveraging the power of Python. While the first function is more efficient and samples 3 records out of 336,776 in about 100 milliseconds, the second one is more for fun and a showcase of the reticulate package.

library(LaF) sample1 <- function(file, n) { lf <- laf_open(detect_dm_csv(file, sep = ",", header = TRUE, factor_fraction = -1)) return(read_lines(lf, sample(1:nrow(lf), n))) } sample1("Downloads/nycflights.csv", 3) # year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight # 1 2013 9 15 1323 -6 1506 -23 MQ N857MQ 3340 # 2 2013 3 18 1657 -4 2019 9 UA N35271 80 # 3 2013 6 7 1325 -4 1515 -11 9E N8477R 3867 # origin dest air_time distance hour minute # 1 LGA DTW 82 502 13 23 # 2 EWR MIA 157 1085 16 57 # 3 EWR CVG 91 569 13 25 library(reticulate) sample2 <- function(file, n) { rows <- py_eval(paste("sum(1 for line in open('", file, "'))", sep = '')) - 1 return(import("pandas")$read_csv(file, skiprows = setdiff(1:rows, sample(1:rows, n)))) } sample2("Downloads/nycflights.csv", 3) # year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight # 1 2013 10 9 812 12 1010 -16 9E N902XJ 3507 # 2 2013 4 30 1218 -10 1407 -30 EV N18557 4091 # 3 2013 8 25 1111 -4 1238 -27 MQ N721MQ 3281 # origin dest air_time distance hour minute # 1 JFK MSY 156 1182 8 12 # 2 EWR IND 92 645 12 18 # 3 LGA CMH 66 479 11 11

## LogRatio Regression – A Simple Way to Model Compositional Data

The compositional data are proportionals of mutually exclusive groups that would be summed up to the unity. Statistical models for compositional data have been applicable in a number of areas, e.g. the product or channel mix in the marketing research and asset allocations of a investment portfolio.

In the example below, I will show how to model compositional outcomes with a simple LogRatio regression. The underlying idea is very simple. With the D-dimension outcome [p_1, p_2…p_D], we can derive a [D-1]-dimension outcome [log(p_2 / p_1)…log(p_D / p_1)] and then estimate a multivariate regression based on the new outcome.

df = get("ArcticLake", envir = asNamespace('DirichletReg')) # sand silt clay depth #1 0.775 0.195 0.030 10.4 #2 0.719 0.249 0.032 11.7 #3 0.507 0.361 0.132 12.8 lm(cbind(log(silt / sand), log(clay / sand)) ~ depth, data = df) #Response log(silt/sand): #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -0.649656 0.236733 -2.744 0.0093 ** #depth 0.037522 0.004269 8.790 1.36e-10 *** # #Response log(clay/sand) : #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -2.614897 0.421383 -6.206 3.31e-07 *** #depth 0.062181 0.007598 8.184 8.00e-10 ***

Since log(x / y) = log(x) – log(y), we can also estimate the model with log(sand) as an offset term.

lm(cbind(log(silt), log(clay)) ~ depth + offset(log(sand)), data = df) #Response log(silt) : #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -0.649656 0.236733 -2.744 0.0093 ** #depth 0.037522 0.004269 8.790 1.36e-10 *** # #Response log(clay) : #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -2.614897 0.421383 -6.206 3.31e-07 *** #depth 0.062181 0.007598 8.184 8.00e-10 ***

Alternatively, we can also use the comp.reg function in the Compositional package.

Compositional::comp.reg(as.matrix(df[, 1:3]), df[, 4]) #$be # [,1] [,2] #(Intercept) -0.64965598 -2.61489731 #x 0.03752186 0.06218069 # #$seb # [,1] [,2] #(Intercept) 0.236733203 0.421382652 #x 0.004268588 0.007598043

## Clojure Integration with R

(require '[tnoda.rashinban :as rr] '[tnoda.rashinban.core :as rc] '[clojure.core.matrix.dataset :as dt] '[clojure.core.matrix.impl.dataset :as id]) ;; CREATE A TOY DATA (def ds [{:id 1.0 :name "name1"} {:id 2.0 :name "name2"} {:id 3.0 :name "name3"}]) ;; RUN THE FOLLOWING R CODE IN ADVANCE TO START THE RSERVE SERVER: ;; R -e 'library(Rserve)' -e 'Rserve(args = "--vanilla")' ;; IF YOU HAVE LITTLER INSTALLED, BELOW ALSO WORKS: ;; r -e 'library(Rserve); Rserve(args = "--vanilla")' (rr/init) ;; PASS THE DATA FROM CLOJURE INTO R (map (fn [x] (rr/<- (name (key x)) (val x))) (let [ks ((comp keys first) ds)] (zipmap ks (map #(map % ds) ks)))) (rr/<- 'header (map name ((comp keys first) ds))) ;; CREATE THE R DATA.FRAME (rc/eval "df = data.frame(lapply(header, as.name))") ;; TEST THE R DATA.FRAME (rc/eval "df$id") ; [1.0 2.0 3.0] (rc/eval "df$name") ; ["name1" "name2" "name3"] ;; CONVERT THE R DATA.FRAME BACK TO THE CLOJURE MAP (def mp (into [] (map #(zipmap (map keyword (rr/colnames 'df)) %) (partition (count (rr/colnames 'df)) (apply interleave (rr/matrix 'df)))))) ; [{:id 1.0, :name "name1"} {:id 2.0, :name "name2"} {:id 3.0, :name "name3"}] ;; TEST THE EQUALITY BETWEEN INPUT AND OUTPUT DATA (= mp ds) ; true ;; ALTERNATIVELY, WE CAN ALSO CONVERT THE R DATA.FRAME TO A CLOJURE DATASET (def dt (id/dataset-from-columns (map keyword (rr/colnames 'df)) (rr/matrix 'df))) ; #dataset/dataset {:column-names [:id :name], :columns [[1.0 2.0 3.0] ["name1" "name2" "name3"]], :shape [3 2]} ;; NEXT, CONVERT THE DATASET TO THE MAP (def mp2 (dt/row-maps dt)) ; [{:id 1.0, :name "name1"} {:id 2.0, :name "name2"} {:id 3.0, :name "name3"}] (= ds mp2) ; true

## MLE in R

When I learned and experimented a new model, I always like to start with its likelihood function in order to gain a better understanding about the statistical nature. Thatâ€™s why I extensively used the SAS/NLMIXED procedure that gives me more flexibility. Today, I spent a couple hours playing the optim() function and its wrappers, e.g. mle() and mle2(), in case that I might need a replacement for my favorite NLMIXED in the model estimation. Overall, I feel that the optim() is more flexible. The named list required by the mle() or mle2() for initial values of parameters is somewhat cumbersome without additional benefits. As shown in the benchmark below, the optim() is the most efficient.

library(COUNT) library(stats4) library(bbmle) data(rwm1984) attach(rwm1984) ### OPTIM() ### LogLike1 <- function(par) { xb <- par[1] + par[2] * outwork + par[3] * age + par[4] * female + par[5] * married mu <- exp(xb) ll <- sum(log(exp(-mu) * (mu ^ docvis) / factorial(docvis))) return(-ll) } fit1 <- optim(rep(0, 5), LogLike1, hessian = TRUE, method = "BFGS") std1 <- sqrt(diag(solve(fit1$hessian))) est1 <- data.frame(beta = fit1$par, stder = stder1, z_values = fit1$par / stder1) # beta stder z_values #1 -0.06469676 0.0433207574 -1.493436 #2 0.27264177 0.0214085110 12.735205 #3 0.02283541 0.0008394589 27.202540 #4 0.27461355 0.0210597539 13.039732 #5 -0.11804504 0.0217745647 -5.421236 ### MLE() ### LogLike2 <- function(b0, b1, b2, b3, b4) { mu <- exp(b0 + b1 * outwork + b2 * age + b3 * female + b4 * married) -sum(log(exp(-mu) * (mu ^ docvis) / factorial(docvis))) } inits <- list(b0 = 0, b1 = 0, b2 = 0, b3 = 0, b4 = 0) fit2 <- mle(LogLike2, method = "BFGS", start = inits) std2 <- sqrt(diag(vcov(fit2))) est2 <- data.frame(beta = coef(fit2), stder = std2, z_values = coef(fit2) / std2) # beta stder z_values #b0 -0.06469676 0.0433417474 -1.492712 #b1 0.27264177 0.0214081592 12.735414 #b2 0.02283541 0.0008403589 27.173407 #b3 0.27461355 0.0210597350 13.039744 #b4 -0.11804504 0.0217746108 -5.421224 ### BENCHMARKS ### microbenchmark::microbenchmark( "optim" = {optim(rep(0, 5), LogLike1, hessian = TRUE, method = "BFGS")}, "mle" = {mle(LogLike2, method = "BFGS", start = inits)}, "mle2" = {mle2(LogLike2, method = "BFGS", start = inits)}, times = 10 ) # expr min lq mean median uq max neval # optim 280.4829 280.7902 296.9538 284.5886 318.6975 320.5094 10 # mle 283.6701 286.3797 302.9257 289.8849 327.1047 328.6255 10 # mle2 387.1912 390.8239 407.5090 392.8134 427.0569 467.0013 10

## R Interfaces to Python Keras Package

Keras is a popular Python package to do the prototyping for deep neural networks with multiple backends, including TensorFlow, CNTK, and Theano. Currently, there are two R interfaces that allow us to use Keras from R through the reticulate package. While the keras R package is able to provide a flexible and feature-rich API, the kerasR R package is more convenient and computationally efficient. For instance, in the below example mimicking the Python code shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks, the kerasR package is at least 10% faster than the keras package in terms of the computing time.

df <- read.csv("credit_count.txt") Y <- matrix(df[df$CARDHLDR == 1, ]$DEFAULT) X <- scale(df[df$CARDHLDR == 1, ][3:14]) set.seed(2018) rows <- sample(1:nrow(Y), nrow(Y) - 2000) Y1 <- Y[rows, ] Y2 <- Y[-rows, ] X1 <- X[rows, ] X2 <- X[-rows, ] ### USE KERAS PACKAGE (https://keras.rstudio.com) ### library(keras) dnn1 % ### DEFINE THE INPUT LAYER ### layer_dense(units = 50, activation = 'relu', input_shape = ncol(X), kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% ### DEFINE THE 1ST HIDDEN LAYER ### layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% ### DEFINE THE 2ND HIDDEN LAYER ### layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% layer_dense(units = 1, activation = 'sigmoid') %>% compile(loss = 'binary_crossentropy', optimizer = 'sgd', metrics = c('accuracy')) dnn1 %>% fit(X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3) pROC::roc(as.numeric(Y2), as.numeric(predict_proba(dnn1, X2))) ### USE KERAS PACKAGE (https://github.com/statsmaths/kerasR) ### library(kerasR) dnn2 <- Sequential() ### DEFINE THE INPUT LAYER ### dnn2$add(Dense(units = 50, input_shape = ncol(X), activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) ### DEFINE THE 1ST HIDDEN LAYER ### dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) ### DEFINE THE 2ND HIDDEN LAYER ### dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) dnn2$add(Dense(units = 1, activation = 'sigmoid')) keras_compile(dnn2, loss = 'binary_crossentropy', optimizer = 'sgd', metrics = 'accuracy') keras_fit(dnn2, X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3) pROC::roc(as.numeric(Y2), as.numeric(keras_predict_proba(dnn2, X2)))