## Archive for the ‘**Statistics**’ Category

## Subset by Index in Clojure

In the previous post https://statcompute.wordpress.com/2018/03/23/subset-by-values-in-clojure, it’s been demonstrated how to subset by value in Clojure. In the example below, I would show how to subset by index by using the **keep-indexed** function. The key is to use the keep-indexed function locating the position of each lookup value in the country list and then subset the order list by positions, e.g. index.

(require '[clojure.pprint :as p] '[clojure.java.jdbc :as j]) (def db {:classname "org.sqlite.JDBC" :subprotocol "sqlite" :subname "/home/liuwensui/Downloads/chinook.db"}) (def orders (j/query db "select billingcountry as country, count(*) as orders from invoices group by billingcountry;")) (def clist '("USA" "India" "Canada" "France") (def country (map #(get % :country) orders)) (p/print-table (map #(nth orders %) (flatten (pmap (fn [c] (keep-indexed (fn [i v] (if (= v c) i)) country)) clist)))) ;| :country | :orders | ;|----------+---------| ;| USA | 91 | ;| India | 13 | ;| Canada | 56 | ;| France | 35 |

## SAS Implementation of ZAGA Models

In the previous post https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros/, I gave a brief introduction about the ZAGA (Zero-Adjusted Gamma) model that provides us a very flexible approach to model non-negative numeric responses. Today, I will show how to implement the ZAGA model with SAS, which can be conducted either jointly or by two steps.

In SAS, the FMM procedure provides a very convenient interface to estimate the ZAGA model in 1 simple step. As shown, there are two model statements, e.g. the first one to estimate a Gamma sub-model with positive outcomes and the second used to separate the point-mass at zero from the positive. The subsequent probmodel statement then is employed to estimate the probability of a record being positive.

data ds; set "/folders/myfolders/autoclaim" (keep = clm_amt bluebook npolicy clm_freq5 mvr_pts income); where income ~= .; clm_flg = (clm_amt > 0); run; proc fmm data = ds tech = trureg; model clm_amt = bluebook npolicy / dist = gamma; model clm_amt = / dist = constant; probmodel clm_freq5 mvr_pts income; run;

An alternative way to develop a ZAGA model in two steps is to estimate a logistic regression first separating the point-mass at zero from the positive and then to estimate a Gamma regression with positive outcomes only, as illustrated below. The two-step approach is more intuitive to understand and, more importantly, is easier to implement without convergence issues as in FMM or NLMIXED procedure.

proc logistic data = ds desc; model clm_flg = clm_freq5 mvr_pts income; run; proc genmod data = ds; where clm_flg = 1; model clm_amt = bluebook npolicy / link = log dist = gamma; run;

## Mimicking SQLDF with MonetDBLite

Like many useRs, I am also a big fan of the sqldf package developed by Grothendieck, which uses SQL statement for data frame manipulations with SQLite embedded database as the default back-end.

In examples below, I drafted a couple R utility functions with the MonetDBLite back-end by mimicking the sqldf function. There are several interesting observations shown in the benchmark comparison.

– The data import for csv data files is more efficient with MonetDBLite than with the generic read.csv function or read.csv.sql function in the sqldf package.

– The data manipulation for a single data frame, such as selection, aggregation, and subquery, is also significantly faster with MonetDBLite than with the sqldf function.

– However, the sqldf function is extremely efficient in joining 2 data frames, e.g. inner join in the example.

# IMPORT monet.read.csv <- function(file) { monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(MonetDBLite::monetdb.read.csv(monet.con, file, "file", sep = ",")) result <- DBI::dbReadTable(monet.con, "file") DBI::dbDisconnect(monet.con, shutdown = T) return(result) } microbenchmark::microbenchmark(monet = {df <- monet.read.csv("Downloads/nycflights.csv")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 528.5378 532.5463 539.2877 539.0902 542.4301 559.1191 10 microbenchmark::microbenchmark(read.csv = {df <- read.csv("Downloads/nycflights.csv")}, times = 10) #Unit: seconds # expr min lq mean median uq max neval # read.csv 2.310238 2.338134 2.360688 2.343313 2.373913 2.444814 10 # SELECTION AND AGGREGATION monet.sql <- function(df, sql) { df_str <- deparse(substitute(df)) monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(DBI::dbWriteTable(monet.con, df_str, df, overwrite = T)) result <- DBI::dbGetQuery(monet.con, sql) DBI::dbDisconnect(monet.con, shutdown = T) return(result) } microbenchmark::microbenchmark(monet = {monet.sql(df, "select * from df sample 3")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 422.761 429.428 439.0438 438.3503 447.3286 453.104 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from df order by RANDOM() limit 3")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 903.9982 908.256 925.4255 920.2692 930.0934 963.6983 10 microbenchmark::microbenchmark(monet = {monet.sql(df, "select origin, median(distance) as med_dist from df group by origin")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 450.7862 456.9589 458.6389 458.9634 460.4402 465.2253 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select origin, median(distance) as med_dist from df group by origin")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 833.1494 836.6816 841.952 843.5569 846.8117 851.0771 10 microbenchmark::microbenchmark(monet = {monet.sql(df, "with df1 as (select dest, avg(distance) as dist from df group by dest), df2 as (select dest, count(*) as cnts from df group by dest) select * from df1 inner join df2 on (df1.dest = df2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 426.0248 431.2086 437.634 438.4718 442.8799 451.275 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from (select dest, avg(distance) as dist from df group by dest) df1 inner join (select dest, count(*) as cnts from df group by dest) df2 on (df1.dest = df2.dest)")}, times = 10) #Unit: seconds # expr min lq mean median uq max neval # sqldf 1.013116 1.017248 1.024117 1.021555 1.025668 1.048133 10 # MERGE monet.sql2 <- function(df1, df2, sql) { df1_str <- deparse(substitute(df1)) df2_str <- deparse(substitute(df2)) monet.con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), ":memory:") suppressMessages(DBI::dbWriteTable(monet.con, df1_str, df1, overwrite = T)) suppressMessages(DBI::dbWriteTable(monet.con, df2_str, df2, overwrite = T)) result <- DBI::dbGetQuery(monet.con, sql) DBI::dbDisconnect(monet.con, shutdown = T) return(result) } tbl1 <- monet.sql(df, "select dest, avg(distance) as dist from df group by dest") tbl2 <- monet.sql(df, "select dest, count(*) as cnts from df group by dest") microbenchmark::microbenchmark(monet = {monet.sql2(tbl1, tbl2, "select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # monet 93.94973 174.2211 170.7771 178.487 182.4724 187.3155 10 microbenchmark::microbenchmark(sqldf = {sqldf::sqldf("select * from tbl1 inner join tbl2 on (tbl1.dest = tbl2.dest)")}, times = 10) #Unit: milliseconds # expr min lq mean median uq max neval # sqldf 19.49334 19.60981 20.29535 20.001 20.93383 21.51837 10

## MLE with General Optimization Functions in R

In my previous post (https://statcompute.wordpress.com/2018/02/25/mle-in-r/), it is shown how to estimate the MLE based on the log likelihood function with the general-purpose optimization algorithm, e.g. optim(), and that the optimizer is more flexible and efficient than wrappers in statistical packages.

A benchmark comparison are given below showing the use case of other general optimizers commonly used in R, including **optim()**, **nlm()**, **nlminb()**, and **ucminf()**. Since these optimizers are normally designed to minimize the objective function, we need to add a minus (-) sign to the log likelihood function that we want to maximize, as shown in the minLL() function below. In addition, in order to speed up the optimization process, we can suppress the hessian in the function call. If indeed the hessian is required to calculate standard errors of estimated parameters, it can be calculated by calling the hessian() function in the numDeriv package.

As shown in the benchmark result, although the ucminf() is the most efficient optimization function, a hessian option can increase the computing time by 70%. In addition, in the second fastest nlminb() function, there is no built-in option to output the hessian. Therefore, sometimes it might be preferable to estimate model parameters first and then calculate the hessian afterwards for the analysis purpose, as demonstrated below.

df <- read.csv("Downloads/credit_count.txt") ### DEFINE THE OBJECTIVE FUNCTION ### minLL <- function(par) { mu <- exp(par[1] + par[2] * df$AGE + par[3] * df$ACADMOS + par[4] * df$MINORDRG + par[5] * df$OWNRENT) return(ll <- -sum(log(exp(-mu) * (mu ^ df$MAJORDRG) / factorial(df$MAJORDRG)))) } ### BENCHMARKING ### import::from("rbenchmark", "benchmark") benchmark(replications = 10, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative"), optim = {optim(par = rep(0, 5), fn = minLL, hessian = F)}, nlm = {nlm(f = minLL, p = rep(0, 5), hessian = F)}, nlminb = {nlminb(start = rep(0, 5), objective = minLL)}, ucminf = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 0)}, hessian = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 1)} ) # test replications elapsed relative # 4 ucminf 10 4.044 1.000 # 3 nlminb 10 6.444 1.593 # 5 hessian 10 6.973 1.724 # 2 nlm 10 8.292 2.050 # 1 optim 10 12.027 2.974 ### HOW TO CALCULATE THE HESSIAN ### fit <- nlminb(start = rep(0, 5), objective = minLL) import::from("numDeriv", "hessian") std <- sqrt(diag(solve(hessian(minLL, fit$par)))) est <- data.frame(beta = fit$par, stder = std, z_values = fit$par / std) # beta stder z_values # 1 -1.379324501 0.0438155970 -31.480217 # 2 0.010394876 0.0013645030 7.618068 # 3 0.001532188 0.0001956843 7.829894 # 4 0.461129515 0.0068557359 67.261856 # 5 -0.199393808 0.0283222704 -7.040177

It is worth mentioning that, although these general optimizers are fast, they are less user-friendly than wrappers in statistical packages, such as mle or maxLik. For instance, we have to calculate AIC or BIC based on the log likelihood function or p-values based on Z-scores.

## Read Random Rows from A Huge CSV File

Given R data frames stored in the memory, sometimes it is beneficial to sample and examine the data in a large-size csv file before importing into the data frame. To the best of my knowledge, there is no off-shelf R function performing such data sampling with a relatively low computing cost. Therefore, I drafted two utility functions serving this particular purpose, one with the LaF library and the other with the reticulate library by leveraging the power of Python. While the first function is more efficient and samples 3 records out of 336,776 in about 100 milliseconds, the second one is more for fun and a showcase of the reticulate package.

library(LaF) sample1 <- function(file, n) { lf <- laf_open(detect_dm_csv(file, sep = ",", header = TRUE, factor_fraction = -1)) return(read_lines(lf, sample(1:nrow(lf), n))) } sample1("Downloads/nycflights.csv", 3) # year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight # 1 2013 9 15 1323 -6 1506 -23 MQ N857MQ 3340 # 2 2013 3 18 1657 -4 2019 9 UA N35271 80 # 3 2013 6 7 1325 -4 1515 -11 9E N8477R 3867 # origin dest air_time distance hour minute # 1 LGA DTW 82 502 13 23 # 2 EWR MIA 157 1085 16 57 # 3 EWR CVG 91 569 13 25 library(reticulate) sample2 <- function(file, n) { rows <- py_eval(paste("sum(1 for line in open('", file, "'))", sep = '')) - 1 return(import("pandas")$read_csv(file, skiprows = setdiff(1:rows, sample(1:rows, n)))) } sample2("Downloads/nycflights.csv", 3) # year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight # 1 2013 10 9 812 12 1010 -16 9E N902XJ 3507 # 2 2013 4 30 1218 -10 1407 -30 EV N18557 4091 # 3 2013 8 25 1111 -4 1238 -27 MQ N721MQ 3281 # origin dest air_time distance hour minute # 1 JFK MSY 156 1182 8 12 # 2 EWR IND 92 645 12 18 # 3 LGA CMH 66 479 11 11

## Updating Column Values in Clojure Map

(require '[huri.core :as h] '[clojure.core.matrix.dataset :as d] '[incanter.core :as i]) (def ds [{:id 1.0 :name "name1"} {:id 2.0 :name "name2"} {:id 3.0 :name "name3"}]) ;; UPDATE THE :NAME COLUMN IN THE DATASET ;; - IF THE VALUE IS NOT "NAME2", THEN CHANGE TO "NOT 2" ;; ;; EXPECTED OUTPUT: ;; | :id | :name | ;; |-----+-------| ;; | 1.0 | not 2 | ;; | 2.0 | name2 | ;; | 3.0 | not 2 | ;; WITH CLOJURE.CORE/UPDATE (def d1 (map (fn [x] (update x :name #(if (= "name2" %) % "not 2"))) ds)) ;; WITH CLOJURE.CORE/UPDATE-IN (def d2 (map (fn [x] (update-in x [:name] #(if (= "name2" %) % "not 2"))) ds)) ;; WITH HURI/UPDATE-COLS (def d3 (h/update-cols {:name #(if (= "name2" %) % "not 2")} ds)) ;; WITH MATRIX.DATASET/EMAP-COLUMN (def d4 (-> ds (d/dataset) (d/emap-column :name #(if (= "name2" %) % "not 2")) ((comp #(map into %) d/row-maps)))) ;; WITH INCANTER/TRANSFORM-COL (def d5 (-> ds (i/to-dataset) (i/transform-col :name #(if (= "name2" %) % "not 2")) ((comp #(map into %) second vals))))

## Adding New Columns to Clojure Map

(require '[huri.core :as h] '[clojure.core.matrix.dataset :as d] '[incanter.core :as i]) (def ds [{:id 1.0 :name "name1"} {:id 2.0 :name "name2"} {:id 3.0 :name "name3"}]) ;; ADD 2 COLUMNS TO THE DATASET ;; - ADD 2 TO ID AND NAME ADD2 ;; - CHECK NAME = "name2" AND NAME NAME2 ;; ;; EXPECTED OUTPUT: ;;| :id | :name | :add2 | :name2 | ;;|-----+-------+-------+--------| ;;| 1.0 | name1 | 3.0 | N | ;;| 2.0 | name2 | 4.0 | Y | ;;| 3.0 | name3 | 5.0 | N | ;; WITH PLAIN CLOJURE ;; #1 - MERGE (def d1 (map #(merge % {:add2 (+ (:id %) 2) :name2 (if (= "name2" (:name %)) "Y" "N")}) ds)) ;; #2 - MERGE-WITH (def d2 (map #(merge-with into % {:add2 (+ (:id %) 2) :name2 (if (= "name2" (:name %)) "Y" "N")}) ds)) ;; #3 - ASSOC (def d3 (map #(assoc % :add2 (+ (:id %) 2) :name2 (if (= "name2" (:name %)) "Y" "N")) ds)) ;; #4 - CONJ (def d4 (map #(conj % {:add2 (+ (:id %) 2) :name2 (if (= "name2" (:name %)) "Y" "N")}) ds)) ;; #5 - CONCAT (def d5 (map #(into {} (concat % {:add2 (+ (:id %) 2) :name2 (if (= "name2" (:name %)) "Y" "N")})) ds)) ;; WITH HURI (def d6 (h/derive-cols {:name2 [#(if (= "name2" %) "Y" "N") :name] :add2 [#(+ 2 %) :id]} ds)) ;; WITH CORE.MATRIX API (def d7 (-> ds (d/dataset) (d/add-column :add2 (map #(+ 2 %) (map :id ds))) (d/add-column :name2 (map #(if (= "name2" %) "Y" "N") (map :name ds))) (d/row-maps))) ;; WITH INCANTER API (def d8 (->> ds (i/to-dataset) (i/add-derived-column :add2 [:id] #(+ 2 %)) (i/add-derived-column :name2 [:name] #(if (= "name2" %) "Y" "N")) ((comp second vals)))) ;; CHECK THE DATA EQUALITY (= d1 d2 d3 d4 d5 d6 d7 d8) ;; true