## Parse CSV File with Headers in Clojure

When importing csv files into incanter datasets, we might tend to use the read-dataset function in the incanter package. Today, after messing around several csv parsers in closure, I found out that it might be a better strategy to import the csv file with other parsers as a LazySeq and then convert it to an incanter dataset.

; (defproject prj "0.1.0-SNAPSHOT" ; :dependencies [[org.clojure/clojure "1.8.0"] ; [org.clojars.bmabey/csvlib "0.3.6"] ; [ultra-csv "0.2.1"] ; [incanter "1.5.7"] ; [semantic-csv "0.2.1-alpha1"] ; [org.clojure/data.csv "0.1.4"]]) (require '[csvlib :as csvlib] '[ultra-csv.core :as ultra] '[semantic-csv.core :as semantic] '[clojure-csv.core :as csv] '[clojure.java.io :as io] '[incanter.core :as ic] '[incanter.io :as ii]) ;; INCANTER PACKAGE (time (ic/$ (range 0 3) '(:dest :origin :hour) (ii/read-dataset "/home/liuwensui/Downloads/nycflights.csv" :header true :delim \,))) ; "Elapsed time: 14085.382359 msecs" ; | :dest | :origin | :hour | ; |-------+---------+-------| ; | IAH | EWR | 5 | ; | IAH | LGA | 5 | ; | MIA | JFK | 5 | ;; CSVLIB PACKAGE (time (ic/$ (range 0 3) '(:dest :origin :hour) (ic/to-dataset (csvlib/read-csv "/home/liuwensui/Downloads/nycflights.csv" :headers? true)))) ; "Elapsed time: 0.933955 msecs" ; | dest | origin | hour | ; |------+--------+------| ; | IAH | EWR | 5 | ; | IAH | LGA | 5 | ; | MIA | JFK | 5 | ;; ULTRA-CSV PACKAGE (time (ic/$ (range 0 3) '(:dest :origin :hour) (ic/to-dataset (ultra/read-csv "/home/liuwensui/Downloads/nycflights.csv")))) ; "Elapsed time: 30.599593 msecs" ; | :dest | :origin | :hour | ; |-------+---------+-------| ; | IAH | EWR | 5 | ; | IAH | LGA | 5 | ; | MIA | JFK | 5 | ;; SEMANTIC-CSV PACKAGE (time (ic/$ (range 0 3) '(:dest :origin :hour) (ic/to-dataset (with-open [in-file (io/reader "/home/liuwensui/Downloads/nycflights.csv")] (->> (csv/parse-csv in-file) semantic/mappify doall))))) ; "Elapsed time: 8338.366317 msecs" ; | :dest | :origin | :hour | ; |-------+---------+-------| ; | IAH | EWR | 5 | ; | IAH | LGA | 5 | ; | MIA | JFK | 5 |

## For Loop and Map in Clojure

;; DEFINE THE DATABASE (def db {:classname "org.sqlite.JDBC" :subprotocol "sqlite" :subname "/home/liuwensui/Downloads/chinook.db"}) ;; CALL PACKAGES (require '[clojure.java.jdbc :as j] '[criterium.core :as c] '[clojure.core.reducers :as r]) ;; DEFINE THE QUERY TO PULL THE TABLE LIST (def qry (str "select tbl_name as tbl from sqlite_master where type = 'table';")) ;; PULL THE TABLE LIST AND CONVERT IT TO A LIST (def tbl (apply list (for [i (j/query db qry)] (get i :tbl)))) ;; PRINT 5 RECORDS OF THE TABLE LIST (doseq [i (take 5 tbl)] (prn i)) ; "albums" ; "sqlite_sequence" ; "artists" ; "customers" ; "employees" ;; DEFINE A FUNCTION TO COUNT RECORDS IN A TABLE (defn cnt [x] (j/query db (str "select '" x "' as tbl, count(*) as cnt from " x ";"))) ;; TEST THE FUNCTION (cnt "albums") ; ({:tbl "albums", :cnt 347}) ;; FOR LOOP (c/quick-bench (for [i tbl] (cnt i))) ; Execution time mean : 14.156879 ns ;; MAP FUNCTION (c/quick-bench (map cnt tbl)) ; Execution time mean : 15.623790 ns ;; PMAP FUNCTION - PARALLEL MAP (c/quick-bench (pmap cnt tbl)) ; Execution time mean : 456.780027 µs ;; REDUCERS MAP FUNCTION (defn rmap [f l] (into () (r/map f l))) (c/quick-bench (rmap cnt tbl)) ; Execution time mean : 8.376753 ms

## Clojure and SQLite

First of all, we need to define the database specification.

(def db {:classname "org.sqlite.JDBC" :subprotocol "sqlite" :subname "/home/liuwensui/Downloads/chinook.db"})

We can use JDBC to query data from the table, which is the same approach mentioned in “Clojure Data Analysis Cookbook”.

;; project.clj ;; (defproject prj "0.1.0-SNAPSHOT" ;; :dependencies [[org.clojure/clojure "1.8.0"] ;; [org.clojure/java.jdbc "0.7.5"] ;; [org.xerial/sqlite-jdbc "3.7.2"] ;; ]) (require '[clojure.pprint :as p] '[clojure.java.jdbc :as j]) (p/print-table (j/query db (str "select tbl_name, type from sqlite_master where type = 'table' limit 3;"))) ;; | :tbl_name | :type | ;; |-----------------+-------| ;; | albums | table | ;; | sqlite_sequence | table | ;; | artists | table |

Alternatively, we can also use the ClojureQL package, as shown below.

;; project.clj ;; (defproject prj "0.1.0-SNAPSHOT" ;; :dependencies [[org.clojure/clojure "1.8.0"] ;; [clojureql "1.0.5"] ;; [org.xerial/sqlite-jdbc "3.7.2"] ;; ]) (require '[clojure.pprint :as p] '[clojureql.core :as l]) (p/print-table @(-> (l/select (l/table db :sqlite_master) (l/where (= :type "table"))) (l/take 3) (l/project [:tbl_name :type]))) ;; | :type | :tbl_name | ;; |-------+-----------------| ;; | table | albums | ;; | table | sqlite_sequence | ;; | table | artists |

After the data import, we can easily convert it to incanter dataset or clojure map for further data munging.

## MLE in R

When I learned and experimented a new model, I always like to start with its likelihood function in order to gain a better understanding about the statistical nature. That’s why I extensively used the SAS/NLMIXED procedure that gives me more flexibility. Today, I spent a couple hours playing the optim() function and its wrappers, e.g. mle() and mle2(), in case that I might need a replacement for my favorite NLMIXED in the model estimation. Overall, I feel that the optim() is more flexible. The named list required by the mle() or mle2() for initial values of parameters is somewhat cumbersome without additional benefits. As shown in the benchmark below, the optim() is the most efficient.

library(COUNT) library(stats4) library(bbmle) data(rwm1984) attach(rwm1984) ### OPTIM() ### LogLike1 <- function(par) { xb <- par[1] + par[2] * outwork + par[3] * age + par[4] * female + par[5] * married mu <- exp(xb) ll <- sum(log(exp(-mu) * (mu ^ docvis) / factorial(docvis))) return(-ll) } fit1 <- optim(rep(0, 5), LogLike1, hessian = TRUE, method = "BFGS") std1 <- sqrt(diag(solve(fit1$hessian))) est1 <- data.frame(beta = fit1$par, stder = stder1, z_values = fit1$par / stder1) # beta stder z_values #1 -0.06469676 0.0433207574 -1.493436 #2 0.27264177 0.0214085110 12.735205 #3 0.02283541 0.0008394589 27.202540 #4 0.27461355 0.0210597539 13.039732 #5 -0.11804504 0.0217745647 -5.421236 ### MLE() ### LogLike2 <- function(b0, b1, b2, b3, b4) { mu <- exp(b0 + b1 * outwork + b2 * age + b3 * female + b4 * married) -sum(log(exp(-mu) * (mu ^ docvis) / factorial(docvis))) } inits <- list(b0 = 0, b1 = 0, b2 = 0, b3 = 0, b4 = 0) fit2 <- mle(LogLike2, method = "BFGS", start = inits) std2 <- sqrt(diag(vcov(fit2))) est2 <- data.frame(beta = coef(fit2), stder = std2, z_values = coef(fit2) / std2) # beta stder z_values #b0 -0.06469676 0.0433417474 -1.492712 #b1 0.27264177 0.0214081592 12.735414 #b2 0.02283541 0.0008403589 27.173407 #b3 0.27461355 0.0210597350 13.039744 #b4 -0.11804504 0.0217746108 -5.421224 ### BENCHMARKS ### microbenchmark::microbenchmark( "optim" = {optim(rep(0, 5), LogLike1, hessian = TRUE, method = "BFGS")}, "mle" = {mle(LogLike2, method = "BFGS", start = inits)}, "mle2" = {mle2(LogLike2, method = "BFGS", start = inits)}, times = 10 ) # expr min lq mean median uq max neval # optim 280.4829 280.7902 296.9538 284.5886 318.6975 320.5094 10 # mle 283.6701 286.3797 302.9257 289.8849 327.1047 328.6255 10 # mle2 387.1912 390.8239 407.5090 392.8134 427.0569 467.0013 10

## Modeling Dollar Amounts in Regression Setting

After switching the role from the credit risk to the operational risk in 2015, I spent countless weekend hours in the Starbucks researching on how to model operational losses in the regression setting in light of the heightened scrutiny. While I feel very comfortable with various frequency models, how to model severity and loss remain challenging both conceptually and empirically. The same challenge also holds true for modeling other financial measures in dollar amounts, such as balance, profit, or cost.

Most practitioners still prefer modeling severity and loss under the Gaussian distributional assumption explicitly or implicitly. In practice, there are 3 commonly used approaches, as elaborated below.

– First of all, the simple OLS regression to model severity and loss directly without any transformation remains the number one choice due to the simplicity. Given the inconsistency between the empirical data range and the conceptual domain for a Gaussian distribution, it is evidential that this approach is problematic.

– Secondly, the OLS regression to model LOG transformed severity and loss under the Lognormal distributional assumption is also a common approach. In this method, Log(Y) instead of Y is estimated. However, given E(Log(Y)|X) != Log(E(Y|X)), the estimation bias is introduced and therefore should be corrected by MSE / 2. In addition, the positive domain of a Lognormal might not work well in cases of losses with a lower bound that can be either zero or a known threshold value.

– At last, the Tobit regression under the censored Normal distribution seems a viable solution that supports the non-negative or any above-threshold values shown in severity or loss measures. Nonetheless, the censorship itself is questionable given that the unobservability of negative or below-threshold values is not due to the censorship but attributable to the data nature governed by the data collection process. Therefore, the argument for the data censorship is not well supported.

Considering the aforementioned challenge, I investigated and experimented various approaches given different data characteristics observed empirically.

– In cases of severity or loss observed in the range of (0, inf), GLM under Gamma or Inverse Gaussian distributional assumption can be considered (https://statcompute.wordpress.com/2015/08/16/some-considerations-of-modeling-severity-in-operational-losses). In addition, the mean-variance relationship can be employed to assess the appropriateness of the correct distribution by either the modified Park test (https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas) or the value of power parameter in the Tweedie distribution (https://statcompute.wordpress.com/2017/06/24/using-tweedie-parameter-to-identify-distributions).

– In cases of severity or loss observed in the range of [alpha, inf) with alpha being positive, then a regression under the type-I Pareto distribution (https://statcompute.wordpress.com/2016/12/11/estimate-regression-with-type-i-pareto-response) can be considered. However, there is a caveat that the conditional mean only exists when the shape parameter is large than 1.

– In cases of severity or loss observed in the range of [0, inf) with a small number of zeros, then a regression under the Lomax distribution (https://statcompute.wordpress.com/2016/11/13/parameter-estimation-of-pareto-type-ii-distribution-with-nlmixed-in-sas) or the Tweedie distribution (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm) can be considered. For the Lomax model, it is worth pointing out that the shape parameter alpha has to be large than 2 in order to to have both mean and variance defined.

– In cases of severity or loss observed in the range of [0, inf) with many zeros, then a ZAGA or ZAIG model (https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros) can be considered by assuming the measure governed by a mixed distribution between the point-mass at zeros and the standard Gamma or Inverse Gaussian. As a result, a ZA model consists of 2 sub-models, a nu model separating zeros and positive values and a mu model estimating the conditional mean of positive values.

## R Interfaces to Python Keras Package

Keras is a popular Python package to do the prototyping for deep neural networks with multiple backends, including TensorFlow, CNTK, and Theano. Currently, there are two R interfaces that allow us to use Keras from R through the reticulate package. While the keras R package is able to provide a flexible and feature-rich API, the kerasR R package is more convenient and computationally efficient. For instance, in the below example mimicking the Python code shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks, the kerasR package is at least 10% faster than the keras package in terms of the computing time.

df <- read.csv("credit_count.txt") Y <- matrix(df[df$CARDHLDR == 1, ]$DEFAULT) X <- scale(df[df$CARDHLDR == 1, ][3:14]) set.seed(2018) rows <- sample(1:nrow(Y), nrow(Y) - 2000) Y1 <- Y[rows, ] Y2 <- Y[-rows, ] X1 <- X[rows, ] X2 <- X[-rows, ] ### USE KERAS PACKAGE (https://keras.rstudio.com) ### library(keras) dnn1 % ### DEFINE THE INPUT LAYER ### layer_dense(units = 50, activation = 'relu', input_shape = ncol(X), kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% ### DEFINE THE 1ST HIDDEN LAYER ### layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% ### DEFINE THE 2ND HIDDEN LAYER ### layer_dense(units = 20, activation = 'relu', kernel_constraint = constraint_maxnorm(4)) %>% layer_dropout(rate = 0.2, seed = 1) %>% layer_dense(units = 1, activation = 'sigmoid') %>% compile(loss = 'binary_crossentropy', optimizer = 'sgd', metrics = c('accuracy')) dnn1 %>% fit(X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3) pROC::roc(as.numeric(Y2), as.numeric(predict_proba(dnn1, X2))) ### USE KERAS PACKAGE (https://github.com/statsmaths/kerasR) ### library(kerasR) dnn2 <- Sequential() ### DEFINE THE INPUT LAYER ### dnn2$add(Dense(units = 50, input_shape = ncol(X), activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) ### DEFINE THE 1ST HIDDEN LAYER ### dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) ### DEFINE THE 2ND HIDDEN LAYER ### dnn2$add(Dense(units = 20, activation = 'relu', kernel_constraint = max_norm(4))) dnn2$add(Dropout(rate = 0.2, seed = 1)) dnn2$add(Dense(units = 1, activation = 'sigmoid')) keras_compile(dnn2, loss = 'binary_crossentropy', optimizer = 'sgd', metrics = 'accuracy') keras_fit(dnn2, X1, Y1, batch_size = 50, epochs = 20, verbose = 0, validation_split = 0.3) pROC::roc(as.numeric(Y2), as.numeric(keras_predict_proba(dnn2, X2)))

## Additional Thoughts on Estimating LGD with Proportional Odds Model

In my previous post (https://statcompute.wordpress.com/2018/01/28/modeling-lgd-with-proportional-odds-model), I’ve discussed how to use Proportional Odds Models in the LGD model development. In particular, I specifically mentioned that we would estimate a sub-model, which can be Gamma or Simplex regression, to project the conditional mean for LGD values in the (0, 1) range. However, it is worth pointing out that, if we would define a finer LGD segmentation, the necessity of this sub-model is completely optional. A standalone Proportional Odds Model without any sub-model is more than sufficient to serve the purpose of stress testing, e.g. CCAR.

In the example below, I will define 5 categories based upon LGD values in the [0, 1] range, estimate a Proportional Odds Model as usual, and then demonstrate how to apply the model outcome in the setting of stress testing with the stressed model input, e.g. LTV.

First of all, I defined 5 instead of 3 categories for LGD values, as shown below. Nonetheless, we could use a even finer category definition in practice to achieve a more accurate outcome.

df <- read.csv("lgd.csv") df$lgd <- round(1 - df$Recovery_rate, 4) l1 <- c(-Inf, 0, 0.0999, 0.4999, 0.9999, Inf) l2 <- c("A", "B", "C", "D", "E") df$lgd_cat <- cut(df$lgd, breaks = l1, labels = l2, ordered_result = T) summary(df$lgd_cat) m1 <- ordinal::clm(lgd_cat ~ LTV, data = df) #Coefficients: # Estimate Std. Error z value Pr(>|z|) #LTV 2.3841 0.1083 22.02 <2e-16 *** # #Threshold coefficients: # Estimate Std. Error z value #A|B 0.54082 0.07897 6.848 #B|C 2.12270 0.08894 23.866 #C|D 3.18098 0.10161 31.307 #D|E 4.80338 0.13174 36.460

After the model estimation, it is straightforward to calculate the probability of each LGD category. The only question remained is how to calculate the LGD projection for each individual account as well as for the whole portfolio. In order to calculate the LGD projection, we need two factors, namely the probability and the expected mean of each LGD category, such that

Estimated_LGD = SUM_i [Prob(category i) * LGD_Mean(category i)], where i = A, B, C, D, and E in this particular case.

The calculation is shown below with the estimated LGD = 0.23 that is consistent with the actual LGD = 0.23 for the whole portfolio.

prob_A <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) pred <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.2262811

One might be wondering how to apply the model outcome with simple averages in stress testing that the model input is stressed, e.g. more severe, and might be also concerned about the lack of model sensitivity. In the demonstration below, let’s stress the model input LTV by 50% and then evaluate the stressed LGD.

df$LTV_ST <- df$LTV * 1.5 prob_A <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[1])) prob_B <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[2])) - prob_A prob_C <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B prob_D <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C prob_E <- 1 - exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) pred_ST <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E) sum(apply(pred_ST, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2]) #[1] 0.3600153

As shown above, although we only use a simple averages as the expected mean for each LGD category, the overall LGD still increases by ~60%. The reason is that, with the more stressed model input, the Proportional Odds Model is able to push more accounts into categories with higher LGD. For instance, the output below shows that, if LTV is stressed by 50% overall, ~146% more accounts would roll into the most severe LGD category without any recovery.

apply(pred_ST, 2, mean) / apply(pred, 2, mean) # prob_A prob_B prob_C prob_D prob_E #0.6715374 0.7980619 1.0405573 1.4825803 2.4639293