## Query CSV Data with Apache Drill

The most interesting feature of Apache Drill (https://drill.apache.org) is the functionality of querying the raw data in original sources, e.g. json, parquet, or even csv, directly from the file system through the so-called storage plugin. In the demonstration below, I will show how to extract the data from a csv file.

To do the test drive of Apache Drill without the installation, I downloaded the docker container from https://hub.docker.com/r/harisekhon/apache-drill and then started the container by “docker run -it harisekhon/apache-drill”.

Once in the Drill shell, we can use “show schemas;” to list storage plugins that have been configured in the system. As shown below, while there are several default storage plugins, the one that we will use is “dfs” that points to the local file system.

0: jdbc:drill:zk=local> show schemas; +---------------------+ | SCHEMA_NAME | +---------------------+ | INFORMATION_SCHEMA | | cp.default | | dfs.default | | dfs.root | | dfs.tmp | | sys | +---------------------+

We can explore file formats supported in the dfs local file system by the command “describe schema dfs;”. As shown below, in the pre-configured storage plugin, 2 entries are specifically related to the csv data format that will be used later. The entry “csv” supports data files without headers and the entry “csvh” supports data files with headers.

"csv" : { "type" : "text", "extensions" : [ "csv" ], "delimiter" : "," }, "csvh" : { "type" : "text", "extensions" : [ "csvh" ], "extractHeader" : true, "delimiter" : "," }

Next, we can use the command “show files in dfs;” to explore the file system. For instance, running the command “show files in dfs.`apache-drill/sample-data`;” will list all files in the folder “sample-data”, which includes a csv file named “nycflights.csv” that was copied from the host by “docker cp nycflights.csv 1fc4a0087a3e:/apache-drill/sample-data”, where “1fc4a0087a3e” is the container_id.

From the initial output below, it appears that there are headers in this csv file.

0: jdbc:drill:zk=local> select columns[0] as x1, columns[1] as x2 from dfs.`apache-drill/sample-data/nycflights.csv` limit 5; +-------+--------+ | x1 | x2 | +-------+--------+ | year | month | | 2013 | 1 | | 2013 | 1 | | 2013 | 1 | | 2013 | 1 | +-------+--------+

As a result, we need to change the file extension from “csv” to “csvh” in the file system and then query the data again. It turns out successful this time.

0: jdbc:drill:zk=local> select `year`, `month` from dfs.`apache-drill/sample-data/nycflights.csvh` limit 5; +-------+--------+ | year | month | +-------+--------+ | 2013 | 1 | | 2013 | 1 | | 2013 | 1 | | 2013 | 1 | | 2013 | 1 | +-------+--------+

From here, we can explore the data in the raw csv file directly. For instancce, we can aggregate the number of flights by airports.

0: jdbc:drill:zk=local> select origin, count(*) as cnts from dfs.`apache-drill/sample-data/nycflights.csvh` group by origin; +---------+---------+ | origin | cnts | +---------+---------+ | JFK | 111279 | | EWR | 120835 | | LGA | 104662 | +---------+---------+

Or we can extract all flights from the airport LGA and then export the data to a JSON file for the future analyses.

0: jdbc:drill:zk=local> use dfs.tmp; +-------+--------------------------------------+ | ok | summary | +-------+--------------------------------------+ | true | Default schema changed to [dfs.tmp] | +-------+--------------------------------------+ 0: jdbc:drill:zk=local> alter session set `store.format` = 'json'; +-------+------------------------+ | ok | summary | +-------+------------------------+ | true | store.format updated. | +-------+------------------------+ 0: jdbc:drill:zk=local> create table dfs.tmp.lga as select * from dfs.`apache-drill/sample-data/nycflights.csvh` where origin = 'LGA'; +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 104662 | +-----------+----------------------------+ 0: jdbc:drill:zk=local> show files in dfs.tmp.`lga`; +-------------+--------------+---------+-----------+--------+--------+ | name | isDirectory | isFile | length | owner | group | +-------------+--------------+---------+-----------+--------+--------+ | 0_0_0.json | false | true | 34156554 | root | root | +-------------+--------------+---------+-----------+--------+--------+

## Monotonic WoE Binning for LGD Models

While the monotonic binning algorithm has been widely used in scorecard and PD model (Probability of Default) developments, the similar idea can be generalized to LGD (Loss Given Default) models. In the post below, two SAS macros performing the monotonic binning for LGD are demonstrated.

The first one tends to generate relatively coarse bins based on iterative grouping, which requires a longer computing time.

%macro lgd_bin1(data = , y = , x = ); %let maxbin = 20; data _tmp1 (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc sql noprint; select count(distinct x) into :xflg from _last_; quit; %let nbin = %sysfunc(min(&maxbin, &xflg)); %if &nbin > 2 %then %do; %do j = &nbin %to 2 %by -1; proc rank data = _tmp1 groups = &j out = _data_ (keep = x rank y); var x; ranks rank; run; proc summary data = _last_ nway; class rank; output out = _tmp2 (drop = _type_ rename = (_freq_ = freq)) sum(y) = bads mean(y) = bad_rate min(x) = minx max(x) = maxx; run; proc sql noprint; select case when min(bad_rate) > 0 then 1 else 0 end into :minflg from _tmp2; select case when max(bad_rate) < 1 then 1 else 0 end into :maxflg from _tmp2; quit; %if &minflg = 1 & &maxflg = 1 %then %do; proc corr data = _tmp2 spearman noprint outs = _corr; var minx; with bad_rate; run; proc sql noprint; select case when abs(minx) = 1 then 1 else 0 end into :cor from _corr where _type_ = 'CORR'; quit; %if &cor = 1 %then %goto loopout; %end; %end; %end; %loopout: proc sql noprint; create table _tmp3 as select a.rank + 1 as bin, a.minx as minx, a.maxx as maxx, a.freq as freq, a.freq / b.freq as dist, a.bad_rate as avg_lgd, a.bads / b.bads as bpct, (a.freq - a.bads) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) / calculated woe as iv from _tmp2 as a, (select sum(freq) as freq, sum(bads) as bads from _tmp2) as b; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin1;

The second one can generate much finer bins based on the idea of isotonic regressions and is more computationally efficient.

%macro lgd_bin2(data = , y = , x = ); data _data_ (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc transreg data = _last_ noprint; model identity(y) = monotone(x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _tx; output out = _data_ (drop = _freq_ _type_) mean(y) = lgd; run; proc sort data = _last_; by lgd; run; data _tmp2; set _last_; by lgd; _idx = _n_; if lgd = 0 then _idx = _idx + 1; if lgd = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._tx = b._tx; create table _tmp4 as select min(a.x) as minx, max(a.x) as maxx, sum(a.y) as bads, count(a.y) as freq, count(a.y) / b.freq as dist, mean(a.y) as avg_lgd, sum(a.y) / b.bads as bpct, sum(1 - a.y) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) * calculated woe as iv from _tmp3 as a, (select count(*) as freq, sum(y) as bads from _tmp3) as b group by a._idx; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin2;

Below is the output comparison between two macros with the testing data downloaded from http://www.creditriskanalytics.net/datasets-private.html. Should you have any feedback, please feel free to leave me a message.

## Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = ); options mprint mlogic; data _data_ (keep = _x _y); set &data; where &y in (0, 1) and &x ~= .; _y = &y; _x = &x; run; proc transreg data = _last_ noprint; model identity(_y) = monotone(_x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _t_x; output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate; run; proc sort data = _last_; by _rate; run; data _tmp2; set _last_; by _rate; _idx = _n_; if _rate = 0 then _idx = _idx + 1; if _rate = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._t_x = b._t_x; create table _tmp4 as select a._idx, min(a._x) as _min_x, max(a._x) as _max_x, sum(a._y) as _bads, count(a._y) as _freq, mean(a._y) as _rate, sum(a._y) / b.bads as _bpct, sum(1 - a._y) / (b.freq - b.bads) as _gpct, log(calculated _bpct / calculated _gpct) as _woe, (calculated _bpct - calculated _gpct) * calculated _woe as _iv from _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b group by a._idx; quit; title "Monotonic WoE Binning for %upcase(%trim(&x))"; proc print data = _last_ label noobs; var _min_x _max_x _bads _freq _rate _woe _iv; label _min_x = "Lower" _max_x = "Upper" _bads = "#Bads" _freq = "#Freq" _rate = "BadRate" _woe = "WoE" _iv = "IV"; sum _bads _freq _iv; run; title; %mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

## Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics") lapply(pkgs, require, character.only = T) data(AutoClaim, package = "cplm") df1 <- na.omit(AutoClaim) # SMALL NUMBER OF ZEROS set.seed(2017) smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE) df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.194e+00 7.234e-02 113.277 < 2e-16 *** # BLUEBOOK 2.047e-05 3.068e-06 6.671 3.21e-11 *** # NPOLICY 7.274e-02 3.102e-02 2.345 0.0191 * MAPE(df2$CLM_AMT, fitted(tw)) # 0.6053669 zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.671e-02 175.629 < 2e-16 *** # BLUEBOOK 2.053e-05 2.090e-06 9.821 < 2e-16 *** # NPOLICY 6.948e-02 2.057e-02 3.377 0.000746 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -5.3886 0.3169 -17 <2e-16 *** MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu")) # 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 6.854e+00 1.067e-01 64.241 < 2e-16 *** # BLUEBOOK 1.332e-05 4.495e-06 2.963 0.00305 ** # NPOLICY 4.380e-02 3.664e-02 1.195 0.23196 # CLM_FREQ5 2.064e-01 2.937e-02 7.026 2.29e-12 *** # MVR_PTS 1.066e-01 1.510e-02 7.063 1.76e-12 *** # INCOME -4.606e-06 8.612e-07 -5.348 9.12e-08 *** MAPE(df1$CLM_AMT, fitted(tw)) # 1.484484 zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.682e-02 175.218 < 2e-16 *** # BLUEBOOK 2.053e-05 2.091e-06 9.816 < 2e-16 *** # NPOLICY 6.948e-02 2.067e-02 3.362 0.000778 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu")) # 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.205e+00 5.836e-02 140.591 < 2e-16 *** # BLUEBOOK 2.163e-05 2.976e-06 7.268 3.97e-13 *** # NPOLICY 5.898e-02 2.681e-02 2.200 0.0278 * # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu")) # 1.469236

## Variable Selection with Elastic Net

LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive.

The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.

pkgs <- list("glmnet", "doParallel", "foreach", "pROC") lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n * 0.5, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1] mdlY <- as.factor(as.matrix(train["DEFAULT"])) mdlX <- as.matrix(train[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))]) newY <- as.factor(as.matrix(test["DEFAULT"])) newX <- as.matrix(test[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])

First of all, we estimates a LASSO model with Alpha = 1. The function cv.glmnet() is used to search for a regularization parameter, namely Lambda, that controls the penalty strength. As shown below, the model only identifies 2 attributes out of total 12.

# LASSO WITH ALPHA = 1 cv1 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 1) md1 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv1$lambda.1se, alpha = 1) coef(md1) #(Intercept) -1.963030e+00 #AGE . #ACADMOS . #ADEPCNT . #MAJORDRG . #MINORDRG . #OWNRENT . #INCOME -5.845981e-05 #SELFEMPL . #INCPER . #EXP_INC . #SPENDING . #LOGSPEND -4.015902e-02 roc(newY, as.numeric(predict(md1, newX, type = "response"))) #Area under the curve: 0.636

We next estimates a ridge model as below by setting Alpha = 0. Similarly, Lambda is searched by the cross-validation. Since the ridge penalty would only regularize the magnitude of each coefficient, we end up with a “full” model with all model attributes. The model performance is slightly better with 10 more variables, which is a debatable outcome.

# RIDGE WITH ALPHA = 0 cv2 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 0) md2 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv2$lambda.1se, alpha = 0) coef(md2) #(Intercept) -2.221016e+00 #AGE -4.184422e-04 #ACADMOS -3.085096e-05 #ADEPCNT 1.485114e-04 #MAJORDRG 6.684849e-03 #MINORDRG 1.006660e-03 #OWNRENT -9.082750e-03 #INCOME -6.960253e-06 #SELFEMPL 3.610381e-03 #INCPER -3.881890e-07 #EXP_INC -1.416971e-02 #SPENDING -1.638184e-05 #LOGSPEND -6.213884e-03 roc(newY, as.numeric(predict(md2, newX, type = "response"))) #Area under the curve: 0.6435

At last, we use the Elastic Net by tuning the value of Alpha through a line search with the parallelism. In this particular case, Alpha = 0.3 is chosen through the cross-validation. As shown below, 6 variables are used in the model that even performs better than the ridge model with all 12 attributes.

# ELASTIC NET WITH 0 < ALPHA < 1 a <- seq(0.1, 0.9, 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv3 <- search[search$cvm == min(search$cvm), ] md3 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv3$lambda.1se, alpha = cv3$alpha) coef(md3) #(Intercept) -1.434700e+00 #AGE -8.426525e-04 #ACADMOS . #ADEPCNT . #MAJORDRG 6.276924e-02 #MINORDRG . #OWNRENT -2.780958e-02 #INCOME -1.305118e-04 #SELFEMPL . #INCPER -2.085349e-06 #EXP_INC . #SPENDING . #LOGSPEND -9.992808e-02 roc(newY, as.numeric(predict(md3, newX, type = "response"))) #Area under the curve: 0.6449

## DART: Dropout Regularization in Boosting Ensembles

The dropout approach developed by Hinton has been widely employed in deep learnings to prevent the deep neural network from overfitting, as shown in https://statcompute.wordpress.com/2017/01/02/dropout-regularization-in-deep-neural-networks.

In the paper http://proceedings.mlr.press/v38/korlakaivinayak15.pdf, the dropout can also be used to address the overfitting in boosting tree ensembles, e.g. MART, caused by the so-called “over-specialization”. In particular, while first few trees added at the beginning of ensembles would dominate the model performance, the rest added later can only improve the prediction for a small subset, which increases the risk of overfitting. The idea of DART is to build an ensemble by randomly dropping boosting tree members. The percentage of dropouts can determine the degree of regularization for boosting tree ensembles.

Below is a demonstration showing the implementation of DART with the R xgboost package. First of all, after importing the data, we divided it into two pieces, one for training and the other for testing.

pkgs <- c('pROC', 'xgboost') lapply(pkgs, require, character.only = T) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1]

For the comparison purpose, we first developed a boosting tree ensemble without dropouts, as shown below. For the simplicity, all parameters were chosen heuristically. The max_depth is set to 3 due to the fact that the boosting tends to work well with so-called “weak” learners, e.g. simple trees. While ROC for the training set can be as high as 0.95, ROC for the testing set is only 0.60 in our case, implying the overfitting issue.

mart.parm <- list(booster = "gbtree", nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") mart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = mart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(mart, as.matrix(train[, -1])) pred2 <- predict(mart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.9459 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6046

With the same set of parameters, we refitted the ensemble with dropouts, e.g. DART. As shown below, by dropping 10% tree members, ROC for the testing set can increase from 0.60 to 0.65. In addition, the performance disparity between training and testing sets with DART decreases significantly.

dart.parm <- list(booster = "dart", rate_drop = 0.1, nthread = 4, eta = 0.1, max_depth = 3, subsample = 1, eval_metric = "auc") dart <- xgboost(data = as.matrix(train[, -1]), label = train[, 1], params = dart.parm, nrounds = 500, verbose = 0, seed = 2017) pred1 <- predict(dart, as.matrix(train[, -1])) pred2 <- predict(dart, as.matrix(test[, -1])) roc(as.factor(train$DEFAULT), pred1) # Area under the curve: 0.7734 roc(as.factor(test$DEFAULT), pred2) # Area under the curve: 0.6517

Besides rate_drop = 0.1, a wide range of dropout rates have also been tested. In most cases, DART outperforms its counterpart without the dropout regularization.

## Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)