## Archive for the ‘**S+/R**’ Category

## Monotonic Binning with Smbinning Package

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.

2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.

3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) { d1 <- data[c(y, x)] n <- min(20, nrow(unique(d1[x]))) repeat { d1$bin <- Hmisc::cut2(d1[, x], g = n) d2 <- aggregate(d1[-3], d1[3], mean) c <- cor(d2[-1], method = "spearman") if(abs(c[1, 2]) == 1 | n == 2) break n <- n - 1 } d3 <- aggregate(d1[-3], d1[3], max) cuts <- d3[-length(d3[, 3]), 3] return(smbinning::smbinning.custom(d1, y, x, cuts)) }

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named “rev_util”. As shown in the highlighted rows in the column “BadRate”, the binning outcome is not monotonic.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 0 965 716 249 965 716 249 0.1653 0.7420 0.2580 2.8755 1.0562 -0.2997 0.0162 2 <= 5 522 496 26 1487 1212 275 0.0894 0.9502 0.0498 19.0769 2.9485 1.5925 0.1356 3 <= 24 1166 1027 139 2653 2239 414 0.1998 0.8808 0.1192 7.3885 1.9999 0.6440 0.0677 4 <= 40 779 651 128 3432 2890 542 0.1335 0.8357 0.1643 5.0859 1.6265 0.2705 0.0090 5 <= 73 1188 932 256 4620 3822 798 0.2035 0.7845 0.2155 3.6406 1.2922 -0.0638 0.0008 6 <= 96 684 482 202 5304 4304 1000 0.1172 0.7047 0.2953 2.3861 0.8697 -0.4863 0.0316 7 > 96 533 337 196 5837 4641 1196 0.0913 0.6323 0.3677 1.7194 0.5420 -0.8140 0.0743 8 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 30 2962 2495 467 2962 2495 467 0.5075 0.8423 0.1577 5.3426 1.6757 0.3198 0.0471 2 > 30 2875 2146 729 5837 4641 1196 0.4925 0.7464 0.2536 2.9438 1.0797 -0.2763 0.0407 3 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 4 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 605 324 167 157 324 167 157 0.0555 0.5154 0.4846 1.0637 0.0617 -1.2942 0.1233 2 <= 632 468 279 189 792 446 346 0.0802 0.5962 0.4038 1.4762 0.3895 -0.9665 0.0946 3 <= 662 896 608 288 1688 1054 634 0.1535 0.6786 0.3214 2.1111 0.7472 -0.6087 0.0668 4 <= 699 1271 1016 255 2959 2070 889 0.2177 0.7994 0.2006 3.9843 1.3824 0.0264 0.0002 5 <= 717 680 586 94 3639 2656 983 0.1165 0.8618 0.1382 6.2340 1.8300 0.4741 0.0226 6 <= 761 1118 1033 85 4757 3689 1068 0.1915 0.9240 0.0760 12.1529 2.4976 1.1416 0.1730 7 > 761 765 742 23 5522 4431 1091 0.1311 0.9699 0.0301 32.2609 3.4739 2.1179 0.2979 8 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

## Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter **0 < a < y** and the shape parameter **k > 1 **.

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response ** y` = severity – threshold ** to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound ** a = 2 ** by using the **VGAM** package.

library(VGAM) set.seed(2017) n <- 200 a <- 2 x <- runif(n) k <- exp(1 + 5 * x) pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x) fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE) summary(fit) # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.0322 0.1363 7.574 3.61e-14 *** # x 4.9815 0.2463 20.229 < 2e-16 *** AIC(fit) # -644.458 BIC(fit) # -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata; parms b0 = 0.1 b1 = 0.1; k = exp(b0 + b1 * x); a = 2; lh = k * (a ** k) / (y ** (k + 1)); ll = log(lh); model y ~ general(ll); run; /* Fit Statistics -2 Log Likelihood -648.5 AIC (smaller is better) -644.5 AICC (smaller is better) -644.4 BIC (smaller is better) -637.9 Parameter Estimate Standard DF t Value Pr > |t| Error b0 1.0322 0.1385 200 7.45 <.0001 b1 4.9815 0.2518 200 19.78 <.0001 */

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to ** exp(x * beta) ** but ** a * k / (k – 1) ** with ** k = exp(x * beta) **. Therefore, the conditional mean only exists when ** k > 1 **, which might cause numerical issues in the model estimation.

## More about Flexible Frequency Models

Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (https://statcompute.wordpress.com/2016/05/13/more-flexible-approaches-to-model-frequency), the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed.

In addition to the quasi-poisson regression, three flexible frequency modeling techniques, including generalized poisson, double poisson, and Conway-Maxwell poisson, with their implementations in R should also be demonstrated below. While the example is specifically related to the over-dispersed data simulated with the negative binomial distributional assumption, these approaches can be generalized to the under-dispersed data as well given their flexibility. However, as demonstrated below, the calculation of parameters for these modeling approaches is not straight-forward.

**Over-Dispersed Data Simulation**

> set.seed(1) > ### SIMULATE NEG. BINOMIAL WITH MEAN(X) = MU AND VAR(X) = MU + MU ^ 2 / THETA > df <- data.frame(y = MASS::rnegbin(1000, mu = 10, theta = 5)) > ### DATA MEAN > mean(df$y) [1] 9.77 > ### DATA VARIANCE > var(df$y) [1] 30.93003003

**Generalized Poisson Regression**

> library(VGAM) > gpois <- vglm(y ~ 1, data = df, family = genpoisson) > gpois.theta <- exp(coef(gpois)[2]) > gpois.lambda <- (exp(coef(gpois)[1]) - 1) / (exp(coef(gpois)[1]) + 1) > ### ESTIMATE MEAN = THETA / (1 - LAMBDA) > gpois.theta / (1 - gpois.lambda) (Intercept):2 9.77 > ### ESTIMATE VARIANCE = THETA / ((1 - LAMBDA) ^ 3) > gpois.theta / ((1 - gpois.lambda) ^ 3) (Intercept):2 31.45359991

**Double Poisson Regression**

> ### DOUBLE POISSON > library(gamlss) > dpois <- gamlss(y ~ 1, data = df, family = DPO, control = gamlss.control(n.cyc = 100)) > ### ESTIMATE MEAN > dpois.mu <- exp(dpois$mu.coefficients) > dpois.mu (Intercept) 9.848457877 > ### ESTIMATE VARIANCE = MU * SIGMA > dpois.sigma <- exp(dpois$sigma.coefficients) > dpois.mu * dpois.sigma (Intercept) 28.29229702

**Conway-Maxwell Poisson Regression**

> ### CONWAY-MAXWELL POISSON > library(CompGLM) > cpois <- glm.comp(y ~ 1, data = df) > cpois.lambda <- exp(cpois$beta) > cpois.nu <- exp(cpois$zeta) > ### ESTIMATE MEAN = LAMBDA ^ (1 / NU) - (NU - 1) / (2 * NU) > cpois.lambda ^ (1 / cpois.nu) - (cpois.nu - 1) / (2 * cpois.nu) (Intercept) 9.66575376 > ### ESTIMATE VARIANCE = LAMBDA ** (1 / NU) / NU > cpois.lambda ^ (1 / cpois.nu) / cpois.nu (Intercept) 29.69861239

## Fastest Way to Add New Variables to A Large Data.Frame

pkgs <- list("hflights", "doParallel", "foreach", "dplyr", "rbenchmark", "data.table") lapply(pkgs, require, character.only = T) data(hflights) benchmark(replications = 10, order = "user.self", relative = "user.self", transform = { ### THE GENERIC FUNCTION MODIFYING THE DATA.FRAME, SIMILAR TO DATA.FRAME() ### transform(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay) }, within = { ### EVALUATE THE EXPRESSION WITHIN THE LOCAL ENVIRONMENT ### within(hflights, {wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'); delay = ArrDelay + DepDelay}) }, mutate = { ### THE SPECIFIC FUNCTION IN DPLYR PACKAGE TO ADD VARIABLES ### mutate(hflights, wday = ifelse(DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), delay = ArrDelay + DepDelay) }, foreach = { ### SPLIT AND THEN COMBINE IN PARALLEL ### registerDoParallel(cores = 2) v <- c(names(hflights), 'wday', 'delay') f <- expression(ifelse(hflights$DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), hflights$ArrDelay + hflights$DepDelay) df <- foreach(fn = iter(f), .combine = mutate, .init = hflights) %dopar% { eval(fn) } names(df) <- v }, data.table = { ### DATA.TABLE ### data.table(hflights)[, c("wday", "delay") := list(ifelse(hflights$DayOfWeek %in% c(6, 7), 'weekend', 'weekday'), hflights$ArrDelay + hflights$DepDelay)] } ) # test replications elapsed relative user.self sys.self user.child # 4 foreach 10 1.442 1.000 0.240 0.144 0.848 # 2 within 10 0.667 2.783 0.668 0.000 0.000 # 3 mutate 10 0.679 2.833 0.680 0.000 0.000 # 5 data.table 10 0.955 3.983 0.956 0.000 0.000 # 1 transform 10 1.732 7.200 1.728 0.000 0.000

## Risk Models with Generalized PLS

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

**R Code**

library(gpls) library(pROC) df1 <- read.csv("credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)] set.seed(2016) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, ] test <- df2[-sample, ] m1 <- glm(DEFAULT ~ ., data = train, family = "binomial") cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response"))) cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response"))) m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5) cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1])) cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1])) cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n") print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

**Output**

### ROC OF GLM PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m1, newdata = train, type = "response")) Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6641 ### ROC OF GLM PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m1, newdata = test, type = "response")) Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6537 ### ROC OF GPLS PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m2, newdata = train)$predicted[, 1]) Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6627 ### ROC OF GPLS PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m2, newdata = test)$predicted[, 1]) Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6542 ### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ### glm gpls (Intercept) -0.1940785071 -0.1954618828 AGE -0.0122709412 -0.0147883358 ACADMOS 0.0005302022 0.0003671781 ADEPCNT 0.1090667092 0.1352491711 MAJORDRG 0.0757313171 0.0813835741 MINORDRG 0.2621574192 0.2547176301 OWNRENT -0.2803919685 -0.1032119571 INCOME -0.0004222914 -0.0004531543 LOGSPEND -0.1688395555 -0.1525963363

## More Flexible Approaches to Model Frequency

(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11)

In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson regression, namely Equi-Dispersion, which requires the equality between the conditional mean and the variance such that E(Y) = var(Y). However, in real-world frequency outcomes, the assumption of Equi-Dispersion is always problematic. On the contrary, the empirical data often presents either an excessive variance, namely Over-Dispersion, or an insufficient variance, namely Under-Dispersion. The application of a standard Poisson regression to the over-dispersed data will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics.

In cases of Over-Dispersion, the Negative Binomial (NB) regression has been the most common alternative to the standard Poisson regression by including a dispersion parameter to accommodate the excessive variance in the data. In the formulation of NB regression, the variance is expressed as a quadratic function of the conditional mean such that the variance is guaranteed to be higher than the conditional mean. However, it is not flexible enough to allow for both Over-Dispersion and Under-Dispersion. Therefore, more generalizable approaches are called for.

Two additional frequency modeling methods, including Quasi-Poisson (QP) regression and Conway-Maxwell Poisson (CMP) regression, are discussed. In the case of Quasi-Poisson, E(Y) = λ and var(Y) = θ • λ. While θ > 1 addresses Over-Dispersion, θ < 1 governs Under-Dispersion. Since QP regression is estimated with QMLE, likelihood-based statistics, such as AIC and BIC, won’t be available. Instead, quasi-AIC and quasi-BIC are provided. In the case of Conway-Maxwell Poisson, E(Y) = λ ** (1 / v) – (v – 1) / (2 • v) and var(Y) = (1 / v) • λ ** (1 / v), where λ doesn’t represent the conditional mean anymore but a location parameter. While v < 1 enables us to model the long-tailed distribution reflected as Over-Dispersion, v > 1 takes care of the short-tailed distribution reflected as Under-Dispersion. Since CMP regression is estimated with MLE, likelihood-based statistics, such as AIC and BIC, are available at a high computing cost.

Below demonstrates how to estimate QP and CMP regressions with R and a comparison of their computing times. If the modeling purpose is mainly for the prediction without focusing on the statistical reference, QP regression would be an excellent choice for most practitioners. Otherwise, CMP regression is an elegant model to address various levels of dispersion parsimoniously.

# data source: www.jstatsoft.org/article/view/v027i08 load("../Downloads/DebTrivedi.rda") library(rbenchmark) library(CompGLM) benchmark(replications = 3, order = "user.self", quasi.poisson = { m1 <- glm(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, family = "quasipoisson") }, conway.maxwell = { m2 <- glm.comp(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, lamStart = m1$coefficient s) } ) # test replications elapsed relative user.self sys.self user.child # 1 quasi.poisson 3 0.084 1.000 0.084 0.000 0 # 2 conway.maxwell 3 42.466 505.548 42.316 0.048 0 summary(m1) summary(m2)

**Quasi-Poisson Regression**

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.886462 0.069644 12.729 < 2e-16 *** healthpoor 0.235673 0.046284 5.092 3.69e-07 *** healthexcellent -0.360188 0.078441 -4.592 4.52e-06 *** hosp 0.163246 0.015594 10.468 < 2e-16 *** numchron 0.144652 0.011894 12.162 < 2e-16 *** privinsyes 0.304691 0.049879 6.109 1.09e-09 *** school 0.028953 0.004812 6.016 1.93e-09 *** gendermale -0.092460 0.033830 -2.733 0.0063 ** medicaidyes 0.297689 0.063787 4.667 3.15e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 6.697556) Null deviance: 26943 on 4405 degrees of freedom Residual deviance: 23027 on 4397 degrees of freedom AIC: NA

**Conway-Maxwell Poisson Regression**

Beta: Estimate Std.Error t.value p.value (Intercept) -0.23385559 0.16398319 -1.4261 0.15391 healthpoor 0.03226830 0.01325437 2.4345 0.01495 * healthexcellent -0.08361733 0.00687228 -12.1673 < 2e-16 *** hosp 0.01743416 0.01500555 1.1618 0.24536 numchron 0.02186788 0.00209274 10.4494 < 2e-16 *** privinsyes 0.05193645 0.00184446 28.1581 < 2e-16 *** school 0.00490214 0.00805940 0.6083 0.54305 gendermale -0.01485663 0.00076861 -19.3292 < 2e-16 *** medicaidyes 0.04861617 0.00535814 9.0733 < 2e-16 *** Zeta: Estimate Std.Error t.value p.value (Intercept) -3.4642316 0.0093853 -369.11 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 AIC: 24467.13 Log-Likelihood: -12223.56

## Improve SVM Tuning through Parallelism

As pointed out in the chapter 10 of “The Elements of Statistical Learning”, ANN and SVM (support vector machines) share similar pros and cons, e.g. lack of interpretability and good predictive power. However, in contrast to ANN usually suffering from local minima solutions, SVM is always able to converge globally. In addition, SVM is less prone to over-fitting given a good choice of free parameters, which usually can be identified through cross-validations.

In the R package “e1071”, tune() function can be used to search for SVM parameters but is extremely inefficient due to the sequential instead of parallel executions. In the code snippet below, a parallelism-based algorithm performs the grid search for SVM parameters through the K-fold cross validation.

pkgs <- c('foreach', 'doParallel') lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) ### PREPARE FOR THE DATA ### df1 <- read.csv("credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] x <- paste("AGE + ACADMOS + ADEPCNT + MAJORDRG + MINORDRG + OWNRENT + INCOME + SELFEMPL + INCPER + EXP_INC") fml <- as.formula(paste("as.factor(DEFAULT) ~ ", x)) ### SPLIT DATA INTO K FOLDS ### set.seed(2016) df2$fold <- caret::createFolds(1:nrow(df2), k = 4, list = FALSE) ### PARAMETER LIST ### cost <- c(10, 100) gamma <- c(1, 2) parms <- expand.grid(cost = cost, gamma = gamma) ### LOOP THROUGH PARAMETER VALUES ### result <- foreach(i = 1:nrow(parms), .combine = rbind) %do% { c <- parms[i, ]$cost g <- parms[i, ]$gamma ### K-FOLD VALIDATION ### out <- foreach(j = 1:max(df2$fold), .combine = rbind, .inorder = FALSE) %dopar% { deve <- df2[df2$fold != j, ] test <- df2[df2$fold == j, ] mdl <- e1071::svm(fml, data = deve, type = "C-classification", kernel = "radial", cost = c, gamma = g, probability = TRUE) pred <- predict(mdl, test, decision.values = TRUE, probability = TRUE) data.frame(y = test$DEFAULT, prob = attributes(pred)$probabilities[, 2]) } ### CALCULATE SVM PERFORMANCE ### roc <- pROC::roc(as.factor(out$y), out$prob) data.frame(parms[i, ], roc = roc$auc[1]) }