While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

**R Code**

library(gpls) library(pROC) df1 <- read.csv("credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)] set.seed(2016) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, ] test <- df2[-sample, ] m1 <- glm(DEFAULT ~ ., data = train, family = "binomial") cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response"))) cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response"))) m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5) cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1])) cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1])) cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n") print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

**Output**

### ROC OF GLM PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m1, newdata = train, type = "response")) Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6641 ### ROC OF GLM PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m1, newdata = test, type = "response")) Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6537 ### ROC OF GPLS PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m2, newdata = train)$predicted[, 1]) Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6627 ### ROC OF GPLS PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m2, newdata = test)$predicted[, 1]) Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6542 ### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ### glm gpls (Intercept) -0.1940785071 -0.1954618828 AGE -0.0122709412 -0.0147883358 ACADMOS 0.0005302022 0.0003671781 ADEPCNT 0.1090667092 0.1352491711 MAJORDRG 0.0757313171 0.0813835741 MINORDRG 0.2621574192 0.2547176301 OWNRENT -0.2803919685 -0.1032119571 INCOME -0.0004222914 -0.0004531543 LOGSPEND -0.1688395555 -0.1525963363