Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for the ‘Credit Risk’ Category

Monotonic Binning with Smbinning Package

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.
2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.
3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  n <- min(20, nrow(unique(d1[x])))
  repeat {
    d1$bin <- Hmisc::cut2(d1[, x], g = n)
    d2 <- aggregate(d1[-3], d1[3], mean)
    c <- cor(d2[-1], method = "spearman")
    if(abs(c[1, 2]) == 1 | n == 2) break
    n <- n - 1
  }
  d3 <- aggregate(d1[-3], d1[3], max)
  cuts <- d3[-length(d3[, 3]), 3]
  return(smbinning::smbinning.custom(d1, y, x, cuts))
}

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named “rev_util”. As shown in the highlighted rows in the column “BadRate”, the binning outcome is not monotonic.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 0    965     716    249       965        716       249 0.1653   0.7420  0.2580  2.8755 1.0562 -0.2997 0.0162
2     <= 5    522     496     26      1487       1212       275 0.0894   0.9502  0.0498 19.0769 2.9485  1.5925 0.1356
3    <= 24   1166    1027    139      2653       2239       414 0.1998   0.8808  0.1192  7.3885 1.9999  0.6440 0.0677
4    <= 40    779     651    128      3432       2890       542 0.1335   0.8357  0.1643  5.0859 1.6265  0.2705 0.0090
5    <= 73   1188     932    256      4620       3822       798 0.2035   0.7845  0.2155  3.6406 1.2922 -0.0638 0.0008
6    <= 96    684     482    202      5304       4304      1000 0.1172   0.7047  0.2953  2.3861 0.8697 -0.4863 0.0316
7     > 96    533     337    196      5837       4641      1196 0.0913   0.6323  0.3677  1.7194 0.5420 -0.8140 0.0743
8  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN     NaN    NaN     NaN    NaN
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1    <= 30   2962    2495    467      2962       2495       467 0.5075   0.8423  0.1577 5.3426 1.6757  0.3198 0.0471
2     > 30   2875    2146    729      5837       4641      1196 0.4925   0.7464  0.2536 2.9438 1.0797 -0.2763 0.0407
3  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN    NaN    NaN     NaN    NaN
4    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1   <= 605    324     167    157       324        167       157 0.0555   0.5154  0.4846  1.0637 0.0617 -1.2942 0.1233
2   <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762 0.3895 -0.9665 0.0946
3   <= 662    896     608    288      1688       1054       634 0.1535   0.6786  0.3214  2.1111 0.7472 -0.6087 0.0668
4   <= 699   1271    1016    255      2959       2070       889 0.2177   0.7994  0.2006  3.9843 1.3824  0.0264 0.0002
5   <= 717    680     586     94      3639       2656       983 0.1165   0.8618  0.1382  6.2340 1.8300  0.4741 0.0226
6   <= 761   1118    1033     85      4757       3689      1068 0.1915   0.9240  0.0760 12.1529 2.4976  1.1416 0.1730
7    > 761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609 3.4739  2.1179 0.2979
8  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10   <= 775    492     465     27      5023       3945      1078 0.0843   0.9451  0.0549 17.2222 2.8462  1.4903 0.1157
11    > 775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

Written by statcompute

January 22, 2017 at 11:05 pm

Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one;
  do i = 1 to 100000;
    x0 = ranuni(0);
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do;
      grp = 1;
      if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1;
    	else bad = 0;
    end;
    else do;
      grp = 0;
      if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1;
    	else bad = 0;
    end;
    output;
  end;
run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint;
  model bad = x0 x1 x2 x3;
  score data = one out = mdl1 (rename = (p_1 = score1));
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1
                                MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281
               ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6800     0.9699       2,057      7,943     10,000   79.43%      79.43%    33.81%      33.81%
   |      0.4679     0.6799       4,444      5,556     10,000   55.56%      67.50%    23.65%      57.46%
   |      0.3094     0.4679       6,133      3,867     10,000   38.67%      57.89%    16.46%      73.92%
   |      0.1947     0.3094       7,319      2,681     10,000   26.81%      50.12%    11.41%      85.33%
   |      0.1181     0.1946       8,364      1,636     10,000   16.36%      43.37%     6.96%      92.29%
   |      0.0690     0.1181       9,044        956     10,000    9.56%      37.73%     4.07%      96.36%
   |      0.0389     0.0690       9,477        523     10,000    5.23%      33.09%     2.23%      98.59%
   |      0.0201     0.0389       9,752        248     10,000    2.48%      29.26%     1.06%      99.64%
   V      0.0085     0.0201       9,925         75     10,000    0.75%      26.09%     0.32%      99.96%
 GOOD     0.0005     0.0085       9,991          9     10,000    0.09%      23.49%     0.04%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9699      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457
               ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7086     0.9699       1,051      6,162      7,213   85.43%      85.43%    30.51%      30.51%
   |      0.5019     0.7086       2,452      4,762      7,214   66.01%      75.72%    23.58%      54.10%
   |      0.3407     0.5019       3,710      3,504      7,214   48.57%      66.67%    17.35%      71.45%
   |      0.2195     0.3406       4,696      2,517      7,213   34.90%      58.73%    12.46%      83.91%
   |      0.1347     0.2195       5,650      1,564      7,214   21.68%      51.32%     7.74%      91.66%
   |      0.0792     0.1347       6,295        919      7,214   12.74%      44.89%     4.55%      96.21%
   |      0.0452     0.0792       6,737        476      7,213    6.60%      39.42%     2.36%      98.56%
   |      0.0234     0.0452       7,000        214      7,214    2.97%      34.86%     1.06%      99.62%
   V      0.0099     0.0234       7,150         64      7,214    0.89%      31.09%     0.32%      99.94%
 GOOD     0.0007     0.0099       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0007     0.9699      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1))
                                MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290
               ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5863     0.9413       1,351      1,435      2,786   51.51%      51.51%    43.48%      43.48%
   |      0.3713     0.5862       2,136        651      2,787   23.36%      37.43%    19.73%      63.21%
   |      0.2299     0.3712       2,340        446      2,786   16.01%      30.29%    13.52%      76.73%
   |      0.1419     0.2298       2,525        262      2,787    9.40%      25.07%     7.94%      84.67%
   |      0.0832     0.1419       2,584        202      2,786    7.25%      21.50%     6.12%      90.79%
   |      0.0480     0.0832       2,643        144      2,787    5.17%      18.78%     4.36%      95.15%
   |      0.0270     0.0480       2,682        104      2,786    3.73%      16.63%     3.15%      98.30%
   |      0.0140     0.0270       2,741         46      2,787    1.65%      14.76%     1.39%      99.70%
   V      0.0058     0.0140       2,776         10      2,786    0.36%      13.16%     0.30%     100.00%
 GOOD     0.0005     0.0058       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9413      24,564      3,300     27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint;
  where grp = 0;
  model bad = x2 x3;
  score data = one out = mdl20(rename = (p_1 = p_10));
run;

proc logistic data = one desc noprint;
  where grp = 1;
  model bad = x2 x3;
  score data = one out = mdl21(rename = (p_1 = p_11));
run;

proc logistic data = one desc noprint;
  model grp = x0 x1;
  score data = one out = seg;
run;

data mdl2;
  merge seg mdl20 mdl21;
  by i;
  score2 = p_10 * (1 - p_1) + p_11 * p_1;
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2
                                MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469
               ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6877     0.9677       2,011      7,989     10,000   79.89%      79.89%    34.00%      34.00%
   |      0.4749     0.6876       4,300      5,700     10,000   57.00%      68.45%    24.26%      58.27%
   |      0.3125     0.4748       6,036      3,964     10,000   39.64%      58.84%    16.87%      75.14%
   |      0.1932     0.3124       7,451      2,549     10,000   25.49%      50.51%    10.85%      85.99%
   |      0.1142     0.1932       8,379      1,621     10,000   16.21%      43.65%     6.90%      92.89%
   |      0.0646     0.1142       9,055        945     10,000    9.45%      37.95%     4.02%      96.91%
   |      0.0345     0.0646       9,533        467     10,000    4.67%      33.19%     1.99%      98.90%
   |      0.0166     0.0345       9,800        200     10,000    2.00%      29.29%     0.85%      99.75%
   V      0.0062     0.0166       9,946         54     10,000    0.54%      26.10%     0.23%      99.98%
 GOOD     0.0001     0.0062       9,995          5     10,000    0.05%      23.49%     0.02%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9677      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458
               ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7221     0.9677       1,075      6,138      7,213   85.10%      85.10%    30.40%      30.40%
   |      0.5208     0.7221       2,436      4,778      7,214   66.23%      75.66%    23.66%      54.06%
   |      0.3533     0.5208       3,670      3,544      7,214   49.13%      66.82%    17.55%      71.61%
   |      0.2219     0.3532       4,726      2,487      7,213   34.48%      58.73%    12.32%      83.92%
   |      0.1309     0.2219       5,617      1,597      7,214   22.14%      51.41%     7.91%      91.83%
   |      0.0731     0.1309       6,294        920      7,214   12.75%      44.97%     4.56%      96.39%
   |      0.0387     0.0731       6,762        451      7,213    6.25%      39.44%     2.23%      98.62%
   |      0.0189     0.0387       7,009        205      7,214    2.84%      34.86%     1.02%      99.63%
   V      0.0074     0.0189       7,152         62      7,214    0.86%      31.09%     0.31%      99.94%
 GOOD     0.0002     0.0073       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0002     0.9677      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1))
                                MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979
               ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5559     0.9553       1,343      1,443      2,786   51.79%      51.79%    43.73%      43.73%
   |      0.3528     0.5559       2,001        786      2,787   28.20%      40.00%    23.82%      67.55%
   |      0.2213     0.3528       2,364        422      2,786   15.15%      31.71%    12.79%      80.33%
   |      0.1372     0.2213       2,513        274      2,787    9.83%      26.24%     8.30%      88.64%
   |      0.0840     0.1372       2,588        198      2,786    7.11%      22.42%     6.00%      94.64%
   |      0.0484     0.0840       2,683        104      2,787    3.73%      19.30%     3.15%      97.79%
   |      0.0256     0.0483       2,729         57      2,786    2.05%      16.84%     1.73%      99.52%
   |      0.0118     0.0256       2,776         11      2,787    0.39%      14.78%     0.33%      99.85%
   V      0.0040     0.0118       2,781          5      2,786    0.18%      13.16%     0.15%     100.00%
 GOOD     0.0001     0.0040       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9553      24,564      3,300     27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

Written by statcompute

June 18, 2016 at 2:43 pm

Risk Models with Generalized PLS

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

R Code

library(gpls)
library(pROC)

df1 <- read.csv("credit_count.txt")
df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)]
set.seed(2016)
n <- nrow(df2)
sample <- sample(seq(n), size = n / 2, replace = FALSE)
train <- df2[sample, ]
test <- df2[-sample, ]

m1 <- glm(DEFAULT ~ ., data = train, family = "binomial")
cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response")))
cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response")))

m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5)
cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1]))
cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1]))

cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n")
print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

Output

### ROC OF GLM PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m1,     newdata = train, type = "response"))

Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6641

### ROC OF GLM PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m1,     newdata = test, type = "response"))

Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6537

### ROC OF GPLS PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m2,     newdata = train)$predicted[, 1])

Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6627

### ROC OF GPLS PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m2,     newdata = test)$predicted[, 1])

Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6542

### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###
                      glm          gpls
(Intercept) -0.1940785071 -0.1954618828
AGE         -0.0122709412 -0.0147883358
ACADMOS      0.0005302022  0.0003671781
ADEPCNT      0.1090667092  0.1352491711
MAJORDRG     0.0757313171  0.0813835741
MINORDRG     0.2621574192  0.2547176301
OWNRENT     -0.2803919685 -0.1032119571
INCOME      -0.0004222914 -0.0004531543
LOGSPEND    -0.1688395555 -0.1525963363

Written by statcompute

June 12, 2016 at 5:55 pm

Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)

data(AutoCollision, package = "insuranceData")
df <- rbind(AutoCollision, AutoCollision)
mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log"))
new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) {
  odata <- model$data
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ]
    bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata)
    rpois(length(bpred), lambda = bpred)
  }
  boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2]))
}

boot_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    25    55
#3 26.97271    16    39
#4 10.69951     4    18
#5 52.50839    35    70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) {
  odata <- model$data
  yhat <- predict(model, type = "response")
  lp <- (1 - p) / 2
  up <- 1 - lp
  set.seed(2016)
  seeds <- round(runif(n, 1, 1000), 0)
  sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% {
    set.seed(seeds[i])
    sim_y <- rpois(length(yhat), lambda = yhat)
    sdata <- data.frame(y = sim_y, odata[names(model$x)])
    refit <- glm(y ~ ., data = sdata, family = poisson)
    bpred <- predict(refit, type = "response", newdata = pdata)
    rpois(length(bpred),lambda = bpred)
  }
  sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up)))
  return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2]))
}

sim_pi(mdl, new_fake, 1000, 0.95)
#      pred lower upper
#1 12.63040     6    21
#2 38.69738    26    52
#3 26.97271    17    39
#4 10.69951     4    18
#5 52.50839    38    68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

Written by statcompute

December 20, 2015 at 2:54 pm

Calculate Leave-One-Out Prediction for GLM

In the model development, the “leave-one-out” prediction is a way of cross-validation, calculated as below:
1. First of all, after a model is developed, each observation used in the model development is removed in turn and then the model is refitted with the remaining observations
2. The out-of-sample prediction for the refitted model is calculated with the removed observation one by one to assemble the LOO, e.g. leave-one-out predicted values for the whole model development sample.
The loo_predict() function below is a general routine to calculate the LOO prediction for any GLM object, which can be further employed to investigate the model stability and predictability.

> pkgs <- c('doParallel', 'foreach')
> lapply(pkgs, require, character.only = T)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

> registerDoParallel(cores = 8)
>
> data(AutoCollision, package = "insuranceData")
> # A GAMMA GLM #
> model1 <- glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = "log"))
> # A POISSON GLM #
> model2 <- glm(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, family = poisson(link = "log"))
>
> loo_predict <- function(obj) {
+   yhat <- foreach(i = 1:nrow(obj$data), .combine = rbind) %dopar% {
+     predict(update(obj, data = obj$data[-i, ]), obj$data[i,], type = "response")
+   }
+   return(data.frame(result = yhat[, 1], row.names = NULL))
+ }
> # TEST CASE 1
> test1 <- loo_predict(model1)
> test1$result
 [1] 303.7393 328.7292 422.6610 375.5023 240.9785 227.6365 288.4404 446.5589
 [9] 213.9368 244.7808 278.7786 443.2256 213.9262 243.2495 266.9166 409.2565
[17] 175.0334 172.0683 197.2911 326.5685 187.2529 215.9931 249.9765 349.3873
[25] 190.1174 218.6321 243.7073 359.9631 192.3655 215.5986 233.1570 348.2781
> # TEST CASE 2
> test2 <- loo_predict(model2)
> test2$result
 [1]  11.15897  37.67273  28.76127  11.54825  50.26364 152.35489 122.23782
 [8]  44.57048 129.58158 465.84173 260.48114 107.23832 167.40672 510.41127
[15] 316.50765 121.75804 172.56928 546.25390 341.03826 134.04303 359.30141
[22] 977.29107 641.69934 251.32547 248.79229 684.86851 574.13994 238.42209
[29] 148.77733 504.12221 422.75047 167.61203

Written by statcompute

December 13, 2015 at 1:36 am

Download Federal Reserve Economic Data (FRED) with Python

In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python.

In [1]: import pandas_datareader.data as web

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import datetime as dt

In [5]: # SET START AND END DATES OF THE SERIES

In [6]: sdt = dt.datetime(2000, 1, 1)

In [7]: edt = dt.datetime(2015, 9, 1)

In [8]: cpi = web.DataReader("CPIAUCNS", "fred", sdt, edt)

In [9]: cpi.head()
Out[9]:
            CPIAUCNS
DATE
2000-01-01     168.8
2000-02-01     169.8
2000-03-01     171.2
2000-04-01     171.3
2000-05-01     171.5

In [10]: df1 = pd.DataFrame({'month': [dt.datetime.strftime(i, "%Y-%m") for i in cpi.index]})

In [11]: df1['qtr'] = [str(x.year) + "-Q" + str(x.quarter) for x in cpi.index]

In [12]: df1['m_cpi'] = cpi.values

In [13]: df1.index = cpi.index

In [14]: grp = df1.groupby('qtr', as_index = False)

In [15]: df2 = grp['m_cpi'].agg({'q_cpi': np.mean})

In [16]: df3 = pd.merge(df1, df2, how = 'inner', left_on = 'qtr', right_on = 'qtr')

In [17]: maxm_cpi = np.array(df3.m_cpi)[-1]

In [18]: maxq_cpi = np.array(df3.q_cpi)[-1]

In [19]: df3['m_factor'] = maxm_cpi / df3.m_cpi

In [20]: df3['q_factor'] = maxq_cpi / df3.q_cpi

In [21]: df3.index = cpi.index

In [22]: final = df3.sort_index(ascending = False)

In [23]: final.head(12)
Out[23]:
              month      qtr    m_cpi       q_cpi  m_factor  q_factor
DATE
2015-09-01  2015-09  2015-Q3  237.945  238.305000  1.000000  1.000000
2015-08-01  2015-08  2015-Q3  238.316  238.305000  0.998443  1.000000
2015-07-01  2015-07  2015-Q3  238.654  238.305000  0.997029  1.000000
2015-06-01  2015-06  2015-Q2  238.638  237.680667  0.997096  1.002627
2015-05-01  2015-05  2015-Q2  237.805  237.680667  1.000589  1.002627
2015-04-01  2015-04  2015-Q2  236.599  237.680667  1.005689  1.002627
2015-03-01  2015-03  2015-Q1  236.119  234.849333  1.007733  1.014714
2015-02-01  2015-02  2015-Q1  234.722  234.849333  1.013731  1.014714
2015-01-01  2015-01  2015-Q1  233.707  234.849333  1.018134  1.014714
2014-12-01  2014-12  2014-Q4  234.812  236.132000  1.013343  1.009202
2014-11-01  2014-11  2014-Q4  236.151  236.132000  1.007597  1.009202
2014-10-01  2014-10  2014-Q4  237.433  236.132000  1.002156  1.009202

Written by statcompute

December 10, 2015 at 9:28 pm

Quasi-Binomial Model in SAS

Similar to quasi-Poisson regressions, quasi-binomial regressions try to address the excessive variance by the inclusion of a dispersion parameter. In addition to addressing the over-dispersion, quasi-binomial regressions also demonstrate extra values in other areas, such as LGD model development in credit risk modeling, due to its flexible distributional assumption.

Measuring the ratio between NCO and GCO, LGD could take any value in the range [0, 1] with no unanimous consensus on the distributional assumption currently in the industry. An advantage of quasi-binomial regression is that it makes no assumption of a specific distribution but merely specifies the conditional mean for a given model response. As a result, the trade-off is the lack of likelihood-based measures such as AIC and BIC.

Below is a demonstration on how to estimate a quasi-binomial model with GLIMMIX procedure in SAS.

proc glimmix data = _last_;
  model y = age number start / link = logit solution;
  _variance_ = _mu_ * (1-_mu_);
  random _residual_;
run;  
/*
              Model Information
Data Set                     WORK.KYPHOSIS   
Response Variable            y               
Response Distribution        Unknown         
Link Function                Logit           
Variance Function            _mu_ * (1-_mu_) 
Variance Matrix              Diagonal        
Estimation Technique         Quasi-Likelihood
Degrees of Freedom Method    Residual        

                       Parameter Estimates 
                         Standard
Effect       Estimate       Error       DF    t Value    Pr > |t|
Intercept     -2.0369      1.3853       77      -1.47      0.1455
age           0.01093    0.006160       77       1.77      0.0800
number         0.4106      0.2149       77       1.91      0.0598
start         -0.2065     0.06470       77      -3.19      0.0020
Residual       0.9132           .        .        .         .    
*/

For the comparison purpose, the same model is also estimated with R glm() function, showing identical outputs.

summary(glm(data = kyphosis, Kyphosis ~ ., family = quasibinomial))
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)   
#(Intercept) -2.03693    1.38527  -1.470  0.14552   
#Age          0.01093    0.00616   1.774  0.07996 . 
#Number       0.41060    0.21489   1.911  0.05975 . 
#Start       -0.20651    0.06470  -3.192  0.00205 **
#---
#(Dispersion parameter for quasibinomial family taken to be 0.913249)

Written by statcompute

November 1, 2015 at 1:20 pm