## Archive for the ‘**Credit Risk**’ Category

## Monotonic WoE Binning for LGD Models

While the monotonic binning algorithm has been widely used in scorecard and PD model (Probability of Default) developments, the similar idea can be generalized to LGD (Loss Given Default) models. In the post below, two SAS macros performing the monotonic binning for LGD are demonstrated.

The first one tends to generate relatively coarse bins based on iterative grouping, which requires a longer computing time.

%macro lgd_bin1(data = , y = , x = ); %let maxbin = 20; data _tmp1 (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc sql noprint; select count(distinct x) into :xflg from _last_; quit; %let nbin = %sysfunc(min(&maxbin, &xflg)); %if &nbin > 2 %then %do; %do j = &nbin %to 2 %by -1; proc rank data = _tmp1 groups = &j out = _data_ (keep = x rank y); var x; ranks rank; run; proc summary data = _last_ nway; class rank; output out = _tmp2 (drop = _type_ rename = (_freq_ = freq)) sum(y) = bads mean(y) = bad_rate min(x) = minx max(x) = maxx; run; proc sql noprint; select case when min(bad_rate) > 0 then 1 else 0 end into :minflg from _tmp2; select case when max(bad_rate) < 1 then 1 else 0 end into :maxflg from _tmp2; quit; %if &minflg = 1 & &maxflg = 1 %then %do; proc corr data = _tmp2 spearman noprint outs = _corr; var minx; with bad_rate; run; proc sql noprint; select case when abs(minx) = 1 then 1 else 0 end into :cor from _corr where _type_ = 'CORR'; quit; %if &cor = 1 %then %goto loopout; %end; %end; %end; %loopout: proc sql noprint; create table _tmp3 as select a.rank + 1 as bin, a.minx as minx, a.maxx as maxx, a.freq as freq, a.freq / b.freq as dist, a.bad_rate as avg_lgd, a.bads / b.bads as bpct, (a.freq - a.bads) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) / calculated woe as iv from _tmp2 as a, (select sum(freq) as freq, sum(bads) as bads from _tmp2) as b; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin1;

The second one can generate much finer bins based on the idea of isotonic regressions and is more computationally efficient.

%macro lgd_bin2(data = , y = , x = ); data _data_ (keep = x y); set &data; y = min(1, max(0, &y)); x = &x; run; proc transreg data = _last_ noprint; model identity(y) = monotone(x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _tx; output out = _data_ (drop = _freq_ _type_) mean(y) = lgd; run; proc sort data = _last_; by lgd; run; data _tmp2; set _last_; by lgd; _idx = _n_; if lgd = 0 then _idx = _idx + 1; if lgd = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._tx = b._tx; create table _tmp4 as select min(a.x) as minx, max(a.x) as maxx, sum(a.y) as bads, count(a.y) as freq, count(a.y) / b.freq as dist, mean(a.y) as avg_lgd, sum(a.y) / b.bads as bpct, sum(1 - a.y) / (b.freq - b.bads) as gpct, log(calculated bpct / calculated gpct) as woe, (calculated bpct - calculated gpct) * calculated woe as iv from _tmp3 as a, (select count(*) as freq, sum(y) as bads from _tmp3) as b group by a._idx; quit; proc print data = _last_ noobs label; var minx maxx freq dist avg_lgd woe; format freq comma8. dist percent10.2; label minx = "Lower Limit" maxx = "Upper Limit" freq = "Freq" dist = "Dist" avg_lgd = "Average LGD" woe = "WoE"; sum freq dist; run; %mend lgd_bin2;

Below is the output comparison between two macros with the testing data downloaded from http://www.creditriskanalytics.net/datasets-private.html. Should you have any feedback, please feel free to leave me a message.

## Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = ); options mprint mlogic; data _data_ (keep = _x _y); set &data; where &y in (0, 1) and &x ~= .; _y = &y; _x = &x; run; proc transreg data = _last_ noprint; model identity(_y) = monotone(_x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _t_x; output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate; run; proc sort data = _last_; by _rate; run; data _tmp2; set _last_; by _rate; _idx = _n_; if _rate = 0 then _idx = _idx + 1; if _rate = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._t_x = b._t_x; create table _tmp4 as select a._idx, min(a._x) as _min_x, max(a._x) as _max_x, sum(a._y) as _bads, count(a._y) as _freq, mean(a._y) as _rate, sum(a._y) / b.bads as _bpct, sum(1 - a._y) / (b.freq - b.bads) as _gpct, log(calculated _bpct / calculated _gpct) as _woe, (calculated _bpct - calculated _gpct) * calculated _woe as _iv from _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b group by a._idx; quit; title "Monotonic WoE Binning for %upcase(%trim(&x))"; proc print data = _last_ label noobs; var _min_x _max_x _bads _freq _rate _woe _iv; label _min_x = "Lower" _max_x = "Upper" _bads = "#Bads" _freq = "#Freq" _rate = "BadRate" _woe = "WoE" _iv = "IV"; sum _bads _freq _iv; run; title; %mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

## Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) { d1 <- data[c(y, x)] d2 <- d1[!is.na(d1[x]), ] c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs") reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1]) k <- knots(as.stepfun(reg)) sm1 <-smbinning.custom(d1, y, x, k) c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint) c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " ")))) c3 <- c2[!is.na(c2)] return(smbinning.custom(d1, y, x, c3[-length(c3)])) }

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

**LTV Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 46 81 78 3 81 78 3 0.0139 0.9630 0.0370 26.0000 3.2581 1.9021 0.0272 2 <= 71 312 284 28 393 362 31 0.0535 0.9103 0.0897 10.1429 2.3168 0.9608 0.0363 3 <= 72 22 20 2 415 382 33 0.0038 0.9091 0.0909 10.0000 2.3026 0.9466 0.0025 4 <= 73 27 24 3 442 406 36 0.0046 0.8889 0.1111 8.0000 2.0794 0.7235 0.0019 5 <= 81 303 268 35 745 674 71 0.0519 0.8845 0.1155 7.6571 2.0356 0.6797 0.0194 6 <= 83 139 122 17 884 796 88 0.0238 0.8777 0.1223 7.1765 1.9708 0.6149 0.0074 7 <= 90 631 546 85 1515 1342 173 0.1081 0.8653 0.1347 6.4235 1.8600 0.5040 0.0235 8 <= 94 529 440 89 2044 1782 262 0.0906 0.8318 0.1682 4.9438 1.5981 0.2422 0.0049 9 <= 95 145 119 26 2189 1901 288 0.0248 0.8207 0.1793 4.5769 1.5210 0.1651 0.0006 10 <= 100 907 709 198 3096 2610 486 0.1554 0.7817 0.2183 3.5808 1.2756 -0.0804 0.0010 11 <= 101 195 151 44 3291 2761 530 0.0334 0.7744 0.2256 3.4318 1.2331 -0.1229 0.0005 12 <= 110 1217 934 283 4508 3695 813 0.2085 0.7675 0.2325 3.3004 1.1940 -0.1619 0.0057 13 <= 112 208 158 50 4716 3853 863 0.0356 0.7596 0.2404 3.1600 1.1506 -0.2054 0.0016 14 <= 115 253 183 70 4969 4036 933 0.0433 0.7233 0.2767 2.6143 0.9610 -0.3950 0.0075 15 <= 136 774 548 226 5743 4584 1159 0.1326 0.7080 0.2920 2.4248 0.8857 -0.4702 0.0333 16 <= 138 27 18 9 5770 4602 1168 0.0046 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0024 17 > 138 66 39 27 5836 4641 1195 0.0113 0.5909 0.4091 1.4444 0.3677 -0.9882 0.0140 18 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 19 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1897

**LTV Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 85 1025 916 109 1025 916 109 0.1756 0.8937 0.1063 8.4037 2.1287 0.7727 0.0821 2 <= 94 1019 866 153 2044 1782 262 0.1746 0.8499 0.1501 5.6601 1.7334 0.3775 0.0221 3 <= 100 1052 828 224 3096 2610 486 0.1802 0.7871 0.2129 3.6964 1.3074 -0.0486 0.0004 4 <= 105 808 618 190 3904 3228 676 0.1384 0.7649 0.2351 3.2526 1.1795 -0.1765 0.0045 5 <= 114 985 748 237 4889 3976 913 0.1688 0.7594 0.2406 3.1561 1.1493 -0.2066 0.0076 6 > 114 947 665 282 5836 4641 1195 0.1622 0.7022 0.2978 2.3582 0.8579 -0.4981 0.0461 7 Missing 1 0 1 5837 4641 1196 0.0002 0.0000 1.0000 0.0000 -Inf -Inf Inf 8 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.1628

**Bureau_Score Binning with isobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 491 4 1 3 4 1 3 0.0007 0.2500 0.7500 0.3333 -1.0986 -2.4546 0.0056 2 <= 532 24 9 15 28 10 18 0.0041 0.3750 0.6250 0.6000 -0.5108 -1.8668 0.0198 3 <= 559 51 24 27 79 34 45 0.0087 0.4706 0.5294 0.8889 -0.1178 -1.4737 0.0256 4 <= 560 2 1 1 81 35 46 0.0003 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0008 5 <= 572 34 17 17 115 52 63 0.0058 0.5000 0.5000 1.0000 0.0000 -1.3559 0.0143 6 <= 602 153 84 69 268 136 132 0.0262 0.5490 0.4510 1.2174 0.1967 -1.1592 0.0459 7 <= 605 56 31 25 324 167 157 0.0096 0.5536 0.4464 1.2400 0.2151 -1.1408 0.0162 8 <= 606 14 8 6 338 175 163 0.0024 0.5714 0.4286 1.3333 0.2877 -1.0683 0.0035 9 <= 607 17 10 7 355 185 170 0.0029 0.5882 0.4118 1.4286 0.3567 -0.9993 0.0037 10 <= 632 437 261 176 792 446 346 0.0749 0.5973 0.4027 1.4830 0.3940 -0.9619 0.0875 11 <= 639 150 95 55 942 541 401 0.0257 0.6333 0.3667 1.7273 0.5465 -0.8094 0.0207 12 <= 653 451 300 151 1393 841 552 0.0773 0.6652 0.3348 1.9868 0.6865 -0.6694 0.0412 13 <= 662 295 213 82 1688 1054 634 0.0505 0.7220 0.2780 2.5976 0.9546 -0.4014 0.0091 14 <= 665 100 77 23 1788 1131 657 0.0171 0.7700 0.2300 3.3478 1.2083 -0.1476 0.0004 15 <= 667 57 44 13 1845 1175 670 0.0098 0.7719 0.2281 3.3846 1.2192 -0.1367 0.0002 16 <= 677 381 300 81 2226 1475 751 0.0653 0.7874 0.2126 3.7037 1.3093 -0.0466 0.0001 17 <= 679 66 53 13 2292 1528 764 0.0113 0.8030 0.1970 4.0769 1.4053 0.0494 0.0000 18 <= 683 160 129 31 2452 1657 795 0.0274 0.8062 0.1938 4.1613 1.4258 0.0699 0.0001 19 <= 689 203 164 39 2655 1821 834 0.0348 0.8079 0.1921 4.2051 1.4363 0.0804 0.0002 20 <= 699 304 249 55 2959 2070 889 0.0521 0.8191 0.1809 4.5273 1.5101 0.1542 0.0012 21 <= 707 312 268 44 3271 2338 933 0.0535 0.8590 0.1410 6.0909 1.8068 0.4509 0.0094 22 <= 717 368 318 50 3639 2656 983 0.0630 0.8641 0.1359 6.3600 1.8500 0.4941 0.0132 23 <= 721 134 119 15 3773 2775 998 0.0230 0.8881 0.1119 7.9333 2.0711 0.7151 0.0094 24 <= 723 49 44 5 3822 2819 1003 0.0084 0.8980 0.1020 8.8000 2.1748 0.8188 0.0043 25 <= 739 425 394 31 4247 3213 1034 0.0728 0.9271 0.0729 12.7097 2.5424 1.1864 0.0700 26 <= 746 166 154 12 4413 3367 1046 0.0284 0.9277 0.0723 12.8333 2.5520 1.1961 0.0277 27 <= 756 234 218 16 4647 3585 1062 0.0401 0.9316 0.0684 13.6250 2.6119 1.2560 0.0422 28 <= 761 110 104 6 4757 3689 1068 0.0188 0.9455 0.0545 17.3333 2.8526 1.4967 0.0260 29 <= 763 46 44 2 4803 3733 1070 0.0079 0.9565 0.0435 22.0000 3.0910 1.7351 0.0135 30 <= 767 96 92 4 4899 3825 1074 0.0164 0.9583 0.0417 23.0000 3.1355 1.7795 0.0293 31 <= 772 77 74 3 4976 3899 1077 0.0132 0.9610 0.0390 24.6667 3.2055 1.8495 0.0249 32 <= 787 269 260 9 5245 4159 1086 0.0461 0.9665 0.0335 28.8889 3.3635 2.0075 0.0974 33 <= 794 95 93 2 5340 4252 1088 0.0163 0.9789 0.0211 46.5000 3.8395 2.4835 0.0456 34 > 794 182 179 3 5522 4431 1091 0.0312 0.9835 0.0165 59.6667 4.0888 2.7328 0.0985 35 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 36 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8357

**Bureau_Score Binning with monobin() Function**

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

## Monotonic Binning with Smbinning Package

The R package smbinning (http://www.scoringmodeling.com/rpackage/smbinning) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.

2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.

3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) { d1 <- data[c(y, x)] n <- min(20, nrow(unique(d1[x]))) repeat { d1$bin <- Hmisc::cut2(d1[, x], g = n) d2 <- aggregate(d1[-3], d1[3], mean) c <- cor(d2[-1], method = "spearman") if(abs(c[1, 2]) == 1 | n == 2) break n <- n - 1 } d3 <- aggregate(d1[-3], d1[3], max) cuts <- d3[-length(d3[, 3]), 3] return(smbinning::smbinning.custom(d1, y, x, cuts)) }

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named “rev_util”. As shown in the highlighted rows in the column “BadRate”, the binning outcome is not monotonic.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 0 965 716 249 965 716 249 0.1653 0.7420 0.2580 2.8755 1.0562 -0.2997 0.0162 2 <= 5 522 496 26 1487 1212 275 0.0894 0.9502 0.0498 19.0769 2.9485 1.5925 0.1356 3 <= 24 1166 1027 139 2653 2239 414 0.1998 0.8808 0.1192 7.3885 1.9999 0.6440 0.0677 4 <= 40 779 651 128 3432 2890 542 0.1335 0.8357 0.1643 5.0859 1.6265 0.2705 0.0090 5 <= 73 1188 932 256 4620 3822 798 0.2035 0.7845 0.2155 3.6406 1.2922 -0.0638 0.0008 6 <= 96 684 482 202 5304 4304 1000 0.1172 0.7047 0.2953 2.3861 0.8697 -0.4863 0.0316 7 > 96 533 337 196 5837 4641 1196 0.0913 0.6323 0.3677 1.7194 0.5420 -0.8140 0.0743 8 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 30 2962 2495 467 2962 2495 467 0.5075 0.8423 0.1577 5.3426 1.6757 0.3198 0.0471 2 > 30 2875 2146 729 5837 4641 1196 0.4925 0.7464 0.2536 2.9438 1.0797 -0.2763 0.0407 3 Missing 0 0 0 5837 4641 1196 0.0000 NaN NaN NaN NaN NaN NaN 4 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 605 324 167 157 324 167 157 0.0555 0.5154 0.4846 1.0637 0.0617 -1.2942 0.1233 2 <= 632 468 279 189 792 446 346 0.0802 0.5962 0.4038 1.4762 0.3895 -0.9665 0.0946 3 <= 662 896 608 288 1688 1054 634 0.1535 0.6786 0.3214 2.1111 0.7472 -0.6087 0.0668 4 <= 699 1271 1016 255 2959 2070 889 0.2177 0.7994 0.2006 3.9843 1.3824 0.0264 0.0002 5 <= 717 680 586 94 3639 2656 983 0.1165 0.8618 0.1382 6.2340 1.8300 0.4741 0.0226 6 <= 761 1118 1033 85 4757 3689 1068 0.1915 0.9240 0.0760 12.1529 2.4976 1.1416 0.1730 7 > 761 765 742 23 5522 4431 1091 0.1311 0.9699 0.0301 32.2609 3.4739 2.1179 0.2979 8 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 9 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate Odds LnOdds WoE IV 1 <= 617 513 284 229 513 284 229 0.0879 0.5536 0.4464 1.2402 0.2153 -1.1407 0.1486 2 <= 642 515 317 198 1028 601 427 0.0882 0.6155 0.3845 1.6010 0.4706 -0.8853 0.0861 3 <= 657 512 349 163 1540 950 590 0.0877 0.6816 0.3184 2.1411 0.7613 -0.5946 0.0363 4 <= 672 487 371 116 2027 1321 706 0.0834 0.7618 0.2382 3.1983 1.1626 -0.1933 0.0033 5 <= 685 494 396 98 2521 1717 804 0.0846 0.8016 0.1984 4.0408 1.3964 0.0405 0.0001 6 <= 701 521 428 93 3042 2145 897 0.0893 0.8215 0.1785 4.6022 1.5265 0.1706 0.0025 7 <= 714 487 418 69 3529 2563 966 0.0834 0.8583 0.1417 6.0580 1.8014 0.4454 0.0144 8 <= 730 489 441 48 4018 3004 1014 0.0838 0.9018 0.0982 9.1875 2.2178 0.8619 0.0473 9 <= 751 513 476 37 4531 3480 1051 0.0879 0.9279 0.0721 12.8649 2.5545 1.1986 0.0859 10 <= 775 492 465 27 5023 3945 1078 0.0843 0.9451 0.0549 17.2222 2.8462 1.4903 0.1157 11 > 775 499 486 13 5522 4431 1091 0.0855 0.9739 0.0261 37.3846 3.6213 2.2653 0.2126 12 Missing 315 210 105 5837 4641 1196 0.0540 0.6667 0.3333 2.0000 0.6931 -0.6628 0.0282 13 Total 5837 4641 1196 NA NA NA 1.0000 0.7951 0.2049 3.8804 1.3559 0.0000 0.7810

## Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one; do i = 1 to 100000; x0 = ranuni(0); x1 = ranuni(1); x2 = ranuni(2); x3 = ranuni(3); if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do; grp = 1; if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1; else bad = 0; end; else do; grp = 0; if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1; else bad = 0; end; output; end; run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint; model bad = x0 x1 x2 x3; score data = one out = mdl1 (rename = (p_1 = score1)); run; GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1 MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281 ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6800 0.9699 2,057 7,943 10,000 79.43% 79.43% 33.81% 33.81% | 0.4679 0.6799 4,444 5,556 10,000 55.56% 67.50% 23.65% 57.46% | 0.3094 0.4679 6,133 3,867 10,000 38.67% 57.89% 16.46% 73.92% | 0.1947 0.3094 7,319 2,681 10,000 26.81% 50.12% 11.41% 85.33% | 0.1181 0.1946 8,364 1,636 10,000 16.36% 43.37% 6.96% 92.29% | 0.0690 0.1181 9,044 956 10,000 9.56% 37.73% 4.07% 96.36% | 0.0389 0.0690 9,477 523 10,000 5.23% 33.09% 2.23% 98.59% | 0.0201 0.0389 9,752 248 10,000 2.48% 29.26% 1.06% 99.64% V 0.0085 0.0201 9,925 75 10,000 0.75% 26.09% 0.32% 99.96% GOOD 0.0005 0.0085 9,991 9 10,000 0.09% 23.49% 0.04% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9699 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0)) MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457 ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7086 0.9699 1,051 6,162 7,213 85.43% 85.43% 30.51% 30.51% | 0.5019 0.7086 2,452 4,762 7,214 66.01% 75.72% 23.58% 54.10% | 0.3407 0.5019 3,710 3,504 7,214 48.57% 66.67% 17.35% 71.45% | 0.2195 0.3406 4,696 2,517 7,213 34.90% 58.73% 12.46% 83.91% | 0.1347 0.2195 5,650 1,564 7,214 21.68% 51.32% 7.74% 91.66% | 0.0792 0.1347 6,295 919 7,214 12.74% 44.89% 4.55% 96.21% | 0.0452 0.0792 6,737 476 7,213 6.60% 39.42% 2.36% 98.56% | 0.0234 0.0452 7,000 214 7,214 2.97% 34.86% 1.06% 99.62% V 0.0099 0.0234 7,150 64 7,214 0.89% 31.09% 0.32% 99.94% GOOD 0.0007 0.0099 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0007 0.9699 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1)) MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290 ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5863 0.9413 1,351 1,435 2,786 51.51% 51.51% 43.48% 43.48% | 0.3713 0.5862 2,136 651 2,787 23.36% 37.43% 19.73% 63.21% | 0.2299 0.3712 2,340 446 2,786 16.01% 30.29% 13.52% 76.73% | 0.1419 0.2298 2,525 262 2,787 9.40% 25.07% 7.94% 84.67% | 0.0832 0.1419 2,584 202 2,786 7.25% 21.50% 6.12% 90.79% | 0.0480 0.0832 2,643 144 2,787 5.17% 18.78% 4.36% 95.15% | 0.0270 0.0480 2,682 104 2,786 3.73% 16.63% 3.15% 98.30% | 0.0140 0.0270 2,741 46 2,787 1.65% 14.76% 1.39% 99.70% V 0.0058 0.0140 2,776 10 2,786 0.36% 13.16% 0.30% 100.00% GOOD 0.0005 0.0058 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9413 24,564 3,300 27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint; where grp = 0; model bad = x2 x3; score data = one out = mdl20(rename = (p_1 = p_10)); run; proc logistic data = one desc noprint; where grp = 1; model bad = x2 x3; score data = one out = mdl21(rename = (p_1 = p_11)); run; proc logistic data = one desc noprint; model grp = x0 x1; score data = one out = seg; run; data mdl2; merge seg mdl20 mdl21; by i; score2 = p_10 * (1 - p_1) + p_11 * p_1; run; GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2 MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469 ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6877 0.9677 2,011 7,989 10,000 79.89% 79.89% 34.00% 34.00% | 0.4749 0.6876 4,300 5,700 10,000 57.00% 68.45% 24.26% 58.27% | 0.3125 0.4748 6,036 3,964 10,000 39.64% 58.84% 16.87% 75.14% | 0.1932 0.3124 7,451 2,549 10,000 25.49% 50.51% 10.85% 85.99% | 0.1142 0.1932 8,379 1,621 10,000 16.21% 43.65% 6.90% 92.89% | 0.0646 0.1142 9,055 945 10,000 9.45% 37.95% 4.02% 96.91% | 0.0345 0.0646 9,533 467 10,000 4.67% 33.19% 1.99% 98.90% | 0.0166 0.0345 9,800 200 10,000 2.00% 29.29% 0.85% 99.75% V 0.0062 0.0166 9,946 54 10,000 0.54% 26.10% 0.23% 99.98% GOOD 0.0001 0.0062 9,995 5 10,000 0.05% 23.49% 0.02% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9677 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0)) MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458 ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7221 0.9677 1,075 6,138 7,213 85.10% 85.10% 30.40% 30.40% | 0.5208 0.7221 2,436 4,778 7,214 66.23% 75.66% 23.66% 54.06% | 0.3533 0.5208 3,670 3,544 7,214 49.13% 66.82% 17.55% 71.61% | 0.2219 0.3532 4,726 2,487 7,213 34.48% 58.73% 12.32% 83.92% | 0.1309 0.2219 5,617 1,597 7,214 22.14% 51.41% 7.91% 91.83% | 0.0731 0.1309 6,294 920 7,214 12.75% 44.97% 4.56% 96.39% | 0.0387 0.0731 6,762 451 7,213 6.25% 39.44% 2.23% 98.62% | 0.0189 0.0387 7,009 205 7,214 2.84% 34.86% 1.02% 99.63% V 0.0074 0.0189 7,152 62 7,214 0.86% 31.09% 0.31% 99.94% GOOD 0.0002 0.0073 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0002 0.9677 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1)) MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979 ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5559 0.9553 1,343 1,443 2,786 51.79% 51.79% 43.73% 43.73% | 0.3528 0.5559 2,001 786 2,787 28.20% 40.00% 23.82% 67.55% | 0.2213 0.3528 2,364 422 2,786 15.15% 31.71% 12.79% 80.33% | 0.1372 0.2213 2,513 274 2,787 9.83% 26.24% 8.30% 88.64% | 0.0840 0.1372 2,588 198 2,786 7.11% 22.42% 6.00% 94.64% | 0.0484 0.0840 2,683 104 2,787 3.73% 19.30% 3.15% 97.79% | 0.0256 0.0483 2,729 57 2,786 2.05% 16.84% 1.73% 99.52% | 0.0118 0.0256 2,776 11 2,787 0.39% 14.78% 0.33% 99.85% V 0.0040 0.0118 2,781 5 2,786 0.18% 13.16% 0.15% 100.00% GOOD 0.0001 0.0040 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9553 24,564 3,300 27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

## Risk Models with Generalized PLS

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

**R Code**

library(gpls) library(pROC) df1 <- read.csv("credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)] set.seed(2016) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, ] test <- df2[-sample, ] m1 <- glm(DEFAULT ~ ., data = train, family = "binomial") cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response"))) cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response"))) m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5) cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1])) cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1])) cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n") print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

**Output**

### ROC OF GLM PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m1, newdata = train, type = "response")) Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6641 ### ROC OF GLM PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m1, newdata = test, type = "response")) Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6537 ### ROC OF GPLS PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m2, newdata = train)$predicted[, 1]) Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6627 ### ROC OF GPLS PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m2, newdata = test)$predicted[, 1]) Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6542 ### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ### glm gpls (Intercept) -0.1940785071 -0.1954618828 AGE -0.0122709412 -0.0147883358 ACADMOS 0.0005302022 0.0003671781 ADEPCNT 0.1090667092 0.1352491711 MAJORDRG 0.0757313171 0.0813835741 MINORDRG 0.2621574192 0.2547176301 OWNRENT -0.2803919685 -0.1032119571 INCOME -0.0004222914 -0.0004531543 LOGSPEND -0.1688395555 -0.1525963363

## Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach') lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) data(AutoCollision, package = "insuranceData") df <- rbind(AutoCollision, AutoCollision) mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log")) new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) { odata <- model$data lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ] bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata) rpois(length(bpred), lambda = bpred) } boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2])) } boot_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 25 55 #3 26.97271 16 39 #4 10.69951 4 18 #5 52.50839 35 70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) { odata <- model$data yhat <- predict(model, type = "response") lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) sim_y <- rpois(length(yhat), lambda = yhat) sdata <- data.frame(y = sim_y, odata[names(model$x)]) refit <- glm(y ~ ., data = sdata, family = poisson) bpred <- predict(refit, type = "response", newdata = pdata) rpois(length(bpred),lambda = bpred) } sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2])) } sim_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 26 52 #3 26.97271 17 39 #4 10.69951 4 18 #5 52.50839 38 68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.