## Archive for the ‘**Loss Forecasting**’ Category

## Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach') lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) data(AutoCollision, package = "insuranceData") df <- rbind(AutoCollision, AutoCollision) mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log")) new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) { odata <- model$data lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ] bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata) rpois(length(bpred), lambda = bpred) } boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2])) } boot_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 25 55 #3 26.97271 16 39 #4 10.69951 4 18 #5 52.50839 35 70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) { odata <- model$data yhat <- predict(model, type = "response") lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) sim_y <- rpois(length(yhat), lambda = yhat) sdata <- data.frame(y = sim_y, odata[names(model$x)]) refit <- glm(y ~ ., data = sdata, family = poisson) bpred <- predict(refit, type = "response", newdata = pdata) rpois(length(bpred),lambda = bpred) } sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2])) } sim_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 26 52 #3 26.97271 17 39 #4 10.69951 4 18 #5 52.50839 38 68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

## Modeling Frequency in Operational Losses with Python

Poisson and Negative Binomial regressions are two popular approaches to model frequency measures in the operational loss and can be implemented in Python with the statsmodels package as below:

In [1]: import pandas as pd In [2]: import statsmodels.api as sm In [3]: import statsmodels.formula.api as smf In [4]: df = pd.read_csv("AutoCollision.csv") In [5]: # FITTING A POISSON REGRESSION In [6]: poisson = smf.glm(formula = "Claim_Count ~ Age + Vehicle_Use", data = df, family = sm.families.Poisson(sm.families.links.log)) In [7]: poisson.fit().summary() Out[7]: <class 'statsmodels.iolib.summary.Summary'> """ Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claim_Count No. Observations: 32 Model: GLM Df Residuals: 21 Model Family: Poisson Df Model: 10 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -204.40 Date: Tue, 08 Dec 2015 Deviance: 184.72 Time: 20:31:27 Pearson chi2: 184. No. Iterations: 9 ============================================================================================= coef std err z P>|z| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2.3702 0.110 21.588 0.000 2.155 2.585 Age[T.21-24] 1.4249 0.118 12.069 0.000 1.193 1.656 Age[T.25-29] 2.3465 0.111 21.148 0.000 2.129 2.564 Age[T.30-34] 2.5153 0.110 22.825 0.000 2.299 2.731 Age[T.35-39] 2.5821 0.110 23.488 0.000 2.367 2.798 Age[T.40-49] 3.2247 0.108 29.834 0.000 3.013 3.437 Age[T.50-59] 3.0019 0.109 27.641 0.000 2.789 3.215 Age[T.60+] 2.6391 0.110 24.053 0.000 2.424 2.854 Vehicle_Use[T.DriveLong] 0.9246 0.036 25.652 0.000 0.854 0.995 Vehicle_Use[T.DriveShort] 1.2856 0.034 37.307 0.000 1.218 1.353 Vehicle_Use[T.Pleasure] 0.1659 0.041 4.002 0.000 0.085 0.247 ============================================================================================= """ In [8]: # FITTING A NEGATIVE BINOMIAL REGRESSION In [9]: nbinom = smf.glm(formula = "Claim_Count ~ Age + Vehicle_Use", data = df, family = sm.families.NegativeBinomial(sm.families.links.log)) In [10]: nbinom.fit().summary() Out[10]: <class 'statsmodels.iolib.summary.Summary'> """ Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claim_Count No. Observations: 32 Model: GLM Df Residuals: 21 Model Family: NegativeBinomial Df Model: 10 Link Function: log Scale: 0.0646089484752 Method: IRLS Log-Likelihood: -198.15 Date: Tue, 08 Dec 2015 Deviance: 1.4436 Time: 20:31:27 Pearson chi2: 1.36 No. Iterations: 11 ============================================================================================= coef std err z P>|z| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2.2939 0.153 14.988 0.000 1.994 2.594 Age[T.21-24] 1.4546 0.183 7.950 0.000 1.096 1.813 Age[T.25-29] 2.4133 0.183 13.216 0.000 2.055 2.771 Age[T.30-34] 2.5636 0.183 14.042 0.000 2.206 2.921 Age[T.35-39] 2.6259 0.183 14.384 0.000 2.268 2.984 Age[T.40-49] 3.2408 0.182 17.760 0.000 2.883 3.598 Age[T.50-59] 2.9717 0.183 16.283 0.000 2.614 3.329 Age[T.60+] 2.6404 0.183 14.463 0.000 2.283 2.998 Vehicle_Use[T.DriveLong] 0.9480 0.128 7.408 0.000 0.697 1.199 Vehicle_Use[T.DriveShort] 1.3402 0.128 10.480 0.000 1.090 1.591 Vehicle_Use[T.Pleasure] 0.3265 0.128 2.548 0.011 0.075 0.578 ============================================================================================= """

Although Quasi-Poisson regressions is not currently supported by the statsmodels package, we are still able to estimate the model with the rpy2 package by using R in the back-end. As shown in the output below, parameter estimates in Quasi-Poisson model are identical to the ones in standard Poisson model. In case that we want a flexible model approach for frequency measures in the operational loss forecast without pursuing more complex Negative Binomial model, Quasi-Poisson regression can be considered a serious contender.

In [11]: # FITTING A QUASI-POISSON REGRESSION In [12]: import rpy2.robjects as ro In [13]: from rpy2.robjects import pandas2ri In [14]: pandas2ri.activate() In [15]: rdf = pandas2ri.py2ri_pandasdataframe(df) In [16]: qpoisson = ro.r.glm('Claim_Count ~ Age + Vehicle_Use', data = rdf, family = ro.r('quasipoisson(link = "log")')) In [17]: print ro.r.summary(qpoisson) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3702 0.3252 7.288 3.55e-07 *** Age21-24 1.4249 0.3497 4.074 0.000544 *** Age25-29 2.3465 0.3287 7.140 4.85e-07 *** Age30-34 2.5153 0.3264 7.705 1.49e-07 *** Age35-39 2.5821 0.3256 7.929 9.49e-08 *** Age40-49 3.2247 0.3202 10.072 1.71e-09 *** Age50-59 3.0019 0.3217 9.331 6.42e-09 *** Age60+ 2.6391 0.3250 8.120 6.48e-08 *** Vehicle_UseDriveLong 0.9246 0.1068 8.660 2.26e-08 *** Vehicle_UseDriveShort 1.2856 0.1021 12.595 2.97e-11 *** Vehicle_UsePleasure 0.1659 0.1228 1.351 0.191016 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 8.774501) Null deviance: 6064.97 on 31 degrees of freedom Residual deviance: 184.72 on 21 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4

## Are These Losses from The Same Distribution?

In Advanced Measurement Approaches (AMA) for Operational Risk models, the bank needs to segment operational losses into homogeneous segments known as “Unit of Measures (UoM)”, which are often defined by the combination of lines of business (LOB) and Basel II event types. However, how do we support whether the losses in one UoM are statistically different from the ones in another UoM? The answer is to test if the losses from various UoMs are distributionally different or equivalent.

Empirically, Kolmogorov-Smirnov (K-S) test is often used to test if two samples are from the same distribution. In the example below, although x and y share the same mean, K-S test still shows a significant result due to different variances.

n <- 300 set.seed(2015) x <- rnorm(n, 0, 1) y <- rnorm(n, 0, 1.5) ### 2-SAMPLE DISTRIBUTIONAL COMPARISON ### ks.test(x, y, alternative = "two.sided") # Two-sample Kolmogorov-Smirnov test # # data: x and y # D = 0.1567, p-value = 0.001268 # alternative hypothesis: two-sided

However, K-S test cannot be generalized to K-sample cases, where K > 2. In such scenario, the univariate coverage test or the more general multivariate MRPP test might be more appropriate. The Blossom package developed by Talbert and Cade (https://www.fort.usgs.gov/products/23735) provides convenient functions implementing both tests, as shown below.

z <- rnorm(n, 0, 2) df <- data.frame(x = c(x, y, z), g = unlist(lapply(c(1, 2, 3), rep, n))) ### K-SAMPLE DISTRIBUTIONAL COMPARISON ### # COVERAGE TEST FOR THE UNIVARIATE RESPONSES library(Blossom) ctest <- coverage(df$x, df$g) summary(ctest) # Results: # Observed coverage statistic : 1.870273 # Mean of coverage statistic : 1.774817 # Estimated variance of coverage statistic: 0.002275862 # Standard deviation of the variance # of the coverage statistic : 5.108031e-05 # # Observed standardized coverage statistic: 2.00093 # Skewness of observed coverage statistic : 0.08127759 # Probability (Pearson Type III) # of a larger or equal coverage statistic : 0.02484709 # Probability (Resampled) # of a largeror equal coverage statistic : 0.02475* # MULTIRESPONSE PERMUTATION PROCEDURE FOR MULTIVARIATE RESPONSES mtest <- mrpp(x, g, df) summary(mtest) # Results: # Delta Observed : 1.676303 # Delta Expected : 1.708194 # Delta Variance : 4.262303e-06 # Delta Skewness : -1.671773 # # Standardized test statistic : -15.44685 # Probability (Pearson Type III) # of a smaller or equal delta : 9.433116e-09***

## To Difference or Not To Difference?

In the textbook of time series analysis, we’ve been taught to difference the time series in order to have a stationary series, which can be justified by various plots and statistical tests.

In the real-world time series analysis, things are not always as clear as shown in the textbook. For instance, although the ACF plot shows a not-so-slow decay pattern, ADF test however can’t reject the null hypothesis of a unit root. In such cases, many analysts might tend to difference the time series to be on the safe side in their view.

However, is it really a safe practice to difference a time series anyway to have a stationary series to model? In the example below, I will show that inappropriately differencing a time series would lead the model development to an undesirable direction.

First of all, let’s simulate an univariate series under the Gaussian distributional assumption. By theory, this series has to be stationary.

> library(urca) > library(forecast) > library(normwhn.test) > x <- rnorm(100) > par(mfrow = c(2, 1)) > acf(x) > pacf(x) > whitenoise.test(x) [1] "no. of observations" [1] 100 [1] "T" [1] 50 [1] "CVM stat MN" [1] 0.8687478 [1] "tMN" [1] -0.9280931 [1] "test value" [1] 0.6426144 > x.adf <- ur.df(x, type = c("none"), selectlags = "BIC") > summary(x.adf) ############################################### # Augmented Dickey-Fuller Test Unit Root Test # ############################################### Test regression none Call: lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -1.75385 -0.60585 -0.03467 0.61702 3.10100 Coefficients: Estimate Std. Error t value Pr(>|t|) z.lag.1 -1.008829 0.143635 -7.024 3.1e-10 *** z.diff.lag 0.002833 0.101412 0.028 0.978 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9501 on 96 degrees of freedom Multiple R-squared: 0.5064, Adjusted R-squared: 0.4961 F-statistic: 49.25 on 2 and 96 DF, p-value: 1.909e-15 Value of test-statistic is: -7.0235 Critical values for test statistics: 1pct 5pct 10pct tau1 -2.6 -1.95 -1.61 > x.pkss <- ur.kpss(x, type = "mu", lags = "short") > summary(x.pkss) ####################### # KPSS Unit Root Test # ####################### Test is of type: mu with 4 lags. Value of test-statistic is: 0.4136 Critical value for a significance level of: 10pct 5pct 2.5pct 1pct critical values 0.347 0.463 0.574 0.739 > auto.arima(x, ic = 'bic') Series: x ARIMA(0,0,0) with zero mean sigma^2 estimated as 0.8829: log likelihood=-135.67 AIC=273.34 AICc=273.38 BIC=275.94

As shown in the above output:

1) Since x is simulated with the normal assumption, the series should be a white noise by definition.

2) ACF plot shows no auto-correlation at all, as it should.

3) In ADF test, the null hypothesis of unit root is rejected.

4) In PKSS test, the null hypothesis of stationarity is not rejected.

5) The output from auto.arima() suggests an ARIMA(0, 0, 0) model, which is completely in line with the assumption.

However, what would happen if we take the difference of x anyway?

> difx <- diff(x) > par(mfrow = c(2, 1)) > acf(difx) > pacf(difx) > whitenoise.test(difx) [1] "no. of observations" [1] 99 [1] "T" [1] 49 [1] "CVM stat MN" [1] 1.669876 [1] "tMN" [1] 4.689132 [1] "test value" [1] 0.01904923 > auto.arima(difx, ic = 'bic') Series: difx ARIMA(0,0,1) with zero mean Coefficients: ma1 -0.9639 s.e. 0.0327 sigma^2 estimated as 0.901: log likelihood=-136.64 AIC=277.27 AICc=277.4 BIC=282.46

The above output is quite interesting in a way that we just artificially “created” a model by over-differencing the white noise series.

1) After over-differenced, the series is not a white noise anymore with the null hypothesis rejected, e.g. p-value = 0.02.

2) In addition, the auto.arima() suggests that an ARIMA(0, 0, 1) model might fit the data.

## A SAS Macro for Scorecard Evaluation with Weights

On 09/28/2012, I posted a SAS macro evaluation the scorecard performance, e.g. KS / AUC statistics (https://statcompute.wordpress.com/2012/09/28/a-sas-macro-for-scorecard-performance-evaluation). However, this macro is not generic enough to handle cases with a weighting variable. In a recent project that I am working on, there is a weight variable attached to each credit applicant due to the reject inference. Therefore, there is no excuse for me to continue neglecting the necessity of developing another SAS macro that can take care of the weight variable with any positive values in the scorecard evaluation. Below is a quick draft of the macro. You might have to tweak it a little to suit your needs in the production.

%macro wt_auc(data = , score = , y = , weight = ); ***********************************************************; * THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A *; * SCORECARD WITH WEIGHTS *; * ------------------------------------------------------- *; * PARAMETERS: *; * DATA : INPUT DATASET *; * SCORE : SCORECARD VARIABLE *; * Y : RESPONSE VARIABLE IN (0, 1) *; * WEIGHT: WEIGHT VARIABLE WITH POSITIVE VALUES *; * ------------------------------------------------------- *; * OUTPUTS: *; * A SUMMARY TABLE WITH KS AND AUC STATISTICS *; * ------------------------------------------------------- *; * CONTACT: *; * WENSUI.LIU@53.COM *; ***********************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*"; data _tmp1 (keep = &score &y &weight); set &data; where &score ~= . and &y in (0, 1) and &weight >= 0; run; *** CAPTURE THE DIRECTION OF SCORE ***; ods listing close; ods output spearmancorr = _cor; proc corr data = _tmp1 spearman; var &y; with &score; run; ods listing; data _null_; set _cor; if &y >= 0 then do; call symput('desc', 'descending'); end; else do; call symput('desc', ' '); end; run; proc sql noprint; create table _tmp2 as select &score as _scr, sum(&y) as _bcnt, count(*) as _cnt, sum(case when &y = 1 then &weight else 0 end) as _bwt, sum(case when &y = 0 then &weight else 0 end) as _gwt from _tmp1 group by &score; select sum(_bwt) into :bsum from _tmp2; select sum(_gwt) into :gsum from _tmp2; select sum(_cnt) into :cnt from _tmp2; quit; %put &cnt; proc sort data = _tmp2; by &desc _scr; run; data _tmp3; set _tmp2; by &desc _scr; retain _gcum _bcum _cntcum; _gcum + _gwt; _bcum + _bwt; _cntcum + _cnt; _gpct = _gcum / &gsum; _bpct = _bcum / &bsum; _ks = abs(_gpct - _bpct) * 100; _rank = int(_cntcum / ceil(&cnt / 10)) + 1; run; proc sort data = _tmp3 sortsize = max; by _gpct _bpct; run; data _tmp4; set _tmp3; by _gpct _bpct; if last._gpct then do; _idx + 1; output; end; run; proc sql noprint; create table _tmp5 as select a._gcum as gcum, (b._gpct - a._gpct) * (b._bpct + a._bpct) / 2 as dx from _tmp4 as a, _tmp4 as b where a._idx + 1 = b._idx; select sum(dx) format = 15.8 into :AUC from _tmp5; select max(_ks) format = 15.8 into :KS_STAT from _tmp3; select _scr format = 6.2 into :KS_SCORE from _tmp3 where _ks = (select max(_ks) from _tmp3); create table _tmp6 as select _rank as rank, min(_scr) as min_scr, max(_scr) as max_scr, sum(_cnt) as cnt, sum(_gwt + _bwt) as wt, sum(_gwt) as gwt, sum(_bwt) as bwt, sum(_bwt) / calculated wt as bad_rate from _tmp3 group by _rank; quit; proc report data = _last_ spacing = 1 split = "/" headline nowd; column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/ MAXIMUM KS = %trim(&ks_stat) AT SCORE POINT %trim(&ks_score) and AUC STATISTICS = %trim(&auc)/ /" rank min_scr max_scr cnt wt gwt bwt bad_rate); define rank / noprint order order = data; define min_scr / "MIN/SCORE" width = 10 format = 9.2 analysis min center; define max_scr / "MAX/SCORE" width = 10 format = 9.2 analysis max center; define cnt / "RAW/COUNT" width = 10 format = comma9. analysis sum; define wt / "WEIGHTED/SUM" width = 15 format = comma14.2 analysis sum; define gwt / "WEIGHTED/GOODS" width = 15 format = comma14.2 analysis sum; define bwt / "WEIGHTED/BADS" width = 15 format = comma14.2 analysis sum; define bad_rate / "BAD/RATE" width = 10 format = percent9.2 order center; rbreak after / summarize dol skip; run; proc datasets library = work nolist; delete _tmp: / memtype = data; run; quit; ***********************************************************; * END OF THE MACRO *; ***********************************************************; %mend wt_auc;

In the first demo below, a weight variable with fractional values is tested.

*** TEST CASE OF FRACTIONAL WEIGHTS ***; data one; set data.accepts; weight = ranuni(1); run; %wt_auc(data = one, score = bureau_score, y = bad, weight = weight); /* GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA ONE MAXIMUM KS = 34.89711721 AT SCORE POINT 678.00 and AUC STATISTICS = 0.73521009 MIN MAX RAW WEIGHTED WEIGHTED WEIGHTED BAD SCORE SCORE COUNT SUM GOODS BADS RATE ------------------------------------------------------------------------------------------- 443.00 619.00 539 276.29 153.16 123.13 44.56% 620.00 644.00 551 273.89 175.00 98.89 36.11% 645.00 660.00 544 263.06 176.88 86.18 32.76% 661.00 676.00 555 277.26 219.88 57.38 20.70% 677.00 692.00 572 287.45 230.41 57.04 19.84% 693.00 707.00 510 251.51 208.25 43.26 17.20% 708.00 724.00 576 276.31 243.89 32.42 11.73% 725.00 746.00 566 285.53 262.73 22.80 7.98% 747.00 772.00 563 285.58 268.95 16.62 5.82% 773.00 848.00 546 272.40 264.34 8.06 2.96% ========== ========== ========== =============== =============== =============== 443.00 848.00 5,522 2,749.28 2,203.49 545.79 */

In the second demo, a weight variable with positive integers is also tested.

*** TEST CASE OF INTEGER WEIGHTS ***; data two; set data.accepts; weight = rand("poisson", 20); run; %wt_auc(data = two, score = bureau_score, y = bad, weight = weight); /* GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA TWO MAXIMUM KS = 35.58884479 AT SCORE POINT 679.00 and AUC STATISTICS = 0.73725030 MIN MAX RAW WEIGHTED WEIGHTED WEIGHTED BAD SCORE SCORE COUNT SUM GOODS BADS RATE ------------------------------------------------------------------------------------------- 443.00 619.00 539 10,753.00 6,023.00 4,730.00 43.99% 620.00 644.00 551 11,019.00 6,897.00 4,122.00 37.41% 645.00 660.00 544 10,917.00 7,479.00 3,438.00 31.49% 661.00 676.00 555 11,168.00 8,664.00 2,504.00 22.42% 677.00 692.00 572 11,525.00 9,283.00 2,242.00 19.45% 693.00 707.00 510 10,226.00 8,594.00 1,632.00 15.96% 708.00 724.00 576 11,497.00 10,117.00 1,380.00 12.00% 725.00 746.00 566 11,331.00 10,453.00 878.00 7.75% 747.00 772.00 563 11,282.00 10,636.00 646.00 5.73% 773.00 848.00 546 10,893.00 10,598.00 295.00 2.71% ========== ========== ========== =============== =============== =============== 443.00 848.00 5,522 110,611.00 88,744.00 21,867.00 */

As shown, the macro works well in both cases. Please feel free to let me know if it helps in your cases.

## Disaggregating Annual Losses into Each Quarter

In loss forecasting, it is often necessary to disaggregate annual losses into each quarter. The most simple method to convert low frequency to high frequency time series is interpolation, such as the one implemented in EXPAND procedure of SAS/ETS. In the example below, there is a series of annual loss projections from 2013 through 2016. An interpolation by the natural spline is used to convert the annual losses into quarterly ones.

**SAS Code:**

data annual; input loss year mmddyy8.; format year mmddyy8.; datalines; 19270175 12/31/13 18043897 12/31/14 17111193 12/31/15 17011107 12/31/16 ; run; proc expand data = annual out = quarterly from = year to = quarter; id year; convert loss / observed = total method = spline(natural); run; proc sql; select year(year) as year, sum(case when qtr(year) = 1 then loss else 0 end) as qtr1, sum(case when qtr(year) = 2 then loss else 0 end) as qtr2, sum(case when qtr(year) = 3 then loss else 0 end) as qtr3, sum(case when qtr(year) = 4 then loss else 0 end) as qtr4, sum(loss) as total from quarterly group by calculated year; quit;

**Output:**

year qtr1 qtr2 qtr3 qtr4 total 2013 4868536 4844486 4818223 4738931 19270175 2014 4560049 4535549 4510106 4438194 18043897 2015 4279674 4276480 4287373 4267666 17111193 2016 4215505 4220260 4279095 4296247 17011107

While the mathematical interpolation is easy to implement, it might be difficult to justify and interpret from the business standpoint. In reality, there might be an assumption that the loss trend would follow the movement of macro-economy. Therefore, it might be advantageous to disaggregate annual losses into quarterly ones with the inclusion of one or more economic indicators. This approach can be implemented in tempdisagg package of R language. Below is a demo with the same loss data used above. However, disaggregation of annual losses is accomplished based upon a macro-economic indicator.

**R Code:**

library(tempdisagg) loss <- c(19270175, 18043897, 17111193, 17011107) loss.a <- ts(loss, frequency = 1, start = 2013) econ <- c(7.74, 7.67, 7.62, 7.48, 7.32, 7.11, 6.88, 6.63, 6.41, 6.26, 6.12, 6.01, 5.93, 5.83, 5.72, 5.59) econ.q <- ts(econ, frequency = 4, start = 2013) summary(mdl <- td(loss.a ~ econ.q)) print(predict(mdl))

**Output:**

Call: td(formula = loss.a ~ econ.q) Residuals: Time Series: Start = 2013 End = 2016 Frequency = 1 [1] 199753 -234384 -199257 233888 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2416610 359064 6.730 0.0214 * econ.q 308226 53724 5.737 0.0291 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'chow-lin-maxlog' disaggregation with 'sum' conversion 4 low-freq. obs. converted to 16 high-freq. obs. Adjusted R-squared: 0.9141 AR1-Parameter: 0 (truncated) Qtr1 Qtr2 Qtr3 Qtr4 2013 4852219 4830643 4815232 4772080 2014 4614230 4549503 4478611 4401554 2015 4342526 4296292 4253140 4219235 2016 4302864 4272041 4238136 4198067

In practice, if a simple and flexible solution is desired without the need of interpretation, then the mathematical interpolation might be a good choice. On the other hand, if there is a strong belief that the macro-economy might drive the loss trend, then the regression-based method implemented in tempdisagg package might be preferred. However, in our example, both methods generate extremely similar results.

## Another Class of Risk Models

In retail banking, it is a key interest to predict the probability of accounts’ adverse behaviors, such as delinquencies or defaults. A widely accepted practice in the industry is to classify accounts into two groups, the good and the bad, based upon the presence of certain adverse behaviors and then to model this binary outcome with discriminant models, e.g. logistic regression. However, an obvious limitation of discriminant models based upon the binary outcome is that the two-state classification over-simplifies adverse behaviors of accounts. What financially impacts a financial institute are not only the presence of a certain adverse behavior but also the frequency of such behavior.

In the definition of binary outcome, it is important to notice that delinquencies can also be measured directly as the frequency of over-due payments. Therefore, instead of modeling the binary outcome, a more sensible alternative might be to model the frequency of delinquencies within a given valuation horizon. In the statistical content, the genuine model for count outcome, e.g. frequency, is Poisson regression model with probability function

f(Y_i | X_i) = exp(-λ_i) * (λ_i ^ Y_i) / Y_i!, where λ_i = exp(X_i`B)

It is assumed that each observed outcome Y_i is drawn from a Poisson distribution with the conditional mean λ_i on a given covariate vector X_i for case i. In Poisson model, a strong assumption is that the mean is equal to the variance such that E(Y_i | X_i) = Var(Y_i | X_i) = λ_i, which is also known as Equi-Disperson. However, in practice, this Equi-Dispersion assumption is too restrictive for many empirical applications. In real-world count outcomes, the variance often exceeds the mean, namely Over-Dispersion, due to various reasons, such as excess zeroes or long right tail. For instance, in a credit card portfolio, majority of cardholders should have zero delinquency at any point in time, while a few might have more than three. With the similar consequence of heteroskedasticity in a linear regression, Over-Dispersion in a Poisson model will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics. Hence, Poisson model is often inadequate and practically unusable.

Considered a generalization of basic Poisson model, Negative Binomial (NB) model accommodates Over-Dispersion in data by including a dispersion parameter. In a NB model, it is assumed that the conditional mean λ_i for case i is determined not only by the observed heterogeneity explained by the covariate vector X_i but also by the unobserved heterogeneity denoted as ε_i that is independent of X_i such that

λ_i = exp(X_i`B + ε_i) = exp(X_i`B) * exp(ε_i), where exp(ε_i) ~ Gamma(1/α, 1/α)

While there are many variants of NB model, the most common one is NB2 model proposed by Cameron and Trivedi (1966) with probability function

f(Y_i | X_i) = Gamma(Y_i + 1/α) / [Gamma(Y_i + 1) * Gamma(1/α)] * [(1/α) / (1/α + λ_i)] ^ (1/α) * [λ_i / (1/α + λ_i)], where α is the dispersion parameter

For NB2 model, its conditional mean E(Y_i | X_i) is still λ_i, while its variance Var(Y_i | X_i) becomes λ_i + α * λ_i ^ 2. Since both λ_i > 0 and α > 0, the variance must exceed the mean and therefore the issue of Over-Dispersion has been addressed.

A major limitation of standard count data models, such as Poisson and Negative Binomial, is that the data is assumed to be generated by a single process. However, in many cases, it might be more appropriate to assume that the data is governed by two or more processes. For instance, it is believed that risk drivers of the first-time delinquent account might be very different from the ones of an account who had been delinquent for multiple times. From the business standpoint, the assumption of multiple processes is particularly attractive in that it provides the potential to segment the portfolio into two or more sub-groups based upon their delinquent pattern and loan characteristics.

Known as the two-part model, Hurdle Poisson model assumes that count outcomes come from two systematically different processes, a Binomial distribution determining the probability of zero counts and a Truncated-at-Zero Poisson governing positive outcomes. The probability function can be expressed as

for Y_i = 0, f(Y_i | X_i) = θ_i, where θ_i = Prob(Y_i = 0)

for Y_i > 0, f(Y_i | X_i) = (1 – θ_i) * exp(-λ_i) * λ_i ^ Y_i / {[1 – exp(-λ_i)] * Y_i!}, where λ_i = exp(X_i`B)

In the modeling framework, the first process can be analyzed by a logistic regression and the second can be reflected by a Truncated-at-Zero Poisson model. An advantage of Hurdle Model is that it is so flexible as to effectively model both Over-Dispersed data with too many zeroes and Under-Dispersed data with too few zeroes.

Alike to Hurdle model, Zero-Inflated Poisson (ZIP) model is another way to model count outcomes with excess zeroes under the assumption of two components. However, it is slightly different from Hurdle model in the sense that zero outcomes are assumed to come from two different sources, one generating only zero outcomes and the other generating both zero and nonzero outcomes. Specifically, a Binomial distribution decides if an individual is from the Always-Zero or the Not-Always-Zero group and then a standard Poisson distribution describes counts in the Not-always-zero group. The probability function of ZIP model is given as

for Y_i = 0, f(Y_i | X_i) = ω_i + (1 + ω_i) * exp(-λ_i), where ω_i = Prob[Y_i ~ Poisson(λ_i)]

for Y_i > 0, f(Y_i | X_i) = (1 – ω_i) * exp(-λ_i) * λ_i ^ Y_i / Y_i!

With the similar idea to Hurdle model, ZIP model can be represented jointly by two different sub-models as well. A logistic regression is used to separate the Always-Zero group from the Not-Always-Zero group and a basic Poisson model is applied to individuals in the Not-Always-Zero group. From a business prospective, ZIP Model describes an important fact that some not-at-risk accounts are well established such that they will never have financial problems, while the other at-risk ones might have chances to get into troubles during the tough time. Therefore, risk exposures and underlying matrices for accounts with same outcomes at zero count might still be differentiable.

In practice, a sharp dichotomization between at-risk group and not-at-risk group might not be realistic. Even a customer with the good financial condition might be exposed to risks in a certain situation. Therefore, it might make sense to split the whole portfolio into a couple segments with different levels of risk-exposure. A Latent Class Poisson model provides such mechanism by assuming that the population of interest is actually a mixture of S > 1 latent (unobservable) components and each individual is considered a draw from one of these latent groups. The probability function of a Latent Class Poisson model with S = 2 classes can be obtained as

F(Y_i | X_i) = P1_i * exp(-λ1_i) * λ1_i ^ Y_i / Y_i! + P2_i * exp(-λ2_i) * λ2_i ^ Y_i / Y_i!, where P1_i + P2_i = 1

Each latent component in the mixture is assumed to have a different parameter λ_i, which will account for the unobserved heterogeneity in the population. For instance, in the case of S = 2, a portfolio is assumed a mixture between a high risk group and a low risk one. Impacts of predictors are allowed to differ across different latent groups, providing a possibility of more informative and flexible interpretations.

Besides models discussed above, it is also worth to point out that the discrete choice model, such as Logit or Probit, has also been widely used to model count outcomes as well. However, such discrete choice model needs to be based upon sequential or ordered instead of multinomial response, namely ordered Logit.