## Archive for the ‘**Loss Forecasting**’ Category

## Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)

## Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach') lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) data(AutoCollision, package = "insuranceData") df <- rbind(AutoCollision, AutoCollision) mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log")) new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) { odata <- model$data lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ] bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata) rpois(length(bpred), lambda = bpred) } boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2])) } boot_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 25 55 #3 26.97271 16 39 #4 10.69951 4 18 #5 52.50839 35 70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) { odata <- model$data yhat <- predict(model, type = "response") lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) sim_y <- rpois(length(yhat), lambda = yhat) sdata <- data.frame(y = sim_y, odata[names(model$x)]) refit <- glm(y ~ ., data = sdata, family = poisson) bpred <- predict(refit, type = "response", newdata = pdata) rpois(length(bpred),lambda = bpred) } sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2])) } sim_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 26 52 #3 26.97271 17 39 #4 10.69951 4 18 #5 52.50839 38 68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

## Modeling Frequency in Operational Losses with Python

Poisson and Negative Binomial regressions are two popular approaches to model frequency measures in the operational loss and can be implemented in Python with the statsmodels package as below:

In [1]: import pandas as pd In [2]: import statsmodels.api as sm In [3]: import statsmodels.formula.api as smf In [4]: df = pd.read_csv("AutoCollision.csv") In [5]: # FITTING A POISSON REGRESSION In [6]: poisson = smf.glm(formula = "Claim_Count ~ Age + Vehicle_Use", data = df, family = sm.families.Poisson(sm.families.links.log)) In [7]: poisson.fit().summary() Out[7]: <class 'statsmodels.iolib.summary.Summary'> """ Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claim_Count No. Observations: 32 Model: GLM Df Residuals: 21 Model Family: Poisson Df Model: 10 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -204.40 Date: Tue, 08 Dec 2015 Deviance: 184.72 Time: 20:31:27 Pearson chi2: 184. No. Iterations: 9 ============================================================================================= coef std err z P>|z| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2.3702 0.110 21.588 0.000 2.155 2.585 Age[T.21-24] 1.4249 0.118 12.069 0.000 1.193 1.656 Age[T.25-29] 2.3465 0.111 21.148 0.000 2.129 2.564 Age[T.30-34] 2.5153 0.110 22.825 0.000 2.299 2.731 Age[T.35-39] 2.5821 0.110 23.488 0.000 2.367 2.798 Age[T.40-49] 3.2247 0.108 29.834 0.000 3.013 3.437 Age[T.50-59] 3.0019 0.109 27.641 0.000 2.789 3.215 Age[T.60+] 2.6391 0.110 24.053 0.000 2.424 2.854 Vehicle_Use[T.DriveLong] 0.9246 0.036 25.652 0.000 0.854 0.995 Vehicle_Use[T.DriveShort] 1.2856 0.034 37.307 0.000 1.218 1.353 Vehicle_Use[T.Pleasure] 0.1659 0.041 4.002 0.000 0.085 0.247 ============================================================================================= """ In [8]: # FITTING A NEGATIVE BINOMIAL REGRESSION In [9]: nbinom = smf.glm(formula = "Claim_Count ~ Age + Vehicle_Use", data = df, family = sm.families.NegativeBinomial(sm.families.links.log)) In [10]: nbinom.fit().summary() Out[10]: <class 'statsmodels.iolib.summary.Summary'> """ Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claim_Count No. Observations: 32 Model: GLM Df Residuals: 21 Model Family: NegativeBinomial Df Model: 10 Link Function: log Scale: 0.0646089484752 Method: IRLS Log-Likelihood: -198.15 Date: Tue, 08 Dec 2015 Deviance: 1.4436 Time: 20:31:27 Pearson chi2: 1.36 No. Iterations: 11 ============================================================================================= coef std err z P>|z| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2.2939 0.153 14.988 0.000 1.994 2.594 Age[T.21-24] 1.4546 0.183 7.950 0.000 1.096 1.813 Age[T.25-29] 2.4133 0.183 13.216 0.000 2.055 2.771 Age[T.30-34] 2.5636 0.183 14.042 0.000 2.206 2.921 Age[T.35-39] 2.6259 0.183 14.384 0.000 2.268 2.984 Age[T.40-49] 3.2408 0.182 17.760 0.000 2.883 3.598 Age[T.50-59] 2.9717 0.183 16.283 0.000 2.614 3.329 Age[T.60+] 2.6404 0.183 14.463 0.000 2.283 2.998 Vehicle_Use[T.DriveLong] 0.9480 0.128 7.408 0.000 0.697 1.199 Vehicle_Use[T.DriveShort] 1.3402 0.128 10.480 0.000 1.090 1.591 Vehicle_Use[T.Pleasure] 0.3265 0.128 2.548 0.011 0.075 0.578 ============================================================================================= """

Although Quasi-Poisson regressions is not currently supported by the statsmodels package, we are still able to estimate the model with the rpy2 package by using R in the back-end. As shown in the output below, parameter estimates in Quasi-Poisson model are identical to the ones in standard Poisson model. In case that we want a flexible model approach for frequency measures in the operational loss forecast without pursuing more complex Negative Binomial model, Quasi-Poisson regression can be considered a serious contender.

In [11]: # FITTING A QUASI-POISSON REGRESSION In [12]: import rpy2.robjects as ro In [13]: from rpy2.robjects import pandas2ri In [14]: pandas2ri.activate() In [15]: rdf = pandas2ri.py2ri_pandasdataframe(df) In [16]: qpoisson = ro.r.glm('Claim_Count ~ Age + Vehicle_Use', data = rdf, family = ro.r('quasipoisson(link = "log")')) In [17]: print ro.r.summary(qpoisson) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3702 0.3252 7.288 3.55e-07 *** Age21-24 1.4249 0.3497 4.074 0.000544 *** Age25-29 2.3465 0.3287 7.140 4.85e-07 *** Age30-34 2.5153 0.3264 7.705 1.49e-07 *** Age35-39 2.5821 0.3256 7.929 9.49e-08 *** Age40-49 3.2247 0.3202 10.072 1.71e-09 *** Age50-59 3.0019 0.3217 9.331 6.42e-09 *** Age60+ 2.6391 0.3250 8.120 6.48e-08 *** Vehicle_UseDriveLong 0.9246 0.1068 8.660 2.26e-08 *** Vehicle_UseDriveShort 1.2856 0.1021 12.595 2.97e-11 *** Vehicle_UsePleasure 0.1659 0.1228 1.351 0.191016 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 8.774501) Null deviance: 6064.97 on 31 degrees of freedom Residual deviance: 184.72 on 21 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4

## Are These Losses from The Same Distribution?

In Advanced Measurement Approaches (AMA) for Operational Risk models, the bank needs to segment operational losses into homogeneous segments known as “Unit of Measures (UoM)”, which are often defined by the combination of lines of business (LOB) and Basel II event types. However, how do we support whether the losses in one UoM are statistically different from the ones in another UoM? The answer is to test if the losses from various UoMs are distributionally different or equivalent.

Empirically, Kolmogorov-Smirnov (K-S) test is often used to test if two samples are from the same distribution. In the example below, although x and y share the same mean, K-S test still shows a significant result due to different variances.

n <- 300 set.seed(2015) x <- rnorm(n, 0, 1) y <- rnorm(n, 0, 1.5) ### 2-SAMPLE DISTRIBUTIONAL COMPARISON ### ks.test(x, y, alternative = "two.sided") # Two-sample Kolmogorov-Smirnov test # # data: x and y # D = 0.1567, p-value = 0.001268 # alternative hypothesis: two-sided

However, K-S test cannot be generalized to K-sample cases, where K > 2. In such scenario, the univariate coverage test or the more general multivariate MRPP test might be more appropriate. The Blossom package developed by Talbert and Cade (https://www.fort.usgs.gov/products/23735) provides convenient functions implementing both tests, as shown below.

z <- rnorm(n, 0, 2) df <- data.frame(x = c(x, y, z), g = unlist(lapply(c(1, 2, 3), rep, n))) ### K-SAMPLE DISTRIBUTIONAL COMPARISON ### # COVERAGE TEST FOR THE UNIVARIATE RESPONSES library(Blossom) ctest <- coverage(df$x, df$g) summary(ctest) # Results: # Observed coverage statistic : 1.870273 # Mean of coverage statistic : 1.774817 # Estimated variance of coverage statistic: 0.002275862 # Standard deviation of the variance # of the coverage statistic : 5.108031e-05 # # Observed standardized coverage statistic: 2.00093 # Skewness of observed coverage statistic : 0.08127759 # Probability (Pearson Type III) # of a larger or equal coverage statistic : 0.02484709 # Probability (Resampled) # of a largeror equal coverage statistic : 0.02475* # MULTIRESPONSE PERMUTATION PROCEDURE FOR MULTIVARIATE RESPONSES mtest <- mrpp(x, g, df) summary(mtest) # Results: # Delta Observed : 1.676303 # Delta Expected : 1.708194 # Delta Variance : 4.262303e-06 # Delta Skewness : -1.671773 # # Standardized test statistic : -15.44685 # Probability (Pearson Type III) # of a smaller or equal delta : 9.433116e-09***

## To Difference or Not To Difference?

In the textbook of time series analysis, we’ve been taught to difference the time series in order to have a stationary series, which can be justified by various plots and statistical tests.

In the real-world time series analysis, things are not always as clear as shown in the textbook. For instance, although the ACF plot shows a not-so-slow decay pattern, ADF test however can’t reject the null hypothesis of a unit root. In such cases, many analysts might tend to difference the time series to be on the safe side in their view.

However, is it really a safe practice to difference a time series anyway to have a stationary series to model? In the example below, I will show that inappropriately differencing a time series would lead the model development to an undesirable direction.

First of all, let’s simulate an univariate series under the Gaussian distributional assumption. By theory, this series has to be stationary.

> library(urca) > library(forecast) > library(normwhn.test) > x <- rnorm(100) > par(mfrow = c(2, 1)) > acf(x) > pacf(x) > whitenoise.test(x) [1] "no. of observations" [1] 100 [1] "T" [1] 50 [1] "CVM stat MN" [1] 0.8687478 [1] "tMN" [1] -0.9280931 [1] "test value" [1] 0.6426144 > x.adf <- ur.df(x, type = c("none"), selectlags = "BIC") > summary(x.adf) ############################################### # Augmented Dickey-Fuller Test Unit Root Test # ############################################### Test regression none Call: lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -1.75385 -0.60585 -0.03467 0.61702 3.10100 Coefficients: Estimate Std. Error t value Pr(>|t|) z.lag.1 -1.008829 0.143635 -7.024 3.1e-10 *** z.diff.lag 0.002833 0.101412 0.028 0.978 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9501 on 96 degrees of freedom Multiple R-squared: 0.5064, Adjusted R-squared: 0.4961 F-statistic: 49.25 on 2 and 96 DF, p-value: 1.909e-15 Value of test-statistic is: -7.0235 Critical values for test statistics: 1pct 5pct 10pct tau1 -2.6 -1.95 -1.61 > x.pkss <- ur.kpss(x, type = "mu", lags = "short") > summary(x.pkss) ####################### # KPSS Unit Root Test # ####################### Test is of type: mu with 4 lags. Value of test-statistic is: 0.4136 Critical value for a significance level of: 10pct 5pct 2.5pct 1pct critical values 0.347 0.463 0.574 0.739 > auto.arima(x, ic = 'bic') Series: x ARIMA(0,0,0) with zero mean sigma^2 estimated as 0.8829: log likelihood=-135.67 AIC=273.34 AICc=273.38 BIC=275.94

As shown in the above output:

1) Since x is simulated with the normal assumption, the series should be a white noise by definition.

2) ACF plot shows no auto-correlation at all, as it should.

3) In ADF test, the null hypothesis of unit root is rejected.

4) In PKSS test, the null hypothesis of stationarity is not rejected.

5) The output from auto.arima() suggests an ARIMA(0, 0, 0) model, which is completely in line with the assumption.

However, what would happen if we take the difference of x anyway?

> difx <- diff(x) > par(mfrow = c(2, 1)) > acf(difx) > pacf(difx) > whitenoise.test(difx) [1] "no. of observations" [1] 99 [1] "T" [1] 49 [1] "CVM stat MN" [1] 1.669876 [1] "tMN" [1] 4.689132 [1] "test value" [1] 0.01904923 > auto.arima(difx, ic = 'bic') Series: difx ARIMA(0,0,1) with zero mean Coefficients: ma1 -0.9639 s.e. 0.0327 sigma^2 estimated as 0.901: log likelihood=-136.64 AIC=277.27 AICc=277.4 BIC=282.46

The above output is quite interesting in a way that we just artificially “created” a model by over-differencing the white noise series.

1) After over-differenced, the series is not a white noise anymore with the null hypothesis rejected, e.g. p-value = 0.02.

2) In addition, the auto.arima() suggests that an ARIMA(0, 0, 1) model might fit the data.

## A SAS Macro for Scorecard Evaluation with Weights

On 09/28/2012, I posted a SAS macro evaluation the scorecard performance, e.g. KS / AUC statistics (https://statcompute.wordpress.com/2012/09/28/a-sas-macro-for-scorecard-performance-evaluation). However, this macro is not generic enough to handle cases with a weighting variable. In a recent project that I am working on, there is a weight variable attached to each credit applicant due to the reject inference. Therefore, there is no excuse for me to continue neglecting the necessity of developing another SAS macro that can take care of the weight variable with any positive values in the scorecard evaluation. Below is a quick draft of the macro. You might have to tweak it a little to suit your needs in the production.

%macro wt_auc(data = , score = , y = , weight = ); ***********************************************************; * THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A *; * SCORECARD WITH WEIGHTS *; * ------------------------------------------------------- *; * PARAMETERS: *; * DATA : INPUT DATASET *; * SCORE : SCORECARD VARIABLE *; * Y : RESPONSE VARIABLE IN (0, 1) *; * WEIGHT: WEIGHT VARIABLE WITH POSITIVE VALUES *; * ------------------------------------------------------- *; * OUTPUTS: *; * A SUMMARY TABLE WITH KS AND AUC STATISTICS *; * ------------------------------------------------------- *; * CONTACT: *; * WENSUI.LIU@53.COM *; ***********************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*"; data _tmp1 (keep = &score &y &weight); set &data; where &score ~= . and &y in (0, 1) and &weight >= 0; run; *** CAPTURE THE DIRECTION OF SCORE ***; ods listing close; ods output spearmancorr = _cor; proc corr data = _tmp1 spearman; var &y; with &score; run; ods listing; data _null_; set _cor; if &y >= 0 then do; call symput('desc', 'descending'); end; else do; call symput('desc', ' '); end; run; proc sql noprint; create table _tmp2 as select &score as _scr, sum(&y) as _bcnt, count(*) as _cnt, sum(case when &y = 1 then &weight else 0 end) as _bwt, sum(case when &y = 0 then &weight else 0 end) as _gwt from _tmp1 group by &score; select sum(_bwt) into :bsum from _tmp2; select sum(_gwt) into :gsum from _tmp2; select sum(_cnt) into :cnt from _tmp2; quit; %put &cnt; proc sort data = _tmp2; by &desc _scr; run; data _tmp3; set _tmp2; by &desc _scr; retain _gcum _bcum _cntcum; _gcum + _gwt; _bcum + _bwt; _cntcum + _cnt; _gpct = _gcum / &gsum; _bpct = _bcum / &bsum; _ks = abs(_gpct - _bpct) * 100; _rank = int(_cntcum / ceil(&cnt / 10)) + 1; run; proc sort data = _tmp3 sortsize = max; by _gpct _bpct; run; data _tmp4; set _tmp3; by _gpct _bpct; if last._gpct then do; _idx + 1; output; end; run; proc sql noprint; create table _tmp5 as select a._gcum as gcum, (b._gpct - a._gpct) * (b._bpct + a._bpct) / 2 as dx from _tmp4 as a, _tmp4 as b where a._idx + 1 = b._idx; select sum(dx) format = 15.8 into :AUC from _tmp5; select max(_ks) format = 15.8 into :KS_STAT from _tmp3; select _scr format = 6.2 into :KS_SCORE from _tmp3 where _ks = (select max(_ks) from _tmp3); create table _tmp6 as select _rank as rank, min(_scr) as min_scr, max(_scr) as max_scr, sum(_cnt) as cnt, sum(_gwt + _bwt) as wt, sum(_gwt) as gwt, sum(_bwt) as bwt, sum(_bwt) / calculated wt as bad_rate from _tmp3 group by _rank; quit; proc report data = _last_ spacing = 1 split = "/" headline nowd; column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/ MAXIMUM KS = %trim(&ks_stat) AT SCORE POINT %trim(&ks_score) and AUC STATISTICS = %trim(&auc)/ /" rank min_scr max_scr cnt wt gwt bwt bad_rate); define rank / noprint order order = data; define min_scr / "MIN/SCORE" width = 10 format = 9.2 analysis min center; define max_scr / "MAX/SCORE" width = 10 format = 9.2 analysis max center; define cnt / "RAW/COUNT" width = 10 format = comma9. analysis sum; define wt / "WEIGHTED/SUM" width = 15 format = comma14.2 analysis sum; define gwt / "WEIGHTED/GOODS" width = 15 format = comma14.2 analysis sum; define bwt / "WEIGHTED/BADS" width = 15 format = comma14.2 analysis sum; define bad_rate / "BAD/RATE" width = 10 format = percent9.2 order center; rbreak after / summarize dol skip; run; proc datasets library = work nolist; delete _tmp: / memtype = data; run; quit; ***********************************************************; * END OF THE MACRO *; ***********************************************************; %mend wt_auc;

In the first demo below, a weight variable with fractional values is tested.

*** TEST CASE OF FRACTIONAL WEIGHTS ***; data one; set data.accepts; weight = ranuni(1); run; %wt_auc(data = one, score = bureau_score, y = bad, weight = weight); /* GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA ONE MAXIMUM KS = 34.89711721 AT SCORE POINT 678.00 and AUC STATISTICS = 0.73521009 MIN MAX RAW WEIGHTED WEIGHTED WEIGHTED BAD SCORE SCORE COUNT SUM GOODS BADS RATE ------------------------------------------------------------------------------------------- 443.00 619.00 539 276.29 153.16 123.13 44.56% 620.00 644.00 551 273.89 175.00 98.89 36.11% 645.00 660.00 544 263.06 176.88 86.18 32.76% 661.00 676.00 555 277.26 219.88 57.38 20.70% 677.00 692.00 572 287.45 230.41 57.04 19.84% 693.00 707.00 510 251.51 208.25 43.26 17.20% 708.00 724.00 576 276.31 243.89 32.42 11.73% 725.00 746.00 566 285.53 262.73 22.80 7.98% 747.00 772.00 563 285.58 268.95 16.62 5.82% 773.00 848.00 546 272.40 264.34 8.06 2.96% ========== ========== ========== =============== =============== =============== 443.00 848.00 5,522 2,749.28 2,203.49 545.79 */

In the second demo, a weight variable with positive integers is also tested.

*** TEST CASE OF INTEGER WEIGHTS ***; data two; set data.accepts; weight = rand("poisson", 20); run; %wt_auc(data = two, score = bureau_score, y = bad, weight = weight); /* GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA TWO MAXIMUM KS = 35.58884479 AT SCORE POINT 679.00 and AUC STATISTICS = 0.73725030 MIN MAX RAW WEIGHTED WEIGHTED WEIGHTED BAD SCORE SCORE COUNT SUM GOODS BADS RATE ------------------------------------------------------------------------------------------- 443.00 619.00 539 10,753.00 6,023.00 4,730.00 43.99% 620.00 644.00 551 11,019.00 6,897.00 4,122.00 37.41% 645.00 660.00 544 10,917.00 7,479.00 3,438.00 31.49% 661.00 676.00 555 11,168.00 8,664.00 2,504.00 22.42% 677.00 692.00 572 11,525.00 9,283.00 2,242.00 19.45% 693.00 707.00 510 10,226.00 8,594.00 1,632.00 15.96% 708.00 724.00 576 11,497.00 10,117.00 1,380.00 12.00% 725.00 746.00 566 11,331.00 10,453.00 878.00 7.75% 747.00 772.00 563 11,282.00 10,636.00 646.00 5.73% 773.00 848.00 546 10,893.00 10,598.00 295.00 2.71% ========== ========== ========== =============== =============== =============== 443.00 848.00 5,522 110,611.00 88,744.00 21,867.00 */

As shown, the macro works well in both cases. Please feel free to let me know if it helps in your cases.

## Disaggregating Annual Losses into Each Quarter

In loss forecasting, it is often necessary to disaggregate annual losses into each quarter. The most simple method to convert low frequency to high frequency time series is interpolation, such as the one implemented in EXPAND procedure of SAS/ETS. In the example below, there is a series of annual loss projections from 2013 through 2016. An interpolation by the natural spline is used to convert the annual losses into quarterly ones.

**SAS Code:**

data annual; input loss year mmddyy8.; format year mmddyy8.; datalines; 19270175 12/31/13 18043897 12/31/14 17111193 12/31/15 17011107 12/31/16 ; run; proc expand data = annual out = quarterly from = year to = quarter; id year; convert loss / observed = total method = spline(natural); run; proc sql; select year(year) as year, sum(case when qtr(year) = 1 then loss else 0 end) as qtr1, sum(case when qtr(year) = 2 then loss else 0 end) as qtr2, sum(case when qtr(year) = 3 then loss else 0 end) as qtr3, sum(case when qtr(year) = 4 then loss else 0 end) as qtr4, sum(loss) as total from quarterly group by calculated year; quit;

**Output:**

year qtr1 qtr2 qtr3 qtr4 total 2013 4868536 4844486 4818223 4738931 19270175 2014 4560049 4535549 4510106 4438194 18043897 2015 4279674 4276480 4287373 4267666 17111193 2016 4215505 4220260 4279095 4296247 17011107

While the mathematical interpolation is easy to implement, it might be difficult to justify and interpret from the business standpoint. In reality, there might be an assumption that the loss trend would follow the movement of macro-economy. Therefore, it might be advantageous to disaggregate annual losses into quarterly ones with the inclusion of one or more economic indicators. This approach can be implemented in tempdisagg package of R language. Below is a demo with the same loss data used above. However, disaggregation of annual losses is accomplished based upon a macro-economic indicator.

**R Code:**

library(tempdisagg) loss <- c(19270175, 18043897, 17111193, 17011107) loss.a <- ts(loss, frequency = 1, start = 2013) econ <- c(7.74, 7.67, 7.62, 7.48, 7.32, 7.11, 6.88, 6.63, 6.41, 6.26, 6.12, 6.01, 5.93, 5.83, 5.72, 5.59) econ.q <- ts(econ, frequency = 4, start = 2013) summary(mdl <- td(loss.a ~ econ.q)) print(predict(mdl))

**Output:**

Call: td(formula = loss.a ~ econ.q) Residuals: Time Series: Start = 2013 End = 2016 Frequency = 1 [1] 199753 -234384 -199257 233888 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2416610 359064 6.730 0.0214 * econ.q 308226 53724 5.737 0.0291 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 'chow-lin-maxlog' disaggregation with 'sum' conversion 4 low-freq. obs. converted to 16 high-freq. obs. Adjusted R-squared: 0.9141 AR1-Parameter: 0 (truncated) Qtr1 Qtr2 Qtr3 Qtr4 2013 4852219 4830643 4815232 4772080 2014 4614230 4549503 4478611 4401554 2015 4342526 4296292 4253140 4219235 2016 4302864 4272041 4238136 4198067

In practice, if a simple and flexible solution is desired without the need of interpretation, then the mathematical interpolation might be a good choice. On the other hand, if there is a strong belief that the macro-economy might drive the loss trend, then the regression-based method implemented in tempdisagg package might be preferred. However, in our example, both methods generate extremely similar results.