## Archive for the ‘**Loss Forecasting**’ Category

## Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics") lapply(pkgs, require, character.only = T) data(AutoClaim, package = "cplm") df1 <- na.omit(AutoClaim) # SMALL NUMBER OF ZEROS set.seed(2017) smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE) df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.194e+00 7.234e-02 113.277 < 2e-16 *** # BLUEBOOK 2.047e-05 3.068e-06 6.671 3.21e-11 *** # NPOLICY 7.274e-02 3.102e-02 2.345 0.0191 * MAPE(df2$CLM_AMT, fitted(tw)) # 0.6053669 zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.671e-02 175.629 < 2e-16 *** # BLUEBOOK 2.053e-05 2.090e-06 9.821 < 2e-16 *** # NPOLICY 6.948e-02 2.057e-02 3.377 0.000746 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -5.3886 0.3169 -17 <2e-16 *** MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu")) # 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1) # Estimate Std. Error t value Pr(>|t|) # (Intercept) 6.854e+00 1.067e-01 64.241 < 2e-16 *** # BLUEBOOK 1.332e-05 4.495e-06 2.963 0.00305 ** # NPOLICY 4.380e-02 3.664e-02 1.195 0.23196 # CLM_FREQ5 2.064e-01 2.937e-02 7.026 2.29e-12 *** # MVR_PTS 1.066e-01 1.510e-02 7.063 1.76e-12 *** # INCOME -4.606e-06 8.612e-07 -5.348 9.12e-08 *** MAPE(df1$CLM_AMT, fitted(tw)) # 1.484484 zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.203e+00 4.682e-02 175.218 < 2e-16 *** # BLUEBOOK 2.053e-05 2.091e-06 9.816 < 2e-16 *** # NPOLICY 6.948e-02 2.067e-02 3.362 0.000778 *** # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu")) # 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG") # Mu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 8.205e+00 5.836e-02 140.591 < 2e-16 *** # BLUEBOOK 2.163e-05 2.976e-06 7.268 3.97e-13 *** # NPOLICY 5.898e-02 2.681e-02 2.200 0.0278 * # Nu Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 1.153e+00 5.077e-02 22.72 <2e-16 *** # CLM_FREQ5 -3.028e-01 2.283e-02 -13.26 <2e-16 *** # MVR_PTS -1.509e-01 1.217e-02 -12.41 <2e-16 *** # INCOME 7.285e-06 6.269e-07 11.62 <2e-16 *** MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu")) # 1.469236

## Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression) # number of observations to simulate n <- 100 # seed value for the simulation set.seed(2017) # design matrices with a constant column X <- cbind(rep(1, n), runif(n), runif(n)) # define coefficients for both Poisson and Gamma regressions p_beta <- g_beta <- c(3, -2, 1) # define the Gamma dispersion delta <- 1 # define the Kendall's tau tau <- 0.5 # copula parameter based on tau theta <- 1 / (1 - tau) # define the Gumbel Copula family <- 4 # simulate outcomes out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE) G <- out[, 1] P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE) coef <- c("_CONST", "X1", "X2") cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE") g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha) p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta) g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0) p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0) rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols # estimated coefficients for the Gamma regression assumed dependence print(g_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.9710512 0.2303651 12.897141 # X1 -1.8047627 0.2944627 -6.129003 # X2 0.9071093 0.2995218 3.028526 # estimated coefficients for the Gamma regression assumed dependence print(p_est) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.954519 0.06023353 49.05107 # X1 -1.967023 0.09233056 -21.30414 # X2 1.025863 0.08254870 12.42736 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(g_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 3.020771 0.2499246 12.086727 # X1 -1.777570 0.3480328 -5.107478 # X2 0.905527 0.3619011 2.502140 # estimated coefficients for the Gamma regression assumed independence # should be identical to GLM() outcome print(p_est0) # ESTIMATE STD. ERR Z-VALUE # _CONST 2.939787 0.06507502 45.17536 # X1 -2.010535 0.10297887 -19.52376 # X2 1.088269 0.09334663 11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3]) glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log)) glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log)) loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]] loss_ind <- fitted(glm_p) * fitted(glm_g) den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n)) ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)

## Prediction Intervals for Poisson Regression

Different from the confidence interval that is to address the uncertainty related to the conditional mean, the prediction interval is to accommodate the additional uncertainty associated with prediction errors. As a result, the prediction interval is always wider than the confidence interval in a regression model. In the context of risk modeling, the prediction interval is often used to address the potential model risk due to aforementioned uncertainties.

While calculating prediction interval of OLS regression based on the Gaussian distributional assumption is relatively straightforward with the off-shelf solution in R, it could be more complicated in a Generalized Linear Model, e.g. Poisson regression. In this post, I am going to show two empirical methods, one based on bootstrapping and the other based on simulation, calculating the prediction interval of a Poisson regression. Because of the high computing cost, the parallelism with foreach() function will be used to improve the efficiency.

First of all, let’s estimate a Poisson regression with glm() and generate a couple fake new data points to calculate model predictions. Since the toy data is very small with only 32 records with all categorical predictors, I doubled the sample size by rbind() to ensure the appropriate data coverage in the bootstrapping.

pkgs <- c('doParallel', 'foreach') lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) data(AutoCollision, package = "insuranceData") df <- rbind(AutoCollision, AutoCollision) mdl <- glm(Claim_Count ~ Age + Vehicle_Use, data = df, family = poisson(link = "log")) new_fake <- df[1:5, 1:2]

The first method shown below is based on the bootstrapping with following steps:

1. Bootstrapped the original model development sample by the random sample with replacements;

2. Repeated the above many times, e.g. 1000, to generate different bootstrapped samples;

3. Refitted models with bootstrapped samples;

4. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

5. Generated random numbers based on Poisson distribution with the mean, e.g. lambda, equal to the predicted values from refitted models

6. Collected all Poisson random numbers from the previous step and calculated the percentiles.

boot_pi <- function(model, pdata, n, p) { odata <- model$data lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) boot_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) bdata <- odata[sample(seq(nrow(odata)), size = nrow(odata), replace = TRUE), ] bpred <- predict(update(model, data = bdata), type = "response", newdata = pdata) rpois(length(bpred), lambda = bpred) } boot_ci <- t(apply(boot_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = boot_ci[, 1], upper = boot_ci[, 2])) } boot_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 25 55 #3 26.97271 16 39 #4 10.69951 4 18 #5 52.50839 35 70

The second method is based on the simulation and outlined as below:

1. Re-produced the model response variable, e.g. Claim_Count, by simulating Poisson random numbers with lambda equal to predicted values from the original model;

2. Repeated the above simulations many times, e.g. 1000, to generate many response series;

3. Generated 1000 updated model samples by replacing the original response with the new response generated from simulations;

4. Refitted models with these updated samples

5. Generated predictions with new data points, e.g. “new_fake”, but with refitted models;

6. Generated Poisson random numbers with lambda equal to the predicted values from refitted models

7. Collected all Poisson random numbers from the previous step and calculated the percentiles.

sim_pi <- function(model, pdata, n, p) { odata <- model$data yhat <- predict(model, type = "response") lp <- (1 - p) / 2 up <- 1 - lp set.seed(2016) seeds <- round(runif(n, 1, 1000), 0) sim_y <- foreach(i = 1:n, .combine = rbind) %dopar% { set.seed(seeds[i]) sim_y <- rpois(length(yhat), lambda = yhat) sdata <- data.frame(y = sim_y, odata[names(model$x)]) refit <- glm(y ~ ., data = sdata, family = poisson) bpred <- predict(refit, type = "response", newdata = pdata) rpois(length(bpred),lambda = bpred) } sim_ci <- t(apply(sim_y, 2, quantile, c(lp, up))) return(data.frame(pred = predict(model, newdata = pdata, type = "response"), lower = sim_ci[, 1], upper = sim_ci[, 2])) } sim_pi(mdl, new_fake, 1000, 0.95) # pred lower upper #1 12.63040 6 21 #2 38.69738 26 52 #3 26.97271 17 39 #4 10.69951 4 18 #5 52.50839 38 68

As demonstrated above, after a large number of replications, outcomes from both methods are highly consistent.

## Modeling Frequency in Operational Losses with Python

Poisson and Negative Binomial regressions are two popular approaches to model frequency measures in the operational loss and can be implemented in Python with the statsmodels package as below:

In [1]: import pandas as pd In [2]: import statsmodels.api as sm In [3]: import statsmodels.formula.api as smf In [4]: df = pd.read_csv("AutoCollision.csv") In [5]: # FITTING A POISSON REGRESSION In [6]: poisson = smf.glm(formula = "Claim_Count ~ Age + Vehicle_Use", data = df, family = sm.families.Poisson(sm.families.links.log)) In [7]: poisson.fit().summary() Out[7]: <class 'statsmodels.iolib.summary.Summary'> """ Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claim_Count No. Observations: 32 Model: GLM Df Residuals: 21 Model Family: Poisson Df Model: 10 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -204.40 Date: Tue, 08 Dec 2015 Deviance: 184.72 Time: 20:31:27 Pearson chi2: 184. No. Iterations: 9 ============================================================================================= coef std err z P>|z| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2.3702 0.110 21.588 0.000 2.155 2.585 Age[T.21-24] 1.4249 0.118 12.069 0.000 1.193 1.656 Age[T.25-29] 2.3465 0.111 21.148 0.000 2.129 2.564 Age[T.30-34] 2.5153 0.110 22.825 0.000 2.299 2.731 Age[T.35-39] 2.5821 0.110 23.488 0.000 2.367 2.798 Age[T.40-49] 3.2247 0.108 29.834 0.000 3.013 3.437 Age[T.50-59] 3.0019 0.109 27.641 0.000 2.789 3.215 Age[T.60+] 2.6391 0.110 24.053 0.000 2.424 2.854 Vehicle_Use[T.DriveLong] 0.9246 0.036 25.652 0.000 0.854 0.995 Vehicle_Use[T.DriveShort] 1.2856 0.034 37.307 0.000 1.218 1.353 Vehicle_Use[T.Pleasure] 0.1659 0.041 4.002 0.000 0.085 0.247 ============================================================================================= """ In [8]: # FITTING A NEGATIVE BINOMIAL REGRESSION In [9]: nbinom = smf.glm(formula = "Claim_Count ~ Age + Vehicle_Use", data = df, family = sm.families.NegativeBinomial(sm.families.links.log)) In [10]: nbinom.fit().summary() Out[10]: <class 'statsmodels.iolib.summary.Summary'> """ Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Claim_Count No. Observations: 32 Model: GLM Df Residuals: 21 Model Family: NegativeBinomial Df Model: 10 Link Function: log Scale: 0.0646089484752 Method: IRLS Log-Likelihood: -198.15 Date: Tue, 08 Dec 2015 Deviance: 1.4436 Time: 20:31:27 Pearson chi2: 1.36 No. Iterations: 11 ============================================================================================= coef std err z P>|z| [95.0% Conf. Int.] --------------------------------------------------------------------------------------------- Intercept 2.2939 0.153 14.988 0.000 1.994 2.594 Age[T.21-24] 1.4546 0.183 7.950 0.000 1.096 1.813 Age[T.25-29] 2.4133 0.183 13.216 0.000 2.055 2.771 Age[T.30-34] 2.5636 0.183 14.042 0.000 2.206 2.921 Age[T.35-39] 2.6259 0.183 14.384 0.000 2.268 2.984 Age[T.40-49] 3.2408 0.182 17.760 0.000 2.883 3.598 Age[T.50-59] 2.9717 0.183 16.283 0.000 2.614 3.329 Age[T.60+] 2.6404 0.183 14.463 0.000 2.283 2.998 Vehicle_Use[T.DriveLong] 0.9480 0.128 7.408 0.000 0.697 1.199 Vehicle_Use[T.DriveShort] 1.3402 0.128 10.480 0.000 1.090 1.591 Vehicle_Use[T.Pleasure] 0.3265 0.128 2.548 0.011 0.075 0.578 ============================================================================================= """

Although Quasi-Poisson regressions is not currently supported by the statsmodels package, we are still able to estimate the model with the rpy2 package by using R in the back-end. As shown in the output below, parameter estimates in Quasi-Poisson model are identical to the ones in standard Poisson model. In case that we want a flexible model approach for frequency measures in the operational loss forecast without pursuing more complex Negative Binomial model, Quasi-Poisson regression can be considered a serious contender.

In [11]: # FITTING A QUASI-POISSON REGRESSION In [12]: import rpy2.robjects as ro In [13]: from rpy2.robjects import pandas2ri In [14]: pandas2ri.activate() In [15]: rdf = pandas2ri.py2ri_pandasdataframe(df) In [16]: qpoisson = ro.r.glm('Claim_Count ~ Age + Vehicle_Use', data = rdf, family = ro.r('quasipoisson(link = "log")')) In [17]: print ro.r.summary(qpoisson) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3702 0.3252 7.288 3.55e-07 *** Age21-24 1.4249 0.3497 4.074 0.000544 *** Age25-29 2.3465 0.3287 7.140 4.85e-07 *** Age30-34 2.5153 0.3264 7.705 1.49e-07 *** Age35-39 2.5821 0.3256 7.929 9.49e-08 *** Age40-49 3.2247 0.3202 10.072 1.71e-09 *** Age50-59 3.0019 0.3217 9.331 6.42e-09 *** Age60+ 2.6391 0.3250 8.120 6.48e-08 *** Vehicle_UseDriveLong 0.9246 0.1068 8.660 2.26e-08 *** Vehicle_UseDriveShort 1.2856 0.1021 12.595 2.97e-11 *** Vehicle_UsePleasure 0.1659 0.1228 1.351 0.191016 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 8.774501) Null deviance: 6064.97 on 31 degrees of freedom Residual deviance: 184.72 on 21 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 4

## Are These Losses from The Same Distribution?

In Advanced Measurement Approaches (AMA) for Operational Risk models, the bank needs to segment operational losses into homogeneous segments known as “Unit of Measures (UoM)”, which are often defined by the combination of lines of business (LOB) and Basel II event types. However, how do we support whether the losses in one UoM are statistically different from the ones in another UoM? The answer is to test if the losses from various UoMs are distributionally different or equivalent.

Empirically, Kolmogorov-Smirnov (K-S) test is often used to test if two samples are from the same distribution. In the example below, although x and y share the same mean, K-S test still shows a significant result due to different variances.

n <- 300 set.seed(2015) x <- rnorm(n, 0, 1) y <- rnorm(n, 0, 1.5) ### 2-SAMPLE DISTRIBUTIONAL COMPARISON ### ks.test(x, y, alternative = "two.sided") # Two-sample Kolmogorov-Smirnov test # # data: x and y # D = 0.1567, p-value = 0.001268 # alternative hypothesis: two-sided

However, K-S test cannot be generalized to K-sample cases, where K > 2. In such scenario, the univariate coverage test or the more general multivariate MRPP test might be more appropriate. The Blossom package developed by Talbert and Cade (https://www.fort.usgs.gov/products/23735) provides convenient functions implementing both tests, as shown below.

z <- rnorm(n, 0, 2) df <- data.frame(x = c(x, y, z), g = unlist(lapply(c(1, 2, 3), rep, n))) ### K-SAMPLE DISTRIBUTIONAL COMPARISON ### # COVERAGE TEST FOR THE UNIVARIATE RESPONSES library(Blossom) ctest <- coverage(df$x, df$g) summary(ctest) # Results: # Observed coverage statistic : 1.870273 # Mean of coverage statistic : 1.774817 # Estimated variance of coverage statistic: 0.002275862 # Standard deviation of the variance # of the coverage statistic : 5.108031e-05 # # Observed standardized coverage statistic: 2.00093 # Skewness of observed coverage statistic : 0.08127759 # Probability (Pearson Type III) # of a larger or equal coverage statistic : 0.02484709 # Probability (Resampled) # of a largeror equal coverage statistic : 0.02475* # MULTIRESPONSE PERMUTATION PROCEDURE FOR MULTIVARIATE RESPONSES mtest <- mrpp(x, g, df) summary(mtest) # Results: # Delta Observed : 1.676303 # Delta Expected : 1.708194 # Delta Variance : 4.262303e-06 # Delta Skewness : -1.671773 # # Standardized test statistic : -15.44685 # Probability (Pearson Type III) # of a smaller or equal delta : 9.433116e-09***

## To Difference or Not To Difference?

In the textbook of time series analysis, we’ve been taught to difference the time series in order to have a stationary series, which can be justified by various plots and statistical tests.

In the real-world time series analysis, things are not always as clear as shown in the textbook. For instance, although the ACF plot shows a not-so-slow decay pattern, ADF test however can’t reject the null hypothesis of a unit root. In such cases, many analysts might tend to difference the time series to be on the safe side in their view.

However, is it really a safe practice to difference a time series anyway to have a stationary series to model? In the example below, I will show that inappropriately differencing a time series would lead the model development to an undesirable direction.

First of all, let’s simulate an univariate series under the Gaussian distributional assumption. By theory, this series has to be stationary.

> library(urca) > library(forecast) > library(normwhn.test) > x <- rnorm(100) > par(mfrow = c(2, 1)) > acf(x) > pacf(x) > whitenoise.test(x) [1] "no. of observations" [1] 100 [1] "T" [1] 50 [1] "CVM stat MN" [1] 0.8687478 [1] "tMN" [1] -0.9280931 [1] "test value" [1] 0.6426144 > x.adf <- ur.df(x, type = c("none"), selectlags = "BIC") > summary(x.adf) ############################################### # Augmented Dickey-Fuller Test Unit Root Test # ############################################### Test regression none Call: lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -1.75385 -0.60585 -0.03467 0.61702 3.10100 Coefficients: Estimate Std. Error t value Pr(>|t|) z.lag.1 -1.008829 0.143635 -7.024 3.1e-10 *** z.diff.lag 0.002833 0.101412 0.028 0.978 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9501 on 96 degrees of freedom Multiple R-squared: 0.5064, Adjusted R-squared: 0.4961 F-statistic: 49.25 on 2 and 96 DF, p-value: 1.909e-15 Value of test-statistic is: -7.0235 Critical values for test statistics: 1pct 5pct 10pct tau1 -2.6 -1.95 -1.61 > x.pkss <- ur.kpss(x, type = "mu", lags = "short") > summary(x.pkss) ####################### # KPSS Unit Root Test # ####################### Test is of type: mu with 4 lags. Value of test-statistic is: 0.4136 Critical value for a significance level of: 10pct 5pct 2.5pct 1pct critical values 0.347 0.463 0.574 0.739 > auto.arima(x, ic = 'bic') Series: x ARIMA(0,0,0) with zero mean sigma^2 estimated as 0.8829: log likelihood=-135.67 AIC=273.34 AICc=273.38 BIC=275.94

As shown in the above output:

1) Since x is simulated with the normal assumption, the series should be a white noise by definition.

2) ACF plot shows no auto-correlation at all, as it should.

3) In ADF test, the null hypothesis of unit root is rejected.

4) In PKSS test, the null hypothesis of stationarity is not rejected.

5) The output from auto.arima() suggests an ARIMA(0, 0, 0) model, which is completely in line with the assumption.

However, what would happen if we take the difference of x anyway?

> difx <- diff(x) > par(mfrow = c(2, 1)) > acf(difx) > pacf(difx) > whitenoise.test(difx) [1] "no. of observations" [1] 99 [1] "T" [1] 49 [1] "CVM stat MN" [1] 1.669876 [1] "tMN" [1] 4.689132 [1] "test value" [1] 0.01904923 > auto.arima(difx, ic = 'bic') Series: difx ARIMA(0,0,1) with zero mean Coefficients: ma1 -0.9639 s.e. 0.0327 sigma^2 estimated as 0.901: log likelihood=-136.64 AIC=277.27 AICc=277.4 BIC=282.46

The above output is quite interesting in a way that we just artificially “created” a model by over-differencing the white noise series.

1) After over-differenced, the series is not a white noise anymore with the null hypothesis rejected, e.g. p-value = 0.02.

2) In addition, the auto.arima() suggests that an ARIMA(0, 0, 1) model might fit the data.

## A SAS Macro for Scorecard Evaluation with Weights

On 09/28/2012, I posted a SAS macro evaluation the scorecard performance, e.g. KS / AUC statistics (https://statcompute.wordpress.com/2012/09/28/a-sas-macro-for-scorecard-performance-evaluation). However, this macro is not generic enough to handle cases with a weighting variable. In a recent project that I am working on, there is a weight variable attached to each credit applicant due to the reject inference. Therefore, there is no excuse for me to continue neglecting the necessity of developing another SAS macro that can take care of the weight variable with any positive values in the scorecard evaluation. Below is a quick draft of the macro. You might have to tweak it a little to suit your needs in the production.

%macro wt_auc(data = , score = , y = , weight = ); ***********************************************************; * THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A *; * SCORECARD WITH WEIGHTS *; * ------------------------------------------------------- *; * PARAMETERS: *; * DATA : INPUT DATASET *; * SCORE : SCORECARD VARIABLE *; * Y : RESPONSE VARIABLE IN (0, 1) *; * WEIGHT: WEIGHT VARIABLE WITH POSITIVE VALUES *; * ------------------------------------------------------- *; * OUTPUTS: *; * A SUMMARY TABLE WITH KS AND AUC STATISTICS *; * ------------------------------------------------------- *; * CONTACT: *; * WENSUI.LIU@53.COM *; ***********************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*"; data _tmp1 (keep = &score &y &weight); set &data; where &score ~= . and &y in (0, 1) and &weight >= 0; run; *** CAPTURE THE DIRECTION OF SCORE ***; ods listing close; ods output spearmancorr = _cor; proc corr data = _tmp1 spearman; var &y; with &score; run; ods listing; data _null_; set _cor; if &y >= 0 then do; call symput('desc', 'descending'); end; else do; call symput('desc', ' '); end; run; proc sql noprint; create table _tmp2 as select &score as _scr, sum(&y) as _bcnt, count(*) as _cnt, sum(case when &y = 1 then &weight else 0 end) as _bwt, sum(case when &y = 0 then &weight else 0 end) as _gwt from _tmp1 group by &score; select sum(_bwt) into :bsum from _tmp2; select sum(_gwt) into :gsum from _tmp2; select sum(_cnt) into :cnt from _tmp2; quit; %put &cnt; proc sort data = _tmp2; by &desc _scr; run; data _tmp3; set _tmp2; by &desc _scr; retain _gcum _bcum _cntcum; _gcum + _gwt; _bcum + _bwt; _cntcum + _cnt; _gpct = _gcum / &gsum; _bpct = _bcum / &bsum; _ks = abs(_gpct - _bpct) * 100; _rank = int(_cntcum / ceil(&cnt / 10)) + 1; run; proc sort data = _tmp3 sortsize = max; by _gpct _bpct; run; data _tmp4; set _tmp3; by _gpct _bpct; if last._gpct then do; _idx + 1; output; end; run; proc sql noprint; create table _tmp5 as select a._gcum as gcum, (b._gpct - a._gpct) * (b._bpct + a._bpct) / 2 as dx from _tmp4 as a, _tmp4 as b where a._idx + 1 = b._idx; select sum(dx) format = 15.8 into :AUC from _tmp5; select max(_ks) format = 15.8 into :KS_STAT from _tmp3; select _scr format = 6.2 into :KS_SCORE from _tmp3 where _ks = (select max(_ks) from _tmp3); create table _tmp6 as select _rank as rank, min(_scr) as min_scr, max(_scr) as max_scr, sum(_cnt) as cnt, sum(_gwt + _bwt) as wt, sum(_gwt) as gwt, sum(_bwt) as bwt, sum(_bwt) / calculated wt as bad_rate from _tmp3 group by _rank; quit; proc report data = _last_ spacing = 1 split = "/" headline nowd; column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/ MAXIMUM KS = %trim(&ks_stat) AT SCORE POINT %trim(&ks_score) and AUC STATISTICS = %trim(&auc)/ /" rank min_scr max_scr cnt wt gwt bwt bad_rate); define rank / noprint order order = data; define min_scr / "MIN/SCORE" width = 10 format = 9.2 analysis min center; define max_scr / "MAX/SCORE" width = 10 format = 9.2 analysis max center; define cnt / "RAW/COUNT" width = 10 format = comma9. analysis sum; define wt / "WEIGHTED/SUM" width = 15 format = comma14.2 analysis sum; define gwt / "WEIGHTED/GOODS" width = 15 format = comma14.2 analysis sum; define bwt / "WEIGHTED/BADS" width = 15 format = comma14.2 analysis sum; define bad_rate / "BAD/RATE" width = 10 format = percent9.2 order center; rbreak after / summarize dol skip; run; proc datasets library = work nolist; delete _tmp: / memtype = data; run; quit; ***********************************************************; * END OF THE MACRO *; ***********************************************************; %mend wt_auc;

In the first demo below, a weight variable with fractional values is tested.

*** TEST CASE OF FRACTIONAL WEIGHTS ***; data one; set data.accepts; weight = ranuni(1); run; %wt_auc(data = one, score = bureau_score, y = bad, weight = weight); /* GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA ONE MAXIMUM KS = 34.89711721 AT SCORE POINT 678.00 and AUC STATISTICS = 0.73521009 MIN MAX RAW WEIGHTED WEIGHTED WEIGHTED BAD SCORE SCORE COUNT SUM GOODS BADS RATE ------------------------------------------------------------------------------------------- 443.00 619.00 539 276.29 153.16 123.13 44.56% 620.00 644.00 551 273.89 175.00 98.89 36.11% 645.00 660.00 544 263.06 176.88 86.18 32.76% 661.00 676.00 555 277.26 219.88 57.38 20.70% 677.00 692.00 572 287.45 230.41 57.04 19.84% 693.00 707.00 510 251.51 208.25 43.26 17.20% 708.00 724.00 576 276.31 243.89 32.42 11.73% 725.00 746.00 566 285.53 262.73 22.80 7.98% 747.00 772.00 563 285.58 268.95 16.62 5.82% 773.00 848.00 546 272.40 264.34 8.06 2.96% ========== ========== ========== =============== =============== =============== 443.00 848.00 5,522 2,749.28 2,203.49 545.79 */

In the second demo, a weight variable with positive integers is also tested.

*** TEST CASE OF INTEGER WEIGHTS ***; data two; set data.accepts; weight = rand("poisson", 20); run; %wt_auc(data = two, score = bureau_score, y = bad, weight = weight); /* GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA TWO MAXIMUM KS = 35.58884479 AT SCORE POINT 679.00 and AUC STATISTICS = 0.73725030 MIN MAX RAW WEIGHTED WEIGHTED WEIGHTED BAD SCORE SCORE COUNT SUM GOODS BADS RATE ------------------------------------------------------------------------------------------- 443.00 619.00 539 10,753.00 6,023.00 4,730.00 43.99% 620.00 644.00 551 11,019.00 6,897.00 4,122.00 37.41% 645.00 660.00 544 10,917.00 7,479.00 3,438.00 31.49% 661.00 676.00 555 11,168.00 8,664.00 2,504.00 22.42% 677.00 692.00 572 11,525.00 9,283.00 2,242.00 19.45% 693.00 707.00 510 10,226.00 8,594.00 1,632.00 15.96% 708.00 724.00 576 11,497.00 10,117.00 1,380.00 12.00% 725.00 746.00 566 11,331.00 10,453.00 878.00 7.75% 747.00 772.00 563 11,282.00 10,636.00 646.00 5.73% 773.00 848.00 546 10,893.00 10,598.00 295.00 2.71% ========== ========== ========== =============== =============== =============== 443.00 848.00 5,522 110,611.00 88,744.00 21,867.00 */

As shown, the macro works well in both cases. Please feel free to let me know if it helps in your cases.