Improve General Regression Neural Network by Monotonic Binning

A major criticism on the binning algorithm as well as on the WoE transformation is that the use of binned predictors will decrease the model predictive power due to the loss of data granularity after the WoE transformation. While talk is cheap, I would use the example below to show that using the monotonic binning algorithm to pre-process predictors in a GRNN is actually able to alleviate the over-fitting and to improve the prediction accuracy for the hold-out sample.

First of all, the whole dataset was split into half, e.g. one as the training sample and another as the hold-out sample. The smoothing parameter, e.g. sigma, was chosen through the random search and happened to be 2.198381 for both GRNNs.

  1. For the first GRNN with untransformed raw predictors, the AUC for the training sample is 0.69 and the AUC for the hold-out sample is 0.66.
  2. For the second GRNN with WoE-transformed predictors, the AUC for the training sample is 0.72 and the AUC for the hold-out sample is 0.69.

In this particular example, it is clearly shown that there is roughly a 4% – 5% improvement in the AUC statistic for both training and hold-out samples through the use of monotonic binning and WoE transformations.

df1 <- read.table("credit_count.txt", header = T, sep = ",")
df2 <- df1[which(df1$CARDHLDR == 1), ]
Y <- df2$DEFAULT
X <- scale(df2[, 3:ncol(df2)])
i <- sample(seq(length(Y)), length(Y) / 2)
# WITHOUT BINNING
Y1 <- Y[i]
Y2 <- Y[i]
X1 <- X[i, ]
X2 <- X[i, ]
net11 <- grnn.fit(x = X1, y = Y1)
test1 <- grnn.search_auc(net11, gen_latin(1, 3, 10), nfolds = 4)
# $best
# sigma auc
# 2.198381 0.6297201
net12 <- grnn.fit(x = X1, y = Y1, sigma = test1$best$sigma)
MLmetrics::AUC(grnn.parpred(net12, X1), Y1)
# 0.6855638
MLmetrics::AUC(grnn.parpred(net12, X2), Y2)
# 0.6555798
# WITH BINNING
df3 <- data.frame(df2[, 3:ncol(df2)], Y)
bin_out <- batch_bin(df3, method = 3)
df_woe <- batch_woe(df3, bin_out$BinLst)
W <- scale(df_woe$df[, 1])
W1 <- W[i, ]
W2 <- W[i, ]
net21 <- grnn.fit(x = W1, y = Y1)
test2 <- grnn.search_auc(net21, gen_latin(1, 3, 10), nfolds = 4)
# $best
# sigma auc
# 2.198381 0.6820317
net22 <- grnn.fit(x = W1, y = Y1, sigma = test2$best$sigma)
MLmetrics::AUC(grnn.parpred(net22, W1), Y1)
# 0.7150051
MLmetrics::AUC(grnn.parpred(net22, W2), Y2)
# 0.6884229

view raw
grnn_bin.R
hosted with ❤ by GitHub

GRNN with Small Samples

After a bank launches a new product or acquires a new portfolio, the risk modeling team would often be faced with a challenge of how to estimate the corresponding performance, e.g. risk or loss, with a limited number of data points conditional on business drivers or macro-economic indicators. For instance, it is required to project the 9-quarter loss in CCAR, regardless of the portfolio age. In such cases, the prevalent practice based upon conventional regression models might not be applicable given the requirement for a sufficient number of samples in order to draw the statistical inference. As a result, we would have to rely on the input of SME (Subject Matter Expert), to gauge the performance based on similar products and portfolios, or to fall back on simple statistical metrics such as Average or Median that can’t be intuitively related to predictors.

With the GRNN implemented in the YAGeR project (https://github.com/statcompute/yager), it is however technically feasible to project the expected performance conditional on predictors due to the fact that the projected Y_i of a future case is determined by the distance between the predictor vector X_i and each X vector in the training sample, subject to a smoothing parameter namely Sigma. While more samples in the training data are certainly helpful to estimate a generalizable model, a couple data points, e.g. even only one or two data points in the extreme case, are also conceptually sufficient to form a GRNN that is able to generate sensible projections without violating statistical assumptions.

Following are a couple practical considerations.

  1. Although normalizing the input data, e.g. X matrix, in a GRNN is usually necessary for the numerical reason, the exact scaling is not required. Practically, the “rough” scaling can be employed and ranges or variances used in the normalization can be based upon the historical data of X that might not be reflected in the training data with only a small sample size.
  2. With limited data points in the training data, the Sigma value can be chosen by the L-O-O (Leave-One-Out) or empirically based upon another GRNN with a similar data structure that might or might not be related to the training data. What’s more, it is easy enough to dynamically fine-tune or refresh the Sigma value with more data samples becoming available along the time.
  3. While there is no requirement for the variable selection in a GRNN, the model developer does have the flexibility of judgmentally choosing predictors based upon the prior information and eliminating variables not showing correct marginal effects in PDP (https://statcompute.wordpress.com/2019/10/19/partial-dependence-plot-pdp-of-grnn).

Below is an example of using 100 data points as the training sample to predict LGD within the unity interval of 1,000 cases with both GLM and GRNN. Out of 100 trials, while the GLM only outperformed the simple average for 32 times, the GRNN was able to do better for 76 times.

source("yager.R")
df <- read.table("lgd", header = T)[, 1:8]
Y <- 1 df$rr
X <- scale(df[, 2:8])
pre.N <- 1000
trn.N <- 100
try.N <- 100
seeds <- floor(with(set.seed(2020), runif(try.N) * 1e8))
test_glm <- function(seed) {
i1 <- with(set.seed(seed), sample(seq(length(Y)), pre.N))
Y1 <- Y[i1]
X1 <- X[i1, ]
Y2 <- Y[i1]
X2 <- X[i1, ]
i2 <- with(set.seed(seed), sample(seq(length(Y2)), trn.N))
gm <- glm(Y2 ~ ., data = data.frame(Y2, X2)[i2, ], family = quasibinomial)
round(MLmetrics::R2_Score(predict(gm, newdata = data.frame(X1), type = "response"), Y1), 4)
}
perf.glm <- Reduce(c, lapply(seeds, test_glm))
summary(perf.glm)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.39300 -0.10483 -0.02280 -0.05135 0.01230 0.08920
sum(perf.glm > 0) / length(perf.glm)
# [1] 0.32
test_grnn <- function(seed) {
i1 <- with(set.seed(seed), sample(seq(length(Y)), pre.N))
Y1 <- Y[i1]
X1 <- X[i1, ]
Y2 <- Y[i1]
X2 <- X[i1, ]
i2 <- with(set.seed(seed), sample(seq(length(Y2)), trn.N))
gn <- grnn.fit(X2[i2, ], Y2[i2])
round(MLmetrics::R2_Score(grnn.predict(gn, X1), Y1), 4)
}
perf.grnn <- Reduce(c, lapply(seeds, test_grnn))
summary(perf.grnn)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -0.06130 0.00075 0.03075 0.02739 0.05437 0.10000
sum(perf.grnn > 0) / length(perf.grnn)
# [1] 0.76

view raw
grnn_SmallSample.R
hosted with ❤ by GitHub

GRNN vs. GAM

In practice, GRNN is very similar to GAM (Generalized Additive Models) in the sense that they both shared the flexibility of approximating non-linear functions. In the example below, both GRNN and GAM were applied to the Kyphosis data that has been widely experimented in examples of GAM and revealed very similar patterns of functional relationships between model predictors and the response (red for GRNN and blue for GAM). However, while we have to determine the degree of freedom for each predictor in order to control the smoothness of a GAM model, there is only one tuning parameter governing the overall fitting of a GRNN model.

data(kyphosis, package = "gam")
y <- ifelse(kyphosis$Kyphosis == "present", 1, 0)
x <- scale(kyphosis[, 1])
### FIT A GRNN
net1 <- grnn.fit(x = x, y = y)
test <- grnn.search_auc(net1, sigmas = gen_sobol(min = 0.5, max = 1.5, n = 50), nfolds = 20)
net2 <- grnn.fit(x = x, y = y, sigma = min(test$best$sigma))
### FIT A GAM
library(gam)
gam1 <- gam(y~ Age + Number + Start, data = data.frame(y, x), family = binomial)
step <- step.Gam(gam1, data = data.frame(x, y), direction = "both",
scope = list("Age" = ~1 + Age + s(Age, 3) + s(Age, 4) + s(Age, 5),
"Number" = ~1 + Number + s(Number, 3) + s(Number, 4) + s(Number, 5),
"Start" = ~1 + Start + s(Start, 3)+ s(Start, 4) + s(Start, 5)))
# Start: y ~ Age + Number + Start; AIC= 69.3799
# Step:1 y ~ s(Age, 3) + Number + Start ; AIC= 66.1469
# Step:2 y ~ s(Age, 3) + Number + s(Start, 3) ; AIC= 64.1875
gam2 <- gam::gam(y ~ s(Age, 3) + Number + s(Start, 3), data = data.frame(x, y), family = binomial)
### PLOTTING
par(mfrow = c(2, 3))
for (i in 1:ncol(net2$x)) grnn.margin(net2, i)
plot(gam2, col = "blue", lwd = 5)

view raw
compare_gam.R
hosted with ❤ by GitHub

gam

Modeling Practices of Operational Losses in CCAR

Based upon the CCAR2019 benchmark report published by O.R.X, 88% participants in the survey that submitted results to the Fed used regression models to project operational losses, demonstrating a strong convergence. As described in the report, the OLS regression under the Gaussian distribution still seems the most prevalent approach.

Below is the summary of modeling approaches for operational losses based on my own experience and knowledge that might be helpful for other banking practitioners in CCAR.

Modeling Frequency
|
|– Equi-Dispersion (Baseline)
| |
| `– Standard Poisson
|
|– Over-Dispersion
| |
| |– Negative Binomial
| |
| |– Zero-Inflated Poisson
| |
| `– Finite Mixture Poisson
|
`– Over- or Under-Dispersion
|
|– Quasi-Poisson
|
|– Hurdle Poisson
|
|– Generalized Poisson
|
|– Double Poisson
|
|– Conway-Maxwell Poisson
|
`– Hyper-Poisson

view raw
FrequencyModels
hosted with ❤ by GitHub

Modeling Loss / Average Loss
|
|– Loss without Zeroes
| |
| |– Gamma
| |
| `– Inverse Gaussian
|
|– Loss with Some Zeros
| |
| |– Intercept-only ZAGA (Zero-Adjusted Gamma)
| |
| |– Intercept-only ZAIG (Zero-Adjusted Inverse Gaussian)
| |
| `– Tweedie
|
|– Loss with Many Zeros
| |
| |– ZAGA (Zero-Adjusted Gamma)
| |
| `– ZAIG (Zero-Adjusted Inverse Gaussian)
|
`– Loss with Nonzero Threshold Xm
|
|– Tobit
|
|– Pareto
|
`– Shifted Response with Y' = Y – Xm or Y' = Ln(Y / Xm)

view raw
LossModels
hosted with ❤ by GitHub

Co-integration and Mean Reverting Portfolio

In the previous post https://statcompute.wordpress.com/2018/07/29/co-integration-and-pairs-trading, it was shown how to identify two co-integrated stocks in the pair trade. In the example below, I will show how to form a mean reverting portfolio with three or more stocks, e.g. stocks with co-integration, and also how to find the linear combination that is stationary for these stocks.

First of all, we downloaded series of three stock prices from finance.yahoo.com.

### GET DATA FROM YAHOO FINANCE
quantmod::getSymbols("FITB", from = "2010-01-01")
FITB <- get("FITB")[, 6]
quantmod::getSymbols("MTB", from = "2010-01-01")
MTB <- get("MTB")[, 6]
quantmod::getSymbols("BAC", from = "2010-01-01")
BAC <- get("BAC")[, 6]

For the residual-based co-integration test, we can utilize the Pu statistic in the Phillips-Ouliaris test to identify the co-integration among three stocks. As shown below, the null hypothesis of no co-integration is rejected, indicating that these three stocks are co-integrated and therefore form a mean reverting portfolio. Also, the test regression to derive the residual for the statistical test is also given.

k <- trunc(4 + (length(FITB) / 100) ^ 0.25)
po.test <- urca::ca.po(cbind(FITB, MTB, BAC), demean = "constant", lag = "short", type = "Pu")
#Value of test-statistic is: 62.7037
#Critical values of Pu are:
#                  10pct    5pct    1pct
#critical values 33.6955 40.5252 53.8731

po.test@testreg
#                     Estimate Std. Error t value Pr(|t|)
#(Intercept)         -1.097465   0.068588  -16.00   <2e-16 ***
#z[, -1]MTB.Adjusted  0.152637   0.001487  102.64   <2e-16 ***
#z[, -1]BAC.Adjusted  0.140457   0.007930   17.71   <2e-16 ***

Based on the test regression output, a linear combination can be derived by [FITB + 1.097465 – 0.152637 * MTB – 0.140457 * BAC]. The ADF test result confirms that the linear combination of these three stocks are indeed stationary.

ts1 <- FITB + 1.097465 - 0.152637 * MTB - 0.140457 * BAC
tseries::adf.test(ts1, k = k)
#Dickey-Fuller = -4.1695, Lag order = 6, p-value = 0.01 

Alternatively, we can also utilize the Johansen test that is based upon the likelihood ratio to identify the co-integration. While the null hypothesis of no co-integration (r = 0) is rejected, the null hypothesis of r <= 1 suggests that there exists a co-integration equation at the 5% significance level.

js.test <- urca::ca.jo(cbind(FITB, MTB, BAC), type = "trace", K = k, spec = "longrun", ecdet = "const")
#          test 10pct  5pct  1pct
#r <= 2 |  3.26  7.52  9.24 12.97
#r <= 1 | 19.72 17.85 19.96 24.60
#r = 0  | 45.88 32.00 34.91 41.07

js.test@V
#                 FITB.Adjusted.l6 MTB.Adjusted.l6 BAC.Adjusted.l6   constant
#FITB.Adjusted.l6        1.0000000        1.000000        1.000000  1.0000000
#MTB.Adjusted.l6        -0.1398349       -0.542546       -0.522351 -0.1380191
#BAC.Adjusted.l6        -0.1916826        1.548169        3.174651 -0.9654671
#constant                0.6216917       17.844653      -20.329085  6.8713179

Similarly, based on the above Eigenvectors, a linear combination can be derived by [FITB + 0.6216917 – 0.1398349 * MTB – 0.1916826 * BAC]. The ADF test result also confirms that the linear combination of these three stocks are stationary.

ts2 <- FITB + 0.6216917 - 0.1398349 * MTB - 0.1916826 * BAC
tseries::adf.test(ts2, k = k)
#Dickey-Fuller = -4.0555, Lag order = 6, p-value = 0.01

Two-Stage Estimation of Switching Regression

The switching regression is an extension of the Heckit model, which is also known as the type-V Tobit model and assumes that there is a multivariate normal distribution for three latent variables Y1*, Y2*, and Y3* such that
A. Y1 = 1 for Y1* > 0 and Y1 = 0 for Y1* <= 0;
B. Y2 = Y2* for Y1 = 1 and Y2 = 0 for Y1 = 0;
C. Y3 = Y3* for Y1 = 0 and Y3 = 0 for Y1 = 1.
Therefore, Y2 and Y3 would not be observable at the same time.

In SAS, the switching regression can be implemented with the QLIM procedure, as shown in (http://support.sas.com/documentation/cdl/en/etsug/63939/HTML/default/viewer.htm#etsug_qlim_sect039.htm). However, when using the QLIM procedure in practice, I sometimes find that the MLE might not converge given the complexity of the likelihood function. In the example below, a two-stage estimation approach by using simple LOGISTIC and REG procedures is demonstrated. Benefits of the two-stage estimation are twofold. First of all, it is extremely easy to implement in practice. Secondly, when the MLE is preferred, estimated parameters from the two-stage approach can be used to provide initial values in the optimization to help the MLE convergence.

data d1;
  keep y1 y2 y3 x1 x2;
  do i = 1 to 500;
    x1 = rannor(1);
    x2 = rannor(1);
    u1 = rannor(1);
    u2 = rannor(1);
    u3 = rannor(1);
    y1l = 1 + 2 * x1 + 3 * x2 + u1;
    y2l = 1 + 2 * x1 + u1 * 0.2 + u2;
    y3l = 1 - 2 * x2 + u1 * 0.1 - u2 * 0.5 + u3 * 0.5;
    if y1l > 0 then y1 = 1;
    else y1 = 0;
    if y1l > 0 then y2 = y2l;
    else y2 = 0;
    if y1l <= 0 then y3 = y3l;
    else y3 = 0;
    output;
  end;
run;

*** 1-STEP MLE ***;
proc qlim data = d1;
  model y1 = x1 x2 / discrete;
  model y2 = x1 / select(y1 = 1);
  model y3 = x2 / select(y1 = 0);
run;

/*
Parameter      DF      Estimate        Standard     t Value   Approx
                                       Error                  P-Value
y2.Intercept   1       0.931225        0.080241     11.61     <.0001
y2.x1          1       1.970194        0.06801      28.97     <.0001
_Sigma.y2      1       1.050489        0.042064     24.97     <.0001
y3.Intercept   1       0.936837        0.09473       9.89     <.0001
y3.x2          1      -2.043977        0.071986    -28.39     <.0001
_Sigma.y3      1       0.710451        0.037412     18.99     <.0001
y1.Intercept   1       1.040852        0.127171      8.18     <.0001
y1.x1          1       1.900394        0.19335       9.83     <.0001
y1.x2          1       2.590489        0.257989     10.04     <.0001
_Rho.y1.y2     1       0.147923        0.2156        0.69     0.4927
_Rho.y1.y3     1       0.324967        0.166508      1.95     0.051
*/

*** 2-STAGE APPROACH ***;
proc logistic data = d1 desc;
  model y1 = x1 x2 / link = probit;
  output out = d2 xbeta = xb;
run;
/*
Parameter  DF  Estimate  Standard  Wald          P-Value
                         Error     Chi-Square
Intercept  1   1.0406    0.1296     64.5117      <.0001
x1         1   1.8982    0.1973     92.5614      <.0001
x2         1   2.6223    0.2603    101.47        <.0001
*/

data d3;
  set d2;
  if y1 = 1 then imr = pdf('normal', xb) / cdf('normal', xb);
  else imr = pdf('normal', xb) / (1 - cdf('normal', xb));
run;

proc reg data = d3 plots = none;
  where y1 = 1;
  model y2 = x1 imr;
run;
/*
Variable   DF  Parameter  Standard  t Value  P-Value
               Estimate   Error
Intercept  1   0.94043    0.0766    12.28    <.0001
x1         1   1.96494    0.06689   29.38    <.0001
imr        1   0.11476    0.20048    0.57    0.5674
*/

proc reg data = d3 plots = none;
  where y1 = 0;
  model y3 = x2 imr;
run;
/*
Variable   DF  Parameter  Standard  t Value  P-Value
               Estimate   Error
Intercept  1    0.92982   0.09493     9.79   <.0001
x2         1   -2.04808   0.07194   -28.47   <.0001
imr        1   -0.21852   0.1244     -1.76   0.0807
*/

*** SET INITIAL VALUES IN MLE BASED ON TWO-STAGE OUTCOMES ***;
proc qlim data = d1;
  init y1.intercept = 1.0406  y1.x1 = 1.8982      y1.x2 = 2.6223
       y2.intercept = 0.9404  y2.x1 = 1.9649  _sigma.y2 = 1.0539
       y3.intercept = 0.9298  y3.x2 = -2.048  _sigma.y3 = 0.7070;
  model y1 = x1 x2 / discrete;
  model y2 = x1 / select(y1 = 1);
  model y3 = x2 / select(y1 = 0);
run;

Co-integration and Pairs Trading

The co-integration is an important statistical concept behind the statistical arbitrage strategy named “Pairs Trading”. While projecting a stock price with time series models is by all means difficult, it is technically feasible to find a pair of (or even a portfolio of) stocks sharing the common trend such that a linear combination of two series is stationary, which is so-called co-integration. The underlying logic of Pairs Trading is to monitor movements of co-integrated stocks and to look for trading opportunities when the divergence presents. Under the mean-reversion assumption, the stock price would tend to move back to the long-term equilibrium. As a result, the spread between two co-integrated stock prices would eventually converge. Furthermore, given the stationarity of the spread between co-integrated stocks, it becomes possible to forecast such spread with time series models.

Below shows a R utility function helping to identify pairwise co-integrations based upon the Johansen Test out of a arbitrary number of stock prices provided in a list of tickers.

For instance, based on a starting date on 2010/01/01 and a list of tickers for major US banks, we are able to identify 23 pairs of co-integrated stock prices out of 78 pairwise combinations. It is interesting to see that stock prices of two regional players, e.g. Fifth Third and M&T, are highly co-integrated, as visualized in the chart below.


pkgs <- list("quantmod", "doParallel", "foreach", "urca")
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)

jtest <- function(t1, t2) {
  start <- sd
  getSymbols(t1, from = start)
  getSymbols(t2, from = start)
  j <- summary(ca.jo(cbind(get(t1)[, 6], get(t2)[, 6])))
  r <- data.frame(stock1 = t1, stock2 = t2, stat = j@teststat[2])
  r[, c("pct10", "pct5", "pct1")] <- j@cval[2, ]
  return(r)
}

pair <- function(lst) {
  d2 <- data.frame(t(combn(lst, 2)))
  stat <- foreach(i = 1:nrow(d2), .combine = rbind) %dopar% jtest(as.character(d2[i, 1]), as.character(d2[i, 2]))
  stat <- stat[order(-stat$stat), ]
  # THE PIECE GENERATING * CAN'T BE DISPLAYED PROPERLY IN WORDPRESS 
  rownames(stat) <- NULL
  return(stat)
}

sd <- "2010-01-01"
tickers <- c("FITB", "BBT", "MTB", "STI", "PNC", "HBAN", "CMA", "USB", "KEY", "JPM", "C", "BAC", "WFC")
pair(tickers)

   stock1 stock2      stat pct10 pct5  pct1 coint
1     STI    JPM 27.207462 12.91 14.9 19.19  ***
2    FITB    MTB 21.514142 12.91 14.9 19.19  ***
3     MTB    KEY 20.760885 12.91 14.9 19.19  ***
4    HBAN    KEY 19.247719 12.91 14.9 19.19  ***
5       C    BAC 18.573168 12.91 14.9 19.19   **
6    HBAN    JPM 18.019051 12.91 14.9 19.19   **
7    FITB    BAC 17.490536 12.91 14.9 19.19   **
8     PNC   HBAN 16.959451 12.91 14.9 19.19   **
9    FITB    BBT 16.727097 12.91 14.9 19.19   **
10    MTB   HBAN 15.852456 12.91 14.9 19.19   **
11    PNC    JPM 15.822610 12.91 14.9 19.19   **
12    CMA    BAC 15.685086 12.91 14.9 19.19   **
13   HBAN    BAC 15.446149 12.91 14.9 19.19   **
14    BBT    MTB 15.256334 12.91 14.9 19.19   **
15    MTB    JPM 15.178646 12.91 14.9 19.19   **
16    BBT   HBAN 14.808770 12.91 14.9 19.19    *
17    KEY    BAC 14.576440 12.91 14.9 19.19    *
18   FITB    JPM 14.272424 12.91 14.9 19.19    *
19    STI    BAC 14.253971 12.91 14.9 19.19    *
20   FITB    PNC 14.215647 12.91 14.9 19.19    *
21    MTB    BAC 13.891615 12.91 14.9 19.19    *
22    MTB    PNC 13.668863 12.91 14.9 19.19    *
23    KEY    JPM 12.952239 12.91 14.9 19.19    *

Screenshot from 2018-07-29 16-55-27

Screenshot from 2018-07-29 15-09-11

Modeling Dollar Amounts in Regression Setting

After switching the role from the credit risk to the operational risk in 2015, I spent countless weekend hours in the Starbucks researching on how to model operational losses in the regression setting in light of the heightened scrutiny. While I feel very comfortable with various frequency models, how to model severity and loss remain challenging both conceptually and empirically. The same challenge also holds true for modeling other financial measures in dollar amounts, such as balance, profit, or cost.

Most practitioners still prefer modeling severity and loss under the Gaussian distributional assumption explicitly or implicitly. In practice, there are 3 commonly used approaches, as elaborated below.

– First of all, the simple OLS regression to model severity and loss directly without any transformation remains the number one choice due to the simplicity. Given the inconsistency between the empirical data range and the conceptual domain for a Gaussian distribution, it is evidential that this approach is problematic.

– Secondly, the OLS regression to model LOG transformed severity and loss under the Lognormal distributional assumption is also a common approach. In this method, Log(Y) instead of Y is estimated. However, given E(Log(Y)|X) != Log(E(Y|X)), the estimation bias is introduced and therefore should be corrected by MSE / 2. In addition, the positive domain of a Lognormal might not work well in cases of losses with a lower bound that can be either zero or a known threshold value.

– At last, the Tobit regression under the censored Normal distribution seems a viable solution that supports the non-negative or any above-threshold values shown in severity or loss measures. Nonetheless, the censorship itself is questionable given that the unobservability of negative or below-threshold values is not due to the censorship but attributable to the data nature governed by the data collection process. Therefore, the argument for the data censorship is not well supported.

Considering the aforementioned challenge, I investigated and experimented various approaches given different data characteristics observed empirically.

– In cases of severity or loss observed in the range of (0, inf), GLM under Gamma or Inverse Gaussian distributional assumption can be considered (https://statcompute.wordpress.com/2015/08/16/some-considerations-of-modeling-severity-in-operational-losses). In addition, the mean-variance relationship can be employed to assess the appropriateness of the correct distribution by either the modified Park test (https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas) or the value of power parameter in the Tweedie distribution (https://statcompute.wordpress.com/2017/06/24/using-tweedie-parameter-to-identify-distributions).

– In cases of severity or loss observed in the range of [alpha, inf) with alpha being positive, then a regression under the type-I Pareto distribution (https://statcompute.wordpress.com/2016/12/11/estimate-regression-with-type-i-pareto-response) can be considered. However, there is a caveat that the conditional mean only exists when the shape parameter is large than 1.

– In cases of severity or loss observed in the range of [0, inf) with a small number of zeros, then a regression under the Lomax distribution (https://statcompute.wordpress.com/2016/11/13/parameter-estimation-of-pareto-type-ii-distribution-with-nlmixed-in-sas) or the Tweedie distribution (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm) can be considered. For the Lomax model, it is worth pointing out that the shape parameter alpha has to be large than 2 in order to to have both mean and variance defined.

– In cases of severity or loss observed in the range of [0, inf) with many zeros, then a ZAGA or ZAIG model (https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros) can be considered by assuming the measure governed by a mixed distribution between the point-mass at zeros and the standard Gamma or Inverse Gaussian. As a result, a ZA model consists of 2 sub-models, a nu model separating zeros and positive values and a mu model estimating the conditional mean of positive values.

Additional Thoughts on Estimating LGD with Proportional Odds Model

In my previous post (https://statcompute.wordpress.com/2018/01/28/modeling-lgd-with-proportional-odds-model), I’ve discussed how to use Proportional Odds Models in the LGD model development. In particular, I specifically mentioned that we would estimate a sub-model, which can be Gamma or Simplex regression, to project the conditional mean for LGD values in the (0, 1) range. However, it is worth pointing out that, if we would define a finer LGD segmentation, the necessity of this sub-model is completely optional. A standalone Proportional Odds Model without any sub-model is more than sufficient to serve the purpose of stress testing, e.g. CCAR.

In the example below, I will define 5 categories based upon LGD values in the [0, 1] range, estimate a Proportional Odds Model as usual, and then demonstrate how to apply the model outcome in the setting of stress testing with the stressed model input, e.g. LTV.

First of all, I defined 5 instead of 3 categories for LGD values, as shown below. Nonetheless, we could use a even finer category definition in practice to achieve a more accurate outcome.


df <- read.csv("lgd.csv")
df$lgd <- round(1 - df$Recovery_rate, 4)
l1 <- c(-Inf, 0, 0.0999, 0.4999, 0.9999, Inf)
l2 <- c("A", "B", "C", "D", "E")
df$lgd_cat <- cut(df$lgd, breaks = l1, labels = l2, ordered_result = T)
summary(df$lgd_cat)
m1 <- ordinal::clm(lgd_cat ~ LTV, data = df)
#Coefficients:
#    Estimate Std. Error z value Pr(>|z|)    
#LTV   2.3841     0.1083   22.02   <2e-16 ***
#
#Threshold coefficients:
#    Estimate Std. Error z value
#A|B  0.54082    0.07897   6.848
#B|C  2.12270    0.08894  23.866
#C|D  3.18098    0.10161  31.307
#D|E  4.80338    0.13174  36.460

After the model estimation, it is straightforward to calculate the probability of each LGD category. The only question remained is how to calculate the LGD projection for each individual account as well as for the whole portfolio. In order to calculate the LGD projection, we need two factors, namely the probability and the expected mean of each LGD category, such that

Estimated_LGD = SUM_i [Prob(category i) * LGD_Mean(category i)], where i = A, B, C, D, and E in this particular case.

The calculation is shown below with the estimated LGD = 0.23 that is consistent with the actual LGD = 0.23 for the whole portfolio.


prob_A <- exp(df$LTV * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[1])) 
prob_B <- exp(df$LTV * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[2])) - prob_A
prob_C <- exp(df$LTV * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B
prob_D <- exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C
prob_E <- 1 - exp(df$LTV * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV * (-m1$beta) + m1$Theta[4]))
pred <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E)
sum(apply(pred, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2])
#[1] 0.2262811

One might be wondering how to apply the model outcome with simple averages in stress testing that the model input is stressed, e.g. more severe, and might be also concerned about the lack of model sensitivity. In the demonstration below, let’s stress the model input LTV by 50% and then evaluate the stressed LGD.


df$LTV_ST <- df$LTV * 1.5
prob_A <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[1]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[1])) 
prob_B <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[2]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[2])) - prob_A
prob_C <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[3]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[3])) - prob_A - prob_B
prob_D <- exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4])) - prob_A - prob_B - prob_C
prob_E <- 1 - exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]) / (1 + exp(df$LTV_ST * (-m1$beta) + m1$Theta[4]))
pred_ST <- data.frame(prob_A, prob_B, prob_C, prob_D, prob_E)
sum(apply(pred_ST, 2, mean) * aggregate(df['lgd'], df['lgd_cat'], mean)[2])
#[1] 0.3600153

As shown above, although we only use a simple averages as the expected mean for each LGD category, the overall LGD still increases by ~60%. The reason is that, with the more stressed model input, the Proportional Odds Model is able to push more accounts into categories with higher LGD. For instance, the output below shows that, if LTV is stressed by 50% overall, ~146% more accounts would roll into the most severe LGD category without any recovery.


apply(pred_ST, 2, mean) / apply(pred, 2, mean)
#   prob_A    prob_B    prob_C    prob_D    prob_E 
#0.6715374 0.7980619 1.0405573 1.4825803 2.4639293

Estimating Parameters of A Hyper-Poisson Distribution in SAS

Similar to COM-Poisson, Double-Poisson, and Generalized Poisson distributions discussed in my previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models/), the Hyper-Poisson distribution is another extension of the standard Poisson and is able to accommodate both under-dispersion and over-dispersion that are common in real-world problems. Given the complexity of parameterization and computation, the Hyper-Poisson is somewhat under-investigated. To the best of my knowledge, there is no off-shelf computing routine in SAS for the Hyper-Poisson distribution and only a R function available in http://www4.ujaen.es/~ajsaez/hp.fit.r written by A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

The SAS code presented below is the starting point of my attempt on the Hyper-Poisson and its potential applications. The purpose is to replicate the calculation result shown in the Table 6 of “On the Hyper-Poisson Distribution and its Generalization with Applications” by Bayo H. Lawal (2017) (http://www.journalrepository.org/media/journals/BJMCS_6/2017/Mar/Lawal2132017BJMCS32184.pdf). As a result, the parameterization employed in my SAS code will closely follow Bayo H. Lawal (2017) instead of A.J. Sáez-Castillo and A. Conde-Sánchez (2013).


data d1;
  input y n @@;
datalines;
0 121 1 85 2 19 3 1 4 0 5 0 6 1
;
run;

data df;
  set d1;
  where n > 0;
  do i = 1 to n;
    output;
  end;
run;

proc nlmixed data = df;
  parms lambda = 1 beta = 1;
  theta = 1;
  do k = 1 to 100;
    theta = theta + gamma(beta) * (lambda ** k) / gamma(beta + k);
  end;
  prob = (gamma(beta) / gamma(beta + y)) * ((lambda ** y) / theta);
  ll = log(prob);
  model y ~ general(ll);
run;

/*
                     Standard
Parameter  Estimate     Error    DF  t Value  Pr > |t|   Alpha
lambda       0.3752    0.1178   227     3.19    0.0016    0.05
beta         0.5552    0.2266   227     2.45    0.0150    0.05 
*/

As shown, the estimated Lambda = 0.3752 and the estimated Beta = 0.5552 are identical to what is presented in the paper. The next step is be to explore applications in the frequency modeling as well as its value in business cases.

HP

Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics")
lapply(pkgs, require, character.only = T)

data(AutoClaim, package = "cplm")
df1 <- na.omit(AutoClaim)

# SMALL NUMBER OF ZEROS
set.seed(2017)
smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE)
df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2)
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.194e+00  7.234e-02 113.277  < 2e-16 ***
# BLUEBOOK    2.047e-05  3.068e-06   6.671 3.21e-11 ***
# NPOLICY     7.274e-02  3.102e-02   2.345   0.0191 *  

MAPE(df2$CLM_AMT, fitted(tw))
# 0.6053669

zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.203e+00  4.671e-02 175.629  < 2e-16 ***
# BLUEBOOK    2.053e-05  2.090e-06   9.821  < 2e-16 ***
# NPOLICY     6.948e-02  2.057e-02   3.377 0.000746 ***
# Nu Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -5.3886     0.3169     -17   <2e-16 ***

MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu"))
# 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS
tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1)
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  6.854e+00  1.067e-01  64.241  < 2e-16 ***
# BLUEBOOK     1.332e-05  4.495e-06   2.963  0.00305 ** 
# NPOLICY      4.380e-02  3.664e-02   1.195  0.23196    
# CLM_FREQ5    2.064e-01  2.937e-02   7.026 2.29e-12 ***
# MVR_PTS      1.066e-01  1.510e-02   7.063 1.76e-12 ***
# INCOME      -4.606e-06  8.612e-07  -5.348 9.12e-08 ***

MAPE(df1$CLM_AMT, fitted(tw))
# 1.484484

zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.203e+00  4.682e-02 175.218  < 2e-16 ***
# BLUEBOOK    2.053e-05  2.091e-06   9.816  < 2e-16 ***
# NPOLICY     6.948e-02  2.067e-02   3.362 0.000778 ***
# Nu Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.153e+00  5.077e-02   22.72   <2e-16 ***
# CLM_FREQ5   -3.028e-01  2.283e-02  -13.26   <2e-16 ***
# MVR_PTS     -1.509e-01  1.217e-02  -12.41   <2e-16 ***
# INCOME       7.285e-06  6.269e-07   11.62   <2e-16 ***

MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu"))
# 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.205e+00  5.836e-02 140.591  < 2e-16 ***
# BLUEBOOK    2.163e-05  2.976e-06   7.268 3.97e-13 ***
# NPOLICY     5.898e-02  2.681e-02   2.200   0.0278 *  
# Nu Coefficients:
#               Estimate Std. Error t value Pr(>|t|)
# (Intercept)  1.153e+00  5.077e-02   22.72   <2e-16 ***
# CLM_FREQ5   -3.028e-01  2.283e-02  -13.26   <2e-16 ***
# MVR_PTS     -1.509e-01  1.217e-02  -12.41   <2e-16 ***
# INCOME       7.285e-06  6.269e-07   11.62   <2e-16 ***

MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu"))
# 1.469236

Finer Monotonic Binning Based on Isotonic Regression

In my early post (https://statcompute.wordpress.com/2017/01/22/monotonic-binning-with-smbinning-package/), I wrote a monobin() function based on the smbinning package by Herman Jopia to improve the monotonic binning algorithm. The function works well and provides robust binning outcomes. However, there are a couple potential drawbacks due to the coarse binning. First of all, the derived Information Value for each binned variable might tend to be low. Secondly, the binned variable might not be granular enough to reflect the data nature.

In light of the aforementioned, I drafted an improved function isobin() based on the isotonic regression (https://en.wikipedia.org/wiki/Isotonic_regression), as shown below.

isobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  d2 <- d1[!is.na(d1[x]), ]
  c <- cor(d2[, 2], d2[, 1], method = "spearman", use = "complete.obs")
  reg <- isoreg(d2[, 2], c / abs(c) * d2[, 1])
  k <- knots(as.stepfun(reg))
  sm1 <-smbinning.custom(d1, y, x, k)
  c1 <- subset(sm1$ivtable, subset = CntGood * CntBad > 0, select = Cutpoint)
  c2 <- suppressWarnings(as.numeric(unlist(strsplit(c1$Cutpoint, " "))))
  c3 <- c2[!is.na(c2)]
  return(smbinning.custom(d1, y, x, c3[-length(c3)]))
}

Compared with the legacy monobin(), the isobin() function is able to significantly increase the binning granularity as well as moderately improve the Information Value.

LTV Binning with isobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 46     81      78      3        81         78         3 0.0139   0.9630  0.0370 26.0000 3.2581  1.9021 0.0272
2     <= 71    312     284     28       393        362        31 0.0535   0.9103  0.0897 10.1429 2.3168  0.9608 0.0363
3     <= 72     22      20      2       415        382        33 0.0038   0.9091  0.0909 10.0000 2.3026  0.9466 0.0025
4     <= 73     27      24      3       442        406        36 0.0046   0.8889  0.1111  8.0000 2.0794  0.7235 0.0019
5     <= 81    303     268     35       745        674        71 0.0519   0.8845  0.1155  7.6571 2.0356  0.6797 0.0194
6     <= 83    139     122     17       884        796        88 0.0238   0.8777  0.1223  7.1765 1.9708  0.6149 0.0074
7     <= 90    631     546     85      1515       1342       173 0.1081   0.8653  0.1347  6.4235 1.8600  0.5040 0.0235
8     <= 94    529     440     89      2044       1782       262 0.0906   0.8318  0.1682  4.9438 1.5981  0.2422 0.0049
9     <= 95    145     119     26      2189       1901       288 0.0248   0.8207  0.1793  4.5769 1.5210  0.1651 0.0006
10   <= 100    907     709    198      3096       2610       486 0.1554   0.7817  0.2183  3.5808 1.2756 -0.0804 0.0010
11   <= 101    195     151     44      3291       2761       530 0.0334   0.7744  0.2256  3.4318 1.2331 -0.1229 0.0005
12   <= 110   1217     934    283      4508       3695       813 0.2085   0.7675  0.2325  3.3004 1.1940 -0.1619 0.0057
13   <= 112    208     158     50      4716       3853       863 0.0356   0.7596  0.2404  3.1600 1.1506 -0.2054 0.0016
14   <= 115    253     183     70      4969       4036       933 0.0433   0.7233  0.2767  2.6143 0.9610 -0.3950 0.0075
15   <= 136    774     548    226      5743       4584      1159 0.1326   0.7080  0.2920  2.4248 0.8857 -0.4702 0.0333
16   <= 138     27      18      9      5770       4602      1168 0.0046   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0024
17    > 138     66      39     27      5836       4641      1195 0.0113   0.5909  0.4091  1.4444 0.3677 -0.9882 0.0140
18  Missing      1       0      1      5837       4641      1196 0.0002   0.0000  1.0000  0.0000   -Inf    -Inf    Inf
19    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.1897

LTV Binning with monobin() Function

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1    <= 85   1025     916    109      1025        916       109 0.1756   0.8937  0.1063 8.4037 2.1287  0.7727 0.0821
2    <= 94   1019     866    153      2044       1782       262 0.1746   0.8499  0.1501 5.6601 1.7334  0.3775 0.0221
3   <= 100   1052     828    224      3096       2610       486 0.1802   0.7871  0.2129 3.6964 1.3074 -0.0486 0.0004
4   <= 105    808     618    190      3904       3228       676 0.1384   0.7649  0.2351 3.2526 1.1795 -0.1765 0.0045
5   <= 114    985     748    237      4889       3976       913 0.1688   0.7594  0.2406 3.1561 1.1493 -0.2066 0.0076
6    > 114    947     665    282      5836       4641      1195 0.1622   0.7022  0.2978 2.3582 0.8579 -0.4981 0.0461
7  Missing      1       0      1      5837       4641      1196 0.0002   0.0000  1.0000 0.0000   -Inf    -Inf    Inf
8    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.1628

Bureau_Score Binning with isobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds  LnOdds     WoE     IV
1    <= 491      4       1      3         4          1         3 0.0007   0.2500  0.7500  0.3333 -1.0986 -2.4546 0.0056
2    <= 532     24       9     15        28         10        18 0.0041   0.3750  0.6250  0.6000 -0.5108 -1.8668 0.0198
3    <= 559     51      24     27        79         34        45 0.0087   0.4706  0.5294  0.8889 -0.1178 -1.4737 0.0256
4    <= 560      2       1      1        81         35        46 0.0003   0.5000  0.5000  1.0000  0.0000 -1.3559 0.0008
5    <= 572     34      17     17       115         52        63 0.0058   0.5000  0.5000  1.0000  0.0000 -1.3559 0.0143
6    <= 602    153      84     69       268        136       132 0.0262   0.5490  0.4510  1.2174  0.1967 -1.1592 0.0459
7    <= 605     56      31     25       324        167       157 0.0096   0.5536  0.4464  1.2400  0.2151 -1.1408 0.0162
8    <= 606     14       8      6       338        175       163 0.0024   0.5714  0.4286  1.3333  0.2877 -1.0683 0.0035
9    <= 607     17      10      7       355        185       170 0.0029   0.5882  0.4118  1.4286  0.3567 -0.9993 0.0037
10   <= 632    437     261    176       792        446       346 0.0749   0.5973  0.4027  1.4830  0.3940 -0.9619 0.0875
11   <= 639    150      95     55       942        541       401 0.0257   0.6333  0.3667  1.7273  0.5465 -0.8094 0.0207
12   <= 653    451     300    151      1393        841       552 0.0773   0.6652  0.3348  1.9868  0.6865 -0.6694 0.0412
13   <= 662    295     213     82      1688       1054       634 0.0505   0.7220  0.2780  2.5976  0.9546 -0.4014 0.0091
14   <= 665    100      77     23      1788       1131       657 0.0171   0.7700  0.2300  3.3478  1.2083 -0.1476 0.0004
15   <= 667     57      44     13      1845       1175       670 0.0098   0.7719  0.2281  3.3846  1.2192 -0.1367 0.0002
16   <= 677    381     300     81      2226       1475       751 0.0653   0.7874  0.2126  3.7037  1.3093 -0.0466 0.0001
17   <= 679     66      53     13      2292       1528       764 0.0113   0.8030  0.1970  4.0769  1.4053  0.0494 0.0000
18   <= 683    160     129     31      2452       1657       795 0.0274   0.8062  0.1938  4.1613  1.4258  0.0699 0.0001
19   <= 689    203     164     39      2655       1821       834 0.0348   0.8079  0.1921  4.2051  1.4363  0.0804 0.0002
20   <= 699    304     249     55      2959       2070       889 0.0521   0.8191  0.1809  4.5273  1.5101  0.1542 0.0012
21   <= 707    312     268     44      3271       2338       933 0.0535   0.8590  0.1410  6.0909  1.8068  0.4509 0.0094
22   <= 717    368     318     50      3639       2656       983 0.0630   0.8641  0.1359  6.3600  1.8500  0.4941 0.0132
23   <= 721    134     119     15      3773       2775       998 0.0230   0.8881  0.1119  7.9333  2.0711  0.7151 0.0094
24   <= 723     49      44      5      3822       2819      1003 0.0084   0.8980  0.1020  8.8000  2.1748  0.8188 0.0043
25   <= 739    425     394     31      4247       3213      1034 0.0728   0.9271  0.0729 12.7097  2.5424  1.1864 0.0700
26   <= 746    166     154     12      4413       3367      1046 0.0284   0.9277  0.0723 12.8333  2.5520  1.1961 0.0277
27   <= 756    234     218     16      4647       3585      1062 0.0401   0.9316  0.0684 13.6250  2.6119  1.2560 0.0422
28   <= 761    110     104      6      4757       3689      1068 0.0188   0.9455  0.0545 17.3333  2.8526  1.4967 0.0260
29   <= 763     46      44      2      4803       3733      1070 0.0079   0.9565  0.0435 22.0000  3.0910  1.7351 0.0135
30   <= 767     96      92      4      4899       3825      1074 0.0164   0.9583  0.0417 23.0000  3.1355  1.7795 0.0293
31   <= 772     77      74      3      4976       3899      1077 0.0132   0.9610  0.0390 24.6667  3.2055  1.8495 0.0249
32   <= 787    269     260      9      5245       4159      1086 0.0461   0.9665  0.0335 28.8889  3.3635  2.0075 0.0974
33   <= 794     95      93      2      5340       4252      1088 0.0163   0.9789  0.0211 46.5000  3.8395  2.4835 0.0456
34    > 794    182     179      3      5522       4431      1091 0.0312   0.9835  0.0165 59.6667  4.0888  2.7328 0.0985
35  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000  0.6931 -0.6628 0.0282
36    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804  1.3559  0.0000 0.8357

Bureau_Score Binning with monobin() Function

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10   <= 775    492     465     27      5023       3945      1078 0.0843   0.9451  0.0549 17.2222 2.8462  1.4903 0.1157
11    > 775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)
E(Y) = Mu = Alpha / (1 – Xi)
Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est;
proc genmod data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / dist = poisson link = log;
run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint;
select
  "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ')
into
  :_parm separated by ' '
from  
  _est;
  
select
  case 
    when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ')
    else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ')
  end
into
  :_xb separated by ' + '    
from  
  _est
where
  upcase(parameter) ~= 'SCALE';  
quit;

/*
%put &_parm;
_INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00

%put &_xb;
 _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT
*/

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count;
  parms &_parm.;
  _xb = &_xb.;
  _xi = 1 - exp(-_scale);
  _mu = exp(_xb);  
  _alpha = _mu * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);
  ll = log(_prob);
  model majordrg ~ general(ll);
run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / link = log solution;
  _xi = 1 - 1 / exp(_phi_);
  _variance_ = _mu_ / (1 - _xi) ** 2;
  _alpha = _mu_ * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);  
  _logl_ = log(_prob);
run;

Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 .

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response y` = severity – threshold to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound a = 2 by using the VGAM package.

library(VGAM)
set.seed(2017)
n <- 200
a <- 2
x <- runif(n)
k <- exp(1 + 5 * x)
pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x)
fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE)
summary(fit)
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   1.0322     0.1363   7.574 3.61e-14 ***
# x             4.9815     0.2463  20.229  < 2e-16 ***
AIC(fit)
#  -644.458
BIC(fit)
#  -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata;
  parms b0 = 0.1 b1 = 0.1;
  k = exp(b0 + b1 * x);
  a = 2;
  lh = k * (a ** k) / (y ** (k + 1));
  ll = log(lh);
  model y ~ general(ll);
run;
/*
Fit Statistics
-2 Log Likelihood               -648.5
AIC (smaller is better)         -644.5
AICC (smaller is better)        -644.4
BIC (smaller is better)         -637.9

Parameter Estimate   Standard   DF    t Value   Pr > |t|
                     Error 
b0        1.0322     0.1385     200    7.45     <.0001 	
b1        4.9815     0.2518     200   19.78     <.0001 	
*/

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to exp(x * beta) but a * k / (k – 1) with k = exp(x * beta) . Therefore, the conditional mean only exists when k > 1 , which might cause numerical issues in the model estimation.

Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;
2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).
3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = );
***********************************************************;
* SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  Y    : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES        *;
*  X    : MODEL PREDICTORS                                *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;
options mprint mlogic nocenter;

%let links = logit probit cloglog;
%let loop = 1;

proc sql noprint;
  select n(&data) - 3 into :df from &data;
quit; 

%do %while (%scan(&links, &loop) ne %str());

  %let link = %scan(&links, &loop);
  
  proc logistic data = &data noprint desc;
    model &y = &x / link = &link;
    score data = &data out = _out1;
  run;
  
  data _out2;
    set _out1(rename = (p_1 = p1));
    p2 = p1 * p1;
  run;
  
  ods listing close;
  ods output ParameterEstimates = _parm;  
  proc logistic data = _out2 desc;
    model &y = p1 p2 /  link = &link ;
  run;
  ods listing;
    
  %if &loop = 1 %then %do;
    data _parm1;
      format link $10.;
      set _parm(where = (variable = "p2"));
      link = upcase("&link");
    run;
  %end;
  %else %do;
    data _parm1;
      set _parm1 _parm(where = (variable = "p2") in = new);
      if new then link = upcase("&link");
    run;
  %end;
  
  data _parm2(drop = variable);
    set _parm1;
    _t = estimate / stderr;
    _df = &df;
    _p = (1 - probt(abs(_t), _df)) * 2;
  run;
  
  %let loop = %eval(&loop + 1);

%end;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * PREGIBON TEST FOR GOODNESS OF LINK
           * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * "
         link _t _df _p);
  define link / "LINK FUNCTION" width = 15 order order = data;          
  define _t   / "T-VALUE"       width = 15 format = 12.4;
  define _df  / "DF"            width = 10;
  define _p   / "P-VALUE"       width = 15 format = 12.4;
run;

%mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

             PREGIBON TEST FOR GOODNESS OF LINK
        H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY

 LINK FUNCTION           T-VALUE         DF         P-VALUE
-----------------------------------------------------------
 LOGIT                   -1.6825         78          0.0965
 PROBIT                  -1.7940         78          0.0767
 CLOGLOG                 -2.3632         78          0.0206

Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one;
  do i = 1 to 100000;
    x0 = ranuni(0);
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do;
      grp = 1;
      if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1;
    	else bad = 0;
    end;
    else do;
      grp = 0;
      if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1;
    	else bad = 0;
    end;
    output;
  end;
run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint;
  model bad = x0 x1 x2 x3;
  score data = one out = mdl1 (rename = (p_1 = score1));
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1
                                MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281
               ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6800     0.9699       2,057      7,943     10,000   79.43%      79.43%    33.81%      33.81%
   |      0.4679     0.6799       4,444      5,556     10,000   55.56%      67.50%    23.65%      57.46%
   |      0.3094     0.4679       6,133      3,867     10,000   38.67%      57.89%    16.46%      73.92%
   |      0.1947     0.3094       7,319      2,681     10,000   26.81%      50.12%    11.41%      85.33%
   |      0.1181     0.1946       8,364      1,636     10,000   16.36%      43.37%     6.96%      92.29%
   |      0.0690     0.1181       9,044        956     10,000    9.56%      37.73%     4.07%      96.36%
   |      0.0389     0.0690       9,477        523     10,000    5.23%      33.09%     2.23%      98.59%
   |      0.0201     0.0389       9,752        248     10,000    2.48%      29.26%     1.06%      99.64%
   V      0.0085     0.0201       9,925         75     10,000    0.75%      26.09%     0.32%      99.96%
 GOOD     0.0005     0.0085       9,991          9     10,000    0.09%      23.49%     0.04%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9699      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457
               ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7086     0.9699       1,051      6,162      7,213   85.43%      85.43%    30.51%      30.51%
   |      0.5019     0.7086       2,452      4,762      7,214   66.01%      75.72%    23.58%      54.10%
   |      0.3407     0.5019       3,710      3,504      7,214   48.57%      66.67%    17.35%      71.45%
   |      0.2195     0.3406       4,696      2,517      7,213   34.90%      58.73%    12.46%      83.91%
   |      0.1347     0.2195       5,650      1,564      7,214   21.68%      51.32%     7.74%      91.66%
   |      0.0792     0.1347       6,295        919      7,214   12.74%      44.89%     4.55%      96.21%
   |      0.0452     0.0792       6,737        476      7,213    6.60%      39.42%     2.36%      98.56%
   |      0.0234     0.0452       7,000        214      7,214    2.97%      34.86%     1.06%      99.62%
   V      0.0099     0.0234       7,150         64      7,214    0.89%      31.09%     0.32%      99.94%
 GOOD     0.0007     0.0099       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0007     0.9699      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1))
                                MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290
               ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5863     0.9413       1,351      1,435      2,786   51.51%      51.51%    43.48%      43.48%
   |      0.3713     0.5862       2,136        651      2,787   23.36%      37.43%    19.73%      63.21%
   |      0.2299     0.3712       2,340        446      2,786   16.01%      30.29%    13.52%      76.73%
   |      0.1419     0.2298       2,525        262      2,787    9.40%      25.07%     7.94%      84.67%
   |      0.0832     0.1419       2,584        202      2,786    7.25%      21.50%     6.12%      90.79%
   |      0.0480     0.0832       2,643        144      2,787    5.17%      18.78%     4.36%      95.15%
   |      0.0270     0.0480       2,682        104      2,786    3.73%      16.63%     3.15%      98.30%
   |      0.0140     0.0270       2,741         46      2,787    1.65%      14.76%     1.39%      99.70%
   V      0.0058     0.0140       2,776         10      2,786    0.36%      13.16%     0.30%     100.00%
 GOOD     0.0005     0.0058       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9413      24,564      3,300     27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint;
  where grp = 0;
  model bad = x2 x3;
  score data = one out = mdl20(rename = (p_1 = p_10));
run;

proc logistic data = one desc noprint;
  where grp = 1;
  model bad = x2 x3;
  score data = one out = mdl21(rename = (p_1 = p_11));
run;

proc logistic data = one desc noprint;
  model grp = x0 x1;
  score data = one out = seg;
run;

data mdl2;
  merge seg mdl20 mdl21;
  by i;
  score2 = p_10 * (1 - p_1) + p_11 * p_1;
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2
                                MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469
               ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6877     0.9677       2,011      7,989     10,000   79.89%      79.89%    34.00%      34.00%
   |      0.4749     0.6876       4,300      5,700     10,000   57.00%      68.45%    24.26%      58.27%
   |      0.3125     0.4748       6,036      3,964     10,000   39.64%      58.84%    16.87%      75.14%
   |      0.1932     0.3124       7,451      2,549     10,000   25.49%      50.51%    10.85%      85.99%
   |      0.1142     0.1932       8,379      1,621     10,000   16.21%      43.65%     6.90%      92.89%
   |      0.0646     0.1142       9,055        945     10,000    9.45%      37.95%     4.02%      96.91%
   |      0.0345     0.0646       9,533        467     10,000    4.67%      33.19%     1.99%      98.90%
   |      0.0166     0.0345       9,800        200     10,000    2.00%      29.29%     0.85%      99.75%
   V      0.0062     0.0166       9,946         54     10,000    0.54%      26.10%     0.23%      99.98%
 GOOD     0.0001     0.0062       9,995          5     10,000    0.05%      23.49%     0.02%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9677      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458
               ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7221     0.9677       1,075      6,138      7,213   85.10%      85.10%    30.40%      30.40%
   |      0.5208     0.7221       2,436      4,778      7,214   66.23%      75.66%    23.66%      54.06%
   |      0.3533     0.5208       3,670      3,544      7,214   49.13%      66.82%    17.55%      71.61%
   |      0.2219     0.3532       4,726      2,487      7,213   34.48%      58.73%    12.32%      83.92%
   |      0.1309     0.2219       5,617      1,597      7,214   22.14%      51.41%     7.91%      91.83%
   |      0.0731     0.1309       6,294        920      7,214   12.75%      44.97%     4.56%      96.39%
   |      0.0387     0.0731       6,762        451      7,213    6.25%      39.44%     2.23%      98.62%
   |      0.0189     0.0387       7,009        205      7,214    2.84%      34.86%     1.02%      99.63%
   V      0.0074     0.0189       7,152         62      7,214    0.86%      31.09%     0.31%      99.94%
 GOOD     0.0002     0.0073       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0002     0.9677      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1))
                                MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979
               ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5559     0.9553       1,343      1,443      2,786   51.79%      51.79%    43.73%      43.73%
   |      0.3528     0.5559       2,001        786      2,787   28.20%      40.00%    23.82%      67.55%
   |      0.2213     0.3528       2,364        422      2,786   15.15%      31.71%    12.79%      80.33%
   |      0.1372     0.2213       2,513        274      2,787    9.83%      26.24%     8.30%      88.64%
   |      0.0840     0.1372       2,588        198      2,786    7.11%      22.42%     6.00%      94.64%
   |      0.0484     0.0840       2,683        104      2,787    3.73%      19.30%     3.15%      97.79%
   |      0.0256     0.0483       2,729         57      2,786    2.05%      16.84%     1.73%      99.52%
   |      0.0118     0.0256       2,776         11      2,787    0.39%      14.78%     0.33%      99.85%
   V      0.0040     0.0118       2,781          5      2,786    0.18%      13.16%     0.15%     100.00%
 GOOD     0.0001     0.0040       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9553      24,564      3,300     27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

Duplicate Breusch-Godfrey Test Logic in SAS Autoreg Procedure

Since it appears that SAS and R might give slightly different B-G test results, I spent a couple hours on duplicating the logic of B-G test implemented in SAS Autoreg Procedure. The written SAS macro should give my team more flexibility to perform B-G test in CCAR 2017 model developments in cases that models will not be estimated with Autoreg Procedure.

B-G Test with Proc Autoreg

data one;
  do i = 1 to 100;
    x1 = uniform(1);
    x2 = uniform(2);
    r  = normal(3) * 0.5;
    y = 10 + 8 * x1 + 6 * x2 + r;
    output;
  end;
run;

proc autoreg data = one;
  model y = x1 x2 / godfrey = 4;
run;
quit;

/*
Godfrey's Serial Correlation Test

Alternative            LM    Pr > LM
AR(1)              0.2976     0.5854
AR(2)              1.5919     0.4512
AR(3)              1.7168     0.6332
AR(4)              1.7839     0.7754
*/

Home-brew SAS Macro

%macro bgtest(data = , r = , x = , order = 4);
options nocenter nonumber nodate mprint mlogic symbolgen
        formchar = "|----|+|---+=|-/\<>*";

proc sql noprint;
select
  mean(&r) format = 12.8 into :m
from
  &data;
quit;

data _1 (drop = _i);
  set &data (keep = &r &x);
  %do i = 1 %to &order;
    _lag&i._&r = lag&i.(&r);
  %end;
  _i + 1;
  _index = _i - &order;
  array _l _lag:;
  do over _l;
    if _l = . then _l = &m;
  end;
run;
 
proc reg data = _last_ noprint;
  model &r =  &x _lag:;
  output out = _2 p = rhat;
run;
 
proc sql noprint;  
create table
  _result as
select
  sum((rhat - &m) ** 2) / sum((&r - &m) ** 2)  as _r2,
  (select count(*) from _2) * calculated _r2   as _chisq,
  1 - probchi(calculated _chisq, &order.)      as _p_chisq,
  &order                                       as _df
from
  _2;
quit;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
           * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * "
          _chisq _df _p_chisq);
  define _chisq   / "CHI-SQUARE" width = 20 format = 15.10;
  define _df      / "DF"         width = 10;
  define _p_chisq / "P-VALUE"    width = 20 format = 15.10;
run;
%mend bgtest;

proc reg data = one noprint;
  model y = x1 x2;
  output out = two r = r2;
run;
quit;

data _null_;
  do i = 1 to 4;
    call execute('%bgtest(data = two, r = r2, x = x1 x2, order = '||put(i, 2.)||');');
  end;
run;

/*
       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 1
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         0.2976458421          1         0.5853620441

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 2
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.5918785412          2         0.4511572771

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 3
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.7167785901          3         0.6332099963

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 4
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.7839349922          4         0.7754201982
*/

More Flexible Approaches to Model Frequency

(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11)

In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson regression, namely Equi-Dispersion, which requires the equality between the conditional mean and the variance such that E(Y) = var(Y). However, in real-world frequency outcomes, the assumption of Equi-Dispersion is always problematic. On the contrary, the empirical data often presents either an excessive variance, namely Over-Dispersion, or an insufficient variance, namely Under-Dispersion. The application of a standard Poisson regression to the over-dispersed data will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics.

In cases of Over-Dispersion, the Negative Binomial (NB) regression has been the most common alternative to the standard Poisson regression by including a dispersion parameter to accommodate the excessive variance in the data. In the formulation of NB regression, the variance is expressed as a quadratic function of the conditional mean such that the variance is guaranteed to be higher than the conditional mean. However, it is not flexible enough to allow for both Over-Dispersion and Under-Dispersion. Therefore, more generalizable approaches are called for.

Two additional frequency modeling methods, including Quasi-Poisson (QP) regression and Conway-Maxwell Poisson (CMP) regression, are discussed. In the case of Quasi-Poisson, E(Y) = λ and var(Y) = θ • λ. While θ > 1 addresses Over-Dispersion, θ < 1 governs Under-Dispersion. Since QP regression is estimated with QMLE, likelihood-based statistics, such as AIC and BIC, won’t be available. Instead, quasi-AIC and quasi-BIC are provided. In the case of Conway-Maxwell Poisson, E(Y) = λ ** (1 / v) – (v – 1) / (2 • v) and var(Y) = (1 / v) • λ ** (1 / v), where λ doesn’t represent the conditional mean anymore but a location parameter. While v < 1 enables us to model the long-tailed distribution reflected as Over-Dispersion, v > 1 takes care of the short-tailed distribution reflected as Under-Dispersion. Since CMP regression is estimated with MLE, likelihood-based statistics, such as AIC and BIC, are available at a high computing cost.

Below demonstrates how to estimate QP and CMP regressions with R and a comparison of their computing times. If the modeling purpose is mainly for the prediction without focusing on the statistical reference, QP regression would be an excellent choice for most practitioners. Otherwise, CMP regression is an elegant model to address various levels of dispersion parsimoniously.

# data source: www.jstatsoft.org/article/view/v027i08
load("../Downloads/DebTrivedi.rda")

library(rbenchmark)
library(CompGLM)

benchmark(replications = 3, order = "user.self",
  quasi.poisson = {
    m1 <- glm(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, family = "quasipoisson")
  },
  conway.maxwell = {
    m2 <- glm.comp(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, lamStart = m1$coefficient
s)
  }
)
#             test replications elapsed relative user.self sys.self user.child
# 1  quasi.poisson            3   0.084    1.000     0.084    0.000          0
# 2 conway.maxwell            3  42.466  505.548    42.316    0.048          0

summary(m1)
summary(m2) 

Quasi-Poisson Regression

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.886462   0.069644  12.729  < 2e-16 ***
healthpoor       0.235673   0.046284   5.092 3.69e-07 ***
healthexcellent -0.360188   0.078441  -4.592 4.52e-06 ***
hosp             0.163246   0.015594  10.468  < 2e-16 ***
numchron         0.144652   0.011894  12.162  < 2e-16 ***
privinsyes       0.304691   0.049879   6.109 1.09e-09 ***
school           0.028953   0.004812   6.016 1.93e-09 ***
gendermale      -0.092460   0.033830  -2.733   0.0063 **
medicaidyes      0.297689   0.063787   4.667 3.15e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 6.697556)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23027  on 4397  degrees of freedom
AIC: NA

Conway-Maxwell Poisson Regression

Beta:
                   Estimate   Std.Error  t.value p.value
(Intercept)     -0.23385559  0.16398319  -1.4261 0.15391
healthpoor       0.03226830  0.01325437   2.4345 0.01495 *
healthexcellent -0.08361733  0.00687228 -12.1673 < 2e-16 ***
hosp             0.01743416  0.01500555   1.1618 0.24536
numchron         0.02186788  0.00209274  10.4494 < 2e-16 ***
privinsyes       0.05193645  0.00184446  28.1581 < 2e-16 ***
school           0.00490214  0.00805940   0.6083 0.54305
gendermale      -0.01485663  0.00076861 -19.3292 < 2e-16 ***
medicaidyes      0.04861617  0.00535814   9.0733 < 2e-16 ***

Zeta:
              Estimate  Std.Error t.value   p.value
(Intercept) -3.4642316  0.0093853 -369.11 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 24467.13
Log-Likelihood: -12223.56

Calculating ACF with Data Step Only

In SAS/ETS, it is trivial to calculate ACF of a time series with ARIMA procedure. However, the downside is that, in addition to ACF, you will get more outputs than necessary without knowing the underlying mechanism. The SAS macro below is a clean routine written with simple data steps showing each step how to calculate ACF and generating nothing but a table with ACF and the related lag without using SAS/ETS module at all. It is easy to write a wrapper around this macro for any further analysis.

%macro acf(data = , var = , out = acf);
***********************************************************;
* SAS MACRO CALCULATING AUTOCORRELATION FUNCTION WITH     *;
* DATA STEP ONLY                                          *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  VAR  : THE TIME SERIES TO TEST FOR INDEPENDENCE        *;
* ======================================================= *;
* OUTPUT:                                                 *;
*  OUT : A OUTPUT SAS DATA TABLE WITH ACF AND LAG         *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

%local nobs;
data _1 (keep = &var);
  set &data end = eof;
  if eof then do;
    call execute('%let nobs = '||put(_n_, 8.)||';');
  end;
run;

proc sql noprint;
  select mean(&var) into :mean_x from _last_;
quit;

%do i = 1 %to %eval(&nobs - 1);

  data _2(keep = _:);
    set _1;
    _x = &var;
    _lag = lag&i.(_x);
  run;

  proc sql ;
  create table
    _3 as
  select
    (_x - &mean_x) ** 2               as _den,
    (_x - &mean_x) * (_lag - &mean_x) as _num
  from
    _last_;

  create table
    _4 as
  select
    &i                    as lag,
    sum(_num) / sum(_den) as acf
  from
    _last_;

  %if &i = 1 %then %do;
  create table 
    &out as
  select
    *
  from
    _4;
  %end;
  %else %do;
  insert into &out
  select
    *
  from
    _4;
  %end;

  drop table _2, _3, _4;
  quit;
%end;

%mend acf;

Estimate Quasi-Binomial Model with GENMOD Procedure in SAS

In my previous post (https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas/), I’ve shown why there is an interest in estimating Quasi-Binomial models for financial practitioners and how to estimate with GLIMMIX procedure in SAS.

In the demonstration below, I will show an example how to estimate a Quasi-Binomial model with GENMOD procedure by specifying VARIANCE and DEVIANCE. While the CPU time for model estimation is a lot faster with GENMOD than with GLIMMIX, additional steps are necessary to ensure the correct statistical inference.

ods listing close;
ods output modelfit = fit;
ods output parameterestimates = parm1;
proc genmod data = kyphosis;
  model y = age number start / link = logit noscale;
  variance v = _mean_ * (1 - _mean_);
  deviance d = (-2) * log((_mean_ ** _resp_) * ((1 - _mean_) ** (1 - _resp_)));
run;
ods listing;

proc sql noprint;
select
  distinct valuedf format = 12.8, df format = 8.0 into :disp, :df
from
  fit
where 
  index(criterion, "Pearson") > 0;

select
  value format = 12.4 into :ll
from
  fit
where
  criterion = "Log Likelihood";

select
  sum(df) into :k
from
  parm1;
quit;

%let aic = %sysevalf((-&ll + &k) * 2); 
%let bic = %sysevalf(-&ll * 2 + &k * %sysfunc(log(&df + &k)));

data parm2 (keep = parameter estimate stderr2 df t_value p_value);
  set parm1;
  where df > 0;

  stderr2 = stderr * (&scale ** 0.5);
  df = &df;
  t_value = estimate / stderr2;
  p_value = (1 - probt(abs(t_value), &df)) * 2;
run;

title;
proc report data = parm2 spacing = 1 headline nowindows split = "*";
  column(" Parameter Estimate of Quasi-Binomial Model "
         parameter estimate stderr2 t_value df p_value);
  compute before _page_;
    line @5 "Fit Statistics";
	line " ";
	line @3 "Quasi Log Likelihood: %sysfunc(round(&ll, 0.01))";
	line @3 "Quasi-AIC (smaller is better): %sysfunc(round(&aic, 0.01))";
	line @3 "Quasi-BIC (smaller is better): %sysfunc(round(&bic, 0.01))";
	line @3 "(Dispersion Parameter for Quasi-Binomial is %sysfunc(round(&disp, 0.0001)))";
	line " ";
  endcomp;
  define parameter / "Parmeter" width = 10 order order = data;
  define estimate  / "Estimate" width = 10 format = 10.4;
  define stderr2   / "Std Err"  width = 10 format = 10.4;
  define t_value   / "T Value"  width = 10 format = 10.2;
  define df        / "DF"       width = 5  format = 4.0;
  define p_value   / "Pr > |t|" width = 10 format = 10.4;
run;
quit;
  
/*
    Fit Statistics

  Quasi Log Likelihood: -30.69
  Quasi-AIC (smaller is better): 69.38
  Quasi-BIC (smaller is better): 78.96
  (Dispersion Parameter for Quasi-Binomial is 0.9132)

          Parameter Estimate of Quasi-Binomial Model
 Parmeter     Estimate    Std Err    T Value    DF   Pr > |t|
 ------------------------------------------------------------
 Intercept     -2.0369     1.3853      -1.47    77     0.1455
 Age            0.0109     0.0062       1.77    77     0.0800
 Number         0.4106     0.2149       1.91    77     0.0598
 Start         -0.2065     0.0647      -3.19    77     0.0020
*/

SAS Macro Performing Breusch–Godfrey Test for Serial Correlation

%macro bgtest(data = , r = , x = , order = 1);
********************************************************************;
* SAS MACRO PERFORMING BREUSCH-GODFREY TEST FOR SERIAL CORRELATION *;
* BY FOLLOWING THE LOGIC OF BGTEST() IN R LMTEST PACKAGE           *;
* ================================================================ *;
* INPUT PAREMETERS:                                                *;
*  DATA  : INPUT SAS DATA TABLE                                    *;
*  R     : RESIDUALS TO TEST SERIAL CORRELATION                    *;
*  X     : INDEPENDENT VARIABLES IN THE ORIGINAL REGRESSION MODEL  *;
*  ORDER : THE ORDER OF SERIAL CORRELATION                         *;
* ================================================================ *;
* AUTHOR: WENSUI.LIU@53.COM                                        *;
********************************************************************;

data _1 (drop = _i);
  set &data (keep = &r &x);
  %do i = 1 %to &order;
    _lag&i._&r = lag&i.(&r);
  %end;
  _i + 1;
  _index = _i - &order;
  if _index > 0 then output;
run;

ods listing close;
proc reg data = _last_;
  model &r = &x _lag:;
  output out = _2 p = yhat;
run;

ods listing;
proc sql noprint;
create table
  _result as
select
  (select count(*) from _2) * sum(yhat ** 2) / sum(&r ** 2)   as _chisq,
  1 - probchi(calculated _chisq, &order.)                     as _p_chisq,
  &order                                                      as _df
from
  _2;
quit;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
           * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * "
          _chisq _df _p_chisq);
  define _chisq   / "CHI-SQUARE" width = 20 format = 15.10;
  define _df      / "DF"         width = 10;
  define _p_chisq / "P-VALUE"    width = 20 format = 15.10;
run;

%mend bgtest;

Calculate Leave-One-Out Prediction for GLM

In the model development, the “leave-one-out” prediction is a way of cross-validation, calculated as below:
1. First of all, after a model is developed, each observation used in the model development is removed in turn and then the model is refitted with the remaining observations
2. The out-of-sample prediction for the refitted model is calculated with the removed observation one by one to assemble the LOO, e.g. leave-one-out predicted values for the whole model development sample.
The loo_predict() function below is a general routine to calculate the LOO prediction for any GLM object, which can be further employed to investigate the model stability and predictability.

> pkgs <- c('doParallel', 'foreach')
> lapply(pkgs, require, character.only = T)
[[1]]
[1] TRUE

[[2]]
[1] TRUE

> registerDoParallel(cores = 8)
>
> data(AutoCollision, package = "insuranceData")
> # A GAMMA GLM #
> model1 <- glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = "log"))
> # A POISSON GLM #
> model2 <- glm(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, family = poisson(link = "log"))
>
> loo_predict <- function(obj) {
+   yhat <- foreach(i = 1:nrow(obj$data), .combine = rbind) %dopar% {
+     predict(update(obj, data = obj$data[-i, ]), obj$data[i,], type = "response")
+   }
+   return(data.frame(result = yhat[, 1], row.names = NULL))
+ }
> # TEST CASE 1
> test1 <- loo_predict(model1)
> test1$result
 [1] 303.7393 328.7292 422.6610 375.5023 240.9785 227.6365 288.4404 446.5589
 [9] 213.9368 244.7808 278.7786 443.2256 213.9262 243.2495 266.9166 409.2565
[17] 175.0334 172.0683 197.2911 326.5685 187.2529 215.9931 249.9765 349.3873
[25] 190.1174 218.6321 243.7073 359.9631 192.3655 215.5986 233.1570 348.2781
> # TEST CASE 2
> test2 <- loo_predict(model2)
> test2$result
 [1]  11.15897  37.67273  28.76127  11.54825  50.26364 152.35489 122.23782
 [8]  44.57048 129.58158 465.84173 260.48114 107.23832 167.40672 510.41127
[15] 316.50765 121.75804 172.56928 546.25390 341.03826 134.04303 359.30141
[22] 977.29107 641.69934 251.32547 248.79229 684.86851 574.13994 238.42209
[29] 148.77733 504.12221 422.75047 167.61203

Multivariate Adaptive Regression Splines with Python

In [1]: import statsmodels.datasets as datasets

In [2]: import sklearn.metrics as metrics

In [3]: from numpy import log

In [4]: from pyearth import Earth as earth

In [5]: boston = datasets.get_rdataset("Boston", "MASS").data

In [6]: x = boston.ix[:, 0:boston.shape[1] - 1]

In [7]: xlabel = list(x.columns)

In [8]: y = boston.ix[:, boston.shape[1] - 1]

In [9]: model = earth(enable_pruning = True, penalty = 3, minspan_alpha = 0.05, endspan_alpha = 0.05)

In [10]: model.fit(x, log(y), xlabels = xlabel)
Out[10]:
Earth(allow_linear=None, check_every=None, enable_pruning=True, endspan=None,
   endspan_alpha=0.05, max_degree=None, max_terms=None,
   min_search_points=None, minspan=None, minspan_alpha=0.05, penalty=3,
   smooth=None, thresh=None)

In [11]: print model.summary()
Earth Model
--------------------------------------
Basis Function   Pruned  Coefficient
--------------------------------------
(Intercept)      No      2.93569
h(lstat-5.9)     No      -0.0249319
h(5.9-lstat)     No      0.067697
h(rm-6.402)      No      0.230063
h(6.402-rm)      Yes     None
crim             Yes     None
h(dis-1.5004)    No      -0.0369272
h(1.5004-dis)    No      1.70517
h(ptratio-18.6)  No      -0.0393493
h(18.6-ptratio)  No      0.0278253
nox              No      -0.767146
h(indus-25.65)   No      -0.147471
h(25.65-indus)   Yes     None
h(black-395.24)  Yes     None
h(395.24-black)  Yes     None
h(crim-24.8017)  Yes     None
h(24.8017-crim)  No      0.0292657
rad              No      0.00807783
h(rm-7.853)      Yes     None
h(7.853-rm)      Yes     None
chas             Yes     None
--------------------------------------
MSE: 0.0265, GCV: 0.0320, RSQ: 0.8409, GRSQ: 0.8090

In [12]: metrics.r2_score(log(y), model.predict(x))
Out[12]: 0.840861083407211

Modeling Frequency in Operational Losses with Python

Poisson and Negative Binomial regressions are two popular approaches to model frequency measures in the operational loss and can be implemented in Python with the statsmodels package as below:

In [1]: import pandas as pd

In [2]: import statsmodels.api as sm

In [3]: import statsmodels.formula.api as smf

In [4]: df = pd.read_csv(&quot;AutoCollision.csv&quot;)

In [5]: # FITTING A POISSON REGRESSION

In [6]: poisson = smf.glm(formula = &quot;Claim_Count ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.Poisson(sm.families.links.log))

In [7]: poisson.fit().summary()
Out[7]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            Claim_Count   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:                 Poisson   Df Model:                           10
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -204.40
Date:                Tue, 08 Dec 2015   Deviance:                       184.72
Time:                        20:31:27   Pearson chi2:                     184.
No. Iterations:                     9
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     2.3702      0.110     21.588      0.000         2.155     2.585
Age[T.21-24]                  1.4249      0.118     12.069      0.000         1.193     1.656
Age[T.25-29]                  2.3465      0.111     21.148      0.000         2.129     2.564
Age[T.30-34]                  2.5153      0.110     22.825      0.000         2.299     2.731
Age[T.35-39]                  2.5821      0.110     23.488      0.000         2.367     2.798
Age[T.40-49]                  3.2247      0.108     29.834      0.000         3.013     3.437
Age[T.50-59]                  3.0019      0.109     27.641      0.000         2.789     3.215
Age[T.60+]                    2.6391      0.110     24.053      0.000         2.424     2.854
Vehicle_Use[T.DriveLong]      0.9246      0.036     25.652      0.000         0.854     0.995
Vehicle_Use[T.DriveShort]     1.2856      0.034     37.307      0.000         1.218     1.353
Vehicle_Use[T.Pleasure]       0.1659      0.041      4.002      0.000         0.085     0.247
=============================================================================================
&quot;&quot;&quot;

In [8]: # FITTING A NEGATIVE BINOMIAL REGRESSION

In [9]: nbinom = smf.glm(formula = &quot;Claim_Count ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.NegativeBinomial(sm.families.links.log))

In [10]: nbinom.fit().summary()
Out[10]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:            Claim_Count   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:        NegativeBinomial   Df Model:                           10
Link Function:                    log   Scale:                 0.0646089484752
Method:                          IRLS   Log-Likelihood:                -198.15
Date:                Tue, 08 Dec 2015   Deviance:                       1.4436
Time:                        20:31:27   Pearson chi2:                     1.36
No. Iterations:                    11
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     2.2939      0.153     14.988      0.000         1.994     2.594
Age[T.21-24]                  1.4546      0.183      7.950      0.000         1.096     1.813
Age[T.25-29]                  2.4133      0.183     13.216      0.000         2.055     2.771
Age[T.30-34]                  2.5636      0.183     14.042      0.000         2.206     2.921
Age[T.35-39]                  2.6259      0.183     14.384      0.000         2.268     2.984
Age[T.40-49]                  3.2408      0.182     17.760      0.000         2.883     3.598
Age[T.50-59]                  2.9717      0.183     16.283      0.000         2.614     3.329
Age[T.60+]                    2.6404      0.183     14.463      0.000         2.283     2.998
Vehicle_Use[T.DriveLong]      0.9480      0.128      7.408      0.000         0.697     1.199
Vehicle_Use[T.DriveShort]     1.3402      0.128     10.480      0.000         1.090     1.591
Vehicle_Use[T.Pleasure]       0.3265      0.128      2.548      0.011         0.075     0.578
=============================================================================================
&quot;&quot;&quot;

Although Quasi-Poisson regressions is not currently supported by the statsmodels package, we are still able to estimate the model with the rpy2 package by using R in the back-end. As shown in the output below, parameter estimates in Quasi-Poisson model are identical to the ones in standard Poisson model. In case that we want a flexible model approach for frequency measures in the operational loss forecast without pursuing more complex Negative Binomial model, Quasi-Poisson regression can be considered a serious contender.

In [11]: # FITTING A QUASI-POISSON REGRESSION

In [12]: import rpy2.robjects as ro

In [13]: from rpy2.robjects import pandas2ri

In [14]: pandas2ri.activate()

In [15]: rdf = pandas2ri.py2ri_pandasdataframe(df)

In [16]: qpoisson = ro.r.glm('Claim_Count ~ Age + Vehicle_Use', data = rdf, family = ro.r('quasipoisson(link = &quot;log&quot;)'))

In [17]: print ro.r.summary(qpoisson)

Coefficients:
                      Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)             2.3702     0.3252   7.288 3.55e-07 ***
Age21-24                1.4249     0.3497   4.074 0.000544 ***
Age25-29                2.3465     0.3287   7.140 4.85e-07 ***
Age30-34                2.5153     0.3264   7.705 1.49e-07 ***
Age35-39                2.5821     0.3256   7.929 9.49e-08 ***
Age40-49                3.2247     0.3202  10.072 1.71e-09 ***
Age50-59                3.0019     0.3217   9.331 6.42e-09 ***
Age60+                  2.6391     0.3250   8.120 6.48e-08 ***
Vehicle_UseDriveLong    0.9246     0.1068   8.660 2.26e-08 ***
Vehicle_UseDriveShort   1.2856     0.1021  12.595 2.97e-11 ***
Vehicle_UsePleasure     0.1659     0.1228   1.351 0.191016
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 8.774501)

    Null deviance: 6064.97  on 31  degrees of freedom
Residual deviance:  184.72  on 21  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Modeling Severity in Operational Losses with Python

When modeling severity measurements in the operational loss with Generalized Linear Models, we might have a couple choices based on different distributional assumptions, including Gamma, Inverse Gaussian, and Lognormal. However, based on my observations from the empirical work, the differences in parameter estimates among these three popular candidates are rather immaterial from the practical standpoint.

Below is a demonstration showing how to model the severity with the insurance data under aforementioned three distributions. As shown, albeit with inferential differences, three models show similar coefficients.

In [1]: # LOAD PACKAGES

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import statsmodels.api as sm

In [5]: import statsmodels.formula.api as smf

In [6]: df = pd.read_csv(&quot;AutoCollision.csv&quot;)

In [7]: df.head()
Out[7]:
     Age Vehicle_Use  Severity  Claim_Count
0  17-20    Pleasure    250.48           21
1  17-20  DriveShort    274.78           40
2  17-20   DriveLong    244.52           23
3  17-20    Business    797.80            5
4  21-24    Pleasure    213.71           63

In [8]: # FIT A GAMMA REGRESSION

In [9]: gamma = smf.glm(formula = &quot;Severity ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.Gamma(sm.families.links.log))

In [10]: gamma.fit().summary()
Out[10]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:               Severity   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:                   Gamma   Df Model:                           10
Link Function:                    log   Scale:                 0.0299607547345
Method:                          IRLS   Log-Likelihood:                -161.35
Date:                Sun, 06 Dec 2015   Deviance:                      0.58114
Time:                        12:59:17   Pearson chi2:                    0.629
No. Iterations:                     8
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     6.2413      0.101     61.500      0.000         6.042     6.440
Age[T.21-24]                 -0.2080      0.122     -1.699      0.089        -0.448     0.032
Age[T.25-29]                 -0.2303      0.122     -1.881      0.060        -0.470     0.010
Age[T.30-34]                 -0.2630      0.122     -2.149      0.032        -0.503    -0.023
Age[T.35-39]                 -0.5311      0.122     -4.339      0.000        -0.771    -0.291
Age[T.40-49]                 -0.3820      0.122     -3.121      0.002        -0.622    -0.142
Age[T.50-59]                 -0.3741      0.122     -3.057      0.002        -0.614    -0.134
Age[T.60+]                   -0.3939      0.122     -3.218      0.001        -0.634    -0.154
Vehicle_Use[T.DriveLong]     -0.3573      0.087     -4.128      0.000        -0.527    -0.188
Vehicle_Use[T.DriveShort]    -0.5053      0.087     -5.839      0.000        -0.675    -0.336
Vehicle_Use[T.Pleasure]      -0.5886      0.087     -6.801      0.000        -0.758    -0.419
=============================================================================================
&quot;&quot;&quot;

In [11]: # FIT A INVERSE GAUSSIAN REGRESSION

In [12]: igauss = smf.glm(formula = &quot;Severity ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.InverseGaussian(sm.families.links.log))

In [13]: igauss.fit().summary()
Out[13]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:               Severity   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:         InverseGaussian   Df Model:                           10
Link Function:                    log   Scale:               8.73581523073e-05
Method:                          IRLS   Log-Likelihood:                -156.44
Date:                Sun, 06 Dec 2015   Deviance:                    0.0015945
Time:                        13:01:14   Pearson chi2:                  0.00183
No. Iterations:                     7
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     6.1776      0.103     59.957      0.000         5.976     6.379
Age[T.21-24]                 -0.1475      0.116     -1.269      0.204        -0.375     0.080
Age[T.25-29]                 -0.1632      0.116     -1.409      0.159        -0.390     0.064
Age[T.30-34]                 -0.2079      0.115     -1.814      0.070        -0.433     0.017
Age[T.35-39]                 -0.4732      0.108     -4.361      0.000        -0.686    -0.261
Age[T.40-49]                 -0.3299      0.112     -2.954      0.003        -0.549    -0.111
Age[T.50-59]                 -0.3206      0.112     -2.866      0.004        -0.540    -0.101
Age[T.60+]                   -0.3465      0.111     -3.115      0.002        -0.565    -0.128
Vehicle_Use[T.DriveLong]     -0.3334      0.084     -3.992      0.000        -0.497    -0.170
Vehicle_Use[T.DriveShort]    -0.4902      0.081     -6.055      0.000        -0.649    -0.332
Vehicle_Use[T.Pleasure]      -0.5743      0.080     -7.206      0.000        -0.731    -0.418
=============================================================================================
&quot;&quot;&quot;

In [14]: # FIT A LOGNORMAL REGRESSION

In [15]: df['Log_Severity'] = np.log(df.Severity)

In [16]: lognormal = smf.glm(formula = &quot;Log_Severity ~ Age + Vehicle_Use&quot;, data = df, family = sm.families.Gaussian())

In [17]: lognormal.fit().summary()
Out[17]:
&lt;class 'statsmodels.iolib.summary.Summary'&gt;
&quot;&quot;&quot;
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:           Log_Severity   No. Observations:                   32
Model:                            GLM   Df Residuals:                       21
Model Family:                Gaussian   Df Model:                           10
Link Function:               identity   Scale:                 0.0265610360381
Method:                          IRLS   Log-Likelihood:                 19.386
Date:                Sun, 06 Dec 2015   Deviance:                      0.55778
Time:                        13:02:12   Pearson chi2:                    0.558
No. Iterations:                     4
=============================================================================================
                                coef    std err          z      P&gt;|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept                     6.1829      0.096     64.706      0.000         5.996     6.370
Age[T.21-24]                 -0.1667      0.115     -1.447      0.148        -0.393     0.059
Age[T.25-29]                 -0.1872      0.115     -1.624      0.104        -0.413     0.039
Age[T.30-34]                 -0.2163      0.115     -1.877      0.061        -0.442     0.010
Age[T.35-39]                 -0.4901      0.115     -4.252      0.000        -0.716    -0.264
Age[T.40-49]                 -0.3347      0.115     -2.904      0.004        -0.561    -0.109
Age[T.50-59]                 -0.3267      0.115     -2.835      0.005        -0.553    -0.101
Age[T.60+]                   -0.3467      0.115     -3.009      0.003        -0.573    -0.121
Vehicle_Use[T.DriveLong]     -0.3481      0.081     -4.272      0.000        -0.508    -0.188
Vehicle_Use[T.DriveShort]    -0.4903      0.081     -6.016      0.000        -0.650    -0.331
Vehicle_Use[T.Pleasure]      -0.5726      0.081     -7.027      0.000        -0.732    -0.413
=============================================================================================
&quot;&quot;&quot;
 

Quasi-Binomial Model in SAS

Similar to quasi-Poisson regressions, quasi-binomial regressions try to address the excessive variance by the inclusion of a dispersion parameter. In addition to addressing the over-dispersion, quasi-binomial regressions also demonstrate extra values in other areas, such as LGD model development in credit risk modeling, due to its flexible distributional assumption.

Measuring the ratio between NCO and GCO, LGD could take any value in the range [0, 1] with no unanimous consensus on the distributional assumption currently in the industry. An advantage of quasi-binomial regression is that it makes no assumption of a specific distribution but merely specifies the conditional mean for a given model response. As a result, the trade-off is the lack of likelihood-based measures such as AIC and BIC.

Below is a demonstration on how to estimate a quasi-binomial model with GLIMMIX procedure in SAS.

proc glimmix data = _last_;
  model y = age number start / link = logit solution;
  _variance_ = _mu_ * (1-_mu_);
  random _residual_;
run;  
/*
              Model Information
Data Set                     WORK.KYPHOSIS   
Response Variable            y               
Response Distribution        Unknown         
Link Function                Logit           
Variance Function            _mu_ * (1-_mu_) 
Variance Matrix              Diagonal        
Estimation Technique         Quasi-Likelihood
Degrees of Freedom Method    Residual        

                       Parameter Estimates 
                         Standard
Effect       Estimate       Error       DF    t Value    Pr > |t|
Intercept     -2.0369      1.3853       77      -1.47      0.1455
age           0.01093    0.006160       77       1.77      0.0800
number         0.4106      0.2149       77       1.91      0.0598
start         -0.2065     0.06470       77      -3.19      0.0020
Residual       0.9132           .        .        .         .    
*/

For the comparison purpose, the same model is also estimated with R glm() function, showing identical outputs.

summary(glm(data = kyphosis, Kyphosis ~ ., family = quasibinomial))
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)   
#(Intercept) -2.03693    1.38527  -1.470  0.14552   
#Age          0.01093    0.00616   1.774  0.07996 . 
#Number       0.41060    0.21489   1.911  0.05975 . 
#Start       -0.20651    0.06470  -3.192  0.00205 **
#---
#(Dispersion parameter for quasibinomial family taken to be 0.913249)

SAS Macro for Engle-Granger Co-integration Test

In the coursework of time series analysis, we’ve been taught that a time series regression of Y on X could be valid only when both X and Y are stationary due to the so-call “spurious regression problem”. However, one exception holds that if X and Y, albeit non-stationary, share a common trend such that their trends can be cancelled each other out, then X and Y are co-integrated and the regression of Y on X is valid. As a result, it is important to test the co-integration between X and Y.

Following the definition of co-integration, it is straightforward to formulate a procedure of the co-integration test. First of all, construct a linear combination between Y and X such that e = Y – (a + b * X). Secondly, test if e is stationary with ADF test. If e is stationary, then X and Y are co-integrated. This two-stage procedure is also called Engle-Granger co-integration test.

Below is a SAS macro implementing Engle-Granger co-integration test to show the long-term relationship between GDP and other macro-economic variables, e.g. Personal Consumption and Personal Disposable Income.

SAS Macro

%macro eg_coint(data = , y = , xs = );
*********************************************************************;
* THIS SAS MACRO IMPLEMENTATION ENGLE-GRANGER COINTEGRATION TEST IN *;
* A BATCH MODE TO PROCESS MANY TIME SERIES                          *;
*********************************************************************;
* INPUT PARAMETERS:                                                 *;
*   DATA: A INPUT SAS DATASET                                       *;
*   Y   : A DEPENDENT VARIABLE IN THE COINTEGRATION REGRESSION      *;
*   X   : A LIST OF INDEPENDENT VARIABLE IN THE COINTEGRATION       *;
*         REGRESSION                                                *;
*********************************************************************;
* AUTHOR: WENSUI.LIU@53.COM                                         *;
*********************************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*";

%local sig loop;

%let sig = 0.1;

%let loop = 1;

%do %while (%scan(&xs, &loop) ne %str());

  %let x = %scan(&xs, &loop);

  ods listing close;
  ods output FitStatistics = _fit;
  proc reg data = &data;
    model &y = &x;
    output out = _1 residual = r;
  run;
  quit;

  proc sql noprint;
    select cvalue2 into :r2 from _fit where upcase(label2) = "R-SQUARE";
  quit;

  proc arima data = _1;
    ods output stationaritytests = _adf1 (where = (upcase(type) = "ZERO MEAN" and lags = 1) drop = rho probrho fvalue probf);
    identify var = r stationarity = (adf = 1);
  run;
  quit;
  ods listing;

  %if &loop = 1 %then %do;
    data _adf;
      format vars $32. lterm_r2 best12. flg_coint $3.;
      set _adf1 (drop = type lags);
      vars = upcase("&x");
      lterm_r2 = &r2;
      if probtau < &sig then flg_coint = "YES";
      else flg_coint = "NO";
    run;
  %end;
  %else %do;
    data _adf;
      set _adf _adf1 (in = new drop = type lags);
      if new then do;
        vars = upcase("&x");
        lterm_r2 = &r2;
        if probtau < &sig then flg_coint = "YES";
          else flg_coint = "NO";
      end;
    run;
  %end;
    
  %let loop = %eval(&loop + 1);
%end;

proc sort data = _last_;
  by descending flg_coint probtau;
run;

proc report data = _last_ box spacing = 1 split = "/" nowd;
  COLUMN("ENGLE-GRANGER COINTEGRATION TEST BETWEEN %UPCASE(&Y) AND EACH VARIABLE BELOW/ "
         vars lterm_r2 flg_coint tau probtau);
  define vars      / "VARIABLES"                      width = 35;
  define lterm_r2  / "LONG-RUN/R-SQUARED"             width = 15 format =  9.4 center;
  define flg_coint / "COINTEGRATION/FLAG"             width = 15 center;
  define tau       / "TAU STATISTIC/FOR ADF TEST"     width = 20 format = 15.4;
  define probtau   / "P-VALUE FOR/ADF TEST"           width = 15 format =  9.4 center;
run;

%mend eg_coint;

%eg_coint(data = sashelp.citiqtr, y = gdp, xs = gyd gc);

SAS Output

----------------------------------------------------------------------------------------------------------
|                  ENGLE-GRANGER COINTEGRATION TEST BETWEEN GDP AND EACH VARIABLE BELOW                  |
|                                                                                                        |
|                                       LONG-RUN      COINTEGRATION         TAU STATISTIC   P-VALUE FOR  |
|VARIABLES                              R-SQUARED         FLAG               FOR ADF TEST    ADF TEST    |
|--------------------------------------------------------------------------------------------------------|
|GC                                 |      0.9985   |      YES      |             -2.8651|      0.0051   |
|-----------------------------------+---------------+---------------+--------------------+---------------|
|GYD                                |      0.9976   |      YES      |             -1.7793|      0.0715   |
----------------------------------------------------------------------------------------------------------

From the output, it is interesting to see that GDP in U.S. is driven more by Personal Consumption than by Personal Disposable Income.

SAS Macro to Test Stationarity in Batch

To determine if a time series is stationary or has the unit root, three methods can be used:

A. The most intuitive way, which is also sufficient in most cases, is to eyeball the ACF (Autocorrelation Function) plot of the time series. The ACF pattern with a fast decay might imply a stationary series.
B. Statistical tests for Unit Roots, e.g. ADF (Augmented Dickey–Fuller) or PP (Phillips–Perron) test, could be employed as well. With the Null Hypothesis of Unit Root, a statistically significant outcome might suggest a stationary series.
C. In addition to the aforementioned tests for Unit Roots, statistical tests for stationarity, e.g. KPSS (Kwiatkowski–Phillips–Schmidt–Shin) test, might be an useful complement as well. With the Null Hypothesis of stationarity, a statistically insignificant outcome might suggest a stationary series.

By testing both the unit root and stationarity, the analyst should be able to have a better understanding about the data nature of a specific time series.

The SAS macro below is a convenient wrapper of stationarity tests for many time series in the production environment. (Please note that this macro only works for SAS 9.2 or above.)

%macro stationary(data = , vars =);
***********************************************************;
* THIS SAS MACRO IS TO DO STATIONARITY TESTS FOR MANY     *;
* TIME SERIES                                             *;
* ------------------------------------------------------- *;
* INPUT PARAMETERS:                                       *;
*   DATA: A INPUT SAS DATASET                             *;
*   VARS: A LIST OF TIME SERIES                           *;
* ------------------------------------------------------- *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*";

%local sig loop;

%let sig = 0.1;

%let loop = 1;

%do %while (%scan(&vars, &loop) ne %str());

  %let x = %scan(&vars, &loop);

  proc sql noprint;
    select int(12 * ((count(&x) / 100) ** 0.25)) into :nlag1 from &data;

    select int(max(1, (count(&x) ** 0.5) / 5)) into :nlag2 from &data;
  quit;

  ods listing close;
  ods output kpss = _kpss (drop = model lags rename = (prob = probeta))
             adf = _adf  (drop = model lags rho probrho fstat probf rename = (tau = adf_tau probtau = adf_probtau))
             philperron = _pp  (drop = model lags rho probrho rename = (tau = pp_tau probtau = pp_probtau));
  proc autoreg data = &data;
    model &x = / noint stationarity = (adf = &nlag1, phillips = &nlag2, kpss = (kernel = nw lag = &nlag1));
  run;
  quit;
  ods listing;

  proc sql noprint;
    create table
      _1 as 
    select
      upcase("&x")           as vars length = 32,
      upcase(_adf.type)      as type,
      _adf.adf_tau,
      _adf.adf_probtau,
      _pp.pp_tau,
      _pp.pp_probtau,
      _kpss.eta,
      _kpss.probeta,
      case 
        when _adf.adf_probtau < &sig or _pp.pp_probtau < &sig or _kpss.probeta > &sig then "*"
        else " "
      end                    as _flg,
      &loop                  as _i,
      monotonic()            as _j
    from 
      _adf inner join _pp on _adf.type = _pp.type inner join _kpss on _adf.type = _kpss.type;
  quit;

  %if &loop = 1 %then %do;
    data _result;
      set _1;
    run;
  %end;
  %else %do;
    proc append base = _result data = _1;
    run;
  %end;

  proc datasets library = work nolist;
    delete _1 _adf _pp _kpss / memtype = data;
  quit;

  %let loop = %eval(&loop + 1);
%end;

proc sort data = _result;
  by _i _j;
run;

proc report data = _result box spacing = 1 split = "/" nowd;
  column("STATISTICAL TESTS FOR STATIONARITY/ "
         vars type adf_tau adf_probtau pp_tau pp_probtau eta probeta _flg);
  define vars        / "VARIABLES/ "                  width = 20 group order order = data;
  define type        / "TYPE/ "                       width = 15 order order = data;
  define adf_tau     / "ADF TEST/FOR/UNIT ROOT"       width = 10 format = 8.2;
  define adf_probtau / "P-VALUE/FOR/ADF TEST"         width = 10 format = 8.4 center;
  define pp_tau      / "PP TEST/FOR/UNIT ROOT"        width = 10 format = 8.2;
  define pp_probtau  / "P-VALUE/FOR/PP TEST"          width = 10 format = 8.4 center;
  define eta         / "KPSS TEST/FOR/STATIONARY"     width = 10 format = 8.2;
  define probeta     / "P-VALUE/FOR/KPSS TEST"        width = 10 format = 8.4 center;
  define _flg        / "STATIONARY/FLAG"              width = 10 center;
run;

%mend stationary;

Granger Causality Test

# READ QUARTERLY DATA FROM CSV
library(zoo)
ts1 <- read.zoo('Documents/data/macros.csv', header = T, sep = ",", FUN = as.yearqtr)

# CONVERT THE DATA TO STATIONARY TIME SERIES
ts1$hpi_rate <- log(ts1$hpi / lag(ts1$hpi))
ts1$unemp_rate <- log(ts1$unemp / lag(ts1$unemp))
ts2 <- ts1[1:nrow(ts1) - 1, c(3, 4)]

# METHOD 1: LMTEST PACKAGE
library(lmtest)
grangertest(unemp_rate ~ hpi_rate, order = 1, data = ts2)
# Granger causality test
#
# Model 1: unemp_rate ~ Lags(unemp_rate, 1:1) + Lags(hpi_rate, 1:1)
# Model 2: unemp_rate ~ Lags(unemp_rate, 1:1)
#   Res.Df Df      F  Pr(>F)
# 1     55
# 2     56 -1 4.5419 0.03756 *
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# METHOD 2: VARS PACKAGE
library(vars)
var <- VAR(ts2, p = 1, type = "const")
causality(var, cause = "hpi_rate")$Granger
#         Granger causality H0: hpi_rate do not Granger-cause unemp_rate
#
# data:  VAR object var
# F-Test = 4.5419, df1 = 1, df2 = 110, p-value = 0.0353

# AUTOMATICALLY SEARCH FOR THE MOST SIGNIFICANT RESULT
for (i in 1:4)
  {
  cat("LAG =", i)
  print(causality(VAR(ts2, p = i, type = "const"), cause = "hpi_rate")$Granger)
  }

Estimating Time Series Models for Count Outcomes with SAS

In SAS, there is no out-of-box procedure to estimate time series models for count outcomes, which is similar to the one shown here (https://statcompute.wordpress.com/2015/03/31/modeling-count-time-series-with-tscount-package). However, as long as we understand the likelihood function of Poisson distribution, it is straightforward to estimate a time series model with PROC MODEL in the ETS module.

Below is a demonstration of how to estimate a Poisson time series model with the identity link function. As shown, the parameter estimates with related inferences are extremely close to the ones estimated with tscount() in R.

data polio;
  idx + 1;
  input y @@;
datalines;
0  1  0  0  1  3  9  2  3  5  3  5  2  2  0  1  0  1  3  3  2  1  1  5  0
3  1  0  1  4  0  0  1  6 14  1  1  0  0  1  1  1  1  0  1  0  1  0  1  0
1  0  1  0  1  0  1  0  0  2  0  1  0  1  0  0  1  2  0  0  1  2  0  3  1
1  0  2  0  4  0  2  1  1  1  1  0  1  1  0  2  1  3  1  2  4  0  0  0  1
0  1  0  2  2  4  2  3  3  0  0  2  7  8  2  4  1  1  2  4  0  1  1  1  3
0  0  0  0  1  0  1  1  0  0  0  0  0  1  2  0  2  0  0  0  1  0  1  0  1
0  2  0  0  1  2  0  1  0  0  0  1  2  1  0  1  3  6
;
run;

proc model data = polio;
  parms b0 = 0.5 b1 = 0.1 b2 = 0.1;
  yhat = b0 + b1 * zlag1(y) + b2 * zlag1(yhat);
  y = yhat;
  lk = exp(-yhat) * (yhat ** y) / fact(y);
  ll = -log(lk);
  errormodel y ~ general(ll);
  fit y / fiml converge = 1e-8;
run;

/* OUTPUT:
            Nonlinear Liklhood Summary of Residual Errors

                  DF      DF                                        Adj
Equation       Model   Error        SSE        MSE   R-Square      R-Sq
y                  3     165      532.6     3.2277     0.0901    0.0791

           Nonlinear Liklhood Parameter Estimates

                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|
b0              0.606313      0.1680       3.61       0.0004
b1              0.349495      0.0690       5.06       <.0001
b2              0.206877      0.1397       1.48       0.1405

Number of Observations       Statistics for System

Used               168    Log Likelihood    -278.6615
Missing              0
*/

Modeling Count Time Series with tscount Package

The example below shows how to estimate a simple univariate Poisson time series model with the tscount package. While the model estimation is straightforward and yeilds very similar parameter estimates to the ones generated with the acp package (https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i), the prediction mechanism is a bit tricky.

1) For the in-sample and the 1-step-ahead predictions:

yhat_[t] = beta0 + beta1 * y_[t – 1] + beta2 * yhat_[t – 1]

2) For the out-of-sample predictions with the observed Y unavailable:

yhat_[t] = beta0 + beta1 * yhat_[t – 1] + beta2 * yhat_[t – 1]

library(tscount)

mdl <- tsglm(cnt$y, model = list(past_obs = 1, past_mean = 1), distr = "poisson")
summary(mdl)
# tsglm(ts = cnt$y, model = list(past_obs = 1, past_mean = 1), 
#     distr = "poisson")
#
# Coefficients:
#              Estimate  Std. Error
# (Intercept)     0.632      0.1774
# beta_1          0.350      0.0687
# alpha_1         0.184      0.1455
# Standard errors obtained by normal approximation.
#
# Link function: identity 
# Distribution family: poisson 
# Number of coefficients: 3 
# Log-likelihood: -279.2738 
# AIC: 564.5476 
# BIC: 573.9195 

### in-sample prediction ###
cnt$yhat <- mdl$fitted.values
tail(cnt, 3)
#     y      yhat
# 166 1 0.8637023
# 167 3 1.1404714
# 168 6 1.8918651

### manually check ###
beta <- mdl$coefficients
pv167 <- beta[1] + beta[2] * cnt$y[166] + beta[3] * cnt$yhat[166] 
#  1.140471
pv168 <- beta[1] + beta[2] * cnt$y[167] + beta[3] * cnt$yhat[167] 
#  1.891865 

### out-of-sample prediction ###
oot <- predict(mdl, n.ahead = 3)
# [1] 3.080667 2.276211 1.846767

### manually check ###
ov2 <- beta[1] + beta[2] * oot[[1]][1] + beta[3] * oot[[1]][1] 
#  2.276211
ov3 <- beta[1] + beta[2] * oot[[1]][2] + beta[3] * oot[[1]][2] 
#  1.846767

Autoregressive Conditional Poisson Model – I

Modeling the time series of count outcome is of interest in the operational risk while forecasting the frequency of losses. Below is an example showing how to estimate a simple ACP(1, 1) model, e.g. Autoregressive Conditional Poisson, without covariates with ACP package.

library(acp)

### acp(1, 1) without covariates ###
mdl <- acp(y ~ -1, data = cnt)
summary(mdl)
# acp.formula(formula = y ~ -1, data = cnt)
#
#   Estimate   StdErr t.value   p.value    
# a 0.632670 0.169027  3.7430 0.0002507 ***
# b 0.349642 0.067414  5.1865 6.213e-07 ***
# c 0.184509 0.134154  1.3753 0.1708881    

### generate predictions ###
f <- predict(mdl)
pred <- data.frame(yhat = f, cnt)
tail(pred, 5)
#          yhat y
# 164 1.5396921 1
# 165 1.2663993 0
# 166 0.8663321 1
# 167 1.1421586 3
# 168 1.8923355 6

### calculate predictions manually ###
pv167 <- mdl$coef[1] + mdl$coef[2] * pred$y[166] + mdl$coef[3] * pred$yhat[166] 
# [1] 1.142159

pv168 <- mdl$coef[1] + mdl$coef[2] * pred$y[167] + mdl$coef[3] * pred$yhat[167] 
# [1] 1.892336

plot.ts(pred, main = "Predictions")

rplot

Estimating a Beta Regression with The Variable Dispersion in R

pkgs <- c('sas7bdat', 'betareg', 'lmtest')
lapply(pkgs, require, character.only = T)

df1 <- read.sas7bdat("lgd.sas7bdat")
df2 <- df1[which(df1$y < 1), ]

xvar <- paste("x", 1:7, sep = '', collapse = " + ")
fml1 <- as.formula(paste("y ~ ", xvar))
fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar))

# FIT A BETA MODEL WITH THE FIXED PHI
beta1 <- betareg(fml1, data = df2)
summary(beta1)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.500242   0.329670  -4.551 5.35e-06 ***
# x1           0.007516   0.026020   0.289 0.772680    
# x2           0.429756   0.135899   3.162 0.001565 ** 
# x3           0.099202   0.022285   4.452 8.53e-06 ***
# x4           2.465055   0.415657   5.931 3.02e-09 ***
# x5          -0.003687   0.001070  -3.446 0.000568 ***
# x6           0.007181   0.001821   3.943 8.06e-05 ***
# x7           0.128796   0.186715   0.690 0.490319    
#
# Phi coefficients (precision model with identity link):
#       Estimate Std. Error z value Pr(>|z|)    
# (phi)   3.6868     0.1421   25.95   <2e-16 ***

# FIT A BETA MODEL WITH THE VARIABLE PHI
beta2 <- betareg(fml2, data = df2)
summary(beta2)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.996661   0.336445  -5.935 2.95e-09 ***
# x1           0.007033   0.026809   0.262 0.793072    
# x2           0.371098   0.135186   2.745 0.006049 ** 
# x3           0.133356   0.022624   5.894 3.76e-09 ***
# x4           2.951245   0.401493   7.351 1.97e-13 ***
# x5          -0.003475   0.001119  -3.105 0.001902 ** 
# x6           0.006528   0.001884   3.466 0.000529 ***
# x7           0.100274   0.190915   0.525 0.599424    
#
# Phi coefficients (precision model with log link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -0.454399   0.452302  -1.005 0.315072    
# x1           0.009119   0.035659   0.256 0.798150    
# x2           0.611049   0.188225   3.246 0.001169 ** 
# x3           0.092073   0.030678   3.001 0.002689 ** 
# x4           2.248399   0.579440   3.880 0.000104 ***
# x5          -0.002188   0.001455  -1.504 0.132704    
# x6          -0.000317   0.002519  -0.126 0.899847    
# x7          -0.166226   0.256199  -0.649 0.516457    

# LIKELIHOOD RATIO TEST TO COMPARE BOTH BETA MODELS
lrtest(beta1, beta2) 

# Likelihood ratio test
#
# Model 1: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
# Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 | x1 + x2 + x3 + x4 + x5 + x6 + x7
#   #Df LogLik Df Chisq Pr(>Chisq)    
# 1   9 231.55                        
# 2  16 257.24  7 51.38  7.735e-09 ***

Simplex Model in R

R CODE

library(simplexreg)
library(foreign)

### http://fmwww.bc.edu/repec/bocode/k/k401.dta ###
data <- read.dta("/home/liuwensui/Documents/data/k401.dta")

mdl <- simplexreg(prate ~ mrate + totemp + age + sole|mrate + totemp + age + sole, type = "hetero", link = "logit", data = data, subset = prate < 1)

summary(mdl) 

R OUTPUT

simplexreg(formula = prate ~ mrate + totemp + age + sole | mrate + totemp + 
    age + sole, data = data, subset = prate < 1, type = "hetero", link = "logit")

standard Pearson residuals:
    Min      1Q  Median      3Q     Max 
-6.1724 -0.5369  0.0681  0.5379  2.2987 

Coefficients (mean model with logit link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  8.817e-01  4.036e-02  21.848  < 2e-16 ***
mrate        2.710e-01  4.880e-02   5.553 2.81e-08 ***
totemp      -8.454e-06  1.164e-06  -7.266 3.70e-13 ***
age          2.762e-02  2.702e-03  10.225  < 2e-16 ***
sole         1.079e-01  4.684e-02   2.304   0.0212 *  

Coefficients (dispersion model with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.668e+00  5.395e-02  30.918  < 2e-16 ***
mrate       8.775e-01  4.472e-02  19.621  < 2e-16 ***
totemp      7.432e-06  1.434e-06   5.182  2.2e-07 ***
age         2.816e-02  3.242e-03   8.688  < 2e-16 ***
sole        7.744e-01  5.966e-02  12.981  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Log-likelihood:  2370,  p-value: 0.4693177 
Deviance: 2711 
Number of Fisher Scoring iterations:  20 

SAS CODE & OUTPUT FOR COMPARISON

proc nlmixed data = one tech = trureg maxiter = 100;
  parms b0 = 0  b1 = 0  b2 = 0  b3 = 0  b4 = 0
        c0 = 2  c1 = 0  c2 = 0  c3 = 0  c4 = 0 ;
  xb = b0 + b1 * mrate + b2 * totemp + b3 * age + b4 * sole;
  xc = c0 + c1 * mrate + c2 * totemp + c3 * age + c4 * sole;
  mu_xb = 1 / (1 + exp(-xb));
  s2 = exp(xc);
  v = (prate * (1 - prate)) ** 3;
  d = (prate - mu_xb) ** 2 / (prate * (1 - prate) * mu_xb ** 2 * (1 - mu_xb) ** 2);
  lh = (2 * constant('pi') * s2 * v) ** (-0.5) * exp(-(2 * s2) ** (-1) * d);
  ll = log(lh);
  model prate ~ general(ll);
run;
/*
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|    Alpha
b0            0.8817    0.03843   2711     22.94     <.0001     0.05
b1            0.2710    0.04540   2711      5.97     <.0001     0.05
b2          -8.45E-6    1.35E-6   2711     -6.26     <.0001     0.05
b3           0.02762   0.002588   2711     10.67     <.0001     0.05
b4            0.1079    0.04792   2711      2.25     0.0244     0.05
c0            1.6680    0.05490   2711     30.38     <.0001     0.05
c1            0.8775    0.07370   2711     11.91     <.0001     0.05
c2          7.431E-6   1.935E-6   2711      3.84     0.0001     0.05
c3           0.02816   0.003224   2711      8.73     <.0001     0.05
c4            0.7744    0.06194   2711     12.50     <.0001     0.05
*/

Multinomial Logit with Python

In [1]: import statsmodels.api as st

In [2]: iris = st.datasets.get_rdataset('iris', 'datasets')

In [3]: ### get the y 

In [4]: y = iris.data.Species

In [5]: print y.head(3)
0    setosa
1    setosa
2    setosa
Name: Species, dtype: object

In [6]: ### get the x 

In [7]: x = iris.data.ix[:, 0]

In [8]: x = st.add_constant(x, prepend = False)

In [9]: print x.head(3)
   Sepal.Length  const
0           5.1      1
1           4.9      1
2           4.7      1

In [10]: ### specify the model

In [11]: mdl = st.MNLogit(y, x)

In [12]: mdl_fit = mdl.fit()
Optimization terminated successfully.
         Current function value: 0.606893
         Iterations 8

In [13]: ### print model summary ###

In [14]: print mdl_fit.summary()
                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                Species   No. Observations:                  150
Model:                        MNLogit   Df Residuals:                      146
Method:                           MLE   Df Model:                            2
Date:                Fri, 23 Aug 2013   Pseudo R-squ.:                  0.4476
Time:                        22:22:58   Log-Likelihood:                -91.034
converged:                       True   LL-Null:                       -164.79
                                        LLR p-value:                 9.276e-33
=====================================================================================
Species=versicolor       coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length           4.8157      0.907      5.310      0.000         3.038     6.593
const                -26.0819      4.889     -5.335      0.000       -35.665   -16.499
--------------------------------------------------------------------------------------
Species=virginica       coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Sepal.Length          6.8464      1.022      6.698      0.000         4.843     8.850
const               -38.7590      5.691     -6.811      0.000       -49.913   -27.605
=====================================================================================

In [15]: ### marginal effects ###

In [16]: mdl_margeff = mdl_fit.get_margeff()

In [17]: print mdl_margeff.summary()
       MNLogit Marginal Effects      
=====================================
Dep. Variable:                Species
Method:                          dydx
At:                           overall
=====================================================================================
    Species=setosa      dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length          -0.3785      0.003   -116.793      0.000        -0.385    -0.372
--------------------------------------------------------------------------------------
Species=versicolor      dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length           0.0611      0.022      2.778      0.005         0.018     0.104
--------------------------------------------------------------------------------------
Species=virginica      dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Sepal.Length          0.3173      0.022     14.444      0.000         0.274     0.360
=====================================================================================

In [18]: ### aic and bic ###

In [19]: print mdl_fit.aic
190.06793279

In [20]: print mdl_fit.bic
202.110473966

Prototyping Multinomial Logit with R

Recently, I am working on a new modeling proposal based on the competing risk and need to prototype multinomial logit models with R. There are 2 R packages implementing multinomial logit models that I’ve tested, namely nnet and vgam. Model outputs with iris data are shown below.

data(iris)

### method 1: nnet package ###
library(nnet)
mdl1 <- multinom(Species ~ Sepal.Length, data = iris, model = TRUE)
summary(mdl1)

# Coefficients:
#            (Intercept) Sepal.Length
# versicolor   -26.08339     4.816072
# virginica    -38.76786     6.847957
#
# Std. Errors:
#            (Intercept) Sepal.Length
# versicolor    4.889635    0.9069211
# virginica     5.691596    1.0223867

### method 2: vgam package ### 
library(VGAM)
mdl2 <- vglm(Species ~ Sepal.Length, data = iris, multinomial(refLevel = 1)) 
summary(mdl2)

# Coefficients:
#                Estimate Std. Error z value
# (Intercept):1  -26.0819    4.88924 -5.3346
# (Intercept):2  -38.7590    5.69064 -6.8110
# Sepal.Length:1   4.8157    0.90683  5.3105
# Sepal.Length:2   6.8464    1.02222  6.6976

However, in my view, above methods are not flexible for real-world problems. For instance, there is no off-shelf solution for the variable selection for above multinomial logit models. Instead of building one multinomial logit model, we might develop two separate binomial logit models to accomplish the same task.

### method 3: two binary logit models ### 
iris$y <- ifelse(iris$Species == 'setosa', 0, 1)
mdl31 <- glm(y ~ Sepal.Length, data = iris, subset = (Species != 'virginica'), family = binomial)
summary(mdl31)

#  Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   -27.831      5.434  -5.122 3.02e-07 ***
# Sepal.Length    5.140      1.007   5.107 3.28e-07 ***

mdl32 <- glm(y ~ Sepal.Length, data = iris, subset = (Species != 'versicolor'), family = binomial)
summary(mdl32)

# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   -38.547      9.557  -4.033 5.50e-05 ***
# Sepal.Length    6.805      1.694   4.016 5.91e-05 ***

As shown above, we can get a set of similar estimated parameters by the third approach with much simpler models.

Dispersion Models

In the last week, I’ve read an interesting article “Dispersion Models in Regression Analysis” by Peter Song (http://www.pakjs.com/journals/25%284%29/25%284%299.pdf), which describes a new class of models more general than classic generalized linear models based on the error distribution.

A dispersion model can be defined by two parameters, a location parameter mu and a dispersion parameter sigma ^ 2, and has a very general form of probability function formulated as:
p(y, mu, sigma ^ 2) = {2 * pi * sigma ^ 2 * V(.)} ^ -0.5 * exp{-1 / (2 * sigma ^ 2) * D(.)}
where the variance function V(.) and the deviance function D(.) varies by distributions. For instance, in a poisson model,
D(.) = 2 * (y * log(y / mu) – y + mu)
V(.) = mu

Below is a piece of SAS code estimating a Poisson with both the error distribution assumption and the dispersion assumption.

data one;
  do i = 1 to 1000;
    x = ranuni(i);
    y = ranpoi(i, exp(2 + x * 2 + rannor(1) * 0.1));
    output;
  end;
run;

*** fit a poisson model with classic GLM ***;
proc nlmixed data = one tech = trureg;
  parms b0 = 0 b1 = 0;
  mu = exp(b0 + b1 * x);
  ll = -mu + y * log(mu) - log(fact(y));
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 6118.0
AIC (smaller is better)           6122.0
AICC (smaller is better)          6122.0
BIC (smaller is better)           6131.8

                                             Parameter Estimates
                         Standard
Parameter    Estimate       Error      DF    t Value    Pr > |t|     Alpha       Lower       Upper    Gradient
b0             2.0024     0.01757    1000     113.95      <.0001      0.05      1.9679      2.0369    5.746E-9
b1             1.9883     0.02518    1000      78.96      <.0001      0.05      1.9388      2.0377    1.773E-9
*/

*** fit a poisson model with dispersion probability ***;
*** proposed by Jorgensen in 1987                   ***;
proc nlmixed data = one tech = trureg;
  parms b0 = 0 b1 = 0 s2 = 1;
  mu = exp(b0 + b1 * x);
  d  = 2 * (y * log(y / mu) - y + mu);
  v  = y;
  lh = (2 * constant('pi') * s2 * v) **  (-0.5) * exp(-(2 * s2) ** (-1) * d);
  ll = log(lh);
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 6066.2
AIC (smaller is better)           6072.2
AICC (smaller is better)          6072.2
BIC (smaller is better)           6086.9

                                             Parameter Estimates
                         Standard
Parameter    Estimate       Error      DF    t Value    Pr > |t|     Alpha       Lower       Upper    Gradient
b0             2.0024     0.02015    1000      99.37      <.0001      0.05      1.9629      2.0420    2.675E-6
b1             1.9883     0.02888    1000      68.86      <.0001      0.05      1.9316      2.0449    1.903E-6
s2             1.3150     0.05881    1000      22.36      <.0001      0.05      1.1996      1.4304    -0.00002
*/

Please note that although both methods yield the same parameter estimates, there are slight differences in standard errors and therefore t-values. In addition, despite one more parameter estimated in the model, AIC / BIC are even lower in the dispersion model.

Estimating Composite Models for Count Outcomes with FMM Procedure

Once upon a time when I learned SAS, it was still version 6.X. As a old dog with 10+ years of experience in SAS, I’ve been trying my best to keep up with new tricks in each release of SAS. In SAS 9.3, my favorite novel feature in SAS/STAT is FMM procedure to estimate finite mixture models.

In 2008 when I drafted “Count Data Models in SAS®” (www2.sas.com/proceedings/forum2008/371-2008.pdf‎), it was pretty cumbersome to specify the log likelihood function of a composite model for count outcomes, e.g. hurdle Poisson or zero-inflated Poisson model. However, with the availability of FMM procedure, estimating composite models has never been easier.

In the demonstration below, I am going to show a side-by-side comparison how to estimate three types of composite models for count outcomes, including hurdle Poisson, zero-inflated Poisson, and finite mixture Poisson models, with FMM and NLMIXED procedure respectively. As shown, both procedures provided identical model estimations.

Hurdle Poisson Model

*** HURDLE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = truncpoisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8201.0
AIC  (smaller is better)      8221.0
AICC (smaller is better)      8221.0
BIC  (smaller is better)      8293.5

Parameter Estimates for 'Truncated Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.0706    0.3081    -6.72    <.0001
        1  AGE         0.01796  0.005482     3.28    0.0011
        1  ACADMOS    0.000852  0.000700     1.22    0.2240
        1  MINORDRG     0.1739   0.03441     5.05    <.0001
        1  LOGSPEND     0.1229   0.04219     2.91    0.0036

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -4.2309      0.1808     -23.40      <.0001
AGE           0.01694    0.003323       5.10      <.0001
ACADMOS      0.002240    0.000492       4.55      <.0001
MINORDRG       0.7653     0.03842      19.92      <.0001
LOGSPEND       0.2301     0.02683       8.58      <.0001
*/

*** HURDLE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -4 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0;
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / ((1 - exp(-exp_eta2)) * fact(majordrg));
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8201.0
AIC (smaller is better)           8221.0
AICC (smaller is better)          8221.0
BIC (smaller is better)           8293.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -4.2309     0.1808    1E4    -23.40     <.0001
B1_age          0.01694   0.003323    1E4      5.10     <.0001
B1_acadmos     0.002240   0.000492    1E4      4.55     <.0001
B1_minordrg      0.7653    0.03842    1E4     19.92     <.0001
B1_logspend      0.2301    0.02683    1E4      8.58     <.0001
============
B2_intercept    -2.0706     0.3081    1E4     -6.72     <.0001
B2_age          0.01796   0.005482    1E4      3.28     0.0011
B2_acadmos     0.000852   0.000700    1E4      1.22     0.2240
B2_minordrg      0.1739    0.03441    1E4      5.05     <.0001
B2_logspend      0.1229    0.04219    1E4      2.91     0.0036
*/

Zero-Inflated Poisson Model

*** ZERO-INFLATED POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8147.9
AIC  (smaller is better)      8167.9
AICC (smaller is better)      8167.9
BIC  (smaller is better)      8240.5

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.2780    0.3002    -7.59    <.0001
        1  AGE         0.01956  0.006019     3.25    0.0012
        1  ACADMOS    0.000249  0.000668     0.37    0.7093
        1  MINORDRG     0.1176   0.02711     4.34    <.0001
        1  LOGSPEND     0.1644   0.03531     4.66    <.0001

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.9111      0.4170      -4.58      <.0001
AGE          -0.00082    0.008406      -0.10      0.9218
ACADMOS      0.002934    0.001085       2.70      0.0068
MINORDRG       1.4424      0.1361      10.59      <.0001
LOGSPEND      0.09562     0.05080       1.88      0.0598
*/

*** ZERO-INFLATED POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0 + (1 - p0) * exp(-exp_eta2);
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / fact(majordrg);
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8147.9
AIC (smaller is better)           8167.9
AICC (smaller is better)          8167.9
BIC (smaller is better)           8240.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -1.9111     0.4170    1E4     -4.58     <.0001
B1_age         -0.00082   0.008406    1E4     -0.10     0.9219
B1_acadmos     0.002934   0.001085    1E4      2.70     0.0068
B1_minordrg      1.4424     0.1361    1E4     10.59     <.0001
B1_logspend     0.09562    0.05080    1E4      1.88     0.0598
============
B2_intercept    -2.2780     0.3002    1E4     -7.59     <.0001
B2_age          0.01956   0.006019    1E4      3.25     0.0012
B2_acadmos     0.000249   0.000668    1E4      0.37     0.7093
B2_minordrg      0.1176    0.02711    1E4      4.34     <.0001
B2_logspend      0.1644    0.03531    1E4      4.66     <.0001
*/

Two-Class Finite Mixture Poisson Model

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson k = 2;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8136.8
AIC  (smaller is better)      8166.8
AICC (smaller is better)      8166.9
BIC  (smaller is better)      8275.7

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.4449    0.3497    -6.99    <.0001
        1  AGE         0.02214  0.006628     3.34    0.0008
        1  ACADMOS    0.000529  0.000770     0.69    0.4920
        1  MINORDRG    0.05054   0.04015     1.26    0.2081
        1  LOGSPEND     0.2140   0.04127     5.18    <.0001
        2  Intercept   -8.0935    1.5915    -5.09    <.0001
        2  AGE         0.01150   0.01294     0.89    0.3742
        2  ACADMOS    0.004567  0.002055     2.22    0.0263
        2  MINORDRG     0.2638    0.6770     0.39    0.6968
        2  LOGSPEND     0.6826    0.2203     3.10    0.0019

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.4275      0.5278      -2.70      0.0068
AGE          -0.00277     0.01011      -0.27      0.7844
ACADMOS      0.001614    0.001440       1.12      0.2623
MINORDRG       1.5865      0.1791       8.86      <.0001
LOGSPEND     -0.06949     0.07436      -0.93      0.3501
*/

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -8 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0 B2_logspend = 0
		B3_intercept = -1 B3_age = 0 B3_acadmos = 0 B3_minordrg = 0 B3_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  prob1 = exp(-exp_eta1) * exp_eta1 ** majordrg / fact(majordrg);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  prob2 = exp(-exp_eta2) * exp_eta2 ** majordrg / fact(majordrg);
  eta3 = B3_intercept + B3_age * age + B3_acadmos * acadmos + B3_minordrg * minordrg + B3_logspend * logspend;
  exp_eta3 = exp(eta3);
  p = exp_eta3 / (1 + exp_eta3);
  _prob_ = p * prob1 + (1 - p) * prob2;
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8136.8
AIC (smaller is better)           8166.8
AICC (smaller is better)          8166.9
BIC (smaller is better)           8275.7

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -2.4449     0.3497    1E4     -6.99     <.0001
B1_age          0.02214   0.006628    1E4      3.34     0.0008
B1_acadmos     0.000529   0.000770    1E4      0.69     0.4920
B1_minordrg     0.05054    0.04015    1E4      1.26     0.2081
B1_logspend      0.2140    0.04127    1E4      5.18     <.0001
============
B2_intercept    -8.0935     1.5916    1E4     -5.09     <.0001
B2_age          0.01150    0.01294    1E4      0.89     0.3742
B2_acadmos     0.004567   0.002055    1E4      2.22     0.0263
B2_minordrg      0.2638     0.6770    1E4      0.39     0.6968
B2_logspend      0.6826     0.2203    1E4      3.10     0.0020
============
B3_intercept    -1.4275     0.5278    1E4     -2.70     0.0068
B3_age         -0.00277    0.01011    1E4     -0.27     0.7844
B3_acadmos     0.001614   0.001440    1E4      1.12     0.2623
B3_minordrg      1.5865     0.1791    1E4      8.86     <.0001
B3_logspend    -0.06949    0.07436    1E4     -0.93     0.3501
*/

Disaggregating Annual Losses into Each Quarter

In loss forecasting, it is often necessary to disaggregate annual losses into each quarter. The most simple method to convert low frequency to high frequency time series is interpolation, such as the one implemented in EXPAND procedure of SAS/ETS. In the example below, there is a series of annual loss projections from 2013 through 2016. An interpolation by the natural spline is used to convert the annual losses into quarterly ones.
SAS Code:

data annual;
  input loss year mmddyy8.;
  format year mmddyy8.;
datalines;
19270175 12/31/13
18043897 12/31/14
17111193 12/31/15
17011107 12/31/16
;
run;

proc expand data = annual out = quarterly from = year to = quarter;
  id year;
  convert loss / observed = total method = spline(natural);
run;

proc sql;
select 
  year(year) as year, 
  sum(case when qtr(year) = 1 then loss else 0 end) as qtr1,
  sum(case when qtr(year) = 2 then loss else 0 end) as qtr2,
  sum(case when qtr(year) = 3 then loss else 0 end) as qtr3,
  sum(case when qtr(year) = 4 then loss else 0 end) as qtr4,
  sum(loss) as total
from
  quarterly
group by
  calculated year;
quit;

Output:

    year      qtr1      qtr2      qtr3      qtr4     total

    2013   4868536   4844486   4818223   4738931  19270175
    2014   4560049   4535549   4510106   4438194  18043897
    2015   4279674   4276480   4287373   4267666  17111193
    2016   4215505   4220260   4279095   4296247  17011107

While the mathematical interpolation is easy to implement, it might be difficult to justify and interpret from the business standpoint. In reality, there might be an assumption that the loss trend would follow the movement of macro-economy. Therefore, it might be advantageous to disaggregate annual losses into quarterly ones with the inclusion of one or more economic indicators. This approach can be implemented in tempdisagg package of R language. Below is a demo with the same loss data used above. However, disaggregation of annual losses is accomplished based upon a macro-economic indicator.
R Code:

library(tempdisagg)

loss <- c(19270175, 18043897, 17111193, 17011107)
loss.a <- ts(loss, frequency = 1, start = 2013)

econ <- c(7.74, 7.67, 7.62, 7.48, 7.32, 7.11, 6.88, 6.63, 6.41, 6.26, 6.12, 6.01, 5.93, 5.83, 5.72, 5.59)
econ.q <- ts(econ, frequency = 4, start = 2013)

summary(mdl <- td(loss.a ~ econ.q))
print(predict(mdl))

Output:

Call:
td(formula = loss.a ~ econ.q)

Residuals:
Time Series:
Start = 2013
End = 2016
Frequency = 1
[1]  199753 -234384 -199257  233888

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2416610     359064   6.730   0.0214 *
econ.q        308226      53724   5.737   0.0291 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

'chow-lin-maxlog' disaggregation with 'sum' conversion
4 low-freq. obs. converted to 16 high-freq. obs.
Adjusted R-squared: 0.9141      AR1-Parameter:     0 (truncated)
        Qtr1    Qtr2    Qtr3    Qtr4
2013 4852219 4830643 4815232 4772080
2014 4614230 4549503 4478611 4401554
2015 4342526 4296292 4253140 4219235
2016 4302864 4272041 4238136 4198067

In practice, if a simple and flexible solution is desired without the need of interpretation, then the mathematical interpolation might be a good choice. On the other hand, if there is a strong belief that the macro-economy might drive the loss trend, then the regression-based method implemented in tempdisagg package might be preferred. However, in our example, both methods generate extremely similar results.

Lagrange Multiplier (LM) Test for Over-Dispersion

While Poisson regression is often used as a baseline model for count data, its assumption of equi-dispersion is too restrictive for many empirical applications. In practice, the variance of observed count data usually exceeds the mean, namely over-dispersion, due to the unobserved heterogeneity and/or excess zeroes. With the similar consequences of heteroskedasticity in the linear regression, over-dispersion in a Poisson regression will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics. After the development of Poisson regression, it is always a sound practice to do an additional analysis for over-dispersion.

Below is a SAS macro to test the over-dispersion based upon the Lagrange Multiplier (LM) Test introduced by William Greene (2002) in his famous “Econometric Analysis”. The statistic follows the chi-square distribution with 1 degree freedom. The null hypothesis implies equi-dispersion in outcomes from the tested Poisson model.

%macro lm(data = , y = , pred_y = );
***************************************************;
* This macro is to test the over-dispersion based *;
* on outcomes from a poisson model                *;
*                            -- wensui.liu@53.com *;
***************************************************;
* parameters:                                     *;
*  data  : the input dataset                      *;
*  y     : observed count outcome                 *;
*  pred_y: predicted outcome from poisson model   *;
***************************************************;
* reference:                                      *;
*  w. greene (2002), econometric analysis         *;
***************************************************;

proc iml;
  use &data;
  read all var {&y} into y;
  read all var {&pred_y} into lambda;
  close &data;

  e = (y - lambda);
  n = nrow(y);
  ybar = y`[, :];
  LM = (e` * e - n * ybar) ** 2 / (2 * lambda` * lambda);
  Pvalue = 1 - probchi(LM, 1);
  title 'LM TEST FOR OVER-DISPERSION';
  print LM Pvalue;
  title;
quit;

***************************************************;
*                 end of macro                    *;
***************************************************;
%mend lm;

Next, a use case of the aforementioned LM test is demonstrated. First of all, a vector of Poisson outcomes are simulated with 10% excessive zeros and therefore over-dispersion.

*** SIMULATE A POISSON VECTOR WITH EXCESSIVE ZEROS ***;
data one;
  do i = 1 to 1000;
    x = ranuni(i);
    if i <= 900 then y = ranpoi(i, exp(x * 2));
    else y = 0;
    output;
  end;
run;

A Poisson regression is estimated with the simulated count outcomes including excessive zeros. After the calculation of predicted values, LM test is used to test the over-dispersion. As shown below, the null hypothesis of equi-dispersion is rejected with LM-stat = 31.18.

*** TEST DISPERSION WITH EXCESSIVE ZEROS ***;
ods listing close;
proc genmod data = one;
  model y =  x / dist = poisson;
  output out = out1 p = predicted;
run;
ods listing;

%lm(data = out1, y = y, pred_y = predicted);
/*
LM TEST FOR OVER-DISPERSION

       LM    PVALUE

31.182978 2.3482E-8
*/

Another Poisson regression is also estimated with simulated count outcomes excluding 10% excessive zeros. As expected, with outcomes from this newly estimated Poisson model, the null hypothesis of equi-dispersion is not rejected.

*** TEST DISPERSION WITHOUT EXCESSIVE ZEROS ***;
ods listing close;
proc genmod data = one;
  where i <= 900;
  model y =  x / dist = poisson;
  output out = out2 p = predicted;
run;
ods listing;

%lm(data = out2, y = y, pred_y = predicted);
/*
LM TEST FOR OVER-DISPERSION

       LM    PVALUE

 0.052131 0.8193959
*/

How to Score Outcomes from Count Models

When calculating the prediction from a count model, many people like to use the expected mean directly. However, from the business standpoint, it might be more appealing to calculate the probability of a specific count outcome. For instance, in the retail banking, it is often of interests to know the probability of an account with one or more delinquencies and then convert this probability to a certain score point. A widely accepted practice is to develop a logistic regression predicting the delinquent account, e.g. Y = 1 for delinquencies >= 1. However, it is also possible to develop a count model, e.g. negative binomial, predicting the number of delinquencies and then estimating the probability of one or more delinquencies given the expected mean.

In the demonstration below, a scoring scheme for count models is shown. From the output, it is clear that the predictiveness of a negative binomial model is comparable to the one of a logistic model in terms KS and ROC statistics.

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

libname data 'C:\Users\liuwensui\projects\data';

%include 'C:\Users\liuwensui\projects\code\ks_macro.sas';

data tmp1;
  set data.credit_count;
  if majordrg = 0 then bad = 0;
  else bad = 1;
run;
    
proc logistic data = tmp1 desc;
  model bad = AGE ACADMOS ADEPCNT MINORDRG OWNRENT EXP_INC;
  score data = tmp1 out = logit_out1(rename = (p_1 = logit_prob1));
run;

proc genmod data = tmp1;
  model majordrg = AGE ACADMOS ADEPCNT MINORDRG OWNRENT EXP_INC / dist = nb;
  output out = nb_out1 p = yhat;
run;

data nb_out1;
  set nb_out1;
  nb_prob1 = 1 - pdf('negbinomial', 0, (1 / 4.0362) / (Yhat + (1 / 4.0362)), (1 / 4.0362));
run;

%separation(data = logit_out1, score = logit_prob1, y = bad);
/*
                            GOOD BAD SEPARATION REPORT FOR LOGIT_PROB1 IN DATA LOGIT_OUT1                           
                                     MAXIMUM KS = 35.5049 AT SCORE POINT 0.1773                                     
                     ( AUC STATISTICS = 0.7373, GINI COEFFICIENT = 0.4747, DIVERGENCE = 0.6511 )                    
                                                                                                                    
          MIN        MAX           GOOD        BAD      TOTAL               BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #       ODDS    RATE      BAD RATE  PERCENT      PERCENT
 -------------------------------------------------------------------------------------------------------------------
  BAD     0.3369     0.9998         557        787      1,344       0.71   58.56%      58.56%    30.73%      30.73% 
   |      0.2157     0.3369         944        401      1,345       2.35   29.81%      44.18%    15.66%      46.39% 
   |      0.1802     0.2157       1,039        305      1,344       3.41   22.69%      37.02%    11.91%      58.30% 
   |      0.1619     0.1802       1,099        246      1,345       4.47   18.29%      32.34%     9.61%      67.90% 
   |      0.1489     0.1619       1,124        220      1,344       5.11   16.37%      29.14%     8.59%      76.49% 
   |      0.1383     0.1489       1,171        174      1,345       6.73   12.94%      26.44%     6.79%      83.29% 
   |      0.1255     0.1383       1,213        131      1,344       9.26    9.75%      24.06%     5.12%      88.40% 
   |      0.1109     0.1255       1,254         91      1,345      13.78    6.77%      21.89%     3.55%      91.96% 
   V      0.0885     0.1109       1,246         98      1,344      12.71    7.29%      20.27%     3.83%      95.78% 
 GOOD     0.0001     0.0885       1,236        108      1,344      11.44    8.04%      19.05%     4.22%     100.00% 
       ========== ========== ========== ========== ==========                                                       
          0.0001     0.9998      10,883      2,561     13,444                                                       
*/
    
%separation(data = nb_out1, score = nb_prob1, y = bad);
/*
                               GOOD BAD SEPARATION REPORT FOR NB_PROB1 IN DATA NB_OUT1                              
                                     MAXIMUM KS = 35.8127 AT SCORE POINT 0.2095                                     
                     ( AUC STATISTICS = 0.7344, GINI COEFFICIENT = 0.4687, DIVERGENCE = 0.7021 )                    
                                                                                                                    
          MIN        MAX           GOOD        BAD      TOTAL               BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #       ODDS    RATE      BAD RATE  PERCENT      PERCENT
 -------------------------------------------------------------------------------------------------------------------
  BAD     0.2929     0.8804         561        783      1,344       0.72   58.26%      58.26%    30.57%      30.57% 
   |      0.2367     0.2929         944        401      1,345       2.35   29.81%      44.03%    15.66%      46.23% 
   |      0.2117     0.2367       1,025        319      1,344       3.21   23.74%      37.27%    12.46%      58.69% 
   |      0.1947     0.2117       1,106        239      1,345       4.63   17.77%      32.39%     9.33%      68.02% 
   |      0.1813     0.1947       1,131        213      1,344       5.31   15.85%      29.08%     8.32%      76.34% 
   |      0.1675     0.1813       1,191        154      1,345       7.73   11.45%      26.14%     6.01%      82.35% 
   |      0.1508     0.1675       1,208        136      1,344       8.88   10.12%      23.86%     5.31%      87.66% 
   |      0.1298     0.1508       1,247         98      1,345      12.72    7.29%      21.78%     3.83%      91.49% 
   V      0.0978     0.1297       1,242        102      1,344      12.18    7.59%      20.21%     3.98%      95.47% 
 GOOD     0.0000     0.0978       1,228        116      1,344      10.59    8.63%      19.05%     4.53%     100.00% 
       ========== ========== ========== ========== ==========                                                       
          0.0000     0.8804      10,883      2,561     13,444                                                       
*/

A SAS Macro for Breusch-Pagan Test

In SAS, Breusch-Pagan test for Heteroscedasticity in a linear regression can be conducted with MODEL procedure in SAS/ETS, as shown in the code snippet below.

data one;
  do i = 1 to 100;
    x1 = uniform(1);
    x2 = uniform(2);
    r  = normal(1) * 0.1;
    if x2 > 0.5 then r = r * 2;
    y = 10 + 8 * x1 + 6 * x2 + r;
    output;
  end;
run;

proc model data = one;
  parms b0 b1 b2;
  y = b0 + b1 * x1 + b2 * x2;
  fit y / breusch = (1 x1 x2);
run;
/*
                                 Heteroscedasticity Test
Equation        Test               Statistic     DF    Pr > ChiSq    Variables

y               Breusch-Pagan          10.44      2        0.0054    1, x1, x2           
*/

However, in a forecasting model that I am recently working on, I find that it is not convenient to use “proc model” every time when I want to do Breusch-Pagan test and rather prefer a more generic solution not tied to a specific SAS module or procedure and that would only need to take a minimum set of inputs instead of specifying out a full model. As a result, I draft a simple sas macro to do Breusch-Pagan test, which gives the identical result as the one from MODEL procedure. Hopeful, others might find this macro useful as well.

proc reg data = one;
  model y = x1 x2;
  output out = two r = resid;
run;

%macro hetero_bp(r = , x = , data = );
***********************************************************;
* THE SAS MACRO IS TO CALCULATE BREUSCH-PAGEN TEST FOR    *
* HETEROSKEDASTICITY                                      *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  R   : RESIDUAL VALUES FROM A LINEAR REGRESSION         *;
*  X   : A LIST OF NUMERIC VARIABLES TO MODEL ERROR       *;
*        VARIANCE IN BREUSCH-PAGEN TEST                   *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;
    
data _data_(keep = r2 &x);
  set &data;
  where r ~= .;
  r2 = &r ** 2;
run;

ods output nobs = _nobs_;
ods output anova = _anova_;
ods output fitstatistics = _fits_;
ods listing close;
proc reg data = _last_;
  model r2 = &x;
run;
ods listing;

proc sql noprint;
  select distinct NObsUsed into :n from _nobs_;
  select df into :df from _anova_ where upcase(compress(source, ' ')) = 'MODEL';
  select nvalue2 into :r2 from _fits_ where upcase(compress(label2, ' ')) = 'R-SQUARE';
run;
%put &r2;

data _result_;
  chi_square = &r2 * &n;
  df         = &df;
  p_value    = 1 - probchi(chi_square, df);
run;

proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * BREUSCH-PAGEN TEST FOR HETEROSKEDASTICITY * "
          chi_square df p_value);
  define chi_square  / "CHI-SQUARE"  width = 15;
  define df          / "DF"          width = 5;
  define p_value     / "P-VALUE"     width = 15 format = 12.8;
run;

proc datasets library = work;
  delete _: / memtype = data;
run;  
      
%mend hetero_bp;

%hetero_bp(r = resid, x = x1 x2, data = two);
/*
 BREUSCH-PAGEN TEST FOR HETEROSKEDASTICITY
                                          
      CHI-SQUARE    DF         P-VALUE    
 -----------------------------------------
         10.4389     2      0.00541030    
*/

About Cointegration

As in the stat workshop supporting the loss forecasting, my analysts and I are frequently asked to quantify the “correlation” between time series. In the summary below, I will briefly convey a statistical method other than “correlation”, namely cointegration, to describe the relationship between time series.

In the empirical finance, it is a popular practice for many financial practitioners to use correlation describing a relationship between multiple time series. However, this approach has been criticized in that a relationship might be wrongfully inferred due to the existence of other latent causal factors. In this case, cointegration, proposed by Engle and Granger (1987), becomes an alternative to characterize this correlated nature between time series.

In layman’s term, cointegration describes if two or more time series are moving with a common trend. In the statistical definition, assumed two time series X_t and Y_t individually integrated of order one, i.e. I(1), if a linear combination of X_t and Y_t, e.g. Z_t = X_t + B * Y_t, is stationary, i.e. I(0), then these two series X_t and Y_t are defined to be co-integrated. Since the idea of co-integration is concerned with a co-movement / long-term equilibrium among multiple time series, it is also used to test the Efficient Market Hypothesis (EMH) in econometrics.

In the cointegration analysis, most practices are fallen into two major categories, either the minimization of certain variances or the maximization of certain correlations. For instance, the single equation approach, such as the one suggested by Engel and Granger (1987), looks for the linear combination of X_t and Y_t with minimum variance and therefore belongs to the first category. On the other hand, reduced rank system based approach, such as the one proposed by Johansen (1988), belongs to the second category in that it looks for the linear combination of X_t and Y_t with maximum correlation.

Following the logic outlined by Engle and Granger, it is straightforward to formulate an augmented Dickey-Fuller (ADF) test for the cointegration analysis between two time series, although other unit-root test such as Phillips–Perron or Durbin–Watson should also suffice. Given X_t and Y_t integrated of order one, the first step is to estimate a simple linear regression Y_t = a + B * X_t + e_t, in which e_t is just the residual term. Afterwards, ADF test is used to check the existence of unit roots in e_t. If the unit-root hypothesis for e_t is rejected, then e_t is I(0) and therefore stationary, implying that X_t and Y_t are co-integrated. Otherwise, co-integration between X_t and Y_t is not concluded. This approach is attractive is that it is extremely easy to implement and understand. However, this method is only appropriate for a system with only two time series and one possible cointegrating relationship.

Johansen test for cointegration is a maximum likelihood estimation procedure based on the Vector Autoregressive (VAR) model that allows for dynamic interactions between two or more series and therefore is more general than the previous approach. Consider a VAR model with order p such that Y_t = A_1 * Y_t-1 + … + A_p * Y_t-p + e_t, where Y_t is a vector of variables integrated of order one and e_t is a vector of innovations. Without the loss of generality, the VAR can be re-written as delta_Y_t = PI * Y_t-1 + SUM[GAMMA_i * delta_Y_t-i]. The whole idea of Johansen test is to decompose PI into two n by r matrices, α and β, such that PI = α * β` and β` * Y_t is stationary. r is the number of co-integrating relations (the cointegrating rank) and each column of β is the cointegrating vector. A likelihood ratio test can be formulated to test the null hypothesis of r co-integrating vectors against the alternative hypothesis of n cointegrating vectors. While Johansen’s method is a more powerful test for cointegration, the drawback is more complicated implementation and interpretation.

Another Way to Access R from Python – PypeR

Different from RPy2, PypeR provides another simple way to access R from Python through pipes (http://www.jstatsoft.org/v35/c02/paper). This handy feature enables data analysts to do the data munging with python and the statistical analysis with R by passing objects interactively between two computing systems.

Below is a simple demonstration on how to call R within Python through RypeR, estimate a Beta regression, and then return the model prediction from R back to Python.

In [1]: # LOAD PYTHON PACKAGES

In [2]: import pandas as pd

In [3]: import pyper as pr

In [4]: # READ DATA

In [5]: data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt", header = 0)

In [6]: # CREATE A R INSTANCE WITH PYPER

In [7]: r = pr.R(use_pandas = True)

In [8]: # PASS DATA FROM PYTHON TO R

In [9]: r.assign("rdata", data)

In [10]: # SHOW DATA SUMMARY

In [11]: print r("summary(rdata)")
try({summary(rdata)})
    LEV_LT3           TAX_NDEB           COLLAT1           SIZE1       
 Min.   :0.00000   Min.   :  0.0000   Min.   :0.0000   Min.   : 7.738  
 1st Qu.:0.00000   1st Qu.:  0.3494   1st Qu.:0.1241   1st Qu.:12.317  
 Median :0.00000   Median :  0.5666   Median :0.2876   Median :13.540  
 Mean   :0.09083   Mean   :  0.8245   Mean   :0.3174   Mean   :13.511  
 3rd Qu.:0.01169   3rd Qu.:  0.7891   3rd Qu.:0.4724   3rd Qu.:14.751  
 Max.   :0.99837   Max.   :102.1495   Max.   :0.9953   Max.   :18.587  
     PROF2              GROWTH2             AGE              LIQ         
 Min.   :0.0000158   Min.   :-81.248   Min.   :  6.00   Min.   :0.00000  
 1st Qu.:0.0721233   1st Qu.: -3.563   1st Qu.: 11.00   1st Qu.:0.03483  
 Median :0.1203435   Median :  6.164   Median : 17.00   Median :0.10854  
 Mean   :0.1445929   Mean   : 13.620   Mean   : 20.37   Mean   :0.20281  
 3rd Qu.:0.1875148   3rd Qu.: 21.952   3rd Qu.: 25.00   3rd Qu.:0.29137  
 Max.   :1.5902009   Max.   :681.354   Max.   :210.00   Max.   :1.00018  
     IND2A            IND3A            IND4A             IND5A        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :1.0000   Median :0.0000   Median :0.00000   Median :0.00000  
 Mean   :0.6116   Mean   :0.1902   Mean   :0.02692   Mean   :0.09907  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  


In [12]: # LOAD R PACKAGE

In [13]: r("library(betareg)")
Out[13]: 'try({library(betareg)})\nLoading required package: Formula\n'

In [14]: # ESTIMATE A BETA REGRESSION

In [15]: r("m <- betareg(LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, subset = LEV_LT3 > 0)")
Out[15]: 'try({m <- betareg(LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, subset = LEV_LT3 > 0)})\n'

In [16]: # OUTPUT MODEL SUMMARY

In [17]: print r("summary(m)")
try({summary(m)})

Call:
betareg(formula = LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, 
    subset = LEV_LT3 > 0)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-7.2802 -0.5194  0.0777  0.6037  5.8777 

Coefficients (mean model with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.229773   0.312990   3.929 8.53e-05 ***
SIZE1       -0.105009   0.021211  -4.951 7.39e-07 ***
PROF2       -2.414794   0.377271  -6.401 1.55e-10 ***
GROWTH2      0.003306   0.001043   3.169  0.00153 ** 
AGE         -0.004999   0.001795  -2.786  0.00534 ** 
IND3A        0.688314   0.074069   9.293  < 2e-16 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   3.9362     0.1528   25.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 266.7 on 7 Df
Pseudo R-squared: 0.1468
Number of iterations: 25 (BFGS) + 2 (Fisher scoring) 


In [18]: # CALCULATE MODEL PREDICTION

In [19]: r("beta_fit <- predict(m, link = 'response')")
Out[19]: "try({beta_fit <- predict(m, link = 'response')})\n"

In [20]: # SHOW PREDICTION SUMMARY IN R

In [21]: print r("summary(beta_fit)")
try({summary(beta_fit)})
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1634  0.3069  0.3465  0.3657  0.4007  0.6695 


In [22]: # PASS DATA FROM R TO PYTHON

In [23]: pydata = pd.DataFrame(r.get("beta_fit"), columns = ["y_hat"])

In [24]: # SHOW PREDICTION SUMMARY IN PYTHON

In [25]: pydata.y_hat.describe()
Out[25]: 
count    1116.000000
mean        0.365675
std         0.089804
min         0.163388
25%         0.306897
50%         0.346483
75%         0.400656
max         0.669489

Another Class of Risk Models

In retail banking, it is a key interest to predict the probability of accounts’ adverse behaviors, such as delinquencies or defaults. A widely accepted practice in the industry is to classify accounts into two groups, the good and the bad, based upon the presence of certain adverse behaviors and then to model this binary outcome with discriminant models, e.g. logistic regression. However, an obvious limitation of discriminant models based upon the binary outcome is that the two-state classification over-simplifies adverse behaviors of accounts. What financially impacts a financial institute are not only the presence of a certain adverse behavior but also the frequency of such behavior.

In the definition of binary outcome, it is important to notice that delinquencies can also be measured directly as the frequency of over-due payments. Therefore, instead of modeling the binary outcome, a more sensible alternative might be to model the frequency of delinquencies within a given valuation horizon. In the statistical content, the genuine model for count outcome, e.g. frequency, is Poisson regression model with probability function

f(Y_i | X_i) = exp(-λ_i) * (λ_i ^ Y_i) / Y_i!, where λ_i = exp(X_i`B)

It is assumed that each observed outcome Y_i is drawn from a Poisson distribution with the conditional mean λ_i on a given covariate vector X_i for case i. In Poisson model, a strong assumption is that the mean is equal to the variance such that E(Y_i | X_i) = Var(Y_i | X_i) = λ_i, which is also known as Equi-Disperson. However, in practice, this Equi-Dispersion assumption is too restrictive for many empirical applications. In real-world count outcomes, the variance often exceeds the mean, namely Over-Dispersion, due to various reasons, such as excess zeroes or long right tail. For instance, in a credit card portfolio, majority of cardholders should have zero delinquency at any point in time, while a few might have more than three. With the similar consequence of heteroskedasticity in a linear regression, Over-Dispersion in a Poisson model will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics. Hence, Poisson model is often inadequate and practically unusable.

Considered a generalization of basic Poisson model, Negative Binomial (NB) model accommodates Over-Dispersion in data by including a dispersion parameter. In a NB model, it is assumed that the conditional mean λ_i for case i is determined not only by the observed heterogeneity explained by the covariate vector X_i but also by the unobserved heterogeneity denoted as ε_i that is independent of X_i such that

λ_i = exp(X_i`B + ε_i) = exp(X_i`B) * exp(ε_i), where exp(ε_i) ~ Gamma(1/α, 1/α)

While there are many variants of NB model, the most common one is NB2 model proposed by Cameron and Trivedi (1966) with probability function

f(Y_i | X_i) = Gamma(Y_i + 1/α) / [Gamma(Y_i + 1) * Gamma(1/α)] * [(1/α) / (1/α + λ_i)] ^ (1/α) * [λ_i / (1/α + λ_i)], where α is the dispersion parameter

For NB2 model, its conditional mean E(Y_i | X_i) is still λ_i, while its variance Var(Y_i | X_i) becomes λ_i + α * λ_i ^ 2. Since both λ_i > 0 and α > 0, the variance must exceed the mean and therefore the issue of Over-Dispersion has been addressed.

A major limitation of standard count data models, such as Poisson and Negative Binomial, is that the data is assumed to be generated by a single process. However, in many cases, it might be more appropriate to assume that the data is governed by two or more processes. For instance, it is believed that risk drivers of the first-time delinquent account might be very different from the ones of an account who had been delinquent for multiple times. From the business standpoint, the assumption of multiple processes is particularly attractive in that it provides the potential to segment the portfolio into two or more sub-groups based upon their delinquent pattern and loan characteristics.

Known as the two-part model, Hurdle Poisson model assumes that count outcomes come from two systematically different processes, a Binomial distribution determining the probability of zero counts and a Truncated-at-Zero Poisson governing positive outcomes. The probability function can be expressed as

for Y_i = 0, f(Y_i | X_i) = θ_i, where θ_i = Prob(Y_i = 0)
for Y_i > 0, f(Y_i | X_i) = (1 – θ_i) * exp(-λ_i) * λ_i ^ Y_i / {[1 – exp(-λ_i)] * Y_i!}, where λ_i = exp(X_i`B)

In the modeling framework, the first process can be analyzed by a logistic regression and the second can be reflected by a Truncated-at-Zero Poisson model. An advantage of Hurdle Model is that it is so flexible as to effectively model both Over-Dispersed data with too many zeroes and Under-Dispersed data with too few zeroes.

Alike to Hurdle model, Zero-Inflated Poisson (ZIP) model is another way to model count outcomes with excess zeroes under the assumption of two components. However, it is slightly different from Hurdle model in the sense that zero outcomes are assumed to come from two different sources, one generating only zero outcomes and the other generating both zero and nonzero outcomes. Specifically, a Binomial distribution decides if an individual is from the Always-Zero or the Not-Always-Zero group and then a standard Poisson distribution describes counts in the Not-always-zero group. The probability function of ZIP model is given as

for Y_i = 0, f(Y_i | X_i) = ω_i + (1 + ω_i) * exp(-λ_i), where ω_i = Prob[Y_i ~ Poisson(λ_i)]
for Y_i > 0, f(Y_i | X_i) = (1 – ω_i) * exp(-λ_i) * λ_i ^ Y_i / Y_i!

With the similar idea to Hurdle model, ZIP model can be represented jointly by two different sub-models as well. A logistic regression is used to separate the Always-Zero group from the Not-Always-Zero group and a basic Poisson model is applied to individuals in the Not-Always-Zero group. From a business prospective, ZIP Model describes an important fact that some not-at-risk accounts are well established such that they will never have financial problems, while the other at-risk ones might have chances to get into troubles during the tough time. Therefore, risk exposures and underlying matrices for accounts with same outcomes at zero count might still be differentiable.

In practice, a sharp dichotomization between at-risk group and not-at-risk group might not be realistic. Even a customer with the good financial condition might be exposed to risks in a certain situation. Therefore, it might make sense to split the whole portfolio into a couple segments with different levels of risk-exposure. A Latent Class Poisson model provides such mechanism by assuming that the population of interest is actually a mixture of S > 1 latent (unobservable) components and each individual is considered a draw from one of these latent groups. The probability function of a Latent Class Poisson model with S = 2 classes can be obtained as

F(Y_i | X_i) = P1_i * exp(-λ1_i) * λ1_i ^ Y_i / Y_i! + P2_i * exp(-λ2_i) * λ2_i ^ Y_i / Y_i!, where P1_i + P2_i = 1

Each latent component in the mixture is assumed to have a different parameter λ_i, which will account for the unobserved heterogeneity in the population. For instance, in the case of S = 2, a portfolio is assumed a mixture between a high risk group and a low risk one. Impacts of predictors are allowed to differ across different latent groups, providing a possibility of more informative and flexible interpretations.

Besides models discussed above, it is also worth to point out that the discrete choice model, such as Logit or Probit, has also been widely used to model count outcomes as well. However, such discrete choice model needs to be based upon sequential or ordered instead of multinomial response, namely ordered Logit.

Adjustment of Selection Bias in Marketing Campaign

A standard practice to evaluate the effect of a marketing campaign is to divide the targeted prospects into two testing groups, a control group without the marketing intervention and the other with the intervention, and then to compare the difference of results between two groups given observed characteristics. The sole purpose is to see if the intervention is the causal effect of results that we are interested in, e.g. response or conversion rate. From the statistical perspective, it is desirable to randomly assign targeted individuals into one of the testing groups such that the background information of individuals in different groups are comparable or balancing and the assignment of individuals to groups is independent of outcomes of the intervention. This practice is also called Randomization. However, in many marketing campaigns that we observed, the randomization mentioned above is prohibitively difficult due to various constraints in the real world. An example is that the assignment of an individual to the specific group might be somehow determined by his background information, which might be related to his response to the campaign. As a result, this characteristic heterogeneity of targeted individuals between different testing groups will give a biased estimation for the campaign effect, which is also called selection bias. A typical observation of such bias is that the campaign effect looks more optimistic than it is supposed to be.

The selection bias is a common issue in many observational studies in social science. While different methods have been purposed to adjust or correct this bias, we’d like to demonstrate two modeling techniques to correct the selection bias in the marketing campaign, namely propensity score method and Heckman selection method.

Introduced by Rosenbaum and Rubin (1983), propensity score can be defined as the conditional probability of an individual receiving a specific exposure (treatment or campaign) given a certain observed background information. The idea of propensity score is to partial out the observed characteristic heterogeneity of individuals between testing groups so as to make the assignment of groups conditionally independent of the campaign outcome. While there are numerous ways, such as matching or weighting, to implement propensity score, we are particularly interested in the one proposed by Cela (2003) due to its simplicity and flexibility. The implementation of Cela’s method is straight-forward and considered a two-step approach.

Step One
First of all, we build a logistic regression using the assignment of groups as the response variable and then estimate the probability of an individual assigned to the group with marketing intervention given a set of background variables. The propensity score model can be formulated as:

Prob(d = 1 | x) = p = EXP(T * x) / [1 + EXP(T * x)]

where d = 1 for an individual assigned to the intervention group, p is the propensity score, and x is the covariate matrix of background information. In this step, while we propose using a logistic regression, other models designed for binary outcome, such as probit model or classification tree, might also work well.

Step Two
Secondly, we build another logistic regression to estimate the casual effect of marketing campaign such that

Prob(y = 1 | z, d, p) = EXP(A * z + B * d) / [1 + EXP(A * z + B * d + C * p)]

where y = 1 for the individual with a positive response and z is the covariate matrix of control factors and background information. In this model, the propensity score acts as a control factor such that the outcome y is independent of the group assignment d after partialling out the observed characteristic heterogeneity incorporated in the propensity score p. In the formulation of our second model, the parameter C is the estimated causal effect of marketing intervention that we are interested. While we are using logistic regression for the binary outcome in our second model, any model within the framework of Generalized Linear Models (GLM) should be applicable for the appropriate response variable.

While the propensity score method is able to adjust the selection bias due to observed characteristics, it is under the criticism that it fails to address the bias arised from the unobservable. Originated by Heckman (1976), Heckman selection method is able to control the bias due to the unobservable, if the assumption about the underlying distribution is valid. In the framework of Heckman model, it is assumed that there are 2 simultaneous processes existed in the model, one called Selection Equation and the other Outcome Equation, of which the error terms follow a Bivariate Normal Distribution with the nonzero correlation. While two equations can be modeled simultaneously, we prefer a 2-stage estimation method due to its simplicity and its advantage in algorithm convergence.

Step One
Similar to the propensity score method, we fit a probit model to estimate the probability of an individual assigned to the group with marketing intervention.

Prob(d = 1 | x) = G(T * x)

where G(.) is the cumulative density functin of the Normal Distribution. Please note that the use of probit model is determined by the assumption of Heckman model for the Normal Distribution. Based upon the result from this probit model, we calculate Inverse Mills Ratio (IMR) for each individual, which is considered a measure of selection bias and can be formulated as

IMR = g(T * x) / G(T * x)

where g(.) is the probability density function and x is the covariate matrix in Selection Equation.

Step Two
We build a second probit model by including IMR as one of the predictors, which can be formulated as

Prob(y = 1 | z, d, IMR) = G(A * z + B * d + C * IMR)

where y = 1 for the individual with a positive response and z is the covariate matrix of control factors and background information. While the significance of parameter C indicates the existence of selection bias, the lack of such significance doesn’t necessarily imply that there is no bias. The consistency of estimates in Heckman model strongly replies on the distributional assumption, which is difficult to justify in the real-world data.

Risk Classification of Auto Insurance Claims

In the insurance industry, Poisson regression (PR) has been widely used to model insurance claims. In the real-world auto insurance data, more than 90% of the insured reported zero claim, which might bring difficulties in two aspects if PR is employed. First of all, the observed number of zero count outcome is more than the predicted one by PR. Secondly, the variance might exceed the mean in the observed outcome, a violation of equi-dispersion assumption in PR. In order to improve the model performance, alternatives to PR should be considered.

As the most common alternative to PR, Negative Binomial (NB) regression addresses the issue of over-dispersion by including a random component with Gamma distribution. In NB, the variance can be expressed as the mean plus a non-negative term such that

V(Y|X) = mu + mu^2 / theta >= mu = E(Y|X)

Therefore, NB has a more flexible variance structure than PR does. While NB is able to capture the heterogeneity in the count outcome, it is under the criticism of failing to provide a satisfactory explanation on excess zeros.

Known as the two-part model, Hurdle regression (Mullahy 1986) provides a more appealing interpretation for the count outcome with excess zeros. While the first part separates the potential insurance claimants (Y ≥ 1) from the non-claimants (Y = 0) with a Binomial model, the second part models the number of claims made by those potential claimants (Y ≥ 1) with a truncated-at-zero Poisson model. Thus, the density of Hurdle model can be expressed as

F(Y|X) = θ for Y = 0

(1 – θ) * G(Y) / (1 – G(Y)) for Y > 0

θ is the probability of non-claimants and G(.) is a count density, e.g. Poisson or Negative Binomial. The major motivation of Hurdle regression in the auto insurance content is that the insured individual tends to behave differently once he/she makes the first claim, which is in line with our observation on most human behaviors.

Fallen into the class of finite mixture model, Zero-Inflated Poisson (ZIP) regression (Lambert 1992) tries to accommodate excess zeros by assuming zero count outcome coming from 2 different sources, one always with zero outcome and the other generating both zero and nonzero outcome. For instance, in the insured population, some never make insurance claim (Y = 0) despite the occurrence of accidents, while the other make claims (Y ≥ 0) whenever accidents happen. The density of ZIP model can be formulated as

F(Y|X) = ω + (1 – ω) * H(0) for Y = 0

(1 – ω) * H(Y) for Y > 0

ω is the probability of an individual belonging to the always-zero group and H(.) is a count density, e.g. Poisson or Negative Binomial. It is noted that, unlike θ in Hurdle model, ω in ZIP model is not directly observable but determined by the claim pattern of the insured. From the marketing perspective, ZIP model is attractive in that the insurance company is able to differentiate the insured and therefore to charge different premiums based upon their claim behaviors.

Considered a generalization of ZIP model, Latent Class Poisson (LCP) regression (Wedel 1993) assumes that all individuals instead of just the ones with zero counts are drawn from a finite mixture of Poisson distributions. Given the fact that even a careful driver might also have chance to file the claim, it is more desirable to avoid the somehow unrealistic dichotomization solely based upon the claim pattern but to distinguish the insured by latent risk factors such as lifestyle, attitude to risk, financial ability, and so on. For a population consisting of K mixing components with the proportion π, the density of LCP is given as

F(Y|X) = SUM πi * Fi(Y|X)

πi is the probability of an individual coming from component i with sum(πi = 1 to K) = 1 and Fi(.) is the count density for component i. In the mixture distribution, impacts of each explanatory variable are allowed to differ across different components to account for the heterogeneity in the population. Therefore, LCP provides an ability to differentiate the insured in a more flexible manner and a better framework for the market segmentation.

Composite Conditional Mean and Variance Modeling in Time Series

In time series analysis, it is often necessary to model both conditional mean and conditional variance simultaneously, which is so-called composite modeling. For instance, while the conditional mean is an AR(1) model, the conditional variance can be a GARCH(1, 1) model.

In SAS/ETS module, it is convenient to build such composite models with AUTOREG procedure if the conditional mean specification is as simple as shown below.

data garch1;
  lu = 0;
  lh = 0;
  do i = 1 to 5000;
    x = ranuni(1);
    h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh;
    u = sqrt(h) * rannor(1);
    y = 1 + 3 * x + u;
    lu = u;
    lh = h;
    output;
  end;
run;

proc autoreg data = _last_;
  model y = x / garch = (p = 1, q = 1);
run;
/*
                                    Standard                 Approx
Variable        DF     Estimate        Error    t Value    Pr > |t|

Intercept        1       1.0125       0.0316      32.07      <.0001
x                1       2.9332       0.0536      54.72      <.0001
ARCH0            1       0.2886       0.0256      11.28      <.0001
ARCH1            1       0.3881       0.0239      16.22      <.0001
GARCH1           1       0.5040       0.0239      21.10      <.0001
*/

However, when the conditional mean has a more complex structure, then MODEL instead of AUTOREG procedure should be used. Below is an perfect example showing the flexibility of MODEL procedure. In the demonstration, the conditional mean is an ARMA(1, 1) model and the conditional variance is a GARCH(1, 1) model.

data garch2;
  lu = 0;
  lh = 0;
  ly = 0;
  do i = 1 to 5000;
    x = ranuni(1);
    h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh;
    u = sqrt(h) * rannor(1);
    y = 1 + 3 * x + 0.6 * (ly - 1) + u - 0.7 * lu;
    lu = u;
    lh = h;
    ly = y;
    output;
  end;
run;

proc model data = _last_;
  parms mu x_beta ar1 ma1 arch0 arch1 garch1;
  y = mu + x_beta * x + ar1 * zlag1(y - mu) + ma1 * zlag1(resid.y);
  h.y = arch0 + arch1 * xlag(resid.y ** 2, mse.y) +
        garch1 * xlag(h.y, mse.y);
  fit y / method = marquardt fiml;
run;
/*
                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|

mu              0.953905      0.0673      14.18       <.0001
x_beta           2.92509      0.0485      60.30       <.0001
ar1             0.613025     0.00819      74.89       <.0001
ma1             0.700154      0.0126      55.49       <.0001
arch0           0.288948      0.0257      11.26       <.0001
arch1           0.387436      0.0238      16.28       <.0001
garch1          0.504588      0.0237      21.26       <.0001
*/

Marginal Effects in Two-Part Fractional Models

As shown in “Two-Part Fractional Model” posted on 09/25/2012, sometimes it might be beneficial to model fractional outcomes in the range of [0, 1] with composite models, e.g. a two-part model, especially when there are a non-trivial number of boundary outcomes. However, the marginal effect of X in a two-part model is not as straightforward to calculate as in a one-part model shown in “Marginal Effects in Tobit Models” posted on 10/06/2012.

In the demonstration below, I will show how to calculate the marginal effect of X in a two-part model with a similar logic shown in McDonald and Moffitt decomposition.

proc nlmixed data = one tech = congra maxiter = 1000;
  parms b10 = -9.3586 b11 = -0.0595 b12 =  1.7644 b13 =  0.5994 b14 = -2.5496
        b15 = -0.0007 b16 = -0.0011 b17 = -1.6359
        b20 =  0.3401 b21 =  0.0274 b22 =  0.1437 b23 =  0.0229 b24 =  0.4656
        b25 =  0.0011 b26 =  0.0021 b27 =  0.1977  s  =  0.2149;
  logit_xb = b10 + b11 * x1 + b12 * x2 + b13 * x3 + b14 * x4 +
             b15 * x5 + b16 * x6 + b17 * x7;
  nls_xb = b20 + b21 * x1 + b22 * x2 + b23 * x3 + b24 * x4 +
           b25 * x5 + b26 * x6 + b27 * x7;
  p1 = 1 / (1 + exp(-logit_xb));
  p2 = 1 / (1 + exp(-nls_xb));
  if y = 0 then ll = log(1 - p1);
  else ll = log(p1) + log(pdf('normal', y, p2, s));
  model y ~ general(ll);
  predict logit_xb out = out_1 (rename = (pred = part1_xb) keep = _id_ pred y);
  predict p1       out = out_2 (rename = (pred = part1_p)  keep = _id_ pred);
  predict nls_xb   out = out_3 (rename = (pred = part2_xb) keep = _id_ pred);
  predict p2       out = out_4 (rename = (pred = part2_p)  keep = _id_ pred);
run;

data out;
  merge out_1 out_2 out_3 out_4;
  by _id_;

  margin1_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.05948) * part2_p;
  margin1_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.01115) * part1_p;
  x1_margin = margin1_part1 + margin1_part2;

  margin2_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * 1.7645) * part2_p;
  margin2_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.4363) * part1_p;
  x2_margin = margin2_part1 + margin2_part2;

  margin3_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * 0.5994) * part2_p;
  margin3_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.1139) * part1_p;
  x3_margin = margin3_part1 + margin3_part2;

  margin4_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -2.5496) * part2_p;
  margin4_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -2.8755) * part1_p;
  x4_margin = margin4_part1 + margin4_part2;

  margin5_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.00071) * part2_p;
  margin5_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * 0.004091) * part1_p;
  x5_margin = margin5_part1 + margin5_part2;

  margin6_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.00109) * part2_p;
  margin6_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.00839) * part1_p;
  x6_margin = margin6_part1 + margin6_part2;

  margin7_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -1.6359) * part2_p;
  margin7_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.1666) * part1_p;
  x7_margin = margin7_part1 + margin7_part2;
run;

proc means data = _last_ mean;
  var x:;
run;
/*
Variable                 Mean

x1_margin          -0.0039520
x2_margin           0.0739847
x3_margin           0.0270673
x4_margin          -0.3045967
x5_margin         0.000191015
x6_margin        -0.000533998
x7_margin          -0.1007960
*/

Modeling Heteroscedasticity Directly in NLS

****** nonlinear least square regression without heteroscedasticity ******;
proc nlmixed data = data.sme tech = congra;
  where y > 0  and y < 1;
  parms b0 =  1.78 b1 = -0.01 b2 = -0.43 b3 = -0.11 b4 = -2.93
        b5 =  0.01 b6 = -0.01 b7 = -0.17; 
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
       b5 * x5 + b6 * x6 + b7 * x7;
  mu = 1 / (1 + exp(-xb));
  lh = pdf('normal', y, mu, s);
  ll = log(lh);
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 -264.5
AIC (smaller is better)           -246.5
AICC (smaller is better)          -246.3
BIC (smaller is better)           -201.4

                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|    
b0            1.7813     0.3400   1116      5.24     <.0001   
b1          -0.01203    0.02735   1116     -0.44     0.6602  
b2           -0.4305     0.1437   1116     -3.00     0.0028    
b3           -0.1147    0.02287   1116     -5.02     <.0001   
b4           -2.9302     0.4657   1116     -6.29     <.0001    
b5          0.004095   0.001074   1116      3.81     0.0001   
b6          -0.00839   0.002110   1116     -3.98     <.0001    
b7           -0.1710     0.1977   1116     -0.87     0.3871   
s             0.2149   0.004549   1116     47.24     <.0001  
*/
 
****** nonlinear least square regression with heteroscedasticity ******;
proc nlmixed data = data.sme tech = congra;
  where y > 0 and y < 1;
  parms b0 =  1.78 b1 = -0.01 b2 = -0.43 b3 = -0.11 b4 = -2.93
        b5 =  0.01 b6 = -0.01 b7 = -0.17 s  =  0.21
        a1 = -0.13 a2 = -15.62 a3 = 0.09 a4 = -1.27 a5 = 0.01
        a6 = -0.02 a7 =  0.47;
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
       b5 * x5 + b6 * x6 + b7 * x7;
  xa = a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 +
       a5 * x5 + a6 * x6 + a7 * x7;
  mu = 1 / (1 + exp(-xb));
  si = (s ** 2 * (1 + exp(xa))) ** 0.5;       
  lh = pdf('normal', y, mu, si);
  ll = log(lh);
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 -325.9
AIC (smaller is better)           -293.9
AICC (smaller is better)          -293.4
BIC (smaller is better)           -213.6

                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|    
b0            2.0343     0.3336   1116      6.10     <.0001     
b1          0.003764    0.02408   1116      0.16     0.8758    
b2          -0.08544     0.1501   1116     -0.57     0.5693     
b3           -0.1495    0.02263   1116     -6.61     <.0001    
b4           -2.6251     0.4379   1116     -6.00     <.0001    
b5          0.003331   0.001115   1116      2.99     0.0029    
b6          -0.00644   0.001989   1116     -3.24     0.0012     
b7           -0.1836     0.1938   1116     -0.95     0.3436    
s             0.1944   0.005067   1116     38.35     <.0001    
a1           -0.1266     0.3389   1116     -0.37     0.7088     
a2          -15.6169     5.2424   1116     -2.98     0.0030     
a3           0.09074    0.03282   1116      2.76     0.0058     
a4           -1.2681     3.8044   1116     -0.33     0.7389    
a5          0.007411   0.005267   1116      1.41     0.1597    
a6          -0.01738    0.01527   1116     -1.14     0.2550    
a7            0.4805     1.1232   1116      0.43     0.6689    
*/

Marginal Effects in Tobit Models

proc qlim data = data.sme;
  model y = x1 - x7;
  endogenous y ~ censored(lb = 0 ub = 1);
  output out = out1 marginal;
run;
/*
                                 Standard                 Approx
Parameter        Estimate           Error    t Value    Pr > |t|
Intercept       -2.204123        0.118473     -18.60      <.0001
x1              -0.015086        0.008345      -1.81      0.0707
x2               0.376830        0.048772       7.73      <.0001
x3               0.141672        0.008032      17.64      <.0001
x4              -0.813496        0.133564      -6.09      <.0001
x5            0.000036437        0.000320       0.11      0.9094
x6              -0.001152        0.000704      -1.64      0.1016
x7              -0.392152        0.060902      -6.44      <.0001
_Sigma           0.497938        0.012319      40.42      <.0001
*/

proc means data = out1 mean;
  var meff_x:;
run;
/* Auto Calculation:
Variable    Label                                 Mean
------------------------------------------------------
Meff_x1     Marginal effect of x1 on y      -0.0036988
Meff_x2     Marginal effect of x2 on y       0.0923919
Meff_x3     Marginal effect of x3 on y       0.0347354
Meff_x4     Marginal effect of x4 on y      -0.1994545
Meff_x5     Marginal effect of x5 on y    8.9337756E-6
Meff_x6     Marginal effect of x6 on y    -0.000282493
Meff_x7     Marginal effect of x7 on y      -0.0961485
*/

data one;
  set data.sme;
  
  _xb_ = -2.204123 + x1 * -0.015086 + x2 * 0.376830 + x3 * 0.141672 + x4 * -0.813496 + 
         x5 * 0.000036437 + x6 * -0.001152 + x7 * -0.392152;
  _phi_lb = probnorm((0 - _xb_) / 0.497938);
  _phi_ub = probnorm((1 - _xb_) / 0.497938);
  _pdf_lb = pdf('normal', (0 - _xb_) / 0.497938);
  _pdf_ub = pdf('normal', (1 - _xb_) / 0.497938);
  _imr = (_pdf_lb - _pdf_ub) / (_phi_ub - _phi_lb);
  _margin_x1 = (_phi_ub - _phi_lb) * -0.015086;
  _margin_x2 = (_phi_ub - _phi_lb) * 0.376830;
  _margin_x3 = (_phi_ub - _phi_lb) * 0.141672;
  _margin_x4 = (_phi_ub - _phi_lb) * -0.813496;
  _margin_x5 = (_phi_ub - _phi_lb) * 0.000036437;
  _margin_x6 = (_phi_ub - _phi_lb) * -0.001152;
  _margin_x7 = (_phi_ub - _phi_lb) * -0.392152;
run;

proc means data = one mean;
  var _margin_x:;
run; 
/* Manual Calculation:
Variable              Mean
--------------------------
_margin_x1      -0.0036988
_margin_x2       0.0923924
_margin_x3       0.0347356
_margin_x4      -0.1994555
_margin_x5    8.9337391E-6
_margin_x6    -0.000282451
_margin_x7      -0.0961491
*/

Marginal Effects (on Binary Outcome)

In regression models, the marginal effect of a explanatory variable X is the partial derivative of the prediction with respect to X and measures the expected change in the response variable as a function of the change in X with the other explanatory variables held constant. In the interpretation of a regression model, presenting marginal effects often brings more information than just looking at coefficients. Below, I will use 2 types of regression models, e.g. logit and probit, for binary outcomes to show that although coefficients estimated from the same set of Xs might differ substantially in 2 models, marginal effects of each X in both model actually look very similar.

As shown below, parameter estimates from logit and probit look very different due to different model specification and assumptions. As a result, it is not possible to compare the effect and sensitivity of each predictor across 2 models.

proc qlim data = one;
  model bad = bureau_score ltv / discrete(d = logit);
  output out = out1 marginal;
run;
/* logit estimates
                                    Standard                 Approx
Parameter           Estimate           Error    t Value    Pr > |t|
Intercept           7.080229        0.506910      13.97      <.0001
bureau_score       -0.016705        0.000735     -22.74      <.0001
ltv                 0.028055        0.002361      11.88      <.0001
*/

proc qlim data = one;
  model bad = bureau_score ltv / discrete(d = probit);
  output out = out2 marginal;
run;
/* probit estimates
                                    Standard                 Approx
Parameter           Estimate           Error    t Value    Pr > |t|
Intercept           4.023515        0.285587      14.09      <.0001
bureau_score       -0.009500        0.000403     -23.56      <.0001
ltv                 0.015690        0.001316      11.93      <.0001
*/

However, are these 2 models so much different from each other? Comparing marginal effects instead of parameter estimates might be table to bring us more useful information.

proc means data = out1 mean;
  var meff_p2_:;
run;
/* marginal effects from logit
Variable                            Mean
----------------------------------------
Meff_P2_bureau_score          -0.0022705
Meff_P2_ltv                    0.0038132
----------------------------------------
*/

proc means data = out2 mean;
  var meff_p2_:;
run;
/* marginal effects from probit
Variable                            Mean
----------------------------------------
Meff_P2_bureau_score          -0.0022553
Meff_P2_ltv                    0.0037249
----------------------------------------
*/

It turns out that marginal effects of each predictor between two models are reasonably close.

Although it is easy to calculate marginal effects with SAS QLIM procedure, it might still be better to understand the underlying math and then compute them yourself with SAS data steps. Below is a demo on how to manually calculate marginal effects of a logit model following the formulation:
MF_x_i = EXP(XB) / ((1 + EXP(XB)) ^ 2) * beta_i for the ith predictor.

proc logistic data = one desc;
  model bad = bureau_score ltv;
  output out = out3 xbeta = xb;
run;
/* model estimates:
                                  Standard          Wald
Parameter       DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept        1      7.0802      0.5069      195.0857        <.0001
bureau_score     1     -0.0167    0.000735      516.8737        <.0001
ltv              1      0.0281     0.00236      141.1962        <.0001
*/

data out3;
  set out3;
  margin_bureau_score = exp(xb) / ((1 + exp(xb)) ** 2) * (-0.0167);
  margin_ltv = exp(xb) / ((1 + exp(xb)) ** 2) * (0.0281);
run;

proc means data = out3 mean;
  var margin_bureau_score margin_ltv;
run;
/* manual calculated marginal effects:
Variable                       Mean
-----------------------------------
margin_bureau_score      -0.0022698
margin_ltv                0.0038193
-----------------------------------
*/

Two-Part Fractional Model

1. Two-step estimation method:
– SAS code

data one;
  set data.sme;
  _id_ + 1;
  if y = 0 then y2 = 0;
  else y2 = 1;
run;

proc logistic data = one desc;
  model y2 = x1 - x7;
run;

proc nlmixed data = one tech = congra;
  where y2 = 1;
  parms b0 =  2.14 b1 = -0.01 b2 = -0.55 b3 = -0.14 b4 = -3.47
        b5 =  0.01 b6 = -0.01 b7 = -0.19 s = 0.1;
  _xb_ = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
         b5 * x5 + b6 * x6 + b7 * x7;
  _mu_ = 1 / (1 + exp(-_xb_));
  lh = pdf('normal', y, _mu_, s);
  ll = log(lh);
  model y ~ general(ll);
  predict ll out = out1(keep = _id_ pred rename = (pred = ln_like1));
run;

– Outputs

*** output for logit component ***
         Model Fit Statistics
                             Intercept
              Intercept            and
Criterion          Only     Covariates
AIC            4997.648       4066.398
SC             5004.043       4117.551
-2 Log L       4995.648       4050.398

             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1     -9.3586      0.4341      464.8713        <.0001
x1            1     -0.0595      0.0341        3.0445        0.0810
x2            1      1.7644      0.1836       92.3757        <.0001
x3            1      0.5994      0.0302      392.7404        <.0001
x4            1     -2.5496      0.5084       25.1488        <.0001
x5            1    -0.00071     0.00128        0.3094        0.5780
x6            1    -0.00109     0.00267        0.1659        0.6838
x7            1     -1.6359      0.2443       44.8427        <.0001

*** output for the positive component ***
             Fit Statistics
-2 Log Likelihood                 -264.5
AIC (smaller is better)           -246.5
AICC (smaller is better)          -246.3
BIC (smaller is better)           -201.4

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b0            1.7947     0.3401   1116      5.28     <.0001
b1          -0.01216    0.02737   1116     -0.44     0.6568
b2           -0.4306     0.1437   1116     -3.00     0.0028
b3           -0.1157    0.02287   1116     -5.06     <.0001
b4           -2.9267     0.4656   1116     -6.29     <.0001
b5          0.004092   0.001074   1116      3.81     0.0001
b6          -0.00838   0.002110   1116     -3.97     <.0001
b7           -0.1697     0.1977   1116     -0.86     0.3909
s             0.2149   0.004549   1116     47.24     <.0001

2. Joint estimation method:
– SAS code

proc nlmixed data = one tech = congra maxiter = 1000;
  parms b10 = -9.3586 b11 = -0.0595 b12 =  1.7644 b13 =  0.5994 b14 = -2.5496
        b15 = -0.0007 b16 = -0.0011 b17 = -1.6359
        b20 =  0.3401 b21 =  0.0274 b22 =  0.1437 b23 =  0.0229 b24 =  0.4656
        b25 =  0.0011 b26 =  0.0021 b27 =  0.1977  s  =  0.2149;
  logit_xb = b10 + b11 * x1 + b12 * x2 + b13 * x3 + b14 * x4 +
             b15 * x5 + b16 * x6 + b17 * x7;
  nls_xb = b20 + b21 * x1 + b22 * x2 + b23 * x3 + b24 * x4 +
           b25 * x5 + b26 * x6 + b27 * x7;
  p1 = 1 / (1 + exp(-logit_xb));
  if y = 0 then ll = log(1 - p1);
  else ll = log(p1) + log(pdf('normal', y, 1 / (1 + exp(-nls_xb)), s));
  model y ~ general(ll);
run;

– Outputs

             Fit Statistics
-2 Log Likelihood                 3785.9
AIC (smaller is better)           3819.9
AICC (smaller is better)          3820.0
BIC (smaller is better)           3928.6

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b10          -9.3586     0.4341   4421    -21.56     <.0001
b11         -0.05948    0.03408   4421     -1.75     0.0810
b12           1.7645     0.1836   4421      9.61     <.0001
b13           0.5994    0.03024   4421     19.82     <.0001
b14          -2.5496     0.5084   4421     -5.01     <.0001
b15         -0.00071   0.001276   4421     -0.56     0.5784
b16         -0.00109   0.002673   4421     -0.41     0.6836
b17          -1.6359     0.2443   4421     -6.70     <.0001
b20           1.7633     0.3398   4421      5.19     <.0001
b21         -0.01115    0.02730   4421     -0.41     0.6830
b22          -0.4363     0.1437   4421     -3.04     0.0024
b23          -0.1139    0.02285   4421     -4.98     <.0001
b24          -2.8755     0.4643   4421     -6.19     <.0001
b25         0.004091   0.001074   4421      3.81     0.0001
b26         -0.00839   0.002109   4421     -3.98     <.0001
b27          -0.1666     0.1975   4421     -0.84     0.3991
s             0.2149   0.004550   4421     47.24     <.0001

As shown above, both estimation methods give similar parameter estimates. The summation of log likelihood from the 2-step estimation models is exactly equal to the log likelihood from the joint estimation model.

A Distribution-Free Alternative to Vuong Test

The Vuong test has been widely used in nonnested model selection under the normality assumption. While the Vuong test determines whether the average log-likelihood ratio is statistically different from zero, the distribution-free test proposed by Clarke determines whether or not the median log-likelihood ratio is statistically different from zero.

Below is the SAS macro to implement Clarke test.

%macro clarke(data = , ll1 = , q1 = , ll2 = , q2 = );
***********************************************************;
* THE SAS MACRO IS TO PERFORM AN ALTERNATIVE TO VUONG     *;
* TEST, DISTRIBUTION-FREE CLARKE TEST, FOR THE MODEL      *;
* COMPARISON.                                             *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  LL1 : LOG LIKELIHOOD FUNCTION OF THE MODEL 1           *;
*  Q1  : DEGREE OF FREEDOM OF THE MODEL 1                 *;
*  LL2 : LOG LIKELIHOOD FUNCTION OF THE MODEL 2           *;
*  Q2  : DEGREE OF FREEDOM OF THE MODEL 2                 *;
* ======================================================= *;
* REFERENCE:                                              *;
*  A SIMPLE DISTRIBUTION-FREE TEST FOR NONNESTED MODEL    *;
*  SELECTION, KEVIN CLARKE, 2007                          *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options mprint mlogic formchar = "|----|+|---+=|-/\<>*" nocenter nonumber nodate;

data _tmp1;
  set &data;
  where &ll1 ~= 0 and &ll2 ~= 0;
run;

proc sql noprint;
  select count(*) into :nobs from _tmp1;
quit;

%let schwarz = %sysevalf((&q1 - &q2) * %sysfunc(log(&nobs)) / &nobs);

data _tmp2;
  set _tmp1;
  z = &ll1 - &ll2 - &schwarz;
  b1 = 0;
  b2 = 0;
  if z > 0 then b1 = 1;
  if z < 0 then b2 = 1;
run;

proc sql;
  create table
    _tmp3 as
  select 
    cdf("binomial", count(*) - sum(b1), 0.5, count(*))             as p1,
    cdf("binomial", count(*) - sum(b2), 0.5, count(*))             as p2,
    min(1, cdf("binomial", count(*) - sum(b2), 0.5, count(*)) * 2) as p3
  from 
    _tmp2;
quit;

proc report data = _tmp3 spacing = 1 split = "*" headline nowindows;
  column("Null Hypothesis: MDL1 = MDL2" p1 p2 p3);
  define p1  / "MDL1 > MDL2"  width = 15 format = 12.8 order;
  define p2  / "MDL1 < MDL2"  width = 15 format = 12.8;
  define p3  / "MDL1 != MDL2" width = 15 format = 12.8;
run;

%mend clarke;

A SAS Macro Implementing Bi-variate Granger Causality Test

In loss forecasting, it is often of the interest to know: 1) if a time series, e.g. macro-economic variables, is useful to predict another, e.g. portfolio loss; 2) the number of lags to contribute such predictive power. Granger (1969) proposed that A time series X is said to Granger-cause Y if lagged values of X are able to provide statistically significant information to predict the future values of Y.

A SAS macro below is showing how to implement Granger Causality test in a bi-variate sense.

%macro causal(data = , y = , drivers = , max_lags = );
***********************************************************;
* THIS SAS MACRO IS AN IMPLEMENTATION OF BI-VARIATE       *;
* GRANGER CAUSALITY TEST PROPOSED BY GRANGER (1969)       *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA     : INPUT SAS DATA TABLE                        *;
*  Y        : A CONTINUOUS TIME SERIES RESPONSE VARIABLE  *;
*  DRIVERS  : A LIST OF TIME SERIES PREDICTORS            *;
*  MAX_LAGS : MAX # OF LAGS TO SEARCH FOR CAUSAL          *;
*             RELATIONSHIPS                               *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM, LOSS FORECASTING & RISK MODELING    *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape 
        ls = 150 formchar = "|----|+|---+=|-/\<>*";

%macro granger(data = , dep = , indep = , nlag = );

%let lag_dep = ;
%let lag_indep = ;

data _tmp1;
  set &data (keep = &dep &indep);

  %do i = 1 %to &nlag;
  lag&i._&dep = lag&i.(&dep);
  lag&i._&indep = lag&i.(&indep);

  %let lag_dep = &lag_dep lag&i._&dep;
  %let lag_indep = &lag_indep lag&i._&indep;
  %end;
run;

proc corr data = _tmp1 noprint outp = _corr1(rename = (&dep = value) where = (_type_ = 'CORR')) nosimple;
  var &dep;
  with lag&nlag._&indep;
run;

proc corr data = _tmp1 noprint outp = _corr2(rename = (&dep = value) where = (_type_ = 'CORR')) nosimple;
  var &dep;
  with lag&nlag._&indep;
  partial lag&nlag._&dep;
run;

proc reg data = _tmp1 noprint;
  model &dep = &lag_dep;
  output out = _rest1 r = rest_e;
run;

proc reg data = _tmp1 noprint;
  model &dep = &lag_dep &lag_indep;
  output out = _full1 r = full_e;
run;

proc sql noprint;
  select sum(full_e * full_e) into :full_sse1 from _full1;

  select sum(rest_e * rest_e) into :rest_sse1 from _rest1;

  select count(*) into :n from _full1;

  select value into :cor1 from _corr1;

  select value into :cor2 from _corr2;
quit;

data _result;
  format dep $20. ind $20.;
  dep   = "&dep";
  ind   = "%upcase(&indep)";
  nlag = &nlag;

  corr1 = &cor1;
  corr2 = &cor2;

  f_test1  = ((&rest_sse1 - &full_sse1) / &nlag) / (&full_sse1 / (&n -  2 * &nlag - 1));
  p_ftest1 = 1 - probf(f_test1, &nlag, &n -  2 * &nlag - 1);

  chisq_test1 = (&n * (&rest_sse1 - &full_sse1)) / &full_sse1;
  p_chisq1    = 1 - probchi(chisq_test1, &nlag);

  format flag1 $3.;
  if max(p_ftest1, p_chisq1) < 0.01 then flag1 = "***";
  else if max(p_ftest1, p_chisq1) < 0.05 then flag1 = " **";
  else if max(p_ftest1, p_chisq1) < 0.1 then flag1 = "  *";
  else flag1 = "   ";
run;

%mend granger;

data _in1;
  set &data (keep = &y &drivers);
run;

%let var_loop = 1;

%do %while (%scan(&drivers, &var_loop) ne %str());

  %let driver = %scan(&drivers, &var_loop);

  %do lag_loop = 1 %to &max_lags;

    %granger(data = _in1, dep = &y, indep = &driver, nlag = &lag_loop);

    %if &var_loop = 1 & &lag_loop = 1 %then %do;
      data _final;
        set _result;
      run;
    %end;
    %else %do;
      data _final;
        set _final _result;
      run;
    %end;
  %end;

  %let var_loop = %eval(&var_loop + 1);
%end;

title;
proc report data  = _last_ box spacing = 1 split = "/" nowd;
  column("GRANGER CAUSALITY TEST FOR %UPCASE(&y) UPTO &MAX_LAGS LAGS"
         ind nlag corr1 corr2 f_test1 chisq_test1 flag1);

  define ind         / "DRIVERS"                width = 20 center group order order = data;
  define nlag        / "LAG"                    width = 3  format = 3.   center order order = data;
  define corr1       / "PEARSON/CORRELATION"    width = 12 format = 8.4  center;
  define corr2       / "PARTIAL/CORRELATION"    width = 12 format = 8.4  center;
  define f_test1     / "CAUSAL/F-STAT"          width = 12 format = 10.4 center;
  define chisq_test1 / "CAUSAL/CHISQ-STAT"      width = 12 format = 10.4 center;
  define flag1       / "CAUSAL/FLAG"            width = 8  right;
run;

%mend causal;

%causal(data = sashelp.citimon, y = RTRR, drivers = CCIUTC LHUR FSPCON, max_lags = 6);

Based upon the output shown below, it is tentative to conclude that LHUR (UNEMPLOYMENT RATE) 3 months early might be helpful to predict RTRR (RETAIL SALES).

 ---------------------------------------------------------------------------------------
 |                     GRANGER CAUSALITY TEST FOR RTRR UPTO 6 LAGS                     |
 |                           PEARSON      PARTIAL       CAUSAL       CAUSAL      CAUSAL|
 |      DRIVERS        LAG CORRELATION  CORRELATION     F-STAT     CHISQ-STAT      FLAG|
 |-------------------------------------------------------------------------------------| 
 |       CCIUTC       |  1|    0.9852  |    0.1374  |     2.7339 |     2.7917 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  2|    0.9842  |    0.1114  |     0.7660 |     1.5867 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  3|    0.9834  |    0.0778  |     0.8186 |     2.5803 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  4|    0.9829  |    0.1047  |     0.7308 |     3.1165 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  5|    0.9825  |    0.0926  |     0.7771 |     4.2043 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  6|    0.9819  |    0.0868  |     0.7085 |     4.6695 |        | 
 |--------------------+---+------------+------------+------------+------------+--------| 
 |        LHUR        |  1|   -0.7236  |    0.0011  |     0.0000 |     0.0000 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  2|   -0.7250  |    0.0364  |     1.4136 |     2.9282 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  3|   -0.7268  |    0.0759  |     2.4246 |     7.6428 |       *| 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  4|   -0.7293  |    0.0751  |     2.1621 |     9.2208 |       *| 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  5|   -0.7312  |    0.1045  |     2.1148 |    11.4422 |       *| 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  6|   -0.7326  |    0.1365  |     1.9614 |    12.9277 |       *| 
 |--------------------+---+------------+------------+------------+------------+--------| 
 |       FSPCON       |  1|    0.9484  |    0.0431  |     0.2631 |     0.2687 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  2|    0.9481  |    0.0029  |     0.6758 |     1.3998 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  3|    0.9483  |   -0.0266  |     0.4383 |     1.3817 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  4|    0.9484  |   -0.0360  |     0.9219 |     3.9315 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  5|    0.9494  |   -0.0793  |     0.9008 |     4.8739 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  6|    0.9492  |   -0.0999  |     0.9167 |     6.0421 |        | 
 --------------------------------------------------------------------------------------- 

Information Criteria and Vuong Test

When it comes to model selection between two non-nest models, the information criteria, e.g. AIC or BIC, is often used and the model with a lower information criteria is preferred.

However, even with AIC or BIC, we are still unable to answer the question of whether the model A is significantly better than the model B probabilistically. Proposed by Quang Vuong (1989), Vuong test considers a better model with the individual log likelihoods significantly higher than the ones of its rival. A demonstration of Vuong test is given below.

First of all, two models for proportional outcomes, namely TOBIT regression and NLS (Nonlinear Least Squares) regression, are estimated below with information criteria, e.g. AIC and BIC, calculated and the likelihood of each individual record computed. As shown in the following output, NLS regression has a lower BIC and therefore might be considered a “better” model.

Next, with the likelihood of each individual record from both models, Vuong test is calculated with the formulation given below

Vuong statistic = [LR(model1, model2) – C] / sqrt(N * V) ~ N(0, 1)

LR(…) is the summation of individual log likelihood ratio between 2 models. “C” is a correction term for the difference of DF (Degrees of Freedom) between 2 models. “N” is the number of records. “V” is the variance of individual log likelihood ratio between 2 models. Vuong demonstrated that Vuong statistic is distributed as a standard normal N(0, 1). As a result, the model 1 is better with Vuong statistic > 1.96 and the model 2 is better with Vuong statistic < -1.96.

As shown in the output, although the model2, e.g. NLS regression, is preferred by a lower BIC, Vuong statistic doesn’t show the evidence that NLS regression is significantly better than TOBIT regression but indicates instead that both models are equally close to the true model.