Modeling Practices of Operational Losses in CCAR

Based upon the CCAR2019 benchmark report published by O.R.X, 88% participants in the survey that submitted results to the Fed used regression models to project operational losses, demonstrating a strong convergence. As described in the report, the OLS regression under the Gaussian distribution still seems the most prevalent approach.

Below is the summary of modeling approaches for operational losses based on my own experience and knowledge that might be helpful for other banking practitioners in CCAR.

Modeling Frequency
|
|– Equi-Dispersion (Baseline)
| |
| `– Standard Poisson
|
|– Over-Dispersion
| |
| |– Negative Binomial
| |
| |– Zero-Inflated Poisson
| |
| `– Finite Mixture Poisson
|
`– Over- or Under-Dispersion
|
|– Quasi-Poisson
|
|– Hurdle Poisson
|
|– Generalized Poisson
|
|– Double Poisson
|
|– Conway-Maxwell Poisson
|
`– Hyper-Poisson

view raw
FrequencyModels
hosted with ❤ by GitHub

Modeling Loss / Average Loss
|
|– Loss without Zeroes
| |
| |– Gamma
| |
| `– Inverse Gaussian
|
|– Loss with Some Zeros
| |
| |– Intercept-only ZAGA (Zero-Adjusted Gamma)
| |
| |– Intercept-only ZAIG (Zero-Adjusted Inverse Gaussian)
| |
| `– Tweedie
|
|– Loss with Many Zeros
| |
| |– ZAGA (Zero-Adjusted Gamma)
| |
| `– ZAIG (Zero-Adjusted Inverse Gaussian)
|
`– Loss with Nonzero Threshold Xm
|
|– Tobit
|
|– Pareto
|
`– Shifted Response with Y' = Y – Xm or Y' = Ln(Y / Xm)

view raw
LossModels
hosted with ❤ by GitHub

Monotonic Binning with GBM

In addition to monotonic binning algorithms introduced in my previous post (https://statcompute.wordpress.com/2019/03/10/a-summary-of-my-home-brew-binning-algorithms-for-scorecard-development), two more functions based on Generalized Boosted Regression Models have been added to my GitHub repository, gbm_bin() and gbmcv_bin().

The function gbm_bin() estimates a GBM model without the cross validation and tends to generate a more granular binning outcome.

gbm_bin(df, bad, tot_derog)
# $df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 9 848 0.1453 0 282 0.3325 0.6593 0.0750 3.2492
# 05 $X > 9 225 0.0385 0 77 0.3422 0.7025 0.0228 0.0000
# $cuts
# [1] 1 2 3 9

view raw
gbm_bin
hosted with ❤ by GitHub

The function gbmcv_bin() estimates a GBM model with the cross validation (CV). Therefore, it would generate a more stable but coarse binning outcome. Nonetheless, the computation is more expensive due to CV, especially for large datasets.

gbmcv_bin(df, bad, tot_derog)
### OUTPUT ###
# $df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 1405 0.2407 0 445 0.3167 0.5871 0.0970 0.0000
# $cuts
# [1] 1 2

view raw
gbmcv_bin
hosted with ❤ by GitHub

Motivated by the idea of my friend Talbot (https://www.linkedin.com/in/talbot-katz-b76785), I also drafted a function pava_bin() based upon the Pool Adjacent Violators Algorithm (PAVA) and compared it with the iso_bin() function based on the isotonic regression. As shown in the comparison below, there is no difference in the binning outcome. However, the computing cost of pava_bin() function is higher given that PAVA is an iterative algorithm solving for the monotonicity.

pava_bin(df, bad, tot_derog)$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000
iso_bin(df, bad, tot_derog)$df
# bin rule freq dist mv_cnt bad_freq bad_rate woe iv ks
# 00 is.na($X) 213 0.0365 213 70 0.3286 0.6416 0.0178 2.7716
# 01 $X <= 1 3741 0.6409 0 560 0.1497 -0.3811 0.0828 18.9469
# 02 $X > 1 & $X <= 2 478 0.0819 0 121 0.2531 0.2740 0.0066 16.5222
# 03 $X > 2 & $X <= 3 332 0.0569 0 86 0.2590 0.3050 0.0058 14.6321
# 04 $X > 3 & $X <= 23 1064 0.1823 0 353 0.3318 0.6557 0.0931 0.4370
# 05 $X > 23 9 0.0015 0 6 0.6667 2.0491 0.0090 0.0000

view raw
pava_compare
hosted with ❤ by GitHub

More Flexible Ordinal Outcome Models

In the previous post (https://statcompute.wordpress.com/2018/08/26/adjacent-categories-and-continuation-ratio-logit-models-for-ordinal-outcomes), we’ve shown alternative models for ordinal outcomes in addition to commonly used Cumulative Logit models under the proportional odds assumption, which are also known as Proportional Odds model. A potential drawback of Proportional Odds model is the lack of flexibility and the restricted assumption of proportional odds, of which the violation might lead to the model mis-specification. As a result, Cumulative Logit models with more flexible assumptions are called for.

In the example below, I will first show how to estimate credit ratings with a Cumulative Logit model under the proportional odds assumption with corporate financial performance measures, expressed as Logit(Y <= j) = A_j – X * B, where A_j depends on the category j.

pkgs <- list("maxLik", "VGAM")
sapply(pkgs, require, character.only = T)
data(data_cr, envir = .GlobalEnv, package = "mvord")
data_cr$RATING <- pmax(data_cr$rater1, data_cr$rater2, na.rm = T)
x <- c("LR", "LEV", "PR", "RSIZE", "BETA")
# LR   : LIQUIDITY RATIO
# LEV  : LEVERAGE RATIO
# PR   : PROFITABILITY RATIO
# RSIZE: LOG OF RELATIVE SIZE
# BETA : SYSTEMATIC RISK
y <- "RATING"
df <- data_cr[!is.na(data_cr[, y]), c(x, y)]
table(df[, y]) / length(df[, y])
#         A         B         C         D         E
# 0.1047198 0.1681416 0.3023599 0.2994100 0.1253687

### CUMULATIVE LOGIT MODEL ASSUMED PROPORTIONAL ODDS ###
# BELOW IS THE SIMPLER EQUIVALENT:
# vglm(RATING ~ LR + LEV + PR + RSIZE + BETA, data = df, family = cumulative(parallel = T))

ll1 <- function(param) {
  plist <- c("a_A", "a_B", "a_C", "a_D", "b_LR", "b_LE", "b_PR", "b_RS", "b_BE")
  sapply(1:length(plist), function(i) assign(plist[i], param[i], envir = .GlobalEnv))
  XB_A <- with(df, a_A - (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_B <- with(df, a_B - (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_C <- with(df, a_C - (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_D <- with(df, a_D - (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  prob_A <- 1 / (1 + exp(-XB_A))
  prob_B <- 1 / (1 + exp(-XB_B)) - prob_A
  prob_C <- 1 / (1 + exp(-XB_C)) - prob_A - prob_B
  prob_D <- 1 / (1 + exp(-XB_D)) - prob_A - prob_B - prob_C
  prob_E <- 1 - prob_A - prob_B - prob_C - prob_D
  CAT <- data.frame(sapply(c("A", "B", "C", "D", "E"), function(x) assign(x, df[, y] == x)))
  LH <- with(CAT, (prob_A ^ A) * (prob_B ^ B) * (prob_C ^ C) * (prob_D ^ D) * (prob_E ^ E))
  return(sum(log(LH)))
}

start1 <- c(a_A = 0, a_B = 2, a_C = 3, a_D = 4, b_LR = 0, b_LE = 0, b_PR = 0, b_RS = 0, b_BE = 0)
summary(m1 <- maxLik(logLik = ll1, start = start1))
#     Estimate Std. error t value Pr(t)
#a_A  15.53765    0.77215  20.123  <2e-16 ***
#a_B  18.26195    0.84043  21.729  <2e-16 ***
#a_C  21.61729    0.94804  22.802  <2e-16 ***
#a_D  25.88787    1.10522  23.423  <2e-16 ***
#b_LR  0.29070    0.11657   2.494  0.0126 *
#b_LE  0.83977    0.07220  11.631  <2e-16 ***
#b_PR -5.10955    0.35531 -14.381  <2e-16 ***
#b_RS -2.18552    0.09982 -21.895  <2e-16 ***
#b_BE  3.26811    0.21696  15.063  <2e-16 ***

In the above output, the attribute “liquidity ratio” is somewhat less significant than the other, implying a potential opportunity for further improvements by relaxing the proportional odds assumption. As a result, I will try a different class of Cumulative Logit models, namely (unconstrained) Partial-Proportional Odds models, that would allow non-proportional odds for a subset of model attributes, e.g. LR in our case. Therefore, the formulation now becomes Logit(Y <= j) = A_j – X * B – Z * G_j, where both A_j and G_j vary by the category j.

### CUMULATIVE LOGIT MODEL ASSUMED UNCONSTRAINED PARTIAL-PROPORTIONAL ODDS ###
# BELOW IS THE SIMPLER EQUIVALENT:
# vglm(RATING ~ LR + LEV + PR + RSIZE + BETA, data = df, family = cumulative(parallel = F ~ LR))

ll2 <- function(param) {
  plist <- c("a_A", "a_B", "a_C", "a_D", "b_LRA", "b_LRB", "b_LRC", "b_LRD", "b_LE", "b_PR", "b_RS", "b_BE")
  sapply(1:length(plist), function(i) assign(plist[i], param[i], envir = .GlobalEnv))
  XB_A <- with(df, a_A - (b_LRA * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_B <- with(df, a_B - (b_LRB * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_C <- with(df, a_C - (b_LRC * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_D <- with(df, a_D - (b_LRD * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  prob_A <- 1 / (1 + exp(-XB_A))
  prob_B <- 1 / (1 + exp(-XB_B)) - prob_A
  prob_C <- 1 / (1 + exp(-XB_C)) - prob_A - prob_B
  prob_D <- 1 / (1 + exp(-XB_D)) - prob_A - prob_B - prob_C
  prob_E <- 1 - prob_A - prob_B - prob_C - prob_D
  CAT <- data.frame(sapply(c("A", "B", "C", "D", "E"), function(x) assign(x, df[, y] == x)))
  LH <- with(CAT, (prob_A ^ A) * (prob_B ^ B) * (prob_C ^ C) * (prob_D ^ D) * (prob_E ^ E))
  return(sum(log(LH)))
}

start2 <- c(a_A = 0.1, a_B = 0.2, a_C = 0.3, a_D = 0.4, b_LRA = 0, b_LRB = 0, b_LRC = 0, b_LRD = 0, b_LE = 0, b_PR = 0, b_RS = 0, b_BE = 0)
summary(m2 <- maxLik(logLik = ll2, start = start2))
#Estimates:
#      Estimate Std. error t value Pr(t)
#a_A   15.30082    0.83936  18.229  <2e-16 ***
#a_B   18.14795    0.81325  22.315  <2e-16 ***
#a_C   21.72469    0.89956  24.150  <2e-16 ***
#a_D   25.92697    1.07749  24.062  <2e-16 ***
#b_LRA  0.12442    0.30978   0.402  0.6880
#b_LRB  0.21127    0.20762   1.018  0.3089
#b_LRC  0.36097    0.16687   2.163  0.0305 *
#b_LRD  0.31404    0.22090   1.422  0.1551
#b_LE   0.83949    0.07155  11.733  <2e-16 ***
#b_PR  -5.09891    0.35249 -14.465  <2e-16 ***
#b_RS  -2.18589    0.09540 -22.913  <2e-16 ***
#b_BE   3.26529    0.20993  15.554  <2e-16 ***

As shown above, under the partial-proportional odds assumption, there are 4 parameters estimated for LR, three of which are not significant and therefore the additional flexibility is not justified. In fact, AIC of the 2nd model (AIC = 1103.60) is even higher than AIC of the 1st model (AIC = 1098.18).

In light of the above observation, I will introduce the 3rd model, which is known as the Constrained Partial-Proportional Odds model and expressed as Logit(Y <= j) = A_j – X * B – Z * G * gamma_j, where A_j and gamma_j vary the category j. It is worth pointing out that gamma_j is a pre-specified fixed scalar and does not need to be estimated. Based on the unconstrained model outcome, we can set gamma_1 = 1, gamma_2 = 2, and gamma_3 = gamma_4 = 3 for LR in our case.

### CUMULATIVE LOGIT MODEL ASSUMED CONSTRAINED PARTIAL-PROPORTIONAL ODDS ###

ll3 <- function(param) {
  plist <- c("a_A", "a_B", "a_C", "a_D", "b_LR", "b_LE", "b_PR", "b_RS", "b_BE")
  sapply(1:length(plist), function(i) assign(plist[i], param[i], envir = .GlobalEnv))
  gamma <- c(1, 2, 3, 3)
  XB_A <- with(df, a_A - (gamma[1] * b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_B <- with(df, a_B - (gamma[2] * b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_C <- with(df, a_C - (gamma[3] * b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_D <- with(df, a_D - (gamma[4] * b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  prob_A <- 1 / (1 + exp(-XB_A))
  prob_B <- 1 / (1 + exp(-XB_B)) - prob_A
  prob_C <- 1 / (1 + exp(-XB_C)) - prob_A - prob_B
  prob_D <- 1 / (1 + exp(-XB_D)) - prob_A - prob_B - prob_C
  prob_E <- 1 - prob_A - prob_B - prob_C - prob_D
  CAT <- data.frame(sapply(c("A", "B", "C", "D", "E"), function(x) assign(x, df[, y] == x)))
  LH <- with(CAT, (prob_A ^ A) * (prob_B ^ B) * (prob_C ^ C) * (prob_D ^ D) * (prob_E ^ E))
  return(sum(log(LH)))
}

start3 <- c(a_A = 1, a_B = 2, a_C = 3, a_D = 4, b_LR = 0.1, b_LE = 0, b_PR = 0, b_RS = 0, b_BE = 0)
summary(m3 <- maxLik(logLik = ll3, start = start3))
#Estimates:
#     Estimate Std. error t value Pr(t)
#a_A  15.29442    0.60659  25.214 < 2e-16 ***
#a_B  18.18220    0.65734  27.660 < 2e-16 ***
#a_C  21.70599    0.75181  28.872 < 2e-16 ***
#a_D  25.98491    0.88104  29.493 < 2e-16 ***
#b_LR  0.11351    0.04302   2.638 0.00833 **
#b_LE  0.84012    0.06939  12.107 < 2e-16 ***
#b_PR -5.10025    0.33481 -15.233 < 2e-16 ***
#b_RS -2.18708    0.08118 -26.940 < 2e-16 ***
#b_BE  3.26689    0.19958  16.369 < 2e-16 ***

As shown above, after the introduction of gamma_j as the constrained scalar, the statistical significance of LR has been improved with a slightly lower AIC = 1097.64.

To be complete, I’d like to mention the last model today, which is named the Stereotype model. The idea of Stereotype models is very similar to the idea of adjacent-categories models and is to estimate Log(Y = j / Y = j+1) or more often Log(Y = j / Y = j_c), where C represents a baseline category. However, the right-hand side is expressed as Log(…) = A_j – (X * B) * phi_j, where phi_j is a hyper-parameter such that phi_1 = 1 > phi_2…> phi_max = 0. As a result, the coefficient of each model attribute could also vary by the category j, introducing more flexibility at the cost of being difficult to estimate.

### STEREOTYPE MODEL ###
# BELOW IS THE SIMPLER EQUIVALENT:
# rrvglm(sapply(c("A", "B", "C", "D", "E"), function(x) df[, y] == x)~ LR + LEV + PR + RSIZE + BETA, multinomial, data = df)

ll4 <- function(param) {
  plist <- c("a_A", "a_B", "a_C", "a_D", "b_LR", "b_LE", "b_PR", "b_RS", "b_BE", "phi_B", "phi_C", "phi_D")
  sapply(1:length(plist), function(i) assign(plist[i], param[i], envir = .GlobalEnv))
  XB_A <- with(df, a_A - (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_B <- with(df, a_B - phi_B * (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_C <- with(df, a_C - phi_C * (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  XB_D <- with(df, a_D - phi_D * (b_LR * LR + b_LE * LEV + b_PR * PR + b_RS * RSIZE + b_BE * BETA))
  prob_A <- exp(XB_A) / (exp(XB_A) + exp(XB_B) + exp(XB_C) + exp(XB_D) + 1)
  prob_B <- exp(XB_B) / (exp(XB_A) + exp(XB_B) + exp(XB_C) + exp(XB_D) + 1)
  prob_C <- exp(XB_C) / (exp(XB_A) + exp(XB_B) + exp(XB_C) + exp(XB_D) + 1)
  prob_D <- exp(XB_D) / (exp(XB_A) + exp(XB_B) + exp(XB_C) + exp(XB_D) + 1)
  prob_E <- 1 - prob_A - prob_B - prob_C - prob_D
  CAT <- data.frame(sapply(c("A", "B", "C", "D", "E"), function(x) assign(x, df[, y] == x)))
  LH <- with(CAT, (prob_A ^ A) * (prob_B ^ B) * (prob_C ^ C) * (prob_D ^ D) * (prob_E ^ E))
  return(sum(log(LH)))
}

start4 <- c(a_A = 1, a_B = 2, a_C = 3, a_D = 4, b_LR = 0.1, b_LE = 0, b_PR = 0, b_RS = 0, b_BE = 0, phi_B = 0.1, phi_C = 0.2, phi_D = 0.3)
summary(m4 <- maxLik(logLik = ll4, start = start4))
#Estimates:
#       Estimate Std. error t value Pr(t)
#a_A    67.73429    2.37424  28.529  <2e-16 ***
#a_B    55.86469    1.94442  28.731  <2e-16 ***
#a_C    41.27477    1.47960  27.896  <2e-16 ***
#a_D    22.24244    1.83137  12.145  <2e-16 ***
#b_LR    0.86975    0.37481   2.320  0.0203 *
#b_LE    2.79215    0.23373  11.946  <2e-16 ***
#b_PR  -16.66836    1.17569 -14.178  <2e-16 ***
#b_RS   -7.24921    0.33460 -21.665  <2e-16 ***
#b_BE   10.57411    0.72796  14.526  <2e-16 ***
#phi_B   0.77172    0.03155  24.461  <2e-16 ***
#phi_C   0.52806    0.03187  16.568  <2e-16 ***
#phi_D   0.26040    0.02889   9.013  <2e-16 ***

Ordered Probit Model and Price Movements of High-Frequency Trades

The analysis of high frequency stock transactions has played an important role in the algorithmic trading and the result can be used to monitor stock movements and to develop trading strategies. In the paper “An Ordered Probit Analysis of Transaction Stock Prices” (1992), Hausman, Lo, and MacKinlay discussed estimating trade-by-trade stock price changes with the ordered probit model by incorporating potential model drivers, including previous price changes, trading volumes, and the time between consecutive trades. Following the same logic, Tsay demonstrated how to employ the ordered probit model to project price movements of high frequency stock trades in his book “An Introduction to Analysis of Financial Data with R” (2013).

The exercise below is simply to mimic the analysis shown in the chapter 6 of Tsay’s book. Please note that the output of rms::orm() function slightly differs from the one of MASS::polr() used in the book due to the different parameterization. Otherwise, results are largely consistent.


cat = read.table("Downloads/chap6/taq-cat-t-jan042010.txt", header = T)

### CALCULATE PRICE DIFFERENCE ###
pchg = cat$price[2:nrow(cat)] - cat$price[1:nrow(cat) - 1]

### CATEGORIES PRICE CHANGE ###
cchg = as.factor(memisc::cases((pchg  1, 
                               (pchg >= -0.01 & pchg  2, 
                               (pchg == 0) -> 3, 
                               (pchg > 0 & pchg  4, 
                               (pchg > 0.01) -> 5))

### PLOT HISTOGRAM OF PRICE CHANGES ###
barplot(table(cchg) / length(cchg), space = 0, col = "gray", border = NA, main = "Distribution of Price Changes", xlab = "Price Movements")

hist

From the histogram above, it is interesting to see that the distribution of price movements looks very symmetrical and centering around the zero and that price changes for consecutive trades are mostly within the range of 1 – 2 cents.


y_raw = pchg[4:length(cchg)]
y = cchg[4:length(cchg)]

### CREATE LAGGED Y AS MODEL PREDICTORS ###
y1 = cchg[3:(length(y) + 2)]
y2 = cchg[2:(length(y) + 1)]

### CREATE LAGGED PRICE CHANGES AS MODEL PREDICTORS ###
pch1 = pchg[3:(length(y) + 2)]
pch2 = pchg[2:(length(y) + 1)]
pch3 = pchg[1:length(y)]

### CREATE LAGGED TRADING VOLUME AS MODEL PREDICTORS ###
vol1 = cat$size[4:(3 + length(y))] / 100
vol2 = cat$size[3:(2 + length(y))] / 100

### CREATE LAGGED SECONDS BETWEEN TRADES AS MODEL PREDICTORS ###
cat$time = strptime(paste(sprintf("%02d", cat$hour), sprintf("%02d", cat$minute), sprintf("%02d", cat$second), sep = ':'), "%H:%M:%S")
tdif = as.numeric(difftime(cat$time[-1], cat$time[-length(cat$time)]))
tdif1 = tdif[3:(length(y) + 2)]
tdif2 = tdif[2:(length(y) + 1)]

df = data.frame(y, y1, y2, vol1, vol2, tdif1, tdif2, pch1, pch2, pch3)

### VOL1 / TDIF1 / TDIF2 ARE NOT SIGNIFICANT ###
m1 = rms::orm(y ~ y1 + y2 + pch1 + pch2 + pch3 + vol1 + vol2 + tdif1 + tdif2, data = df, family = probit)
#       Coef     S.E.   Wald Z Pr(>|Z|)
# vol1    0.0011 0.0012   0.88 0.3775  
# tdif1  -0.0030 0.0034  -0.88 0.3783  
# tdif2  -0.0018 0.0035  -0.52 0.6058

### REFIT THE MODEL WITH SIGNIFICANT DRIVERS ###
m2 = update(m1, y ~ y1 + y2 + pch1 + pch2 + pch3 + vol2)

### PREDICT PROBABILITY OF EACH CATEGORY ###
head(predict(m1, type = "fitted.ind"), 3)
#          y=1        y=2       y=3        y=4         y=5
#1 0.017586540 0.08172596 0.6655605 0.17209486 0.063032101
#2 0.098890397 0.22135286 0.6180407 0.05228561 0.009430461
#3 0.001268321 0.01270428 0.4104822 0.30700447 0.268540702

### PREDICT CUMULATIVE PROBABILITY OF EACH CATEGORY ###
head(predict(m2, type = "fitted"), 3)
#       y>=2      y>=3       y>=4        y>=5
#1 0.9824135 0.9006875 0.23512696 0.063032101
#2 0.9011096 0.6797567 0.06171607 0.009430461
#3 0.9987317 0.9860274 0.57554517 0.268540702

### MODEL ACCURACY ASSESSMENT FOR PREDICTING PRICE INCREASES ###
pROC::roc(ifelse(y_raw > 0, 1, 0), predict(m2, type = "fitted")[, 3])
# Area under the curve: 0.6994

par(mfrow = c(2, 1))
ts.plot(y_raw, main = "Price Changes", ylab = "Price Changes")
ts.plot(predict(m2, type = "fitted")[, 3], main = "Probability of Price Increase", ylab = "Probability")

cat

Co-integration and Pairs Trading

The co-integration is an important statistical concept behind the statistical arbitrage strategy named “Pairs Trading”. While projecting a stock price with time series models is by all means difficult, it is technically feasible to find a pair of (or even a portfolio of) stocks sharing the common trend such that a linear combination of two series is stationary, which is so-called co-integration. The underlying logic of Pairs Trading is to monitor movements of co-integrated stocks and to look for trading opportunities when the divergence presents. Under the mean-reversion assumption, the stock price would tend to move back to the long-term equilibrium. As a result, the spread between two co-integrated stock prices would eventually converge. Furthermore, given the stationarity of the spread between co-integrated stocks, it becomes possible to forecast such spread with time series models.

Below shows a R utility function helping to identify pairwise co-integrations based upon the Johansen Test out of a arbitrary number of stock prices provided in a list of tickers.

For instance, based on a starting date on 2010/01/01 and a list of tickers for major US banks, we are able to identify 23 pairs of co-integrated stock prices out of 78 pairwise combinations. It is interesting to see that stock prices of two regional players, e.g. Fifth Third and M&T, are highly co-integrated, as visualized in the chart below.


pkgs <- list("quantmod", "doParallel", "foreach", "urca")
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 4)

jtest <- function(t1, t2) {
  start <- sd
  getSymbols(t1, from = start)
  getSymbols(t2, from = start)
  j <- summary(ca.jo(cbind(get(t1)[, 6], get(t2)[, 6])))
  r <- data.frame(stock1 = t1, stock2 = t2, stat = j@teststat[2])
  r[, c("pct10", "pct5", "pct1")] <- j@cval[2, ]
  return(r)
}

pair <- function(lst) {
  d2 <- data.frame(t(combn(lst, 2)))
  stat <- foreach(i = 1:nrow(d2), .combine = rbind) %dopar% jtest(as.character(d2[i, 1]), as.character(d2[i, 2]))
  stat <- stat[order(-stat$stat), ]
  # THE PIECE GENERATING * CAN'T BE DISPLAYED PROPERLY IN WORDPRESS 
  rownames(stat) <- NULL
  return(stat)
}

sd <- "2010-01-01"
tickers <- c("FITB", "BBT", "MTB", "STI", "PNC", "HBAN", "CMA", "USB", "KEY", "JPM", "C", "BAC", "WFC")
pair(tickers)

   stock1 stock2      stat pct10 pct5  pct1 coint
1     STI    JPM 27.207462 12.91 14.9 19.19  ***
2    FITB    MTB 21.514142 12.91 14.9 19.19  ***
3     MTB    KEY 20.760885 12.91 14.9 19.19  ***
4    HBAN    KEY 19.247719 12.91 14.9 19.19  ***
5       C    BAC 18.573168 12.91 14.9 19.19   **
6    HBAN    JPM 18.019051 12.91 14.9 19.19   **
7    FITB    BAC 17.490536 12.91 14.9 19.19   **
8     PNC   HBAN 16.959451 12.91 14.9 19.19   **
9    FITB    BBT 16.727097 12.91 14.9 19.19   **
10    MTB   HBAN 15.852456 12.91 14.9 19.19   **
11    PNC    JPM 15.822610 12.91 14.9 19.19   **
12    CMA    BAC 15.685086 12.91 14.9 19.19   **
13   HBAN    BAC 15.446149 12.91 14.9 19.19   **
14    BBT    MTB 15.256334 12.91 14.9 19.19   **
15    MTB    JPM 15.178646 12.91 14.9 19.19   **
16    BBT   HBAN 14.808770 12.91 14.9 19.19    *
17    KEY    BAC 14.576440 12.91 14.9 19.19    *
18   FITB    JPM 14.272424 12.91 14.9 19.19    *
19    STI    BAC 14.253971 12.91 14.9 19.19    *
20   FITB    PNC 14.215647 12.91 14.9 19.19    *
21    MTB    BAC 13.891615 12.91 14.9 19.19    *
22    MTB    PNC 13.668863 12.91 14.9 19.19    *
23    KEY    JPM 12.952239 12.91 14.9 19.19    *

Screenshot from 2018-07-29 16-55-27

Screenshot from 2018-07-29 15-09-11

SAS Implementation of ZAGA Models

In the previous post https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros/, I gave a brief introduction about the ZAGA (Zero-Adjusted Gamma) model that provides us a very flexible approach to model non-negative numeric responses. Today, I will show how to implement the ZAGA model with SAS, which can be conducted either jointly or by two steps.

In SAS, the FMM procedure provides a very convenient interface to estimate the ZAGA model in 1 simple step. As shown, there are two model statements, e.g. the first one to estimate a Gamma sub-model with positive outcomes and the second used to separate the point-mass at zero from the positive. The subsequent probmodel statement then is employed to estimate the probability of a record being positive.


data ds;
  set "/folders/myfolders/autoclaim" (keep = clm_amt bluebook npolicy clm_freq5 mvr_pts income);
  where income ~= .;
  clm_flg = (clm_amt > 0);
run;

proc fmm data = ds tech = trureg;
  model clm_amt = bluebook npolicy / dist = gamma;
  model clm_amt = / dist = constant;
  probmodel clm_freq5 mvr_pts income;
run;

An alternative way to develop a ZAGA model in two steps is to estimate a logistic regression first separating the point-mass at zero from the positive and then to estimate a Gamma regression with positive outcomes only, as illustrated below. The two-step approach is more intuitive to understand and, more importantly, is easier to implement without convergence issues as in FMM or NLMIXED procedure.


proc logistic data = ds desc;
  model clm_flg = clm_freq5 mvr_pts income;
run;

proc genmod data = ds;
  where clm_flg = 1;
  model clm_amt = bluebook npolicy / link = log dist = gamma;
run;

LogRatio Regression – A Simple Way to Model Compositional Data

The compositional data are proportionals of mutually exclusive groups that would be summed up to the unity. Statistical models for compositional data have been applicable in a number of areas, e.g. the product or channel mix in the marketing research and asset allocations of a investment portfolio.

In the example below, I will show how to model compositional outcomes with a simple LogRatio regression. The underlying idea is very simple. With the D-dimension outcome [p_1, p_2…p_D], we can derive a [D-1]-dimension outcome [log(p_2 / p_1)…log(p_D / p_1)] and then estimate a multivariate regression based on the new outcome.

df = get("ArcticLake", envir = asNamespace('DirichletReg'))

#   sand  silt  clay depth
#1 0.775 0.195 0.030  10.4
#2 0.719 0.249 0.032  11.7
#3 0.507 0.361 0.132  12.8

lm(cbind(log(silt / sand), log(clay / sand)) ~ depth, data = df)

#Response log(silt/sand):
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.649656   0.236733  -2.744   0.0093 **
#depth        0.037522   0.004269   8.790 1.36e-10 ***
#
#Response log(clay/sand) :
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -2.614897   0.421383  -6.206 3.31e-07 ***
#depth        0.062181   0.007598   8.184 8.00e-10 ***

Since log(x / y) = log(x) – log(y), we can also estimate the model with log(sand) as an offset term.


lm(cbind(log(silt), log(clay)) ~ depth + offset(log(sand)), data = df)

#Response log(silt) :
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.649656   0.236733  -2.744   0.0093 **
#depth        0.037522   0.004269   8.790 1.36e-10 ***
#
#Response log(clay) :
#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
#(Intercept) -2.614897   0.421383  -6.206 3.31e-07 ***
#depth        0.062181   0.007598   8.184 8.00e-10 ***

Alternatively, we can also use the comp.reg function in the Compositional package.


Compositional::comp.reg(as.matrix(df[, 1:3]), df[, 4])

#$be
#                   [,1]        [,2]
#(Intercept) -0.64965598 -2.61489731
#x            0.03752186  0.06218069
#
#$seb
#                   [,1]        [,2]
#(Intercept) 0.236733203 0.421382652
#x           0.004268588 0.007598043

MLE in R

When I learned and experimented a new model, I always like to start with its likelihood function in order to gain a better understanding about the statistical nature. That’s why I extensively used the SAS/NLMIXED procedure that gives me more flexibility. Today, I spent a couple hours playing the optim() function and its wrappers, e.g. mle() and mle2(), in case that I might need a replacement for my favorite NLMIXED in the model estimation. Overall, I feel that the optim() is more flexible. The named list required by the mle() or mle2() for initial values of parameters is somewhat cumbersome without additional benefits. As shown in the benchmark below, the optim() is the most efficient.


library(COUNT)
library(stats4)
library(bbmle)
data(rwm1984)
attach(rwm1984)

### OPTIM() ###
LogLike1 <- function(par) {
  xb <- par[1] + par[2] * outwork + par[3] * age + par[4] * female + par[5] * married 
  mu <- exp(xb)
  ll <- sum(log(exp(-mu) * (mu ^ docvis) / factorial(docvis)))
  return(-ll)
}
fit1 <- optim(rep(0, 5), LogLike1, hessian = TRUE, method = "BFGS")
std1 <- sqrt(diag(solve(fit1$hessian)))
est1 <- data.frame(beta = fit1$par, stder = stder1, z_values = fit1$par / stder1)
#         beta        stder  z_values
#1 -0.06469676 0.0433207574 -1.493436
#2  0.27264177 0.0214085110 12.735205
#3  0.02283541 0.0008394589 27.202540
#4  0.27461355 0.0210597539 13.039732
#5 -0.11804504 0.0217745647 -5.421236

### MLE() ###
LogLike2 <- function(b0, b1, b2, b3, b4) {
  mu <- exp(b0 + b1 * outwork + b2 * age + b3 * female + b4 * married)
  -sum(log(exp(-mu) * (mu ^ docvis) / factorial(docvis)))
}
inits <- list(b0 = 0, b1 = 0, b2 = 0, b3 = 0, b4 = 0)
fit2 <- mle(LogLike2, method = "BFGS", start = inits)
std2 <- sqrt(diag(vcov(fit2)))
est2 <- data.frame(beta = coef(fit2), stder = std2, z_values = coef(fit2) / std2)
#          beta        stder  z_values
#b0 -0.06469676 0.0433417474 -1.492712
#b1  0.27264177 0.0214081592 12.735414
#b2  0.02283541 0.0008403589 27.173407
#b3  0.27461355 0.0210597350 13.039744
#b4 -0.11804504 0.0217746108 -5.421224

### BENCHMARKS ###
microbenchmark::microbenchmark(
  "optim" = {optim(rep(0, 5), LogLike1, hessian = TRUE, method = "BFGS")},
  "mle"   = {mle(LogLike2, method = "BFGS", start = inits)},
  "mle2"  = {mle2(LogLike2, method = "BFGS", start = inits)},
  times = 10
)
#  expr      min       lq     mean   median       uq      max neval
# optim 280.4829 280.7902 296.9538 284.5886 318.6975 320.5094    10
#   mle 283.6701 286.3797 302.9257 289.8849 327.1047 328.6255    10
#  mle2 387.1912 390.8239 407.5090 392.8134 427.0569 467.0013    10

Modeling Dollar Amounts in Regression Setting

After switching the role from the credit risk to the operational risk in 2015, I spent countless weekend hours in the Starbucks researching on how to model operational losses in the regression setting in light of the heightened scrutiny. While I feel very comfortable with various frequency models, how to model severity and loss remain challenging both conceptually and empirically. The same challenge also holds true for modeling other financial measures in dollar amounts, such as balance, profit, or cost.

Most practitioners still prefer modeling severity and loss under the Gaussian distributional assumption explicitly or implicitly. In practice, there are 3 commonly used approaches, as elaborated below.

– First of all, the simple OLS regression to model severity and loss directly without any transformation remains the number one choice due to the simplicity. Given the inconsistency between the empirical data range and the conceptual domain for a Gaussian distribution, it is evidential that this approach is problematic.

– Secondly, the OLS regression to model LOG transformed severity and loss under the Lognormal distributional assumption is also a common approach. In this method, Log(Y) instead of Y is estimated. However, given E(Log(Y)|X) != Log(E(Y|X)), the estimation bias is introduced and therefore should be corrected by MSE / 2. In addition, the positive domain of a Lognormal might not work well in cases of losses with a lower bound that can be either zero or a known threshold value.

– At last, the Tobit regression under the censored Normal distribution seems a viable solution that supports the non-negative or any above-threshold values shown in severity or loss measures. Nonetheless, the censorship itself is questionable given that the unobservability of negative or below-threshold values is not due to the censorship but attributable to the data nature governed by the data collection process. Therefore, the argument for the data censorship is not well supported.

Considering the aforementioned challenge, I investigated and experimented various approaches given different data characteristics observed empirically.

– In cases of severity or loss observed in the range of (0, inf), GLM under Gamma or Inverse Gaussian distributional assumption can be considered (https://statcompute.wordpress.com/2015/08/16/some-considerations-of-modeling-severity-in-operational-losses). In addition, the mean-variance relationship can be employed to assess the appropriateness of the correct distribution by either the modified Park test (https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas) or the value of power parameter in the Tweedie distribution (https://statcompute.wordpress.com/2017/06/24/using-tweedie-parameter-to-identify-distributions).

– In cases of severity or loss observed in the range of [alpha, inf) with alpha being positive, then a regression under the type-I Pareto distribution (https://statcompute.wordpress.com/2016/12/11/estimate-regression-with-type-i-pareto-response) can be considered. However, there is a caveat that the conditional mean only exists when the shape parameter is large than 1.

– In cases of severity or loss observed in the range of [0, inf) with a small number of zeros, then a regression under the Lomax distribution (https://statcompute.wordpress.com/2016/11/13/parameter-estimation-of-pareto-type-ii-distribution-with-nlmixed-in-sas) or the Tweedie distribution (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm) can be considered. For the Lomax model, it is worth pointing out that the shape parameter alpha has to be large than 2 in order to to have both mean and variance defined.

– In cases of severity or loss observed in the range of [0, inf) with many zeros, then a ZAGA or ZAIG model (https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros) can be considered by assuming the measure governed by a mixed distribution between the point-mass at zeros and the standard Gamma or Inverse Gaussian. As a result, a ZA model consists of 2 sub-models, a nu model separating zeros and positive values and a mu model estimating the conditional mean of positive values.

Model Non-Negative Numeric Outcomes with Zeros

As mentioned in the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm/), we often need to model non-negative numeric outcomes with zeros in the operational loss model development. Tweedie GLM provides a convenient interface to model non-negative losses directly by assuming that aggregated losses are the Poisson sum of Gamma outcomes, which however might not be well supported empirically from the data generation standpoint.

In examples below, we demonstrated another flexible option, namely Zero-Adjusted (ZA) models, in both scenarios of modeling non-negative numeric outcomes, one with a small number of zeros and the other with a large number of zeros. The basic idea of ZA models is very intuitive and similar to the concept of Hurdle models for count outcomes. In a nutshell, non-negative numeric outcomes can be considered two data generation processes, one for point-mass at zeros and the other governed by a statistical distribution for positive outcomes. The latter could be either Gamma or Inverse Gaussian.

First of all, we sampled down an auto-claim data in a way that only 10 claims are zeros and the rest are all positive. While 10 is an arbitrary choice in the example, other small numbers should show similar results.

pkgs <- list("cplm", "gamlss", "MLmetrics")
lapply(pkgs, require, character.only = T)

data(AutoClaim, package = "cplm")
df1 <- na.omit(AutoClaim)

# SMALL NUMBER OF ZEROS
set.seed(2017)
smp <- sample(seq(nrow(df1[df1$CLM_AMT == 0, ])), size = 10, replace = FALSE)
df2 <- rbind(df1[df1$CLM_AMT > 0, ], df1[df1$CLM_AMT == 0, ][smp, ])

Next, we applied both Tweedie and zero-adjusted Gamma (ZAGA) models to the data with only 10 zero outcomes. It is worth mentioning that ZAGA doesn’t have to be overly complex in this case. As shown below, while we estimated the Gamma Mu parameter with model attributes, the Nu parameter to separate zeros is just a constant with the intercept = -5.4. Both Tweedie and GAZA models gave very similar estimated parameters and predictive measures with MAPE = 0.61.

tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2)
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.194e+00  7.234e-02 113.277  < 2e-16 ***
# BLUEBOOK    2.047e-05  3.068e-06   6.671 3.21e-11 ***
# NPOLICY     7.274e-02  3.102e-02   2.345   0.0191 *  

MAPE(df2$CLM_AMT, fitted(tw))
# 0.6053669

zaga0 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, data = df2, family = "ZAGA")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.203e+00  4.671e-02 175.629  < 2e-16 ***
# BLUEBOOK    2.053e-05  2.090e-06   9.821  < 2e-16 ***
# NPOLICY     6.948e-02  2.057e-02   3.377 0.000746 ***
# Nu Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -5.3886     0.3169     -17   <2e-16 ***

MAPE(df2$CLM_AMT, (1 - fitted(zaga0, what = "nu")) * fitted(zaga0, what = "mu"))
# 0.6053314

In the next case, we used the full data with a large number of zeros in the response and then applied both Tweedie and ZAGA models again. However, in ZAGA model, we estimated two sub-models this time, one for the Nu parameter to separate zeros from non-zeros and the other for the Mu parameter to model non-zero outcomes. As shown below, ZAGA outperformed Tweedie in terms of MAPE due to the advantage that ZAGA is able to explain two data generation schemes separately with different model attributes, which is the capability beyond what Tweedie can provide.

# LARGE NUMBER OF ZEROS
tw <- cpglm(CLM_AMT ~ BLUEBOOK + NPOLICY + CLM_FREQ5 + MVR_PTS + INCOME, data = df1)
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  6.854e+00  1.067e-01  64.241  < 2e-16 ***
# BLUEBOOK     1.332e-05  4.495e-06   2.963  0.00305 ** 
# NPOLICY      4.380e-02  3.664e-02   1.195  0.23196    
# CLM_FREQ5    2.064e-01  2.937e-02   7.026 2.29e-12 ***
# MVR_PTS      1.066e-01  1.510e-02   7.063 1.76e-12 ***
# INCOME      -4.606e-06  8.612e-07  -5.348 9.12e-08 ***

MAPE(df1$CLM_AMT, fitted(tw))
# 1.484484

zaga1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAGA")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.203e+00  4.682e-02 175.218  < 2e-16 ***
# BLUEBOOK    2.053e-05  2.091e-06   9.816  < 2e-16 ***
# NPOLICY     6.948e-02  2.067e-02   3.362 0.000778 ***
# Nu Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  1.153e+00  5.077e-02   22.72   <2e-16 ***
# CLM_FREQ5   -3.028e-01  2.283e-02  -13.26   <2e-16 ***
# MVR_PTS     -1.509e-01  1.217e-02  -12.41   <2e-16 ***
# INCOME       7.285e-06  6.269e-07   11.62   <2e-16 ***

MAPE(df1$CLM_AMT, (1 - fitted(zaga1, what = "nu")) * fitted(zaga1, what = "mu"))
# 1.470228

Given the great flexibility of ZA models, we also have the luxury to explore other candidates than ZAGA. For instance, if the positive part of non-negative outcomes demonstrates a high variance, we can also try a zero-inflated Inverse Gaussian (ZAIG) model, as shown below.

zaig1 <- gamlss(CLM_AMT ~ BLUEBOOK + NPOLICY, nu.formula = ~(CLM_FREQ5 + MVR_PTS + INCOME), data = df1, family = "ZAIG")
# Mu Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 8.205e+00  5.836e-02 140.591  < 2e-16 ***
# BLUEBOOK    2.163e-05  2.976e-06   7.268 3.97e-13 ***
# NPOLICY     5.898e-02  2.681e-02   2.200   0.0278 *  
# Nu Coefficients:
#               Estimate Std. Error t value Pr(>|t|)
# (Intercept)  1.153e+00  5.077e-02   22.72   <2e-16 ***
# CLM_FREQ5   -3.028e-01  2.283e-02  -13.26   <2e-16 ***
# MVR_PTS     -1.509e-01  1.217e-02  -12.41   <2e-16 ***
# INCOME       7.285e-06  6.269e-07   11.62   <2e-16 ***

MAPE(df1$CLM_AMT, (1 - fitted(zaig1, what = "nu")) * fitted(zaig1, what = "mu"))
# 1.469236

Model Operational Losses with Copula Regression

In the previous post (https://statcompute.wordpress.com/2017/06/29/model-operational-loss-directly-with-tweedie-glm), it has been explained why we should consider modeling operational losses for non-material UoMs directly with Tweedie models. However, for material UoMs with significant losses, it is still beneficial to model the frequency and the severity separately.

In the prevailing modeling practice for operational losses, it is often convenient to assume a functional independence between frequency and severity models, which might not be the case empirically. For instance, in the economic downturn, both the frequency and the severity of consumer frauds might tend to increase simultaneously. With the independence assumption, while we can argue that same variables could be included in both frequency and severity models and therefore induce a certain correlation, the frequency-severity dependence and the its contribution to the loss distribution might be overlooked.

In the context of Copula, the distribution of operational losses can be considered a joint distribution determined by both marginal distributions and a parameter measuring the dependence between marginals, of which marginal distributions can be Poisson for the frequency and Gamma for the severity. Depending on the dependence structure in the data, various copula functions might be considered. For instance, a product copula can be used to describe the independence. In the example shown below, a Gumbel copula is considered given that it is often used to describe the positive dependence on the right tail, e.g. high severity and high frequency. For details, the book “Copula Modeling” by Trivedi and Zimmer is a good reference to start with.

In the demonstration, we simulated both frequency and severity measures driven by the same set of co-variates. Both are positively correlated with the Kendall’s tau = 0.5 under the assumption of Gumbel copula.

library(CopulaRegression)
# number of observations to simulate
n <- 100
# seed value for the simulation
set.seed(2017)
# design matrices with a constant column
X <- cbind(rep(1, n), runif(n), runif(n))
# define coefficients for both Poisson and Gamma regressions
p_beta <- g_beta <- c(3, -2, 1)
# define the Gamma dispersion
delta <- 1
# define the Kendall's tau
tau <- 0.5
# copula parameter based on tau
theta <- 1 / (1 - tau)
# define the Gumbel Copula 
family <- 4
# simulate outcomes
out <- simulate_regression_data(n, g_beta, p_beta, X, X, delta, tau, family, zt = FALSE)
G <- out[, 1]
P <- out[, 2]

After the simulation, a Copula regression is estimated with Poisson and Gamma marginals for the frequency and the severity respectively. As shown in the model estimation, estimated parameters with related inferences are different between independent and dependent assumptions.

m <- copreg(G, P, X, family = 4, sd.error = TRUE, joint = TRUE, zt = FALSE)
coef <- c("_CONST", "X1", "X2")
cols <- c("ESTIMATE", "STD. ERR", "Z-VALUE")
g_est <- cbind(m$alpha, m$sd.alpha, m$alpha / m$sd.alpha)
p_est <- cbind(m$beta, m$sd.beta, m$beta / m$sd.beta)
g_est0 <- cbind(m$alpha0, m$sd.alpha0, m$alpha0 / m$sd.alpha0)
p_est0 <- cbind(m$beta0, m$sd.beta0, m$beta0 / m$sd.beta0)
rownames(g_est) <- rownames(g_est0) <- rownames(p_est) <- rownames(p_est0) <- coef
colnames(g_est) <- colnames(g_est0) <- colnames(p_est) <- colnames(p_est0) <- cols

# estimated coefficients for the Gamma regression assumed dependence 
print(g_est)
#          ESTIMATE  STD. ERR   Z-VALUE
# _CONST  2.9710512 0.2303651 12.897141
# X1     -1.8047627 0.2944627 -6.129003
# X2      0.9071093 0.2995218  3.028526

# estimated coefficients for the Gamma regression assumed dependence 
print(p_est)
#         ESTIMATE   STD. ERR   Z-VALUE
# _CONST  2.954519 0.06023353  49.05107
# X1     -1.967023 0.09233056 -21.30414
# X2      1.025863 0.08254870  12.42736

# estimated coefficients for the Gamma regression assumed independence 
# should be identical to GLM() outcome
print(g_est0)
#         ESTIMATE  STD. ERR   Z-VALUE
# _CONST  3.020771 0.2499246 12.086727
# X1     -1.777570 0.3480328 -5.107478
# X2      0.905527 0.3619011  2.502140

# estimated coefficients for the Gamma regression assumed independence 
# should be identical to GLM() outcome
print(p_est0)
#         ESTIMATE   STD. ERR   Z-VALUE
# _CONST  2.939787 0.06507502  45.17536
# X1     -2.010535 0.10297887 -19.52376
# X2      1.088269 0.09334663  11.65837

If we compare conditional loss distributions under different dependence assumptions, it shows that the predicted loss with Copula regression tends to have a fatter right tail and therefore should be considered more conservative.

df <- data.frame(g = G, p = P, x1 = X[, 2], x2 = X[, 3])
glm_p <- glm(p ~ x1 + x2, data = df, family = poisson(log))
glm_g <- glm(g ~ x1 + x2, data = df, family = Gamma(log))
loss_dep <- predict(m, X, X, independence = FALSE)[3][[1]][[1]]
loss_ind <- fitted(glm_p) * fitted(glm_g)
den <- data.frame(loss = c(loss_dep, loss_ind), lines = rep(c("DEPENDENCE", "INDEPENDENCE"), each = n))
ggplot(den, aes(x = loss, fill = lines)) + geom_density(alpha = 0.5)

loss2

Using Tweedie Parameter to Identify Distributions

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity?

In my previous post https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas, it is shown how to use the Modified Park test to identify the mean-variance relationship and then decide the corresponding distribution of operational risk measures. Following the similar logic, we can also leverage the flexibility of the Tweedie distribution to accomplish the same goal. Based upon the parameterization of a Tweedie distribution, the variance = Phi * (Mu ** P), where Mu is the mean and P is the power parameter. Depending on the specific value of P, the Tweedie distribution can accommodate several important distributions commonly used in the operational risk modeling, including Poisson, Gamma, Inverse Gaussian. For instance,

  • With P = 0, the variance would be independent of the mean, indicating a Normal distribution.
  • With P = 1, the variance would be in a linear form of the mean, indicating a Poisson-like distribution
  • With P = 2, the variance would be in a quadratic form of the mean, indicating a Gamma distribution.
  • With P = 3, the variance would be in a cubic form of the mean, indicating an Inverse Gaussian distribution.

In the example below, it is shown that the value of P is in the neighborhood of 1 for the frequency measure and is near 3 for the severity measure and that, given P closer to 3, the Inverse Gaussian regression would fit the severity better than the Gamma regression.

library(statmod)
library(tweedie)

profile1 <- tweedie.profile(Claim_Count ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE)
print(profile1$p.max)
# [1] 1.216327
# The P parameter close to 1 indicates that the claim_count might follow a Poisson-like distribution

profile2 <- tweedie.profile(Severity ~ Age + Vehicle_Use, data = AutoCollision, p.vec = seq(1.1, 3.0, 0.1), fit.glm = TRUE)
print(profile2$p.max)
# [1] 2.844898
# The P parameter close to 3 indicates that the severity might follow an Inverse Gaussian distribution

BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = Gamma(link = log)))
# [1] 360.8064

BIC(glm(Severity ~ Age + Vehicle_Use, data = AutoCollision, family = inverse.gaussian(link = log)))
# [1] 350.2504

Together with the Modified Park test, the estimation of P in a Tweedie distribution is able to help us identify the correct distribution employed in operational loss models in the context of GLM.

Double Poisson Regression in SAS

In the previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models), I’ve shown how to estimate the double Poisson (DP) regression in R with the gamlss package. The hurdle of estimating DP regression is the calculation of a normalizing constant in the DP density function, which can be calculated either by the sum of an infinite series or by a closed form approximation. In the example below, I will show how to estimate DP regression in SAS with the GLIMMIX procedure.

First of all, I will show how to estimate DP regression by using the exact DP density function. In this case, we will approximate the normalizing constant by computing a partial sum of the infinite series, as highlighted below.

data poi;
  do n = 1 to 5000;
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    y = ranpoi(4, exp(1 * x1 - 2 * x2 + 3 * x3));
    output;
  end;
run;

proc glimmix data = poi;
  nloptions tech = quanew update = bfgs maxiter = 1000;
  model y = x1 x2 x3 / link = log solution;
  theta = exp(_phi_);
  _variance_ = _mu_ / theta;
  p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
  p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
  f = (theta ** 0.5) * ((exp(-_mu_)) ** theta);  
  do i = 1 to 100;
    f = f + (theta ** 0.5) * ((exp(-i) * (i ** i) / fact(i)) ** (1 - theta)) * ((exp(-_mu_) * (_mu_ ** i) / fact(i)) ** theta);
  end;
  k = 1 / f;
  prob = k * (theta ** 0.5) * p_y * p_u;
  if log(prob) ~= . then _logl_ = log(prob);
run;

Next, I will show the same estimation routine by using the closed form approximation.

proc glimmix data = poi;
  nloptions tech = quanew update = bfgs maxiter = 1000;
  model y = x1 x2 x3 / link = log solution;
  theta = exp(_phi_);
  _variance_ = _mu_ / theta;
  p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
  p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
  k = 1 / (1 + (1 - theta) / (12 * theta * _mu_) * (1 + 1 / (theta * _mu_)));
  prob = k * (theta ** 0.5) * p_y * p_u;
  if log(prob) ~= . then _logl_ = log(prob);
run;

While the first approach is more accurate by closely following the DP density function, the second approach is more efficient with a significantly lower computing cost. However, both are much faster than the corresponding R function gamlss().

Monotonic Binning with Smbinning Package

The R package smbinning (https://cran.r-project.org/web/packages/smbinning/index.html) provides a very user-friendly interface for the WoE (Weight of Evidence) binning algorithm employed in the scorecard development. However, there are several improvement opportunities in my view:

1. First of all, the underlying algorithm in the smbinning() function utilizes the recursive partitioning, which does not necessarily guarantee the monotonicity.
2. Secondly, the density in each generated bin is not even. The frequency in some bins could be much higher than the one in others.
3. At last, the function might not provide the binning outcome for some variables due to the lack of statistical significance.

In light of the above, I wrote an enhanced version by utilizing the smbinning.custom() function, shown as below. The idea is very simple. Within the repeat loop, we would bin the variable iteratively until a certain criterion is met and then feed the list of cut points into the smbinning.custom() function. As a result, we are able to achieve a set of monotonic bins with similar frequencies regardless of the so-called “statistical significance”, which is a premature step for the variable transformation in my mind.

monobin <- function(data, y, x) {
  d1 <- data[c(y, x)]
  n <- min(20, nrow(unique(d1[x])))
  repeat {
    d1$bin <- Hmisc::cut2(d1[, x], g = n)
    d2 <- aggregate(d1[-3], d1[3], mean)
    c <- cor(d2[-1], method = "spearman")
    if(abs(c[1, 2]) == 1 | n == 2) break
    n <- n - 1
  }
  d3 <- aggregate(d1[-3], d1[3], max)
  cuts <- d3[-length(d3[, 3]), 3]
  return(smbinning::smbinning.custom(d1, y, x, cuts))
}

Below are a couple comparisons between the generic smbinning() and the home-brew monobin() functions with the use of a toy data.

In the first example, we applied the smbinning() function to a variable named "rev_util". As shown in the highlighted rows in the column "BadRate", the binning outcome is not monotonic.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1     <= 0    965     716    249       965        716       249 0.1653   0.7420  0.2580  2.8755 1.0562 -0.2997 0.0162
2     <= 5    522     496     26      1487       1212       275 0.0894   0.9502  0.0498 19.0769 2.9485  1.5925 0.1356
3    <= 24   1166    1027    139      2653       2239       414 0.1998   0.8808  0.1192  7.3885 1.9999  0.6440 0.0677
4    <= 40    779     651    128      3432       2890       542 0.1335   0.8357  0.1643  5.0859 1.6265  0.2705 0.0090
5    <= 73   1188     932    256      4620       3822       798 0.2035   0.7845  0.2155  3.6406 1.2922 -0.0638 0.0008
6     96    533     337    196      5837       4641      1196 0.0913   0.6323  0.3677  1.7194 0.5420 -0.8140 0.0743
8  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN     NaN    NaN     NaN    NaN
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.3352

Next, we did the same with the monobin() function. As shown below, the algorithm provided a monotonic binning at the cost of granularity. Albeit coarse, the result is directionally correct with no inversion.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate   Odds LnOdds     WoE     IV
1     30   2875    2146    729      5837       4641      1196 0.4925   0.7464  0.2536 2.9438 1.0797 -0.2763 0.0407
3  Missing      0       0      0      5837       4641      1196 0.0000      NaN     NaN    NaN    NaN     NaN    NaN
4    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049 3.8804 1.3559  0.0000 0.0878

In the second example, we applied the smbinning() function to a variable named “bureau_score”. As shown in the highlighted rows, the frequencies in these two bins are much higher than the rest.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1   <= 605    324     167    157       324        167       157 0.0555   0.5154  0.4846  1.0637 0.0617 -1.2942 0.1233
2   <= 632    468     279    189       792        446       346 0.0802   0.5962  0.4038  1.4762 0.3895 -0.9665 0.0946
3   <= 662    896     608    288      1688       1054       634 0.1535   0.6786  0.3214  2.1111 0.7472 -0.6087 0.0668
4   <= 699   1271    1016    255      2959       2070       889 0.2177   0.7994  0.2006  3.9843 1.3824  0.0264 0.0002
5   <= 717    680     586     94      3639       2656       983 0.1165   0.8618  0.1382  6.2340 1.8300  0.4741 0.0226
6    761    765     742     23      5522       4431      1091 0.1311   0.9699  0.0301 32.2609 3.4739  2.1179 0.2979
8  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
9    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.8066

With the monobin() function applied to the same variable, we were able to get a set of more granular bins with similar frequencies.

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate BadRate    Odds LnOdds     WoE     IV
1    <= 617    513     284    229       513        284       229 0.0879   0.5536  0.4464  1.2402 0.2153 -1.1407 0.1486
2    <= 642    515     317    198      1028        601       427 0.0882   0.6155  0.3845  1.6010 0.4706 -0.8853 0.0861
3    <= 657    512     349    163      1540        950       590 0.0877   0.6816  0.3184  2.1411 0.7613 -0.5946 0.0363
4    <= 672    487     371    116      2027       1321       706 0.0834   0.7618  0.2382  3.1983 1.1626 -0.1933 0.0033
5    <= 685    494     396     98      2521       1717       804 0.0846   0.8016  0.1984  4.0408 1.3964  0.0405 0.0001
6    <= 701    521     428     93      3042       2145       897 0.0893   0.8215  0.1785  4.6022 1.5265  0.1706 0.0025
7    <= 714    487     418     69      3529       2563       966 0.0834   0.8583  0.1417  6.0580 1.8014  0.4454 0.0144
8    <= 730    489     441     48      4018       3004      1014 0.0838   0.9018  0.0982  9.1875 2.2178  0.8619 0.0473
9    <= 751    513     476     37      4531       3480      1051 0.0879   0.9279  0.0721 12.8649 2.5545  1.1986 0.0859
10    775    499     486     13      5522       4431      1091 0.0855   0.9739  0.0261 37.3846 3.6213  2.2653 0.2126
12  Missing    315     210    105      5837       4641      1196 0.0540   0.6667  0.3333  2.0000 0.6931 -0.6628 0.0282
13    Total   5837    4641   1196        NA         NA        NA 1.0000   0.7951  0.2049  3.8804 1.3559  0.0000 0.7810

Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 .

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response y` = severity – threshold to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound a = 2 by using the VGAM package.

library(VGAM)
set.seed(2017)
n <- 200
a <- 2
x <- runif(n)
k <- exp(1 + 5 * x)
pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x)
fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE)
summary(fit)
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   1.0322     0.1363   7.574 3.61e-14 ***
# x             4.9815     0.2463  20.229  < 2e-16 ***
AIC(fit)
#  -644.458
BIC(fit)
#  -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata;
  parms b0 = 0.1 b1 = 0.1;
  k = exp(b0 + b1 * x);
  a = 2;
  lh = k * (a ** k) / (y ** (k + 1));
  ll = log(lh);
  model y ~ general(ll);
run;
/*
Fit Statistics
-2 Log Likelihood               -648.5
AIC (smaller is better)         -644.5
AICC (smaller is better)        -644.4
BIC (smaller is better)         -637.9

Parameter Estimate   Standard   DF    t Value   Pr > |t|
                     Error 
b0        1.0322     0.1385     200    7.45     <.0001 	
b1        4.9815     0.2518     200   19.78     <.0001 	
*/

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to exp(x * beta) but a * k / (k – 1) with k = exp(x * beta) . Therefore, the conditional mean only exists when k > 1 , which might cause numerical issues in the model estimation.

Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;
2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).
3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = );
***********************************************************;
* SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  Y    : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES        *;
*  X    : MODEL PREDICTORS                                *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;
options mprint mlogic nocenter;

%let links = logit probit cloglog;
%let loop = 1;

proc sql noprint;
  select n(&data) - 3 into :df from &data;
quit; 

%do %while (%scan(&links, &loop) ne %str());

  %let link = %scan(&links, &loop);
  
  proc logistic data = &data noprint desc;
    model &y = &x / link = &link;
    score data = &data out = _out1;
  run;
  
  data _out2;
    set _out1(rename = (p_1 = p1));
    p2 = p1 * p1;
  run;
  
  ods listing close;
  ods output ParameterEstimates = _parm;  
  proc logistic data = _out2 desc;
    model &y = p1 p2 /  link = &link ;
  run;
  ods listing;
    
  %if &loop = 1 %then %do;
    data _parm1;
      format link $10.;
      set _parm(where = (variable = "p2"));
      link = upcase("&link");
    run;
  %end;
  %else %do;
    data _parm1;
      set _parm1 _parm(where = (variable = "p2") in = new);
      if new then link = upcase("&link");
    run;
  %end;
  
  data _parm2(drop = variable);
    set _parm1;
    _t = estimate / stderr;
    _df = &df;
    _p = (1 - probt(abs(_t), _df)) * 2;
  run;
  
  %let loop = %eval(&loop + 1);

%end;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * PREGIBON TEST FOR GOODNESS OF LINK
           * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * "
         link _t _df _p);
  define link / "LINK FUNCTION" width = 15 order order = data;          
  define _t   / "T-VALUE"       width = 15 format = 12.4;
  define _df  / "DF"            width = 10;
  define _p   / "P-VALUE"       width = 15 format = 12.4;
run;

%mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

             PREGIBON TEST FOR GOODNESS OF LINK
        H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY

 LINK FUNCTION           T-VALUE         DF         P-VALUE
-----------------------------------------------------------
 LOGIT                   -1.6825         78          0.0965
 PROBIT                  -1.7940         78          0.0767
 CLOGLOG                 -2.3632         78          0.0206

More about Flexible Frequency Models

Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (https://statcompute.wordpress.com/2016/05/13/more-flexible-approaches-to-model-frequency), the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed.

In addition to the quasi-poisson regression, three flexible frequency modeling techniques, including generalized poisson, double poisson, and Conway-Maxwell poisson, with their implementations in R should also be demonstrated below. While the example is specifically related to the over-dispersed data simulated with the negative binomial distributional assumption, these approaches can be generalized to the under-dispersed data as well given their flexibility. However, as demonstrated below, the calculation of parameters for these modeling approaches is not straight-forward.

Over-Dispersed Data Simulation

> set.seed(1)
> ### SIMULATE NEG. BINOMIAL WITH MEAN(X) = MU AND VAR(X) = MU + MU ^ 2 / THETA
> df <- data.frame(y = MASS::rnegbin(1000, mu = 10, theta = 5))
> ### DATA MEAN
> mean(df$y)
[1] 9.77
> ### DATA VARIANCE
> var(df$y)
[1] 30.93003003

Generalized Poisson Regression

> library(VGAM)
> gpois <- vglm(y ~ 1, data = df, family = genpoisson)
> gpois.theta <- exp(coef(gpois)[2])
> gpois.lambda <- (exp(coef(gpois)[1]) - 1) / (exp(coef(gpois)[1]) + 1)
> ### ESTIMATE MEAN = THETA / (1 - LAMBDA)
> gpois.theta / (1 - gpois.lambda)
(Intercept):2
         9.77
> ### ESTIMATE VARIANCE = THETA / ((1 - LAMBDA) ^ 3)
> gpois.theta / ((1 - gpois.lambda) ^ 3)
(Intercept):2
  31.45359991

Double Poisson Regression

> ### DOUBLE POISSON
> library(gamlss)
> dpois <- gamlss(y ~ 1, data = df, family = DPO, control = gamlss.control(n.cyc = 100))
> ### ESTIMATE MEAN
> dpois.mu <- exp(dpois$mu.coefficients)
> dpois.mu
(Intercept)
9.848457877
> ### ESTIMATE VARIANCE = MU * SIGMA
> dpois.sigma <- exp(dpois$sigma.coefficients)
> dpois.mu * dpois.sigma
(Intercept)
28.29229702

Conway-Maxwell Poisson Regression

> ### CONWAY-MAXWELL POISSON
> library(CompGLM)
> cpois <- glm.comp(y ~ 1, data = df)
> cpois.lambda <- exp(cpois$beta)
> cpois.nu <- exp(cpois$zeta)
> ### ESTIMATE MEAN = LAMBDA ^ (1 / NU) - (NU - 1) / (2 * NU)
> cpois.lambda ^ (1 / cpois.nu) - (cpois.nu - 1) / (2 * cpois.nu)
(Intercept)
 9.66575376
> ### ESTIMATE VARIANCE = LAMBDA ** (1 / NU) / NU
> cpois.lambda ^ (1 / cpois.nu) / cpois.nu
(Intercept)
29.69861239

Risk Models with Generalized PLS

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

R Code

library(gpls)
library(pROC)

df1 <- read.csv("credit_count.txt")
df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)]
set.seed(2016)
n <- nrow(df2)
sample <- sample(seq(n), size = n / 2, replace = FALSE)
train <- df2[sample, ]
test <- df2[-sample, ]

m1 <- glm(DEFAULT ~ ., data = train, family = "binomial")
cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response")))
cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response")))

m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5)
cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n")
print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1]))
cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n")
print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1]))

cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n")
print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

Output

### ROC OF GLM PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m1,     newdata = train, type = "response"))

Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6641

### ROC OF GLM PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m1,     newdata = test, type = "response"))

Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6537

### ROC OF GPLS PREDICTION WITH TRAINING DATA ###

Call:
roc.default(response = train$DEFAULT, predictor = predict(m2,     newdata = train)$predicted[, 1])

Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1).
Area under the curve: 0.6627

### ROC OF GPLS PREDICTION WITH TESTING DATA ###

Call:
roc.default(response = test$DEFAULT, predictor = predict(m2,     newdata = test)$predicted[, 1])

Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1).
Area under the curve: 0.6542

### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###
                      glm          gpls
(Intercept) -0.1940785071 -0.1954618828
AGE         -0.0122709412 -0.0147883358
ACADMOS      0.0005302022  0.0003671781
ADEPCNT      0.1090667092  0.1352491711
MAJORDRG     0.0757313171  0.0813835741
MINORDRG     0.2621574192  0.2547176301
OWNRENT     -0.2803919685 -0.1032119571
INCOME      -0.0004222914 -0.0004531543
LOGSPEND    -0.1688395555 -0.1525963363

More Flexible Approaches to Model Frequency

(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11)

In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson regression, namely Equi-Dispersion, which requires the equality between the conditional mean and the variance such that E(Y) = var(Y). However, in real-world frequency outcomes, the assumption of Equi-Dispersion is always problematic. On the contrary, the empirical data often presents either an excessive variance, namely Over-Dispersion, or an insufficient variance, namely Under-Dispersion. The application of a standard Poisson regression to the over-dispersed data will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics.

In cases of Over-Dispersion, the Negative Binomial (NB) regression has been the most common alternative to the standard Poisson regression by including a dispersion parameter to accommodate the excessive variance in the data. In the formulation of NB regression, the variance is expressed as a quadratic function of the conditional mean such that the variance is guaranteed to be higher than the conditional mean. However, it is not flexible enough to allow for both Over-Dispersion and Under-Dispersion. Therefore, more generalizable approaches are called for.

Two additional frequency modeling methods, including Quasi-Poisson (QP) regression and Conway-Maxwell Poisson (CMP) regression, are discussed. In the case of Quasi-Poisson, E(Y) = λ and var(Y) = θ • λ. While θ > 1 addresses Over-Dispersion, θ < 1 governs Under-Dispersion. Since QP regression is estimated with QMLE, likelihood-based statistics, such as AIC and BIC, won’t be available. Instead, quasi-AIC and quasi-BIC are provided. In the case of Conway-Maxwell Poisson, E(Y) = λ ** (1 / v) – (v – 1) / (2 • v) and var(Y) = (1 / v) • λ ** (1 / v), where λ doesn’t represent the conditional mean anymore but a location parameter. While v < 1 enables us to model the long-tailed distribution reflected as Over-Dispersion, v > 1 takes care of the short-tailed distribution reflected as Under-Dispersion. Since CMP regression is estimated with MLE, likelihood-based statistics, such as AIC and BIC, are available at a high computing cost.

Below demonstrates how to estimate QP and CMP regressions with R and a comparison of their computing times. If the modeling purpose is mainly for the prediction without focusing on the statistical reference, QP regression would be an excellent choice for most practitioners. Otherwise, CMP regression is an elegant model to address various levels of dispersion parsimoniously.

# data source: www.jstatsoft.org/article/view/v027i08
load("../Downloads/DebTrivedi.rda")

library(rbenchmark)
library(CompGLM)

benchmark(replications = 3, order = "user.self",
  quasi.poisson = {
    m1 <- glm(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, family = "quasipoisson")
  },
  conway.maxwell = {
    m2 <- glm.comp(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, lamStart = m1$coefficient
s)
  }
)
#             test replications elapsed relative user.self sys.self user.child
# 1  quasi.poisson            3   0.084    1.000     0.084    0.000          0
# 2 conway.maxwell            3  42.466  505.548    42.316    0.048          0

summary(m1)
summary(m2) 

Quasi-Poisson Regression

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)      0.886462   0.069644  12.729  < 2e-16 ***
healthpoor       0.235673   0.046284   5.092 3.69e-07 ***
healthexcellent -0.360188   0.078441  -4.592 4.52e-06 ***
hosp             0.163246   0.015594  10.468  < 2e-16 ***
numchron         0.144652   0.011894  12.162  < 2e-16 ***
privinsyes       0.304691   0.049879   6.109 1.09e-09 ***
school           0.028953   0.004812   6.016 1.93e-09 ***
gendermale      -0.092460   0.033830  -2.733   0.0063 **
medicaidyes      0.297689   0.063787   4.667 3.15e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 6.697556)

    Null deviance: 26943  on 4405  degrees of freedom
Residual deviance: 23027  on 4397  degrees of freedom
AIC: NA

Conway-Maxwell Poisson Regression

Beta:
                   Estimate   Std.Error  t.value p.value
(Intercept)     -0.23385559  0.16398319  -1.4261 0.15391
healthpoor       0.03226830  0.01325437   2.4345 0.01495 *
healthexcellent -0.08361733  0.00687228 -12.1673 < 2e-16 ***
hosp             0.01743416  0.01500555   1.1618 0.24536
numchron         0.02186788  0.00209274  10.4494 < 2e-16 ***
privinsyes       0.05193645  0.00184446  28.1581 < 2e-16 ***
school           0.00490214  0.00805940   0.6083 0.54305
gendermale      -0.01485663  0.00076861 -19.3292 < 2e-16 ***
medicaidyes      0.04861617  0.00535814   9.0733 < 2e-16 ***

Zeta:
              Estimate  Std.Error t.value   p.value
(Intercept) -3.4642316  0.0093853 -369.11 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 24467.13
Log-Likelihood: -12223.56

Estimate Quasi-Binomial Model with GENMOD Procedure in SAS

In my previous post (https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas/), I’ve shown why there is an interest in estimating Quasi-Binomial models for financial practitioners and how to estimate with GLIMMIX procedure in SAS.

In the demonstration below, I will show an example how to estimate a Quasi-Binomial model with GENMOD procedure by specifying VARIANCE and DEVIANCE. While the CPU time for model estimation is a lot faster with GENMOD than with GLIMMIX, additional steps are necessary to ensure the correct statistical inference.

ods listing close;
ods output modelfit = fit;
ods output parameterestimates = parm1;
proc genmod data = kyphosis;
  model y = age number start / link = logit noscale;
  variance v = _mean_ * (1 - _mean_);
  deviance d = (-2) * log((_mean_ ** _resp_) * ((1 - _mean_) ** (1 - _resp_)));
run;
ods listing;

proc sql noprint;
select
  distinct valuedf format = 12.8, df format = 8.0 into :disp, :df
from
  fit
where 
  index(criterion, "Pearson") > 0;

select
  value format = 12.4 into :ll
from
  fit
where
  criterion = "Log Likelihood";

select
  sum(df) into :k
from
  parm1;
quit;

%let aic = %sysevalf((-&ll + &k) * 2); 
%let bic = %sysevalf(-&ll * 2 + &k * %sysfunc(log(&df + &k)));

data parm2 (keep = parameter estimate stderr2 df t_value p_value);
  set parm1;
  where df > 0;

  stderr2 = stderr * (&scale ** 0.5);
  df = &df;
  t_value = estimate / stderr2;
  p_value = (1 - probt(abs(t_value), &df)) * 2;
run;

title;
proc report data = parm2 spacing = 1 headline nowindows split = "*";
  column(" Parameter Estimate of Quasi-Binomial Model "
         parameter estimate stderr2 t_value df p_value);
  compute before _page_;
    line @5 "Fit Statistics";
	line " ";
	line @3 "Quasi Log Likelihood: %sysfunc(round(&ll, 0.01))";
	line @3 "Quasi-AIC (smaller is better): %sysfunc(round(&aic, 0.01))";
	line @3 "Quasi-BIC (smaller is better): %sysfunc(round(&bic, 0.01))";
	line @3 "(Dispersion Parameter for Quasi-Binomial is %sysfunc(round(&disp, 0.0001)))";
	line " ";
  endcomp;
  define parameter / "Parmeter" width = 10 order order = data;
  define estimate  / "Estimate" width = 10 format = 10.4;
  define stderr2   / "Std Err"  width = 10 format = 10.4;
  define t_value   / "T Value"  width = 10 format = 10.2;
  define df        / "DF"       width = 5  format = 4.0;
  define p_value   / "Pr > |t|" width = 10 format = 10.4;
run;
quit;
  
/*
    Fit Statistics

  Quasi Log Likelihood: -30.69
  Quasi-AIC (smaller is better): 69.38
  Quasi-BIC (smaller is better): 78.96
  (Dispersion Parameter for Quasi-Binomial is 0.9132)

          Parameter Estimate of Quasi-Binomial Model
 Parmeter     Estimate    Std Err    T Value    DF   Pr > |t|
 ------------------------------------------------------------
 Intercept     -2.0369     1.3853      -1.47    77     0.1455
 Age            0.0109     0.0062       1.77    77     0.0800
 Number         0.4106     0.2149       1.91    77     0.0598
 Start         -0.2065     0.0647      -3.19    77     0.0020
*/

A More Flexible Ljung-Box Test in SAS

Ljung-Box test is an important diagnostic to check if residuals from the time series model are independently distributed. In SAS / ETS module, it is easy to perform Ljung-Box with ARIMA procedure. However, test outputs are only provided for Lag 6, 12, 18, and so on, which cannot be changed by any option.

data one;
  do i = 1 to 100;
    x = uniform(1);
	output;
  end;
run;

proc arima data = one;
  identify var = x whitenoise = ignoremiss;
run;
quit;
/*
                            Autocorrelation Check for White Noise

 To        Chi-             Pr >
Lag      Square     DF     ChiSq    --------------------Autocorrelations--------------------
  6        5.49      6    0.4832     0.051    -0.132     0.076    -0.024    -0.146     0.064
 12        6.78     12    0.8719     0.050     0.076    -0.046    -0.025    -0.016    -0.018
 18       10.43     18    0.9169     0.104    -0.053     0.063     0.038    -0.085    -0.065
 24       21.51     24    0.6083     0.007     0.178     0.113    -0.046     0.180     0.079
*/

The SAS macro below is a more flexible way to perform Ljung-Box test for any number of lags. As shown in the output, test results for Lag 6 and 12 are identical to the one directly from ARIMA procedure.

%macro LBtest(data = , var = , lags = 4);
***********************************************************;
* SAS MACRO PERFORMING LJUNG-BOX TEST FOR INDEPENDENCE    *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  VAR  : THE TIME SERIES TO TEST FOR INDEPENDENCE        *;
*  LAGS : THE NUMBER OF LAGS BEING TESTED                 *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

%local nlag; 

data _1 (keep = &var);
  set &data end = eof;
  if eof then do;
    call execute('%let nlag = '||put(_n_ - 1, 8.)||';');
  end;
run;

proc arima data = _last_;
  identify var = &var nlag = &nlag outcov = _2 noprint;
run;
quit;

%do i = 1 %to &lags;
  data _3;
    set _2;
	where lag > 0 and lag <= &i;
  run;

  proc sql noprint;
    create table
	  _4 as
	select
      sum(corr * corr / n) * (&nlag + 1) * (&nlag + 3) as _chisq,
	  1 - probchi(calculated _chisq, &i.)              as _p_chisq,
	  &i                                               as _df
	from
	  _last_;
  quit;

  %if &i = 1 %then %do;
  data _5;
    set _4;
  run;
  %end;
  %else %do;
  data _5;
    set _5 _4;
  run;
  %end;
%end;

title;
proc report data = _5 spacing = 1 headline nowindows split = "*";
  column(" * LJUNG-BOX TEST FOR WHITE NOISE *
           * H0: RESIDUALS ARE INDEPENDENTLY DISTRIBUTED UPTO LAG &lags * "
          _chisq _df _p_chisq);
  define _chisq   / "CHI-SQUARE" width = 20 format = 15.10;
  define _df      / "DF"         width = 10 order;
  define _p_chisq / "P-VALUE"    width = 20 format = 15.10;
run;

%mend LBtest;

%LBtest(data = one, var = x, lags = 12);

/*
             LJUNG-BOX TEST FOR WHITE NOISE
 H0: RESIDUALS ARE INDEPENDENTLY DISTRIBUTED UPTO LAG 12

           CHI-SQUARE         DF              P-VALUE
 ------------------------------------------------------
         0.2644425904          1         0.6070843322
         2.0812769288          2         0.3532290858
         2.6839655476          3         0.4429590625
         2.7428168168          4         0.6017432831
         5.0425834917          5         0.4107053939
         5.4851972398          6         0.4832476224
         5.7586229652          7         0.5681994829
         6.4067856029          8         0.6017645131
         6.6410385135          9         0.6744356312
         6.7142471241         10         0.7521182318
         6.7427585395         11         0.8195164211
         6.7783018413         12         0.8719097622
*/

Estimating Quasi-Poisson Regression with GLIMMIX in SAS

When modeling the frequency measure in the operational risk with regressions, most modelers often prefer Poisson or Negative Binomial regressions as best practices in the industry. However, as an alternative approach, Quasi-Poisson regression provides a more flexible model estimation routine with at least two benefits. First of all, Quasi-Poisson regression is able to address both over-dispersion and under-dispersion by assuming that the variance is a function of the mean such that VAR(Y|X) = Theta * MEAN(Y|X), where Theta > 1 for the over-dispersion and Theta < 1 for the under-dispersion. Secondly, estimated coefficients with Quasi-Poisson regression are identical to the ones with Standard Poisson regression, which is considered the prevailing practice in the industry.

While Quasi-Poisson regression can be easily estimated with glm() in R language, its estimation in SAS is not very straight-forward. Luckily, with GLIMMIX procedure, we can estimate Quasi-Poisson regression by directly specifying the functional relationship between the variance and the mean and making no distributional assumption in the MODEL statement, as demonstrated below.


proc glimmix data = credit_count;
  model MAJORDRG = AGE ACADMOS MINORDRG OWNRENT / link = log solution;
  _variance_ = _mu_;
  random _residual_;
run;
  
/*
              Model Information
 
Data Set                     WORK.CREDIT_COUNT
Response Variable            MAJORDRG        
Response Distribution        Unknown         
Link Function                Log             
Variance Function            _mu_             
Variance Matrix              Diagonal        
Estimation Technique         Quasi-Likelihood
Degrees of Freedom Method    Residual        
 
              Fit Statistics
 
-2 Log Quasi-Likelihood           19125.57
Quasi-AIC  (smaller is better)    19135.57
Quasi-AICC (smaller is better)    19135.58
Quasi-BIC  (smaller is better)    19173.10
Quasi-CAIC (smaller is better)    19178.10
Quasi-HQIC (smaller is better)    19148.09
Pearson Chi-Square                51932.87
Pearson Chi-Square / DF               3.86
 
                       Parameter Estimates
                         Standard
Effect       Estimate       Error       DF    t Value    Pr > |t|
 
Intercept     -1.3793     0.08613    13439     -16.01      <.0001
AGE           0.01039    0.002682    13439       3.88      0.0001
ACADMOS      0.001532    0.000385    13439       3.98      <.0001
MINORDRG       0.4611     0.01348    13439      34.22      <.0001
OWNRENT       -0.1994     0.05568    13439      -3.58      0.0003
Residual       3.8643           .        .        .         .   
*/

For the comparison purpose, we also estimated a Quasi-Poisson regression in R, showing completely identical statistical results.


summary(glm(MAJORDRG ~ AGE + ACADMOS + MINORDRG + OWNRENT, data = credit_count, family = quasipoisson(link = "log")))
  
#               Estimate Std. Error t value Pr(>|t|)   
# (Intercept) -1.3793249  0.0861324 -16.014  < 2e-16 ***
# AGE          0.0103949  0.0026823   3.875 0.000107 ***
# ACADMOS      0.0015322  0.0003847   3.983 6.84e-05 ***
# MINORDRG     0.4611297  0.0134770  34.216  < 2e-16 ***
# OWNRENT     -0.1993933  0.0556757  -3.581 0.000343 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for quasipoisson family taken to be 3.864409)
# 
#     Null deviance: 24954  on 13443  degrees of freedom
# Residual deviance: 22048  on 13439  degrees of freedom
# AIC: NA

SAS Macro for Engle-Granger Co-integration Test

In the coursework of time series analysis, we’ve been taught that a time series regression of Y on X could be valid only when both X and Y are stationary due to the so-call “spurious regression problem”. However, one exception holds that if X and Y, albeit non-stationary, share a common trend such that their trends can be cancelled each other out, then X and Y are co-integrated and the regression of Y on X is valid. As a result, it is important to test the co-integration between X and Y.

Following the definition of co-integration, it is straightforward to formulate a procedure of the co-integration test. First of all, construct a linear combination between Y and X such that e = Y – (a + b * X). Secondly, test if e is stationary with ADF test. If e is stationary, then X and Y are co-integrated. This two-stage procedure is also called Engle-Granger co-integration test.

Below is a SAS macro implementing Engle-Granger co-integration test to show the long-term relationship between GDP and other macro-economic variables, e.g. Personal Consumption and Personal Disposable Income.

SAS Macro

%macro eg_coint(data = , y = , xs = );
*********************************************************************;
* THIS SAS MACRO IMPLEMENTATION ENGLE-GRANGER COINTEGRATION TEST IN *;
* A BATCH MODE TO PROCESS MANY TIME SERIES                          *;
*********************************************************************;
* INPUT PARAMETERS:                                                 *;
*   DATA: A INPUT SAS DATASET                                       *;
*   Y   : A DEPENDENT VARIABLE IN THE COINTEGRATION REGRESSION      *;
*   X   : A LIST OF INDEPENDENT VARIABLE IN THE COINTEGRATION       *;
*         REGRESSION                                                *;
*********************************************************************;
* AUTHOR: WENSUI.LIU@53.COM                                         *;
*********************************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*";

%local sig loop;

%let sig = 0.1;

%let loop = 1;

%do %while (%scan(&xs, &loop) ne %str());

  %let x = %scan(&xs, &loop);

  ods listing close;
  ods output FitStatistics = _fit;
  proc reg data = &data;
    model &y = &x;
    output out = _1 residual = r;
  run;
  quit;

  proc sql noprint;
    select cvalue2 into :r2 from _fit where upcase(label2) = "R-SQUARE";
  quit;

  proc arima data = _1;
    ods output stationaritytests = _adf1 (where = (upcase(type) = "ZERO MEAN" and lags = 1) drop = rho probrho fvalue probf);
    identify var = r stationarity = (adf = 1);
  run;
  quit;
  ods listing;

  %if &loop = 1 %then %do;
    data _adf;
      format vars $32. lterm_r2 best12. flg_coint $3.;
      set _adf1 (drop = type lags);
      vars = upcase("&x");
      lterm_r2 = &r2;
      if probtau < &sig then flg_coint = "YES";
      else flg_coint = "NO";
    run;
  %end;
  %else %do;
    data _adf;
      set _adf _adf1 (in = new drop = type lags);
      if new then do;
        vars = upcase("&x");
        lterm_r2 = &r2;
        if probtau < &sig then flg_coint = "YES";
          else flg_coint = "NO";
      end;
    run;
  %end;
    
  %let loop = %eval(&loop + 1);
%end;

proc sort data = _last_;
  by descending flg_coint probtau;
run;

proc report data = _last_ box spacing = 1 split = "/" nowd;
  COLUMN("ENGLE-GRANGER COINTEGRATION TEST BETWEEN %UPCASE(&Y) AND EACH VARIABLE BELOW/ "
         vars lterm_r2 flg_coint tau probtau);
  define vars      / "VARIABLES"                      width = 35;
  define lterm_r2  / "LONG-RUN/R-SQUARED"             width = 15 format =  9.4 center;
  define flg_coint / "COINTEGRATION/FLAG"             width = 15 center;
  define tau       / "TAU STATISTIC/FOR ADF TEST"     width = 20 format = 15.4;
  define probtau   / "P-VALUE FOR/ADF TEST"           width = 15 format =  9.4 center;
run;

%mend eg_coint;

%eg_coint(data = sashelp.citiqtr, y = gdp, xs = gyd gc);

SAS Output

----------------------------------------------------------------------------------------------------------
|                  ENGLE-GRANGER COINTEGRATION TEST BETWEEN GDP AND EACH VARIABLE BELOW                  |
|                                                                                                        |
|                                       LONG-RUN      COINTEGRATION         TAU STATISTIC   P-VALUE FOR  |
|VARIABLES                              R-SQUARED         FLAG               FOR ADF TEST    ADF TEST    |
|--------------------------------------------------------------------------------------------------------|
|GC                                 |      0.9985   |      YES      |             -2.8651|      0.0051   |
|-----------------------------------+---------------+---------------+--------------------+---------------|
|GYD                                |      0.9976   |      YES      |             -1.7793|      0.0715   |
----------------------------------------------------------------------------------------------------------

From the output, it is interesting to see that GDP in U.S. is driven more by Personal Consumption than by Personal Disposable Income.

SAS Macro to Test Stationarity in Batch

To determine if a time series is stationary or has the unit root, three methods can be used:

A. The most intuitive way, which is also sufficient in most cases, is to eyeball the ACF (Autocorrelation Function) plot of the time series. The ACF pattern with a fast decay might imply a stationary series.
B. Statistical tests for Unit Roots, e.g. ADF (Augmented Dickey–Fuller) or PP (Phillips–Perron) test, could be employed as well. With the Null Hypothesis of Unit Root, a statistically significant outcome might suggest a stationary series.
C. In addition to the aforementioned tests for Unit Roots, statistical tests for stationarity, e.g. KPSS (Kwiatkowski–Phillips–Schmidt–Shin) test, might be an useful complement as well. With the Null Hypothesis of stationarity, a statistically insignificant outcome might suggest a stationary series.

By testing both the unit root and stationarity, the analyst should be able to have a better understanding about the data nature of a specific time series.

The SAS macro below is a convenient wrapper of stationarity tests for many time series in the production environment. (Please note that this macro only works for SAS 9.2 or above.)

%macro stationary(data = , vars =);
***********************************************************;
* THIS SAS MACRO IS TO DO STATIONARITY TESTS FOR MANY     *;
* TIME SERIES                                             *;
* ------------------------------------------------------- *;
* INPUT PARAMETERS:                                       *;
*   DATA: A INPUT SAS DATASET                             *;
*   VARS: A LIST OF TIME SERIES                           *;
* ------------------------------------------------------- *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*";

%local sig loop;

%let sig = 0.1;

%let loop = 1;

%do %while (%scan(&vars, &loop) ne %str());

  %let x = %scan(&vars, &loop);

  proc sql noprint;
    select int(12 * ((count(&x) / 100) ** 0.25)) into :nlag1 from &data;

    select int(max(1, (count(&x) ** 0.5) / 5)) into :nlag2 from &data;
  quit;

  ods listing close;
  ods output kpss = _kpss (drop = model lags rename = (prob = probeta))
             adf = _adf  (drop = model lags rho probrho fstat probf rename = (tau = adf_tau probtau = adf_probtau))
             philperron = _pp  (drop = model lags rho probrho rename = (tau = pp_tau probtau = pp_probtau));
  proc autoreg data = &data;
    model &x = / noint stationarity = (adf = &nlag1, phillips = &nlag2, kpss = (kernel = nw lag = &nlag1));
  run;
  quit;
  ods listing;

  proc sql noprint;
    create table
      _1 as 
    select
      upcase("&x")           as vars length = 32,
      upcase(_adf.type)      as type,
      _adf.adf_tau,
      _adf.adf_probtau,
      _pp.pp_tau,
      _pp.pp_probtau,
      _kpss.eta,
      _kpss.probeta,
      case 
        when _adf.adf_probtau < &sig or _pp.pp_probtau < &sig or _kpss.probeta > &sig then "*"
        else " "
      end                    as _flg,
      &loop                  as _i,
      monotonic()            as _j
    from 
      _adf inner join _pp on _adf.type = _pp.type inner join _kpss on _adf.type = _kpss.type;
  quit;

  %if &loop = 1 %then %do;
    data _result;
      set _1;
    run;
  %end;
  %else %do;
    proc append base = _result data = _1;
    run;
  %end;

  proc datasets library = work nolist;
    delete _1 _adf _pp _kpss / memtype = data;
  quit;

  %let loop = %eval(&loop + 1);
%end;

proc sort data = _result;
  by _i _j;
run;

proc report data = _result box spacing = 1 split = "/" nowd;
  column("STATISTICAL TESTS FOR STATIONARITY/ "
         vars type adf_tau adf_probtau pp_tau pp_probtau eta probeta _flg);
  define vars        / "VARIABLES/ "                  width = 20 group order order = data;
  define type        / "TYPE/ "                       width = 15 order order = data;
  define adf_tau     / "ADF TEST/FOR/UNIT ROOT"       width = 10 format = 8.2;
  define adf_probtau / "P-VALUE/FOR/ADF TEST"         width = 10 format = 8.4 center;
  define pp_tau      / "PP TEST/FOR/UNIT ROOT"        width = 10 format = 8.2;
  define pp_probtau  / "P-VALUE/FOR/PP TEST"          width = 10 format = 8.4 center;
  define eta         / "KPSS TEST/FOR/STATIONARY"     width = 10 format = 8.2;
  define probeta     / "P-VALUE/FOR/KPSS TEST"        width = 10 format = 8.4 center;
  define _flg        / "STATIONARY/FLAG"              width = 10 center;
run;

%mend stationary;

SAS Macro for Jarque-Bera Normality Test

%macro jarque_bera(data = , var = );
**********************************************;
* SAS macro to do Jarque-Bera Normality Test *;
* ------------------------------------------ *;
* wensui.liu@53.com                          *;
**********************************************;
options mprint mlogic symbolgen nodate;

ods listing close;
ods output moments = m1;
proc univariate data = &data normal;
  var &var.;
run;

proc sql noprint;
  select nvalue1 into :n from m1 where upcase(compress(label1, ' ')) = 'N';
  select put(nvalue1, best32.) into :s from m1 where upcase(compress(label1, ' ')) = 'SKEWNESS';
  select put(nvalue2, best32.) into :k from m1 where upcase(compress(label2, ' ')) = 'KURTOSIS';
quit;

data _temp_;
  jb = ((&s) ** 2 + (&k) ** 2 / 4) / 6 * &n;
  pvalue = 1 - probchi(jb, 2);
  put jb pvalue;
run;

ods listing;
proc print data = _last_ noobs;
run;

%mend jarque_bera;

Estimating Time Series Models for Count Outcomes with SAS

In SAS, there is no out-of-box procedure to estimate time series models for count outcomes, which is similar to the one shown here (https://statcompute.wordpress.com/2015/03/31/modeling-count-time-series-with-tscount-package). However, as long as we understand the likelihood function of Poisson distribution, it is straightforward to estimate a time series model with PROC MODEL in the ETS module.

Below is a demonstration of how to estimate a Poisson time series model with the identity link function. As shown, the parameter estimates with related inferences are extremely close to the ones estimated with tscount() in R.

data polio;
  idx + 1;
  input y @@;
datalines;
0  1  0  0  1  3  9  2  3  5  3  5  2  2  0  1  0  1  3  3  2  1  1  5  0
3  1  0  1  4  0  0  1  6 14  1  1  0  0  1  1  1  1  0  1  0  1  0  1  0
1  0  1  0  1  0  1  0  0  2  0  1  0  1  0  0  1  2  0  0  1  2  0  3  1
1  0  2  0  4  0  2  1  1  1  1  0  1  1  0  2  1  3  1  2  4  0  0  0  1
0  1  0  2  2  4  2  3  3  0  0  2  7  8  2  4  1  1  2  4  0  1  1  1  3
0  0  0  0  1  0  1  1  0  0  0  0  0  1  2  0  2  0  0  0  1  0  1  0  1
0  2  0  0  1  2  0  1  0  0  0  1  2  1  0  1  3  6
;
run;

proc model data = polio;
  parms b0 = 0.5 b1 = 0.1 b2 = 0.1;
  yhat = b0 + b1 * zlag1(y) + b2 * zlag1(yhat);
  y = yhat;
  lk = exp(-yhat) * (yhat ** y) / fact(y);
  ll = -log(lk);
  errormodel y ~ general(ll);
  fit y / fiml converge = 1e-8;
run;

/* OUTPUT:
            Nonlinear Liklhood Summary of Residual Errors

                  DF      DF                                        Adj
Equation       Model   Error        SSE        MSE   R-Square      R-Sq
y                  3     165      532.6     3.2277     0.0901    0.0791

           Nonlinear Liklhood Parameter Estimates

                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|
b0              0.606313      0.1680       3.61       0.0004
b1              0.349495      0.0690       5.06       <.0001
b2              0.206877      0.1397       1.48       0.1405

Number of Observations       Statistics for System

Used               168    Log Likelihood    -278.6615
Missing              0
*/

Modeling Count Time Series with tscount Package

The example below shows how to estimate a simple univariate Poisson time series model with the tscount package. While the model estimation is straightforward and yeilds very similar parameter estimates to the ones generated with the acp package (https://statcompute.wordpress.com/2015/03/29/autoregressive-conditional-poisson-model-i), the prediction mechanism is a bit tricky.

1) For the in-sample and the 1-step-ahead predictions:

yhat_[t] = beta0 + beta1 * y_[t – 1] + beta2 * yhat_[t – 1]

2) For the out-of-sample predictions with the observed Y unavailable:

yhat_[t] = beta0 + beta1 * yhat_[t – 1] + beta2 * yhat_[t – 1]

library(tscount)

mdl <- tsglm(cnt$y, model = list(past_obs = 1, past_mean = 1), distr = "poisson")
summary(mdl)
# tsglm(ts = cnt$y, model = list(past_obs = 1, past_mean = 1), 
#     distr = "poisson")
#
# Coefficients:
#              Estimate  Std. Error
# (Intercept)     0.632      0.1774
# beta_1          0.350      0.0687
# alpha_1         0.184      0.1455
# Standard errors obtained by normal approximation.
#
# Link function: identity 
# Distribution family: poisson 
# Number of coefficients: 3 
# Log-likelihood: -279.2738 
# AIC: 564.5476 
# BIC: 573.9195 

### in-sample prediction ###
cnt$yhat <- mdl$fitted.values
tail(cnt, 3)
#     y      yhat
# 166 1 0.8637023
# 167 3 1.1404714
# 168 6 1.8918651

### manually check ###
beta <- mdl$coefficients
pv167 <- beta[1] + beta[2] * cnt$y[166] + beta[3] * cnt$yhat[166] 
#  1.140471
pv168 <- beta[1] + beta[2] * cnt$y[167] + beta[3] * cnt$yhat[167] 
#  1.891865 

### out-of-sample prediction ###
oot <- predict(mdl, n.ahead = 3)
# [1] 3.080667 2.276211 1.846767

### manually check ###
ov2 <- beta[1] + beta[2] * oot[[1]][1] + beta[3] * oot[[1]][1] 
#  2.276211
ov3 <- beta[1] + beta[2] * oot[[1]][2] + beta[3] * oot[[1]][2] 
#  1.846767

Autoregressive Conditional Poisson Model – I

Modeling the time series of count outcome is of interest in the operational risk while forecasting the frequency of losses. Below is an example showing how to estimate a simple ACP(1, 1) model, e.g. Autoregressive Conditional Poisson, without covariates with ACP package.

library(acp)

### acp(1, 1) without covariates ###
mdl <- acp(y ~ -1, data = cnt)
summary(mdl)
# acp.formula(formula = y ~ -1, data = cnt)
#
#   Estimate   StdErr t.value   p.value    
# a 0.632670 0.169027  3.7430 0.0002507 ***
# b 0.349642 0.067414  5.1865 6.213e-07 ***
# c 0.184509 0.134154  1.3753 0.1708881    

### generate predictions ###
f <- predict(mdl)
pred <- data.frame(yhat = f, cnt)
tail(pred, 5)
#          yhat y
# 164 1.5396921 1
# 165 1.2663993 0
# 166 0.8663321 1
# 167 1.1421586 3
# 168 1.8923355 6

### calculate predictions manually ###
pv167 <- mdl$coef[1] + mdl$coef[2] * pred$y[166] + mdl$coef[3] * pred$yhat[166] 
# [1] 1.142159

pv168 <- mdl$coef[1] + mdl$coef[2] * pred$y[167] + mdl$coef[3] * pred$yhat[167] 
# [1] 1.892336

plot.ts(pred, main = "Predictions")

rplot

Model Segmentation with Cubist

Cubist is a tree-based model with a OLS regression attached to each terminal node and is somewhat similar to mob() function in the Party package (https://statcompute.wordpress.com/2014/10/26/model-segmentation-with-recursive-partitioning). Below is a demonstrate of cubist() model with the classic Boston housing data.

pkgs <- c('MASS', 'Cubist', 'caret')
lapply(pkgs, require, character.only = T)

data(Boston)
X <- Boston[, 1:13]
Y <- log(Boston[, 14])

### TRAIN THE MODEL ###
mdl <- cubist(x = X, y = Y, control = cubistControl(unbiased = TRUE,  label = "log_medv", seed = 2015, rules = 5))
summary(mdl)
#  Rule 1: [94 cases, mean 2.568824, range 1.609438 to 3.314186, est err 0.180985]
#
#    if
#	nox > 0.671
#    then
#	log_medv = 1.107315 + 0.588 dis + 2.92 nox - 0.0287 lstat - 0.2 rm
#	           - 0.0065 crim
#
#  Rule 2: [39 cases, mean 2.701933, range 1.94591 to 3.314186, est err 0.202473]
#
#    if
#	nox <= 0.671
#	lstat > 19.01
#    then
#	log_medv = 3.935974 - 1.68 nox - 0.0076 lstat + 0.0035 rad - 0.00017 tax
#	           - 0.013 dis - 0.0029 crim + 0.034 rm - 0.011 ptratio
#	           + 0.00015 black + 0.0003 zn
#
#  Rule 3: [200 cases, mean 2.951007, range 2.116256 to 3.589059, est err 0.100825]
#
#    if
#	rm <= 6.232
#	dis > 1.8773
#    then
#	log_medv = 2.791381 + 0.152 rm - 0.0147 lstat + 0.00085 black
#	           - 0.031 dis - 0.027 ptratio - 0.0017 age + 0.0031 rad
#	           - 0.00013 tax - 0.0025 crim - 0.12 nox + 0.0002 zn
#
#  Rule 4: [37 cases, mean 3.038195, range 2.341806 to 3.912023, est err 0.184200]
#
#    if
#	dis <= 1.8773
#	lstat <= 19.01
#    then
#	log_medv = 5.668421 - 1.187 dis - 0.0469 lstat - 0.0122 crim
#
#  Rule 5: [220 cases, mean 3.292121, range 2.261763 to 3.912023, est err 0.093716]
#
#    if
#	rm > 6.232
#	lstat <= 19.01
#    then
#	log_medv = 2.419507 - 0.033 lstat + 0.238 rm - 0.0089 crim + 0.0082 rad
#	           - 0.029 dis - 0.00035 tax + 0.0006 black - 0.024 ptratio
#	           - 0.0006 age - 0.12 nox + 0.0002 zn
#
# Evaluation on training data (506 cases):
#
#    Average  |error|           0.100444
#    Relative |error|               0.33
#    Correlation coefficient        0.94
#
#	Attribute usage:
#	  Conds  Model
#
#	   71%    94%    rm
#	   50%   100%    lstat
#	   40%   100%    dis
#	   23%    94%    nox
#	         100%    crim
#	          78%    zn
#	          78%    rad
#	          78%    tax
#	          78%    ptratio
#	          78%    black
#	          71%    age

### VARIABLE IMPORTANCE ###
varImp(mdl)
#        Overall
# rm         82.5
# lstat      75.0
# dis        70.0
# nox        58.5
# crim       50.0
# zn         39.0
# rad        39.0
# tax        39.0
# ptratio    39.0
# black      39.0
# age        35.5
# indus       0.0
# chas        0.0

Flexible Beta Modeling

library(betareg)
library(sas7bdat)

df1 <- read.sas7bdat('lgd.sas7bdat')
df2 <- df1[df1$y < 1, ]

fml <- as.formula('y ~ x2 + x3 + x4 + x5 + x6 | x3 + x4 | x1 + x2')

### LATENT-CLASS BETA REGRESSION: AIC = -565 ###
mdl1 <- betamix(fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, minprior = 0.1))
print(mdl1)
#betamix(formula = fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, 
#    minprior = 0.1))
#
#Cluster sizes:
#  1   2 
#157 959 

summary(mdl1, which = 'concomitant')
#            Estimate Std. Error z value Pr(>|z|)   
#(Intercept) -1.35153    0.41988 -3.2188 0.001287 **
#x1           2.92537    1.13046  2.5878 0.009660 **
#x2           2.82809    1.42139  1.9897 0.046628 * 

summary(mdl1)
#$Comp.1$mean
#              Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -0.8963228  1.0385545 -0.8630 0.3881108    
#x2           3.1769062  0.6582108  4.8266 1.389e-06 ***
#x3          -0.0520060  0.0743714 -0.6993 0.4843805    
#x4           4.9642998  1.4204071  3.4950 0.0004741 ***
#x5           0.0021647  0.0022659  0.9554 0.3393987    
#x6           0.0248573  0.0062982  3.9467 7.922e-05 ***
#
#$Comp.1$precision
#            Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -5.37817    1.44817 -3.7138 0.0002042 ***
#x3           0.45009    0.10094  4.4589 8.239e-06 ***
#x4           3.06969    1.41450  2.1702 0.0299948 *  
#
#$Comp.2
#$Comp.2$mean
#              Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -1.8737088  0.3888454 -4.8186 1.445e-06 ***
#x2          -0.6318086  0.1892501 -3.3385 0.0008424 ***
#x3           0.1786425  0.0265428  6.7303 1.693e-11 ***
#x4           2.0646272  0.5256002  3.9281 8.561e-05 ***
#x5          -0.0064821  0.0014053 -4.6127 3.974e-06 ***
#x6           0.0018828  0.0022873  0.8231 0.4104318    
#
#$Comp.2$precision
#            Estimate Std. Error z value Pr(>|z|)   
#(Intercept) 1.092403   0.616974  1.7706 0.076630 . 
#x3          0.017330   0.040024  0.4330 0.665029   
#x4          2.138812   0.717702  2.9801 0.002882 **


### BETA REGRESSION TREE: AIC = -578 ###
mdl2 <- betatree(fml, data = df2, minsplit = 100)
print(mdl2)
#1) x2 <= 0.08584895; criterion = 1, statistic = 154.716
#  2)*  weights = 121 
#Terminal node model
#betaReg fit with coefficients:
#      (Intercept)                 x2                 x3                 x4  
#         3.307359          -2.854351          -0.262815          -2.414481  
#               x5                 x6  (phi)_(Intercept)           (phi)_x3  
#        -0.007555           0.030346           1.003767          -0.002907  
#         (phi)_x4  
#         2.528602  
#
#1) x2 > 0.08584895
#  3)*  weights = 995 
#Terminal node model
#betaReg fit with coefficients:
#      (Intercept)                 x2                 x3                 x4  
#        -2.134931          -0.194830           0.168136           2.811077  
#               x5                 x6  (phi)_(Intercept)           (phi)_x3  
#        -0.002070           0.004677          -1.018102           0.151778  
#         (phi)_x4  
#         2.142995  

sctest(mdl2, node = 1)
#                x1       x2
#statistic 113.4781 154.7165
#p.value     0.0000   0.0000

summary(mdl2)
#$`2`
#
#Coefficients (mean model with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
#(Intercept)  3.307359   1.091289   3.031 0.002440 ** 
#x2          -2.854351   3.644882  -0.783 0.433561    
#x3          -0.262815   0.074716  -3.518 0.000436 ***
#x4          -2.414481   1.785447  -1.352 0.176276    
#x5          -0.007555   0.002788  -2.710 0.006738 ** 
#x6           0.030346   0.006833   4.441 8.96e-06 ***
#
#Phi coefficients (precision model with log link):
#             Estimate Std. Error z value Pr(>|z|)
#(Intercept)  1.003767   1.353496   0.742    0.458
#x3          -0.002907   0.090816  -0.032    0.974
#x4           2.528602   2.344241   1.079    0.281

#$`3`
#
#Coefficients (mean model with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -2.134931   0.337784  -6.320 2.61e-10 ***
#x2          -0.194830   0.144062  -1.352  0.17625    
#x3           0.168136   0.022521   7.466 8.28e-14 ***
#x4           2.811077   0.387788   7.249 4.20e-13 ***
#x5          -0.002070   0.001136  -1.822  0.06848 .  
#x6           0.004677   0.001770   2.643  0.00822 ** 
#
#Phi coefficients (precision model with log link):
#            Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -1.01810    0.46575  -2.186 0.028821 *  
#x3           0.15178    0.03057   4.965 6.88e-07 ***
#x4           2.14300    0.56979   3.761 0.000169 ***

Estimating a Beta Regression with The Variable Dispersion in R

pkgs <- c('sas7bdat', 'betareg', 'lmtest')
lapply(pkgs, require, character.only = T)

df1 <- read.sas7bdat("lgd.sas7bdat")
df2 <- df1[which(df1$y < 1), ]

xvar <- paste("x", 1:7, sep = '', collapse = " + ")
fml1 <- as.formula(paste("y ~ ", xvar))
fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar))

# FIT A BETA MODEL WITH THE FIXED PHI
beta1 <- betareg(fml1, data = df2)
summary(beta1)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.500242   0.329670  -4.551 5.35e-06 ***
# x1           0.007516   0.026020   0.289 0.772680    
# x2           0.429756   0.135899   3.162 0.001565 ** 
# x3           0.099202   0.022285   4.452 8.53e-06 ***
# x4           2.465055   0.415657   5.931 3.02e-09 ***
# x5          -0.003687   0.001070  -3.446 0.000568 ***
# x6           0.007181   0.001821   3.943 8.06e-05 ***
# x7           0.128796   0.186715   0.690 0.490319    
#
# Phi coefficients (precision model with identity link):
#       Estimate Std. Error z value Pr(>|z|)    
# (phi)   3.6868     0.1421   25.95   <2e-16 ***

# FIT A BETA MODEL WITH THE VARIABLE PHI
beta2 <- betareg(fml2, data = df2)
summary(beta2)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.996661   0.336445  -5.935 2.95e-09 ***
# x1           0.007033   0.026809   0.262 0.793072    
# x2           0.371098   0.135186   2.745 0.006049 ** 
# x3           0.133356   0.022624   5.894 3.76e-09 ***
# x4           2.951245   0.401493   7.351 1.97e-13 ***
# x5          -0.003475   0.001119  -3.105 0.001902 ** 
# x6           0.006528   0.001884   3.466 0.000529 ***
# x7           0.100274   0.190915   0.525 0.599424    
#
# Phi coefficients (precision model with log link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -0.454399   0.452302  -1.005 0.315072    
# x1           0.009119   0.035659   0.256 0.798150    
# x2           0.611049   0.188225   3.246 0.001169 ** 
# x3           0.092073   0.030678   3.001 0.002689 ** 
# x4           2.248399   0.579440   3.880 0.000104 ***
# x5          -0.002188   0.001455  -1.504 0.132704    
# x6          -0.000317   0.002519  -0.126 0.899847    
# x7          -0.166226   0.256199  -0.649 0.516457    

# LIKELIHOOD RATIO TEST TO COMPARE BOTH BETA MODELS
lrtest(beta1, beta2) 

# Likelihood ratio test
#
# Model 1: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
# Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 | x1 + x2 + x3 + x4 + x5 + x6 + x7
#   #Df LogLik Df Chisq Pr(>Chisq)    
# 1   9 231.55                        
# 2  16 257.24  7 51.38  7.735e-09 ***

Marginal Effects in Two-Part Fractional Models

As shown in “Two-Part Fractional Model” posted on 09/25/2012, sometimes it might be beneficial to model fractional outcomes in the range of [0, 1] with composite models, e.g. a two-part model, especially when there are a non-trivial number of boundary outcomes. However, the marginal effect of X in a two-part model is not as straightforward to calculate as in a one-part model shown in “Marginal Effects in Tobit Models” posted on 10/06/2012.

In the demonstration below, I will show how to calculate the marginal effect of X in a two-part model with a similar logic shown in McDonald and Moffitt decomposition.

proc nlmixed data = one tech = congra maxiter = 1000;
  parms b10 = -9.3586 b11 = -0.0595 b12 =  1.7644 b13 =  0.5994 b14 = -2.5496
        b15 = -0.0007 b16 = -0.0011 b17 = -1.6359
        b20 =  0.3401 b21 =  0.0274 b22 =  0.1437 b23 =  0.0229 b24 =  0.4656
        b25 =  0.0011 b26 =  0.0021 b27 =  0.1977  s  =  0.2149;
  logit_xb = b10 + b11 * x1 + b12 * x2 + b13 * x3 + b14 * x4 +
             b15 * x5 + b16 * x6 + b17 * x7;
  nls_xb = b20 + b21 * x1 + b22 * x2 + b23 * x3 + b24 * x4 +
           b25 * x5 + b26 * x6 + b27 * x7;
  p1 = 1 / (1 + exp(-logit_xb));
  p2 = 1 / (1 + exp(-nls_xb));
  if y = 0 then ll = log(1 - p1);
  else ll = log(p1) + log(pdf('normal', y, p2, s));
  model y ~ general(ll);
  predict logit_xb out = out_1 (rename = (pred = part1_xb) keep = _id_ pred y);
  predict p1       out = out_2 (rename = (pred = part1_p)  keep = _id_ pred);
  predict nls_xb   out = out_3 (rename = (pred = part2_xb) keep = _id_ pred);
  predict p2       out = out_4 (rename = (pred = part2_p)  keep = _id_ pred);
run;

data out;
  merge out_1 out_2 out_3 out_4;
  by _id_;

  margin1_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.05948) * part2_p;
  margin1_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.01115) * part1_p;
  x1_margin = margin1_part1 + margin1_part2;

  margin2_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * 1.7645) * part2_p;
  margin2_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.4363) * part1_p;
  x2_margin = margin2_part1 + margin2_part2;

  margin3_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * 0.5994) * part2_p;
  margin3_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.1139) * part1_p;
  x3_margin = margin3_part1 + margin3_part2;

  margin4_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -2.5496) * part2_p;
  margin4_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -2.8755) * part1_p;
  x4_margin = margin4_part1 + margin4_part2;

  margin5_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.00071) * part2_p;
  margin5_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * 0.004091) * part1_p;
  x5_margin = margin5_part1 + margin5_part2;

  margin6_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.00109) * part2_p;
  margin6_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.00839) * part1_p;
  x6_margin = margin6_part1 + margin6_part2;

  margin7_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -1.6359) * part2_p;
  margin7_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.1666) * part1_p;
  x7_margin = margin7_part1 + margin7_part2;
run;

proc means data = _last_ mean;
  var x:;
run;
/*
Variable                 Mean

x1_margin          -0.0039520
x2_margin           0.0739847
x3_margin           0.0270673
x4_margin          -0.3045967
x5_margin         0.000191015
x6_margin        -0.000533998
x7_margin          -0.1007960
*/

Two-Part Fractional Model

1. Two-step estimation method:
– SAS code

data one;
  set data.sme;
  _id_ + 1;
  if y = 0 then y2 = 0;
  else y2 = 1;
run;

proc logistic data = one desc;
  model y2 = x1 - x7;
run;

proc nlmixed data = one tech = congra;
  where y2 = 1;
  parms b0 =  2.14 b1 = -0.01 b2 = -0.55 b3 = -0.14 b4 = -3.47
        b5 =  0.01 b6 = -0.01 b7 = -0.19 s = 0.1;
  _xb_ = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
         b5 * x5 + b6 * x6 + b7 * x7;
  _mu_ = 1 / (1 + exp(-_xb_));
  lh = pdf('normal', y, _mu_, s);
  ll = log(lh);
  model y ~ general(ll);
  predict ll out = out1(keep = _id_ pred rename = (pred = ln_like1));
run;

– Outputs

*** output for logit component ***
         Model Fit Statistics
                             Intercept
              Intercept            and
Criterion          Only     Covariates
AIC            4997.648       4066.398
SC             5004.043       4117.551
-2 Log L       4995.648       4050.398

             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1     -9.3586      0.4341      464.8713        <.0001
x1            1     -0.0595      0.0341        3.0445        0.0810
x2            1      1.7644      0.1836       92.3757        <.0001
x3            1      0.5994      0.0302      392.7404        <.0001
x4            1     -2.5496      0.5084       25.1488        <.0001
x5            1    -0.00071     0.00128        0.3094        0.5780
x6            1    -0.00109     0.00267        0.1659        0.6838
x7            1     -1.6359      0.2443       44.8427        <.0001

*** output for the positive component ***
             Fit Statistics
-2 Log Likelihood                 -264.5
AIC (smaller is better)           -246.5
AICC (smaller is better)          -246.3
BIC (smaller is better)           -201.4

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b0            1.7947     0.3401   1116      5.28     <.0001
b1          -0.01216    0.02737   1116     -0.44     0.6568
b2           -0.4306     0.1437   1116     -3.00     0.0028
b3           -0.1157    0.02287   1116     -5.06     <.0001
b4           -2.9267     0.4656   1116     -6.29     <.0001
b5          0.004092   0.001074   1116      3.81     0.0001
b6          -0.00838   0.002110   1116     -3.97     <.0001
b7           -0.1697     0.1977   1116     -0.86     0.3909
s             0.2149   0.004549   1116     47.24     <.0001

2. Joint estimation method:
– SAS code

proc nlmixed data = one tech = congra maxiter = 1000;
  parms b10 = -9.3586 b11 = -0.0595 b12 =  1.7644 b13 =  0.5994 b14 = -2.5496
        b15 = -0.0007 b16 = -0.0011 b17 = -1.6359
        b20 =  0.3401 b21 =  0.0274 b22 =  0.1437 b23 =  0.0229 b24 =  0.4656
        b25 =  0.0011 b26 =  0.0021 b27 =  0.1977  s  =  0.2149;
  logit_xb = b10 + b11 * x1 + b12 * x2 + b13 * x3 + b14 * x4 +
             b15 * x5 + b16 * x6 + b17 * x7;
  nls_xb = b20 + b21 * x1 + b22 * x2 + b23 * x3 + b24 * x4 +
           b25 * x5 + b26 * x6 + b27 * x7;
  p1 = 1 / (1 + exp(-logit_xb));
  if y = 0 then ll = log(1 - p1);
  else ll = log(p1) + log(pdf('normal', y, 1 / (1 + exp(-nls_xb)), s));
  model y ~ general(ll);
run;

– Outputs

             Fit Statistics
-2 Log Likelihood                 3785.9
AIC (smaller is better)           3819.9
AICC (smaller is better)          3820.0
BIC (smaller is better)           3928.6

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b10          -9.3586     0.4341   4421    -21.56     <.0001
b11         -0.05948    0.03408   4421     -1.75     0.0810
b12           1.7645     0.1836   4421      9.61     <.0001
b13           0.5994    0.03024   4421     19.82     <.0001
b14          -2.5496     0.5084   4421     -5.01     <.0001
b15         -0.00071   0.001276   4421     -0.56     0.5784
b16         -0.00109   0.002673   4421     -0.41     0.6836
b17          -1.6359     0.2443   4421     -6.70     <.0001
b20           1.7633     0.3398   4421      5.19     <.0001
b21         -0.01115    0.02730   4421     -0.41     0.6830
b22          -0.4363     0.1437   4421     -3.04     0.0024
b23          -0.1139    0.02285   4421     -4.98     <.0001
b24          -2.8755     0.4643   4421     -6.19     <.0001
b25         0.004091   0.001074   4421      3.81     0.0001
b26         -0.00839   0.002109   4421     -3.98     <.0001
b27          -0.1666     0.1975   4421     -0.84     0.3991
s             0.2149   0.004550   4421     47.24     <.0001

As shown above, both estimation methods give similar parameter estimates. The summation of log likelihood from the 2-step estimation models is exactly equal to the log likelihood from the joint estimation model.