Archive for November 2016
Modeling the frequency is one of the most important aspects in operational risk models. In the previous post (https://statcompute.wordpress.com/2016/05/13/more-flexible-approaches-to-model-frequency), the importance of flexible modeling approaches for both under-dispersion and over-dispersion has been discussed.
In addition to the quasi-poisson regression, three flexible frequency modeling techniques, including generalized poisson, double poisson, and Conway-Maxwell poisson, with their implementations in R should also be demonstrated below. While the example is specifically related to the over-dispersed data simulated with the negative binomial distributional assumption, these approaches can be generalized to the under-dispersed data as well given their flexibility. However, as demonstrated below, the calculation of parameters for these modeling approaches is not straight-forward.
Over-Dispersed Data Simulation
> set.seed(1) > ### SIMULATE NEG. BINOMIAL WITH MEAN(X) = MU AND VAR(X) = MU + MU ^ 2 / THETA > df <- data.frame(y = MASS::rnegbin(1000, mu = 10, theta = 5)) > ### DATA MEAN > mean(df$y)  9.77 > ### DATA VARIANCE > var(df$y)  30.93003003
Generalized Poisson Regression
> library(VGAM) > gpois <- vglm(y ~ 1, data = df, family = genpoisson) > gpois.theta <- exp(coef(gpois)) > gpois.lambda <- (exp(coef(gpois)) - 1) / (exp(coef(gpois)) + 1) > ### ESTIMATE MEAN = THETA / (1 - LAMBDA) > gpois.theta / (1 - gpois.lambda) (Intercept):2 9.77 > ### ESTIMATE VARIANCE = THETA / ((1 - LAMBDA) ^ 3) > gpois.theta / ((1 - gpois.lambda) ^ 3) (Intercept):2 31.45359991
Double Poisson Regression
> ### DOUBLE POISSON > library(gamlss) > dpois <- gamlss(y ~ 1, data = df, family = DPO, control = gamlss.control(n.cyc = 100)) > ### ESTIMATE MEAN > dpois.mu <- exp(dpois$mu.coefficients) > dpois.mu (Intercept) 9.848457877 > ### ESTIMATE VARIANCE = MU * SIGMA > dpois.sigma <- exp(dpois$sigma.coefficients) > dpois.mu * dpois.sigma (Intercept) 28.29229702
Conway-Maxwell Poisson Regression
> ### CONWAY-MAXWELL POISSON > library(CompGLM) > cpois <- glm.comp(y ~ 1, data = df) > cpois.lambda <- exp(cpois$beta) > cpois.nu <- exp(cpois$zeta) > ### ESTIMATE MEAN = LAMBDA ^ (1 / NU) - (NU - 1) / (2 * NU) > cpois.lambda ^ (1 / cpois.nu) - (cpois.nu - 1) / (2 * cpois.nu) (Intercept) 9.66575376 > ### ESTIMATE VARIANCE = LAMBDA ** (1 / NU) / NU > cpois.lambda ^ (1 / cpois.nu) / cpois.nu (Intercept) 29.69861239
The severity measure in operational loss models has an empirical distribution with positive values and a long tail to the far right. To estimate regression models for severity measures with such data characteristics, we can consider several candidate distributions, such as Lognormal, Gamma, inverse Gaussian, and so on. A statistical approach is called for to choose the appropriate estimator with a correct distributional assumption. The modified Park test is designed to fill the gap.
For any GLM model, a general relationship between the variance and the mean can be described as below:
var(y | x) = alpha * [E(y | x)] ^ lambda
- With lambda = 0, it is suggested that the relationship between the variance and the mean is orthogonal. In this case, a Gaussian distributional assumption should be considered.
- With lambda = 1, it is suggestion that the variance is proportional to the mean. In this case, a Poisson-like distribution assumption should be considered.
- With lambda = 2, it is suggested that the variance is quadratic to the mean. In this case, a Gamma distributional assumption should be considered.
- With lambda = 3, it is suggested that the variance is cubic to the mean. In this case, an Inverse Gaussian distributional assumption should be considered.
Without the loss of generality, the aforementioned logic can be further formulated as below given E(y | x) = yhat for an arbitrary estimator. As mentioned by Manning and Mullahy (2001), a Gamma estimator can be considered a natural baseline estimator.
var(y | x) = alpha * [E(y | x)] ^ lambda
–> (y – yhat) ^ 2 = alpha * [yhat] ^ lambda
–> log(y – yhat) ^ 2 = log(alpha) + lambda * log(yhat)
With the above formulation, there are two ways to construct the statistical test for lambda, which is the so-called “modified Park test”.
In the OLS regression setting, the log of squared residuals from the baseline estimator can be regression on a constant and the log of predicted values from the baseline estimator, e.g. a Gamma regression.
proc reg data = data; model ln_r2 = ln_yhat; park_test: test ln_yhat = 2; run;
In the demonstrated example, we want to test the null hypothesis if the coefficient of ln_yhat is statistically different from 2, which suggests a Gamma distributional assumption.
Alternatively, in the GLM setting, the squared residuals from the baseline estimator can be regressed on a constant and the log of predicted values from the baseline estimator. In this specific GLM, the Gamma distribution and the log() link function should be employed.
proc nlmixed data = data; parms b0 = 1 b2 = 2 scale = 10; mu = exp(b0 + b1 * x); b = mu / scale; model r2 ~ gamma(scale, b); contrast 'park test' b1 - 2; run;
Similarly, if the null hypothesis that the coefficient of ln_yhat minus 2 is not statistically different from 0 cannot be rejected, then the Gamma distributional assumption is valid based on the modified Park test.
In several previous posts, I’ve shown how to estimate severity models under the various distributional assumptions, including Lognormal, Gamma, and Inverse Gaussian. However, I am not satisfied with the fact that the supporting domain of aforementioned distributions doesn’t include the value at ZERO.
Today, I had spent some time on looking into another interesting distribution, namely Pareto Type II distribution, and the possibility of estimating the regression model. The Pareto Type II distribution, which is also called Lomax distribution, is a special case of the Pareto distribution such that its supporting domain starts at ZERO (>= 0) with a long tail to the right, making it a good candidate for severity or loss distributions. This distribution can be described by 2 parameters, a scale parameter “Lambda” and a shape parameter “Alpha” such that prob(y) = Alpha / Lambda * (1 + y / Lambda) ^ (-(1 + Alpha)) with the mean E(y) = Lambda / (Alpha – 1) for Alpha > 1 and Var(y) = Lambda ^ 2 * Alpha / [(Alpha – 1) ^ 2 * (Alpha – 2)] for Alpha > 2.
With the re-parameterization, Alpha and Lambda can be further expressed in terms of E(y) = mu and Var(y) = sigma2 such that Alpha = 2 * sigma2 / (sigma2 – mu ^ 2) and Lambda = mu * ((sigma2 + mu ^ 2) / (sigma2 – mu ^ 2)). Below is an example showing how to estimate the mean and the variance by using the likelihood function of Lomax distribution with SAS / NLMIXED procedure.
data test; do i = 1 to 100; y = exp(rannor(1)); output; end; run; proc nlmixed data = test tech = trureg; parms _c_ = 0 ln_sigma2 = 1; mu = exp(_c_); sigma2 = exp(ln_sigma2); alpha = 2 * sigma2 / (sigma2 - mu ** 2); lambda = mu * ((sigma2 + mu ** 2) / (sigma2 - mu ** 2)); lh = alpha / lambda * ( 1 + y/ lambda) ** (-(alpha + 1)); ll = log(lh); model y ~ general(ll); predict mu out = pred (rename = (pred = mu)); run; proc means data = pred; var mu y; run;
With the above setting, it is very doable to estimate a regression model with the Lomax distributional assumption. However, in order to make it useful in production, I still need to find out an effective way to ensure the estimation convergence after including co-variates in the model.