## Archive for the ‘**SAS**’ Category

## Double Poisson Regression in SAS

In the previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models), I’ve shown how to estimate the double Poisson (DP) regression in R with the gamlss package. The hurdle of estimating DP regression is the calculation of a normalizing constant in the DP density function, which can be calculated either by the sum of an infinite series or by a closed form approximation. In the example below, I will show how to estimate DP regression in SAS with the GLIMMIX procedure.

First of all, I will show how to estimate DP regression by using the exact DP density function. In this case, we will approximate the normalizing constant by computing a partial sum of the infinite series, as highlighted below.

data poi; do n = 1 to 5000; x1 = ranuni(1); x2 = ranuni(2); x3 = ranuni(3); y = ranpoi(4, exp(1 * x1 - 2 * x2 + 3 * x3)); output; end; run; proc glimmix data = poi; nloptions tech = quanew update = bfgs maxiter = 1000; model y = x1 x2 x3 / link = log solution; theta = exp(_phi_); _variance_ = _mu_ / theta; p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta; p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta); f = (theta ** 0.5) * ((exp(-_mu_)) ** theta); do i = 1 to 100; f = f + (theta ** 0.5) * ((exp(-i) * (i ** i) / fact(i)) ** (1 - theta)) * ((exp(-_mu_) * (_mu_ ** i) / fact(i)) ** theta); end; k = 1 / f; prob = k * (theta ** 0.5) * p_y * p_u; if log(prob) ~= . then _logl_ = log(prob); run;

Next, I will show the same estimation routine by using the closed form approximation.

proc glimmix data = poi; nloptions tech = quanew update = bfgs maxiter = 1000; model y = x1 x2 x3 / link = log solution; theta = exp(_phi_); _variance_ = _mu_ / theta; p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta; p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta); k = 1 / (1 + (1 - theta) / (12 * theta * _mu_) * (1 + 1 / (theta * _mu_))); prob = k * (theta ** 0.5) * p_y * p_u; if log(prob) ~= . then _logl_ = log(prob); run;

While the first approach is more accurate by closely following the DP density function, the second approach is more efficient with a significantly lower computing cost. However, both are much faster than the corresponding R function gamlss().

## SAS Macro Calculating Goodness-of-Fit Statistics for Quantile Regression

As shown by Fu and Wu in their presentation (https://www.casact.org/education/rpm/2010/handouts/CL1-Fu.pdf), the quantile regression is an appealing approach to model severity measures with high volatilities due to its statistical characteristics, including the robustness to extreme values and no distributional assumptions. Curti and Migueis also pointed out in a research paper (https://www.federalreserve.gov/econresdata/feds/2016/files/2016002r1pap.pdf) that the operational loss is more sensitive to macro-economic drivers at the tail, making the quantile regression an ideal model to capture such relationships.

While the quantile regression can be conveniently estimated in SAS with the QUANTREG procedure, the standard SAS output doesn’t provide goodness-of-fit (GoF) statistics. More importantly, it is noted that the underlying rationale of calculating GoF in a quantile regression is very different from the ones employed in OLS or GLM regressions. For instance, the most popular R-square is not applicable in the quantile regression anymore. Instead, a statistic called “R1” should be used. In addition, AIC and BIC are also defined differently in the quantile regression.

Below is a SAS macro showing how to calculate GoF statistics, including R1 and various information criterion, for a quantile regression.

%macro quant_gof(data = , y = , x = , tau = 0.5); ***********************************************************; * THE MACRO CALCULATES GOODNESS-OF-FIT STATISTICS FOR *; * QUANTILE REGRESSION *; * ------------------------------------------------------- *; * REFERENCE: *; * GOODNESS OF FIT AND RELATED INFERENCE PROCESSES FOR *; * QUANTILE REGRESSION, KOENKER AND MACHADO, 1999 *; ***********************************************************; options nodate nocenter; title; * UNRESTRICTED QUANTILE REGRESSION *; ods select ParameterEstimates ObjFunction; ods output ParameterEstimates = _est; proc quantreg data = &data ci = resampling(nrep = 500); model &y = &x / quantile = &tau nosummary nodiag seed = 1; output out = _full p = _p; run; * RESTRICTED QUANTILE REGRESSION *; ods select none; proc quantreg data = &data ci = none; model &y = / quantile = &tau nosummary nodiag; output out = _null p = _p; run; ods select all; proc sql noprint; select sum(df) into :p from _est; quit; proc iml; use _full; read all var {&y _p} into A; close _full; use _null; read all var {&y _p} into B; close _null; * DEFINE A FUNCTION CALCULATING THE SUM OF ABSOLUTE DEVIATIONS *; start loss(x); r = x[, 1] - x[, 2]; z = j(nrow(r), 1, 0); l = sum(&tau * (r <> z) + (1 - &tau) * (-r <> z)); return(l); finish; r1 = 1 - loss(A) / loss(B); adj_r1 = 1 - ((nrow(A) - 1) * loss(A)) / ((nrow(A) - &p) * loss(B)); aic = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p; aicc = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p * nrow(A) / (nrow(A) - &p - 1); bic = 2 * nrow(A) * log(loss(A) / nrow(A)) + &p * log(nrow(A)); l = {"R1" "ADJUSTED R1" "AIC" "AICC" "BIC"}; v = r1 // adj_r1 // aic // aicc // bic; print v[rowname = l format = 20.8 label = "Fit Statistics"]; quit; %mend quant_gof;

## Estimating Conway-Maxwell-Poisson Regression in SAS

Conway-Maxwell-Poisson (CMP) regression is a flexible way to model frequency outcomes with both under-dispersion and over-dispersion. In SAS, CMP regression can be estimated with COUNTREG procedure directly or with NLMIXED procedure by specifying the likelihood function. However, the use of NLMIXED procedure is extremely cumbersome in that we need to estimate a standard Poisson regression and then use estimated parameters as initial values parameter estimates for the CMP regression.

In the example below, we will show how to employ GLIMMIX procedure to estimate a CMP regression by providing both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / link = log solution; _nu = 1 / exp(_phi_); _variance_ = (1 / _nu) / ((_mu_) ** (1 / _nu)); _z = 0; do i = 0 to 100; _z = _z + (_mu_ ** i) / fact(i) ** _nu; end; _prob = (_mu_ ** majordrg) / (fact(majordrg) ** _nu) * (_z ** (-1)); _logl_ = log(_prob); run;

Since the scale parameter **_phi_** is strictly above 0, the function **1 / exp(_phi_)** in the line #3 is to ensure the **Nu** parameter bounded between 0 and 1.

In addition, the DO loop is to calculate the normalization constant **Z** such that the PMF would sum up to 1. As there is no closed form for the calculation of **Z**, we need to calculate it numerically at the cost of a longer computing time.

Other implicit advantages of GLIMMIX procedure over NLMIXED procedure include the unnecessity to provide initiate values of parameter estimates and a shorter computing time.

## Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)

E(Y) = Mu = Alpha / (1 – Xi)

Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est; proc genmod data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / dist = poisson link = log; run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint; select "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ') into :_parm separated by ' ' from _est; select case when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ') else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ') end into :_xb separated by ' + ' from _est where upcase(parameter) ~= 'SCALE'; quit; /* %put &_parm; _INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00 %put &_xb; _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT */

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count; parms &_parm.; _xb = &_xb.; _xi = 1 - exp(-_scale); _mu = exp(_xb); _alpha = _mu * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); ll = log(_prob); model majordrg ~ general(ll); run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / link = log solution; _xi = 1 - 1 / exp(_phi_); _variance_ = _mu_ / (1 - _xi) ** 2; _alpha = _mu_ * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); _logl_ = log(_prob); run;

## Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter **0 < a < y** and the shape parameter **k > 1 **.

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response ** y` = severity – threshold ** to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound ** a = 2 ** by using the **VGAM** package.

library(VGAM) set.seed(2017) n <- 200 a <- 2 x <- runif(n) k <- exp(1 + 5 * x) pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x) fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE) summary(fit) # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.0322 0.1363 7.574 3.61e-14 *** # x 4.9815 0.2463 20.229 < 2e-16 *** AIC(fit) # -644.458 BIC(fit) # -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata; parms b0 = 0.1 b1 = 0.1; k = exp(b0 + b1 * x); a = 2; lh = k * (a ** k) / (y ** (k + 1)); ll = log(lh); model y ~ general(ll); run; /* Fit Statistics -2 Log Likelihood -648.5 AIC (smaller is better) -644.5 AICC (smaller is better) -644.4 BIC (smaller is better) -637.9 Parameter Estimate Standard DF t Value Pr > |t| Error b0 1.0322 0.1385 200 7.45 <.0001 b1 4.9815 0.2518 200 19.78 <.0001 */

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to ** exp(x * beta) ** but ** a * k / (k – 1) ** with ** k = exp(x * beta) **. Therefore, the conditional mean only exists when ** k > 1 **, which might cause numerical issues in the model estimation.

## Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;

2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).

3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = ); ***********************************************************; * SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * Y : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES *; * X : MODEL PREDICTORS *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; options mprint mlogic nocenter; %let links = logit probit cloglog; %let loop = 1; proc sql noprint; select n(&data) - 3 into :df from &data; quit; %do %while (%scan(&links, &loop) ne %str()); %let link = %scan(&links, &loop); proc logistic data = &data noprint desc; model &y = &x / link = &link; score data = &data out = _out1; run; data _out2; set _out1(rename = (p_1 = p1)); p2 = p1 * p1; run; ods listing close; ods output ParameterEstimates = _parm; proc logistic data = _out2 desc; model &y = p1 p2 / link = &link ; run; ods listing; %if &loop = 1 %then %do; data _parm1; format link $10.; set _parm(where = (variable = "p2")); link = upcase("&link"); run; %end; %else %do; data _parm1; set _parm1 _parm(where = (variable = "p2") in = new); if new then link = upcase("&link"); run; %end; data _parm2(drop = variable); set _parm1; _t = estimate / stderr; _df = &df; _p = (1 - probt(abs(_t), _df)) * 2; run; %let loop = %eval(&loop + 1); %end; title; proc report data = _last_ spacing = 1 headline nowindows split = "*"; column(" * PREGIBON TEST FOR GOODNESS OF LINK * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * " link _t _df _p); define link / "LINK FUNCTION" width = 15 order order = data; define _t / "T-VALUE" width = 15 format = 12.4; define _df / "DF" width = 10; define _p / "P-VALUE" width = 15 format = 12.4; run; %mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

PREGIBON TEST FOR GOODNESS OF LINK H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY LINK FUNCTION T-VALUE DF P-VALUE ----------------------------------------------------------- LOGIT -1.6825 78 0.0965 PROBIT -1.7940 78 0.0767 CLOGLOG -2.3632 78 0.0206

## Modified Park Test in SAS

The severity measure in operational loss models has an empirical distribution with positive values and a long tail to the far right. To estimate regression models for severity measures with such data characteristics, we can consider several candidate distributions, such as Lognormal, Gamma, inverse Gaussian, and so on. A statistical approach is called for to choose the appropriate estimator with a correct distributional assumption. The modified Park test is designed to fill the gap.

For any GLM model, a general relationship between the variance and the mean can be described as below:

**var(y | x) = alpha * [E(y | x)] ^ lambda**

- With lambda = 0, it is suggested that the relationship between the variance and the mean is orthogonal. In this case, a Gaussian distributional assumption should be considered.
- With lambda = 1, it is suggestion that the variance is proportional to the mean. In this case, a Poisson-like distribution assumption should be considered.
- With lambda = 2, it is suggested that the variance is quadratic to the mean. In this case, a Gamma distributional assumption should be considered.
- With lambda = 3, it is suggested that the variance is cubic to the mean. In this case, an Inverse Gaussian distributional assumption should be considered.

Without the loss of generality, the aforementioned logic can be further formulated as below given E(y | x) = yhat for an arbitrary estimator. As mentioned by Manning and Mullahy (2001), a Gamma estimator can be considered a natural baseline estimator.

**var(y | x) = alpha * [E(y | x)] ^ lambda**

** –> (y – yhat) ^ 2 = alpha * [yhat] ^ lambda**

** –> log(y – yhat) ^ 2 = log(alpha) + lambda * log(yhat)**

With the above formulation, there are two ways to construct the statistical test for lambda, which is the so-called “modified Park test”.

In the OLS regression setting, the log of squared residuals from the baseline estimator can be regression on a constant and the log of predicted values from the baseline estimator, e.g. a Gamma regression.

proc reg data = data; model ln_r2 = ln_yhat; park_test: test ln_yhat = 2; run;

In the demonstrated example, we want to test the null hypothesis if the coefficient of ln_yhat is statistically different from 2, which suggests a Gamma distributional assumption.

Alternatively, in the GLM setting, the squared residuals from the baseline estimator can be regressed on a constant and the log of predicted values from the baseline estimator. In this specific GLM, the Gamma distribution and the log() link function should be employed.

proc nlmixed data = data; parms b0 = 1 b2 = 2 scale = 10; mu = exp(b0 + b1 * x); b = mu / scale; model r2 ~ gamma(scale, b); contrast 'park test' b1 - 2; run;

Similarly, if the null hypothesis that the coefficient of ln_yhat minus 2 is not statistically different from 0 cannot be rejected, then the Gamma distributional assumption is valid based on the modified Park test.