## Posts Tagged ‘**SAS**’

## Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = ); options mprint mlogic; data _data_ (keep = _x _y); set &data; where &y in (0, 1) and &x ~= .; _y = &y; _x = &x; run; proc transreg data = _last_ noprint; model identity(_y) = monotone(_x); output out = _tmp1 tip = _t; run; proc summary data = _last_ nway; class _t_x; output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate; run; proc sort data = _last_; by _rate; run; data _tmp2; set _last_; by _rate; _idx = _n_; if _rate = 0 then _idx = _idx + 1; if _rate = 1 then _idx = _idx - 1; run; proc sql noprint; create table _tmp3 as select a.*, b._idx from _tmp1 as a inner join _tmp2 as b on a._t_x = b._t_x; create table _tmp4 as select a._idx, min(a._x) as _min_x, max(a._x) as _max_x, sum(a._y) as _bads, count(a._y) as _freq, mean(a._y) as _rate, sum(a._y) / b.bads as _bpct, sum(1 - a._y) / (b.freq - b.bads) as _gpct, log(calculated _bpct / calculated _gpct) as _woe, (calculated _bpct - calculated _gpct) * calculated _woe as _iv from _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b group by a._idx; quit; title "Monotonic WoE Binning for %upcase(%trim(&x))"; proc print data = _last_ label noobs; var _min_x _max_x _bads _freq _rate _woe _iv; label _min_x = "Lower" _max_x = "Upper" _bads = "#Bads" _freq = "#Freq" _rate = "BadRate" _woe = "WoE" _iv = "IV"; sum _bads _freq _iv; run; title; %mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

## Double Poisson Regression in SAS

In the previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models), I’ve shown how to estimate the double Poisson (DP) regression in R with the gamlss package. The hurdle of estimating DP regression is the calculation of a normalizing constant in the DP density function, which can be calculated either by the sum of an infinite series or by a closed form approximation. In the example below, I will show how to estimate DP regression in SAS with the GLIMMIX procedure.

First of all, I will show how to estimate DP regression by using the exact DP density function. In this case, we will approximate the normalizing constant by computing a partial sum of the infinite series, as highlighted below.

data poi; do n = 1 to 5000; x1 = ranuni(1); x2 = ranuni(2); x3 = ranuni(3); y = ranpoi(4, exp(1 * x1 - 2 * x2 + 3 * x3)); output; end; run; proc glimmix data = poi; nloptions tech = quanew update = bfgs maxiter = 1000; model y = x1 x2 x3 / link = log solution; theta = exp(_phi_); _variance_ = _mu_ / theta; p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta; p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta); f = (theta ** 0.5) * ((exp(-_mu_)) ** theta); do i = 1 to 100; f = f + (theta ** 0.5) * ((exp(-i) * (i ** i) / fact(i)) ** (1 - theta)) * ((exp(-_mu_) * (_mu_ ** i) / fact(i)) ** theta); end; k = 1 / f; prob = k * (theta ** 0.5) * p_y * p_u; if log(prob) ~= . then _logl_ = log(prob); run;

Next, I will show the same estimation routine by using the closed form approximation.

proc glimmix data = poi; nloptions tech = quanew update = bfgs maxiter = 1000; model y = x1 x2 x3 / link = log solution; theta = exp(_phi_); _variance_ = _mu_ / theta; p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta; p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta); k = 1 / (1 + (1 - theta) / (12 * theta * _mu_) * (1 + 1 / (theta * _mu_))); prob = k * (theta ** 0.5) * p_y * p_u; if log(prob) ~= . then _logl_ = log(prob); run;

While the first approach is more accurate by closely following the DP density function, the second approach is more efficient with a significantly lower computing cost. However, both are much faster than the corresponding R function gamlss().

## SAS Macro Calculating Goodness-of-Fit Statistics for Quantile Regression

As shown by Fu and Wu in their presentation (https://www.casact.org/education/rpm/2010/handouts/CL1-Fu.pdf), the quantile regression is an appealing approach to model severity measures with high volatilities due to its statistical characteristics, including the robustness to extreme values and no distributional assumptions. Curti and Migueis also pointed out in a research paper (https://www.federalreserve.gov/econresdata/feds/2016/files/2016002r1pap.pdf) that the operational loss is more sensitive to macro-economic drivers at the tail, making the quantile regression an ideal model to capture such relationships.

While the quantile regression can be conveniently estimated in SAS with the QUANTREG procedure, the standard SAS output doesn’t provide goodness-of-fit (GoF) statistics. More importantly, it is noted that the underlying rationale of calculating GoF in a quantile regression is very different from the ones employed in OLS or GLM regressions. For instance, the most popular R-square is not applicable in the quantile regression anymore. Instead, a statistic called “R1” should be used. In addition, AIC and BIC are also defined differently in the quantile regression.

Below is a SAS macro showing how to calculate GoF statistics, including R1 and various information criterion, for a quantile regression.

%macro quant_gof(data = , y = , x = , tau = 0.5); ***********************************************************; * THE MACRO CALCULATES GOODNESS-OF-FIT STATISTICS FOR *; * QUANTILE REGRESSION *; * ------------------------------------------------------- *; * REFERENCE: *; * GOODNESS OF FIT AND RELATED INFERENCE PROCESSES FOR *; * QUANTILE REGRESSION, KOENKER AND MACHADO, 1999 *; ***********************************************************; options nodate nocenter; title; * UNRESTRICTED QUANTILE REGRESSION *; ods select ParameterEstimates ObjFunction; ods output ParameterEstimates = _est; proc quantreg data = &data ci = resampling(nrep = 500); model &y = &x / quantile = &tau nosummary nodiag seed = 1; output out = _full p = _p; run; * RESTRICTED QUANTILE REGRESSION *; ods select none; proc quantreg data = &data ci = none; model &y = / quantile = &tau nosummary nodiag; output out = _null p = _p; run; ods select all; proc sql noprint; select sum(df) into :p from _est; quit; proc iml; use _full; read all var {&y _p} into A; close _full; use _null; read all var {&y _p} into B; close _null; * DEFINE A FUNCTION CALCULATING THE SUM OF ABSOLUTE DEVIATIONS *; start loss(x); r = x[, 1] - x[, 2]; z = j(nrow(r), 1, 0); l = sum(&tau * (r <> z) + (1 - &tau) * (-r <> z)); return(l); finish; r1 = 1 - loss(A) / loss(B); adj_r1 = 1 - ((nrow(A) - 1) * loss(A)) / ((nrow(A) - &p) * loss(B)); aic = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p; aicc = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p * nrow(A) / (nrow(A) - &p - 1); bic = 2 * nrow(A) * log(loss(A) / nrow(A)) + &p * log(nrow(A)); l = {"R1" "ADJUSTED R1" "AIC" "AICC" "BIC"}; v = r1 // adj_r1 // aic // aicc // bic; print v[rowname = l format = 20.8 label = "Fit Statistics"]; quit; %mend quant_gof;

## Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)

E(Y) = Mu = Alpha / (1 – Xi)

Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est; proc genmod data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / dist = poisson link = log; run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint; select "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ') into :_parm separated by ' ' from _est; select case when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ') else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ') end into :_xb separated by ' + ' from _est where upcase(parameter) ~= 'SCALE'; quit; /* %put &_parm; _INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00 %put &_xb; _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT */

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count; parms &_parm.; _xb = &_xb.; _xi = 1 - exp(-_scale); _mu = exp(_xb); _alpha = _mu * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); ll = log(_prob); model majordrg ~ general(ll); run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count; model majordrg = age acadmos minordrg ownrent / link = log solution; _xi = 1 - 1 / exp(_phi_); _variance_ = _mu_ / (1 - _xi) ** 2; _alpha = _mu_ * (1 - _xi); _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg); _logl_ = log(_prob); run;

## Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter **0 < a < y** and the shape parameter **k > 1 **.

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response ** y` = severity – threshold ** to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound ** a = 2 ** by using the **VGAM** package.

library(VGAM) set.seed(2017) n <- 200 a <- 2 x <- runif(n) k <- exp(1 + 5 * x) pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x) fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE) summary(fit) # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) 1.0322 0.1363 7.574 3.61e-14 *** # x 4.9815 0.2463 20.229 < 2e-16 *** AIC(fit) # -644.458 BIC(fit) # -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata; parms b0 = 0.1 b1 = 0.1; k = exp(b0 + b1 * x); a = 2; lh = k * (a ** k) / (y ** (k + 1)); ll = log(lh); model y ~ general(ll); run; /* Fit Statistics -2 Log Likelihood -648.5 AIC (smaller is better) -644.5 AICC (smaller is better) -644.4 BIC (smaller is better) -637.9 Parameter Estimate Standard DF t Value Pr > |t| Error b0 1.0322 0.1385 200 7.45 <.0001 b1 4.9815 0.2518 200 19.78 <.0001 */

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to ** exp(x * beta) ** but ** a * k / (k – 1) ** with ** k = exp(x * beta) **. Therefore, the conditional mean only exists when ** k > 1 **, which might cause numerical issues in the model estimation.

## Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;

2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).

3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = ); ***********************************************************; * SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * Y : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES *; * X : MODEL PREDICTORS *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; options mprint mlogic nocenter; %let links = logit probit cloglog; %let loop = 1; proc sql noprint; select n(&data) - 3 into :df from &data; quit; %do %while (%scan(&links, &loop) ne %str()); %let link = %scan(&links, &loop); proc logistic data = &data noprint desc; model &y = &x / link = &link; score data = &data out = _out1; run; data _out2; set _out1(rename = (p_1 = p1)); p2 = p1 * p1; run; ods listing close; ods output ParameterEstimates = _parm; proc logistic data = _out2 desc; model &y = p1 p2 / link = &link ; run; ods listing; %if &loop = 1 %then %do; data _parm1; format link $10.; set _parm(where = (variable = "p2")); link = upcase("&link"); run; %end; %else %do; data _parm1; set _parm1 _parm(where = (variable = "p2") in = new); if new then link = upcase("&link"); run; %end; data _parm2(drop = variable); set _parm1; _t = estimate / stderr; _df = &df; _p = (1 - probt(abs(_t), _df)) * 2; run; %let loop = %eval(&loop + 1); %end; title; proc report data = _last_ spacing = 1 headline nowindows split = "*"; column(" * PREGIBON TEST FOR GOODNESS OF LINK * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * " link _t _df _p); define link / "LINK FUNCTION" width = 15 order order = data; define _t / "T-VALUE" width = 15 format = 12.4; define _df / "DF" width = 10; define _p / "P-VALUE" width = 15 format = 12.4; run; %mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

PREGIBON TEST FOR GOODNESS OF LINK H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY LINK FUNCTION T-VALUE DF P-VALUE ----------------------------------------------------------- LOGIT -1.6825 78 0.0965 PROBIT -1.7940 78 0.0767 CLOGLOG -2.3632 78 0.0206

## Parameter Estimation of Pareto Type II Distribution with NLMIXED in SAS

In several previous posts, I’ve shown how to estimate severity models under the various distributional assumptions, including Lognormal, Gamma, and Inverse Gaussian. However, I am not satisfied with the fact that the supporting domain of aforementioned distributions doesn’t include the value at ZERO.

Today, I had spent some time on looking into another interesting distribution, namely Pareto Type II distribution, and the possibility of estimating the regression model. The Pareto Type II distribution, which is also called Lomax distribution, is a special case of the Pareto distribution such that its supporting domain starts at ZERO (>= 0) with a long tail to the right, making it a good candidate for severity or loss distributions. This distribution can be described by 2 parameters, a scale parameter “Lambda” and a shape parameter “Alpha” such that prob(y) = Alpha / Lambda * (1 + y / Lambda) ^ (-(1 + Alpha)) with the mean E(y) = Lambda / (Alpha – 1) for Alpha > 1 and Var(y) = Lambda ^ 2 * Alpha / [(Alpha – 1) ^ 2 * (Alpha – 2)] for Alpha > 2.

With the re-parameterization, Alpha and Lambda can be further expressed in terms of E(y) = mu and Var(y) = sigma2 such that Alpha = 2 * sigma2 / (sigma2 – mu ^ 2) and Lambda = mu * ((sigma2 + mu ^ 2) / (sigma2 – mu ^ 2)). Below is an example showing how to estimate the mean and the variance by using the likelihood function of Lomax distribution with SAS / NLMIXED procedure.

data test; do i = 1 to 100; y = exp(rannor(1)); output; end; run; proc nlmixed data = test tech = trureg; parms _c_ = 0 ln_sigma2 = 1; mu = exp(_c_); sigma2 = exp(ln_sigma2); alpha = 2 * sigma2 / (sigma2 - mu ** 2); lambda = mu * ((sigma2 + mu ** 2) / (sigma2 - mu ** 2)); lh = alpha / lambda * ( 1 + y/ lambda) ** (-(alpha + 1)); ll = log(lh); model y ~ general(ll); predict mu out = pred (rename = (pred = mu)); run; proc means data = pred; var mu y; run;

With the above setting, it is very doable to estimate a regression model with the Lomax distributional assumption. However, in order to make it useful in production, I still need to find out an effective way to ensure the estimation convergence after including co-variates in the model.