# Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

## SAS Implementation of ZAGA Models

In the previous post https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros/, I gave a brief introduction about the ZAGA (Zero-Adjusted Gamma) model that provides us a very flexible approach to model non-negative numeric responses. Today, I will show how to implement the ZAGA model with SAS, which can be conducted either jointly or by two steps.

In SAS, the FMM procedure provides a very convenient interface to estimate the ZAGA model in 1 simple step. As shown, there are two model statements, e.g. the first one to estimate a Gamma sub-model with positive outcomes and the second used to separate the point-mass at zero from the positive. The subsequent probmodel statement then is employed to estimate the probability of a record being positive.

```
data ds;
set "/folders/myfolders/autoclaim" (keep = clm_amt bluebook npolicy clm_freq5 mvr_pts income);
where income ~= .;
clm_flg = (clm_amt > 0);
run;

proc fmm data = ds tech = trureg;
model clm_amt = bluebook npolicy / dist = gamma;
model clm_amt = / dist = constant;
probmodel clm_freq5 mvr_pts income;
run;
```

An alternative way to develop a ZAGA model in two steps is to estimate a logistic regression first separating the point-mass at zero from the positive and then to estimate a Gamma regression with positive outcomes only, as illustrated below. The two-step approach is more intuitive to understand and, more importantly, is easier to implement without convergence issues as in FMM or NLMIXED procedure.

```
proc logistic data = ds desc;
model clm_flg = clm_freq5 mvr_pts income;
run;

proc genmod data = ds;
where clm_flg = 1;
model clm_amt = bluebook npolicy / link = log dist = gamma;
run;
```

Written by statcompute

May 19, 2018 at 8:58 pm

Posted in SAS, Statistical Models, Statistics

Tagged with ,

## Estimating Parameters of A Hyper-Poisson Distribution in SAS

Similar to COM-Poisson, Double-Poisson, and Generalized Poisson distributions discussed in my previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models/), the Hyper-Poisson distribution is another extension of the standard Poisson and is able to accommodate both under-dispersion and over-dispersion that are common in real-world problems. Given the complexity of parameterization and computation, the Hyper-Poisson is somewhat under-investigated. To the best of my knowledge, there is no off-shelf computing routine in SAS for the Hyper-Poisson distribution and only a R function available in http://www4.ujaen.es/~ajsaez/hp.fit.r written by A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

The SAS code presented below is the starting point of my attempt on the Hyper-Poisson and its potential applications. The purpose is to replicate the calculation result shown in the Table 6 of “On the Hyper-Poisson Distribution and its Generalization with Applications” by Bayo H. Lawal (2017) (http://www.journalrepository.org/media/journals/BJMCS_6/2017/Mar/Lawal2132017BJMCS32184.pdf). As a result, the parameterization employed in my SAS code will closely follow Bayo H. Lawal (2017) instead of A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

```
data d1;
input y n @@;
datalines;
0 121 1 85 2 19 3 1 4 0 5 0 6 1
;
run;

data df;
set d1;
where n > 0;
do i = 1 to n;
output;
end;
run;

proc nlmixed data = df;
parms lambda = 1 beta = 1;
theta = 1;
do k = 1 to 100;
theta = theta + gamma(beta) * (lambda ** k) / gamma(beta + k);
end;
prob = (gamma(beta) / gamma(beta + y)) * ((lambda ** y) / theta);
ll = log(prob);
model y ~ general(ll);
run;

/*
Standard
Parameter  Estimate     Error    DF  t Value  Pr > |t|   Alpha
lambda       0.3752    0.1178   227     3.19    0.0016    0.05
beta         0.5552    0.2266   227     2.45    0.0150    0.05
*/

```

As shown, the estimated Lambda = 0.3752 and the estimated Beta = 0.5552 are identical to what is presented in the paper. The next step is be to explore applications in the frequency modeling as well as its value in business cases.

Written by statcompute

February 4, 2018 at 3:22 pm

Tagged with ,

## Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

```%macro monobin(data = , y = , x = );
options mprint mlogic;

data _data_ (keep = _x _y);
set &data;
where &y in (0, 1) and &x ~= .;
_y = &y;
_x = &x;
run;

proc transreg data = _last_ noprint;
model identity(_y) = monotone(_x);
output out = _tmp1 tip = _t;
run;

proc summary data = _last_ nway;
class _t_x;
output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate;
run;

proc sort data = _last_;
by _rate;
run;

data _tmp2;
set _last_;
by _rate;
_idx = _n_;
if _rate = 0 then _idx = _idx + 1;
if _rate = 1 then _idx = _idx - 1;
run;

proc sql noprint;
create table
_tmp3 as
select
a.*,
b._idx
from
_tmp1 as a inner join _tmp2 as b
on
a._t_x = b._t_x;

create table
_tmp4 as
select
a._idx,
min(a._x)                                               as _min_x,
max(a._x)                                               as _max_x,
count(a._y)                                             as _freq,
mean(a._y)                                              as _rate,
sum(a._y) / b.bads                                      as _bpct,
sum(1 - a._y) / (b.freq - b.bads)                       as _gpct,
log(calculated _bpct / calculated _gpct)                as _woe,
(calculated _bpct - calculated _gpct) * calculated _woe as _iv
from
_tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b
group by
a._idx;
quit;

title "Monotonic WoE Binning for %upcase(%trim(&x))";
proc print data = _last_ label noobs;
var _min_x _max_x _bads _freq _rate _woe _iv;
label
_min_x = "Lower"
_max_x = "Upper"
_freq  = "#Freq"
_woe   = "WoE"
_iv    = "IV";
sum _bads _freq _iv;
run;
title;

%mend monobin;
```

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

Written by statcompute

September 24, 2017 at 11:00 pm

## Double Poisson Regression in SAS

In the previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models), I’ve shown how to estimate the double Poisson (DP) regression in R with the gamlss package. The hurdle of estimating DP regression is the calculation of a normalizing constant in the DP density function, which can be calculated either by the sum of an infinite series or by a closed form approximation. In the example below, I will show how to estimate DP regression in SAS with the GLIMMIX procedure.

First of all, I will show how to estimate DP regression by using the exact DP density function. In this case, we will approximate the normalizing constant by computing a partial sum of the infinite series, as highlighted below.

```data poi;
do n = 1 to 5000;
x1 = ranuni(1);
x2 = ranuni(2);
x3 = ranuni(3);
y = ranpoi(4, exp(1 * x1 - 2 * x2 + 3 * x3));
output;
end;
run;

proc glimmix data = poi;
nloptions tech = quanew update = bfgs maxiter = 1000;
model y = x1 x2 x3 / link = log solution;
theta = exp(_phi_);
_variance_ = _mu_ / theta;
p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
f = (theta ** 0.5) * ((exp(-_mu_)) ** theta);
do i = 1 to 100;
f = f + (theta ** 0.5) * ((exp(-i) * (i ** i) / fact(i)) ** (1 - theta)) * ((exp(-_mu_) * (_mu_ ** i) / fact(i)) ** theta);
end;
k = 1 / f;
prob = k * (theta ** 0.5) * p_y * p_u;
if log(prob) ~= . then _logl_ = log(prob);
run;
```

Next, I will show the same estimation routine by using the closed form approximation.

```proc glimmix data = poi;
nloptions tech = quanew update = bfgs maxiter = 1000;
model y = x1 x2 x3 / link = log solution;
theta = exp(_phi_);
_variance_ = _mu_ / theta;
p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
k = 1 / (1 + (1 - theta) / (12 * theta * _mu_) * (1 + 1 / (theta * _mu_)));
prob = k * (theta ** 0.5) * p_y * p_u;
if log(prob) ~= . then _logl_ = log(prob);
run;
```

While the first approach is more accurate by closely following the DP density function, the second approach is more efficient with a significantly lower computing cost. However, both are much faster than the corresponding R function gamlss().

Written by statcompute

April 20, 2017 at 12:48 am

## SAS Macro Calculating Goodness-of-Fit Statistics for Quantile Regression

As shown by Fu and Wu in their presentation (https://www.casact.org/education/rpm/2010/handouts/CL1-Fu.pdf), the quantile regression is an appealing approach to model severity measures with high volatilities due to its statistical characteristics, including the robustness to extreme values and no distributional assumptions. Curti and Migueis also pointed out in a research paper (https://www.federalreserve.gov/econresdata/feds/2016/files/2016002r1pap.pdf) that the operational loss is more sensitive to macro-economic drivers at the tail, making the quantile regression an ideal model to capture such relationships.

While the quantile regression can be conveniently estimated in SAS with the QUANTREG procedure, the standard SAS output doesn’t provide goodness-of-fit (GoF) statistics. More importantly, it is noted that the underlying rationale of calculating GoF in a quantile regression is very different from the ones employed in OLS or GLM regressions. For instance, the most popular R-square is not applicable in the quantile regression anymore. Instead, a statistic called “R1” should be used. In addition, AIC and BIC are also defined differently in the quantile regression.

Below is a SAS macro showing how to calculate GoF statistics, including R1 and various information criterion, for a quantile regression.

```%macro quant_gof(data = , y = , x = , tau = 0.5);
***********************************************************;
* THE MACRO CALCULATES GOODNESS-OF-FIT STATISTICS FOR     *;
* QUANTILE REGRESSION                                     *;
* ------------------------------------------------------- *;
* REFERENCE:                                              *;
*  GOODNESS OF FIT AND RELATED INFERENCE PROCESSES FOR    *;
*  QUANTILE REGRESSION, KOENKER AND MACHADO, 1999         *;
***********************************************************;

options nodate nocenter;
title;

* UNRESTRICTED QUANTILE REGRESSION *;
ods select ParameterEstimates ObjFunction;
ods output ParameterEstimates = _est;
proc quantreg data = &data ci = resampling(nrep = 500);
model &y = &x / quantile = &tau nosummary nodiag seed = 1;
output out = _full p = _p;
run;

* RESTRICTED QUANTILE REGRESSION *;
ods select none;
proc quantreg data = &data ci = none;
model &y = / quantile = &tau nosummary nodiag;
output out = _null p = _p;
run;
ods select all;

proc sql noprint;
select sum(df) into :p from _est;
quit;

proc iml;
use _full;
read all var {&y _p} into A;
close _full;

use _null;
read all var {&y _p} into B;
close _null;

* DEFINE A FUNCTION CALCULATING THE SUM OF ABSOLUTE DEVIATIONS *;
start loss(x);
r = x[, 1] - x[, 2];
z = j(nrow(r), 1, 0);
l = sum(&tau * (r <> z) + (1 - &tau) * (-r <> z));
return(l);
finish;

r1 = 1 - loss(A) / loss(B);
adj_r1 = 1 - ((nrow(A) - 1) * loss(A)) / ((nrow(A) - &p) * loss(B));
aic = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p;
aicc = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p * nrow(A) / (nrow(A) - &p - 1);
bic = 2 * nrow(A) * log(loss(A) / nrow(A)) + &p * log(nrow(A));

l = {"R1" "ADJUSTED R1" "AIC" "AICC" "BIC"};
v = r1 // adj_r1 // aic // aicc // bic;
print v[rowname = l format = 20.8 label = "Fit Statistics"];
quit;

%mend quant_gof;
```

Written by statcompute

April 15, 2017 at 8:24 pm

## Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)
E(Y) = Mu = Alpha / (1 – Xi)
Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

```ods output ParameterEstimates = _est;
proc genmod data = mylib.credit_count;
model majordrg = age acadmos minordrg ownrent / dist = poisson link = log;
run;
```

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

```proc sql noprint;
select
"_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ')
into
:_parm separated by ' '
from
_est;

select
case
when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ')
else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ')
end
into
:_xb separated by ' + '
from
_est
where
upcase(parameter) ~= 'SCALE';
quit;

/*
%put &_parm;
_INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00

%put &_xb;
_INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT
*/
```

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

```proc nlmixed data = mylib.credit_count;
parms &_parm.;
_xb = &_xb.;
_xi = 1 - exp(-_scale);
_mu = exp(_xb);
_alpha = _mu * (1 - _xi);
_prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);
ll = log(_prob);
model majordrg ~ general(ll);
run;
```

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

```proc glimmix data = mylib.credit_count;
model majordrg = age acadmos minordrg ownrent / link = log solution;
_xi = 1 - 1 / exp(_phi_);
_variance_ = _mu_ / (1 - _xi) ** 2;
_alpha = _mu_ * (1 - _xi);
_prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);
_logl_ = log(_prob);
run;
```

Written by statcompute

March 11, 2017 at 3:01 pm

Tagged with ,

## Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 .

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is \$10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response y` = severity – threshold to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound a = 2 by using the VGAM package.

```library(VGAM)
set.seed(2017)
n <- 200
a <- 2
x <- runif(n)
k <- exp(1 + 5 * x)
pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x)
fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE)
summary(fit)
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   1.0322     0.1363   7.574 3.61e-14 ***
# x             4.9815     0.2463  20.229  < 2e-16 ***
AIC(fit)
#  -644.458
BIC(fit)
#  -637.8614
```

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

```proc nlmixed data = pdata;
parms b0 = 0.1 b1 = 0.1;
k = exp(b0 + b1 * x);
a = 2;
lh = k * (a ** k) / (y ** (k + 1));
ll = log(lh);
model y ~ general(ll);
run;
/*
Fit Statistics
-2 Log Likelihood               -648.5
AIC (smaller is better)         -644.5
AICC (smaller is better)        -644.4
BIC (smaller is better)         -637.9

Parameter Estimate   Standard   DF    t Value   Pr > |t|
Error
b0        1.0322     0.1385     200    7.45     <.0001
b1        4.9815     0.2518     200   19.78     <.0001
*/
```

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to exp(x * beta) but a * k / (k – 1) with k = exp(x * beta) . Therefore, the conditional mean only exists when k > 1 , which might cause numerical issues in the model estimation.

Written by statcompute

December 11, 2016 at 5:12 pm

Tagged with , ,