Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for the ‘SAS’ Category

Monotonic WoE Binning for LGD Models

While the monotonic binning algorithm has been widely used in scorecard and PD model (Probability of Default) developments, the similar idea can be generalized to LGD (Loss Given Default) models. In the post below, two SAS macros performing the monotonic binning for LGD are demonstrated.

The first one tends to generate relatively coarse bins based on iterative grouping, which requires a longer computing time.


%macro lgd_bin1(data = , y = , x = );

%let maxbin = 20;

data _tmp1 (keep = x y);
  set &data;
  y = min(1, max(0, &y));
  x = &x;
run;

proc sql noprint;
  select
    count(distinct x) into :xflg
  from
    _last_;
quit;

%let nbin = %sysfunc(min(&maxbin, &xflg));

%if &nbin > 2 %then %do;
  %do j = &nbin %to 2 %by -1;
    proc rank data = _tmp1 groups = &j out = _data_ (keep = x rank y);
      var x;
      ranks rank;
    run;

    proc summary data = _last_ nway;
      class rank;
      output out = _tmp2 (drop = _type_ rename = (_freq_ = freq))
      sum(y) = bads  mean(y) = bad_rate 
      min(x) = minx  max(x)  = maxx;
    run;

    proc sql noprint;
      select
        case when min(bad_rate) > 0 then 1 else 0 end into :minflg
      from
        _tmp2;
 
      select
        case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
      from
        _tmp2;              
    quit;

    %if &minflg = 1 & &maxflg = 1 %then %do;
      proc corr data = _tmp2 spearman noprint outs = _corr;
        var minx;
        with bad_rate;
      run;
      
      proc sql noprint;
        select
          case when abs(minx) = 1 then 1 else 0 end into :cor
        from
          _corr
        where
          _type_ = 'CORR';
      quit;
 
      %if &cor = 1 %then %goto loopout;
    %end;
  %end;
%end;

%loopout:

proc sql noprint;
create table
  _tmp3 as
select
  a.rank + 1                                           as bin,
  a.minx                                               as minx,
  a.maxx                                               as maxx,
  a.freq                                               as freq,
  a.freq / b.freq                                      as dist,
  a.bad_rate                                           as avg_lgd,
  a.bads / b.bads                                      as bpct,
  (a.freq - a.bads) / (b.freq - b.bads)                as gpct,
  log(calculated bpct / calculated gpct)               as woe,
  (calculated bpct - calculated gpct) / calculated woe as iv 
from
  _tmp2 as a, (select sum(freq) as freq, sum(bads) as bads from _tmp2) as b;
quit;

proc print data = _last_ noobs label;
  var minx maxx freq dist avg_lgd woe;
  format freq comma8. dist percent10.2;
  label
    minx    = "Lower Limit"
    maxx    = "Upper Limit"
    freq    = "Freq"
    dist    = "Dist"
    avg_lgd = "Average LGD"
    woe     = "WoE";
  sum freq dist;
run; 

%mend lgd_bin1;

The second one can generate much finer bins based on the idea of isotonic regressions and is more computationally efficient.


%macro lgd_bin2(data = , y = , x = );

data _data_ (keep = x y);
  set &data;
  y = min(1, max(0, &y));
  x = &x;
run;

proc transreg data = _last_ noprint;
  model identity(y) = monotone(x);
  output out = _tmp1 tip = _t;
run;
 
proc summary data = _last_ nway;
  class _tx;
  output out = _data_ (drop = _freq_ _type_) mean(y) = lgd;
run;

proc sort data = _last_;
  by lgd;
run;
 
data _tmp2;
  set _last_;
  by lgd;
  _idx = _n_;
  if lgd = 0 then _idx = _idx + 1;
  if lgd = 1 then _idx = _idx - 1;
run;

proc sql noprint;
create table
  _tmp3 as
select
  a.*,
  b._idx
from
  _tmp1 as a inner join _tmp2 as b
on
  a._tx = b._tx;

create table
  _tmp4 as
select
  min(a.x)                                             as minx,
  max(a.x)                                             as maxx,
  sum(a.y)                                             as bads,
  count(a.y)                                           as freq,
  count(a.y) / b.freq                                  as dist,
  mean(a.y)                                            as avg_lgd,
  sum(a.y) / b.bads                                    as bpct,
  sum(1 - a.y) / (b.freq - b.bads)                     as gpct,
  log(calculated bpct / calculated gpct)               as woe,
  (calculated bpct - calculated gpct) * calculated woe as iv
from
  _tmp3 as a, (select count(*) as freq, sum(y) as bads from _tmp3) as b
group by
  a._idx;
quit;

proc print data = _last_ noobs label;
  var minx maxx freq dist avg_lgd woe; 
  format freq comma8. dist percent10.2;
  label
    minx    = "Lower Limit"
    maxx    = "Upper Limit"
    freq    = "Freq"
    dist    = "Dist"
    avg_lgd = "Average LGD"
    woe     = "WoE";
  sum freq dist;
run; 

%mend lgd_bin2;

Below is the output comparison between two macros with the testing data downloaded from http://www.creditriskanalytics.net/datasets-private.html. Should you have any feedback, please feel free to leave me a message.

lgd1

Advertisements

Written by statcompute

September 30, 2017 at 5:50 pm

Posted in CCAR, Credit Risk, SAS, Statistical Models, Statistics

Tagged with

Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = );
options mprint mlogic;

data _data_ (keep = _x _y);
  set &data;
  where &y in (0, 1) and &x ~= .;
  _y = &y;
  _x = &x;
run;

proc transreg data = _last_ noprint;
  model identity(_y) = monotone(_x);
  output out = _tmp1 tip = _t;
run;

proc summary data = _last_ nway;
  class _t_x;
  output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate;
run;

proc sort data = _last_;
  by _rate;
run;

data _tmp2;
  set _last_;
  by _rate;
  _idx = _n_;
  if _rate = 0 then _idx = _idx + 1;
  if _rate = 1 then _idx = _idx - 1;
run;
  
proc sql noprint;
create table
  _tmp3 as
select
  a.*,
  b._idx
from
  _tmp1 as a inner join _tmp2 as b
on
  a._t_x = b._t_x;
  
create table
  _tmp4 as
select
  a._idx,
  min(a._x)                                               as _min_x,
  max(a._x)                                               as _max_x,
  sum(a._y)                                               as _bads,
  count(a._y)                                             as _freq,
  mean(a._y)                                              as _rate,
  sum(a._y) / b.bads                                      as _bpct,
  sum(1 - a._y) / (b.freq - b.bads)                       as _gpct,
  log(calculated _bpct / calculated _gpct)                as _woe,
  (calculated _bpct - calculated _gpct) * calculated _woe as _iv
from 
  _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b
group by
  a._idx;
quit;

title "Monotonic WoE Binning for %upcase(%trim(&x))";
proc print data = _last_ label noobs;
  var _min_x _max_x _bads _freq _rate _woe _iv;
  label
    _min_x = "Lower"
    _max_x = "Upper"
    _bads  = "#Bads"
    _freq  = "#Freq"
    _rate  = "BadRate"
    _woe   = "WoE"
    _iv    = "IV";
  sum _bads _freq _iv;
run;
title;

%mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

Screenshot from 2017-09-24 21-30-40

Written by statcompute

September 24, 2017 at 11:00 pm

Double Poisson Regression in SAS

In the previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models), I’ve shown how to estimate the double Poisson (DP) regression in R with the gamlss package. The hurdle of estimating DP regression is the calculation of a normalizing constant in the DP density function, which can be calculated either by the sum of an infinite series or by a closed form approximation. In the example below, I will show how to estimate DP regression in SAS with the GLIMMIX procedure.

First of all, I will show how to estimate DP regression by using the exact DP density function. In this case, we will approximate the normalizing constant by computing a partial sum of the infinite series, as highlighted below.

data poi;
  do n = 1 to 5000;
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    y = ranpoi(4, exp(1 * x1 - 2 * x2 + 3 * x3));
    output;
  end;
run;

proc glimmix data = poi;
  nloptions tech = quanew update = bfgs maxiter = 1000;
  model y = x1 x2 x3 / link = log solution;
  theta = exp(_phi_);
  _variance_ = _mu_ / theta;
  p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
  p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
  f = (theta ** 0.5) * ((exp(-_mu_)) ** theta);  
  do i = 1 to 100;
    f = f + (theta ** 0.5) * ((exp(-i) * (i ** i) / fact(i)) ** (1 - theta)) * ((exp(-_mu_) * (_mu_ ** i) / fact(i)) ** theta);
  end;
  k = 1 / f;
  prob = k * (theta ** 0.5) * p_y * p_u;
  if log(prob) ~= . then _logl_ = log(prob);
run;

Next, I will show the same estimation routine by using the closed form approximation.

proc glimmix data = poi;
  nloptions tech = quanew update = bfgs maxiter = 1000;
  model y = x1 x2 x3 / link = log solution;
  theta = exp(_phi_);
  _variance_ = _mu_ / theta;
  p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
  p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
  k = 1 / (1 + (1 - theta) / (12 * theta * _mu_) * (1 + 1 / (theta * _mu_)));
  prob = k * (theta ** 0.5) * p_y * p_u;
  if log(prob) ~= . then _logl_ = log(prob);
run;

While the first approach is more accurate by closely following the DP density function, the second approach is more efficient with a significantly lower computing cost. However, both are much faster than the corresponding R function gamlss().

Written by statcompute

April 20, 2017 at 12:48 am

SAS Macro Calculating Goodness-of-Fit Statistics for Quantile Regression

As shown by Fu and Wu in their presentation (https://www.casact.org/education/rpm/2010/handouts/CL1-Fu.pdf), the quantile regression is an appealing approach to model severity measures with high volatilities due to its statistical characteristics, including the robustness to extreme values and no distributional assumptions. Curti and Migueis also pointed out in a research paper (https://www.federalreserve.gov/econresdata/feds/2016/files/2016002r1pap.pdf) that the operational loss is more sensitive to macro-economic drivers at the tail, making the quantile regression an ideal model to capture such relationships.

While the quantile regression can be conveniently estimated in SAS with the QUANTREG procedure, the standard SAS output doesn’t provide goodness-of-fit (GoF) statistics. More importantly, it is noted that the underlying rationale of calculating GoF in a quantile regression is very different from the ones employed in OLS or GLM regressions. For instance, the most popular R-square is not applicable in the quantile regression anymore. Instead, a statistic called “R1” should be used. In addition, AIC and BIC are also defined differently in the quantile regression.

Below is a SAS macro showing how to calculate GoF statistics, including R1 and various information criterion, for a quantile regression.

%macro quant_gof(data = , y = , x = , tau = 0.5);
***********************************************************;
* THE MACRO CALCULATES GOODNESS-OF-FIT STATISTICS FOR     *;
* QUANTILE REGRESSION                                     *;
* ------------------------------------------------------- *;
* REFERENCE:                                              *;
*  GOODNESS OF FIT AND RELATED INFERENCE PROCESSES FOR    *;
*  QUANTILE REGRESSION, KOENKER AND MACHADO, 1999         *;
***********************************************************;

options nodate nocenter;
title;

* UNRESTRICTED QUANTILE REGRESSION *;
ods select ParameterEstimates ObjFunction;
ods output ParameterEstimates = _est;
proc quantreg data = &data ci = resampling(nrep = 500);
  model &y = &x / quantile = &tau nosummary nodiag seed = 1;
  output out = _full p = _p;
run;

* RESTRICTED QUANTILE REGRESSION *;
ods select none;
proc quantreg data = &data ci = none;
  model &y = / quantile = &tau nosummary nodiag;
  output out = _null p = _p;
run;
ods select all; 

proc sql noprint;
  select sum(df) into :p from _est;
quit;

proc iml;
  use _full;
  read all var {&y _p} into A;
  close _full;

  use _null;
  read all var {&y _p} into B;
  close _null;

  * DEFINE A FUNCTION CALCULATING THE SUM OF ABSOLUTE DEVIATIONS *;
  start loss(x);
    r = x[, 1] - x[, 2];
    z = j(nrow(r), 1, 0);
    l = sum(&tau * (r <> z) + (1 - &tau) * (-r <> z));
    return(l);
  finish;
  
  r1 = 1 - loss(A) / loss(B);
  adj_r1 = 1 - ((nrow(A) - 1) * loss(A)) / ((nrow(A) - &p) * loss(B));
  aic = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p;
  aicc = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p * nrow(A) / (nrow(A) - &p - 1);
  bic = 2 * nrow(A) * log(loss(A) / nrow(A)) + &p * log(nrow(A));
  
  l = {"R1" "ADJUSTED R1" "AIC" "AICC" "BIC"};
  v = r1 // adj_r1 // aic // aicc // bic;
  print v[rowname = l format = 20.8 label = "Fit Statistics"];
quit;

%mend quant_gof;

Written by statcompute

April 15, 2017 at 8:24 pm

Estimating Conway-Maxwell-Poisson Regression in SAS

Conway-Maxwell-Poisson (CMP) regression is a flexible way to model frequency outcomes with both under-dispersion and over-dispersion. In SAS, CMP regression can be estimated with COUNTREG procedure directly or with NLMIXED procedure by specifying the likelihood function. However, the use of NLMIXED procedure is extremely cumbersome in that we need to estimate a standard Poisson regression and then use estimated parameters as initial values parameter estimates for the CMP regression.

In the example below, we will show how to employ GLIMMIX procedure to estimate a CMP regression by providing both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / link = log solution;
  _nu =  1 / exp(_phi_);
  _variance_ = (1 / _nu) / ((_mu_) ** (1 / _nu));
  _z = 0;
  do i = 0 to 100;
    _z = _z + (_mu_ ** i) / fact(i) ** _nu;
  end;
  _prob = (_mu_ ** majordrg) / (fact(majordrg) ** _nu) * (_z ** (-1));
  _logl_ = log(_prob);
run;

Since the scale parameter _phi_ is strictly above 0, the function 1 / exp(_phi_) in the line #3 is to ensure the Nu parameter bounded between 0 and 1.

In addition, the DO loop is to calculate the normalization constant Z such that the PMF would sum up to 1. As there is no closed form for the calculation of Z, we need to calculate it numerically at the cost of a longer computing time.

Other implicit advantages of GLIMMIX procedure over NLMIXED procedure include the unnecessity to provide initiate values of parameter estimates and a shorter computing time.

Written by statcompute

March 26, 2017 at 5:09 pm

Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)
E(Y) = Mu = Alpha / (1 – Xi)
Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est;
proc genmod data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / dist = poisson link = log;
run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint;
select
  "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ')
into
  :_parm separated by ' '
from  
  _est;
  
select
  case 
    when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ')
    else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ')
  end
into
  :_xb separated by ' + '    
from  
  _est
where
  upcase(parameter) ~= 'SCALE';  
quit;

/*
%put &_parm;
_INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00

%put &_xb;
 _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT
*/

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count;
  parms &_parm.;
  _xb = &_xb.;
  _xi = 1 - exp(-_scale);
  _mu = exp(_xb);  
  _alpha = _mu * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);
  ll = log(_prob);
  model majordrg ~ general(ll);
run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / link = log solution;
  _xi = 1 - 1 / exp(_phi_);
  _variance_ = _mu_ / (1 - _xi) ** 2;
  _alpha = _mu_ * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);  
  _logl_ = log(_prob);
run;

Written by statcompute

March 11, 2017 at 3:01 pm

Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 .

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response y` = severity – threshold to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound a = 2 by using the VGAM package.

library(VGAM)
set.seed(2017)
n <- 200
a <- 2
x <- runif(n)
k <- exp(1 + 5 * x)
pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x)
fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE)
summary(fit)
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   1.0322     0.1363   7.574 3.61e-14 ***
# x             4.9815     0.2463  20.229  < 2e-16 ***
AIC(fit)
#  -644.458
BIC(fit)
#  -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata;
  parms b0 = 0.1 b1 = 0.1;
  k = exp(b0 + b1 * x);
  a = 2;
  lh = k * (a ** k) / (y ** (k + 1));
  ll = log(lh);
  model y ~ general(ll);
run;
/*
Fit Statistics
-2 Log Likelihood               -648.5
AIC (smaller is better)         -644.5
AICC (smaller is better)        -644.4
BIC (smaller is better)         -637.9

Parameter Estimate   Standard   DF    t Value   Pr > |t|
                     Error 
b0        1.0322     0.1385     200    7.45     <.0001 	
b1        4.9815     0.2518     200   19.78     <.0001 	
*/

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to exp(x * beta) but a * k / (k – 1) with k = exp(x * beta) . Therefore, the conditional mean only exists when k > 1 , which might cause numerical issues in the model estimation.

Written by statcompute

December 11, 2016 at 5:12 pm