Fetching Data From SAS Dataset to Lua Table

data one;
  array c{2} $ _temporary_ ("A", "B");
  do i = 1 to dim(c);
    x = c[i];
    do j = 1 to 2;
      y = round(rannor(1), 0.0001);
      output;
    end;
  end;
run;

proc lua;
submit;
  -- OPEN SAS DATASET FOR READING --
  local dsid = sas.open("work.one", i)

  -- CREATING AN EMPTY LUA TABLE --
  local list = {}

  -- LOOP THROUGH OBSERVATIONS IN SAS DATASET --
  for obs in sas.rows(dsid) do
    local dict = {}

    -- LOOP THROUGH VARIABLES IN EACH OBSERVATION --
    for var in sas.vars(dsid) do
      dict[var.name] = obs[var.name]
    end

    -- INSERT EACH RECORD INTO LUA TABLE --
    table.insert(list, dict)

    -- CLOSE SAS DATASET AFTER THE LAST RECORD --
    if #list == sas.nobs(dsid) then
      sas.close(dsid)
    end
  end

  -- PRINT OUT LUA TABLE --
  for i = 1, #list do
    print(string.rep("*", 5).." RECORD: "..i.." "..string.rep("*", 5))
    for key, value in pairs(list[i]) do
      print(key.." --> "..type(value).." --> "..value)
    end
    print("\n")
  end
  -- WRITE LUA TABLE INTO NEW SAS DATASET --
  new_ds = "work.two"
  sas.write_ds(list, new_ds)

  -- SUBMITTING SAS CODE --
  sas.submit([[proc print data = @ds@ noobs; run]], {ds = new_ds})

endsubmit;
run;

*** OUTPUT SHOWN IN THE LOG ***
***** RECORD: 1 *****
y --> number --> 1.8048
j --> number --> 1
i --> number --> 1
x --> string --> A


***** RECORD: 2 *****
y --> number --> -0.0799
j --> number --> 2
i --> number --> 1
x --> string --> A


***** RECORD: 3 *****
y --> number --> 0.3966
j --> number --> 1
i --> number --> 2
x --> string --> B


***** RECORD: 4 *****
y --> number --> -1.0833
j --> number --> 2
i --> number --> 2
x --> string --> B

Convert SAS Dataset to Dictionary List

from sas7bdat import SAS7BDAT

with SAS7BDAT("Downloads/accepts.sas7bdat") as f:
  lst = map(lambda x: dict(zip(f.column_names, x)), [i for i in f][1:])

col = ["app_id", "bureau_score", "ltv", "tot_derog", "tot_income", "bad"]

for i in range(5):
  print {k: lst[i].get(k) for k in col}

#{'tot_income': 4800.0, 'ltv': 109.0, 'app_id': 1001.0, 'bureau_score': 747.0, 'bad': 0.0, 'tot_derog': 6.0}
#{'tot_income': 5833.33, 'ltv': 97.0, 'app_id': 1002.0, 'bureau_score': 744.0, 'bad': 0.0, 'tot_derog': 0.0}
#{'tot_income': 2308.33, 'ltv': 105.0, 'app_id': 1003.0, 'bureau_score': 667.0, 'bad': 0.0, 'tot_derog': 0.0}
#{'tot_income': 4083.33, 'ltv': 78.0, 'app_id': 1005.0, 'bureau_score': 648.0, 'bad': 1.0, 'tot_derog': 2.0}
#{'tot_income': 5783.33, 'ltv': 100.0, 'app_id': 1006.0, 'bureau_score': 649.0, 'bad': 0.0, 'tot_derog': 2.0}

Two-Stage Estimation of Switching Regression

The switching regression is an extension of the Heckit model, which is also known as the type-V Tobit model and assumes that there is a multivariate normal distribution for three latent variables Y1*, Y2*, and Y3* such that
A. Y1 = 1 for Y1* > 0 and Y1 = 0 for Y1* <= 0;
B. Y2 = Y2* for Y1 = 1 and Y2 = 0 for Y1 = 0;
C. Y3 = Y3* for Y1 = 0 and Y3 = 0 for Y1 = 1.
Therefore, Y2 and Y3 would not be observable at the same time.

In SAS, the switching regression can be implemented with the QLIM procedure, as shown in (http://support.sas.com/documentation/cdl/en/etsug/63939/HTML/default/viewer.htm#etsug_qlim_sect039.htm). However, when using the QLIM procedure in practice, I sometimes find that the MLE might not converge given the complexity of the likelihood function. In the example below, a two-stage estimation approach by using simple LOGISTIC and REG procedures is demonstrated. Benefits of the two-stage estimation are twofold. First of all, it is extremely easy to implement in practice. Secondly, when the MLE is preferred, estimated parameters from the two-stage approach can be used to provide initial values in the optimization to help the MLE convergence.

data d1;
  keep y1 y2 y3 x1 x2;
  do i = 1 to 500;
    x1 = rannor(1);
    x2 = rannor(1);
    u1 = rannor(1);
    u2 = rannor(1);
    u3 = rannor(1);
    y1l = 1 + 2 * x1 + 3 * x2 + u1;
    y2l = 1 + 2 * x1 + u1 * 0.2 + u2;
    y3l = 1 - 2 * x2 + u1 * 0.1 - u2 * 0.5 + u3 * 0.5;
    if y1l > 0 then y1 = 1;
    else y1 = 0;
    if y1l > 0 then y2 = y2l;
    else y2 = 0;
    if y1l <= 0 then y3 = y3l;
    else y3 = 0;
    output;
  end;
run;

*** 1-STEP MLE ***;
proc qlim data = d1;
  model y1 = x1 x2 / discrete;
  model y2 = x1 / select(y1 = 1);
  model y3 = x2 / select(y1 = 0);
run;

/*
Parameter      DF      Estimate        Standard     t Value   Approx
                                       Error                  P-Value
y2.Intercept   1       0.931225        0.080241     11.61     <.0001
y2.x1          1       1.970194        0.06801      28.97     <.0001
_Sigma.y2      1       1.050489        0.042064     24.97     <.0001
y3.Intercept   1       0.936837        0.09473       9.89     <.0001
y3.x2          1      -2.043977        0.071986    -28.39     <.0001
_Sigma.y3      1       0.710451        0.037412     18.99     <.0001
y1.Intercept   1       1.040852        0.127171      8.18     <.0001
y1.x1          1       1.900394        0.19335       9.83     <.0001
y1.x2          1       2.590489        0.257989     10.04     <.0001
_Rho.y1.y2     1       0.147923        0.2156        0.69     0.4927
_Rho.y1.y3     1       0.324967        0.166508      1.95     0.051
*/

*** 2-STAGE APPROACH ***;
proc logistic data = d1 desc;
  model y1 = x1 x2 / link = probit;
  output out = d2 xbeta = xb;
run;
/*
Parameter  DF  Estimate  Standard  Wald          P-Value
                         Error     Chi-Square
Intercept  1   1.0406    0.1296     64.5117      <.0001
x1         1   1.8982    0.1973     92.5614      <.0001
x2         1   2.6223    0.2603    101.47        <.0001
*/

data d3;
  set d2;
  if y1 = 1 then imr = pdf('normal', xb) / cdf('normal', xb);
  else imr = pdf('normal', xb) / (1 - cdf('normal', xb));
run;

proc reg data = d3 plots = none;
  where y1 = 1;
  model y2 = x1 imr;
run;
/*
Variable   DF  Parameter  Standard  t Value  P-Value
               Estimate   Error
Intercept  1   0.94043    0.0766    12.28    <.0001
x1         1   1.96494    0.06689   29.38    <.0001
imr        1   0.11476    0.20048    0.57    0.5674
*/

proc reg data = d3 plots = none;
  where y1 = 0;
  model y3 = x2 imr;
run;
/*
Variable   DF  Parameter  Standard  t Value  P-Value
               Estimate   Error
Intercept  1    0.92982   0.09493     9.79   <.0001
x2         1   -2.04808   0.07194   -28.47   <.0001
imr        1   -0.21852   0.1244     -1.76   0.0807
*/

*** SET INITIAL VALUES IN MLE BASED ON TWO-STAGE OUTCOMES ***;
proc qlim data = d1;
  init y1.intercept = 1.0406  y1.x1 = 1.8982      y1.x2 = 2.6223
       y2.intercept = 0.9404  y2.x1 = 1.9649  _sigma.y2 = 1.0539
       y3.intercept = 0.9298  y3.x2 = -2.048  _sigma.y3 = 0.7070;
  model y1 = x1 x2 / discrete;
  model y2 = x1 / select(y1 = 1);
  model y3 = x2 / select(y1 = 0);
run;

Writing Wrapper in SAS

When it comes to writing wrappers around data steps and procedures in SAS, SAS macros might still be the primary choice for most SASors. In the example below, I am going to show other alternatives to accomplish such task.

First of all, let’s generate a toy SAS dataset as below. Based on different categories in the variable X, we are going to calculate different statistics for the variable Y. For instance, we will want the minimum with X = “A”, the median with X = “B”, the maximum with Y = “C”.

data one (keep = x y); 
  array a{3} $ _temporary_ ("A" "B" "C");
  do i = 1 to dim(a);
    x = a[i];
    do j = 1 to 10; 
      y = rannor(1);
      output;
    end;
  end;
run;

/* Expected Output:
x        y       stat  
A    -1.08332    MIN  
B    -0.51915    MEDIAN  
C     1.61438    MAX    */

Writing a SAS macro to get the job done is straightforward with lots of “&” and “%” that could be a little confusing for new SASors, as shown below.

%macro wrap;
%local list stat i x s;
%let list = A B C;
%let stat = MIN MEDIAN MAX;
%let i = 1;
%do %while (%scan(&list, &i) ne %str());
  %let x = %scan(&list, &i);
  %let s = %scan(&stat, &i);
  proc summary data = one nway; where x = "&x"; class x; output out = tmp(drop = _freq_ _type_) &s.(y) = ; run;
  %if &i = 1 %then %do;
    data two1; set tmp; format stat $6.; stat = "&s"; run;
  %end;
  %else %do;
    data two1; set two1 tmp (in = tmp); if tmp then stat = "&s."; run;
  %end;
  %let i = %eval(&i + 1); 
%end;
%mend wrap;

%wrap;

Other than using SAS macro, Data _Null_ might be considered another old-fashion way for the same task by utilizing the generic data flow embedded in the data step and the Call Execute routine. The most challenging piece might be to parse the script for data steps and procedures. The benefit over using SAS macro is that the code runs instantaneously without the need to compile the macro.

data _null_;
  array list{3} $ _temporary_ ("A" "B" "C");
  array stat{3} $ _temporary_ ("MIN" "MEDIAN" "MAX");
  do i = 1 to dim(list);
    call execute(cats(
      'proc summary data = one nway; class x; where x = "', list[i], cat('"; output out = tmp(drop = _type_ _freq_) ', stat[i]), '(y) = ; run;'
    ));
    if i = 1 then do;
      call execute(cats(
        'data two2; set tmp; format stat $6.; stat = "', stat[i], '"; run;'
      ));
    end;
    else do;
      call execute(cats(
        'data two2; set two2 tmp (in = tmp); if tmp then stat = "', stat[i], '"; run;'
      ));
    end;
  end;
run;

If we’d like to look for something more inspiring, the IML Procedure might be another option that can be considered by SASors who feel more comfortable about other programming languages, such as R or Python. The only caveat is that we need to convert values in IML into macro variables that can be consumed by SAS codes within the SUBMIT block.

proc iml;
  list = {'A', 'B', 'C'};
  stat = {'MIN', 'MEDIAN', 'MAX'};
  do i = 1 to nrow(list);
    call symputx("x", list[i]);
    call symputx("s", stat[i]);
    submit;
      proc summary data = one nway; class x; where x = "&x."; output out = tmp(drop = _type_ _freq_) &s.(y) = ; run;
    endsubmit;
    if i = 1 then do;
      submit;
        data two3; set tmp; format stat $6.; stat = "&s."; run;
      endsubmit;
    end;
    else do;
      submit;
        data two3; set two3 tmp (in = tmp); if tmp then stat = "&s."; run;
      endsubmit;
    end;
  end;
quit;

The last option that I’d like to demonstrate is based on the LUA Procedure that is relatively new in SAS/Base. The logic flow of Proc LUA implementation looks similar to the one of Proc IML implementation shown above. However, passing values and tables in and out of generic SAS data steps and procedures is much more intuitive, making Proc LUA a perfect wrapper to bind other SAS functionalities together.

proc lua;
  submit;
    local list = {'A', 'B', 'C'}
    local stat = {'MIN', 'MEDIAN', 'MAX'}
    for i, item in ipairs(list) do
      local x = list[i]
      local s = stat[i]
      sas.submit[[
        proc summary data = one nway; class x; where x = "@x@"; output out = tmp(drop = _type_ _freq_) @s@(y) = ; run;
      ]]
      if i == 1 then
        sas.submit[[
          data two4; set tmp; format stat $6.; stat = "@s@"; run;
        ]]
      else
        sas.submit[[
          data two4; set two4 tmp (in = tmp); if tmp then stat = "@s@"; run;
        ]]
      end
    end
  endsubmit;
run;

SAS Implementation of ZAGA Models

In the previous post https://statcompute.wordpress.com/2017/09/17/model-non-negative-numeric-outcomes-with-zeros/, I gave a brief introduction about the ZAGA (Zero-Adjusted Gamma) model that provides us a very flexible approach to model non-negative numeric responses. Today, I will show how to implement the ZAGA model with SAS, which can be conducted either jointly or by two steps.

In SAS, the FMM procedure provides a very convenient interface to estimate the ZAGA model in 1 simple step. As shown, there are two model statements, e.g. the first one to estimate a Gamma sub-model with positive outcomes and the second used to separate the point-mass at zero from the positive. The subsequent probmodel statement then is employed to estimate the probability of a record being positive.


data ds;
  set "/folders/myfolders/autoclaim" (keep = clm_amt bluebook npolicy clm_freq5 mvr_pts income);
  where income ~= .;
  clm_flg = (clm_amt > 0);
run;

proc fmm data = ds tech = trureg;
  model clm_amt = bluebook npolicy / dist = gamma;
  model clm_amt = / dist = constant;
  probmodel clm_freq5 mvr_pts income;
run;

An alternative way to develop a ZAGA model in two steps is to estimate a logistic regression first separating the point-mass at zero from the positive and then to estimate a Gamma regression with positive outcomes only, as illustrated below. The two-step approach is more intuitive to understand and, more importantly, is easier to implement without convergence issues as in FMM or NLMIXED procedure.


proc logistic data = ds desc;
  model clm_flg = clm_freq5 mvr_pts income;
run;

proc genmod data = ds;
  where clm_flg = 1;
  model clm_amt = bluebook npolicy / link = log dist = gamma;
run;

Estimating Parameters of A Hyper-Poisson Distribution in SAS

Similar to COM-Poisson, Double-Poisson, and Generalized Poisson distributions discussed in my previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models/), the Hyper-Poisson distribution is another extension of the standard Poisson and is able to accommodate both under-dispersion and over-dispersion that are common in real-world problems. Given the complexity of parameterization and computation, the Hyper-Poisson is somewhat under-investigated. To the best of my knowledge, there is no off-shelf computing routine in SAS for the Hyper-Poisson distribution and only a R function available in http://www4.ujaen.es/~ajsaez/hp.fit.r written by A.J. Sáez-Castillo and A. Conde-Sánchez (2013).

The SAS code presented below is the starting point of my attempt on the Hyper-Poisson and its potential applications. The purpose is to replicate the calculation result shown in the Table 6 of “On the Hyper-Poisson Distribution and its Generalization with Applications” by Bayo H. Lawal (2017) (http://www.journalrepository.org/media/journals/BJMCS_6/2017/Mar/Lawal2132017BJMCS32184.pdf). As a result, the parameterization employed in my SAS code will closely follow Bayo H. Lawal (2017) instead of A.J. Sáez-Castillo and A. Conde-Sánchez (2013).


data d1;
  input y n @@;
datalines;
0 121 1 85 2 19 3 1 4 0 5 0 6 1
;
run;

data df;
  set d1;
  where n > 0;
  do i = 1 to n;
    output;
  end;
run;

proc nlmixed data = df;
  parms lambda = 1 beta = 1;
  theta = 1;
  do k = 1 to 100;
    theta = theta + gamma(beta) * (lambda ** k) / gamma(beta + k);
  end;
  prob = (gamma(beta) / gamma(beta + y)) * ((lambda ** y) / theta);
  ll = log(prob);
  model y ~ general(ll);
run;

/*
                     Standard
Parameter  Estimate     Error    DF  t Value  Pr > |t|   Alpha
lambda       0.3752    0.1178   227     3.19    0.0016    0.05
beta         0.5552    0.2266   227     2.45    0.0150    0.05 
*/

As shown, the estimated Lambda = 0.3752 and the estimated Beta = 0.5552 are identical to what is presented in the paper. The next step is be to explore applications in the frequency modeling as well as its value in business cases.

HP

Monotonic WoE Binning for LGD Models

While the monotonic binning algorithm has been widely used in scorecard and PD model (Probability of Default) developments, the similar idea can be generalized to LGD (Loss Given Default) models. In the post below, two SAS macros performing the monotonic binning for LGD are demonstrated.

The first one tends to generate relatively coarse bins based on iterative grouping, which requires a longer computing time.


%macro lgd_bin1(data = , y = , x = );

%let maxbin = 20;

data _tmp1 (keep = x y);
  set &data;
  y = min(1, max(0, &y));
  x = &x;
run;

proc sql noprint;
  select
    count(distinct x) into :xflg
  from
    _last_;
quit;

%let nbin = %sysfunc(min(&maxbin, &xflg));

%if &nbin > 2 %then %do;
  %do j = &nbin %to 2 %by -1;
    proc rank data = _tmp1 groups = &j out = _data_ (keep = x rank y);
      var x;
      ranks rank;
    run;

    proc summary data = _last_ nway;
      class rank;
      output out = _tmp2 (drop = _type_ rename = (_freq_ = freq))
      sum(y) = bads  mean(y) = bad_rate 
      min(x) = minx  max(x)  = maxx;
    run;

    proc sql noprint;
      select
        case when min(bad_rate) > 0 then 1 else 0 end into :minflg
      from
        _tmp2;
 
      select
        case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
      from
        _tmp2;              
    quit;

    %if &minflg = 1 & &maxflg = 1 %then %do;
      proc corr data = _tmp2 spearman noprint outs = _corr;
        var minx;
        with bad_rate;
      run;
      
      proc sql noprint;
        select
          case when abs(minx) = 1 then 1 else 0 end into :cor
        from
          _corr
        where
          _type_ = 'CORR';
      quit;
 
      %if &cor = 1 %then %goto loopout;
    %end;
  %end;
%end;

%loopout:

proc sql noprint;
create table
  _tmp3 as
select
  a.rank + 1                                           as bin,
  a.minx                                               as minx,
  a.maxx                                               as maxx,
  a.freq                                               as freq,
  a.freq / b.freq                                      as dist,
  a.bad_rate                                           as avg_lgd,
  a.bads / b.bads                                      as bpct,
  (a.freq - a.bads) / (b.freq - b.bads)                as gpct,
  log(calculated bpct / calculated gpct)               as woe,
  (calculated bpct - calculated gpct) / calculated woe as iv 
from
  _tmp2 as a, (select sum(freq) as freq, sum(bads) as bads from _tmp2) as b;
quit;

proc print data = _last_ noobs label;
  var minx maxx freq dist avg_lgd woe;
  format freq comma8. dist percent10.2;
  label
    minx    = "Lower Limit"
    maxx    = "Upper Limit"
    freq    = "Freq"
    dist    = "Dist"
    avg_lgd = "Average LGD"
    woe     = "WoE";
  sum freq dist;
run; 

%mend lgd_bin1;

The second one can generate much finer bins based on the idea of isotonic regressions and is more computationally efficient.


%macro lgd_bin2(data = , y = , x = );

data _data_ (keep = x y);
  set &data;
  y = min(1, max(0, &y));
  x = &x;
run;

proc transreg data = _last_ noprint;
  model identity(y) = monotone(x);
  output out = _tmp1 tip = _t;
run;
 
proc summary data = _last_ nway;
  class _tx;
  output out = _data_ (drop = _freq_ _type_) mean(y) = lgd;
run;

proc sort data = _last_;
  by lgd;
run;
 
data _tmp2;
  set _last_;
  by lgd;
  _idx = _n_;
  if lgd = 0 then _idx = _idx + 1;
  if lgd = 1 then _idx = _idx - 1;
run;

proc sql noprint;
create table
  _tmp3 as
select
  a.*,
  b._idx
from
  _tmp1 as a inner join _tmp2 as b
on
  a._tx = b._tx;

create table
  _tmp4 as
select
  min(a.x)                                             as minx,
  max(a.x)                                             as maxx,
  sum(a.y)                                             as bads,
  count(a.y)                                           as freq,
  count(a.y) / b.freq                                  as dist,
  mean(a.y)                                            as avg_lgd,
  sum(a.y) / b.bads                                    as bpct,
  sum(1 - a.y) / (b.freq - b.bads)                     as gpct,
  log(calculated bpct / calculated gpct)               as woe,
  (calculated bpct - calculated gpct) * calculated woe as iv
from
  _tmp3 as a, (select count(*) as freq, sum(y) as bads from _tmp3) as b
group by
  a._idx;
quit;

proc print data = _last_ noobs label;
  var minx maxx freq dist avg_lgd woe; 
  format freq comma8. dist percent10.2;
  label
    minx    = "Lower Limit"
    maxx    = "Upper Limit"
    freq    = "Freq"
    dist    = "Dist"
    avg_lgd = "Average LGD"
    woe     = "WoE";
  sum freq dist;
run; 

%mend lgd_bin2;

Below is the output comparison between two macros with the testing data downloaded from http://www.creditriskanalytics.net/datasets-private.html. Should you have any feedback, please feel free to leave me a message.

lgd1

Granular Monotonic Binning in SAS

In the post (https://statcompute.wordpress.com/2017/06/15/finer-monotonic-binning-based-on-isotonic-regression), it is shown how to do a finer monotonic binning with isotonic regression in R.

Below is a SAS macro implementing the monotonic binning with the same idea of isotonic regression. This macro is more efficient than the one shown in (https://statcompute.wordpress.com/2012/06/10/a-sas-macro-implementing-monotonic-woe-transformation-in-scorecard-development) without iterative binning and is also able to significantly increase the binning granularity.

%macro monobin(data = , y = , x = );
options mprint mlogic;

data _data_ (keep = _x _y);
  set &data;
  where &y in (0, 1) and &x ~= .;
  _y = &y;
  _x = &x;
run;

proc transreg data = _last_ noprint;
  model identity(_y) = monotone(_x);
  output out = _tmp1 tip = _t;
run;

proc summary data = _last_ nway;
  class _t_x;
  output out = _data_ (drop = _freq_ _type_) mean(_y) = _rate;
run;

proc sort data = _last_;
  by _rate;
run;

data _tmp2;
  set _last_;
  by _rate;
  _idx = _n_;
  if _rate = 0 then _idx = _idx + 1;
  if _rate = 1 then _idx = _idx - 1;
run;
  
proc sql noprint;
create table
  _tmp3 as
select
  a.*,
  b._idx
from
  _tmp1 as a inner join _tmp2 as b
on
  a._t_x = b._t_x;
  
create table
  _tmp4 as
select
  a._idx,
  min(a._x)                                               as _min_x,
  max(a._x)                                               as _max_x,
  sum(a._y)                                               as _bads,
  count(a._y)                                             as _freq,
  mean(a._y)                                              as _rate,
  sum(a._y) / b.bads                                      as _bpct,
  sum(1 - a._y) / (b.freq - b.bads)                       as _gpct,
  log(calculated _bpct / calculated _gpct)                as _woe,
  (calculated _bpct - calculated _gpct) * calculated _woe as _iv
from 
  _tmp3 as a, (select count(*) as freq, sum(_y) as bads from _tmp3) as b
group by
  a._idx;
quit;

title "Monotonic WoE Binning for %upcase(%trim(&x))";
proc print data = _last_ label noobs;
  var _min_x _max_x _bads _freq _rate _woe _iv;
  label
    _min_x = "Lower"
    _max_x = "Upper"
    _bads  = "#Bads"
    _freq  = "#Freq"
    _rate  = "BadRate"
    _woe   = "WoE"
    _iv    = "IV";
  sum _bads _freq _iv;
run;
title;

%mend monobin;

Below is the sample output for LTV, showing an identical binning scheme to the one generated by the R isobin() function.

Screenshot from 2017-09-24 21-30-40

Double Poisson Regression in SAS

In the previous post (https://statcompute.wordpress.com/2016/11/27/more-about-flexible-frequency-models), I’ve shown how to estimate the double Poisson (DP) regression in R with the gamlss package. The hurdle of estimating DP regression is the calculation of a normalizing constant in the DP density function, which can be calculated either by the sum of an infinite series or by a closed form approximation. In the example below, I will show how to estimate DP regression in SAS with the GLIMMIX procedure.

First of all, I will show how to estimate DP regression by using the exact DP density function. In this case, we will approximate the normalizing constant by computing a partial sum of the infinite series, as highlighted below.

data poi;
  do n = 1 to 5000;
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    y = ranpoi(4, exp(1 * x1 - 2 * x2 + 3 * x3));
    output;
  end;
run;

proc glimmix data = poi;
  nloptions tech = quanew update = bfgs maxiter = 1000;
  model y = x1 x2 x3 / link = log solution;
  theta = exp(_phi_);
  _variance_ = _mu_ / theta;
  p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
  p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
  f = (theta ** 0.5) * ((exp(-_mu_)) ** theta);  
  do i = 1 to 100;
    f = f + (theta ** 0.5) * ((exp(-i) * (i ** i) / fact(i)) ** (1 - theta)) * ((exp(-_mu_) * (_mu_ ** i) / fact(i)) ** theta);
  end;
  k = 1 / f;
  prob = k * (theta ** 0.5) * p_y * p_u;
  if log(prob) ~= . then _logl_ = log(prob);
run;

Next, I will show the same estimation routine by using the closed form approximation.

proc glimmix data = poi;
  nloptions tech = quanew update = bfgs maxiter = 1000;
  model y = x1 x2 x3 / link = log solution;
  theta = exp(_phi_);
  _variance_ = _mu_ / theta;
  p_u = (exp(-_mu_) * (_mu_ ** y) / fact(y)) ** theta;
  p_y = (exp(-y) * (y ** y) / fact(y)) ** (1 - theta);
  k = 1 / (1 + (1 - theta) / (12 * theta * _mu_) * (1 + 1 / (theta * _mu_)));
  prob = k * (theta ** 0.5) * p_y * p_u;
  if log(prob) ~= . then _logl_ = log(prob);
run;

While the first approach is more accurate by closely following the DP density function, the second approach is more efficient with a significantly lower computing cost. However, both are much faster than the corresponding R function gamlss().

SAS Macro Calculating Goodness-of-Fit Statistics for Quantile Regression

As shown by Fu and Wu in their presentation (https://www.casact.org/education/rpm/2010/handouts/CL1-Fu.pdf), the quantile regression is an appealing approach to model severity measures with high volatilities due to its statistical characteristics, including the robustness to extreme values and no distributional assumptions. Curti and Migueis also pointed out in a research paper (https://www.federalreserve.gov/econresdata/feds/2016/files/2016002r1pap.pdf) that the operational loss is more sensitive to macro-economic drivers at the tail, making the quantile regression an ideal model to capture such relationships.

While the quantile regression can be conveniently estimated in SAS with the QUANTREG procedure, the standard SAS output doesn’t provide goodness-of-fit (GoF) statistics. More importantly, it is noted that the underlying rationale of calculating GoF in a quantile regression is very different from the ones employed in OLS or GLM regressions. For instance, the most popular R-square is not applicable in the quantile regression anymore. Instead, a statistic called “R1” should be used. In addition, AIC and BIC are also defined differently in the quantile regression.

Below is a SAS macro showing how to calculate GoF statistics, including R1 and various information criterion, for a quantile regression.

%macro quant_gof(data = , y = , x = , tau = 0.5);
***********************************************************;
* THE MACRO CALCULATES GOODNESS-OF-FIT STATISTICS FOR     *;
* QUANTILE REGRESSION                                     *;
* ------------------------------------------------------- *;
* REFERENCE:                                              *;
*  GOODNESS OF FIT AND RELATED INFERENCE PROCESSES FOR    *;
*  QUANTILE REGRESSION, KOENKER AND MACHADO, 1999         *;
***********************************************************;

options nodate nocenter;
title;

* UNRESTRICTED QUANTILE REGRESSION *;
ods select ParameterEstimates ObjFunction;
ods output ParameterEstimates = _est;
proc quantreg data = &data ci = resampling(nrep = 500);
  model &y = &x / quantile = &tau nosummary nodiag seed = 1;
  output out = _full p = _p;
run;

* RESTRICTED QUANTILE REGRESSION *;
ods select none;
proc quantreg data = &data ci = none;
  model &y = / quantile = &tau nosummary nodiag;
  output out = _null p = _p;
run;
ods select all; 

proc sql noprint;
  select sum(df) into :p from _est;
quit;

proc iml;
  use _full;
  read all var {&y _p} into A;
  close _full;

  use _null;
  read all var {&y _p} into B;
  close _null;

  * DEFINE A FUNCTION CALCULATING THE SUM OF ABSOLUTE DEVIATIONS *;
  start loss(x);
    r = x[, 1] - x[, 2];
    z = j(nrow(r), 1, 0);
    l = sum(&tau * (r <> z) + (1 - &tau) * (-r <> z));
    return(l);
  finish;
  
  r1 = 1 - loss(A) / loss(B);
  adj_r1 = 1 - ((nrow(A) - 1) * loss(A)) / ((nrow(A) - &p) * loss(B));
  aic = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p;
  aicc = 2 * nrow(A) * log(loss(A) / nrow(A)) + 2 * &p * nrow(A) / (nrow(A) - &p - 1);
  bic = 2 * nrow(A) * log(loss(A) / nrow(A)) + &p * log(nrow(A));
  
  l = {"R1" "ADJUSTED R1" "AIC" "AICC" "BIC"};
  v = r1 // adj_r1 // aic // aicc // bic;
  print v[rowname = l format = 20.8 label = "Fit Statistics"];
quit;

%mend quant_gof;

Modeling Generalized Poisson Regression in SAS

The Generalized Poisson (GP) regression is a very flexible statistical model for count outcomes in that it can accommodate both over-dispersion and under-dispersion, which makes it a very practical modeling approach in real-world problems and is considered a serious contender for the Quasi-Poisson regression.

Prob(Y) = Alpha / Y! * (Alpha + Xi * Y) ^ (Y – 1) * EXP(-Alpha – Xi * Y)
E(Y) = Mu = Alpha / (1 – Xi)
Var(Y) = Mu / (1 – Xi) ^ 2

While the GP regression can be conveniently estimated with HMM procedure in SAS, I’d always like to dive a little deeper into its model specification and likelihood function to have a better understanding. For instance, there is a slight difference in GP model outcomes between HMM procedure in SAS and VGAM package in R. After looking into the detail, I then realized that the difference is solely due to the different parameterization.

Basically, there are three steps for estimating a GP regression with NLMIXED procedure. Due to the complexity of GP likelihood function and its convergence process, it is always a good practice to estimate a baseline Standard Poisson regression as a starting point and then to output its parameter estimates into a table, e.g. _EST as shown below.

ods output ParameterEstimates = _est;
proc genmod data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / dist = poisson link = log;
run;

After acquiring parameter estimates from a Standard Poisson regression, we can use them to construct initiate values of parameter estimates for the Generalized Poisson regression. In the code snippet below, we used SQL procedure to create 2 macro variables that we are going to use in the final model estimation of GP regression.

proc sql noprint;
select
  "_"||compress(upcase(parameter), ' ')||" = "||compress(put(estimate, 10.2), ' ')
into
  :_parm separated by ' '
from  
  _est;
  
select
  case 
    when upcase(parameter) = 'INTERCEPT' then "_"||compress(upcase(parameter), ' ')
    else "_"||compress(upcase(parameter), ' ')||" * "||compress(upcase(parameter), ' ')
  end
into
  :_xb separated by ' + '    
from  
  _est
where
  upcase(parameter) ~= 'SCALE';  
quit;

/*
%put &_parm;
_INTERCEPT = -1.38 _AGE = 0.01 _ACADMOS = 0.00 _MINORDRG = 0.46 _OWNRENT = -0.20 _SCALE = 1.00

%put &_xb;
 _INTERCEPT + _AGE * AGE + _ACADMOS * ACADMOS + _MINORDRG * MINORDRG + _OWNRENT * OWNRENT
*/

In the last step, we used the NLMIXED procedure to estimate the GP regression by specifying its log likelihood function that would generate identical model results as with HMM procedure. It is worth mentioning that the expected mean _mu = exp(x * beta) in SAS and the term exp(x * beta) refers to the _alpha parameter in R. Therefore, the intercept would be different between SAS and R, primarily due to different ways of parameterization with the identical statistical logic.

proc nlmixed data = mylib.credit_count;
  parms &_parm.;
  _xb = &_xb.;
  _xi = 1 - exp(-_scale);
  _mu = exp(_xb);  
  _alpha = _mu * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);
  ll = log(_prob);
  model majordrg ~ general(ll);
run;

In addition to HMM and NLMIXED procedures, GLIMMIX can also be employed to estimate the GP regression, as shown below. In this case, we need to specify both the log likelihood function and the variance function in terms of the expected mean.

proc glimmix data = mylib.credit_count;
  model majordrg = age acadmos minordrg ownrent / link = log solution;
  _xi = 1 - 1 / exp(_phi_);
  _variance_ = _mu_ / (1 - _xi) ** 2;
  _alpha = _mu_ * (1 - _xi);
  _prob = _alpha / fact(majordrg) * (_alpha + _xi * majordrg) ** (majordrg - 1) * exp(- _alpha - _xi * majordrg);  
  _logl_ = log(_prob);
run;

Estimate Regression with (Type-I) Pareto Response

The Type-I Pareto distribution has a probability function shown as below

f(y; a, k) = k * (a ^ k) / (y ^ (k + 1))

In the formulation, the scale parameter 0 < a < y and the shape parameter k > 1 .

The positive lower bound of Type-I Pareto distribution is particularly appealing in modeling the severity measure in that there is usually a reporting threshold for operational loss events. For instance, the reporting threshold of ABA operational risk consortium data is $10,000 and any loss event below the threshold value would be not reported, which might add the complexity in the severity model estimation.

In practice, instead of modeling the severity measure directly, we might model the shifted response y` = severity – threshold to accommodate the threshold value such that the supporting domain of y` could start from 0 and that the Gamma, Inverse Gaussian, or Lognormal regression can still be applicable. However, under the distributional assumption of Type-I Pareto with a known lower end, we do not need to shift the severity measure anymore but model it directly based on the probability function.

Below is the R code snippet showing how to estimate a regression model for the Pareto response with the lower bound a = 2 by using the VGAM package.

library(VGAM)
set.seed(2017)
n <- 200
a <- 2
x <- runif(n)
k <- exp(1 + 5 * x)
pdata <- data.frame(y = rpareto(n = n, scale = a, shape = k), x = x)
fit <- vglm(y ~ x, paretoff(scale = a), data = pdata, trace = TRUE)
summary(fit)
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)
# (Intercept)   1.0322     0.1363   7.574 3.61e-14 ***
# x             4.9815     0.2463  20.229  < 2e-16 ***
AIC(fit)
#  -644.458
BIC(fit)
#  -637.8614

The SAS code below estimating the Type-I Pareto regression provides almost identical model estimation.

proc nlmixed data = pdata;
  parms b0 = 0.1 b1 = 0.1;
  k = exp(b0 + b1 * x);
  a = 2;
  lh = k * (a ** k) / (y ** (k + 1));
  ll = log(lh);
  model y ~ general(ll);
run;
/*
Fit Statistics
-2 Log Likelihood               -648.5
AIC (smaller is better)         -644.5
AICC (smaller is better)        -644.4
BIC (smaller is better)         -637.9

Parameter Estimate   Standard   DF    t Value   Pr > |t|
                     Error 
b0        1.0322     0.1385     200    7.45     <.0001 	
b1        4.9815     0.2518     200   19.78     <.0001 	
*/

At last, it is worth pointing out that the conditional mean of Type-I Pareto response is not equal to exp(x * beta) but a * k / (k – 1) with k = exp(x * beta) . Therefore, the conditional mean only exists when k > 1 , which might cause numerical issues in the model estimation.

Pregibon Test for Goodness of Link in SAS

When estimating generalized linear models for binary outcomes, we often choose the logit link function by default and seldom consider other alternatives such as probit or cloglog. The Pregibon test (Pregibon, 1980) provides a mean to check the goodness of link with a simple logic outlined below.

1. First of all, we can estimate the regression model with the hypothesized link function, e.g. logit;
2. After the model estimation, we calculate yhat and yhat ^ 2 and then estimate a secondary regression with the identical response variable Y and link function but with yhat and yhat ^ 2 as model predictors (with the intercept).
3. If the link function is correctly specified, then the t-value of yaht ^2 should be insignificant.

The SAS macro shown below is the implementation of Pregibon test in the context of logistic regressions. However, the same idea can be generalized to any GLM.

%macro pregibon(data = , y = , x = );
***********************************************************;
* SAS MACRO PERFORMING PREGIBON TEST FOR GOODNESS OF LINK *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  Y    : THE DEPENDENT VARIABLE WITH 0 / 1 VALUES        *;
*  X    : MODEL PREDICTORS                                *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;
options mprint mlogic nocenter;

%let links = logit probit cloglog;
%let loop = 1;

proc sql noprint;
  select n(&data) - 3 into :df from &data;
quit; 

%do %while (%scan(&links, &loop) ne %str());

  %let link = %scan(&links, &loop);
  
  proc logistic data = &data noprint desc;
    model &y = &x / link = &link;
    score data = &data out = _out1;
  run;
  
  data _out2;
    set _out1(rename = (p_1 = p1));
    p2 = p1 * p1;
  run;
  
  ods listing close;
  ods output ParameterEstimates = _parm;  
  proc logistic data = _out2 desc;
    model &y = p1 p2 /  link = &link ;
  run;
  ods listing;
    
  %if &loop = 1 %then %do;
    data _parm1;
      format link $10.;
      set _parm(where = (variable = "p2"));
      link = upcase("&link");
    run;
  %end;
  %else %do;
    data _parm1;
      set _parm1 _parm(where = (variable = "p2") in = new);
      if new then link = upcase("&link");
    run;
  %end;
  
  data _parm2(drop = variable);
    set _parm1;
    _t = estimate / stderr;
    _df = &df;
    _p = (1 - probt(abs(_t), _df)) * 2;
  run;
  
  %let loop = %eval(&loop + 1);

%end;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * PREGIBON TEST FOR GOODNESS OF LINK
           * H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY * "
         link _t _df _p);
  define link / "LINK FUNCTION" width = 15 order order = data;          
  define _t   / "T-VALUE"       width = 15 format = 12.4;
  define _df  / "DF"            width = 10;
  define _p   / "P-VALUE"       width = 15 format = 12.4;
run;

%mend;

After applying the macro to the kyphosis data (https://stat.ethz.ch/R-manual/R-devel/library/rpart/html/kyphosis.html), we can see that both logit and probit can be considered appropriate link functions in this specific case and cloglog might not be a good choice.

             PREGIBON TEST FOR GOODNESS OF LINK
        H0: THE LINK FUNCTION IS SPECIFIED CORRECTLY

 LINK FUNCTION           T-VALUE         DF         P-VALUE
-----------------------------------------------------------
 LOGIT                   -1.6825         78          0.0965
 PROBIT                  -1.7940         78          0.0767
 CLOGLOG                 -2.3632         78          0.0206

Modified Park Test in SAS

The severity measure in operational loss models has an empirical distribution with positive values and a long tail to the far right. To estimate regression models for severity measures with such data characteristics, we can consider several candidate distributions, such as Lognormal, Gamma, inverse Gaussian, and so on. A statistical approach is called for to choose the appropriate estimator with a correct distributional assumption. The modified Park test is designed to fill the gap.

For any GLM model, a general relationship between the variance and the mean can be described as below:

var(y | x) = alpha * [E(y | x)] ^ lambda

  • With lambda = 0, it is suggested that the relationship between the variance and the mean is orthogonal. In this case, a Gaussian distributional assumption should be considered.
  • With lambda = 1, it is suggestion that the variance is proportional to the mean. In this case, a Poisson-like distribution assumption should be considered.
  • With lambda = 2, it is suggested that the variance is quadratic to the mean. In this case, a Gamma distributional assumption should be considered.
  • With lambda = 3, it is suggested that the variance is cubic to the mean. In this case, an Inverse Gaussian distributional assumption should be considered.

Without the loss of generality, the aforementioned logic can be further formulated as below given E(y | x) = yhat for an arbitrary estimator. As mentioned by Manning and Mullahy (2001), a Gamma estimator can be considered a natural baseline estimator.

var(y | x) = alpha * [E(y | x)] ^ lambda
–> (y – yhat) ^ 2 = alpha * [yhat] ^ lambda
–> log(y – yhat) ^ 2 = log(alpha) + lambda * log(yhat)

With the above formulation, there are two ways to construct the statistical test for lambda, which is the so-called “modified Park test”.

In the OLS regression setting, the log of squared residuals from the baseline estimator can be regression on a constant and the log of predicted values from the baseline estimator, e.g. a Gamma regression.

proc reg data = data;
  model ln_r2 = ln_yhat;
  park_test: test ln_yhat = 2;
run;

In the demonstrated example, we want to test the null hypothesis if the coefficient of ln_yhat is statistically different from 2, which suggests a Gamma distributional assumption.

Alternatively, in the GLM setting, the squared residuals from the baseline estimator can be regressed on a constant and the log of predicted values from the baseline estimator. In this specific GLM, the Gamma distribution and the log() link function should be employed.

proc nlmixed data = data;
  parms b0 = 1 b1 = 2 scale = 10;
  mu = exp(b0 + b1 * x);
  b = mu / scale;
  model r2 ~ gamma(scale, b);
  contrast 'park test' b1 - 2;
run;

Similarly, if the null hypothesis that the coefficient of ln_yhat minus 2 is not statistically different from 0 cannot be rejected, then the Gamma distributional assumption is valid based on the modified Park test.

Parameter Estimation of Pareto Type II Distribution with NLMIXED in SAS

In several previous posts, I’ve shown how to estimate severity models under the various distributional assumptions, including Lognormal, Gamma, and Inverse Gaussian. However, I am not satisfied with the fact that the supporting domain of aforementioned distributions doesn’t include the value at ZERO.

Today, I had spent some time on looking into another interesting distribution, namely Pareto Type II distribution, and the possibility of estimating the regression model. The Pareto Type II distribution, which is also called Lomax distribution, is a special case of the Pareto distribution such that its supporting domain starts at ZERO (>= 0) with a long tail to the right, making it a good candidate for severity or loss distributions. This distribution can be described by 2 parameters, a scale parameter “Lambda” and a shape parameter “Alpha” such that prob(y) = Alpha / Lambda * (1 + y / Lambda) ^ (-(1 + Alpha)) with the mean E(y) = Lambda / (Alpha – 1) for Alpha > 1 and Var(y) = Lambda ^ 2 * Alpha / [(Alpha – 1) ^ 2 * (Alpha – 2)] for Alpha > 2.

With the re-parameterization, Alpha and Lambda can be further expressed in terms of E(y) = mu and Var(y) = sigma2 such that Alpha = 2 * sigma2 / (sigma2 – mu ^ 2) and Lambda = mu * ((sigma2 + mu ^ 2) / (sigma2 – mu ^ 2)). Below is an example showing how to estimate the mean and the variance by using the likelihood function of Lomax distribution with SAS / NLMIXED procedure.

data test;
  do i = 1 to 100;
    y = exp(rannor(1));
    output;
  end;
run;

proc nlmixed data = test tech = trureg;
  parms _c_ = 0 ln_sigma2 = 1;
  mu = exp(_c_);
  sigma2 = exp(ln_sigma2);
  alpha = 2 * sigma2 / (sigma2 - mu ** 2);
  lambda = mu * ((sigma2 + mu ** 2) / (sigma2 - mu ** 2));
  lh = alpha / lambda * ( 1 + y/ lambda) ** (-(alpha + 1));
  ll = log(lh);
  model y ~ general(ll);
  predict mu out = pred (rename = (pred = mu));
run;  

proc means data = pred;
  var mu y;
run;

With the above setting, it is very doable to estimate a regression model with the Lomax distributional assumption. However, in order to make it useful in production, I still need to find out an effective way to ensure the estimation convergence after including co-variates in the model.

Test Drive Proc Lua – Convert SAS Table to 2-Dimension Lua Table

data one (drop = i);
  array a x1 x2 x3 x4 x5;
  do i = 1 to 5;
    do over a;
      a = ranuni(i);
    end;
    output;
  end;
run;

proc lua;
submit;
  local ds = sas.open("one")
  local tbl = {}
  for var in sas.vars(ds) do
    tbl[var.name] = {}
  end 
  
  while sas.next(ds) do
    for i, v in pairs(tbl) do
      table.insert(tbl[i], sas.get_value(ds, i))
    end
  end
  sas.close(ds) 

  for i, item in pairs(tbl) do
    print(i, table.concat(item, " "))
  end
  
endsubmit;
run;

Copas Test for Overfitting in SAS

Overfitting is a concern for overly complex models. When a model suffers from the overfitting, it will tend to over-explain the model training data and can’t generalize well in the out-of-sample (OOS) prediction. Many statistical measures, such as Adjusted R-squared and various Information criterion, have been developed to guard against the overfitting. However, these statistics are more suggestive than conclusive.

To test the null hypothesis of no overfitting, the Copas statistic is a convenient statistical measure to detect the overfitting and is based upon the fact that the conditional expectation of a response, e.g. E(Y|Y_oos), can be expressed as a linear function of its out-of-sample prediction Y_oos. For a model without the overfitting problem, E(Y|Y_oos) and Y_oos should be equal. In his research work, Copas also showed that this method can be generalized to the entire GLM family.

The implementation routine of Copas test is outlined as below.
– First of all, given a testing data sample, we generate the out-of-sample prediction, which could be derived from multiple approaches, such as n-fold, split-sample, or leave-one-out.
– Next, we fit a simple OLS regression between the observed Y and the out-of-sample prediction Y_hat such that Y = B0 + B1 * Y_hat.
– If the null hypothesis B0 = 0 and B1 = 1 is not rejected, then there is no concern about the overfitting.

Below is the SAS implementation of Copas test for Poisson regression based on LOO predictions and can be easily generalized to other cases with a few tweaks.

%macro copas(data = , y = , x = );
*************************************************;
*         COPAS TEST FOR OVERFITTING            *;
* ============================================= *;
* INPUT PARAMETERS:                             *;
*  DATA: A SAS DATASET INCLUDING BOTH DEPENDENT *;
*        AND INDEPENDENT VARIABLES              *; 
*  Y   : THE DEPENDENT VARIABLE                 *;
*  X   : A LIST OF INDEPENDENT VARIABLES        *;
* ============================================= *;
* Reference:                                    *;
* Measuring Overfitting and Mispecification in  *;
* Nonlinear Models                              *;
*************************************************;
options mprint mlogic symbolgen;

data _1;
  set &data;
  _id = _n_;
  keep _id &x &y;
run;  

proc sql noprint;
  select count(*) into :cnt from _1;
quit;  

%do i = 1 %to &cnt;
ods select none;
proc genmod data = _1;
  where _id ~= &i;
  model &y = &x / dist = poisson link = log;
  store _est;
run;  
ods select all;

proc plm source = _est noprint;
  score data = _1(where = (_id = &i)) out = _2 / ilink;
run;

%if &i = 1 %then %do;
  data _3;
    set _2;
  run;
%end;
%else %do;
  proc append base = _3 data = _2;
  run;
%end;

%end;

title "H0: No Overfitting (B0 = 0 and B1 = 1)";
ods select testanova;
proc reg data = _3;
  Copas_Test: model &y = predicted;
  Copas_Statistic: test intercept = 0, predicted = 1;
run;
quit;

%mend;

SAS Macro Calculating Mutual Information

In statistics, various correlation functions, either Spearman or Pearson, have been used to measure the dependence between two data vectors under the linear or monotonic assumption. Mutual Information (MI) is an alternative widely used in Information Theory and is considered a more general measurement of the dependence between two vectors. More specifically, MI quantifies how much information two vectors, regardless of their actual values, might share based on their joint and marginal probability distribution functions.

Below is a sas macro implementing MI and Normalized MI by mimicking functions in Python, e.g. mutual_info_score() and normalized_mutual_info_score(). Although MI is used to evaluate the cluster analysis performance in sklearn package, it can also be used as an useful tool for Feature Selection in the context of Machine Learning and Statistical Modeling.

%macro mutual(data = , x = , y = );
***********************************************************;
* SAS MACRO CALCULATING MUTUAL INFORMATION AND ITS        *;
* NORMALIZED VARIANT BETWEEN TWO VECTORS BY MIMICKING     *;
* SKLEARN.METRICS.NORMALIZED_MUTUAL_INFO_SCORE()          *;
* SKLEARN.METRICS.MUTUAL_INFO_SCORE() IN PYTHON           *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  X    : FIRST INPUT VECTOR                              *;
*  Y    : SECOND INPUT VECTOR                             *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

data _1;
  set &data;
  where &x ~= . and &y ~= .;
  _id = _n_;
run;

proc sql;
create table
  _2 as
select
  _id,
  &x,
  &y,
  1 / (select count(*) from _1) as _p_xy
from
  _1;

create table
  _3 as
select
  _id,
  &x         as _x,
  sum(_p_xy) as _p_x,
  sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_x
from 
  _2
group by
  &x;

create table
  _4 as
select
  _id,
  &y         as _y,
  sum(_p_xy) as _p_y,
  sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_y
from 
  _2
group by
  &y;

create table
  _5 as
select
  a.*,
  b._p_x,
  b._h_x,
  c._p_y,
  c._h_y,
  a._p_xy * log(a._p_xy / (b._p_x * c._p_y)) as mutual
from
  _2 as a, _3 as b, _4 as c
where
  a._id = b._id and a._id = c._id;

select
  sum(mutual) as MI format = 12.8,
  case 
    when sum(mutual) = 0 then 0
    else sum(mutual) / (sum(_h_x) * sum(_h_y)) ** 0.5 
  end as NMI format = 12.8
from
  _5;
quit;

%mend mutual;

Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one;
  do i = 1 to 100000;
    x0 = ranuni(0);
    x1 = ranuni(1);
    x2 = ranuni(2);
    x3 = ranuni(3);
    if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do;
      grp = 1;
      if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1;
    	else bad = 0;
    end;
    else do;
      grp = 0;
      if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1;
    	else bad = 0;
    end;
    output;
  end;
run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint;
  model bad = x0 x1 x2 x3;
  score data = one out = mdl1 (rename = (p_1 = score1));
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1
                                MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281
               ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6800     0.9699       2,057      7,943     10,000   79.43%      79.43%    33.81%      33.81%
   |      0.4679     0.6799       4,444      5,556     10,000   55.56%      67.50%    23.65%      57.46%
   |      0.3094     0.4679       6,133      3,867     10,000   38.67%      57.89%    16.46%      73.92%
   |      0.1947     0.3094       7,319      2,681     10,000   26.81%      50.12%    11.41%      85.33%
   |      0.1181     0.1946       8,364      1,636     10,000   16.36%      43.37%     6.96%      92.29%
   |      0.0690     0.1181       9,044        956     10,000    9.56%      37.73%     4.07%      96.36%
   |      0.0389     0.0690       9,477        523     10,000    5.23%      33.09%     2.23%      98.59%
   |      0.0201     0.0389       9,752        248     10,000    2.48%      29.26%     1.06%      99.64%
   V      0.0085     0.0201       9,925         75     10,000    0.75%      26.09%     0.32%      99.96%
 GOOD     0.0005     0.0085       9,991          9     10,000    0.09%      23.49%     0.04%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9699      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457
               ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7086     0.9699       1,051      6,162      7,213   85.43%      85.43%    30.51%      30.51%
   |      0.5019     0.7086       2,452      4,762      7,214   66.01%      75.72%    23.58%      54.10%
   |      0.3407     0.5019       3,710      3,504      7,214   48.57%      66.67%    17.35%      71.45%
   |      0.2195     0.3406       4,696      2,517      7,213   34.90%      58.73%    12.46%      83.91%
   |      0.1347     0.2195       5,650      1,564      7,214   21.68%      51.32%     7.74%      91.66%
   |      0.0792     0.1347       6,295        919      7,214   12.74%      44.89%     4.55%      96.21%
   |      0.0452     0.0792       6,737        476      7,213    6.60%      39.42%     2.36%      98.56%
   |      0.0234     0.0452       7,000        214      7,214    2.97%      34.86%     1.06%      99.62%
   V      0.0099     0.0234       7,150         64      7,214    0.89%      31.09%     0.32%      99.94%
 GOOD     0.0007     0.0099       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0007     0.9699      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1))
                                MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290
               ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5863     0.9413       1,351      1,435      2,786   51.51%      51.51%    43.48%      43.48%
   |      0.3713     0.5862       2,136        651      2,787   23.36%      37.43%    19.73%      63.21%
   |      0.2299     0.3712       2,340        446      2,786   16.01%      30.29%    13.52%      76.73%
   |      0.1419     0.2298       2,525        262      2,787    9.40%      25.07%     7.94%      84.67%
   |      0.0832     0.1419       2,584        202      2,786    7.25%      21.50%     6.12%      90.79%
   |      0.0480     0.0832       2,643        144      2,787    5.17%      18.78%     4.36%      95.15%
   |      0.0270     0.0480       2,682        104      2,786    3.73%      16.63%     3.15%      98.30%
   |      0.0140     0.0270       2,741         46      2,787    1.65%      14.76%     1.39%      99.70%
   V      0.0058     0.0140       2,776         10      2,786    0.36%      13.16%     0.30%     100.00%
 GOOD     0.0005     0.0058       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0005     0.9413      24,564      3,300     27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint;
  where grp = 0;
  model bad = x2 x3;
  score data = one out = mdl20(rename = (p_1 = p_10));
run;

proc logistic data = one desc noprint;
  where grp = 1;
  model bad = x2 x3;
  score data = one out = mdl21(rename = (p_1 = p_11));
run;

proc logistic data = one desc noprint;
  model grp = x0 x1;
  score data = one out = seg;
run;

data mdl2;
  merge seg mdl20 mdl21;
  by i;
  score2 = p_10 * (1 - p_1) + p_11 * p_1;
run;

                            GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2
                                MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469
               ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.6877     0.9677       2,011      7,989     10,000   79.89%      79.89%    34.00%      34.00%
   |      0.4749     0.6876       4,300      5,700     10,000   57.00%      68.45%    24.26%      58.27%
   |      0.3125     0.4748       6,036      3,964     10,000   39.64%      58.84%    16.87%      75.14%
   |      0.1932     0.3124       7,451      2,549     10,000   25.49%      50.51%    10.85%      85.99%
   |      0.1142     0.1932       8,379      1,621     10,000   16.21%      43.65%     6.90%      92.89%
   |      0.0646     0.1142       9,055        945     10,000    9.45%      37.95%     4.02%      96.91%
   |      0.0345     0.0646       9,533        467     10,000    4.67%      33.19%     1.99%      98.90%
   |      0.0166     0.0345       9,800        200     10,000    2.00%      29.29%     0.85%      99.75%
   V      0.0062     0.0166       9,946         54     10,000    0.54%      26.10%     0.23%      99.98%
 GOOD     0.0001     0.0062       9,995          5     10,000    0.05%      23.49%     0.02%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9677      76,506     23,494    100,000

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0))
                                MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458
               ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.7221     0.9677       1,075      6,138      7,213   85.10%      85.10%    30.40%      30.40%
   |      0.5208     0.7221       2,436      4,778      7,214   66.23%      75.66%    23.66%      54.06%
   |      0.3533     0.5208       3,670      3,544      7,214   49.13%      66.82%    17.55%      71.61%
   |      0.2219     0.3532       4,726      2,487      7,213   34.48%      58.73%    12.32%      83.92%
   |      0.1309     0.2219       5,617      1,597      7,214   22.14%      51.41%     7.91%      91.83%
   |      0.0731     0.1309       6,294        920      7,214   12.75%      44.97%     4.56%      96.39%
   |      0.0387     0.0731       6,762        451      7,213    6.25%      39.44%     2.23%      98.62%
   |      0.0189     0.0387       7,009        205      7,214    2.84%      34.86%     1.02%      99.63%
   V      0.0074     0.0189       7,152         62      7,214    0.86%      31.09%     0.31%      99.94%
 GOOD     0.0002     0.0073       7,201         12      7,213    0.17%      27.99%     0.06%     100.00%
       ========== ========== ========== ========== ==========
          0.0002     0.9677      51,942     20,194     72,136

                  GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1))
                                MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979
               ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 )

          MIN        MAX           GOOD        BAD      TOTAL    BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #    RATE      BAD RATE  PERCENT      PERCENT
 --------------------------------------------------------------------------------------------------------
  BAD     0.5559     0.9553       1,343      1,443      2,786   51.79%      51.79%    43.73%      43.73%
   |      0.3528     0.5559       2,001        786      2,787   28.20%      40.00%    23.82%      67.55%
   |      0.2213     0.3528       2,364        422      2,786   15.15%      31.71%    12.79%      80.33%
   |      0.1372     0.2213       2,513        274      2,787    9.83%      26.24%     8.30%      88.64%
   |      0.0840     0.1372       2,588        198      2,786    7.11%      22.42%     6.00%      94.64%
   |      0.0484     0.0840       2,683        104      2,787    3.73%      19.30%     3.15%      97.79%
   |      0.0256     0.0483       2,729         57      2,786    2.05%      16.84%     1.73%      99.52%
   |      0.0118     0.0256       2,776         11      2,787    0.39%      14.78%     0.33%      99.85%
   V      0.0040     0.0118       2,781          5      2,786    0.18%      13.16%     0.15%     100.00%
 GOOD     0.0001     0.0040       2,786          0      2,786    0.00%      11.84%     0.00%     100.00%
       ========== ========== ========== ========== ==========
          0.0001     0.9553      24,564      3,300     27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

Duplicate Breusch-Godfrey Test Logic in SAS Autoreg Procedure

Since it appears that SAS and R might give slightly different B-G test results, I spent a couple hours on duplicating the logic of B-G test implemented in SAS Autoreg Procedure. The written SAS macro should give my team more flexibility to perform B-G test in CCAR 2017 model developments in cases that models will not be estimated with Autoreg Procedure.

B-G Test with Proc Autoreg

data one;
  do i = 1 to 100;
    x1 = uniform(1);
    x2 = uniform(2);
    r  = normal(3) * 0.5;
    y = 10 + 8 * x1 + 6 * x2 + r;
    output;
  end;
run;

proc autoreg data = one;
  model y = x1 x2 / godfrey = 4;
run;
quit;

/*
Godfrey's Serial Correlation Test

Alternative            LM    Pr > LM
AR(1)              0.2976     0.5854
AR(2)              1.5919     0.4512
AR(3)              1.7168     0.6332
AR(4)              1.7839     0.7754
*/

Home-brew SAS Macro

%macro bgtest(data = , r = , x = , order = 4);
options nocenter nonumber nodate mprint mlogic symbolgen
        formchar = "|----|+|---+=|-/\<>*";

proc sql noprint;
select
  mean(&r) format = 12.8 into :m
from
  &data;
quit;

data _1 (drop = _i);
  set &data (keep = &r &x);
  %do i = 1 %to &order;
    _lag&i._&r = lag&i.(&r);
  %end;
  _i + 1;
  _index = _i - &order;
  array _l _lag:;
  do over _l;
    if _l = . then _l = &m;
  end;
run;
 
proc reg data = _last_ noprint;
  model &r =  &x _lag:;
  output out = _2 p = rhat;
run;
 
proc sql noprint;  
create table
  _result as
select
  sum((rhat - &m) ** 2) / sum((&r - &m) ** 2)  as _r2,
  (select count(*) from _2) * calculated _r2   as _chisq,
  1 - probchi(calculated _chisq, &order.)      as _p_chisq,
  &order                                       as _df
from
  _2;
quit;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
           * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * "
          _chisq _df _p_chisq);
  define _chisq   / "CHI-SQUARE" width = 20 format = 15.10;
  define _df      / "DF"         width = 10;
  define _p_chisq / "P-VALUE"    width = 20 format = 15.10;
run;
%mend bgtest;

proc reg data = one noprint;
  model y = x1 x2;
  output out = two r = r2;
run;
quit;

data _null_;
  do i = 1 to 4;
    call execute('%bgtest(data = two, r = r2, x = x1 x2, order = '||put(i, 2.)||');');
  end;
run;

/*
       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 1
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         0.2976458421          1         0.5853620441

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 2
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.5918785412          2         0.4511572771

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 3
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.7167785901          3         0.6332099963

       BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
 H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 4
           CHI-SQUARE         DF              P-VALUE
 -------------------------------------------------------
         1.7839349922          4         0.7754201982
*/

Calculating ACF with Data Step Only

In SAS/ETS, it is trivial to calculate ACF of a time series with ARIMA procedure. However, the downside is that, in addition to ACF, you will get more outputs than necessary without knowing the underlying mechanism. The SAS macro below is a clean routine written with simple data steps showing each step how to calculate ACF and generating nothing but a table with ACF and the related lag without using SAS/ETS module at all. It is easy to write a wrapper around this macro for any further analysis.

%macro acf(data = , var = , out = acf);
***********************************************************;
* SAS MACRO CALCULATING AUTOCORRELATION FUNCTION WITH     *;
* DATA STEP ONLY                                          *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  VAR  : THE TIME SERIES TO TEST FOR INDEPENDENCE        *;
* ======================================================= *;
* OUTPUT:                                                 *;
*  OUT : A OUTPUT SAS DATA TABLE WITH ACF AND LAG         *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

%local nobs;
data _1 (keep = &var);
  set &data end = eof;
  if eof then do;
    call execute('%let nobs = '||put(_n_, 8.)||';');
  end;
run;

proc sql noprint;
  select mean(&var) into :mean_x from _last_;
quit;

%do i = 1 %to %eval(&nobs - 1);

  data _2(keep = _:);
    set _1;
    _x = &var;
    _lag = lag&i.(_x);
  run;

  proc sql ;
  create table
    _3 as
  select
    (_x - &mean_x) ** 2               as _den,
    (_x - &mean_x) * (_lag - &mean_x) as _num
  from
    _last_;

  create table
    _4 as
  select
    &i                    as lag,
    sum(_num) / sum(_den) as acf
  from
    _last_;

  %if &i = 1 %then %do;
  create table 
    &out as
  select
    *
  from
    _4;
  %end;
  %else %do;
  insert into &out
  select
    *
  from
    _4;
  %end;

  drop table _2, _3, _4;
  quit;
%end;

%mend acf;

Estimate Quasi-Binomial Model with GENMOD Procedure in SAS

In my previous post (https://statcompute.wordpress.com/2015/11/01/quasi-binomial-model-in-sas/), I’ve shown why there is an interest in estimating Quasi-Binomial models for financial practitioners and how to estimate with GLIMMIX procedure in SAS.

In the demonstration below, I will show an example how to estimate a Quasi-Binomial model with GENMOD procedure by specifying VARIANCE and DEVIANCE. While the CPU time for model estimation is a lot faster with GENMOD than with GLIMMIX, additional steps are necessary to ensure the correct statistical inference.

ods listing close;
ods output modelfit = fit;
ods output parameterestimates = parm1;
proc genmod data = kyphosis;
  model y = age number start / link = logit noscale;
  variance v = _mean_ * (1 - _mean_);
  deviance d = (-2) * log((_mean_ ** _resp_) * ((1 - _mean_) ** (1 - _resp_)));
run;
ods listing;

proc sql noprint;
select
  distinct valuedf format = 12.8, df format = 8.0 into :disp, :df
from
  fit
where 
  index(criterion, "Pearson") > 0;

select
  value format = 12.4 into :ll
from
  fit
where
  criterion = "Log Likelihood";

select
  sum(df) into :k
from
  parm1;
quit;

%let aic = %sysevalf((-&ll + &k) * 2); 
%let bic = %sysevalf(-&ll * 2 + &k * %sysfunc(log(&df + &k)));

data parm2 (keep = parameter estimate stderr2 df t_value p_value);
  set parm1;
  where df > 0;

  stderr2 = stderr * (&scale ** 0.5);
  df = &df;
  t_value = estimate / stderr2;
  p_value = (1 - probt(abs(t_value), &df)) * 2;
run;

title;
proc report data = parm2 spacing = 1 headline nowindows split = "*";
  column(" Parameter Estimate of Quasi-Binomial Model "
         parameter estimate stderr2 t_value df p_value);
  compute before _page_;
    line @5 "Fit Statistics";
	line " ";
	line @3 "Quasi Log Likelihood: %sysfunc(round(&ll, 0.01))";
	line @3 "Quasi-AIC (smaller is better): %sysfunc(round(&aic, 0.01))";
	line @3 "Quasi-BIC (smaller is better): %sysfunc(round(&bic, 0.01))";
	line @3 "(Dispersion Parameter for Quasi-Binomial is %sysfunc(round(&disp, 0.0001)))";
	line " ";
  endcomp;
  define parameter / "Parmeter" width = 10 order order = data;
  define estimate  / "Estimate" width = 10 format = 10.4;
  define stderr2   / "Std Err"  width = 10 format = 10.4;
  define t_value   / "T Value"  width = 10 format = 10.2;
  define df        / "DF"       width = 5  format = 4.0;
  define p_value   / "Pr > |t|" width = 10 format = 10.4;
run;
quit;
  
/*
    Fit Statistics

  Quasi Log Likelihood: -30.69
  Quasi-AIC (smaller is better): 69.38
  Quasi-BIC (smaller is better): 78.96
  (Dispersion Parameter for Quasi-Binomial is 0.9132)

          Parameter Estimate of Quasi-Binomial Model
 Parmeter     Estimate    Std Err    T Value    DF   Pr > |t|
 ------------------------------------------------------------
 Intercept     -2.0369     1.3853      -1.47    77     0.1455
 Age            0.0109     0.0062       1.77    77     0.0800
 Number         0.4106     0.2149       1.91    77     0.0598
 Start         -0.2065     0.0647      -3.19    77     0.0020
*/

A More Flexible Ljung-Box Test in SAS

Ljung-Box test is an important diagnostic to check if residuals from the time series model are independently distributed. In SAS / ETS module, it is easy to perform Ljung-Box with ARIMA procedure. However, test outputs are only provided for Lag 6, 12, 18, and so on, which cannot be changed by any option.

data one;
  do i = 1 to 100;
    x = uniform(1);
	output;
  end;
run;

proc arima data = one;
  identify var = x whitenoise = ignoremiss;
run;
quit;
/*
                            Autocorrelation Check for White Noise

 To        Chi-             Pr >
Lag      Square     DF     ChiSq    --------------------Autocorrelations--------------------
  6        5.49      6    0.4832     0.051    -0.132     0.076    -0.024    -0.146     0.064
 12        6.78     12    0.8719     0.050     0.076    -0.046    -0.025    -0.016    -0.018
 18       10.43     18    0.9169     0.104    -0.053     0.063     0.038    -0.085    -0.065
 24       21.51     24    0.6083     0.007     0.178     0.113    -0.046     0.180     0.079
*/

The SAS macro below is a more flexible way to perform Ljung-Box test for any number of lags. As shown in the output, test results for Lag 6 and 12 are identical to the one directly from ARIMA procedure.

%macro LBtest(data = , var = , lags = 4);
***********************************************************;
* SAS MACRO PERFORMING LJUNG-BOX TEST FOR INDEPENDENCE    *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  VAR  : THE TIME SERIES TO TEST FOR INDEPENDENCE        *;
*  LAGS : THE NUMBER OF LAGS BEING TESTED                 *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

%local nlag; 

data _1 (keep = &var);
  set &data end = eof;
  if eof then do;
    call execute('%let nlag = '||put(_n_ - 1, 8.)||';');
  end;
run;

proc arima data = _last_;
  identify var = &var nlag = &nlag outcov = _2 noprint;
run;
quit;

%do i = 1 %to &lags;
  data _3;
    set _2;
	where lag > 0 and lag <= &i;
  run;

  proc sql noprint;
    create table
	  _4 as
	select
      sum(corr * corr / n) * (&nlag + 1) * (&nlag + 3) as _chisq,
	  1 - probchi(calculated _chisq, &i.)              as _p_chisq,
	  &i                                               as _df
	from
	  _last_;
  quit;

  %if &i = 1 %then %do;
  data _5;
    set _4;
  run;
  %end;
  %else %do;
  data _5;
    set _5 _4;
  run;
  %end;
%end;

title;
proc report data = _5 spacing = 1 headline nowindows split = "*";
  column(" * LJUNG-BOX TEST FOR WHITE NOISE *
           * H0: RESIDUALS ARE INDEPENDENTLY DISTRIBUTED UPTO LAG &lags * "
          _chisq _df _p_chisq);
  define _chisq   / "CHI-SQUARE" width = 20 format = 15.10;
  define _df      / "DF"         width = 10 order;
  define _p_chisq / "P-VALUE"    width = 20 format = 15.10;
run;

%mend LBtest;

%LBtest(data = one, var = x, lags = 12);

/*
             LJUNG-BOX TEST FOR WHITE NOISE
 H0: RESIDUALS ARE INDEPENDENTLY DISTRIBUTED UPTO LAG 12

           CHI-SQUARE         DF              P-VALUE
 ------------------------------------------------------
         0.2644425904          1         0.6070843322
         2.0812769288          2         0.3532290858
         2.6839655476          3         0.4429590625
         2.7428168168          4         0.6017432831
         5.0425834917          5         0.4107053939
         5.4851972398          6         0.4832476224
         5.7586229652          7         0.5681994829
         6.4067856029          8         0.6017645131
         6.6410385135          9         0.6744356312
         6.7142471241         10         0.7521182318
         6.7427585395         11         0.8195164211
         6.7783018413         12         0.8719097622
*/

SAS Macro Performing Breusch–Godfrey Test for Serial Correlation

%macro bgtest(data = , r = , x = , order = 1);
********************************************************************;
* SAS MACRO PERFORMING BREUSCH-GODFREY TEST FOR SERIAL CORRELATION *;
* BY FOLLOWING THE LOGIC OF BGTEST() IN R LMTEST PACKAGE           *;
* ================================================================ *;
* INPUT PAREMETERS:                                                *;
*  DATA  : INPUT SAS DATA TABLE                                    *;
*  R     : RESIDUALS TO TEST SERIAL CORRELATION                    *;
*  X     : INDEPENDENT VARIABLES IN THE ORIGINAL REGRESSION MODEL  *;
*  ORDER : THE ORDER OF SERIAL CORRELATION                         *;
* ================================================================ *;
* AUTHOR: WENSUI.LIU@53.COM                                        *;
********************************************************************;

data _1 (drop = _i);
  set &data (keep = &r &x);
  %do i = 1 %to &order;
    _lag&i._&r = lag&i.(&r);
  %end;
  _i + 1;
  _index = _i - &order;
  if _index > 0 then output;
run;

ods listing close;
proc reg data = _last_;
  model &r = &x _lag:;
  output out = _2 p = yhat;
run;

ods listing;
proc sql noprint;
create table
  _result as
select
  (select count(*) from _2) * sum(yhat ** 2) / sum(&r ** 2)   as _chisq,
  1 - probchi(calculated _chisq, &order.)                     as _p_chisq,
  &order                                                      as _df
from
  _2;
quit;

title;
proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION
           * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * "
          _chisq _df _p_chisq);
  define _chisq   / "CHI-SQUARE" width = 20 format = 15.10;
  define _df      / "DF"         width = 10;
  define _p_chisq / "P-VALUE"    width = 20 format = 15.10;
run;

%mend bgtest;

SAS Macro Calculating LOO Predictions for GLM

Back in last year, I wrote a R function calculating the Leave-One-Out (LOO) prediction in GLM (https://statcompute.wordpress.com/2015/12/13/calculate-leave-one-out-prediction-for-glm). Below is a SAS macro implementing the same function with Gamma regression as the example. However, it is trivial to extend to models with other distributional assumptions.

%macro loo_gamma(data = , y = , x = , out = , out_var = _loo);
**********************************************************************;
* SAS MACRO CALCULATING LEAVE-ONE-OUT PREDICTIONS WITH THE TRAINING  *;
* SAMPLE AND PRESENTING DISTRIBUTIONS OF LOO PARAMETER ESTIMATES     *;
* ================================================================== *;
* INPUT PARAMETERS:                                                  *;
*  DATA   : INPUT SAS DATA TABLE                                     *;
*  Y      : DEPENDENT VARIABLE IN THE GAMMA MODEL                    *;
*  X      : NUMERIC INDEPENDENT VARIABLES IN THE MODEL               *;
*  OUT    : OUTPUT SAS DATA TABLE WITH LOO PREDICTIONS               *;
*  OUT_VAR: VARIABLE NAME OF LOO PREDICTIONS                         *;
* ================================================================== *;
* OUTPUTS:                                                           *;
* 1. A TABLE SHOWING DISTRIBUTIONS OF LOO PARAMETER ESTIMATES        *;
* 2. A SAS DATA TABLE WITH LOO PREDICTIONS                           *;
**********************************************************************;
options nocenter nonumber nodate mprint mlogic symbolgen;

data _1;
  retain &x &y;
  set &data (keep = &x &y);
  where &y ~= .;
  Intercept = 1;
  _i + 1;
run;

data _2;
  set _1 (keep = _i &x Intercept);
  array _x Intercept &x;
  do over _x;
    _name = upcase(vname(_x));
    _value = _x;
    output;
  end;
run;

proc sql noprint;
  select max(_i) into :nobs from _1;
quit;

%do i = 1 %to &nobs;

data _3;
  set _1;
  where _i ~= &i;
run;

ods listing close;
ods output ParameterEstimates = _est;
proc glimmix data = _last_;
  model &y = &x / solution dist = gamma link = log; 
run; 
ods listing;

proc sql;
create table
  _pred1 as
select
  a._i                  as _i,
  upcase(a._name)       as _name,
  a._value              as _value,
  b.estimate            as estimate,
  a._value * b.estimate as _xb
from
  _2 as a, _est as b
where
  a._i = &i and upcase(a._name) = upcase(b.effect); quit;

%if &i = 1 %then %do;
  data _pred2;
    set _pred1;
  run;
%end;
%else %do;
  data _pred2;
    set _pred2 _pred1;
  run;
%end;

%end;

proc summary data = _pred2 nway;
  class _name;
  output out = _eff (drop = _freq_ _type_)
  min(estimate)    = beta_min
  p5(estimate)     = beta_p05
  p10(estimate)    = beta_p10
  median(estimate) = beta_med
  p90(estimate)    = beta_p90
  p95(estimate)    = beta_p95
  max(estimate)    = beta_max
  mean(estimate)   = beta_avg
  stddev(estimate) = beta_std;
run;

title;
proc report data = _eff spacing = 1 headline nowindows split = "*";
  column(" * DISTRIBUTIONS OF LEAVE-ONE-OUT COEFFICIENTS *
             ESTIMATED FROM GAMMA REGRESSIONS * "
          _name beta_:);
  where upcase(_name) ~= 'INTERCEPT';
  define _name    / "BETA"    width = 20;
  define beta_min / "MIN"     width = 10 format = 10.4;
  define beta_p05 / '5%ILE'   width = 10 format = 10.4;
  define beta_p10 / '10%ILE'  width = 10 format = 10.4;
  define beta_med / 'MEDIAN'  width = 10 format = 10.4;
  define beta_p90 / '90%ILE'  width = 10 format = 10.4;
  define beta_p95 / '95%ILE'  width = 10 format = 10.4;
  define beta_max / "MAX"     width = 10 format = 10.4;
  define beta_avg / "AVERAGE" width = 10 format = 10.4;
  define beta_std / "STD DEV" width = 10 format = 10.4; 
run;

proc sql;
create table
  &out as
select
  a.*,
  b.out_var      as _xb,
  exp(b.out_var) as &out_var
from
  _1 (drop = intercept) as a,
  (select _i, sum(_xb) as out_var from _pred2 group by _i) as b where
  a._i = b._i;
quit;

%mend loo_gamma;

Quasi-Binomial Model in SAS

Similar to quasi-Poisson regressions, quasi-binomial regressions try to address the excessive variance by the inclusion of a dispersion parameter. In addition to addressing the over-dispersion, quasi-binomial regressions also demonstrate extra values in other areas, such as LGD model development in credit risk modeling, due to its flexible distributional assumption.

Measuring the ratio between NCO and GCO, LGD could take any value in the range [0, 1] with no unanimous consensus on the distributional assumption currently in the industry. An advantage of quasi-binomial regression is that it makes no assumption of a specific distribution but merely specifies the conditional mean for a given model response. As a result, the trade-off is the lack of likelihood-based measures such as AIC and BIC.

Below is a demonstration on how to estimate a quasi-binomial model with GLIMMIX procedure in SAS.

proc glimmix data = _last_;
  model y = age number start / link = logit solution;
  _variance_ = _mu_ * (1-_mu_);
  random _residual_;
run;  
/*
              Model Information
Data Set                     WORK.KYPHOSIS   
Response Variable            y               
Response Distribution        Unknown         
Link Function                Logit           
Variance Function            _mu_ * (1-_mu_) 
Variance Matrix              Diagonal        
Estimation Technique         Quasi-Likelihood
Degrees of Freedom Method    Residual        

                       Parameter Estimates 
                         Standard
Effect       Estimate       Error       DF    t Value    Pr > |t|
Intercept     -2.0369      1.3853       77      -1.47      0.1455
age           0.01093    0.006160       77       1.77      0.0800
number         0.4106      0.2149       77       1.91      0.0598
start         -0.2065     0.06470       77      -3.19      0.0020
Residual       0.9132           .        .        .         .    
*/

For the comparison purpose, the same model is also estimated with R glm() function, showing identical outputs.

summary(glm(data = kyphosis, Kyphosis ~ ., family = quasibinomial))
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)   
#(Intercept) -2.03693    1.38527  -1.470  0.14552   
#Age          0.01093    0.00616   1.774  0.07996 . 
#Number       0.41060    0.21489   1.911  0.05975 . 
#Start       -0.20651    0.06470  -3.192  0.00205 **
#---
#(Dispersion parameter for quasibinomial family taken to be 0.913249)

Estimating Quasi-Poisson Regression with GLIMMIX in SAS

When modeling the frequency measure in the operational risk with regressions, most modelers often prefer Poisson or Negative Binomial regressions as best practices in the industry. However, as an alternative approach, Quasi-Poisson regression provides a more flexible model estimation routine with at least two benefits. First of all, Quasi-Poisson regression is able to address both over-dispersion and under-dispersion by assuming that the variance is a function of the mean such that VAR(Y|X) = Theta * MEAN(Y|X), where Theta > 1 for the over-dispersion and Theta < 1 for the under-dispersion. Secondly, estimated coefficients with Quasi-Poisson regression are identical to the ones with Standard Poisson regression, which is considered the prevailing practice in the industry.

While Quasi-Poisson regression can be easily estimated with glm() in R language, its estimation in SAS is not very straight-forward. Luckily, with GLIMMIX procedure, we can estimate Quasi-Poisson regression by directly specifying the functional relationship between the variance and the mean and making no distributional assumption in the MODEL statement, as demonstrated below.


proc glimmix data = credit_count;
  model MAJORDRG = AGE ACADMOS MINORDRG OWNRENT / link = log solution;
  _variance_ = _mu_;
  random _residual_;
run;
  
/*
              Model Information
 
Data Set                     WORK.CREDIT_COUNT
Response Variable            MAJORDRG        
Response Distribution        Unknown         
Link Function                Log             
Variance Function            _mu_             
Variance Matrix              Diagonal        
Estimation Technique         Quasi-Likelihood
Degrees of Freedom Method    Residual        
 
              Fit Statistics
 
-2 Log Quasi-Likelihood           19125.57
Quasi-AIC  (smaller is better)    19135.57
Quasi-AICC (smaller is better)    19135.58
Quasi-BIC  (smaller is better)    19173.10
Quasi-CAIC (smaller is better)    19178.10
Quasi-HQIC (smaller is better)    19148.09
Pearson Chi-Square                51932.87
Pearson Chi-Square / DF               3.86
 
                       Parameter Estimates
                         Standard
Effect       Estimate       Error       DF    t Value    Pr > |t|
 
Intercept     -1.3793     0.08613    13439     -16.01      <.0001
AGE           0.01039    0.002682    13439       3.88      0.0001
ACADMOS      0.001532    0.000385    13439       3.98      <.0001
MINORDRG       0.4611     0.01348    13439      34.22      <.0001
OWNRENT       -0.1994     0.05568    13439      -3.58      0.0003
Residual       3.8643           .        .        .         .   
*/

For the comparison purpose, we also estimated a Quasi-Poisson regression in R, showing completely identical statistical results.


summary(glm(MAJORDRG ~ AGE + ACADMOS + MINORDRG + OWNRENT, data = credit_count, family = quasipoisson(link = "log")))
  
#               Estimate Std. Error t value Pr(>|t|)   
# (Intercept) -1.3793249  0.0861324 -16.014  < 2e-16 ***
# AGE          0.0103949  0.0026823   3.875 0.000107 ***
# ACADMOS      0.0015322  0.0003847   3.983 6.84e-05 ***
# MINORDRG     0.4611297  0.0134770  34.216  < 2e-16 ***
# OWNRENT     -0.1993933  0.0556757  -3.581 0.000343 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for quasipoisson family taken to be 3.864409)
# 
#     Null deviance: 24954  on 13443  degrees of freedom
# Residual deviance: 22048  on 13439  degrees of freedom
# AIC: NA

SAS Macro for Engle-Granger Co-integration Test

In the coursework of time series analysis, we’ve been taught that a time series regression of Y on X could be valid only when both X and Y are stationary due to the so-call “spurious regression problem”. However, one exception holds that if X and Y, albeit non-stationary, share a common trend such that their trends can be cancelled each other out, then X and Y are co-integrated and the regression of Y on X is valid. As a result, it is important to test the co-integration between X and Y.

Following the definition of co-integration, it is straightforward to formulate a procedure of the co-integration test. First of all, construct a linear combination between Y and X such that e = Y – (a + b * X). Secondly, test if e is stationary with ADF test. If e is stationary, then X and Y are co-integrated. This two-stage procedure is also called Engle-Granger co-integration test.

Below is a SAS macro implementing Engle-Granger co-integration test to show the long-term relationship between GDP and other macro-economic variables, e.g. Personal Consumption and Personal Disposable Income.

SAS Macro

%macro eg_coint(data = , y = , xs = );
*********************************************************************;
* THIS SAS MACRO IMPLEMENTATION ENGLE-GRANGER COINTEGRATION TEST IN *;
* A BATCH MODE TO PROCESS MANY TIME SERIES                          *;
*********************************************************************;
* INPUT PARAMETERS:                                                 *;
*   DATA: A INPUT SAS DATASET                                       *;
*   Y   : A DEPENDENT VARIABLE IN THE COINTEGRATION REGRESSION      *;
*   X   : A LIST OF INDEPENDENT VARIABLE IN THE COINTEGRATION       *;
*         REGRESSION                                                *;
*********************************************************************;
* AUTHOR: WENSUI.LIU@53.COM                                         *;
*********************************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*";

%local sig loop;

%let sig = 0.1;

%let loop = 1;

%do %while (%scan(&xs, &loop) ne %str());

  %let x = %scan(&xs, &loop);

  ods listing close;
  ods output FitStatistics = _fit;
  proc reg data = &data;
    model &y = &x;
    output out = _1 residual = r;
  run;
  quit;

  proc sql noprint;
    select cvalue2 into :r2 from _fit where upcase(label2) = "R-SQUARE";
  quit;

  proc arima data = _1;
    ods output stationaritytests = _adf1 (where = (upcase(type) = "ZERO MEAN" and lags = 1) drop = rho probrho fvalue probf);
    identify var = r stationarity = (adf = 1);
  run;
  quit;
  ods listing;

  %if &loop = 1 %then %do;
    data _adf;
      format vars $32. lterm_r2 best12. flg_coint $3.;
      set _adf1 (drop = type lags);
      vars = upcase("&x");
      lterm_r2 = &r2;
      if probtau < &sig then flg_coint = "YES";
      else flg_coint = "NO";
    run;
  %end;
  %else %do;
    data _adf;
      set _adf _adf1 (in = new drop = type lags);
      if new then do;
        vars = upcase("&x");
        lterm_r2 = &r2;
        if probtau < &sig then flg_coint = "YES";
          else flg_coint = "NO";
      end;
    run;
  %end;
    
  %let loop = %eval(&loop + 1);
%end;

proc sort data = _last_;
  by descending flg_coint probtau;
run;

proc report data = _last_ box spacing = 1 split = "/" nowd;
  COLUMN("ENGLE-GRANGER COINTEGRATION TEST BETWEEN %UPCASE(&Y) AND EACH VARIABLE BELOW/ "
         vars lterm_r2 flg_coint tau probtau);
  define vars      / "VARIABLES"                      width = 35;
  define lterm_r2  / "LONG-RUN/R-SQUARED"             width = 15 format =  9.4 center;
  define flg_coint / "COINTEGRATION/FLAG"             width = 15 center;
  define tau       / "TAU STATISTIC/FOR ADF TEST"     width = 20 format = 15.4;
  define probtau   / "P-VALUE FOR/ADF TEST"           width = 15 format =  9.4 center;
run;

%mend eg_coint;

%eg_coint(data = sashelp.citiqtr, y = gdp, xs = gyd gc);

SAS Output

----------------------------------------------------------------------------------------------------------
|                  ENGLE-GRANGER COINTEGRATION TEST BETWEEN GDP AND EACH VARIABLE BELOW                  |
|                                                                                                        |
|                                       LONG-RUN      COINTEGRATION         TAU STATISTIC   P-VALUE FOR  |
|VARIABLES                              R-SQUARED         FLAG               FOR ADF TEST    ADF TEST    |
|--------------------------------------------------------------------------------------------------------|
|GC                                 |      0.9985   |      YES      |             -2.8651|      0.0051   |
|-----------------------------------+---------------+---------------+--------------------+---------------|
|GYD                                |      0.9976   |      YES      |             -1.7793|      0.0715   |
----------------------------------------------------------------------------------------------------------

From the output, it is interesting to see that GDP in U.S. is driven more by Personal Consumption than by Personal Disposable Income.

SAS Macro to Test Stationarity in Batch

To determine if a time series is stationary or has the unit root, three methods can be used:

A. The most intuitive way, which is also sufficient in most cases, is to eyeball the ACF (Autocorrelation Function) plot of the time series. The ACF pattern with a fast decay might imply a stationary series.
B. Statistical tests for Unit Roots, e.g. ADF (Augmented Dickey–Fuller) or PP (Phillips–Perron) test, could be employed as well. With the Null Hypothesis of Unit Root, a statistically significant outcome might suggest a stationary series.
C. In addition to the aforementioned tests for Unit Roots, statistical tests for stationarity, e.g. KPSS (Kwiatkowski–Phillips–Schmidt–Shin) test, might be an useful complement as well. With the Null Hypothesis of stationarity, a statistically insignificant outcome might suggest a stationary series.

By testing both the unit root and stationarity, the analyst should be able to have a better understanding about the data nature of a specific time series.

The SAS macro below is a convenient wrapper of stationarity tests for many time series in the production environment. (Please note that this macro only works for SAS 9.2 or above.)

%macro stationary(data = , vars =);
***********************************************************;
* THIS SAS MACRO IS TO DO STATIONARITY TESTS FOR MANY     *;
* TIME SERIES                                             *;
* ------------------------------------------------------- *;
* INPUT PARAMETERS:                                       *;
*   DATA: A INPUT SAS DATASET                             *;
*   VARS: A LIST OF TIME SERIES                           *;
* ------------------------------------------------------- *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*";

%local sig loop;

%let sig = 0.1;

%let loop = 1;

%do %while (%scan(&vars, &loop) ne %str());

  %let x = %scan(&vars, &loop);

  proc sql noprint;
    select int(12 * ((count(&x) / 100) ** 0.25)) into :nlag1 from &data;

    select int(max(1, (count(&x) ** 0.5) / 5)) into :nlag2 from &data;
  quit;

  ods listing close;
  ods output kpss = _kpss (drop = model lags rename = (prob = probeta))
             adf = _adf  (drop = model lags rho probrho fstat probf rename = (tau = adf_tau probtau = adf_probtau))
             philperron = _pp  (drop = model lags rho probrho rename = (tau = pp_tau probtau = pp_probtau));
  proc autoreg data = &data;
    model &x = / noint stationarity = (adf = &nlag1, phillips = &nlag2, kpss = (kernel = nw lag = &nlag1));
  run;
  quit;
  ods listing;

  proc sql noprint;
    create table
      _1 as 
    select
      upcase("&x")           as vars length = 32,
      upcase(_adf.type)      as type,
      _adf.adf_tau,
      _adf.adf_probtau,
      _pp.pp_tau,
      _pp.pp_probtau,
      _kpss.eta,
      _kpss.probeta,
      case 
        when _adf.adf_probtau < &sig or _pp.pp_probtau < &sig or _kpss.probeta > &sig then "*"
        else " "
      end                    as _flg,
      &loop                  as _i,
      monotonic()            as _j
    from 
      _adf inner join _pp on _adf.type = _pp.type inner join _kpss on _adf.type = _kpss.type;
  quit;

  %if &loop = 1 %then %do;
    data _result;
      set _1;
    run;
  %end;
  %else %do;
    proc append base = _result data = _1;
    run;
  %end;

  proc datasets library = work nolist;
    delete _1 _adf _pp _kpss / memtype = data;
  quit;

  %let loop = %eval(&loop + 1);
%end;

proc sort data = _result;
  by _i _j;
run;

proc report data = _result box spacing = 1 split = "/" nowd;
  column("STATISTICAL TESTS FOR STATIONARITY/ "
         vars type adf_tau adf_probtau pp_tau pp_probtau eta probeta _flg);
  define vars        / "VARIABLES/ "                  width = 20 group order order = data;
  define type        / "TYPE/ "                       width = 15 order order = data;
  define adf_tau     / "ADF TEST/FOR/UNIT ROOT"       width = 10 format = 8.2;
  define adf_probtau / "P-VALUE/FOR/ADF TEST"         width = 10 format = 8.4 center;
  define pp_tau      / "PP TEST/FOR/UNIT ROOT"        width = 10 format = 8.2;
  define pp_probtau  / "P-VALUE/FOR/PP TEST"          width = 10 format = 8.4 center;
  define eta         / "KPSS TEST/FOR/STATIONARY"     width = 10 format = 8.2;
  define probeta     / "P-VALUE/FOR/KPSS TEST"        width = 10 format = 8.4 center;
  define _flg        / "STATIONARY/FLAG"              width = 10 center;
run;

%mend stationary;

Efficiency of Concatenating Two Large Tables in SAS

In SAS, there are 4 approaches to concatenate two datasets together:
1. SET statement,
2. APPEND procedure (or APPEND statement in DATASETS procedure)
3. UNION ALL in SQL procedure,
4. INSERT INTO in SQL procedure.
Below is an efficiency comparison among these 4 approaches, showing that APPEND procedure, followed by SET statement, is the most efficient in terms of CPU time.

data one two;
  do i = 1 to 2000000;
    x = put(i, best12.);
    if i le 1000 then output one;
    else output two;
  end;
run;

data test1;
  set one two;
run;

data test2;
  set one;
run;

proc append base = test2 data = two;
run;

proc sql;
  create table
    test3 as
  select
    *
  from
    one

  union all

  select
    *
  from
    two;
quit;

data test4;
  set one;
run;

proc sql;
  insert into
    test4
  select
    *
  from
    two;
quit;

SAS Macro for Jarque-Bera Normality Test

%macro jarque_bera(data = , var = );
**********************************************;
* SAS macro to do Jarque-Bera Normality Test *;
* ------------------------------------------ *;
* wensui.liu@53.com                          *;
**********************************************;
options mprint mlogic symbolgen nodate;

ods listing close;
ods output moments = m1;
proc univariate data = &data normal;
  var &var.;
run;

proc sql noprint;
  select nvalue1 into :n from m1 where upcase(compress(label1, ' ')) = 'N';
  select put(nvalue1, best32.) into :s from m1 where upcase(compress(label1, ' ')) = 'SKEWNESS';
  select put(nvalue2, best32.) into :k from m1 where upcase(compress(label2, ' ')) = 'KURTOSIS';
quit;

data _temp_;
  jb = ((&s) ** 2 + (&k) ** 2 / 4) / 6 * &n;
  pvalue = 1 - probchi(jb, 2);
  put jb pvalue;
run;

ods listing;
proc print data = _last_ noobs;
run;

%mend jarque_bera;

Estimating Time Series Models for Count Outcomes with SAS

In SAS, there is no out-of-box procedure to estimate time series models for count outcomes, which is similar to the one shown here (https://statcompute.wordpress.com/2015/03/31/modeling-count-time-series-with-tscount-package). However, as long as we understand the likelihood function of Poisson distribution, it is straightforward to estimate a time series model with PROC MODEL in the ETS module.

Below is a demonstration of how to estimate a Poisson time series model with the identity link function. As shown, the parameter estimates with related inferences are extremely close to the ones estimated with tscount() in R.

data polio;
  idx + 1;
  input y @@;
datalines;
0  1  0  0  1  3  9  2  3  5  3  5  2  2  0  1  0  1  3  3  2  1  1  5  0
3  1  0  1  4  0  0  1  6 14  1  1  0  0  1  1  1  1  0  1  0  1  0  1  0
1  0  1  0  1  0  1  0  0  2  0  1  0  1  0  0  1  2  0  0  1  2  0  3  1
1  0  2  0  4  0  2  1  1  1  1  0  1  1  0  2  1  3  1  2  4  0  0  0  1
0  1  0  2  2  4  2  3  3  0  0  2  7  8  2  4  1  1  2  4  0  1  1  1  3
0  0  0  0  1  0  1  1  0  0  0  0  0  1  2  0  2  0  0  0  1  0  1  0  1
0  2  0  0  1  2  0  1  0  0  0  1  2  1  0  1  3  6
;
run;

proc model data = polio;
  parms b0 = 0.5 b1 = 0.1 b2 = 0.1;
  yhat = b0 + b1 * zlag1(y) + b2 * zlag1(yhat);
  y = yhat;
  lk = exp(-yhat) * (yhat ** y) / fact(y);
  ll = -log(lk);
  errormodel y ~ general(ll);
  fit y / fiml converge = 1e-8;
run;

/* OUTPUT:
            Nonlinear Liklhood Summary of Residual Errors

                  DF      DF                                        Adj
Equation       Model   Error        SSE        MSE   R-Square      R-Sq
y                  3     165      532.6     3.2277     0.0901    0.0791

           Nonlinear Liklhood Parameter Estimates

                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|
b0              0.606313      0.1680       3.61       0.0004
b1              0.349495      0.0690       5.06       <.0001
b2              0.206877      0.1397       1.48       0.1405

Number of Observations       Statistics for System

Used               168    Log Likelihood    -278.6615
Missing              0
*/

SAS Macro Aligning The Logit Variable to A Scorecard with Specific PDO and Base Point

%macro align_score(data = , y = , logit = , pdo = , base_point = , base_odds = , min_point = 100, max_point = 900); 
***********************************************************; 
* THE MACRO IS TO ALIGN A LOGIT VARIABLE TO A SCORE WITH  *; 
* SPECIFIC PDO, BASE POINT, AND BASE ODDS                 *; 
* ======================================================= *; 
* PAMAMETERS:                                             *; 
*  DATA      : INPUT SAS DATA TABLE                       *; 
*  Y         : PERFORMANCE VARIABLE WITH 0/1 VALUE        *; 
*  LOGIT     : A LOGIT VARIABLE TO BE ALIGNED FROM        *; 
*  PDO       : PDO OF THE SCORE ALIGNED TO                *; 
*  BASE_POINT: BASE POINT OF THE SCORE ALIGNED TO         *; 
*  BASE_ODDS : ODDS AT BASE POINT OF THE SCORE ALIGNED TO *; 
*  MIN_POINT : LOWER END OF SCORE POINT, 100 BY DEFAULT   *; 
*  MAX_POINT : UPPER END OF SCORE POINT, 900 BY DEFAULT   *; 
* ======================================================= *; 
* OUTPUTS:                                                *; 
*  ALIGN_FORMULA.SAS                                      *; 
*  A SAS CODE WITH THE FORMULA TO ALIGN THE LOGIT FIELD   *; 
*  TO A SPECIFIC SCORE TOGETHER WITH THE STATISTICAL      *; 
*  SUMMARY OF ALIGN_SCORE                                 *;                                    
***********************************************************; 
 
options nocenter nonumber nodate mprint mlogic symbolgen 
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\&lt;&gt;*"; 
 
%local b0 b1; 
 
data _tmp1 (keep = &amp;y &amp;logit); 
  set &amp;data; 
  where &amp;y in (0, 1) and &amp;logit ~= .; 
run; 
 
ods listing close; 
ods output ParameterEstimates = _est1 (keep = variable estimate); 
proc logistic data = _last_ desc; 
  model &amp;y = &amp;logit; 
run; 
ods listing; 
 
data _null_; 
  set _last_; 
   
  if _n_ = 1 then do; 
    b = - (estimate + (log(&amp;base_odds) - (log(2) / &amp;pdo) * &amp;base_point)) / (log(2) / &amp;pdo); 
    call symput('b0', put(b, 15.8)); 
  end; 
  else do; 
    b = estimate / (log(2) / &amp;pdo); 
    call symput('b1', put(b, 15.8)); 
  end; 
run; 
 
filename formula "ALIGN_FORMULA.SAS"; 
 
data _null_; 
  file formula; 
 
  put @3 3 * "*" " SCORE ALIGNMENT FORMULA OF %upcase(&amp;logit) " 3 * "*" ";"; 
  put; 
  put @3 "ALIGN_SCORE = MAX(MIN(ROUND((%trim(&amp;b0)) - (%trim(&amp;b1)) * %upcase(&amp;logit), 1), &amp;max_point), &amp;min_point);"; 
  put; 
run; 
 
data _tmp2; 
  set _tmp1; 
  %inc formula; 
run; 
 
proc summary data = _last_ nway; 
  class &amp;y; 
  output out = _tmp3(drop = _type_ _freq_) 
  min(align_score) = min_scr max(align_score) = max_scr 
  median(align_score) = mdn_scr; 
run; 
 
data _null_; 
  set _last_; 
  file formula mod; 
 
  if _n_ = 1 then do; 
    put @3 3 * "*" " STATISTICAL SUMMARY OF ALIGN_SCORE BY INPUT DATA: " 3 * "*" ";"; 
  end; 
  put @3 "* WHEN %upcase(&amp;y) = " &amp;y ": MIN(SCORE) = " min_scr " MEDIAN(SCORE) = " mdn_scr " MAX(SCORE) = " max_scr "*;"; 
run; 
 
proc datasets library = work nolist; 
  delete _: (mt = data); 
run; 
quit; 
 
***********************************************************; 
*                     END OF THE MACRO                    *; 
***********************************************************;  
%mend align_score;

Faster Random Sampling with Replacement in SAS

Most SAS users like to use SURVEYSELECT procedures to do the random sampling. However, when it comes to the big dataset, the efficiency of SURVEYSELECT procedure seems pretty low. As a result, I normally like to use data step to do the sampling.

While the simple random sample without replacement is trivial and can be easily accomplished by generating a random number with the uniform distribution, the random sample with replacement doesn’t seem straightforward with the data step. In the demo below, I will show how to do sampling with replacement by both SURVEYSELECT and data step and compare their efficiencies.

First of all, I will artificially simulate a data set with 10 million rows.

data one;
  do i = 1 to 10000000;
  output;
  end;
run;

Secondly, I will wrap SURVEYSELECT procedure into a macro to do sampling with replacement. with this method, it took more than 20 seconds CPU time to get the work done even after subtracting ~1 second simulation time.

%macro urs1(indata = , outdata = );
options mprint;

proc sql noprint;
  select put(count(*), 10.) into :n from &indata;
quit;

proc surveyselect data = &indata out = &outdata n = &n method = urs seed = 2013;
run;

proc freq data = &outdata;
  tables numberhits;
run;

%mend urs1;

%urs1(indata = one, outdata = two);
/*
      real time           30.32 seconds
      cpu time            22.54 seconds

                                       Cumulative    Cumulative
NumberHits    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
         1     3686249       58.25       3686249        58.25
         2     1843585       29.13       5529834        87.38
         3      611396        9.66       6141230        97.04
         4      151910        2.40       6293140        99.44
         5       30159        0.48       6323299        99.91
         6        4763        0.08       6328062        99.99
         7         641        0.01       6328703       100.00
         8          98        0.00       6328801       100.00
         9          11        0.00       6328812       100.00
        10           1        0.00       6328813       100.00
*/

At last, let’s take a look at how to accomplish the same task with a simple data step. The real trick here is to understand the statistical nature of a Poisson distribution. As shown below, while delivering a very similar result, this approach only consumes roughly a quarter of the CPU time. This efficiency gain would be particularly more attractive when we need to apply complex machine learning algorithms, e.g. bagging, to big data problems.

%macro urs2(indata = , outdata = );
options mprint;

data &outdata;
  set &indata;
  numberhits = ranpoi(2013, 1);
  if numberhits > 0 then output;
run;

proc freq data = &outdata;
  tables numberhits;
run;

%mend urs2;

%urs2(indata = one, outdata = two);
/*
      real time           13.42 seconds
      cpu time            6.60 seconds

                                       Cumulative    Cumulative
numberhits    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
         1     3677134       58.18       3677134        58.18
         2     1840742       29.13       5517876        87.31
         3      612487        9.69       6130363        97.00
         4      152895        2.42       6283258        99.42
         5       30643        0.48       6313901        99.90
         6        5180        0.08       6319081        99.99
         7         732        0.01       6319813       100.00
         8          92        0.00       6319905       100.00
         9          12        0.00       6319917       100.00
        10           2        0.00       6319919       100.00
*/

Calculating Marginal Effects in Zero-Inflated Beta Model

libname data 'c:\projects\sgf14\data';

ods output parameterestimates = _parms;
proc nlmixed data = data.full tech = trureg alpha = 0.01;
  parms a0 = 0  a1 = 0  a2 = 0  a3 = 0  a4 = 0  a5 = 0  a6 = 0  a7 = 0
        b0 = 0  b1 = 0  b2 = 0  b3 = 0  b4 = 0  b5 = 0  b6 = 0  b7 = 0 
        c0 = 1  c1 = 0  c2 = 0  c3 = 0  c4 = 0  c5 = 0  c6 = 0  c7 = 0;
  xa = a0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + a5 * x5 + a6 * x6 + a7 * x7;
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + b7 * x7;
  xc = c0 + c1 * x1 + c2 * x2 + c3 * x3 + c4 * x4 + c5 * x5 + c6 * x6 + c7 * x7;
  mu_xa = 1 / (1 + exp(-xa));
  mu_xb = 1 / (1 + exp(-xb));
  phi = exp(xc);
  w = mu_xb * phi;
  t = (1 - mu_xb) * phi;
  if y = 0 then lh = 1 - mu_xa;
  else lh = mu_xa * (gamma(w + t) / (gamma(w) * gamma(t)) * (y ** (w - 1)) * ((1 - y) ** (t - 1)));
  ll = log(lh);
  model y ~ general(ll);
  *** calculate components for marginal effects ***;
  _mfx_a = exp(xa) / ((1 + exp(xa)) ** 2);
  _mfx_b = exp(xb) / ((1 + exp(xb)) ** 2);
  _p_a = 1 / (1 + exp(-xa));
  _p_b = 1 / (1 + exp(-xb));
  predict _mfx_a out = _marg1 (rename = (pred = _mfx_a) keep = id pred y);
  predict _p_a   out = _marg2 (rename = (pred = _p_a)   keep = id pred);
  predict _mfx_b out = _marg3 (rename = (pred = _mfx_b) keep = id pred);
  predict _p_b   out = _marg4 (rename = (pred = _p_b)   keep = id pred);
run;

data _null_;
  set _parms;
  call symput(parameter, put(estimate, 20.15));
run;

data _marg5;
  merge _marg1 _marg2 _marg3 _marg4;
  by id;

  _marg_x1 = _mfx_a * &a1 * _p_b + _mfx_b * &b1 * _p_a;
  _marg_x2 = _mfx_a * &a2 * _p_b + _mfx_b * &b2 * _p_a;
  _marg_x3 = _mfx_a * &a3 * _p_b + _mfx_b * &b3 * _p_a;
  _marg_x4 = _mfx_a * &a4 * _p_b + _mfx_b * &b4 * _p_a;
  _marg_x5 = _mfx_a * &a5 * _p_b + _mfx_b * &b5 * _p_a;
  _marg_x6 = _mfx_a * &a6 * _p_b + _mfx_b * &b6 * _p_a;
  _marg_x7 = _mfx_a * &a7 * _p_b + _mfx_b * &b7 * _p_a;
run;

proc means data = _marg5 mean;
  var _marg_x:;
run;
/*
Variable            Mean
------------------------
_marg_x1      -0.0037445
_marg_x2       0.0783118
_marg_x3       0.0261884
_marg_x4      -0.3105482
_marg_x5     0.000156693
_marg_x6    -0.000430756
_marg_x7      -0.0977589
------------------------
*/

Calculate Predicted Values for Composite Models with NLMIXED

After estimating a statistical model, we often need to use the estimated model specification to calculate predicted values from a separate hold-out dataset to evaluate the model performance. In R or StatsModels of Python, it is trivial to calculate the predicted values by passing the data through the model object. However, in SAS, the similar task might become a bit complicated especially when we want to calculate predicted values of a composite model, e.g. Zero-Inflated Beta Model, estimated through NLMIXED procedure. Most of the time, we might need to parse the model specification into an open sas code or into a sas dataset that can be used by SCORE procedure, which is not easy in either case.

Today, I’d like to introduce a small trick that can allow us to calculate predicted values with NLMIXED procedure on the fly and that can be extremely handy in case of our interests in the prediction only. Below is a high-level procedure.

1) first of all, we combine, e.g. union, both the development and the testing datasets into 1 table and use a variable to flag out all observations from the develoopment dataset, e.g. deve_flg = 1.
2) secondly, we feed the whole dataset into NLMIXED procedure. However, we should only specify the likelihood function for all observations from the development dataset. For observations from the testing dataset, we force the likelihood function equal to 0 (zero).
3) at last, we might use the PREDICT statement in NLMIXED procedure to calculate predicted values of interest.

Please note that since we artificially increase the sample size of the model estimation sample, some statistical measures, e.g. BIC, might not be accurate. However, since the log likelihood function is still correct, it is trivial to calculate BIC manually.

Below is an example how to calculate predicted values of a Zero-Inflated Beta Model. In this composite model, there are 3 sets of parameters, 2 for mean and 1 for variance. Therefore, it could be extremely cumbersome if there is no scheme to calculate predicted values automatically.

libname data 'c:\projects\data';

*** COMBINE DEVELOPMENT AND TEST DATASETS ***;
data full;
  set data.deve (in = a) data.test (in = b);
  *** FLAG OUT THE DEVELOPMENT SAMPLE ***;
  if a then deve_flg = 1;
  if b then deve_flg = 0;
run;

proc nlmixed data = full tech = trureg;
  parms a0 = 0  a1 = 0  a2 = 0  a3 = 0  a4 = 0  a5 = 0  a6 = 0  a7 = 0
        b0 = 0  b1 = 0  b2 = 0  b3 = 0  b4 = 0  b5 = 0  b6 = 0  b7 = 0
        c0 = 1  c1 = 0  c2 = 0  c3 = 0  c4 = 0  c5 = 0  c6 = 0  c7 = 0;
  xa = a0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + a5 * x5 + a6 * x6 + a7 * x7;
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + b7 * x7;
  xc = c0 + c1 * x1 + c2 * x2 + c3 * x3 + c4 * x4 + c5 * x5 + c6 * x6 + c7 * x7;
  mu_xa = 1 / (1 + exp(-xa));
  mu_xb = 1 / (1 + exp(-xb));
  phi = exp(xc);
  w = mu_xb * phi;
  t = (1 - mu_xb) * phi;
  *** SPECIFY LIKELIHOOD FUNCTION FOR ALL CASES WITH DEVE. FLAG ***;
  if deve_flg = 1 then do;
    if y = 0 then lh = 1 - mu_xa;
    else lh = mu_xa * (gamma(w + t) / (gamma(w) * gamma(t)) * (y ** (w - 1)) * ((1 - y) ** (t - 1)));
    ll = log(lh);
  end;
  *** FORCE LIKELIHOOD FUNCTION = 0 FOR ALL CASES WITHOUT DEVE. FLAG ***;
  else ll = 0;
  model y ~ general(ll);
  mu = mu_xa * mu_xb;
  predict mu out = pred (rename = (pred = mu));
run;

proc means data = pred mean n;
  class deve_flg;
  var y mu;
run;
/* output:
                   N
    deve_flg     Obs    Variable    Label                      Mean       N
---------------------------------------------------------------------------
           0    1780    y                                 0.0892984    1780
                        mu          Predicted Value       0.0932779    1780

           1    2641    y                                 0.0918661    2641
                        mu          Predicted Value       0.0919383    2641
---------------------------------------------------------------------------
*/

Dispersion Models

In the last week, I’ve read an interesting article “Dispersion Models in Regression Analysis” by Peter Song (http://www.pakjs.com/journals/25%284%29/25%284%299.pdf), which describes a new class of models more general than classic generalized linear models based on the error distribution.

A dispersion model can be defined by two parameters, a location parameter mu and a dispersion parameter sigma ^ 2, and has a very general form of probability function formulated as:
p(y, mu, sigma ^ 2) = {2 * pi * sigma ^ 2 * V(.)} ^ -0.5 * exp{-1 / (2 * sigma ^ 2) * D(.)}
where the variance function V(.) and the deviance function D(.) varies by distributions. For instance, in a poisson model,
D(.) = 2 * (y * log(y / mu) – y + mu)
V(.) = mu

Below is a piece of SAS code estimating a Poisson with both the error distribution assumption and the dispersion assumption.

data one;
  do i = 1 to 1000;
    x = ranuni(i);
    y = ranpoi(i, exp(2 + x * 2 + rannor(1) * 0.1));
    output;
  end;
run;

*** fit a poisson model with classic GLM ***;
proc nlmixed data = one tech = trureg;
  parms b0 = 0 b1 = 0;
  mu = exp(b0 + b1 * x);
  ll = -mu + y * log(mu) - log(fact(y));
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 6118.0
AIC (smaller is better)           6122.0
AICC (smaller is better)          6122.0
BIC (smaller is better)           6131.8

                                             Parameter Estimates
                         Standard
Parameter    Estimate       Error      DF    t Value    Pr > |t|     Alpha       Lower       Upper    Gradient
b0             2.0024     0.01757    1000     113.95      <.0001      0.05      1.9679      2.0369    5.746E-9
b1             1.9883     0.02518    1000      78.96      <.0001      0.05      1.9388      2.0377    1.773E-9
*/

*** fit a poisson model with dispersion probability ***;
*** proposed by Jorgensen in 1987                   ***;
proc nlmixed data = one tech = trureg;
  parms b0 = 0 b1 = 0 s2 = 1;
  mu = exp(b0 + b1 * x);
  d  = 2 * (y * log(y / mu) - y + mu);
  v  = y;
  lh = (2 * constant('pi') * s2 * v) **  (-0.5) * exp(-(2 * s2) ** (-1) * d);
  ll = log(lh);
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 6066.2
AIC (smaller is better)           6072.2
AICC (smaller is better)          6072.2
BIC (smaller is better)           6086.9

                                             Parameter Estimates
                         Standard
Parameter    Estimate       Error      DF    t Value    Pr > |t|     Alpha       Lower       Upper    Gradient
b0             2.0024     0.02015    1000      99.37      <.0001      0.05      1.9629      2.0420    2.675E-6
b1             1.9883     0.02888    1000      68.86      <.0001      0.05      1.9316      2.0449    1.903E-6
s2             1.3150     0.05881    1000      22.36      <.0001      0.05      1.1996      1.4304    -0.00002
*/

Please note that although both methods yield the same parameter estimates, there are slight differences in standard errors and therefore t-values. In addition, despite one more parameter estimated in the model, AIC / BIC are even lower in the dispersion model.

V-Fold Cross Validation to Pick GRNN Smoothing Parameter

On 06/23, I posted two SAS macros implementation GRNN (https://statcompute.wordpress.com/2013/06/23/prototyping-a-general-regression-neural-network-with-sas). However, in order to use these macros in the production environment, we still need a scheme to automatically choose the optimal value of smoothing parameter. In practice, v-fold or holdout cross validation has been often used to accomplish the task. Below is a SAS macro implementing the v-fold cross validation to automatically select an optional value of the smoothing parameter based on the highest K-S statistics in a binary classification case.

%macro grnn_cv(data = , y = , x = , v = , sigmas = );
********************************************************;
* THIS MACRO IS TO DO THE V-FOLD CROSS VALIDATION TO   *;
* PICK THE OPTIMAL VALUE OF SMOOTHING PARAMETER IN A   *;
* BINARY CLASSIFICATION PROBLEM                        *;
*------------------------------------------------------*;
* INPUT PARAMETERS:                                    *;
*  DATA  : INPUT SAS DATASET                           *;
*  X     : A LIST OF PREDICTORS IN THE NUMERIC FORMAT  *;
*  Y     : A RESPONSE VARIABLE IN THE NUMERIC FORMAT   *;
*  V     : NUMBER OF FOLDS FOR CROSS-VALIDATION        *; 
*  SIGMAS: A LIST OF SIGMA VALUES TO TEST              *;
*------------------------------------------------------*;
* OUTPUT:                                              *;
*  SAS PRINT-OUT OF CROSS VALIDATION RESULT IN KS      *;
*  STATISTICS                                          *;
*------------------------------------------------------*;
* AUTHOR:                                              *;
*  WENSUI.LIU@53.COM                                   *;
********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen         
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*"; 

data _data_;
  set &data (keep = &x &y);
  where &y ~= .;
  array _x_ &x;
  _miss_ = 0;
  do _i_ = 1 to dim(_x_);
    if _x_[_i_] = . then _miss_ = 1; 
  end;
  _rand_ = ranuni(1);
  if _miss_ = 0 then output;
run;

proc rank data = _last_ out = _cv1 groups = &v;
  var _rand_;
  ranks _rank_;
run;

%let i = 1;
%local i;

%inc "grnn_learn.sas";
%inc "grnn_pred.sas";

%do %while (%scan(&sigmas, &i, " ") ne %str());
%let sigma = %scan(&sigmas, &i, " ");

  %do j = 0 %to %eval(&v - 1);
  %put &sigma | &i | &j;
   
  data _cv2 _cv3;
    set _cv1;
    if _rank_ ~= &j then output _cv2;
    else output _cv3;
  run;
  
  %grnn_learn(data = _cv2, x = &x, y = &y, sigma = &sigma, nn_out = _grnn);
 
  %grnn_pred(data = _cv3, x = &x, nn_in = _grnn, id = _rand_, out = _pred);

  proc sql;
  create table
    _cv4 as
  select
    a.&y as _y_,
    b._pred_  
  from
    _cv3 as a inner join _pred as b on a._rand_ = b._id_;
  quit;

  %if &j = 0 %then %do;
  data _cv5;
    set _cv4;
  run;
  %end;
  %else %do;
  data _cv5;
    set _cv5 _cv4;
  run;
  %end;

  %end;

ods listing close;
ods output kolsmir2stats = _ks1;
proc npar1way wilcoxon edf data = _cv5;
  class _y_;
  var _pred_;
run;
ods listing;

data _ks2 (keep = sigma ks);
  set _ks1;
  if _n_ = 1 then do;
    sigma = &sigma;
    ks = nvalue2 * 100;
    output;
  end;
run;

%if &i = 1 %then %do;
data _ks3;
  set _ks2;
run;
%end;
%else %do;
data _ks3;
  set _ks3 _ks2;
run;
%end;

%let i = %eval(&i + 1); 
%end;

title "&v._fold cross validation outcomes";
proc print data = _ks3 noobs;
run;

********************************************************;
*              END OF THE MACRO                        *;
********************************************************;
%mend grnn_cv;

Prototyping A General Regression Neural Network with SAS

Last time when I read the paper “A General Regression Neural Network” by Donald Specht, it was exactly 10 years ago when I was in the graduate school. After reading again this week, I decided to code it out with SAS macros and make this excellent idea available for the SAS community.

The prototype of GRNN consists of 2 SAS macros, %grnn_learn() for the training of a GRNN and %grnn_pred() for the prediction with a GRNN. The famous Boston Housing dataset is used to test these two macros with the result compared with the outcome from the R implementation below. In this exercise, it is assumed that the smoothing parameter SIGMA is known and equal to 0.55 in order to simplify the case.

pkgs <- c('MASS', 'doParallel', 'foreach', 'grnn')
lapply(pkgs, require, character.only = T)
registerDoParallel(cores = 8)

data(Boston)
X <- Boston[-14]
st.X <- scale(X)
Y <- Boston[14]
boston <- data.frame(st.X, Y)

pred_grnn <- function(x, nn){
  xlst <- split(x, 1:nrow(x))
  pred <- foreach(i = xlst, .combine = rbind) %dopar% {
    data.frame(pred = guess(nn, as.matrix(i)), i, row.names = NULL)
  }
}

grnn <- smooth(learn(boston, variable.column = ncol(boston)), sigma = 0.55)
pred_grnn <- pred_grnn(boston[, -ncol(boston)], grnn)
head(pred_grnn$pred, n = 10)
# [1] 24.61559 23.22232 32.29610 32.57700 33.29552 26.73482 21.46017 20.96827
# [9] 16.55537 20.25247

The first SAS macro to train a GRNN is %grnn_learn() shown below. The purpose of this macro is store the whole specification of a GRNN in a SAS dataset after the simple 1-pass training with the development data. Please note that motivated by the idea of MongoDB, I use the key-value paired scheme to store the information of a GRNN.

libname data '';

data data.boston;
  infile 'housing.data';
  input x1 - x13 y;
run;

%macro grnn_learn(data = , x = , y = , sigma = , nn_out = );
options mprint mlogic nocenter;
********************************************************;
* THIS MACRO IS TO TRAIN A GENERAL REGRESSION NEURAL   *;
* NETWORK (SPECHT, 1991) AND STORE THE SPECIFICATION   *;
*------------------------------------------------------*;
* INPUT PARAMETERS:                                    *;
*  DATA  : INPUT SAS DATASET                           *;
*  X     : A LIST OF PREDICTORS IN THE NUMERIC FORMAT  *;
*  Y     : A RESPONSE VARIABLE IN THE NUMERIC FORMAT   *;
*  SIGMA : THE SMOOTH PARAMETER FOR GRNN               *;
*  NN_OUT: OUTPUT SAS DATASET CONTAINING THE GRNN      *;
*          SPECIFICATION                               *;
*------------------------------------------------------*;
* AUTHOR:                                              *;
*  WENSUI.LIU@53.COM                                   *;
********************************************************;

data _tmp1;
  set &data (keep = &x &y);
  where &y ~= .;
  array _x_ &x;
  _miss_ = 0;
  do _i_ = 1 to dim(_x_);
    if _x_[_i_] = . then _miss_ = 1; 
  end;
  if _miss_ = 0 then output;
run;

proc summary data = _tmp1;
  output out = _avg_ (drop = _type_ _freq_)
  mean(&x) = ;
run;

proc summary data = _tmp1;
  output out = _std_ (drop = _type_ _freq_)
  std(&x) = ;
run;

proc standard data = _tmp1 mean = 0 std = 1 out = _data_;
  var &x;
run;

data &nn_out (keep = _neuron_ _key_ _value_);
  set _last_ end = eof;
  _neuron_ + 1;
  length _key_ $32;
  array _a_ &y &x;
  do _i_ = 1 to dim(_a_);
    if _i_ = 1 then _key_ = '_Y_';
    else _key_ = upcase(vname(_a_[_i_]));
    _value_ = _a_[_i_];
    output;
  end; 
  if eof then do;
    _neuron_ = 0;
    _key_  = "_SIGMA_";
    _value_  = &sigma;
    output;
    set _avg_;
    array _b_ &x;
    do _i_ = 1 to dim(_b_);
      _neuron_ = -1;
      _key_ = upcase(vname(_b_[_i_]));
      _value_ = _b_[_i_];
      output;
    end;
    set _std_;
    array _c_ &x;
    do _i_ = 1 to dim(_c_);
      _neuron_ = -2;
      _key_ = upcase(vname(_c_[_i_]));
      _value_ = _c_[_i_];
      output;
    end;
  end;
run;

proc datasets library = work;
  delete _: / memtype = data;
run;
quit;

********************************************************;
*              END OF THE MACRO                        *;
********************************************************;
%mend grnn_learn;

%grnn_learn(data = data.boston, x = x1 - x13, y = y, sigma = 0.55, nn_out = data.grnn);

proc print data = data.grnn (obs = 10) noobs;
run;
/* SAS PRINTOUT OF GRNN DATA:
_neuron_    _key_     _value_
    1        _Y_      24.0000
    1        X1       -0.4194
    1        X2        0.2845
    1        X3       -1.2866
    1        X4       -0.2723
    1        X5       -0.1441
    1        X6        0.4133
    1        X7       -0.1199
    1        X8        0.1401
    1        X9       -0.9819
*/

After the training of a GRNN, the macro %grnn_pred() would be used to generate predicted values from a test dataset with all predictors. As shown in the print-out, first 10 predicted values are identical to those generated with R.

libname data '';

%macro grnn_pred(data = , x = , id = NA, nn_in = , out = grnn_pred);
options mprint mlogic nocenter;
********************************************************;
* THIS MACRO IS TO GENERATE PREDICTED VALUES BASED ON  *;
* THE SPECIFICATION OF GRNN CREATED BY THE %GRNN_LEARN *;
* MACRO                                                *;
*------------------------------------------------------*;
* INPUT PARAMETERS:                                    *;
*  DATA : INPUT SAS DATASET                            *;
*  X    : A LIST OF PREDICTORS IN THE NUMERIC FORMAT   *;
*  ID   : AN ID VARIABLE (OPTIONAL)                    *;
*  NN_IN: INPUT SAS DATASET CONTAINING THE GRNN        *;
*         SPECIFICATION GENERATED FROM %GRNN_LEARN     *;
*  OUT  : OUTPUT SAS DATASET WITH GRNN PREDICTIONS     *;
*------------------------------------------------------*;
* AUTHOR:                                              *;
*  WENSUI.LIU@53.COM                                   *;
********************************************************;

data _data_;
  set &data;
  array _x_ &x;
  _miss_ = 0;
  do _i_ = 1 to dim(_x_);
    if _x_[_i_] = . then _miss_ = 1;
  end;
  if _miss_ = 0 then output;
run;

data _data_;
  set _last_ (drop = _miss_);
  %if &id = NA %then %do;
  _id_ + 1;
  %end;
  %else %do;
  _id_ = &id;
  %end;
run;

proc sort data = _last_ sortsize = max nodupkey;
  by _id_;
run;

data _data_ (keep = _id_ _key_ _value_);
  set _last_;
  array _x_ &x;
  length _key_ $32;
  do _i_ = 1 to dim(_x_);
    _key_ = upcase(vname(_x_[_i_]));
    _value_ = _x_[_i_];
    output;
  end;
run;

proc sql noprint;
select _value_ ** 2 into :s2 from &nn_in where _neuron_ = 0;

create table
  _last_ as 
select
  a._id_,
  a._key_,
  (a._value_ - b._value_) / c._value_ as _value_
from
  _last_ as a,
  &nn_in as b,
  &nn_in as c
where
  compress(a._key_, ' ') = compress(b._key_, ' ') and
  compress(a._key_, ' ') = compress(c._key_, ' ') and
  b._neuron_ = -1                                 and
  c._neuron_ = -2;

create table
  _last_ as
select
  a._id_,
  b._neuron_,
  sum((a._value_ - b._value_) ** 2) as d2,
  mean(c._value_)                   as y,
  exp(-(calculated d2) / (2 * &s2)) as exp
from
  _last_  as a,
  &nn_in as b,
  &nn_in as c
where
  compress(a._key_, ' ') = compress(b._key_, ' ') and
  b._neuron_ = c._neuron_                         and
  b._neuron_ > 0                                  and
  c._key_ = '_Y_'
group by
  a._id_, b._neuron_;

create table
  _last_ as
select
  a._id_,
  sum(a.y * a.exp / b.sum_exp) as _pred_
from
  _last_ as a inner join (select _id_, sum(exp) as sum_exp from _last_ group by _id_) as b
on
  a._id_ = b._id_
group by
  a._id_;
quit;

proc sort data = _last_ out = &out sortsize = max;
  by _id_;
run;

********************************************************;
*              END OF THE MACRO                        *;
********************************************************;
%mend grnn_pred;

%grnn_pred(data = data.boston, x = x1 - x13, nn_in = data.grnn);

proc print data = grnn_pred (obs = 10) noobs;
run;
/* SAS PRINTOUT:
_id_     _pred_
  1     24.6156
  2     23.2223
  3     32.2961
  4     32.5770
  5     33.2955
  6     26.7348
  7     21.4602
  8     20.9683
  9     16.5554
 10     20.2525
*/

After the development of these two macros, I also compare predictive performances between GRNN and OLS regression. It turns out that GRNN consistently outperforms OLS regression even with a wide range of SIGMA values. With a reasonable choice of SIGMA value, even a GRNN developed with 10% of the whole Boston Housing dataset is able to generalize well and yield a R^2 > 0.8 based upon the rest 90% data.

Estimating Composite Models for Count Outcomes with FMM Procedure

Once upon a time when I learned SAS, it was still version 6.X. As a old dog with 10+ years of experience in SAS, I’ve been trying my best to keep up with new tricks in each release of SAS. In SAS 9.3, my favorite novel feature in SAS/STAT is FMM procedure to estimate finite mixture models.

In 2008 when I drafted “Count Data Models in SAS®” (www2.sas.com/proceedings/forum2008/371-2008.pdf‎), it was pretty cumbersome to specify the log likelihood function of a composite model for count outcomes, e.g. hurdle Poisson or zero-inflated Poisson model. However, with the availability of FMM procedure, estimating composite models has never been easier.

In the demonstration below, I am going to show a side-by-side comparison how to estimate three types of composite models for count outcomes, including hurdle Poisson, zero-inflated Poisson, and finite mixture Poisson models, with FMM and NLMIXED procedure respectively. As shown, both procedures provided identical model estimations.

Hurdle Poisson Model

*** HURDLE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = truncpoisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8201.0
AIC  (smaller is better)      8221.0
AICC (smaller is better)      8221.0
BIC  (smaller is better)      8293.5

Parameter Estimates for 'Truncated Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.0706    0.3081    -6.72    <.0001
        1  AGE         0.01796  0.005482     3.28    0.0011
        1  ACADMOS    0.000852  0.000700     1.22    0.2240
        1  MINORDRG     0.1739   0.03441     5.05    <.0001
        1  LOGSPEND     0.1229   0.04219     2.91    0.0036

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -4.2309      0.1808     -23.40      <.0001
AGE           0.01694    0.003323       5.10      <.0001
ACADMOS      0.002240    0.000492       4.55      <.0001
MINORDRG       0.7653     0.03842      19.92      <.0001
LOGSPEND       0.2301     0.02683       8.58      <.0001
*/

*** HURDLE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -4 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0;
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / ((1 - exp(-exp_eta2)) * fact(majordrg));
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8201.0
AIC (smaller is better)           8221.0
AICC (smaller is better)          8221.0
BIC (smaller is better)           8293.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -4.2309     0.1808    1E4    -23.40     <.0001
B1_age          0.01694   0.003323    1E4      5.10     <.0001
B1_acadmos     0.002240   0.000492    1E4      4.55     <.0001
B1_minordrg      0.7653    0.03842    1E4     19.92     <.0001
B1_logspend      0.2301    0.02683    1E4      8.58     <.0001
============
B2_intercept    -2.0706     0.3081    1E4     -6.72     <.0001
B2_age          0.01796   0.005482    1E4      3.28     0.0011
B2_acadmos     0.000852   0.000700    1E4      1.22     0.2240
B2_minordrg      0.1739    0.03441    1E4      5.05     <.0001
B2_logspend      0.1229    0.04219    1E4      2.91     0.0036
*/

Zero-Inflated Poisson Model

*** ZERO-INFLATED POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson;
  model majordrg = / dist = constant;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8147.9
AIC  (smaller is better)      8167.9
AICC (smaller is better)      8167.9
BIC  (smaller is better)      8240.5

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.2780    0.3002    -7.59    <.0001
        1  AGE         0.01956  0.006019     3.25    0.0012
        1  ACADMOS    0.000249  0.000668     0.37    0.7093
        1  MINORDRG     0.1176   0.02711     4.34    <.0001
        1  LOGSPEND     0.1644   0.03531     4.66    <.0001

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.9111      0.4170      -4.58      <.0001
AGE          -0.00082    0.008406      -0.10      0.9218
ACADMOS      0.002934    0.001085       2.70      0.0068
MINORDRG       1.4424      0.1361      10.59      <.0001
LOGSPEND      0.09562     0.05080       1.88      0.0598
*/

*** ZERO-INFLATED POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -2 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0	B2_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  p0 = 1 / (1 + exp_eta1);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  if majordrg = 0 then _prob_ = p0 + (1 - p0) * exp(-exp_eta2);
  else _prob_ = (1 - p0) * exp(-exp_eta2) * (exp_eta2 ** majordrg) / fact(majordrg);
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8147.9
AIC (smaller is better)           8167.9
AICC (smaller is better)          8167.9
BIC (smaller is better)           8240.5

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -1.9111     0.4170    1E4     -4.58     <.0001
B1_age         -0.00082   0.008406    1E4     -0.10     0.9219
B1_acadmos     0.002934   0.001085    1E4      2.70     0.0068
B1_minordrg      1.4424     0.1361    1E4     10.59     <.0001
B1_logspend     0.09562    0.05080    1E4      1.88     0.0598
============
B2_intercept    -2.2780     0.3002    1E4     -7.59     <.0001
B2_age          0.01956   0.006019    1E4      3.25     0.0012
B2_acadmos     0.000249   0.000668    1E4      0.37     0.7093
B2_minordrg      0.1176    0.02711    1E4      4.34     <.0001
B2_logspend      0.1644    0.03531    1E4      4.66     <.0001
*/

Two-Class Finite Mixture Poisson Model

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH FMM PROCEDURE ***;
proc fmm data = tmp1 tech = trureg;
  model majordrg = age acadmos minordrg logspend / dist = poisson k = 2;
  probmodel age acadmos minordrg logspend;
run;
/*
Fit Statistics

-2 Log Likelihood             8136.8
AIC  (smaller is better)      8166.8
AICC (smaller is better)      8166.9
BIC  (smaller is better)      8275.7

Parameter Estimates for 'Poisson' Model
 
                                Standard
Component  Effect     Estimate     Error  z Value  Pr > |z|

        1  Intercept   -2.4449    0.3497    -6.99    <.0001
        1  AGE         0.02214  0.006628     3.34    0.0008
        1  ACADMOS    0.000529  0.000770     0.69    0.4920
        1  MINORDRG    0.05054   0.04015     1.26    0.2081
        1  LOGSPEND     0.2140   0.04127     5.18    <.0001
        2  Intercept   -8.0935    1.5915    -5.09    <.0001
        2  AGE         0.01150   0.01294     0.89    0.3742
        2  ACADMOS    0.004567  0.002055     2.22    0.0263
        2  MINORDRG     0.2638    0.6770     0.39    0.6968
        2  LOGSPEND     0.6826    0.2203     3.10    0.0019

Parameter Estimates for Mixing Probabilities
 
                         Standard
Effect       Estimate       Error    z Value    Pr > |z|

Intercept     -1.4275      0.5278      -2.70      0.0068
AGE          -0.00277     0.01011      -0.27      0.7844
ACADMOS      0.001614    0.001440       1.12      0.2623
MINORDRG       1.5865      0.1791       8.86      <.0001
LOGSPEND     -0.06949     0.07436      -0.93      0.3501
*/

*** TWO-CLASS FINITE MIXTURE POISSON MODEL WITH NLMIXED PROCEDURE ***;
proc nlmixed data = tmp1 tech = trureg maxit = 500;
  parms B1_intercept = -2 B1_age = 0 B1_acadmos = 0 B1_minordrg = 0 B1_logspend = 0
        B2_intercept = -8 B2_age = 0 B2_acadmos = 0 B2_minordrg = 0 B2_logspend = 0
		B3_intercept = -1 B3_age = 0 B3_acadmos = 0 B3_minordrg = 0 B3_logspend = 0;

  eta1 = B1_intercept + B1_age * age + B1_acadmos * acadmos + B1_minordrg * minordrg + B1_logspend * logspend;
  exp_eta1 = exp(eta1);
  prob1 = exp(-exp_eta1) * exp_eta1 ** majordrg / fact(majordrg);
  eta2 = B2_intercept + B2_age * age + B2_acadmos * acadmos + B2_minordrg * minordrg + B2_logspend * logspend;
  exp_eta2 = exp(eta2);
  prob2 = exp(-exp_eta2) * exp_eta2 ** majordrg / fact(majordrg);
  eta3 = B3_intercept + B3_age * age + B3_acadmos * acadmos + B3_minordrg * minordrg + B3_logspend * logspend;
  exp_eta3 = exp(eta3);
  p = exp_eta3 / (1 + exp_eta3);
  _prob_ = p * prob1 + (1 - p) * prob2;
  ll = log(_prob_);
  model majordrg ~ general(ll);
run;
/*
Fit Statistics

-2 Log Likelihood                 8136.8
AIC (smaller is better)           8166.8
AICC (smaller is better)          8166.9
BIC (smaller is better)           8275.7

Parameter Estimates
 
                          Standard
Parameter      Estimate      Error     DF   t Value   Pr > |t|

B1_intercept    -2.4449     0.3497    1E4     -6.99     <.0001
B1_age          0.02214   0.006628    1E4      3.34     0.0008
B1_acadmos     0.000529   0.000770    1E4      0.69     0.4920
B1_minordrg     0.05054    0.04015    1E4      1.26     0.2081
B1_logspend      0.2140    0.04127    1E4      5.18     <.0001
============
B2_intercept    -8.0935     1.5916    1E4     -5.09     <.0001
B2_age          0.01150    0.01294    1E4      0.89     0.3742
B2_acadmos     0.004567   0.002055    1E4      2.22     0.0263
B2_minordrg      0.2638     0.6770    1E4      0.39     0.6968
B2_logspend      0.6826     0.2203    1E4      3.10     0.0020
============
B3_intercept    -1.4275     0.5278    1E4     -2.70     0.0068
B3_age         -0.00277    0.01011    1E4     -0.27     0.7844
B3_acadmos     0.001614   0.001440    1E4      1.12     0.2623
B3_minordrg      1.5865     0.1791    1E4      8.86     <.0001
B3_logspend    -0.06949    0.07436    1E4     -0.93     0.3501
*/

A SAS Macro for Scorecard Evaluation with Weights

On 09/28/2012, I posted a SAS macro evaluation the scorecard performance, e.g. KS / AUC statistics (https://statcompute.wordpress.com/2012/09/28/a-sas-macro-for-scorecard-performance-evaluation). However, this macro is not generic enough to handle cases with a weighting variable. In a recent project that I am working on, there is a weight variable attached to each credit applicant due to the reject inference. Therefore, there is no excuse for me to continue neglecting the necessity of developing another SAS macro that can take care of the weight variable with any positive values in the scorecard evaluation. Below is a quick draft of the macro. You might have to tweak it a little to suit your needs in the production.

%macro wt_auc(data = , score = , y = , weight = );
***********************************************************;
* THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A      *;
* SCORECARD WITH WEIGHTS                                  *;
* ------------------------------------------------------- *;
* PARAMETERS:                                             *;
*  DATA  : INPUT DATASET                                  *;
*  SCORE : SCORECARD VARIABLE                             *;
*  Y     : RESPONSE VARIABLE IN (0, 1)                    *;
*  WEIGHT: WEIGHT VARIABLE WITH POSITIVE VALUES           *; 
* ------------------------------------------------------- *;
* OUTPUTS:                                                *;
*  A SUMMARY TABLE WITH KS AND AUC STATISTICS             *;
* ------------------------------------------------------- *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;
options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

data _tmp1 (keep = &score &y &weight);
  set &data;
  where &score ~= . and &y in (0, 1) and &weight >= 0;
run;

*** CAPTURE THE DIRECTION OF SCORE ***;
ods listing close;
ods output spearmancorr = _cor;
proc corr data = _tmp1 spearman;
  var &y;
  with &score;
run;
ods listing;

data _null_;
  set _cor;
  if &y >= 0 then do;
    call symput('desc', 'descending');
  end;
  else do;
    call symput('desc', ' ');
  end;
run;

proc sql noprint;
create table
  _tmp2 as    
select
  &score                                         as _scr,
  sum(&y)                                        as _bcnt,
  count(*)                                       as _cnt,
  sum(case when &y = 1 then &weight else 0 end)  as _bwt,
  sum(case when &y = 0 then &weight else 0 end)  as _gwt
from
  _tmp1
group by
  &score;

select
  sum(_bwt) into :bsum
from
  _tmp2;
  
select
  sum(_gwt) into :gsum
from
  _tmp2;

select
  sum(_cnt) into :cnt
from
  _tmp2;    
quit;
%put &cnt;

proc sort data = _tmp2;
  by &desc _scr;
run;

data _tmp3;
  set _tmp2;
  by &desc _scr;
  retain _gcum _bcum _cntcum;
  _gcum + _gwt;
  _bcum + _bwt;
  _cntcum + _cnt;
  _gpct = _gcum / &gsum;
  _bpct = _bcum / &bsum;
  _ks   = abs(_gpct - _bpct) * 100;
  _rank = int(_cntcum / ceil(&cnt / 10)) + 1;
run;

proc sort data = _tmp3 sortsize = max;
  by _gpct _bpct;
run;

data _tmp4;
  set _tmp3;
  by _gpct _bpct;
  if last._gpct then do;
    _idx + 1;
    output;
  end;
run;

proc sql noprint;
create table
  _tmp5 as
select
  a._gcum as gcum,
  (b._gpct - a._gpct) * (b._bpct + a._bpct) / 2 as dx
from
  _tmp4 as a, _tmp4 as b
where
  a._idx + 1 = b._idx;

select
  sum(dx) format = 15.8 into :AUC
from
  _tmp5;

select
  max(_ks) format = 15.8 into :KS_STAT
from
  _tmp3;

select
  _scr format = 6.2 into :KS_SCORE
from
  _tmp3
where
  _ks = (select max(_ks) from _tmp3);

create table
  _tmp6 as
select
  _rank                       as rank,
  min(_scr)                   as min_scr,
  max(_scr)                   as max_scr,
  sum(_cnt)                   as cnt,
  sum(_gwt + _bwt)            as wt,
  sum(_gwt)                   as gwt,
  sum(_bwt)                   as bwt,
  sum(_bwt) / calculated wt   as bad_rate
from
  _tmp3
group by
  _rank;    
quit;  

proc report data = _last_ spacing = 1 split = "/" headline nowd;
  column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/
          MAXIMUM KS = %trim(&ks_stat) AT SCORE POINT %trim(&ks_score) and AUC STATISTICS = %trim(&auc)/ /"
          rank min_scr max_scr cnt wt gwt bwt bad_rate);
  define rank       / noprint order order = data;
  define min_scr    / "MIN/SCORE"             width = 10 format = 9.2        analysis min center;
  define max_scr    / "MAX/SCORE"             width = 10 format = 9.2        analysis max center;
  define cnt        / "RAW/COUNT"             width = 10 format = comma9.    analysis sum;
  define wt         / "WEIGHTED/SUM"          width = 15 format = comma14.2  analysis sum;
  define gwt        / "WEIGHTED/GOODS"        width = 15 format = comma14.2  analysis sum;
  define bwt        / "WEIGHTED/BADS"         width = 15 format = comma14.2  analysis sum;
  define bad_rate   / "BAD/RATE"              width = 10 format = percent9.2 order center;
  rbreak after / summarize dol skip;
run;

proc datasets library = work nolist;
  delete _tmp: / memtype = data;
run;
quit;

***********************************************************;
*                     END OF THE MACRO                    *;
***********************************************************;
%mend wt_auc;

In the first demo below, a weight variable with fractional values is tested.

*** TEST CASE OF FRACTIONAL WEIGHTS ***;
data one;
  set data.accepts;
  weight = ranuni(1);
run;

%wt_auc(data = one, score = bureau_score, y = bad, weight = weight);
/*
                   GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA ONE
       MAXIMUM KS = 34.89711721 AT SCORE POINT 678.00 and AUC STATISTICS = 0.73521009

    MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
   SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
 -------------------------------------------------------------------------------------------
    443.00     619.00         539          276.29          153.16          123.13   44.56%
    620.00     644.00         551          273.89          175.00           98.89   36.11%
    645.00     660.00         544          263.06          176.88           86.18   32.76%
    661.00     676.00         555          277.26          219.88           57.38   20.70%
    677.00     692.00         572          287.45          230.41           57.04   19.84%
    693.00     707.00         510          251.51          208.25           43.26   17.20%
    708.00     724.00         576          276.31          243.89           32.42   11.73%
    725.00     746.00         566          285.53          262.73           22.80    7.98%
    747.00     772.00         563          285.58          268.95           16.62    5.82%
    773.00     848.00         546          272.40          264.34            8.06    2.96%
 ========== ========== ========== =============== =============== ===============
    443.00     848.00       5,522        2,749.28        2,203.49          545.79
*/

In the second demo, a weight variable with positive integers is also tested.

*** TEST CASE OF INTEGER WEIGHTS ***;
data two;
  set data.accepts;
  weight = rand("poisson", 20);
run;

%wt_auc(data = two, score = bureau_score, y = bad, weight = weight);
/*
                   GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA TWO
       MAXIMUM KS = 35.58884479 AT SCORE POINT 679.00 and AUC STATISTICS = 0.73725030

    MIN        MAX            RAW        WEIGHTED        WEIGHTED        WEIGHTED    BAD
   SCORE      SCORE         COUNT             SUM           GOODS            BADS    RATE
 -------------------------------------------------------------------------------------------
    443.00     619.00         539       10,753.00        6,023.00        4,730.00   43.99%
    620.00     644.00         551       11,019.00        6,897.00        4,122.00   37.41%
    645.00     660.00         544       10,917.00        7,479.00        3,438.00   31.49%
    661.00     676.00         555       11,168.00        8,664.00        2,504.00   22.42%
    677.00     692.00         572       11,525.00        9,283.00        2,242.00   19.45%
    693.00     707.00         510       10,226.00        8,594.00        1,632.00   15.96%
    708.00     724.00         576       11,497.00       10,117.00        1,380.00   12.00%
    725.00     746.00         566       11,331.00       10,453.00          878.00    7.75%
    747.00     772.00         563       11,282.00       10,636.00          646.00    5.73%
    773.00     848.00         546       10,893.00       10,598.00          295.00    2.71%
 ========== ========== ========== =============== =============== ===============
    443.00     848.00       5,522      110,611.00       88,744.00       21,867.00
*/

As shown, the macro works well in both cases. Please feel free to let me know if it helps in your cases.

A SAS Macro Calculating PDO

In the development of credit scorecards, the model developer usually will scale the predicted probability of default / delinquent into a range of discrete score points for the purpose of operational convenience. While there are multiple ways to perform scaling, the most popular one in the credit risk arena is to scale the predicted probability logarithmically such that the odds, the ratio between goods and bads, will be doubled / halved after the increase / decrease of a certain number of score points, which is also called PDO (points to double the odds) in the industry.

In practice, PDO along the time carries strong implications in credit policies and strategies and is often used as an effectiveness measure for a scorecard. Chances are that a scorecard could still maintain a satisfactory predictiveness, e.g. K-S statistic or AUC, but might lose its effectiveness, e.g. increase in PDO. For instance, if a credit strategy was originally developed to take effect at credit score = 700 by design, it might require credit score = 720 later after the increase in PDO, which will significantly discount the effectiveness of such strategy. As a result, it is critically important to monitor PDO of a scorecard in the production environment.

Below is a sas macro to calculate PDO of a scorecard.

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

%macro get_pdo(data = , score = , y = , wt = NONE, ref_score = , target_odds = , target_pdo = );
**********************************************************************;
* THIS MACRO IS TO CALCULATE OBSERVED ODDS AND PDO FOR ANY SCORECARD *;
* AND COMPUTE ALIGNMENT BETAS TO REACH TARGET ODDS AND PDO           *;
* ------------------------------------------------------------------ *;
* PARAMETERS:                                                        *;
*  DATA       : INPUT DATASET                                        *;
*  SCORE      : SCORECARD VARIABLE                                   *;
*  Y          : RESPONSE VARIABLE IN (0, 1)                          *;
*  WT         : WEIGHT VARIABLE IN POSITIVE INTEGER                  *;
*  REF_SCORE  : REFERENCE SCORE POINT FOR TARGET ODDS AND PDO        *;
*  TARGET_ODDS: TARGET ODDS AT REFERENCE SCORE OF SCORECARD          *;
*  TARGET_PDO : TARGET POINTS TO DOUBLE ODDS OF SCARECARD            *; 
* ------------------------------------------------------------------ *;
* OUTPUTS:                                                           *;
*  REPORT : PDO REPORT WITH THE CALIBRATION FORMULA IN HTML FORMAT   *;
* ------------------------------------------------------------------ *;
* AUTHOR: WENSUI.LIU@53.COM                                          *;
**********************************************************************;

options nonumber nodate orientation = landscape nocenter;  

*** CHECK IF THERE IS WEIGHT VARIABLE ***;
%if %upcase(&wt) = NONE %then %do;
  data _tmp1 (keep = &y &score _wt);
    set &data;
    where &y in (1, 0)  and
          &score ~= .;
    _wt = 1;
    &score = round(&score., 1);
  run;
%end;
%else %do;
  data _tmp1 (keep = &y &score _wt);
    set &data;
    where &y in (1, 0)        and
          &score ~= .         and
          round(&wt., 1) > 0;
    _wt = round(&wt., 1);
    &score = round(&score., 1);
  run;
%end;

proc logistic data = _tmp1 desc outest = _est1 noprint;
  model &y = &score;
  freq _wt;
run;

proc sql noprint;
  select round(min(&score), 0.01) into :min_score from _tmp1;

  select round(max(&score), 0.01) into :max_score from _tmp1;
quit;

data _est2;
  set _est1 (keep = intercept &score rename = (&score = slope));

  adjust_beta0 = &ref_score - (&target_pdo * log(&target_odds) / log(2)) - intercept * &target_pdo / log(2);
  adjust_beta1 = -1 * (&target_pdo * slope / log(2));

  do i = -5 to 5;
    old_pdo = round(-log(2) / slope, 0.01);
    old_ref = &ref_score + (i) * old_pdo;
    old_odd = exp(-(slope * old_ref + intercept)); 
    if old_ref >= &min_score and old_ref <= &max_score then output; 
  end;
run;

data _tmp2;
  set _tmp1;
  
  if _n_ = 1 then do;
    set _est2(obs = 1);
  end;

  adjusted = adjust_beta0 + adjust_beta1 * &score;
run;

proc logistic data = _tmp2 desc noprint outest = _est3;
  model &y = adjusted;
  freq _wt;
run;

data _est4;
  set _est3 (keep = intercept adjusted rename = (adjusted = slope));

  adjust_beta0 = &ref_score - (&target_pdo * log(&target_odds) / log(2)) - intercept * &target_pdo / log(2);
  adjust_beta1 = -1 * (&target_pdo * slope / log(2));

  do i = -5 to 5;
    new_pdo = round(-log(2) / slope, 0.01);
    new_ref = &ref_score + (i) * new_pdo;
    new_odd = exp(-(slope * new_ref + intercept)); 
    if new_ref >= &min_score and new_ref <= &max_score then output;
  end;
run;
 
proc sql noprint;
create table
  _final as
select
  &target_pdo            as target_pdo,
  &target_odds           as target_odds, 
  a.old_pdo              as pdo1,
  a.old_ref              as ref1,
  a.old_odd              as odd1,
  log(a.old_odd)         as ln_odd1,
  a.adjust_beta0         as adjust_beta0, 
  a.adjust_beta1         as adjust_beta1,
  b.new_pdo              as pdo2,
  b.new_ref              as ref2,
  b.new_odd              as odd2,
  log(b.new_odd)         as ln_odd2
from
  _est2 as a inner join _est4 as b
on
  a.i = b.i;

select round(pdo1, 1) into :pdo1 from _final;

select put(max(pdo1 / pdo2 - 1, 0), percent10.2) into :compare from _final;

select case when pdo1 > pdo2 then 1 else 0 end into :flag from _final;

select put(adjust_beta0, 12.8) into :beta0 from _final;

select put(adjust_beta1, 12.8) into :beta1 from _final;
quit;

%put &compare;
ods html file = "%upcase(%trim(&score))_PDO_SUMMARY.html" style = sasweb;
title;
proc report data  = _final box spacing = 1 split = "/" 
  style(header) = [font_face = "courier new"] style(column) = [font_face = "courier new"]
  style(lines) = [font_face = "courier new" font_size = 2] style(report) = [font_face = "courier new"];

  column("/SUMMARY OF POINTS TO DOUBLE ODDS FOR %upcase(&score) WEIGHTED BY %upcase(&wt) IN DATA %upcase(&data)
          /( TARGET PDO = &target_pdo, TARGET ODDS = &target_odds AT REFERENCE SCORE &ref_score ) / "
         pdo1 ref1 odd1 ln_odd1 pdo2 ref2 odd2 ln_odd2);

  define pdo1    / "OBSERVED/SCORE PDO"   width = 10 format = 4.   center;
  define ref1    / "OBSERVED/REF. SCORE"  width = 15 format = 5.   center order order = data;
  define odd1    / "OBSERVED/ODDS"        width = 15 format = 14.4 center;
  define ln_odd1 / "OBSERVED/LOG ODDS"    width = 15 format = 8.2  center;
  define pdo2    / "ADJUSTED/SCORE PDO"   width = 10 format = 4.   center;
  define ref2    / "ADJUSTED/REF. SCORE"  width = 15 format = 5.   center;
  define odd2    / "ADJUSTED/ODDS"        width = 15 format = 14.4 center;  
  define ln_odd2 / "ADJUSTED/LOG ODDS"    width = 15 format = 8.2  center;

  compute after;
  %if &flag = 1 %then %do;
    line @15 "THE SCORE ODDS IS DETERIORATED BY %trim(&compare).";
    line @15 "CALIBRATION FORMULA: ADJUSTED SCORE = %trim(&beta0) + %trim(&beta1) * %trim(%upcase(&score)).";
  %end;
  %else %do;
    line @25 "THERE IS NO DETERIORATION IN THE SCORE ODDS."; 
  %end;
  endcomp;
run;;
ods html close;

*************************************************;
*              END OF THE MACRO                 *;
*************************************************; 
%mend get_pdo;

In the following example, I called the macro to calculate PDO of bureau_score in a test data. Parameter inputs of “ref_score”, “target_odds”, and “target_pdo” could be acquired from the development or the benchmark sample. As shown in the output, the observed PDO is higher than the target PDO, implying 2.93% deterioration in the score effectiveness. In addition to PDO calculation, the sas macro also generated a formulation to calibrate the bureau_score back to the target PDO.

data tmp1;
  set data.accepts;
  where bureau_score ~= .;
run;

%get_pdo(data = tmp1, score = bureau_score, y = bad, wt = weight, ref_score = 680, target_odds = 20, target_pdo = 45);

/*
 -----------------------------------------------------------------------------------------------------------------------
 |                  SUMMARY OF POINTS TO DOUBLE ODDS FOR BUREAU_SCORE WEIGHTED BY WEIGHT IN DATA TMP1                  |
 |                            ( TARGET PDO = 45, TARGET ODDS = 20 AT REFERENCE SCORE 680 )                             |
 |                                                                                                                     |
 | OBSERVED     OBSERVED        OBSERVED        OBSERVED      ADJUSTED     ADJUSTED        ADJUSTED        ADJUSTED    |
 |SCORE PDO    REF. SCORE         ODDS          LOG ODDS     SCORE PDO    REF. SCORE         ODDS          LOG ODDS    |
 |---------------------------------------------------------------------------------------------------------------------| 
 |     46   |       448     |        0.6404 |      -0.45    |     45   |       455     |        0.6250 |      -0.47    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       495     |        1.2809 |       0.25    |     45   |       500     |        1.2500 |       0.22    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       541     |        2.5618 |       0.94    |     45   |       545     |        2.5000 |       0.92    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       587     |        5.1239 |       1.63    |     45   |       590     |        5.0000 |       1.61    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       634     |       10.2483 |       2.33    |     45   |       635     |       10.0000 |       2.30    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       680     |       20.4976 |       3.02    |     45   |       680     |       20.0000 |       3.00    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       726     |       40.9971 |       3.71    |     45   |       725     |       40.0000 |       3.69    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       773     |       81.9981 |       4.41    |     45   |       770     |       80.0000 |       4.38    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     46   |       819     |      164.0038 |       5.10    |     45   |       815     |      160.0000 |       5.08    | 
 |---------------------------------------------------------------------------------------------------------------------| 
 |              THE SCORE ODDS IS DETERIORATED BY 2.93%.                                                               |            
 |              CALIBRATION FORMULA: ADJUSTED SCORE = 20.92914838 + 0.97156812 * BUREAU_SCORE.                         |            
 ----------------------------------------------------------------------------------------------------------------------- 
*/

In the second example, I used the formulation from the above to generate a adjust_score from the bureau_score and then calculated PDO again. As shown in the output, after the calibration, the observed PDO of adjust_score now becomes equivalent to the target PDO of bureau_score. When the score increases every 45 points, e.g. from 680 to 725, the odds value would be doubled as anticipated, e.g. from 20 to 40.

data tmp2;
  set tmp1;
  adjust_score = 20.92914838 + 0.97156812 * bureau_score;
run;

%get_pdo(data = tmp2, score = adjust_score, y = bad, wt = weight, ref_score = 680, target_odds = 20, target_pdo = 45);

/*
 -----------------------------------------------------------------------------------------------------------------------
 |                  SUMMARY OF POINTS TO DOUBLE ODDS FOR ADJUST_SCORE WEIGHTED BY WEIGHT IN DATA TMP2                  |
 |                            ( TARGET PDO = 45, TARGET ODDS = 20 AT REFERENCE SCORE 680 )                             |
 |                                                                                                                     |
 | OBSERVED     OBSERVED        OBSERVED        OBSERVED      ADJUSTED     ADJUSTED        ADJUSTED        ADJUSTED    |
 |SCORE PDO    REF. SCORE         ODDS          LOG ODDS     SCORE PDO    REF. SCORE         ODDS          LOG ODDS    |
 |---------------------------------------------------------------------------------------------------------------------| 
 |     45   |       455     |        0.6249 |      -0.47    |     45   |       455     |        0.6250 |      -0.47    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       500     |        1.2498 |       0.22    |     45   |       500     |        1.2500 |       0.22    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       545     |        2.4996 |       0.92    |     45   |       545     |        2.5000 |       0.92    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       590     |        4.9991 |       1.61    |     45   |       590     |        5.0000 |       1.61    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       635     |        9.9980 |       2.30    |     45   |       635     |       10.0000 |       2.30    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       680     |       19.9956 |       3.00    |     45   |       680     |       20.0000 |       3.00    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       725     |       39.9905 |       3.69    |     45   |       725     |       40.0000 |       3.69    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       770     |       79.9796 |       4.38    |     45   |       770     |       80.0000 |       4.38    | 
 |----------+---------------+---------------+---------------+----------+---------------+---------------+---------------| 
 |     45   |       815     |      159.9565 |       5.07    |     45   |       815     |      160.0000 |       5.08    | 
 |---------------------------------------------------------------------------------------------------------------------| 
 |                        THERE IS NO DETERIORATION IN THE SCORE ODDS.                                                 |
 ----------------------------------------------------------------------------------------------------------------------- 
*/

Disaggregating Annual Losses into Each Quarter

In loss forecasting, it is often necessary to disaggregate annual losses into each quarter. The most simple method to convert low frequency to high frequency time series is interpolation, such as the one implemented in EXPAND procedure of SAS/ETS. In the example below, there is a series of annual loss projections from 2013 through 2016. An interpolation by the natural spline is used to convert the annual losses into quarterly ones.
SAS Code:

data annual;
  input loss year mmddyy8.;
  format year mmddyy8.;
datalines;
19270175 12/31/13
18043897 12/31/14
17111193 12/31/15
17011107 12/31/16
;
run;

proc expand data = annual out = quarterly from = year to = quarter;
  id year;
  convert loss / observed = total method = spline(natural);
run;

proc sql;
select 
  year(year) as year, 
  sum(case when qtr(year) = 1 then loss else 0 end) as qtr1,
  sum(case when qtr(year) = 2 then loss else 0 end) as qtr2,
  sum(case when qtr(year) = 3 then loss else 0 end) as qtr3,
  sum(case when qtr(year) = 4 then loss else 0 end) as qtr4,
  sum(loss) as total
from
  quarterly
group by
  calculated year;
quit;

Output:

    year      qtr1      qtr2      qtr3      qtr4     total

    2013   4868536   4844486   4818223   4738931  19270175
    2014   4560049   4535549   4510106   4438194  18043897
    2015   4279674   4276480   4287373   4267666  17111193
    2016   4215505   4220260   4279095   4296247  17011107

While the mathematical interpolation is easy to implement, it might be difficult to justify and interpret from the business standpoint. In reality, there might be an assumption that the loss trend would follow the movement of macro-economy. Therefore, it might be advantageous to disaggregate annual losses into quarterly ones with the inclusion of one or more economic indicators. This approach can be implemented in tempdisagg package of R language. Below is a demo with the same loss data used above. However, disaggregation of annual losses is accomplished based upon a macro-economic indicator.
R Code:

library(tempdisagg)

loss <- c(19270175, 18043897, 17111193, 17011107)
loss.a <- ts(loss, frequency = 1, start = 2013)

econ <- c(7.74, 7.67, 7.62, 7.48, 7.32, 7.11, 6.88, 6.63, 6.41, 6.26, 6.12, 6.01, 5.93, 5.83, 5.72, 5.59)
econ.q <- ts(econ, frequency = 4, start = 2013)

summary(mdl <- td(loss.a ~ econ.q))
print(predict(mdl))

Output:

Call:
td(formula = loss.a ~ econ.q)

Residuals:
Time Series:
Start = 2013
End = 2016
Frequency = 1
[1]  199753 -234384 -199257  233888

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2416610     359064   6.730   0.0214 *
econ.q        308226      53724   5.737   0.0291 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

'chow-lin-maxlog' disaggregation with 'sum' conversion
4 low-freq. obs. converted to 16 high-freq. obs.
Adjusted R-squared: 0.9141      AR1-Parameter:     0 (truncated)
        Qtr1    Qtr2    Qtr3    Qtr4
2013 4852219 4830643 4815232 4772080
2014 4614230 4549503 4478611 4401554
2015 4342526 4296292 4253140 4219235
2016 4302864 4272041 4238136 4198067

In practice, if a simple and flexible solution is desired without the need of interpretation, then the mathematical interpolation might be a good choice. On the other hand, if there is a strong belief that the macro-economy might drive the loss trend, then the regression-based method implemented in tempdisagg package might be preferred. However, in our example, both methods generate extremely similar results.

How to Construct Piecewise Linear Spline in SAS

options nocenter;

data tmp1;
  do i = 1 to 5000;
    x = ranuni(1);
    y = x + rannor(1) * 0.5;
    if x >= 0.3 then y = y + 6 * (x - 0.3);
    if x >= 0.6 then y = y - 10 * (x - 0.6);
    output;
  end;
run;
 
*** Manually Construct Piecewise Spline ***;
data tmp2;
  set tmp1;
  x1 = x;
  x2 = max(x - 0.3, 0);
  x3 = max(x - 0.6, 0);
run;

proc reg data = tmp2;
  model y = x1 - x3;
run;  
quit;
/*                   Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        0.02627        0.02432       1.08      0.2801
x1            1        0.81570        0.11576       7.05      <.0001
x2            1        6.29682        0.18477      34.08      <.0001
x3            1      -10.19025        0.14870     -68.53      <.0001
*/
    
*** Automatically Construct Piece Spline ***;
proc transreg data = tmp1 ss2;
  model identity(y) = pspline(x / knots = 0.3 0.6 degree = 1);
run;
/*                                   Type II
                                      Sum of       Mean
Variable        DF    Coefficient    Squares     Square    F Value    Pr > F    Label

Intercept        1       0.026272       0.28       0.28       1.17    0.2801    Intercept
Pspline.x_1      1       0.815702      12.12      12.12      49.65    <.0001    x 1      
Pspline.x_2      1       6.296817     283.50     283.50    1161.34    <.0001    x 2      
Pspline.x_3      1     -10.190247    1146.47    1146.47    4696.43    <.0001    x 3      
*/

Passing Many Parameter Values into A Macro Iteratively

In a production environment, we often need to call a SAS macro multiple time and then pass many parameter values into the macro iteratively. For instance, we might have a dummy macro with only one parameter as below.

%macro silly(parm = );
  %put ==> &parm;
%mend silly;

There might be a situation that we need to call the above macro hundreds of times, each of which a different value should be passed to the macro parameter. For example, we might need to pass each element of 1000 values generated below into the above dummy macro.

data list;
  do i = 0 to 999;
    parm = "var"||put(i, z3.);
    output;
  end;
run;

The most standard way to accomplish the aforementioned task is to parse each element one by one from a long string holding all values and then to loop through the macro 1,000 times as shown below. However, this approach is not only cumbersome to code but also computationally expensive.

*** method 1 ***;
%macro loop;

proc sql noprint;
  select parm into: parm separated by ' ' from list;
quit;

%let i = 1;
%do %while (%scan(&parm, &i) ne %str());
  %let var = %scan(&parm, &i);
  *** loop through each value on the list ***;
  %silly(parm = &var);
  %let i = %eval(&i + 1);
%end;

%mend loop;

%loop;

A sleeker way is to take advantage of the internal iteration scheme in SAS data step and to call the macro iteratively by EXECUTE() routine, as demonstrated below. The code snippet is short and simple. More importantly, the run time of this new approach is approximately 4 – 5 times shorter than the run time used in a standard method.

*** method 2 ***;
data _null_;
  set list;
  call execute('%silly(parm = '||parm||')');
run;

Lagrange Multiplier (LM) Test for Over-Dispersion

While Poisson regression is often used as a baseline model for count data, its assumption of equi-dispersion is too restrictive for many empirical applications. In practice, the variance of observed count data usually exceeds the mean, namely over-dispersion, due to the unobserved heterogeneity and/or excess zeroes. With the similar consequences of heteroskedasticity in the linear regression, over-dispersion in a Poisson regression will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics. After the development of Poisson regression, it is always a sound practice to do an additional analysis for over-dispersion.

Below is a SAS macro to test the over-dispersion based upon the Lagrange Multiplier (LM) Test introduced by William Greene (2002) in his famous “Econometric Analysis”. The statistic follows the chi-square distribution with 1 degree freedom. The null hypothesis implies equi-dispersion in outcomes from the tested Poisson model.

%macro lm(data = , y = , pred_y = );
***************************************************;
* This macro is to test the over-dispersion based *;
* on outcomes from a poisson model                *;
*                            -- wensui.liu@53.com *;
***************************************************;
* parameters:                                     *;
*  data  : the input dataset                      *;
*  y     : observed count outcome                 *;
*  pred_y: predicted outcome from poisson model   *;
***************************************************;
* reference:                                      *;
*  w. greene (2002), econometric analysis         *;
***************************************************;

proc iml;
  use &data;
  read all var {&y} into y;
  read all var {&pred_y} into lambda;
  close &data;

  e = (y - lambda);
  n = nrow(y);
  ybar = y`[, :];
  LM = (e` * e - n * ybar) ** 2 / (2 * lambda` * lambda);
  Pvalue = 1 - probchi(LM, 1);
  title 'LM TEST FOR OVER-DISPERSION';
  print LM Pvalue;
  title;
quit;

***************************************************;
*                 end of macro                    *;
***************************************************;
%mend lm;

Next, a use case of the aforementioned LM test is demonstrated. First of all, a vector of Poisson outcomes are simulated with 10% excessive zeros and therefore over-dispersion.

*** SIMULATE A POISSON VECTOR WITH EXCESSIVE ZEROS ***;
data one;
  do i = 1 to 1000;
    x = ranuni(i);
    if i <= 900 then y = ranpoi(i, exp(x * 2));
    else y = 0;
    output;
  end;
run;

A Poisson regression is estimated with the simulated count outcomes including excessive zeros. After the calculation of predicted values, LM test is used to test the over-dispersion. As shown below, the null hypothesis of equi-dispersion is rejected with LM-stat = 31.18.

*** TEST DISPERSION WITH EXCESSIVE ZEROS ***;
ods listing close;
proc genmod data = one;
  model y =  x / dist = poisson;
  output out = out1 p = predicted;
run;
ods listing;

%lm(data = out1, y = y, pred_y = predicted);
/*
LM TEST FOR OVER-DISPERSION

       LM    PVALUE

31.182978 2.3482E-8
*/

Another Poisson regression is also estimated with simulated count outcomes excluding 10% excessive zeros. As expected, with outcomes from this newly estimated Poisson model, the null hypothesis of equi-dispersion is not rejected.

*** TEST DISPERSION WITHOUT EXCESSIVE ZEROS ***;
ods listing close;
proc genmod data = one;
  where i <= 900;
  model y =  x / dist = poisson;
  output out = out2 p = predicted;
run;
ods listing;

%lm(data = out2, y = y, pred_y = predicted);
/*
LM TEST FOR OVER-DISPERSION

       LM    PVALUE

 0.052131 0.8193959
*/

How to Score Outcomes from Count Models

When calculating the prediction from a count model, many people like to use the expected mean directly. However, from the business standpoint, it might be more appealing to calculate the probability of a specific count outcome. For instance, in the retail banking, it is often of interests to know the probability of an account with one or more delinquencies and then convert this probability to a certain score point. A widely accepted practice is to develop a logistic regression predicting the delinquent account, e.g. Y = 1 for delinquencies >= 1. However, it is also possible to develop a count model, e.g. negative binomial, predicting the number of delinquencies and then estimating the probability of one or more delinquencies given the expected mean.

In the demonstration below, a scoring scheme for count models is shown. From the output, it is clear that the predictiveness of a negative binomial model is comparable to the one of a logistic model in terms KS and ROC statistics.

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

libname data 'C:\Users\liuwensui\projects\data';

%include 'C:\Users\liuwensui\projects\code\ks_macro.sas';

data tmp1;
  set data.credit_count;
  if majordrg = 0 then bad = 0;
  else bad = 1;
run;
    
proc logistic data = tmp1 desc;
  model bad = AGE ACADMOS ADEPCNT MINORDRG OWNRENT EXP_INC;
  score data = tmp1 out = logit_out1(rename = (p_1 = logit_prob1));
run;

proc genmod data = tmp1;
  model majordrg = AGE ACADMOS ADEPCNT MINORDRG OWNRENT EXP_INC / dist = nb;
  output out = nb_out1 p = yhat;
run;

data nb_out1;
  set nb_out1;
  nb_prob1 = 1 - pdf('negbinomial', 0, (1 / 4.0362) / (Yhat + (1 / 4.0362)), (1 / 4.0362));
run;

%separation(data = logit_out1, score = logit_prob1, y = bad);
/*
                            GOOD BAD SEPARATION REPORT FOR LOGIT_PROB1 IN DATA LOGIT_OUT1                           
                                     MAXIMUM KS = 35.5049 AT SCORE POINT 0.1773                                     
                     ( AUC STATISTICS = 0.7373, GINI COEFFICIENT = 0.4747, DIVERGENCE = 0.6511 )                    
                                                                                                                    
          MIN        MAX           GOOD        BAD      TOTAL               BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #       ODDS    RATE      BAD RATE  PERCENT      PERCENT
 -------------------------------------------------------------------------------------------------------------------
  BAD     0.3369     0.9998         557        787      1,344       0.71   58.56%      58.56%    30.73%      30.73% 
   |      0.2157     0.3369         944        401      1,345       2.35   29.81%      44.18%    15.66%      46.39% 
   |      0.1802     0.2157       1,039        305      1,344       3.41   22.69%      37.02%    11.91%      58.30% 
   |      0.1619     0.1802       1,099        246      1,345       4.47   18.29%      32.34%     9.61%      67.90% 
   |      0.1489     0.1619       1,124        220      1,344       5.11   16.37%      29.14%     8.59%      76.49% 
   |      0.1383     0.1489       1,171        174      1,345       6.73   12.94%      26.44%     6.79%      83.29% 
   |      0.1255     0.1383       1,213        131      1,344       9.26    9.75%      24.06%     5.12%      88.40% 
   |      0.1109     0.1255       1,254         91      1,345      13.78    6.77%      21.89%     3.55%      91.96% 
   V      0.0885     0.1109       1,246         98      1,344      12.71    7.29%      20.27%     3.83%      95.78% 
 GOOD     0.0001     0.0885       1,236        108      1,344      11.44    8.04%      19.05%     4.22%     100.00% 
       ========== ========== ========== ========== ==========                                                       
          0.0001     0.9998      10,883      2,561     13,444                                                       
*/
    
%separation(data = nb_out1, score = nb_prob1, y = bad);
/*
                               GOOD BAD SEPARATION REPORT FOR NB_PROB1 IN DATA NB_OUT1                              
                                     MAXIMUM KS = 35.8127 AT SCORE POINT 0.2095                                     
                     ( AUC STATISTICS = 0.7344, GINI COEFFICIENT = 0.4687, DIVERGENCE = 0.7021 )                    
                                                                                                                    
          MIN        MAX           GOOD        BAD      TOTAL               BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #       ODDS    RATE      BAD RATE  PERCENT      PERCENT
 -------------------------------------------------------------------------------------------------------------------
  BAD     0.2929     0.8804         561        783      1,344       0.72   58.26%      58.26%    30.57%      30.57% 
   |      0.2367     0.2929         944        401      1,345       2.35   29.81%      44.03%    15.66%      46.23% 
   |      0.2117     0.2367       1,025        319      1,344       3.21   23.74%      37.27%    12.46%      58.69% 
   |      0.1947     0.2117       1,106        239      1,345       4.63   17.77%      32.39%     9.33%      68.02% 
   |      0.1813     0.1947       1,131        213      1,344       5.31   15.85%      29.08%     8.32%      76.34% 
   |      0.1675     0.1813       1,191        154      1,345       7.73   11.45%      26.14%     6.01%      82.35% 
   |      0.1508     0.1675       1,208        136      1,344       8.88   10.12%      23.86%     5.31%      87.66% 
   |      0.1298     0.1508       1,247         98      1,345      12.72    7.29%      21.78%     3.83%      91.49% 
   V      0.0978     0.1297       1,242        102      1,344      12.18    7.59%      20.21%     3.98%      95.47% 
 GOOD     0.0000     0.0978       1,228        116      1,344      10.59    8.63%      19.05%     4.53%     100.00% 
       ========== ========== ========== ========== ==========                                                       
          0.0000     0.8804      10,883      2,561     13,444                                                       
*/

A SAS Macro for Breusch-Pagan Test

In SAS, Breusch-Pagan test for Heteroscedasticity in a linear regression can be conducted with MODEL procedure in SAS/ETS, as shown in the code snippet below.

data one;
  do i = 1 to 100;
    x1 = uniform(1);
    x2 = uniform(2);
    r  = normal(1) * 0.1;
    if x2 > 0.5 then r = r * 2;
    y = 10 + 8 * x1 + 6 * x2 + r;
    output;
  end;
run;

proc model data = one;
  parms b0 b1 b2;
  y = b0 + b1 * x1 + b2 * x2;
  fit y / breusch = (1 x1 x2);
run;
/*
                                 Heteroscedasticity Test
Equation        Test               Statistic     DF    Pr > ChiSq    Variables

y               Breusch-Pagan          10.44      2        0.0054    1, x1, x2           
*/

However, in a forecasting model that I am recently working on, I find that it is not convenient to use “proc model” every time when I want to do Breusch-Pagan test and rather prefer a more generic solution not tied to a specific SAS module or procedure and that would only need to take a minimum set of inputs instead of specifying out a full model. As a result, I draft a simple sas macro to do Breusch-Pagan test, which gives the identical result as the one from MODEL procedure. Hopeful, others might find this macro useful as well.

proc reg data = one;
  model y = x1 x2;
  output out = two r = resid;
run;

%macro hetero_bp(r = , x = , data = );
***********************************************************;
* THE SAS MACRO IS TO CALCULATE BREUSCH-PAGEN TEST FOR    *
* HETEROSKEDASTICITY                                      *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  R   : RESIDUAL VALUES FROM A LINEAR REGRESSION         *;
*  X   : A LIST OF NUMERIC VARIABLES TO MODEL ERROR       *;
*        VARIANCE IN BREUSCH-PAGEN TEST                   *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;
    
data _data_(keep = r2 &x);
  set &data;
  where r ~= .;
  r2 = &r ** 2;
run;

ods output nobs = _nobs_;
ods output anova = _anova_;
ods output fitstatistics = _fits_;
ods listing close;
proc reg data = _last_;
  model r2 = &x;
run;
ods listing;

proc sql noprint;
  select distinct NObsUsed into :n from _nobs_;
  select df into :df from _anova_ where upcase(compress(source, ' ')) = 'MODEL';
  select nvalue2 into :r2 from _fits_ where upcase(compress(label2, ' ')) = 'R-SQUARE';
run;
%put &r2;

data _result_;
  chi_square = &r2 * &n;
  df         = &df;
  p_value    = 1 - probchi(chi_square, df);
run;

proc report data = _last_ spacing = 1 headline nowindows split = "*";
  column(" * BREUSCH-PAGEN TEST FOR HETEROSKEDASTICITY * "
          chi_square df p_value);
  define chi_square  / "CHI-SQUARE"  width = 15;
  define df          / "DF"          width = 5;
  define p_value     / "P-VALUE"     width = 15 format = 12.8;
run;

proc datasets library = work;
  delete _: / memtype = data;
run;  
      
%mend hetero_bp;

%hetero_bp(r = resid, x = x1 x2, data = two);
/*
 BREUSCH-PAGEN TEST FOR HETEROSKEDASTICITY
                                          
      CHI-SQUARE    DF         P-VALUE    
 -----------------------------------------
         10.4389     2      0.00541030    
*/

Removing Records by Duplicate Values

Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.

SAS Example

data _data_;
  input label $ value;
datalines;
A     4
B     3
C     6
B     3
B     1
A     2
A     4
A     4
;
run;

proc sort data = _last_;
  by label value;
run;

data _data_;
  set _last_;
  by label;
  if first.label then output;
run;

proc print data = _last_ noobs;
run;

/* OUTPUT:
label    value
  A        2  
  B        1  
  C        6 
*/

R Example

> # INPUT DATA INT THE CONSOLE
> df <- read.table(header = T, text = '
+  label value
+      A     4
+      B     3
+      C     6
+      B     3
+      B     1
+      A     2
+      A     4
+      A     4
+ ')
> # SORT DATA FRAME BY COLUMNS
> df2 <- df[order(df$label, df$value), ]
> print(df2)
  label value
6     A     2
1     A     4
7     A     4
8     A     4
5     B     1
2     B     3
4     B     3
3     C     6
> # DEDUP RECORDS
> df3 <- df2[!duplicated(df2$label), ]
> print(df3)
  label value
6     A     2
5     B     1
3     C     6

Python Example

In [1]: import pandas as pd

In [2]: # INPUT DATA INTO DATAFRAME

In [3]: df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]})

In [4]: # SORT DATA BY COLUMNS

In [5]: df2 = df.sort(['label', 'value'])

In [6]: print(df2)
  label  value
5     A      2
0     A      4
6     A      4
7     A      4
4     B      1
1     B      3
3     B      3
2     C      6

In [7]: # DEDUP RECORDS

In [8]: df3 = df2.drop_duplicates(['label'])

In [9]: print(df3)
  label  value
5     A      2
4     B      1
2     C      6

Composite Conditional Mean and Variance Modeling in Time Series

In time series analysis, it is often necessary to model both conditional mean and conditional variance simultaneously, which is so-called composite modeling. For instance, while the conditional mean is an AR(1) model, the conditional variance can be a GARCH(1, 1) model.

In SAS/ETS module, it is convenient to build such composite models with AUTOREG procedure if the conditional mean specification is as simple as shown below.

data garch1;
  lu = 0;
  lh = 0;
  do i = 1 to 5000;
    x = ranuni(1);
    h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh;
    u = sqrt(h) * rannor(1);
    y = 1 + 3 * x + u;
    lu = u;
    lh = h;
    output;
  end;
run;

proc autoreg data = _last_;
  model y = x / garch = (p = 1, q = 1);
run;
/*
                                    Standard                 Approx
Variable        DF     Estimate        Error    t Value    Pr > |t|

Intercept        1       1.0125       0.0316      32.07      <.0001
x                1       2.9332       0.0536      54.72      <.0001
ARCH0            1       0.2886       0.0256      11.28      <.0001
ARCH1            1       0.3881       0.0239      16.22      <.0001
GARCH1           1       0.5040       0.0239      21.10      <.0001
*/

However, when the conditional mean has a more complex structure, then MODEL instead of AUTOREG procedure should be used. Below is an perfect example showing the flexibility of MODEL procedure. In the demonstration, the conditional mean is an ARMA(1, 1) model and the conditional variance is a GARCH(1, 1) model.

data garch2;
  lu = 0;
  lh = 0;
  ly = 0;
  do i = 1 to 5000;
    x = ranuni(1);
    h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh;
    u = sqrt(h) * rannor(1);
    y = 1 + 3 * x + 0.6 * (ly - 1) + u - 0.7 * lu;
    lu = u;
    lh = h;
    ly = y;
    output;
  end;
run;

proc model data = _last_;
  parms mu x_beta ar1 ma1 arch0 arch1 garch1;
  y = mu + x_beta * x + ar1 * zlag1(y - mu) + ma1 * zlag1(resid.y);
  h.y = arch0 + arch1 * xlag(resid.y ** 2, mse.y) +
        garch1 * xlag(h.y, mse.y);
  fit y / method = marquardt fiml;
run;
/*
                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|

mu              0.953905      0.0673      14.18       <.0001
x_beta           2.92509      0.0485      60.30       <.0001
ar1             0.613025     0.00819      74.89       <.0001
ma1             0.700154      0.0126      55.49       <.0001
arch0           0.288948      0.0257      11.26       <.0001
arch1           0.387436      0.0238      16.28       <.0001
garch1          0.504588      0.0237      21.26       <.0001
*/

Marginal Effects in Two-Part Fractional Models

As shown in “Two-Part Fractional Model” posted on 09/25/2012, sometimes it might be beneficial to model fractional outcomes in the range of [0, 1] with composite models, e.g. a two-part model, especially when there are a non-trivial number of boundary outcomes. However, the marginal effect of X in a two-part model is not as straightforward to calculate as in a one-part model shown in “Marginal Effects in Tobit Models” posted on 10/06/2012.

In the demonstration below, I will show how to calculate the marginal effect of X in a two-part model with a similar logic shown in McDonald and Moffitt decomposition.

proc nlmixed data = one tech = congra maxiter = 1000;
  parms b10 = -9.3586 b11 = -0.0595 b12 =  1.7644 b13 =  0.5994 b14 = -2.5496
        b15 = -0.0007 b16 = -0.0011 b17 = -1.6359
        b20 =  0.3401 b21 =  0.0274 b22 =  0.1437 b23 =  0.0229 b24 =  0.4656
        b25 =  0.0011 b26 =  0.0021 b27 =  0.1977  s  =  0.2149;
  logit_xb = b10 + b11 * x1 + b12 * x2 + b13 * x3 + b14 * x4 +
             b15 * x5 + b16 * x6 + b17 * x7;
  nls_xb = b20 + b21 * x1 + b22 * x2 + b23 * x3 + b24 * x4 +
           b25 * x5 + b26 * x6 + b27 * x7;
  p1 = 1 / (1 + exp(-logit_xb));
  p2 = 1 / (1 + exp(-nls_xb));
  if y = 0 then ll = log(1 - p1);
  else ll = log(p1) + log(pdf('normal', y, p2, s));
  model y ~ general(ll);
  predict logit_xb out = out_1 (rename = (pred = part1_xb) keep = _id_ pred y);
  predict p1       out = out_2 (rename = (pred = part1_p)  keep = _id_ pred);
  predict nls_xb   out = out_3 (rename = (pred = part2_xb) keep = _id_ pred);
  predict p2       out = out_4 (rename = (pred = part2_p)  keep = _id_ pred);
run;

data out;
  merge out_1 out_2 out_3 out_4;
  by _id_;

  margin1_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.05948) * part2_p;
  margin1_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.01115) * part1_p;
  x1_margin = margin1_part1 + margin1_part2;

  margin2_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * 1.7645) * part2_p;
  margin2_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.4363) * part1_p;
  x2_margin = margin2_part1 + margin2_part2;

  margin3_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * 0.5994) * part2_p;
  margin3_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.1139) * part1_p;
  x3_margin = margin3_part1 + margin3_part2;

  margin4_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -2.5496) * part2_p;
  margin4_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -2.8755) * part1_p;
  x4_margin = margin4_part1 + margin4_part2;

  margin5_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.00071) * part2_p;
  margin5_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * 0.004091) * part1_p;
  x5_margin = margin5_part1 + margin5_part2;

  margin6_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -0.00109) * part2_p;
  margin6_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.00839) * part1_p;
  x6_margin = margin6_part1 + margin6_part2;

  margin7_part1 = (exp(part1_xb) / ((1 + exp(part1_xb)) ** 2) * -1.6359) * part2_p;
  margin7_part2 = (exp(part2_xb) / ((1 + exp(part2_xb)) ** 2) * -0.1666) * part1_p;
  x7_margin = margin7_part1 + margin7_part2;
run;

proc means data = _last_ mean;
  var x:;
run;
/*
Variable                 Mean

x1_margin          -0.0039520
x2_margin           0.0739847
x3_margin           0.0270673
x4_margin          -0.3045967
x5_margin         0.000191015
x6_margin        -0.000533998
x7_margin          -0.1007960
*/

Modeling Heteroscedasticity Directly in NLS

****** nonlinear least square regression without heteroscedasticity ******;
proc nlmixed data = data.sme tech = congra;
  where y > 0  and y < 1;
  parms b0 =  1.78 b1 = -0.01 b2 = -0.43 b3 = -0.11 b4 = -2.93
        b5 =  0.01 b6 = -0.01 b7 = -0.17; 
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
       b5 * x5 + b6 * x6 + b7 * x7;
  mu = 1 / (1 + exp(-xb));
  lh = pdf('normal', y, mu, s);
  ll = log(lh);
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 -264.5
AIC (smaller is better)           -246.5
AICC (smaller is better)          -246.3
BIC (smaller is better)           -201.4

                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|    
b0            1.7813     0.3400   1116      5.24     <.0001   
b1          -0.01203    0.02735   1116     -0.44     0.6602  
b2           -0.4305     0.1437   1116     -3.00     0.0028    
b3           -0.1147    0.02287   1116     -5.02     <.0001   
b4           -2.9302     0.4657   1116     -6.29     <.0001    
b5          0.004095   0.001074   1116      3.81     0.0001   
b6          -0.00839   0.002110   1116     -3.98     <.0001    
b7           -0.1710     0.1977   1116     -0.87     0.3871   
s             0.2149   0.004549   1116     47.24     <.0001  
*/
 
****** nonlinear least square regression with heteroscedasticity ******;
proc nlmixed data = data.sme tech = congra;
  where y > 0 and y < 1;
  parms b0 =  1.78 b1 = -0.01 b2 = -0.43 b3 = -0.11 b4 = -2.93
        b5 =  0.01 b6 = -0.01 b7 = -0.17 s  =  0.21
        a1 = -0.13 a2 = -15.62 a3 = 0.09 a4 = -1.27 a5 = 0.01
        a6 = -0.02 a7 =  0.47;
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
       b5 * x5 + b6 * x6 + b7 * x7;
  xa = a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 +
       a5 * x5 + a6 * x6 + a7 * x7;
  mu = 1 / (1 + exp(-xb));
  si = (s ** 2 * (1 + exp(xa))) ** 0.5;       
  lh = pdf('normal', y, mu, si);
  ll = log(lh);
  model y ~ general(ll);
run;
/*
             Fit Statistics
-2 Log Likelihood                 -325.9
AIC (smaller is better)           -293.9
AICC (smaller is better)          -293.4
BIC (smaller is better)           -213.6

                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|    
b0            2.0343     0.3336   1116      6.10     <.0001     
b1          0.003764    0.02408   1116      0.16     0.8758    
b2          -0.08544     0.1501   1116     -0.57     0.5693     
b3           -0.1495    0.02263   1116     -6.61     <.0001    
b4           -2.6251     0.4379   1116     -6.00     <.0001    
b5          0.003331   0.001115   1116      2.99     0.0029    
b6          -0.00644   0.001989   1116     -3.24     0.0012     
b7           -0.1836     0.1938   1116     -0.95     0.3436    
s             0.1944   0.005067   1116     38.35     <.0001    
a1           -0.1266     0.3389   1116     -0.37     0.7088     
a2          -15.6169     5.2424   1116     -2.98     0.0030     
a3           0.09074    0.03282   1116      2.76     0.0058     
a4           -1.2681     3.8044   1116     -0.33     0.7389    
a5          0.007411   0.005267   1116      1.41     0.1597    
a6          -0.01738    0.01527   1116     -1.14     0.2550    
a7            0.4805     1.1232   1116      0.43     0.6689    
*/

Marginal Effects in Tobit Models

proc qlim data = data.sme;
  model y = x1 - x7;
  endogenous y ~ censored(lb = 0 ub = 1);
  output out = out1 marginal;
run;
/*
                                 Standard                 Approx
Parameter        Estimate           Error    t Value    Pr > |t|
Intercept       -2.204123        0.118473     -18.60      <.0001
x1              -0.015086        0.008345      -1.81      0.0707
x2               0.376830        0.048772       7.73      <.0001
x3               0.141672        0.008032      17.64      <.0001
x4              -0.813496        0.133564      -6.09      <.0001
x5            0.000036437        0.000320       0.11      0.9094
x6              -0.001152        0.000704      -1.64      0.1016
x7              -0.392152        0.060902      -6.44      <.0001
_Sigma           0.497938        0.012319      40.42      <.0001
*/

proc means data = out1 mean;
  var meff_x:;
run;
/* Auto Calculation:
Variable    Label                                 Mean
------------------------------------------------------
Meff_x1     Marginal effect of x1 on y      -0.0036988
Meff_x2     Marginal effect of x2 on y       0.0923919
Meff_x3     Marginal effect of x3 on y       0.0347354
Meff_x4     Marginal effect of x4 on y      -0.1994545
Meff_x5     Marginal effect of x5 on y    8.9337756E-6
Meff_x6     Marginal effect of x6 on y    -0.000282493
Meff_x7     Marginal effect of x7 on y      -0.0961485
*/

data one;
  set data.sme;
  
  _xb_ = -2.204123 + x1 * -0.015086 + x2 * 0.376830 + x3 * 0.141672 + x4 * -0.813496 + 
         x5 * 0.000036437 + x6 * -0.001152 + x7 * -0.392152;
  _phi_lb = probnorm((0 - _xb_) / 0.497938);
  _phi_ub = probnorm((1 - _xb_) / 0.497938);
  _pdf_lb = pdf('normal', (0 - _xb_) / 0.497938);
  _pdf_ub = pdf('normal', (1 - _xb_) / 0.497938);
  _imr = (_pdf_lb - _pdf_ub) / (_phi_ub - _phi_lb);
  _margin_x1 = (_phi_ub - _phi_lb) * -0.015086;
  _margin_x2 = (_phi_ub - _phi_lb) * 0.376830;
  _margin_x3 = (_phi_ub - _phi_lb) * 0.141672;
  _margin_x4 = (_phi_ub - _phi_lb) * -0.813496;
  _margin_x5 = (_phi_ub - _phi_lb) * 0.000036437;
  _margin_x6 = (_phi_ub - _phi_lb) * -0.001152;
  _margin_x7 = (_phi_ub - _phi_lb) * -0.392152;
run;

proc means data = one mean;
  var _margin_x:;
run; 
/* Manual Calculation:
Variable              Mean
--------------------------
_margin_x1      -0.0036988
_margin_x2       0.0923924
_margin_x3       0.0347356
_margin_x4      -0.1994555
_margin_x5    8.9337391E-6
_margin_x6    -0.000282451
_margin_x7      -0.0961491
*/

Marginal Effects (on Binary Outcome)

In regression models, the marginal effect of a explanatory variable X is the partial derivative of the prediction with respect to X and measures the expected change in the response variable as a function of the change in X with the other explanatory variables held constant. In the interpretation of a regression model, presenting marginal effects often brings more information than just looking at coefficients. Below, I will use 2 types of regression models, e.g. logit and probit, for binary outcomes to show that although coefficients estimated from the same set of Xs might differ substantially in 2 models, marginal effects of each X in both model actually look very similar.

As shown below, parameter estimates from logit and probit look very different due to different model specification and assumptions. As a result, it is not possible to compare the effect and sensitivity of each predictor across 2 models.

proc qlim data = one;
  model bad = bureau_score ltv / discrete(d = logit);
  output out = out1 marginal;
run;
/* logit estimates
                                    Standard                 Approx
Parameter           Estimate           Error    t Value    Pr > |t|
Intercept           7.080229        0.506910      13.97      <.0001
bureau_score       -0.016705        0.000735     -22.74      <.0001
ltv                 0.028055        0.002361      11.88      <.0001
*/

proc qlim data = one;
  model bad = bureau_score ltv / discrete(d = probit);
  output out = out2 marginal;
run;
/* probit estimates
                                    Standard                 Approx
Parameter           Estimate           Error    t Value    Pr > |t|
Intercept           4.023515        0.285587      14.09      <.0001
bureau_score       -0.009500        0.000403     -23.56      <.0001
ltv                 0.015690        0.001316      11.93      <.0001
*/

However, are these 2 models so much different from each other? Comparing marginal effects instead of parameter estimates might be table to bring us more useful information.

proc means data = out1 mean;
  var meff_p2_:;
run;
/* marginal effects from logit
Variable                            Mean
----------------------------------------
Meff_P2_bureau_score          -0.0022705
Meff_P2_ltv                    0.0038132
----------------------------------------
*/

proc means data = out2 mean;
  var meff_p2_:;
run;
/* marginal effects from probit
Variable                            Mean
----------------------------------------
Meff_P2_bureau_score          -0.0022553
Meff_P2_ltv                    0.0037249
----------------------------------------
*/

It turns out that marginal effects of each predictor between two models are reasonably close.

Although it is easy to calculate marginal effects with SAS QLIM procedure, it might still be better to understand the underlying math and then compute them yourself with SAS data steps. Below is a demo on how to manually calculate marginal effects of a logit model following the formulation:
MF_x_i = EXP(XB) / ((1 + EXP(XB)) ^ 2) * beta_i for the ith predictor.

proc logistic data = one desc;
  model bad = bureau_score ltv;
  output out = out3 xbeta = xb;
run;
/* model estimates:
                                  Standard          Wald
Parameter       DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept        1      7.0802      0.5069      195.0857        <.0001
bureau_score     1     -0.0167    0.000735      516.8737        <.0001
ltv              1      0.0281     0.00236      141.1962        <.0001
*/

data out3;
  set out3;
  margin_bureau_score = exp(xb) / ((1 + exp(xb)) ** 2) * (-0.0167);
  margin_ltv = exp(xb) / ((1 + exp(xb)) ** 2) * (0.0281);
run;

proc means data = out3 mean;
  var margin_bureau_score margin_ltv;
run;
/* manual calculated marginal effects:
Variable                       Mean
-----------------------------------
margin_bureau_score      -0.0022698
margin_ltv                0.0038193
-----------------------------------
*/

A SAS Macro for Scorecard Performance Evaluation

%macro separation(data = , score = , y = );
***********************************************************;
* THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A      *;
* SCORECARD                                               *; 
* ------------------------------------------------------- *;
* PARAMETERS:                                             *;
*  DATA : INPUT DATASET                                   *;
*  SCORE: SCORECARD VARIABLE                              *;
*  Y    : RESPONSE VARIABLE IN (0, 1)                     *;
* ------------------------------------------------------- *;
* OUTPUTS:                                                *;
*  SEPARATION_REPORT.TXT                                  *;
*  A SEPARATION SUMMARY REPORT IN TXT FORMAT              *;
*  NAMED AS THE ABOVE WITH PREDICTIVE MEASURES INCLUDING  *;
*  KS, AUC, GINI, AND DIVERGENCE                          *;
* ------------------------------------------------------- *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nonumber nodate orientation = landscape linesize = 160 nocenter 
        formchar = "|----|+|---+=|-/\<>*" formdlim=' ' ls = 150; 

*** DEFAULT GROUP NUMBER FOR REPORT ***;
%let grp = 10;

data _tmp1 (keep = &y &score);
  set &data;
  where &y in (1, 0)  and &score ~= .;
run;

filename lst_out temp;

proc printto new print = lst_out;
run;

*** CONDUCT NON-PARAMETRIC TESTS ***; 
ods output wilcoxonscores = _wx;
ods output kolsmir2stats = _ks;
proc npar1way wilcoxon edf data = _tmp1;
  class &y;
  var &score;
run;

proc printto;
run;

proc sort data = _wx;
  by class;
run;

*** CALCULATE ROC AND GINI ***;
data _null_;
  set _wx end = eof;
  by class;

  array a{2, 3} _temporary_;
  if _n_ = 1 then do;
    a[1, 1] = n;
    a[1, 2] = sumofscores;
    a[1, 3] = expectedsum;
  end;
  else do;
    a[2, 1] = n;
  end;
  if eof then do;
    auc  = (a[1, 2] - a[1, 3]) / (a[1, 1] * a[2, 1])  + 0.5;
    if auc <= 0.5 then auc = 1 - auc;
    gini = 2 * (auc - 0.5);  
    call execute('%let auc = '||put(auc, 10.4)||';');
    call execute('%let gini = '||put(gini, 10.4)||';');
  end;
run;

*** CALCULATE KS ***;
data _null_;
  set _ks;

  if _n_ = 1 then do;
    ks = nvalue2 * 100;
    call execute('%let ks = '||put(ks, 10.4)||';');
  end;
run;

*** CAPTURE SCORE POINT FOR MAX KS ***;
data _null_;
  infile lst_out;
  input @" at Maximum = " ks_score;
  output;
  call execute('%let ks_score = '||put(ks_score, 10.4)||';');
  stop;
run;

proc summary data = _tmp1 nway;
  class &y;
  output out = _data_ (drop = _type_ _freq_)
  mean(&score.) = mean var(&score.) = variance;
run;

*** CALCULATE DIVERGENCE ***;
data _null_;
  set _last_ end = eof;
  array a{2, 2} _temporary_;
  if _n_ = 1 then do;
    a[1, 1] = mean;
    a[1, 2] = variance;
  end;
  else do;
    a[2, 1] = mean;
    a[2, 2] = variance;
  end;
  if eof then do;
    divergence = (a[1, 1] - a[2, 1]) ** 2 / ((a[1, 2] + a[2, 2]) / 2);
    call execute('%let divergence = '||put(divergence, 10.4)||';'); 
  end; 
run;

*** CAPTURE THE DIRECTION OF SCORE ***;
ods listing close;
ods output spearmancorr = _cor;
proc corr data = _tmp1 spearman;
  var &y;
  with &score;
run;
ods listing;
 
data _null_;
  set _cor;
  if &y >= 0 then do;
    call symput('desc', 'descending');
  end;
  else do;
    call symput('desc', ' ');
  end;
run;
%put &desc;

proc rank data = _tmp1 out = _tmp2 groups = &grp ties = low;
  var &score;
  ranks rank;
run;

proc summary data = _last_ nway;
  class rank;
  output out = _data_ (drop = _type_ rename = (_freq_ = freq))
  min(&score) = min_score max(&score) = max_score
  sum(&y) = bads;
run;

proc sql noprint;
  select sum(bads) into :bcnt from _last_;
  select sum(freq) - sum(bads) into :gcnt from _last_;
quit;

proc sort data = _last_ (drop = rank);
  by &desc min_score;
run;

data _data_;
  set _last_;
  by &desc min_score;

  i + 1; 
  percent = i / 100; 
  good  = freq - bads;
  odds  = good / bads;

  hit_rate = bads / freq;
  retain cum_bads cum_freq;
  cum_bads + bads;
  cum_freq + freq;
  cum_hit_rate = cum_bads / cum_freq;

  cat_rate = bads / &bcnt;
  retain cum_cat_rate;
  cum_cat_rate + cat_rate; 

  format symbol $4.;
  if i = 1 then symbol = 'BAD';
  else if i = &grp - 1 then symbol = 'V';
  else if i = &grp then symbol = 'GOOD';
  else symbol = '|';
run;

proc printto print = "%upcase(%trim(&score))_SEPARATION_REPORT.TXT" new;
run;

proc report data = _last_ spacing = 1 split = "/" headline nowd;
  column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/
          MAXIMUM KS = %trim(&ks) AT SCORE POINT %trim(&ks_score)/   
          ( AUC STATISTICS = %trim(&auc), GINI COEFFICIENT = %trim(&gini), DIVERGENCE = %trim(&divergence) )/ /"
         percent symbol min_score max_score good bads freq odds hit_rate cum_hit_rate cat_rate cum_cat_rate);

  define percent      / noprint order order = data;
  define symbol       / "" center               width = 5 center;
  define min_score    / "MIN/SCORE"             width = 10 format = 9.4        analysis min center;
  define max_score    / "MAX/SCORE"             width = 10 format = 9.4        analysis max center;
  define good         / "GOOD/#"                width = 10 format = comma9.    analysis sum;
  define bads         / "BAD/#"                 width = 10 format = comma9.    analysis sum;
  define freq         / "TOTAL/#"               width = 10 format = comma9.    analysis sum;
  define odds         / "ODDS"                  width = 10 format = 8.2        order;
  define hit_rate     / "BAD/RATE"              width = 10 format = percent9.2 order center;
  define cum_hit_rate / "CUMULATIVE/BAD RATE"   width = 10 format = percent9.2 order;
  define cat_rate     / "BAD/PERCENT"           width = 10 format = percent9.2 order center;
  define cum_cat_rate / "CUMU. BAD/PERCENT"     width = 10 format = percent9.2 order; 

  rbreak after / summarize dol skip;
run; 

proc printto;
run;

***********************************************************;
*                     END OF THE MACRO                    *;
***********************************************************; 
%mend separation;

libname data 'C:\Documents and Settings\liuwensui\Desktop\fraction_models\test';

%separation(data = data.accepts, score = bureau_score, y = bad);

Sample Output:

                          GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA DATA.ACCEPTS                          
                                    MAXIMUM KS = 35.5477 AT SCORE POINT 677.0000                                    
                     ( AUC STATISTICS = 0.7389, GINI COEFFICIENT = 0.4778, DIVERGENCE = 0.8027 )                    
                                                                                                                    
          MIN        MAX           GOOD        BAD      TOTAL               BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #       ODDS    RATE      BAD RATE  PERCENT      PERCENT
 -------------------------------------------------------------------------------------------------------------------
  BAD   443.0000   620.0000         310        252        562       1.23   44.84%      44.84%    23.10%      23.10% 
   |    621.0000   645.0000         365        201        566       1.82   35.51%      40.16%    18.42%      41.52% 
   |    646.0000   661.0000         359        173        532       2.08   32.52%      37.71%    15.86%      57.38% 
   |    662.0000   677.0000         441        125        566       3.53   22.08%      33.74%    11.46%      68.84% 
   |    678.0000   692.0000         436         99        535       4.40   18.50%      30.79%     9.07%      77.91% 
   |    693.0000   708.0000         469         89        558       5.27   15.95%      28.29%     8.16%      86.07% 
   |    709.0000   725.0000         492         66        558       7.45   11.83%      25.92%     6.05%      92.12% 
   |    726.0000   747.0000         520         42        562      12.38    7.47%      23.59%     3.85%      95.97% 
   V    748.0000   772.0000         507         30        537      16.90    5.59%      21.64%     2.75%      98.72% 
 GOOD   773.0000   848.0000         532         14        546      38.00    2.56%      19.76%     1.28%     100.00% 
       ========== ========== ========== ========== ==========                                                       
        443.0000   848.0000       4,431      1,091      5,522 

Two-Part Fractional Model

1. Two-step estimation method:
– SAS code

data one;
  set data.sme;
  _id_ + 1;
  if y = 0 then y2 = 0;
  else y2 = 1;
run;

proc logistic data = one desc;
  model y2 = x1 - x7;
run;

proc nlmixed data = one tech = congra;
  where y2 = 1;
  parms b0 =  2.14 b1 = -0.01 b2 = -0.55 b3 = -0.14 b4 = -3.47
        b5 =  0.01 b6 = -0.01 b7 = -0.19 s = 0.1;
  _xb_ = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
         b5 * x5 + b6 * x6 + b7 * x7;
  _mu_ = 1 / (1 + exp(-_xb_));
  lh = pdf('normal', y, _mu_, s);
  ll = log(lh);
  model y ~ general(ll);
  predict ll out = out1(keep = _id_ pred rename = (pred = ln_like1));
run;

– Outputs

*** output for logit component ***
         Model Fit Statistics
                             Intercept
              Intercept            and
Criterion          Only     Covariates
AIC            4997.648       4066.398
SC             5004.043       4117.551
-2 Log L       4995.648       4050.398

             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1     -9.3586      0.4341      464.8713        <.0001
x1            1     -0.0595      0.0341        3.0445        0.0810
x2            1      1.7644      0.1836       92.3757        <.0001
x3            1      0.5994      0.0302      392.7404        <.0001
x4            1     -2.5496      0.5084       25.1488        <.0001
x5            1    -0.00071     0.00128        0.3094        0.5780
x6            1    -0.00109     0.00267        0.1659        0.6838
x7            1     -1.6359      0.2443       44.8427        <.0001

*** output for the positive component ***
             Fit Statistics
-2 Log Likelihood                 -264.5
AIC (smaller is better)           -246.5
AICC (smaller is better)          -246.3
BIC (smaller is better)           -201.4

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b0            1.7947     0.3401   1116      5.28     <.0001
b1          -0.01216    0.02737   1116     -0.44     0.6568
b2           -0.4306     0.1437   1116     -3.00     0.0028
b3           -0.1157    0.02287   1116     -5.06     <.0001
b4           -2.9267     0.4656   1116     -6.29     <.0001
b5          0.004092   0.001074   1116      3.81     0.0001
b6          -0.00838   0.002110   1116     -3.97     <.0001
b7           -0.1697     0.1977   1116     -0.86     0.3909
s             0.2149   0.004549   1116     47.24     <.0001

2. Joint estimation method:
– SAS code

proc nlmixed data = one tech = congra maxiter = 1000;
  parms b10 = -9.3586 b11 = -0.0595 b12 =  1.7644 b13 =  0.5994 b14 = -2.5496
        b15 = -0.0007 b16 = -0.0011 b17 = -1.6359
        b20 =  0.3401 b21 =  0.0274 b22 =  0.1437 b23 =  0.0229 b24 =  0.4656
        b25 =  0.0011 b26 =  0.0021 b27 =  0.1977  s  =  0.2149;
  logit_xb = b10 + b11 * x1 + b12 * x2 + b13 * x3 + b14 * x4 +
             b15 * x5 + b16 * x6 + b17 * x7;
  nls_xb = b20 + b21 * x1 + b22 * x2 + b23 * x3 + b24 * x4 +
           b25 * x5 + b26 * x6 + b27 * x7;
  p1 = 1 / (1 + exp(-logit_xb));
  if y = 0 then ll = log(1 - p1);
  else ll = log(p1) + log(pdf('normal', y, 1 / (1 + exp(-nls_xb)), s));
  model y ~ general(ll);
run;

– Outputs

             Fit Statistics
-2 Log Likelihood                 3785.9
AIC (smaller is better)           3819.9
AICC (smaller is better)          3820.0
BIC (smaller is better)           3928.6

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b10          -9.3586     0.4341   4421    -21.56     <.0001
b11         -0.05948    0.03408   4421     -1.75     0.0810
b12           1.7645     0.1836   4421      9.61     <.0001
b13           0.5994    0.03024   4421     19.82     <.0001
b14          -2.5496     0.5084   4421     -5.01     <.0001
b15         -0.00071   0.001276   4421     -0.56     0.5784
b16         -0.00109   0.002673   4421     -0.41     0.6836
b17          -1.6359     0.2443   4421     -6.70     <.0001
b20           1.7633     0.3398   4421      5.19     <.0001
b21         -0.01115    0.02730   4421     -0.41     0.6830
b22          -0.4363     0.1437   4421     -3.04     0.0024
b23          -0.1139    0.02285   4421     -4.98     <.0001
b24          -2.8755     0.4643   4421     -6.19     <.0001
b25         0.004091   0.001074   4421      3.81     0.0001
b26         -0.00839   0.002109   4421     -3.98     <.0001
b27          -0.1666     0.1975   4421     -0.84     0.3991
s             0.2149   0.004550   4421     47.24     <.0001

As shown above, both estimation methods give similar parameter estimates. The summation of log likelihood from the 2-step estimation models is exactly equal to the log likelihood from the joint estimation model.

A Distribution-Free Alternative to Vuong Test

The Vuong test has been widely used in nonnested model selection under the normality assumption. While the Vuong test determines whether the average log-likelihood ratio is statistically different from zero, the distribution-free test proposed by Clarke determines whether or not the median log-likelihood ratio is statistically different from zero.

Below is the SAS macro to implement Clarke test.

%macro clarke(data = , ll1 = , q1 = , ll2 = , q2 = );
***********************************************************;
* THE SAS MACRO IS TO PERFORM AN ALTERNATIVE TO VUONG     *;
* TEST, DISTRIBUTION-FREE CLARKE TEST, FOR THE MODEL      *;
* COMPARISON.                                             *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  LL1 : LOG LIKELIHOOD FUNCTION OF THE MODEL 1           *;
*  Q1  : DEGREE OF FREEDOM OF THE MODEL 1                 *;
*  LL2 : LOG LIKELIHOOD FUNCTION OF THE MODEL 2           *;
*  Q2  : DEGREE OF FREEDOM OF THE MODEL 2                 *;
* ======================================================= *;
* REFERENCE:                                              *;
*  A SIMPLE DISTRIBUTION-FREE TEST FOR NONNESTED MODEL    *;
*  SELECTION, KEVIN CLARKE, 2007                          *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options mprint mlogic formchar = "|----|+|---+=|-/\<>*" nocenter nonumber nodate;

data _tmp1;
  set &data;
  where &ll1 ~= 0 and &ll2 ~= 0;
run;

proc sql noprint;
  select count(*) into :nobs from _tmp1;
quit;

%let schwarz = %sysevalf((&q1 - &q2) * %sysfunc(log(&nobs)) / &nobs);

data _tmp2;
  set _tmp1;
  z = &ll1 - &ll2 - &schwarz;
  b1 = 0;
  b2 = 0;
  if z > 0 then b1 = 1;
  if z < 0 then b2 = 1;
run;

proc sql;
  create table
    _tmp3 as
  select 
    cdf("binomial", count(*) - sum(b1), 0.5, count(*))             as p1,
    cdf("binomial", count(*) - sum(b2), 0.5, count(*))             as p2,
    min(1, cdf("binomial", count(*) - sum(b2), 0.5, count(*)) * 2) as p3
  from 
    _tmp2;
quit;

proc report data = _tmp3 spacing = 1 split = "*" headline nowindows;
  column("Null Hypothesis: MDL1 = MDL2" p1 p2 p3);
  define p1  / "MDL1 > MDL2"  width = 15 format = 12.8 order;
  define p2  / "MDL1 < MDL2"  width = 15 format = 12.8;
  define p3  / "MDL1 != MDL2" width = 15 format = 12.8;
run;

%mend clarke;

WoE Macro for LGD Model

WoE (Weight of Evidence) transformation has been widely used in scorecard or PD (Probability of Default) model development with a binary response variable in the retail banking. In SAS Global Forum 2012, Anthony and Naeem proposed an innovative method to employ WoE transformation in LGD (Loss Given Default) model development with the response in the range [0, 1]. Specifically, if an account with LGD = 0.25, then this account will be duplicated 100 times, of which 25 instances will be assigned Y = 1 and the rest 75 instances will be assigned Y = 0. As a result, 1 case with the unity-interval response variable has been converted to 100 cases with the binary response variable. After this conversion, WoE transformation algorithm designed for the binary response can be directly applied to the new dataset with the converted LGD.

Following the logic described by Anthony and Naeem, I did a test run with WoE SAS macro that I drafted by myself and got the following output. In the tested data set, there are totally 36,543 cases with the unity-interval response.

                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200     406000  11.11%    158036  38.93%     -0.1726     0.0033     1.8904
   002         84.7300         91.9100     406800  11.13%    160530  39.46%     -0.1501     0.0025     3.5410
   003         91.9132         97.3100     405300  11.09%    168589  41.60%     -0.0615     0.0004     4.2202
   004         97.3146        101.8100     406000  11.11%    170773  42.06%     -0.0424     0.0002     4.6894
   005        101.8162        105.2221     406100  11.11%    172397  42.45%     -0.0264     0.0001     4.9821
   006        105.2223        110.1642     406000  11.11%    174480  42.98%     -0.0050     0.0000     5.0376
   007        110.1700        115.7000     406600  11.13%    183253  45.07%      0.0800     0.0007     4.1431
   008        115.7029        123.0886     405500  11.10%    188026  46.37%      0.1324     0.0020     2.6630
   009        123.1100        150.0000     406000  11.11%    198842  48.98%      0.2369     0.0063     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 3654300, # BADs(Y=1) = 1574926, OVERALL BAD RATE = 43.0979%, MAX. KS = 5.0376, INFO. VALUE = 0.0154.                                    
--------------------------------------------------------------------------------------------------------------                                        

As shown in the above output, monotonic WoE binning, information value, and KS statistic come out nicely with the only exception that the frequency has been increased by 100 times. From a practical view, the increase in the data size will inevitably lead o the increase in the computing cost, making this binary conversion potentially inapplicable to large data.

After a few tweaks, I drafted a modified version of WoE SAS macro suitable for LGD model with the unity-interval response, as given below.

%macro lgd_nwoe(data = , y = , x = );

***********************************************************;
* THE SAS MACRO IS TO PERFORM UNIVARIATE IMPORTANCE RANK  *;
* ORDER AND MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION   *;
* FOR NUMERIC ATTRIBUTES IN PRE-MODELING DATA PROCESSING  *;
* FOR LGD MODELS. OUTPUTS ARE SUPPOSED TO BE USED IN A    *;
* FRACTIONAL MODEL WITH LOGIT LINK FUNCTION               *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  Y   : CONTINUOUS RESPONSE VARIABLE IN [0, 1] RANGE     *;
*  X   : A LIST OF NUMERIC ATTRIBUTES                     *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  NUM_WOE.WOE: A FILE OF WOE TRANSFORMATION RECODING     *;
*  NUM_WOE.FMT: A FILE OF BINNING FORMAT                  *;
*  NUM_WOE.PUT: A FILE OF PUT STATEMENTS FOR *.FMT FILE   *;
*  NUM_WOE.SUM: A FILE WITH PREDICTABILITY SUMMARY        *;
*  NUM_WOE.OUT: A FILE WITH STATISTICAL DETAILS           *;
*  NUM_WOE.IMP: A FILE OF MISSING IMPUTATION RECODING     *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

*** DEFAULT PARAMETERS ***;

%local maxbin miniv bignum mincor;

%let maxbin = 20;

%let miniv  = 0;

%let mincor = 1;

%let bignum = 1e300;

***********************************************************;
***         DO NOT CHANGE CODES BELOW THIS LINE         ***;
***********************************************************;

*** DEFAULT OUTPUT FILES ***;

* WOE RECODING FILE                     *;
filename woefile "LGD_NWOE.WOE";

* FORMAT FOR BINNING                    *;
filename fmtfile "LGD_NWOE.FMT";

* PUT STATEMENT TO USE FORMAT           *;
filename binfile "LGD_NWOE.PUT";

* KS SUMMARY                            *;
filename sumfile "LGD_NWOE.SUM";
 
* STATISTICAL SUMMARY FOR EACH VARIABLE *;
filename outfile "LGD_NWOE.OUT";

* IMPUTE RECODING FILE                  *;
filename impfile "LGD_NWOE.IMP";

*** A MACRO TO DELETE FILE ***;
%macro dfile(file = );
  data _null_;
    rc = fdelete("&file");
    if rc = 0 then do;
      put @1 50 * "+";
      put "THE EXISTED OUTPUT FILE HAS BEEN DELETED.";
      put @1 50 * "+";
    end;
  run;
%mend dfile;

*** CLEAN UP FILES ***;
%dfile(file = woefile);

%dfile(file = fmtfile);

%dfile(file = binfile);

%dfile(file = sumfile);

%dfile(file = outfile);

%dfile(file = impfile);

*** PARSING THE STRING OF NUMERIC PREDICTORS ***;
ods listing close;
ods output position = _pos1;
proc contents data = &data varnum;
run;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;


  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;
quit;

data _tmp1;
  retain &x2 &y &y.2;
  set &data;
  where &Y >= 0 and &y <= 1;
  &y.2 = 1 - &y;
  keep &x2 &y &y.2;
run;

ods output position = _pos2;
proc contents data = _tmp1 varnum;
run;

*** LOOP THROUGH EACH PREDICTOR ***;
%do i = 1 %to &xcnt;
    
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num= &i;

    select
      count(distinct &var) into :xflg
    from
      _tmp1
    where
      &var ~= .;
  quit;

  proc summary data = _tmp1 nway;
    output out  = _med(drop = _type_ _freq_)
    median(&var) = med nmiss(&var) = mis;
  run;
  
  proc sql;
    select
      med into :median
    from
      _med;

    select
      mis into :nmiss
    from
      _med;

    select 
      case when count(&y) = sum(&y) then 1 else 0 end into :mis_flg1
    from
      _tmp1
    where
      &var = .;

    select
      case when sum(&y) = 0 then 1 else 0 end into :mis_flg2
    from
      _tmp1
    where
      &var = .;
  quit;

  %let nbin = %sysfunc(min(&maxbin, &xflg));

  *** CHECK IF # OF DISTINCT VALUES > 1 ***;
  %if &xflg > 1 %then %do;

    *** IMPUTE MISS VALUE WHEN WOE CANNOT BE CALCULATED ***;
    %if &mis_flg1 = 1 | &mis_flg2 = 1 %then %do;
      data _null_;
        file impfile mod;
        put " ";
        put @3 "*** MEDIAN IMPUTATION OF %TRIM(%UPCASE(&VAR)) (NMISS = %trim(&nmiss)) ***;";
        put @3 "IF %TRIM(%UPCASE(&VAR)) = . THEN %TRIM(%UPCASE(&VAR)) = &MEDIAN;";
      run;

      data _tmp1;
        set _tmp1;
        if &var = . then &var = &median;
      run; 
    %end;      
      
    *** LOOP THROUGH ALL # OF BINS ***;
    %do j = &nbin %to 2 %by -1;
      proc rank data = _tmp1 groups = &j out = _tmp2(keep = &y &var rank &y.2);
        var &var;
        ranks rank;
      run;

      proc summary data = _tmp2 nway missing;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)   = bad    mean(&y)  = bad_rate mean(&y.2) = good_rate
        min(&var) = minx   max(&var) = maxx;
      run;

      *** CREATE FLAGS FOR MULTIPLE CRITERION ***;
      proc sql noprint;
        select
          case when min(bad_rate) > 0 then 1 else 0 end into :minflg
        from
          _tmp3;

        select
          case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
        from
          _tmp3;              
      quit;

      *** CHECK IF SPEARMAN CORRELATION = 1 ***;
      %if &minflg = 1 & &maxflg = 1 %then %do;
        ods output spearmancorr = _corr(rename = (minx = cor));
        proc corr data = _tmp3 spearman;
          var minx;
          with bad_rate;
        run;

        proc sql noprint;
          select
            case when abs(cor) >= &mincor then 1 else 0 end into :cor
          from
            _corr;
        quit;

        *** IF SPEARMAN CORR = 1 THEN BREAK THE LOOP ***;
        %if &cor = 1 %then %goto loopout;
      %end;
      %else %if &nbin = 2 %then %goto exit;
    %end;

    %loopout:

    *** CALCULATE STATISTICAL SUMMARY ***;
    
    proc sql noprint;
      select 
        sum(freq) into :freq
      from
        _tmp3;
      
      select
        sum(bad) into :bad
      from
        _tmp3;

      select
        sum(bad) / sum(freq) into :lgd
      from
        _tmp3;
    quit;

    proc sort data = _tmp3 sortsize = max;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq pct bad_rate good_rate;
      set _tmp3 end = eof;
      by rank;

      if rank = . then bin = 0;
      else do;
        retain b 0;
        bin + 1;
      end;
  
      pct  = freq / &freq;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&freq - &bad);
      woe  = log(bpct / gpct);
      iv   = (bpct - gpct) * woe;
      
      retain cum_bpct cum_gpct;
      cum_bpct + bpct;
      cum_gpct + gpct;
      ks = abs(cum_gpct - cum_bpct) * 100;
      
      retain iv_sum ks_max;
      iv_sum + iv;
      ks_max = max(ks_max, ks);
      if eof then do;
        call symput("bin", put(bin, 4.));
        call symput("ks", put(ks_max, 10.4));
        call symput("iv", put(iv_sum, 10.4));
      end;

      keep bin minx maxx bad freq pct bad_rate
           gpct bpct woe iv cum_gpct cum_bpct ks;
    run;

    *** REPORT STATISTICAL SUMMARY ***;
    proc printto print = outfile;
    run;

    title;
    ods listing;
    proc report data = _tmp4 spacing = 1 split = "*" headline nowindows;
      column(" * MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(%trim(&var))"
             bin minx maxx freq pct bad_rate woe iv ks);

      define bin      /"BIN#"         width = 5  format = z3. order order = data;
      define minx     /"LOWER LIMIT"  width = 15 format = 14.4;
      define maxx     /"UPPER LIMIT"  width = 15 format = 14.4;
      define freq     /"#FREQ"        width = 10 format = 9.;
      define pct      /"DISTRIBUTION" width = 15 format = percent14.4;
      define bad_rate /"MEAN LGD"     width = 15 format = percent14.4;
      define woe      /"WOE"          width = 15 format = 14.8;
      define iv       /"INFO. VALUE"  width = 15 format = 14.8;
      define ks       /"KS"           width = 10 format = 9.4;
      compute after;
        line @1 125 * "-";
        line @10 "# TOTAL = %trim(&freq), AVERAGE LGD = %trim(&lgd), "
                 "MAX. KS = %trim(&ks), INFO. VALUE = %trim(&iv).";
        line @1 125 * "-";    
      endcomp;
    run;
    ods listing close;

    proc printto;
    run;

    proc sql noprint;
      select
        case when sum(iv) >= &miniv then 1 else 0 end into :ivflg
      from
        _tmp4;
    quit;

    *** OUTPUT RECODING FILES IF IV >= &miniv BY DEFAULT ***;
    %if &ivflg = 1 %then %do;
      data _tmp5;
        length upper $20 lower $20;
        lower = compress(put(maxx, 20.4), ' ');

        set _tmp4 end = eof;
        upper = compress(put(maxx, 20.4), ' ');
        if bin = 1 then lower = "-%trim(&bignum)";
        if eof then upper = "%trim(&bignum)";
        w%trim(&var) = compress(put(woe, 12.8), ' ');
      run;

      *** OUTPUT WOE RECODE FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file woefile mod;

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3  "if %trim(&var) = . then w%trim(&var) = " w%trim(&var) ";";
        end;
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3 "if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if _n_ > 1 then do;
          put @5 "else if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if eof then do;
          put @5 "else w%trim(&var) = 0;";
        end;
      run;

      *** OUTPUT BINNING FORMAT FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file fmtfile mod;

        if bin = 1 then lower = "LOW";
        if eof then upper = "HIGH";

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 ". " @40 " = '" bin: z3.
              @47 ". MISSINGS'";
        end;

            
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
              @5 "BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 lower @15 " - " upper  @40 " = '" bin: z3.
              @47 "." + 1 lower "- " upper "'";
        end;

        if _n_ > 1 then do;
          put @5 lower @15 "<- " upper @40 " = '" bin: z3.
              @47 "." + 1 lower "<- " upper "'";
        end;
        if eof then do;
          put @5 "OTHER" @40 " = '999. OTHERS';";
        end;
      run;

      *** OUTPUT BINNING RECODE FILE ***;
      data _null_;
        file binfile mod;
        put " ";
        put @3 "*** BINNING RECODE of %trim(&var) ***;";
        put @3 "c%trim(&var) = put(%trim(&var), %trim(&var)_fmt.);";
      run;

      *** SAVE SUMMARY OF EACH VARIABLE INTO A TABLE ***;
      %if %sysfunc(exist(work._result)) %then %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          if _n_ = 1 then do;
            variable = "%trim(&var)";
            bin      = &bin;
            ks       = &ks;
            iv       = &iv;
            output;
          end;
          set _result;
          output;
        run;
      %end;
      %else %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          variable = "%trim(&var)";
          bin      = &bin;
          ks       = &ks;
          iv       = &iv;
        run;        
      %end;
    %end;

    %exit:

    *** CLEAN UP TEMPORARY TABLES ***;
    proc datasets library = work nolist;
      delete _tmp2 - _tmp5 _corr / memtype = data;
    run;
    quit;
  %end;    
%end;

*** SORT VARIABLES BY KS AND OUTPUT RESULTS ***;
proc sort data = _result sortsize = max;
  by descending iv descending ks;
run;

data _null_;
  set _result end = eof;
  file sumfile;

  if _n_ = 1 then do;
    put @1 80 * "-";
    put @1  "| RANK" @10 "| VARIABLE RANKED BY IV" @45 "| # BINS"
        @55 "|  KS"  @66 "| INFO. VALUE" @80 "|";
    put @1 80 * "-";
  end;
  put @1  "| " @4  _n_ z3. @10 "| " @12 variable @45 "| " @50 bin
      @55 "| " @57 ks      @66 "| " @69 iv       @80 "|";
  if eof then do;
    put @1 80 * "-";
  end;
run;

proc datasets library = work nolist;
  delete _result (mt = data);
run;
quit;

*********************************************************;
*                   END OF MACRO                        *;
*********************************************************;

%mend lgd_nwoe;

With this macro, I tested on the same data again but without the binary conversion proposed Anthony and Naeem. The output is also pasted below for the comparison purpose.

                                                                                                                            
                                 MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

  BIN#     LOWER LIMIT     UPPER LIMIT      #FREQ    DISTRIBUTION        MEAN LGD             WOE     INFO. VALUE         KS
 ---------------------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200       4060       11.1102%        38.9246%      -0.17266201      0.00326518     1.8911
   002         84.7300         91.9100       4068       11.1321%        39.4665%      -0.14992613      0.00247205     3.5399
   003         91.9132         97.3100       4053       11.0910%        41.5962%      -0.06155066      0.00041827     4.2195
   004         97.3146        101.8100       4060       11.1102%        42.0504%      -0.04288377      0.00020368     4.6945
   005        101.8162        105.2221       4061       11.1129%        42.4538%      -0.02634960      0.00007701     4.9867
   006        105.2223        110.1642       4060       11.1102%        42.9896%      -0.00445666      0.00000221     5.0362
   007        110.1700        115.7000       4066       11.1266%        45.0653%       0.07978905      0.00071190     4.1440
   008        115.7029        123.0886       4055       11.0965%        46.3687%       0.13231366      0.00195768     2.6644
   009        123.1100        150.0000       4060       11.1102%        48.9800%       0.23701619      0.00631510     0.0000
-----------------------------------------------------------------------------------------------------------------------------
         # TOTAL = 36543, AVERAGE LGD = 0.430988, MAX. KS = 5.0362, INFO. VALUE = 0.0154.                                    
-----------------------------------------------------------------------------------------------------------------------------

The comparison shows that most information from 2 implementations are almost identical. For instance, information value = 0.0154 in both cases. However, the frequency in the second output reflects the correct sample size, e.g 36,543. As a result, the computation is more efficient and the time consumed is much shorter. Considering the fact that usually hundreds of variables should be looped through this WoE transformation one by one, the approach without duplicating cases and converting to binary might make more practical sense.

A SAS Macro for Bootstrap Aggregating (Bagging)

Proposed by Breiman (1996), bagging is the acronym of “bootstrap aggregating” and is a machine learning method to improve the prediction accuracy by simply averaging over predictions from multiple classifiers developed with bootstrapped samples out of the original training set.

Regardless of its statistical elegance, bagging is attractive in modern machine learning in that the construction of each decision tree in bagging is very computationally efficient and is completely independent from each other, which makes bagging ideal in parallel computing.

The SAS macro demonstrated below is an attempt to test bagging algorithm on a consumer banking dataset.

%macro bagging(data = , y = , numx = , catx = , ntrees = 50);
***********************************************************;
* THIS SAS MACRO IS AN ATTEMPT TO IMPLEMENT BAGGING       *;
* PROPOSED BY LEO BREIMAN (1996)                          *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA   : INPUT SAS DATA TABLE                          *;
*  Y      : RESPONSE VARIABLE WITH 0/1 VALUE              *;
*  NUMX   : A LIST OF NUMERIC ATTRIBUTES                  *;
*  CATX   : A LIST OF CATEGORICAL ATTRIBUTES              *;
*  NTREES : # OF TREES TO DO THE BAGGING                  *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  1. A SAS CATALOG FILE NAMED "TREEFILES" IN THE WORKING *;
*     DIRECTORY CONTAINING ALL SCORING FILES IN BAGGING   *;
*  2. A LST FILE SHOWING ks STATISTICS OF THE BAGGING     *;
*     CLASSIFIER AND EACH TREE CLASSIFIER                 *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM, LOSS FORECASTING & RISK MODELING    *;
***********************************************************;

options mprint mlogic nocenter nodate nonumber;

*** a random seed value subject to change ***;
%let seed = 20110613;

*** assign a library to the working folder ***;
libname _path '';

*** generate a series of random seeds ***;
data _null_;
  do i = 1 to &ntrees;
    random = put(ranuni(&seed) * (10 ** 8), 8.);
    name   = compress("random"||put(i, 3.), ' ');
    call symput(name, random);
  end;
run;    

*** clean up catalog files in the library ***;
proc datasets library = _path nolist;
  delete TreeFiles tmp / memtype = catalog;
run;
quit;

proc sql noprint;
  select count(*) into :nobs from &data where &y in (1, 0);
quit;

data _tmp1 (keep = &y &numx &catx _id_);
  set &data;
  _id_ + 1;
run;
  
%do i = 1 %to &ntrees;
  %put &&random&i;

  *** generate bootstrap samples for bagging ***;
  proc surveyselect data = _tmp1 method = urs n = &nobs seed = &&random&i
    out = sample&i(rename = (NumberHits = _hits)) noprint;
  run;
  
  *** generate data mining datasets for sas e-miner ***;
  proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i;
    class &y &catx;
    var &numx;
    target &y;
    freq _hits;
  run;

  *** create a sas temporary catalog to contain sas output ***;
  filename out_tree catalog "_path.tmp.out_tree.source";

  *** create decision tree mimicking CART ***;
  proc split data = db_sample&i dmdbcat = cl_sample&i
    criterion    = gini
    assess       = impurity
    maxbranch    = 2
    splitsize    = 100
    subtree      = assessment
    exhaustive   = 0 
    nsurrs       = 0;
    code file    = out_tree;
    input &numx   / level = interval;
    input &catx   / level = nominal;
    target &y     / level = binary;
    freq _hits;
  run;  

  *** create a perminant sas catalog to contain all tree outputs ***;
  filename in_tree catalog "_path.TreeFiles.tree&i..source";

  data _null_;
    infile out_tree;
    input;
    file in_tree;
    if _n_ > 3 then put _infile_;
  run;

  *** score the original data by each tree output file ***;
  data _score&i (keep = p_&y.1 p_&y.0 &y _id_);
    set _tmp1;
    %include in_tree;
  run;

  *** calculate KS stat ***;
  proc printto new print = lst_out;
  run;

  ods output kolsmir2stats = _kstmp(where = (label1 = 'KS'));
  proc npar1way wilcoxon edf data = _score&i;
    class &y.;
    var p_&y.1;
  run;

  proc printto;
  run;

  %if &i = 1 %then %do;
    data _tmp2;
      set _score&i;
    run;

    data _ks;
      set _kstmp (keep = nvalue2);
      tree_id = &i;
      seed    = &&random&i;
      ks      = round(nvalue2 * 100, 0.0001);
    run;
  %end;    
  %else %do;
    data _tmp2;
      set _tmp2 _score&i;
    run;

    data _ks;
      set _ks _kstmp(in = a keep = nvalue2);
      if a then do;
        tree_id = &i;
        seed    = &&random&i;
        ks      = round(nvalue2 * 100, 0.0001);
      end;
    run;
  %end;    

%end;

*** aggregate predictions from all trees in the bag ***;
proc summary data = _tmp2 nway;
  class _id_;
  output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
  mean(p_&y.1) =  mean(p_&y.0) =  mean(&y) = ;
run;

*** calculate bagging KS stat ***;
proc printto new print = lst_out;
run;

ods output kolsmir2stats = _kstmp(where = (label1 = 'KS'));
proc npar1way wilcoxon edf data = _tmp3;
  class &y;
  var p_&y.1;
run;

proc printto;
run;

data _ks;
  set _ks _kstmp (in = a keep = nvalue2);
  if a then do;
    tree_id = 0;
    seed    = &seed;
    ks      = round(nvalue2 * 100, 0.0001);
  end;
run;

proc sort data = _ks;
  by tree_id;
run;

proc sql noprint;
  select max(ks) into :max_ks from _ks where tree_id > 0;
  
  select min(ks) into :min_ks from _ks where tree_id > 0;

  select ks into :bag_ks from _ks where tree_id = 0;
quit;

*** summarize the performance of bagging classifier and each tree in the bag ***;
title "MAX KS = &max_ks, MIN KS = &min_ks, BAGGING KS = &bag_ks";
proc print data = _ks noobs;
  var tree_id seed ks;
run;
title;

proc datasets library = _path nolist;
  delete tmp / memtype = catalog;
run;
quit;

%mend bagging;

%let x1 = tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt
          tot_rev_line rev_util bureau_score ltv tot_income;

%let x2 = purpose;

libname data 'D:\SAS_CODE\bagging';

%bagging(data = data.accepts, y = bad, numx = &x1, catx = &x2, ntrees = 10);

The table below is to show the result of bagging estimated out of 10 bootstrap samples. As seen, bagging prediction outperforms the prediction from the best decision tree by at least 10%. And this performance of bagging is very robust with multiple iterations and experiments.

MAX KS =  41.9205, MIN KS =  37.9653, BAGGING KS =  47.9446

tree_id      seed         ks
    0      20110613    47.9446
    1      66117721    38.0739
    2      73612659    41.9205
    3      88775645    37.9653
    4      76989116    39.7305
    5      78326288    41.8533
    6      67052887    39.7698
    7       1826834    38.9471
    8      47292499    39.2977
    9      39078123    40.2813
   10      15798916    40.6123

A SAS Macro Implementing Bi-variate Granger Causality Test

In loss forecasting, it is often of the interest to know: 1) if a time series, e.g. macro-economic variables, is useful to predict another, e.g. portfolio loss; 2) the number of lags to contribute such predictive power. Granger (1969) proposed that A time series X is said to Granger-cause Y if lagged values of X are able to provide statistically significant information to predict the future values of Y.

A SAS macro below is showing how to implement Granger Causality test in a bi-variate sense.

%macro causal(data = , y = , drivers = , max_lags = );
***********************************************************;
* THIS SAS MACRO IS AN IMPLEMENTATION OF BI-VARIATE       *;
* GRANGER CAUSALITY TEST PROPOSED BY GRANGER (1969)       *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA     : INPUT SAS DATA TABLE                        *;
*  Y        : A CONTINUOUS TIME SERIES RESPONSE VARIABLE  *;
*  DRIVERS  : A LIST OF TIME SERIES PREDICTORS            *;
*  MAX_LAGS : MAX # OF LAGS TO SEARCH FOR CAUSAL          *;
*             RELATIONSHIPS                               *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM, LOSS FORECASTING & RISK MODELING    *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape 
        ls = 150 formchar = "|----|+|---+=|-/\<>*";

%macro granger(data = , dep = , indep = , nlag = );

%let lag_dep = ;
%let lag_indep = ;

data _tmp1;
  set &data (keep = &dep &indep);

  %do i = 1 %to &nlag;
  lag&i._&dep = lag&i.(&dep);
  lag&i._&indep = lag&i.(&indep);

  %let lag_dep = &lag_dep lag&i._&dep;
  %let lag_indep = &lag_indep lag&i._&indep;
  %end;
run;

proc corr data = _tmp1 noprint outp = _corr1(rename = (&dep = value) where = (_type_ = 'CORR')) nosimple;
  var &dep;
  with lag&nlag._&indep;
run;

proc corr data = _tmp1 noprint outp = _corr2(rename = (&dep = value) where = (_type_ = 'CORR')) nosimple;
  var &dep;
  with lag&nlag._&indep;
  partial lag&nlag._&dep;
run;

proc reg data = _tmp1 noprint;
  model &dep = &lag_dep;
  output out = _rest1 r = rest_e;
run;

proc reg data = _tmp1 noprint;
  model &dep = &lag_dep &lag_indep;
  output out = _full1 r = full_e;
run;

proc sql noprint;
  select sum(full_e * full_e) into :full_sse1 from _full1;

  select sum(rest_e * rest_e) into :rest_sse1 from _rest1;

  select count(*) into :n from _full1;

  select value into :cor1 from _corr1;

  select value into :cor2 from _corr2;
quit;

data _result;
  format dep $20. ind $20.;
  dep   = "&dep";
  ind   = "%upcase(&indep)";
  nlag = &nlag;

  corr1 = &cor1;
  corr2 = &cor2;

  f_test1  = ((&rest_sse1 - &full_sse1) / &nlag) / (&full_sse1 / (&n -  2 * &nlag - 1));
  p_ftest1 = 1 - probf(f_test1, &nlag, &n -  2 * &nlag - 1);

  chisq_test1 = (&n * (&rest_sse1 - &full_sse1)) / &full_sse1;
  p_chisq1    = 1 - probchi(chisq_test1, &nlag);

  format flag1 $3.;
  if max(p_ftest1, p_chisq1) < 0.01 then flag1 = "***";
  else if max(p_ftest1, p_chisq1) < 0.05 then flag1 = " **";
  else if max(p_ftest1, p_chisq1) < 0.1 then flag1 = "  *";
  else flag1 = "   ";
run;

%mend granger;

data _in1;
  set &data (keep = &y &drivers);
run;

%let var_loop = 1;

%do %while (%scan(&drivers, &var_loop) ne %str());

  %let driver = %scan(&drivers, &var_loop);

  %do lag_loop = 1 %to &max_lags;

    %granger(data = _in1, dep = &y, indep = &driver, nlag = &lag_loop);

    %if &var_loop = 1 & &lag_loop = 1 %then %do;
      data _final;
        set _result;
      run;
    %end;
    %else %do;
      data _final;
        set _final _result;
      run;
    %end;
  %end;

  %let var_loop = %eval(&var_loop + 1);
%end;

title;
proc report data  = _last_ box spacing = 1 split = "/" nowd;
  column("GRANGER CAUSALITY TEST FOR %UPCASE(&y) UPTO &MAX_LAGS LAGS"
         ind nlag corr1 corr2 f_test1 chisq_test1 flag1);

  define ind         / "DRIVERS"                width = 20 center group order order = data;
  define nlag        / "LAG"                    width = 3  format = 3.   center order order = data;
  define corr1       / "PEARSON/CORRELATION"    width = 12 format = 8.4  center;
  define corr2       / "PARTIAL/CORRELATION"    width = 12 format = 8.4  center;
  define f_test1     / "CAUSAL/F-STAT"          width = 12 format = 10.4 center;
  define chisq_test1 / "CAUSAL/CHISQ-STAT"      width = 12 format = 10.4 center;
  define flag1       / "CAUSAL/FLAG"            width = 8  right;
run;

%mend causal;

%causal(data = sashelp.citimon, y = RTRR, drivers = CCIUTC LHUR FSPCON, max_lags = 6);

Based upon the output shown below, it is tentative to conclude that LHUR (UNEMPLOYMENT RATE) 3 months early might be helpful to predict RTRR (RETAIL SALES).

 ---------------------------------------------------------------------------------------
 |                     GRANGER CAUSALITY TEST FOR RTRR UPTO 6 LAGS                     |
 |                           PEARSON      PARTIAL       CAUSAL       CAUSAL      CAUSAL|
 |      DRIVERS        LAG CORRELATION  CORRELATION     F-STAT     CHISQ-STAT      FLAG|
 |-------------------------------------------------------------------------------------| 
 |       CCIUTC       |  1|    0.9852  |    0.1374  |     2.7339 |     2.7917 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  2|    0.9842  |    0.1114  |     0.7660 |     1.5867 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  3|    0.9834  |    0.0778  |     0.8186 |     2.5803 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  4|    0.9829  |    0.1047  |     0.7308 |     3.1165 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  5|    0.9825  |    0.0926  |     0.7771 |     4.2043 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  6|    0.9819  |    0.0868  |     0.7085 |     4.6695 |        | 
 |--------------------+---+------------+------------+------------+------------+--------| 
 |        LHUR        |  1|   -0.7236  |    0.0011  |     0.0000 |     0.0000 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  2|   -0.7250  |    0.0364  |     1.4136 |     2.9282 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  3|   -0.7268  |    0.0759  |     2.4246 |     7.6428 |       *| 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  4|   -0.7293  |    0.0751  |     2.1621 |     9.2208 |       *| 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  5|   -0.7312  |    0.1045  |     2.1148 |    11.4422 |       *| 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  6|   -0.7326  |    0.1365  |     1.9614 |    12.9277 |       *| 
 |--------------------+---+------------+------------+------------+------------+--------| 
 |       FSPCON       |  1|    0.9484  |    0.0431  |     0.2631 |     0.2687 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  2|    0.9481  |    0.0029  |     0.6758 |     1.3998 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  3|    0.9483  |   -0.0266  |     0.4383 |     1.3817 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  4|    0.9484  |   -0.0360  |     0.9219 |     3.9315 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  5|    0.9494  |   -0.0793  |     0.9008 |     4.8739 |        | 
 |                    |---+------------+------------+------------+------------+--------| 
 |                    |  6|    0.9492  |   -0.0999  |     0.9167 |     6.0421 |        | 
 --------------------------------------------------------------------------------------- 

Bumping: A Stochastic Search for the Best Model

Breiman (1996) showed how to use the bootstrap sampling technique to improve the prediction accuracy in the bagging algorithm, which has shown successful use cases in subset selection and decision trees. However, a major drawback of bagging is that it destroys the simple structure of the original model. For instance, the bagging of decision trees is not presented in a tree structure any more. As a result, bagging improves the model prediction accuracy at the cost of interpretability.

Tibshirani (1997) proposed another use case of bootstrap sampling, which was named Bumping. In the bumping algorithm, bootstrap sampling is used to estimate candidate models with the purpose to do a stochastic search for a best single model throughout the whole model space. As such, the simple structure of the original model, such as the one presented in a decision tree, will be well preserved.

A SAS macro below is showing how to implement the bumping algorithm.

%macro bumping(data = , y = , numx = , catx = x, ntrees = 100);
***********************************************************;
* THIS SAS MACRO IS AN ATTEMPT TO IMPLEMENT BUMPING       *;
* PROPOSED BY TIBSHIRANI AND KNIGHT (1997)                *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA   : INPUT SAS DATA TABLE                          *;
*  Y      : RESPONSE VARIABLE WITH 0/1 VALUE              *;
*  NUMX   : A LIST OF NUMERIC ATTRIBUTES                  *;
*  CATX   : A LIST OF CATEGORICAL ATTRIBUTES              *;
*  NTREES : # OF TREES TO DO THE BUMPING SEARCH           *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  BESTTREE.TXT: A TEXT FILE USED TO SCORE THE BEST TREE  *;
*                THROUGH THE BUMPING SEARCH               *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options mprint mlogic nocenter nodate nonumber;

*** a random seed value subject to change ***;
%let seed = 1;

data _null_;
  do i = 1 to &ntrees;
    random = put(ranuni(&seed) * (10 ** 8), 8.);
    name   = compress("random"||put(i, 3.), ' ');
    call symput(name, random);
  end;
run;    

proc datasets library = data nolist;
  delete catalog / memtype = catalog;
run;
quit;

proc sql noprint;
  select count(*) into :nobs from &data where &y in (1, 0);
quit;

%do i = 1 %to &ntrees;
  %put &&random&i;

  proc surveyselect data = &data method = urs n = &nobs seed = &&random&i
    out = sample&i(rename = (NumberHits = _hits)) noprint;
  run;
  
  proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i;
    class &y &catx;
    var &numx;
    target &y;
    freq _hits;
  run;

  filename out_tree catalog "data.catalog.out_tree.source";
  
  proc split data = db_sample&i dmdbcat = cl_sample&i
    criterion    = gini
    assess       = impurity
    maxbranch    = 2
    splitsize    = 100
    subtree      = assessment
    exhaustive   = 0 
    nsurrs       = 0;
    code file    = out_tree;
    input &numx   / level = interval;
    input &catx   / level = nominal;
    target &y     / level = binary;
    freq _hits;
  run;  

  filename in_tree catalog "data.catalog.tree&i..source";

  data _null_;
    infile out_tree;
    input;
    file in_tree;
    if _n_ > 3 then put _infile_;
  run;

  data _tmp1(keep = p_&y.1 p_&y.0 &y);
    set &data;
    %include in_tree;
  run;

  proc printto new print = lst_out;
  run;

  ods output kolsmir2stats = _kstmp(where = (label1 = 'KS'));
  proc npar1way wilcoxon edf data = _tmp1;
    class &y;
    var p_&y.1;
  run;

  proc printto;
  run;

  %if &i = 1 %then %do;
    data _ks;
      set _kstmp (keep = nvalue2);
      tree_id = &i;
      seed    = &&random&i;
      ks      = round(nvalue2 * 100, 0.0001);
    run;
  %end;    
  %else %do;
    data _ks;
      set _ks _kstmp(in = a keep = nvalue2);
      if a then do;
        tree_id = &i;
        seed    = &&random&i;
        ks      = round(nvalue2 * 100, 0.0001);
      end;
    run;
  %end;  
%end;

proc sql noprint;
  select max(ks) into :ks from _ks;
  
  select tree_id into :best from _ks where round(ks, 0.0001) = round(&ks., 0.0001);
quit;

filename best catalog "data.catalog.tree%trim(&best).source";
filename output "BestTree.txt";

data _null_;
  infile best;
  input;
  file output;
  if _n_ = 1 then do;
    put " ******************************************************; ";
    put " ***** BEST TREE: TREE %trim(&best) WITH KS = &KS *****; ";
    put " ******************************************************; ";    
  end;
  put _infile_;
run;

data _out;
  set _ks;

  if round(ks, 0.0001) = round(&ks., 0.0001) then flag = '***';
run;

proc print data = _out noobs;
  var tree_id seed ks flag;
run;

%mend bumping;

libname data 'D:\SAS_CODE\bagging';

%let x1 = tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt
            tot_rev_line rev_util bureau_score ltv tot_income;

%let x2 = purpose;

%bumping(data = data.accepts, y = bad, numx = &x1, catx = &x2, ntrees = 50);

The table below is to show the result of a stochastic search for the best decision tree out of 50 trees estimated from bootstrapped samples. In the result table, the best tree has been flagged by “***”. With the related seed value, any bootstrap sample and decision trees should be replicated.

tree_id      seed         ks      flag
    1      18496257    41.1210        
    2      97008872    41.2568        
    3      39982431    39.2714        
    4      25939865    38.7901        
    5      92160258    40.9343        
    6      96927735    40.6441        
    7      54297917    41.2460        
    8      53169172    40.5881        
    9       4979403    40.9662        
   10       6656655    41.2006        
   11      81931857    41.1540        
   12      52387052    40.1930        
   13      85339431    36.7912        
   14       6718458    39.9277        
   15      95702386    39.0264        
   16      29719396    39.8790        
   17      27261179    40.1256        
   18      68992963    40.7699        
   19      97676486    37.7472        
   20      22650752    39.6255        
   21      68823655    40.3759        
   22      41276387    41.2282        
   23      55855411    41.5945    *** 
   24      28722561    40.6127        
   25      47578931    40.2973        
   26      84498698    38.6929        
   27      63452412    41.0329        
   28      59036467    39.1822        
   29      58258153    40.5223        
   30      37701337    40.2190        
   31      72836156    40.1872        
   32      50660353    38.5086        
   33      93121359    39.9043        
   34      92912005    40.0265        
   35      58966034    38.8403        
   36      29722285    39.7879        
   37      39104243    38.4006        
   38      47242918    39.5534        
   39      67952575    39.2817        
   40      16808835    40.4024        
   41      16652610    40.5237        
   42      87110489    39.9251        
   43      29878953    39.6106        
   44      93464176    40.5942        
   45      90047083    40.4422        
   46      56878347    40.6057        
   47       4954566    39.7689        
   48      13558826    38.7292        
   49      51131788    41.0891        
   50      43320456    41.0566        

At last, the text file used to score the best tree is attached below, in which you should be able to see the structure of the best decision tree selected out of 50 trees.

 ******************************************************; 
 ***** BEST TREE: TREE 23 WITH KS =  41.5945 *****; 
 ******************************************************; 
 
 ******         LENGTHS OF NEW CHARACTER VARIABLES         ******;
 LENGTH I_bad  $   12; 
 LENGTH _WARN_  $    4; 
 
 ******              LABELS FOR NEW VARIABLES              ******;
 LABEL _NODE_  = 'Node' ;
 LABEL _LEAF_  = 'Leaf' ;
 LABEL P_bad0  = 'Predicted: bad=0' ;
 LABEL P_bad1  = 'Predicted: bad=1' ;
 LABEL I_bad  = 'Into: bad' ;
 LABEL U_bad  = 'Unnormalized Into: bad' ;
 LABEL _WARN_  = 'Warnings' ;
 
 
 ******      TEMPORARY VARIABLES FOR FORMATTED VALUES      ******;
 LENGTH _ARBFMT_2 $     12; DROP _ARBFMT_2; 
 _ARBFMT_2 = ' '; /* Initialize to avoid warning. */
 LENGTH _ARBFMT_15 $      5; DROP _ARBFMT_15; 
 _ARBFMT_15 = ' '; /* Initialize to avoid warning. */
 
 
 ******             ASSIGN OBSERVATION TO NODE             ******;
 IF  NOT MISSING(bureau_score ) AND 
                  662.5 <= bureau_score  THEN DO;
   IF  NOT MISSING(bureau_score ) AND 
                    721.5 <= bureau_score  THEN DO;
     IF  NOT MISSING(tot_derog ) AND 
       tot_derog  <                  3.5 THEN DO;
       IF  NOT MISSING(ltv ) AND 
                         99.5 <= ltv  THEN DO;
         IF  NOT MISSING(tot_rev_line ) AND 
           tot_rev_line  <                16671 THEN DO;
           _ARBFMT_15 = PUT( purpose , $5.);
            %DMNORMIP( _ARBFMT_15);
           IF _ARBFMT_15 IN ('LEASE' ) THEN DO;
             _NODE_  =                   62;
             _LEAF_  =                   29;
             P_bad0  =     0.77884615384615;
             P_bad1  =     0.22115384615384;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   63;
             _LEAF_  =                   30;
             P_bad0  =     0.91479820627802;
             P_bad1  =     0.08520179372197;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         ELSE DO;
           IF  NOT MISSING(ltv ) AND 
                            131.5 <= ltv  THEN DO;
             _NODE_  =                   65;
             _LEAF_  =                   32;
             P_bad0  =     0.79166666666666;
             P_bad1  =     0.20833333333333;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   64;
             _LEAF_  =                   31;
             P_bad0  =     0.96962025316455;
             P_bad1  =     0.03037974683544;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_open_tr ) AND 
           tot_open_tr  <                  1.5 THEN DO;
           _NODE_  =                   42;
           _LEAF_  =                   26;
           P_bad0  =                  0.9;
           P_bad1  =                  0.1;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(tot_derog ) AND 
                              1.5 <= tot_derog  THEN DO;
             _NODE_  =                   61;
             _LEAF_  =                   28;
             P_bad0  =      0.9047619047619;
             P_bad1  =     0.09523809523809;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   60;
             _LEAF_  =                   27;
             P_bad0  =     0.98779134295227;
             P_bad1  =     0.01220865704772;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       END;
     ELSE DO;
       _NODE_  =                   15;
       _LEAF_  =                   33;
       P_bad0  =     0.75925925925925;
       P_bad1  =     0.24074074074074;
       I_bad  = '0' ;
       U_bad  =                    0;
       END;
     END;
   ELSE DO;
     IF  NOT MISSING(tot_rev_line ) AND 
                     6453.5 <= tot_rev_line  THEN DO;
       IF  NOT MISSING(ltv ) AND 
                        122.5 <= ltv  THEN DO;
         _NODE_  =                   27;
         _LEAF_  =                   25;
         P_bad0  =      0.7471264367816;
         P_bad1  =     0.25287356321839;
         I_bad  = '0' ;
         U_bad  =                    0;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_derog ) AND 
                            9.5 <= tot_derog  THEN DO;
           _NODE_  =                   41;
           _LEAF_  =                   24;
           P_bad0  =     0.53846153846153;
           P_bad1  =     0.46153846153846;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(tot_rev_line ) AND 
             tot_rev_line  <                16694 THEN DO;
             _NODE_  =                   58;
             _LEAF_  =                   22;
             P_bad0  =     0.84235294117647;
             P_bad1  =     0.15764705882352;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   59;
             _LEAF_  =                   23;
             P_bad0  =     0.92375366568914;
             P_bad1  =     0.07624633431085;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       END;
     ELSE DO;
       IF  NOT MISSING(ltv ) AND 
         ltv  <                133.5 THEN DO;
         IF  NOT MISSING(tot_income ) AND 
           tot_income  <               2377.2 THEN DO;
           IF  NOT MISSING(age_oldest_tr ) AND 
             age_oldest_tr  <                   57 THEN DO;
             _NODE_  =                   54;
             _LEAF_  =                   17;
             P_bad0  =                 0.75;
             P_bad1  =                 0.25;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   55;
             _LEAF_  =                   18;
             P_bad0  =     0.93150684931506;
             P_bad1  =     0.06849315068493;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         ELSE DO;
           IF  NOT MISSING(ltv ) AND 
             ltv  <                 94.5 THEN DO;
             _NODE_  =                   56;
             _LEAF_  =                   19;
             P_bad0  =     0.86301369863013;
             P_bad1  =     0.13698630136986;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   57;
             _LEAF_  =                   20;
             P_bad0  =     0.67766497461928;
             P_bad1  =     0.32233502538071;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         _NODE_  =                   25;
         _LEAF_  =                   21;
         P_bad0  =      0.4090909090909;
         P_bad1  =     0.59090909090909;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       END;
     END;
   END;
 ELSE DO;
   IF  NOT MISSING(ltv ) AND 
     ltv  <                 97.5 THEN DO;
     IF  NOT MISSING(bureau_score ) AND 
       bureau_score  <                639.5 THEN DO;
       IF  NOT MISSING(tot_open_tr ) AND 
                          3.5 <= tot_open_tr  THEN DO;
         IF  NOT MISSING(tot_income ) AND 
           tot_income  <             2604.165 THEN DO;
           _NODE_  =                   32;
           _LEAF_  =                    3;
           P_bad0  =     0.54237288135593;
           P_bad1  =     0.45762711864406;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(tot_income ) AND 
                             7375 <= tot_income  THEN DO;
             _NODE_  =                   47;
             _LEAF_  =                    5;
             P_bad0  =     0.57575757575757;
             P_bad1  =     0.42424242424242;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   46;
             _LEAF_  =                    4;
             P_bad0  =     0.81102362204724;
             P_bad1  =     0.18897637795275;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_rev_line ) AND 
           tot_rev_line  <                 2460 THEN DO;
           _NODE_  =                   30;
           _LEAF_  =                    1;
           P_bad0  =     0.69411764705882;
           P_bad1  =     0.30588235294117;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           _NODE_  =                   31;
           _LEAF_  =                    2;
           P_bad0  =     0.34343434343434;
           P_bad1  =     0.65656565656565;
           I_bad  = '1' ;
           U_bad  =                    1;
           END;
         END;
       END;
     ELSE DO;
       IF  NOT MISSING(tot_income ) AND 
         tot_income  <             9291.835 THEN DO;
         IF  NOT MISSING(tot_tr ) AND 
                           13.5 <= tot_tr  THEN DO;
           _NODE_  =                   35;
           _LEAF_  =                    8;
           P_bad0  =     0.94039735099337;
           P_bad1  =     0.05960264900662;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(bureau_score ) AND 
             bureau_score  <                646.5 THEN DO;
             _NODE_  =                   48;
             _LEAF_  =                    6;
             P_bad0  =                    1;
             P_bad1  =                    0;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   49;
             _LEAF_  =                    7;
             P_bad0  =     0.73604060913705;
             P_bad1  =     0.26395939086294;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         _NODE_  =                   19;
         _LEAF_  =                    9;
         P_bad0  =     0.35714285714285;
         P_bad1  =     0.64285714285714;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       END;
     END;
   ELSE DO;
     IF  NOT MISSING(tot_rev_line ) AND 
       tot_rev_line  <               1218.5 THEN DO;
       IF  NOT MISSING(age_oldest_tr ) AND 
                          115 <= age_oldest_tr  THEN DO;
         _NODE_  =                   21;
         _LEAF_  =                   11;
         P_bad0  =     0.60273972602739;
         P_bad1  =      0.3972602739726;
         I_bad  = '0' ;
         U_bad  =                    0;
         END;
       ELSE DO;
         _NODE_  =                   20;
         _LEAF_  =                   10;
         P_bad0  =     0.28776978417266;
         P_bad1  =     0.71223021582733;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       END;
     ELSE DO;
       IF  NOT MISSING(bureau_score ) AND 
         bureau_score  <                  566 THEN DO;
         _NODE_  =                   22;
         _LEAF_  =                   12;
         P_bad0  =     0.15384615384615;
         P_bad1  =     0.84615384615384;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_rev_line ) AND 
                        13717.5 <= tot_rev_line  THEN DO;
           IF  NOT MISSING(tot_rev_debt ) AND 
             tot_rev_debt  <                11884 THEN DO;
             _NODE_  =                   52;
             _LEAF_  =                   15;
             P_bad0  =     0.85869565217391;
             P_bad1  =     0.14130434782608;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   53;
             _LEAF_  =                   16;
             P_bad0  =     0.65972222222222;
             P_bad1  =     0.34027777777777;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         ELSE DO;
           IF  NOT MISSING(ltv ) AND 
             ltv  <                 99.5 THEN DO;
             _NODE_  =                   50;
             _LEAF_  =                   13;
             P_bad0  =     0.41489361702127;
             P_bad1  =     0.58510638297872;
             I_bad  = '1' ;
             U_bad  =                    1;
             END;
           ELSE DO;
             _NODE_  =                   51;
             _LEAF_  =                   14;
             P_bad0  =     0.61769352290679;
             P_bad1  =      0.3823064770932;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       END;
     END;
   END;
 
 ****************************************************************;
 ******          END OF DECISION TREE SCORING CODE         ******;
 ****************************************************************;

A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development

This SAS macro was specifically designed for model developers to do uni-variate variable importance ranking and monotonic weight of evidence (WOE) transformation for potentially hundreds of predictors in the scorecard development. Please feel free to use or distribute it at your own risk. I will really appreciate it if you could share your successful story using this macro in your model development with me.

%macro num_woe(data = , y = , x = );
***********************************************************;
* THE SAS MACRO IS TO PERFORM UNIVARIATE IMPORTANCE RANK  *;
* ORDER AND MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION   *;
* FOR NUMERIC ATTRIBUTES IN PRE-MODELING DATA PROCESSING  *;
* (IT IS RECOMMENDED TO RUN THIS MACRO IN THE BATCH MODE) *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  Y   : RESPONSE VARIABLE WITH 0/1 VALUE                 *;
*  X   : A LIST OF NUMERIC ATTRIBUTES                     *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  MONO_WOE.WOE: A FILE OF WOE TRANSFORMATION RECODING    *;
*  MONO_WOE.FMT: A FILE OF BINNING FORMAT                 *;
*  MONO_WOE.PUT: A FILE OF PUT STATEMENTS FOR *.FMT FILE  *;
*  MONO_WOE.SUM: A FILE WITH PREDICTABILITY SUMMARY       *;
*  MONO_WOE.OUT: A FILE WITH STATISTICAL DETAILS          *;
*  MONO_WOE.IMP: A FILE OF MISSING IMPUTATION RECODING    *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150;

*** DEFAULT PARAMETERS ***;

%local maxbin minbad miniv bignum;

%let maxbin = 100;

%let minbad = 50;

%let miniv  = 0.03;

%let bignum = 1e300;

***********************************************************;
***         DO NOT CHANGE CODES BELOW THIS LINE         ***;
***********************************************************;

*** DEFAULT OUTPUT FILES ***;

* WOE RECODING FILE                     *;
filename woefile "MONO_WOE.WOE";

* FORMAT FOR BINNING                    *;
filename fmtfile "MONO_WOE.FMT";

* PUT STATEMENT TO USE FORMAT           *;
filename binfile "MONO_WOE.PUT";

* KS SUMMARY                            *;
filename sumfile "MONO_WOE.SUM";
 
* STATISTICAL SUMMARY FOR EACH VARIABLE *;
filename outfile "MONO_WOE.OUT";

* IMPUTE RECODING FILE                  *;
filename impfile "MONO_WOE.IMP";

*** A MACRO TO DELETE FILE ***;
%macro dfile(file = );
  data _null_;
    rc = fdelete("&file");
    if rc = 0 then do;
      put @1 50 * "+";
      put "THE EXISTED OUTPUT FILE HAS BEEN DELETED.";
      put @1 50 * "+";
    end;
  run;
%mend dfile;

*** CLEAN UP FILES ***;
%dfile(file = woefile);

%dfile(file = fmtfile);

%dfile(file = binfile);

%dfile(file = sumfile);

%dfile(file = outfile);

%dfile(file = impfile);

*** PARSING THE STRING OF NUMERIC PREDICTORS ***;
ods listing close;
ods output position = _pos1;
proc contents data = &data varnum;
run;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;


  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;
quit;

data _tmp1;
  retain &x2 &y;
  set &data;
  where &Y in (1, 0);
  keep &x2 &y;
run;

ods output position = _pos2;
proc contents data = _tmp1 varnum;
run;

*** LOOP THROUGH EACH PREDICTOR ***;
%do i = 1 %to &xcnt;
    
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num= &i;

    select
      count(distinct &var) into :xflg
    from
      _tmp1
    where
      &var ~= .;
  quit;

  proc summary data = _tmp1 nway;
    output out  = _med(drop = _type_ _freq_)
    median(&var) = med nmiss(&var) = mis;
  run;
  
  proc sql;
    select
      med into :median
    from
      _med;

    select
      mis into :nmiss
    from
      _med;

    select 
      case when count(&y) = sum(&y) then 1 else 0 end into :mis_flg1
    from
      _tmp1
    where
      &var = .;

    select
      case when sum(&y) = 0 then 1 else 0 end into :mis_flg2
    from
      _tmp1
    where
      &var = .;
  quit;

  %let nbin = %sysfunc(min(&maxbin, &xflg));

  *** CHECK IF THE NUMBER OF DISTINCT VALUES > 1 ***;
  %if &xflg > 1 %then %do;

    *** IMPUTE MISS VALUE WHEN WOE CANNOT BE CALCULATED ***;
    %if &mis_flg1 = 1 | &mis_flg2 = 1 %then %do;
      data _null_;
        file impfile mod;
        put " ";
        put @3 "*** MEDIAN IMPUTATION OF %TRIM(%UPCASE(&VAR)) (NMISS = %trim(&nmiss)) ***;";
        put @3 "IF %TRIM(%UPCASE(&VAR)) = . THEN %TRIM(%UPCASE(&VAR)) = &MEDIAN;";
      run;

      data _tmp1;
        set _tmp1;
        if &var = . then &var = &median;
      run; 
    %end;      
      
    *** LOOP THROUGH THE NUMBER OF BINS ***;
    %do j = &nbin %to 2 %by -1;
      proc rank data = _tmp1 groups = &j out = _tmp2(keep = &y &var rank);
        var &var;
        ranks rank;
      run;

      proc summary data = _tmp2 nway missing;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)   = bad    mean(&y)  = bad_rate
        min(&var) = minx   max(&var) = maxx;
      run;

      *** CREATE FLAGS FOR MULTIPLE CRITERION ***;
      proc sql noprint;
        select
          case when min(bad) >= &minbad then 1 else 0 end into :badflg
        from
          _tmp3;

        select
          case when min(bad_rate) > 0 then 1 else 0 end into :minflg
        from
          _tmp3;

        select
          case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
        from
          _tmp3;              
      quit;

      *** CHECK IF SPEARMAN CORRELATION = 1 ***;
      %if &badflg = 1 & &minflg = 1 & &maxflg = 1 %then %do;
        ods output spearmancorr = _corr(rename = (minx = cor));
        proc corr data = _tmp3 spearman;
          var minx;
          with bad_rate;
        run;

        proc sql noprint;
          select
            case when abs(cor) = 1 then 1 else 0 end into :cor
          from
            _corr;
        quit;

        *** IF SPEARMAN CORR = 1 THEN BREAK THE LOOP ***;
        %if &cor = 1 %then %goto loopout;
      %end;
      %else %if &nbin = 2 %then %goto exit;
    %end;

    %loopout:
    
    *** CALCULATE STATISTICAL SUMMARY ***;
    proc sql noprint;
      select 
        sum(freq) into :freq
      from
        _tmp3;

      select
        sum(bad) into :bad
      from
        _tmp3;
    quit;

    proc sort data = _tmp3 sortsize = max;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq pct bad_rate;
      set _tmp3 end = eof;
      by rank;

      if rank = . then bin = 0;
      else do;
        retain b 0;
        bin + 1;
      end;
  
      pct  = freq / &freq;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&freq - &bad);
      woe  = log(bpct / gpct);
      iv   = (bpct - gpct) * woe;

      retain cum_bpct cum_gpct;
      cum_bpct + bpct;
      cum_gpct + gpct;
      ks = abs(cum_gpct - cum_bpct) * 100;

      retain iv_sum ks_max;
      iv_sum + iv;
      ks_max = max(ks_max, ks);
      if eof then do;
        call symput("bin", put(bin, 4.));
        call symput("ks", put(ks_max, 10.4));
        call symput("iv", put(iv_sum, 10.4));
      end;

      keep bin minx maxx bad freq pct bad_rate
           gpct bpct woe iv cum_gpct cum_bpct ks;
    run;

    *** REPORT STATISTICAL SUMMARY ***;
    proc printto print = outfile;
    run;

    title;
    ods listing;
    proc report data = _tmp4 spacing = 1 split = "*" headline nowindows;
      column(" * MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(%trim(&var))"
             bin minx maxx freq pct bad bad_rate woe iv ks);

      define bin      /"BIN*LEVEL"   width = 5  format = z3. order order = data;
      define minx     /"LOWER*LIMIT" width = 15 format = 14.4;
      define maxx     /"UPPER*LIMIT" width = 15 format = 14.4;
      define bad      /"#BADS*(Y=1)" width = 8  format = 7.;
      define freq     /"#FREQ"       width = 10 format = 9.;
      define pct      /"PERCENT"     width = 8  format = percent8.2;
      define bad_rate /"BAD*RATE"    width = 8  format = percent8.2;
      define woe      /"WOE"         width = 10 format = 9.4;
      define iv       /"INFO.*VALUE" width = 10 format = 9.4;
      define ks       /"KS"          width = 10 format = 9.4;
      compute after;
        line @1 110 * "-";
        line @5 "# TOTAL = %trim(&freq), # BADs(Y=1) = %trim(&bad), "
                "OVERALL BAD RATE = %trim(%sysfunc(round(&bad / &freq * 100, 0.0001)))%, "
                "MAX. KS = %trim(&ks), INFO. VALUE = %trim(&iv).";
        line @1 110 * "-";    
      endcomp;
    run;
    ods listing close;

    proc printto;
    run;

    proc sql noprint;
      select
        case when sum(iv) >= &miniv then 1 else 0 end into :ivflg
      from
        _tmp4;
    quit;

    *** OUTPUT RECODING FILES IF IV >= &miniv BY DEFAULT ***;
    %if &ivflg = 1 %then %do;
      data _tmp5;
        length upper $20 lower $20;
        lower = compress(put(maxx, 20.4), ' ');

        set _tmp4 end = eof;
        upper = compress(put(maxx, 20.4), ' ');
        if bin = 1 then lower = "-%trim(&bignum)";
        if eof then upper = "%trim(&bignum)";
        w%trim(&var) = compress(put(woe, 12.8), ' ');
      run;

      *** OUTPUT WOE RECODE FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file woefile mod;

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3  "if %trim(&var) = . then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3 "if " + 1 lower " < %trim(&var) <= " upper
                 " then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if _n_ > 1 then do;
          put @5 "else if " + 1 lower " < %trim(&var) <= " upper
                 " then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if eof then do;
          put @5 "else w%trim(&var) = 0;";
        end;
      run;

      *** OUTPUT BINNING FORMAT FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file fmtfile mod;

        if bin = 1 then lower = "LOW";
        if eof then upper = "HIGH";

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 ". " @40 " = '" bin: z3.
                 ". MISSINGS'";
        end;

            
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
              @5 "BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 lower @15 " - " + 1 upper  @40 " = '" bin: z3.
                 ". " + 1 lower " - " + 1 upper "'";
        end;

        if _n_ > 1 then do;
          put @5 lower @15 "<- " + 1 upper @40 " = '" bin: z3.
                 ". " + 1 lower "<- " + 1 upper "'";
        end;
        if eof then do;
          put @5 "OTHER" @40 " = '999 .  OTHERS';";
        end;
      run;

      *** OUTPUT BINNING RECODE FILE ***;
      data _null_;
        file binfile mod;
        put " ";
        put @3 "*** BINNING RECODE of %trim(&var) ***;";
        put @3 "c%trim(&var) = put(%trim(&var), %trim(&var)_fmt.);";
      run;

      *** SAVE SUMMARY OF EACH VARIABLE INTO A TABLE ***;
      %if %sysfunc(exist(work._result)) %then %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          if _n_ = 1 then do;
            variable = "%trim(&var)";
            bin      = &bin;
            ks       = &ks;
            iv       = &iv;
            output;
          end;
          set _result;
          output;
        run;
      %end;
      %else %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          variable = "%trim(&var)";
          bin      = &bin;
          ks       = &ks;
          iv       = &iv;
        run;        
      %end;
    %end;

    %exit:

    *** CLEAN UP TEMPORARY TABLES ***;
    proc datasets library = work nolist;
      delete _tmp2 - _tmp5 _corr / memtype = data;
    run;
    quit;
  %end;    
%end;

*** SORT VARIABLES BY KS AND OUTPUT RESULTS ***;
proc sort data = _result sortsize = max;
  by descending ks descending iv;
run;

data _null_;
  set _result end = eof;
  file sumfile;

  if _n_ = 1 then do;
    put @1 80 * "-";
    put @1  "| RANK" @10 "| VARIABLE RANKED BY KS" @45 "| # BINS"
        @55 "|  KS"  @66 "| INFO. VALUE" @80 "|";
    put @1 80 * "-";
  end;
  put @1  "| " @4  _n_ z3. @10 "| " @12 variable @45 "| " @50 bin
      @55 "| " @57 ks      @66 "| " @69 iv       @80 "|";
  if eof then do;
    put @1 80 * "-";
  end;
run;

proc datasets library = work nolist;
  delete _result (mt = data);
run;
quit;

*********************************************************;
*           END OF NUM_WOE MACRO                        *;
*********************************************************;
%mend num_woe;

libname data 'D:\SAS_CODE\woe';

%let x = 
tot_derog
tot_tr
age_oldest_tr
tot_open_tr
tot_rev_tr
tot_rev_debt
tot_rev_line
rev_util
bureau_score
ltv
tot_income
;

%num_woe(data = data.accepts, y = bad, x = &x);

The macro above will automatically generate 6 standard output files with different contents for various purposes through the whole process of scorecard development.

1) “MONO_WOE.WOE” is a file of WOE transformation recoding.

 
  *** WOE RECODE OF TOT_DEROG (KS = 20.0442, IV = 0.2480) ***;
  if TOT_DEROG = . then wTOT_DEROG =  0.64159782 ;
    else if  -1e300  < TOT_DEROG <= 0.0000  then wTOT_DEROG =  -0.55591373 ;
    else if  0.0000  < TOT_DEROG <= 2.0000  then wTOT_DEROG =  0.14404414 ;
    else if  2.0000  < TOT_DEROG <= 4.0000  then wTOT_DEROG =  0.50783799 ;
    else if  4.0000  < TOT_DEROG <= 1e300  then wTOT_DEROG =  0.64256014 ;
    else wTOT_DEROG = 0;
 
  *** WOE RECODE OF TOT_TR (KS = 16.8344, IV = 0.1307) ***;
  if TOT_TR = . then wTOT_TR =  0.64159782 ;
    else if  -1e300  < TOT_TR <= 7.0000  then wTOT_TR =  0.40925900 ;
    else if  7.0000  < TOT_TR <= 12.0000  then wTOT_TR =  0.26386662 ;
    else if  12.0000  < TOT_TR <= 18.0000  then wTOT_TR =  -0.13512611 ;
    else if  18.0000  < TOT_TR <= 25.0000  then wTOT_TR =  -0.40608173 ;
    else if  25.0000  < TOT_TR <= 1e300  then wTOT_TR =  -0.42369090 ;
    else wTOT_TR = 0;
 
  *** WOE RECODE OF AGE_OLDEST_TR (KS = 19.6163, IV = 0.2495) ***;
  if AGE_OLDEST_TR = . then wAGE_OLDEST_TR =  0.66280002 ;
    else if  -1e300  < AGE_OLDEST_TR <= 46.0000  then wAGE_OLDEST_TR =  0.66914925 ;
    else if  46.0000  < AGE_OLDEST_TR <= 77.0000  then wAGE_OLDEST_TR =  0.36328349 ;
    else if  77.0000  < AGE_OLDEST_TR <= 114.0000  then wAGE_OLDEST_TR =  0.15812827 ;
    else if  114.0000  < AGE_OLDEST_TR <= 137.0000  then wAGE_OLDEST_TR =  0.01844301 ;
    else if  137.0000  < AGE_OLDEST_TR <= 164.0000  then wAGE_OLDEST_TR =  -0.04100445 ;
    else if  164.0000  < AGE_OLDEST_TR <= 204.0000  then wAGE_OLDEST_TR =  -0.32667232 ;
    else if  204.0000  < AGE_OLDEST_TR <= 275.0000  then wAGE_OLDEST_TR =  -0.79931317 ;
    else if  275.0000  < AGE_OLDEST_TR <= 1e300  then wAGE_OLDEST_TR =  -0.89926463 ;
    else wAGE_OLDEST_TR = 0;
 
  *** WOE RECODE OF TOT_REV_TR (KS = 9.0779, IV = 0.0757) ***;
  if TOT_REV_TR = . then wTOT_REV_TR =  0.69097090 ;
    else if  -1e300  < TOT_REV_TR <= 1.0000  then wTOT_REV_TR =  0.00269270 ;
    else if  1.0000  < TOT_REV_TR <= 3.0000  then wTOT_REV_TR =  -0.14477602 ;
    else if  3.0000  < TOT_REV_TR <= 1e300  then wTOT_REV_TR =  -0.15200275 ;
    else wTOT_REV_TR = 0;
 
  *** WOE RECODE OF TOT_REV_DEBT (KS = 8.5317, IV = 0.0629) ***;
  if TOT_REV_DEBT = . then wTOT_REV_DEBT =  0.68160936 ;
    else if  -1e300  < TOT_REV_DEBT <= 3009.0000  then wTOT_REV_DEBT =  0.04044249 ;
    else if  3009.0000  < TOT_REV_DEBT <= 1e300  then wTOT_REV_DEBT =  -0.19723686 ;
    else wTOT_REV_DEBT = 0;
 
  *** WOE RECODE OF TOT_REV_LINE (KS = 25.5174, IV = 0.3970) ***;
  if TOT_REV_LINE = . then wTOT_REV_LINE =  0.68160936 ;
    else if  -1e300  < TOT_REV_LINE <= 1477.0000  then wTOT_REV_LINE =  0.73834416 ;
    else if  1477.0000  < TOT_REV_LINE <= 4042.0000  then wTOT_REV_LINE =  0.34923628 ;
    else if  4042.0000  < TOT_REV_LINE <= 8350.0000  then wTOT_REV_LINE =  0.11656236 ;
    else if  8350.0000  < TOT_REV_LINE <= 14095.0000  then wTOT_REV_LINE =  0.03996934 ;
    else if  14095.0000  < TOT_REV_LINE <= 23419.0000  then wTOT_REV_LINE =  -0.49492745 ;
    else if  23419.0000  < TOT_REV_LINE <= 38259.0000  then wTOT_REV_LINE =  -0.94090721 ;
    else if  38259.0000  < TOT_REV_LINE <= 1e300  then wTOT_REV_LINE =  -1.22174118 ;
    else wTOT_REV_LINE = 0;
 
  *** WOE RECODE OF REV_UTIL (KS = 14.3262, IV = 0.0834) ***;
  if  -1e300  < REV_UTIL <= 29.0000  then wREV_UTIL =  -0.31721190 ;
    else if  29.0000  < REV_UTIL <= 1e300  then wREV_UTIL =  0.26459777 ;
    else wREV_UTIL = 0;
 
  *** WOE RECODE OF BUREAU_SCORE (KS = 34.1481, IV = 0.7251) ***;
  if BUREAU_SCORE = . then wBUREAU_SCORE =  0.66280002 ;
    else if  -1e300  < BUREAU_SCORE <= 653.0000  then wBUREAU_SCORE =  0.93490359 ;
    else if  653.0000  < BUREAU_SCORE <= 692.0000  then wBUREAU_SCORE =  0.07762676 ;
    else if  692.0000  < BUREAU_SCORE <= 735.0000  then wBUREAU_SCORE =  -0.58254635 ;
    else if  735.0000  < BUREAU_SCORE <= 1e300  then wBUREAU_SCORE =  -1.61790566 ;
    else wBUREAU_SCORE = 0;
 
  *** WOE RECODE OF LTV (KS = 16.3484, IV = 0.1625) ***;
  if  -1e300  < LTV <= 82.0000  then wLTV =  -0.84674934 ;
    else if  82.0000  < LTV <= 91.0000  then wLTV =  -0.43163689 ;
    else if  91.0000  < LTV <= 97.0000  then wLTV =  -0.14361551 ;
    else if  97.0000  < LTV <= 101.0000  then wLTV =  0.08606320 ;
    else if  101.0000  < LTV <= 107.0000  then wLTV =  0.18554122 ;
    else if  107.0000  < LTV <= 115.0000  then wLTV =  0.22405397 ;
    else if  115.0000  < LTV <= 1e300  then wLTV =  0.51906325 ;
    else wLTV = 0;

2) “MONO_WOE.FMT” is a file for binning format.

 
  *** BINNING FORMAT OF TOT_DEROG (KS = 20.0442, IV = 0.2480) ***;
  value TOT_DEROG_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  0.0000                = '001 .  LOW <-  0.0000 '
    0.0000    <-  2.0000                = '002 .  0.0000 <-  2.0000 '
    2.0000    <-  4.0000                = '003 .  2.0000 <-  4.0000 '
    4.0000    <-  HIGH                  = '004 .  4.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_TR (KS = 16.8344, IV = 0.1307) ***;
  value TOT_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  7.0000                = '001 .  LOW <-  7.0000 '
    7.0000    <-  12.0000               = '002 .  7.0000 <-  12.0000 '
    12.0000   <-  18.0000               = '003 .  12.0000 <-  18.0000 '
    18.0000   <-  25.0000               = '004 .  18.0000 <-  25.0000 '
    25.0000   <-  HIGH                  = '005 .  25.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF AGE_OLDEST_TR (KS = 19.6163, IV = 0.2495) ***;
  value AGE_OLDEST_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  46.0000               = '001 .  LOW <-  46.0000 '
    46.0000   <-  77.0000               = '002 .  46.0000 <-  77.0000 '
    77.0000   <-  114.0000              = '003 .  77.0000 <-  114.0000 '
    114.0000  <-  137.0000              = '004 .  114.0000 <-  137.0000 '
    137.0000  <-  164.0000              = '005 .  137.0000 <-  164.0000 '
    164.0000  <-  204.0000              = '006 .  164.0000 <-  204.0000 '
    204.0000  <-  275.0000              = '007 .  204.0000 <-  275.0000 '
    275.0000  <-  HIGH                  = '008 .  275.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_TR (KS = 9.0779, IV = 0.0757) ***;
  value TOT_REV_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  1.0000                = '001 .  LOW <-  1.0000 '
    1.0000    <-  3.0000                = '002 .  1.0000 <-  3.0000 '
    3.0000    <-  HIGH                  = '003 .  3.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_DEBT (KS = 8.5317, IV = 0.0629) ***;
  value TOT_REV_DEBT_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  3009.0000             = '001 .  LOW <-  3009.0000 '
    3009.0000 <-  HIGH                  = '002 .  3009.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_LINE (KS = 25.5174, IV = 0.3970) ***;
  value TOT_REV_LINE_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  1477.0000             = '001 .  LOW <-  1477.0000 '
    1477.0000 <-  4042.0000             = '002 .  1477.0000 <-  4042.0000 '
    4042.0000 <-  8350.0000             = '003 .  4042.0000 <-  8350.0000 '
    8350.0000 <-  14095.0000            = '004 .  8350.0000 <-  14095.0000 '
    14095.0000<-  23419.0000            = '005 .  14095.0000 <-  23419.0000 '
    23419.0000<-  38259.0000            = '006 .  23419.0000 <-  38259.0000 '
    38259.0000<-  HIGH                  = '007 .  38259.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  **BINNING FORMAT OF REV_UTIL (KS = 14.3262, IV = 0.0834) ***;
  value REV_UTIL_fmt
    LOW        -  29.0000               = '001 .  LOW  -  29.0000 '
    29.0000   <-  HIGH                  = '002 .  29.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF BUREAU_SCORE (KS = 34.1481, IV = 0.7251) ***;
  value BUREAU_SCORE_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  653.0000              = '001 .  LOW <-  653.0000 '
    653.0000  <-  692.0000              = '002 .  653.0000 <-  692.0000 '
    692.0000  <-  735.0000              = '003 .  692.0000 <-  735.0000 '
    735.0000  <-  HIGH                  = '004 .  735.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  **BINNING FORMAT OF LTV (KS = 16.3484, IV = 0.1625) ***;
  value LTV_fmt
    LOW        -  82.0000               = '001 .  LOW  -  82.0000 '
    82.0000   <-  91.0000               = '002 .  82.0000 <-  91.0000 '
    91.0000   <-  97.0000               = '003 .  91.0000 <-  97.0000 '
    97.0000   <-  101.0000              = '004 .  97.0000 <-  101.0000 '
    101.0000  <-  107.0000              = '005 .  101.0000 <-  107.0000 '
    107.0000  <-  115.0000              = '006 .  107.0000 <-  115.0000 '
    115.0000  <-  HIGH                  = '007 .  115.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';

3) “MONO_WOE.PUT” is a file for “put” statements with the above *.FMT file.

  *** BINNING RECODE of TOT_DEROG ***;
  cTOT_DEROG = put(TOT_DEROG, TOT_DEROG_fmt.);

  *** BINNING RECODE of TOT_TR ***;
  cTOT_TR = put(TOT_TR, TOT_TR_fmt.);

  *** BINNING RECODE of AGE_OLDEST_TR ***;
  cAGE_OLDEST_TR = put(AGE_OLDEST_TR, AGE_OLDEST_TR_fmt.);

  *** BINNING RECODE of TOT_REV_TR ***;
  cTOT_REV_TR = put(TOT_REV_TR, TOT_REV_TR_fmt.);

  *** BINNING RECODE of TOT_REV_DEBT ***;
  cTOT_REV_DEBT = put(TOT_REV_DEBT, TOT_REV_DEBT_fmt.);

  *** BINNING RECODE of TOT_REV_LINE ***;
  cTOT_REV_LINE = put(TOT_REV_LINE, TOT_REV_LINE_fmt.);

  *** BINNING RECODE of REV_UTIL ***;
  cREV_UTIL = put(REV_UTIL, REV_UTIL_fmt.);

  *** BINNING RECODE of BUREAU_SCORE ***;
  cBUREAU_SCORE = put(BUREAU_SCORE, BUREAU_SCORE_fmt.);

  *** BINNING RECODE of LTV ***;
  cLTV = put(LTV, LTV_fmt.);

4) “MONO_WOE.SUM” is a file summarizing the predictability of all numeric variables, e.g. KS statistics and Information Values.

--------------------------------------------------------------------------------
| RANK   | VARIABLE RANKED BY KS            | # BINS  |  KS      | INFO. VALUE |
--------------------------------------------------------------------------------
|  001   | BUREAU_SCORE                     |    4    | 34.1481  |  0.7251     |
|  002   | TOT_REV_LINE                     |    7    | 25.5174  |  0.3970     |
|  003   | TOT_DEROG                        |    4    | 20.0442  |  0.2480     |
|  004   | AGE_OLDEST_TR                    |    8    | 19.6163  |  0.2495     |
|  005   | TOT_TR                           |    5    | 16.8344  |  0.1307     |
|  006   | LTV                              |    7    | 16.3484  |  0.1625     |
|  007   | REV_UTIL                         |    2    | 14.3262  |  0.0834     |
|  008   | TOT_REV_TR                       |    3    | 9.0779   |  0.0757     |
|  009   | TOT_REV_DEBT                     |    2    | 8.5317   |  0.0629     |
--------------------------------------------------------------------------------

5) “MONO_WOE.OUT” is a file providing statistical summaries of all numeric variables.

                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_DEROG                          
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            213   3.65%        70  32.86%      0.6416     0.0178     2.7716
   001          0.0000          0.0000       2850  48.83%       367  12.88%     -0.5559     0.1268    20.0442
   002          1.0000          2.0000       1369  23.45%       314  22.94%      0.1440     0.0051    16.5222
   003          3.0000          4.0000        587  10.06%       176  29.98%      0.5078     0.0298    10.6623
   004          5.0000         32.0000        818  14.01%       269  32.89%      0.6426     0.0685     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 20.0442, INFO. VALUE = 0.2480.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                            MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_TR                           
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            213   3.65%        70  32.86%      0.6416     0.0178     2.7716
   001          0.0000          7.0000       1159  19.86%       324  27.96%      0.4093     0.0372    11.8701
   002          8.0000         12.0000       1019  17.46%       256  25.12%      0.2639     0.0131    16.8344
   003         13.0000         18.0000       1170  20.04%       215  18.38%     -0.1351     0.0035    14.2335
   004         19.0000         25.0000       1126  19.29%       165  14.65%     -0.4061     0.0281     7.3227
   005         26.0000         77.0000       1150  19.70%       166  14.43%     -0.4237     0.0310     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 16.8344, INFO. VALUE = 0.1307.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                        MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR AGE_OLDEST_TR                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            216   3.70%        72  33.33%      0.6628     0.0193     2.9173
   001          1.0000         46.0000        708  12.13%       237  33.47%      0.6691     0.0647    12.5847
   002         47.0000         77.0000        699  11.98%       189  27.04%      0.3633     0.0175    17.3983
   003         78.0000        114.0000        703  12.04%       163  23.19%      0.1581     0.0032    19.3917
   004        115.0000        137.0000        707  12.11%       147  20.79%      0.0184     0.0000    19.6163
   005        138.0000        164.0000        706  12.10%       140  19.83%     -0.0410     0.0002    19.1263
   006        165.0000        204.0000        689  11.80%       108  15.67%     -0.3267     0.0114    15.6376
   007        205.0000        275.0000        703  12.04%        73  10.38%     -0.7993     0.0597     8.1666
   008        276.0000        588.0000        706  12.10%        67   9.49%     -0.8993     0.0734     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 19.6163, INFO. VALUE = 0.2495.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_OPEN_TR                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .           1416  24.26%       354  25.00%      0.2573     0.0173     6.7157
   001          0.0000          4.0000       1815  31.09%       353  19.45%     -0.0651     0.0013     4.7289
   002          5.0000          6.0000       1179  20.20%       226  19.17%     -0.0831     0.0014     3.0908
   003          7.0000         26.0000       1427  24.45%       263  18.43%     -0.1315     0.0041     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 6.7157, INFO. VALUE = 0.0240.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_TR                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            636  10.90%       216  33.96%      0.6910     0.0623     9.0104
   001          0.0000          1.0000       1461  25.03%       300  20.53%      0.0027     0.0000     9.0779
   002          2.0000          3.0000       2002  34.30%       365  18.23%     -0.1448     0.0069     4.3237
   003          4.0000         24.0000       1738  29.78%       315  18.12%     -0.1520     0.0066     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 9.0779, INFO. VALUE = 0.0757.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_DEBT                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            477   8.17%       161  33.75%      0.6816     0.0453     6.6527
   001          0.0000       3009.0000       2680  45.91%       567  21.16%      0.0404     0.0008     8.5317
   002       3010.0000      96260.0000       2680  45.91%       468  17.46%     -0.1972     0.0168     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 8.5317, INFO. VALUE = 0.0629.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_LINE                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            477   8.17%       161  33.75%      0.6816     0.0453     6.6527
   001          0.0000       1477.0000        765  13.11%       268  35.03%      0.7383     0.0864    18.3518
   002       1481.0000       4042.0000        766  13.12%       205  26.76%      0.3492     0.0176    23.4043
   003       4044.0000       8350.0000        766  13.12%       172  22.45%      0.1166     0.0018    24.9867
   004       8360.0000      14095.0000        766  13.12%       162  21.15%      0.0400     0.0002    25.5174
   005      14100.0000      23419.0000        766  13.12%       104  13.58%     -0.4949     0.0276    19.9488
   006      23427.0000      38259.0000        766  13.12%        70   9.14%     -0.9409     0.0860    10.8049
   007      38300.0000     205395.0000        765  13.11%        54   7.06%     -1.2217     0.1320     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 25.5174, INFO. VALUE = 0.3970.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                           MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR REV_UTIL                          
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   001          0.0000         29.0000       2905  49.77%       459  15.80%     -0.3172     0.0454    14.3262
   002         30.0000        100.0000       2932  50.23%       737  25.14%      0.2646     0.0379     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 14.3262, INFO. VALUE = 0.0834.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR BUREAU_SCORE                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            315   5.40%       105  33.33%      0.6628     0.0282     4.2544
   001        443.0000        653.0000       1393  23.86%       552  39.63%      0.9349     0.2621    32.2871
   002        654.0000        692.0000       1368  23.44%       298  21.78%      0.0776     0.0014    34.1481
   003        693.0000        735.0000       1383  23.69%       174  12.58%     -0.5825     0.0670    22.6462
   004        736.0000        848.0000       1378  23.61%        67   4.86%     -1.6179     0.3664     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 34.1481, INFO. VALUE = 0.7251.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                             MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR LTV                             
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   001          0.0000         82.0000        814  13.95%        81   9.95%     -0.8467     0.0764     9.0214
   002         83.0000         91.0000        837  14.34%       120  14.34%     -0.4316     0.0234    14.4372
   003         92.0000         97.0000        811  13.89%       148  18.25%     -0.1436     0.0027    16.3484
   004         98.0000        101.0000        830  14.22%       182  21.93%      0.0861     0.0011    15.0935
   005        102.0000        107.0000        870  14.90%       206  23.68%      0.1855     0.0054    12.1767
   006        108.0000        115.0000        808  13.84%       197  24.38%      0.2241     0.0074     8.8704
   007        116.0000        176.0000        867  14.85%       262  30.22%      0.5191     0.0460     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 16.3484, INFO. VALUE = 0.1625.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_INCOME                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .              5   0.09%         1  20.00%     -0.0303     0.0000     0.0026
   001          0.0000       3397.0000       2913  49.91%       669  22.97%      0.1457     0.0111     7.5822
   002       3400.0000    8147166.6600       2919  50.01%       526  18.02%     -0.1591     0.0121     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 7.5822, INFO. VALUE = 0.0231.                                            
--------------------------------------------------------------------------------------------------------------                                        

6) “MONO_WOE.IMP” is a file imputing missing values in the case when # of bads or goods for missings is not enough to calculate WOE.

  *** MEDIAN IMPUTATION OF LTV (NMISS = 1) ***;
  IF LTV = . THEN LTV =      100;

Information Criteria and Vuong Test

When it comes to model selection between two non-nest models, the information criteria, e.g. AIC or BIC, is often used and the model with a lower information criteria is preferred.

However, even with AIC or BIC, we are still unable to answer the question of whether the model A is significantly better than the model B probabilistically. Proposed by Quang Vuong (1989), Vuong test considers a better model with the individual log likelihoods significantly higher than the ones of its rival. A demonstration of Vuong test is given below.

First of all, two models for proportional outcomes, namely TOBIT regression and NLS (Nonlinear Least Squares) regression, are estimated below with information criteria, e.g. AIC and BIC, calculated and the likelihood of each individual record computed. As shown in the following output, NLS regression has a lower BIC and therefore might be considered a “better” model.

Next, with the likelihood of each individual record from both models, Vuong test is calculated with the formulation given below

Vuong statistic = [LR(model1, model2) – C] / sqrt(N * V) ~ N(0, 1)

LR(…) is the summation of individual log likelihood ratio between 2 models. “C” is a correction term for the difference of DF (Degrees of Freedom) between 2 models. “N” is the number of records. “V” is the variance of individual log likelihood ratio between 2 models. Vuong demonstrated that Vuong statistic is distributed as a standard normal N(0, 1). As a result, the model 1 is better with Vuong statistic > 1.96 and the model 2 is better with Vuong statistic < -1.96.

As shown in the output, although the model2, e.g. NLS regression, is preferred by a lower BIC, Vuong statistic doesn’t show the evidence that NLS regression is significantly better than TOBIT regression but indicates instead that both models are equally close to the true model.

Decision Stump with the Implementation in SAS

A decision stump is a naively simple but effective rule-based supervised learning algorithm similar to CART (Classification & Regression Tree). However, the stump is a 1-level decision tree consisting of 2 terminal nodes.

Albeit simple, the decision stump has shown successful use cases in many aspects. For instance, as a weak classifier, the decision stump has been proven an excellent base learner in ensemble learning algorithms such as bagging and boosting. Moreover, a single decision stump can also be employed to do feature screening for predictive modeling and cut-point searching for continuous features. The following is an example showing the SAS implementation as well as the predictive power of a decision stump.

First of all, a testing data is simulated with 1 binary response variable Y and 3 continuous features X1 – X3, which X1 is the most related feature to Y with a single cut-point at 5, X2 is also related to Y but with 2 different cut-points at 1.5 and 7.5, and X3 is a pure noise.

The SAS macro below is showing how to program a single decision stump. And this macro would be used to search for the simulated cut-point in each continuous feature.

%macro stump(data = , w = , y = , xlist = );

%let i = 1;

%local i;

proc sql;
create table _out
  (
  variable   char(32),
  gt_value   num,
  gini       num
  );
quit;

%do %while (%scan(&xlist, &i) ne %str());  
  %let x = %scan(&xlist, &i);
  
  data _tmp1(keep = &w &y &x);
    set &data;
    where &y in (0, 1);
  run;

  proc sql;
    create table
      _tmp2 as
    select
      b.&x                                                          as gt_value,
      sum(case when a.&x <= b.&x then &w * &y else 0 end) / 
      sum(case when a.&x <= b.&x then &w else 0 end)                as p1_1,
      sum(case when a.&x >  b.&x then &w * &y else 0 end) / 
      sum(case when a.&x >  b.&x then &w else 0 end)                as p1_2,
      sum(case when a.&x <= b.&x then 1 else 0 end) / count(*)      as ppn1,
      sum(case when a.&x >  b.&x then 1 else 0 end) / count(*)      as ppn2,
      2 * calculated p1_1 * (1 - calculated p1_1) * calculated ppn1 + 
      2 * calculated p1_2 * (1 - calculated p1_2) * calculated ppn2 as gini
    from
      _tmp1 as a,
      (select distinct &x from _tmp1) as b
    group by
      b.&x;

    insert into _out
    select
      "&x",
      gt_value,
      gini
    from
      _tmp2
    having
      gini = min(gini);

    drop table _tmp1;
  quit;

  %let i = %eval(&i + 1); 
%end;

proc sort data = _out;
  by gini;
run;

proc report data = _out box spacing = 1 split = "*" nowd;
  column("DECISION STUMP SUMMARY"
         variable gt_value gini);
  define variable / "VARIABLE"                     width = 30 center;
  define gt_value / "CUTOFF VALUE*(GREATER THAN)"  width = 15 center;
  define gini     / "GINI"                         width = 10 center format = 9.4;
run;

%mend stump;

As shown in the table below, the decision stump did a fairly good job both in identifying predictive features and in locating cut-points. Both related features, X1 and X2, have been identified and correctly ranked in terms of associations with Y. For X1, the cut-point located is 4.97, extremely close to 5. For X2, the cut-point located is 7.46, close enough to 1 of 2 simulated cut-points.

Modeling Rates and Proportions in SAS – 8

7. FRACTIONAL LOGIT MODEL

Different from all models introduced previously that assume specific distributional families for the proportional outcomes of interests, the fractional logit model proposed by Papke and Wooldridge (1996) is a quasi-likelihood method that does not specify the full distribution but only requires the conditional mean to be correctly specified for consistent parameter estimates. Under the assumption E(Y|X) = G(X`B) = 1 / (1 + EXP(-X`B)), the fractional logit has the identical likelihood function to the one for a Bernoulli distribution such that

F(Y) = (G(X`B) ** Y) * (1 – G(X`B)) ** (1 – Y) with 1 >= Y >= 0

Based upon the above formulation, parameter estimates are calculated in the same manner as in the binary logistic regression by maximizing the log likelihood.

In SAS, the most convenient way to implement the fractional logit model is with GLIMMIX procedure. In addition, we can also use NLMIXED procedure by explicitly specifying the likelihood function as shown above.

Modeling Rates and Proportions in SAS – 7

6. SIMPLEX MODEL

Dispersion models proposed by Jorgensen (1997) can be considered a more general case of Generalized Linear Models by McCullagh and Nelder (1989) and include a dispersion parameter describing the distributional shape. The simplex model developed by Barndorff-Nielsen and Jorgensen (1991) is a special dispersion model and is useful to model proportional outcomes. A simplex model has the density function given by
    F(Y) = (2 * pi * sigma ^ 2 * (Y * (1 – Y)) ^ 3) ^ (-0.5) * EXP((-1 / (2 * sigma ^ 2)) * d(Y; Mu))
where d(Y; Mu) = (Y – Mu) ^ 2 / (Y * (1 – Y) * Mu ^ 2 * (1 – Mu) ^ 2) is a unit deviance function.

Similar to the Beta model, a simplex model also consists of 2 components. The first is a sub-model to describe the expected mean Mu. Since 0 < Mu < 1, the logit link function can be used to specify the relationship between the expected mean and covariates X such that LOG(Mu / (1 – Mu)) = X`B. The second is a sub-model to describe the pattern of dispersion parameter sigma ^ 2 also by a set of covariates Z such that LOG(sigma ^ 2) = Z`G. Due to the similar nature of parameterization between Beta model and Simplex model, model performances of these 2 often have been compared with each other. However, it is still an open question which model is able to outperform its competitor.

Similar to the case of Beta model, there is no out-of-box procedure in SAS estimating the simplex model. However, following its density function, we are able to model the simplex model with NLMIXED procedure as given below.

Modeling Rates and Proportions in SAS – 6

5. BETA REGRESSION

Beta regression is a flexible modeling technique based upon the 2-parameter beta distribution and can be employed to model any dependent variable that is continuous and bounded by 2 known endpoints, e.g. 0 and 1 in our context. Assumed that Y follows a standard beta distribution defined in the interval (0, 1) with 2 shape parameters W and T, the density function can be specified as
F(Y) = Gamma(W + T) / (Gamma(W) * Gamma(T)) * Y ^ (W – 1) * (1 – Y) ^ (T – 1)
In the above function, while W is pulling the density toward 0, T is pulling the density toward 1. Without the loss of generality, W and T can be re-parameterized and translated into 2 other parameters, namely location parameter Mu and dispersion parameter Phi such that W = Mu * Phi and T = Phi * (1 – Mu), where Mu is the expected mean and Phi is another parameter governing the variance such that sigma ^ 2 = Mu * (1 – Mu) / (1 + Phi).

Within the framework of generalized linear models (GLM), Mu and Phi can be modeled separately with 2 overlapping or identical sets of covariates X and Z, a location sub-model for Mu and the other dispersion sub-model for Phi. Since the expected mean Mu is bounded by 0 and 1, a natural choice of the link function for location sub-model is logit such that LOG(Mu / (1 – Mu)) = X`B. With the strictly positive nature of Phi, a log function seems appropriate to serve our purpose such that LOG(Phi) = – Z`G, in which the negative sign is only for the purpose of easy interpretation such that the positive G represents a positive impact on the variance.

SAS does not provide the out-of-box procedure to estimate Beta regression. While GLIMMIX procedure is claimed to accommodate Beta modeling, it can only estimate a simple-form of Beta regression without the dispersion sub-model. However, with the density function of Beta distribution, it is extremely easy to model Beta regression with NLMIXED procedure by specifying the log likelihood function. In addition, for the data with a relatively small size, Beta regression estimated with NLMIXED procedure converges very well by setting initial values of parameter estimates equal to parameters from TOBIT model in the previous session.

Modeling Rates and Proportions in SAS – 5

4. TOBIT MODEL

Tobit model can be considered a special case of naïve OLS regression with the dependent variable censored and therefore observable only in a certain interval, which is (0, 1) for rates and proportions in my study. Specifically, this class of models assumes that there is a latent variable Y* such that

Y* = X`B + e, where e ~ N(0, sigma ^ 2) & Y = Min(1, Max(0, Y*))

As a result, the representation of rates and proportions, Y, bounded by (0, 1) can be considered the observable part of a normally distributed variable Y* ~ N(X`B, sigma ^ 2) on the real line. However, a fundamental argument against this censoring assumption is that the reason that values outside the boundary of [0, 1] are not observable is not because they are censored but because they are not defined. Hence, the censored normal distribution might not be an appropriate assumption for percentages and proportions.

In SAS, the most convenient way to model Tobit model is QLIM procedure in SAS / ETS module. However, in order to clearly illustrate the log likelihood function of Tobit model, we’d like stick to NLMIXED procedure and estimate the Tobit model with maximum likelihood estimation. The maximum likelihood estimator for a Tobit model assumes that the errors are normal and homoscedastic and would be otherwise inconsistent. In the previous section, it is shown that the heteroscedasticity presents due to the nature of rate and proportion outcomes. As a result, the simultaneous estimation of a variance model is also necessary to account for heteroscedasticity by the function VAR(e) = sigma ^ 2 * (1 + EXP(Z`G)). Therefore, there are 2 components in our Tobit model specification, a mean sub-model and a variance sub-model. As illustrated in the output below, a couple independent variables, e.g. X1 and X2, are significant in both models, showing that the conditional variance is not independent of the mean in proportion outcomes.

Modeling Rates and Proportions in SAS – 4

3. NLS REGRESSION

NLS (Nonlinear Least Square) regression is a similar modeling technique to OLS regression described the previous section aiming to model rates and proportions in (0, 1) interval. With NLS specification, it is assumed

Y = 1 / (1 + EXP(-X`B)) + e, where e ~ N(0, sigma ^ 2)

Therefore, Y is assumed normally distributed with N(1 / (1 + exp(-X`B)), sigma ^ 2). Because NLS regression is able to directly model the conditional mean of Y instead of the conditional mean of Ln(Y / (1 – Y)), an obvious advantage is that it doesn’t impose restricted distributional assumptions, e.g. Y with an additive logistic normal distribution, that are vital to recover E(Y | X). However, also due to the assumption of Y ~ N(1 / (1 + exp(-X`B)), sigma ^2), NLS regression is inevitably subject to the criticism of failing to address the concern related to unequal variance, e.g. heteroscedasticity, often presented in rate and proportion outcomes.

In SAS, a generic implementation of NLS regression can be conducted by NLIN procedure. However, due to the nonlinear nature, the estimation of NLS regression might suffer from the convergence problem, e.g. long computing time or failure to converge. In this case, it is a good strategy to use estimated coefficients from OLS described in the prior section as the starting points for coefficients to be estimated in NLS.

Again, for the demonstration purpose, both NLIN and NLMIXED procedures in SAS are used to estimate NLS in our study. It is found that the convergence of NLS estimation in PROC NLIN is very effective and fast. An interesting observation after comparing OLS and NLS estimates is that estimated coefficients and t-statistics for significant variables from both models are close enough to each other in terms of the direction and the magnitude, which might be viewed as the heuristic evidence to justify our modeling strategy using OLS coefficients as the starting points for coefficients to be estimated in NLS.