Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for September 2012

Marginal Effects (on Binary Outcome)

In regression models, the marginal effect of a explanatory variable X is the partial derivative of the prediction with respect to X and measures the expected change in the response variable as a function of the change in X with the other explanatory variables held constant. In the interpretation of a regression model, presenting marginal effects often brings more information than just looking at coefficients. Below, I will use 2 types of regression models, e.g. logit and probit, for binary outcomes to show that although coefficients estimated from the same set of Xs might differ substantially in 2 models, marginal effects of each X in both model actually look very similar.

As shown below, parameter estimates from logit and probit look very different due to different model specification and assumptions. As a result, it is not possible to compare the effect and sensitivity of each predictor across 2 models.

proc qlim data = one;
  model bad = bureau_score ltv / discrete(d = logit);
  output out = out1 marginal;
run;
/* logit estimates
                                    Standard                 Approx
Parameter           Estimate           Error    t Value    Pr > |t|
Intercept           7.080229        0.506910      13.97      <.0001
bureau_score       -0.016705        0.000735     -22.74      <.0001
ltv                 0.028055        0.002361      11.88      <.0001
*/

proc qlim data = one;
  model bad = bureau_score ltv / discrete(d = probit);
  output out = out2 marginal;
run;
/* probit estimates
                                    Standard                 Approx
Parameter           Estimate           Error    t Value    Pr > |t|
Intercept           4.023515        0.285587      14.09      <.0001
bureau_score       -0.009500        0.000403     -23.56      <.0001
ltv                 0.015690        0.001316      11.93      <.0001
*/

However, are these 2 models so much different from each other? Comparing marginal effects instead of parameter estimates might be table to bring us more useful information.

proc means data = out1 mean;
  var meff_p2_:;
run;
/* marginal effects from logit
Variable                            Mean
----------------------------------------
Meff_P2_bureau_score          -0.0022705
Meff_P2_ltv                    0.0038132
----------------------------------------
*/

proc means data = out2 mean;
  var meff_p2_:;
run;
/* marginal effects from probit
Variable                            Mean
----------------------------------------
Meff_P2_bureau_score          -0.0022553
Meff_P2_ltv                    0.0037249
----------------------------------------
*/

It turns out that marginal effects of each predictor between two models are reasonably close.

Although it is easy to calculate marginal effects with SAS QLIM procedure, it might still be better to understand the underlying math and then compute them yourself with SAS data steps. Below is a demo on how to manually calculate marginal effects of a logit model following the formulation:
MF_x_i = EXP(XB) / ((1 + EXP(XB)) ^ 2) * beta_i for the ith predictor.

proc logistic data = one desc;
  model bad = bureau_score ltv;
  output out = out3 xbeta = xb;
run;
/* model estimates:
                                  Standard          Wald
Parameter       DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept        1      7.0802      0.5069      195.0857        <.0001
bureau_score     1     -0.0167    0.000735      516.8737        <.0001
ltv              1      0.0281     0.00236      141.1962        <.0001
*/

data out3;
  set out3;
  margin_bureau_score = exp(xb) / ((1 + exp(xb)) ** 2) * (-0.0167);
  margin_ltv = exp(xb) / ((1 + exp(xb)) ** 2) * (0.0281);
run;

proc means data = out3 mean;
  var margin_bureau_score margin_ltv;
run;
/* manual calculated marginal effects:
Variable                       Mean
-----------------------------------
margin_bureau_score      -0.0022698
margin_ltv                0.0038193
-----------------------------------
*/
Advertisements

Written by statcompute

September 30, 2012 at 4:51 pm

A SAS Macro for Scorecard Performance Evaluation

%macro separation(data = , score = , y = );
***********************************************************;
* THE MACRO IS TO EVALUATE THE SEPARATION POWER OF A      *;
* SCORECARD                                               *; 
* ------------------------------------------------------- *;
* PARAMETERS:                                             *;
*  DATA : INPUT DATASET                                   *;
*  SCORE: SCORECARD VARIABLE                              *;
*  Y    : RESPONSE VARIABLE IN (0, 1)                     *;
* ------------------------------------------------------- *;
* OUTPUTS:                                                *;
*  SEPARATION_REPORT.TXT                                  *;
*  A SEPARATION SUMMARY REPORT IN TXT FORMAT              *;
*  NAMED AS THE ABOVE WITH PREDICTIVE MEASURES INCLUDING  *;
*  KS, AUC, GINI, AND DIVERGENCE                          *;
* ------------------------------------------------------- *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nonumber nodate orientation = landscape linesize = 160 nocenter 
        formchar = "|----|+|---+=|-/\<>*" formdlim=' ' ls = 150; 

*** DEFAULT GROUP NUMBER FOR REPORT ***;
%let grp = 10;

data _tmp1 (keep = &y &score);
  set &data;
  where &y in (1, 0)  and &score ~= .;
run;

filename lst_out temp;

proc printto new print = lst_out;
run;

*** CONDUCT NON-PARAMETRIC TESTS ***; 
ods output wilcoxonscores = _wx;
ods output kolsmir2stats = _ks;
proc npar1way wilcoxon edf data = _tmp1;
  class &y;
  var &score;
run;

proc printto;
run;

proc sort data = _wx;
  by class;
run;

*** CALCULATE ROC AND GINI ***;
data _null_;
  set _wx end = eof;
  by class;

  array a{2, 3} _temporary_;
  if _n_ = 1 then do;
    a[1, 1] = n;
    a[1, 2] = sumofscores;
    a[1, 3] = expectedsum;
  end;
  else do;
    a[2, 1] = n;
  end;
  if eof then do;
    auc  = (a[1, 2] - a[1, 3]) / (a[1, 1] * a[2, 1])  + 0.5;
    if auc <= 0.5 then auc = 1 - auc;
    gini = 2 * (auc - 0.5);  
    call execute('%let auc = '||put(auc, 10.4)||';');
    call execute('%let gini = '||put(gini, 10.4)||';');
  end;
run;

*** CALCULATE KS ***;
data _null_;
  set _ks;

  if _n_ = 1 then do;
    ks = nvalue2 * 100;
    call execute('%let ks = '||put(ks, 10.4)||';');
  end;
run;

*** CAPTURE SCORE POINT FOR MAX KS ***;
data _null_;
  infile lst_out;
  input @" at Maximum = " ks_score;
  output;
  call execute('%let ks_score = '||put(ks_score, 10.4)||';');
  stop;
run;

proc summary data = _tmp1 nway;
  class &y;
  output out = _data_ (drop = _type_ _freq_)
  mean(&score.) = mean var(&score.) = variance;
run;

*** CALCULATE DIVERGENCE ***;
data _null_;
  set _last_ end = eof;
  array a{2, 2} _temporary_;
  if _n_ = 1 then do;
    a[1, 1] = mean;
    a[1, 2] = variance;
  end;
  else do;
    a[2, 1] = mean;
    a[2, 2] = variance;
  end;
  if eof then do;
    divergence = (a[1, 1] - a[2, 1]) ** 2 / ((a[1, 2] + a[2, 2]) / 2);
    call execute('%let divergence = '||put(divergence, 10.4)||';'); 
  end; 
run;

*** CAPTURE THE DIRECTION OF SCORE ***;
ods listing close;
ods output spearmancorr = _cor;
proc corr data = _tmp1 spearman;
  var &y;
  with &score;
run;
ods listing;
 
data _null_;
  set _cor;
  if &y >= 0 then do;
    call symput('desc', 'descending');
  end;
  else do;
    call symput('desc', ' ');
  end;
run;
%put &desc;

proc rank data = _tmp1 out = _tmp2 groups = &grp ties = low;
  var &score;
  ranks rank;
run;

proc summary data = _last_ nway;
  class rank;
  output out = _data_ (drop = _type_ rename = (_freq_ = freq))
  min(&score) = min_score max(&score) = max_score
  sum(&y) = bads;
run;

proc sql noprint;
  select sum(bads) into :bcnt from _last_;
  select sum(freq) - sum(bads) into :gcnt from _last_;
quit;

proc sort data = _last_ (drop = rank);
  by &desc min_score;
run;

data _data_;
  set _last_;
  by &desc min_score;

  i + 1; 
  percent = i / 100; 
  good  = freq - bads;
  odds  = good / bads;

  hit_rate = bads / freq;
  retain cum_bads cum_freq;
  cum_bads + bads;
  cum_freq + freq;
  cum_hit_rate = cum_bads / cum_freq;

  cat_rate = bads / &bcnt;
  retain cum_cat_rate;
  cum_cat_rate + cat_rate; 

  format symbol $4.;
  if i = 1 then symbol = 'BAD';
  else if i = &grp - 1 then symbol = 'V';
  else if i = &grp then symbol = 'GOOD';
  else symbol = '|';
run;

proc printto print = "%upcase(%trim(&score))_SEPARATION_REPORT.TXT" new;
run;

proc report data = _last_ spacing = 1 split = "/" headline nowd;
  column("GOOD BAD SEPARATION REPORT FOR %upcase(%trim(&score)) IN DATA %upcase(%trim(&data))/
          MAXIMUM KS = %trim(&ks) AT SCORE POINT %trim(&ks_score)/   
          ( AUC STATISTICS = %trim(&auc), GINI COEFFICIENT = %trim(&gini), DIVERGENCE = %trim(&divergence) )/ /"
         percent symbol min_score max_score good bads freq odds hit_rate cum_hit_rate cat_rate cum_cat_rate);

  define percent      / noprint order order = data;
  define symbol       / "" center               width = 5 center;
  define min_score    / "MIN/SCORE"             width = 10 format = 9.4        analysis min center;
  define max_score    / "MAX/SCORE"             width = 10 format = 9.4        analysis max center;
  define good         / "GOOD/#"                width = 10 format = comma9.    analysis sum;
  define bads         / "BAD/#"                 width = 10 format = comma9.    analysis sum;
  define freq         / "TOTAL/#"               width = 10 format = comma9.    analysis sum;
  define odds         / "ODDS"                  width = 10 format = 8.2        order;
  define hit_rate     / "BAD/RATE"              width = 10 format = percent9.2 order center;
  define cum_hit_rate / "CUMULATIVE/BAD RATE"   width = 10 format = percent9.2 order;
  define cat_rate     / "BAD/PERCENT"           width = 10 format = percent9.2 order center;
  define cum_cat_rate / "CUMU. BAD/PERCENT"     width = 10 format = percent9.2 order; 

  rbreak after / summarize dol skip;
run; 

proc printto;
run;

***********************************************************;
*                     END OF THE MACRO                    *;
***********************************************************; 
%mend separation;

libname data 'C:\Documents and Settings\liuwensui\Desktop\fraction_models\test';

%separation(data = data.accepts, score = bureau_score, y = bad);

Sample Output:

                          GOOD BAD SEPARATION REPORT FOR BUREAU_SCORE IN DATA DATA.ACCEPTS                          
                                    MAXIMUM KS = 35.5477 AT SCORE POINT 677.0000                                    
                     ( AUC STATISTICS = 0.7389, GINI COEFFICIENT = 0.4778, DIVERGENCE = 0.8027 )                    
                                                                                                                    
          MIN        MAX           GOOD        BAD      TOTAL               BAD     CUMULATIVE    BAD      CUMU. BAD
         SCORE      SCORE             #          #          #       ODDS    RATE      BAD RATE  PERCENT      PERCENT
 -------------------------------------------------------------------------------------------------------------------
  BAD   443.0000   620.0000         310        252        562       1.23   44.84%      44.84%    23.10%      23.10% 
   |    621.0000   645.0000         365        201        566       1.82   35.51%      40.16%    18.42%      41.52% 
   |    646.0000   661.0000         359        173        532       2.08   32.52%      37.71%    15.86%      57.38% 
   |    662.0000   677.0000         441        125        566       3.53   22.08%      33.74%    11.46%      68.84% 
   |    678.0000   692.0000         436         99        535       4.40   18.50%      30.79%     9.07%      77.91% 
   |    693.0000   708.0000         469         89        558       5.27   15.95%      28.29%     8.16%      86.07% 
   |    709.0000   725.0000         492         66        558       7.45   11.83%      25.92%     6.05%      92.12% 
   |    726.0000   747.0000         520         42        562      12.38    7.47%      23.59%     3.85%      95.97% 
   V    748.0000   772.0000         507         30        537      16.90    5.59%      21.64%     2.75%      98.72% 
 GOOD   773.0000   848.0000         532         14        546      38.00    2.56%      19.76%     1.28%     100.00% 
       ========== ========== ========== ========== ==========                                                       
        443.0000   848.0000       4,431      1,091      5,522 

Written by statcompute

September 28, 2012 at 11:42 pm

Two-Part Fractional Model

1. Two-step estimation method:
– SAS code

data one;
  set data.sme;
  _id_ + 1;
  if y = 0 then y2 = 0;
  else y2 = 1;
run;

proc logistic data = one desc;
  model y2 = x1 - x7;
run;

proc nlmixed data = one tech = congra;
  where y2 = 1;
  parms b0 =  2.14 b1 = -0.01 b2 = -0.55 b3 = -0.14 b4 = -3.47
        b5 =  0.01 b6 = -0.01 b7 = -0.19 s = 0.1;
  _xb_ = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 +
         b5 * x5 + b6 * x6 + b7 * x7;
  _mu_ = 1 / (1 + exp(-_xb_));
  lh = pdf('normal', y, _mu_, s);
  ll = log(lh);
  model y ~ general(ll);
  predict ll out = out1(keep = _id_ pred rename = (pred = ln_like1));
run;

– Outputs

*** output for logit component ***
         Model Fit Statistics
                             Intercept
              Intercept            and
Criterion          Only     Covariates
AIC            4997.648       4066.398
SC             5004.043       4117.551
-2 Log L       4995.648       4050.398

             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1     -9.3586      0.4341      464.8713        <.0001
x1            1     -0.0595      0.0341        3.0445        0.0810
x2            1      1.7644      0.1836       92.3757        <.0001
x3            1      0.5994      0.0302      392.7404        <.0001
x4            1     -2.5496      0.5084       25.1488        <.0001
x5            1    -0.00071     0.00128        0.3094        0.5780
x6            1    -0.00109     0.00267        0.1659        0.6838
x7            1     -1.6359      0.2443       44.8427        <.0001

*** output for the positive component ***
             Fit Statistics
-2 Log Likelihood                 -264.5
AIC (smaller is better)           -246.5
AICC (smaller is better)          -246.3
BIC (smaller is better)           -201.4

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b0            1.7947     0.3401   1116      5.28     <.0001
b1          -0.01216    0.02737   1116     -0.44     0.6568
b2           -0.4306     0.1437   1116     -3.00     0.0028
b3           -0.1157    0.02287   1116     -5.06     <.0001
b4           -2.9267     0.4656   1116     -6.29     <.0001
b5          0.004092   0.001074   1116      3.81     0.0001
b6          -0.00838   0.002110   1116     -3.97     <.0001
b7           -0.1697     0.1977   1116     -0.86     0.3909
s             0.2149   0.004549   1116     47.24     <.0001

2. Joint estimation method:
– SAS code

proc nlmixed data = one tech = congra maxiter = 1000;
  parms b10 = -9.3586 b11 = -0.0595 b12 =  1.7644 b13 =  0.5994 b14 = -2.5496
        b15 = -0.0007 b16 = -0.0011 b17 = -1.6359
        b20 =  0.3401 b21 =  0.0274 b22 =  0.1437 b23 =  0.0229 b24 =  0.4656
        b25 =  0.0011 b26 =  0.0021 b27 =  0.1977  s  =  0.2149;
  logit_xb = b10 + b11 * x1 + b12 * x2 + b13 * x3 + b14 * x4 +
             b15 * x5 + b16 * x6 + b17 * x7;
  nls_xb = b20 + b21 * x1 + b22 * x2 + b23 * x3 + b24 * x4 +
           b25 * x5 + b26 * x6 + b27 * x7;
  p1 = 1 / (1 + exp(-logit_xb));
  if y = 0 then ll = log(1 - p1);
  else ll = log(p1) + log(pdf('normal', y, 1 / (1 + exp(-nls_xb)), s));
  model y ~ general(ll);
run;

– Outputs

             Fit Statistics
-2 Log Likelihood                 3785.9
AIC (smaller is better)           3819.9
AICC (smaller is better)          3820.0
BIC (smaller is better)           3928.6

                 Parameter Estimates
                       Standard
Parameter   Estimate      Error     DF   t Value   Pr > |t|
b10          -9.3586     0.4341   4421    -21.56     <.0001
b11         -0.05948    0.03408   4421     -1.75     0.0810
b12           1.7645     0.1836   4421      9.61     <.0001
b13           0.5994    0.03024   4421     19.82     <.0001
b14          -2.5496     0.5084   4421     -5.01     <.0001
b15         -0.00071   0.001276   4421     -0.56     0.5784
b16         -0.00109   0.002673   4421     -0.41     0.6836
b17          -1.6359     0.2443   4421     -6.70     <.0001
b20           1.7633     0.3398   4421      5.19     <.0001
b21         -0.01115    0.02730   4421     -0.41     0.6830
b22          -0.4363     0.1437   4421     -3.04     0.0024
b23          -0.1139    0.02285   4421     -4.98     <.0001
b24          -2.8755     0.4643   4421     -6.19     <.0001
b25         0.004091   0.001074   4421      3.81     0.0001
b26         -0.00839   0.002109   4421     -3.98     <.0001
b27          -0.1666     0.1975   4421     -0.84     0.3991
s             0.2149   0.004550   4421     47.24     <.0001

As shown above, both estimation methods give similar parameter estimates. The summation of log likelihood from the 2-step estimation models is exactly equal to the log likelihood from the joint estimation model.

Written by statcompute

September 25, 2012 at 11:08 pm

Posted in Econometrics, SAS, Statistical Models

Tagged with ,

A Distribution-Free Alternative to Vuong Test

The Vuong test has been widely used in nonnested model selection under the normality assumption. While the Vuong test determines whether the average log-likelihood ratio is statistically different from zero, the distribution-free test proposed by Clarke determines whether or not the median log-likelihood ratio is statistically different from zero.

Below is the SAS macro to implement Clarke test.

%macro clarke(data = , ll1 = , q1 = , ll2 = , q2 = );
***********************************************************;
* THE SAS MACRO IS TO PERFORM AN ALTERNATIVE TO VUONG     *;
* TEST, DISTRIBUTION-FREE CLARKE TEST, FOR THE MODEL      *;
* COMPARISON.                                             *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  LL1 : LOG LIKELIHOOD FUNCTION OF THE MODEL 1           *;
*  Q1  : DEGREE OF FREEDOM OF THE MODEL 1                 *;
*  LL2 : LOG LIKELIHOOD FUNCTION OF THE MODEL 2           *;
*  Q2  : DEGREE OF FREEDOM OF THE MODEL 2                 *;
* ======================================================= *;
* REFERENCE:                                              *;
*  A SIMPLE DISTRIBUTION-FREE TEST FOR NONNESTED MODEL    *;
*  SELECTION, KEVIN CLARKE, 2007                          *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options mprint mlogic formchar = "|----|+|---+=|-/\<>*" nocenter nonumber nodate;

data _tmp1;
  set &data;
  where &ll1 ~= 0 and &ll2 ~= 0;
run;

proc sql noprint;
  select count(*) into :nobs from _tmp1;
quit;

%let schwarz = %sysevalf((&q1 - &q2) * %sysfunc(log(&nobs)) / &nobs);

data _tmp2;
  set _tmp1;
  z = &ll1 - &ll2 - &schwarz;
  b1 = 0;
  b2 = 0;
  if z > 0 then b1 = 1;
  if z < 0 then b2 = 1;
run;

proc sql;
  create table
    _tmp3 as
  select 
    cdf("binomial", count(*) - sum(b1), 0.5, count(*))             as p1,
    cdf("binomial", count(*) - sum(b2), 0.5, count(*))             as p2,
    min(1, cdf("binomial", count(*) - sum(b2), 0.5, count(*)) * 2) as p3
  from 
    _tmp2;
quit;

proc report data = _tmp3 spacing = 1 split = "*" headline nowindows;
  column("Null Hypothesis: MDL1 = MDL2" p1 p2 p3);
  define p1  / "MDL1 > MDL2"  width = 15 format = 12.8 order;
  define p2  / "MDL1 < MDL2"  width = 15 format = 12.8;
  define p3  / "MDL1 != MDL2" width = 15 format = 12.8;
run;

%mend clarke;

Written by statcompute

September 24, 2012 at 12:49 am

WoE Macro for LGD Model

WoE (Weight of Evidence) transformation has been widely used in scorecard or PD (Probability of Default) model development with a binary response variable in the retail banking. In SAS Global Forum 2012, Anthony and Naeem proposed an innovative method to employ WoE transformation in LGD (Loss Given Default) model development with the response in the range [0, 1]. Specifically, if an account with LGD = 0.25, then this account will be duplicated 100 times, of which 25 instances will be assigned Y = 1 and the rest 75 instances will be assigned Y = 0. As a result, 1 case with the unity-interval response variable has been converted to 100 cases with the binary response variable. After this conversion, WoE transformation algorithm designed for the binary response can be directly applied to the new dataset with the converted LGD.

Following the logic described by Anthony and Naeem, I did a test run with WoE SAS macro that I drafted by myself and got the following output. In the tested data set, there are totally 36,543 cases with the unity-interval response.

                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200     406000  11.11%    158036  38.93%     -0.1726     0.0033     1.8904
   002         84.7300         91.9100     406800  11.13%    160530  39.46%     -0.1501     0.0025     3.5410
   003         91.9132         97.3100     405300  11.09%    168589  41.60%     -0.0615     0.0004     4.2202
   004         97.3146        101.8100     406000  11.11%    170773  42.06%     -0.0424     0.0002     4.6894
   005        101.8162        105.2221     406100  11.11%    172397  42.45%     -0.0264     0.0001     4.9821
   006        105.2223        110.1642     406000  11.11%    174480  42.98%     -0.0050     0.0000     5.0376
   007        110.1700        115.7000     406600  11.13%    183253  45.07%      0.0800     0.0007     4.1431
   008        115.7029        123.0886     405500  11.10%    188026  46.37%      0.1324     0.0020     2.6630
   009        123.1100        150.0000     406000  11.11%    198842  48.98%      0.2369     0.0063     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 3654300, # BADs(Y=1) = 1574926, OVERALL BAD RATE = 43.0979%, MAX. KS = 5.0376, INFO. VALUE = 0.0154.                                    
--------------------------------------------------------------------------------------------------------------                                        

As shown in the above output, monotonic WoE binning, information value, and KS statistic come out nicely with the only exception that the frequency has been increased by 100 times. From a practical view, the increase in the data size will inevitably lead o the increase in the computing cost, making this binary conversion potentially inapplicable to large data.

After a few tweaks, I drafted a modified version of WoE SAS macro suitable for LGD model with the unity-interval response, as given below.

%macro lgd_nwoe(data = , y = , x = );

***********************************************************;
* THE SAS MACRO IS TO PERFORM UNIVARIATE IMPORTANCE RANK  *;
* ORDER AND MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION   *;
* FOR NUMERIC ATTRIBUTES IN PRE-MODELING DATA PROCESSING  *;
* FOR LGD MODELS. OUTPUTS ARE SUPPOSED TO BE USED IN A    *;
* FRACTIONAL MODEL WITH LOGIT LINK FUNCTION               *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  Y   : CONTINUOUS RESPONSE VARIABLE IN [0, 1] RANGE     *;
*  X   : A LIST OF NUMERIC ATTRIBUTES                     *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  NUM_WOE.WOE: A FILE OF WOE TRANSFORMATION RECODING     *;
*  NUM_WOE.FMT: A FILE OF BINNING FORMAT                  *;
*  NUM_WOE.PUT: A FILE OF PUT STATEMENTS FOR *.FMT FILE   *;
*  NUM_WOE.SUM: A FILE WITH PREDICTABILITY SUMMARY        *;
*  NUM_WOE.OUT: A FILE WITH STATISTICAL DETAILS           *;
*  NUM_WOE.IMP: A FILE OF MISSING IMPUTATION RECODING     *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

*** DEFAULT PARAMETERS ***;

%local maxbin miniv bignum mincor;

%let maxbin = 20;

%let miniv  = 0;

%let mincor = 1;

%let bignum = 1e300;

***********************************************************;
***         DO NOT CHANGE CODES BELOW THIS LINE         ***;
***********************************************************;

*** DEFAULT OUTPUT FILES ***;

* WOE RECODING FILE                     *;
filename woefile "LGD_NWOE.WOE";

* FORMAT FOR BINNING                    *;
filename fmtfile "LGD_NWOE.FMT";

* PUT STATEMENT TO USE FORMAT           *;
filename binfile "LGD_NWOE.PUT";

* KS SUMMARY                            *;
filename sumfile "LGD_NWOE.SUM";
 
* STATISTICAL SUMMARY FOR EACH VARIABLE *;
filename outfile "LGD_NWOE.OUT";

* IMPUTE RECODING FILE                  *;
filename impfile "LGD_NWOE.IMP";

*** A MACRO TO DELETE FILE ***;
%macro dfile(file = );
  data _null_;
    rc = fdelete("&file");
    if rc = 0 then do;
      put @1 50 * "+";
      put "THE EXISTED OUTPUT FILE HAS BEEN DELETED.";
      put @1 50 * "+";
    end;
  run;
%mend dfile;

*** CLEAN UP FILES ***;
%dfile(file = woefile);

%dfile(file = fmtfile);

%dfile(file = binfile);

%dfile(file = sumfile);

%dfile(file = outfile);

%dfile(file = impfile);

*** PARSING THE STRING OF NUMERIC PREDICTORS ***;
ods listing close;
ods output position = _pos1;
proc contents data = &data varnum;
run;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;


  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;
quit;

data _tmp1;
  retain &x2 &y &y.2;
  set &data;
  where &Y >= 0 and &y <= 1;
  &y.2 = 1 - &y;
  keep &x2 &y &y.2;
run;

ods output position = _pos2;
proc contents data = _tmp1 varnum;
run;

*** LOOP THROUGH EACH PREDICTOR ***;
%do i = 1 %to &xcnt;
    
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num= &i;

    select
      count(distinct &var) into :xflg
    from
      _tmp1
    where
      &var ~= .;
  quit;

  proc summary data = _tmp1 nway;
    output out  = _med(drop = _type_ _freq_)
    median(&var) = med nmiss(&var) = mis;
  run;
  
  proc sql;
    select
      med into :median
    from
      _med;

    select
      mis into :nmiss
    from
      _med;

    select 
      case when count(&y) = sum(&y) then 1 else 0 end into :mis_flg1
    from
      _tmp1
    where
      &var = .;

    select
      case when sum(&y) = 0 then 1 else 0 end into :mis_flg2
    from
      _tmp1
    where
      &var = .;
  quit;

  %let nbin = %sysfunc(min(&maxbin, &xflg));

  *** CHECK IF # OF DISTINCT VALUES > 1 ***;
  %if &xflg > 1 %then %do;

    *** IMPUTE MISS VALUE WHEN WOE CANNOT BE CALCULATED ***;
    %if &mis_flg1 = 1 | &mis_flg2 = 1 %then %do;
      data _null_;
        file impfile mod;
        put " ";
        put @3 "*** MEDIAN IMPUTATION OF %TRIM(%UPCASE(&VAR)) (NMISS = %trim(&nmiss)) ***;";
        put @3 "IF %TRIM(%UPCASE(&VAR)) = . THEN %TRIM(%UPCASE(&VAR)) = &MEDIAN;";
      run;

      data _tmp1;
        set _tmp1;
        if &var = . then &var = &median;
      run; 
    %end;      
      
    *** LOOP THROUGH ALL # OF BINS ***;
    %do j = &nbin %to 2 %by -1;
      proc rank data = _tmp1 groups = &j out = _tmp2(keep = &y &var rank &y.2);
        var &var;
        ranks rank;
      run;

      proc summary data = _tmp2 nway missing;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)   = bad    mean(&y)  = bad_rate mean(&y.2) = good_rate
        min(&var) = minx   max(&var) = maxx;
      run;

      *** CREATE FLAGS FOR MULTIPLE CRITERION ***;
      proc sql noprint;
        select
          case when min(bad_rate) > 0 then 1 else 0 end into :minflg
        from
          _tmp3;

        select
          case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
        from
          _tmp3;              
      quit;

      *** CHECK IF SPEARMAN CORRELATION = 1 ***;
      %if &minflg = 1 & &maxflg = 1 %then %do;
        ods output spearmancorr = _corr(rename = (minx = cor));
        proc corr data = _tmp3 spearman;
          var minx;
          with bad_rate;
        run;

        proc sql noprint;
          select
            case when abs(cor) >= &mincor then 1 else 0 end into :cor
          from
            _corr;
        quit;

        *** IF SPEARMAN CORR = 1 THEN BREAK THE LOOP ***;
        %if &cor = 1 %then %goto loopout;
      %end;
      %else %if &nbin = 2 %then %goto exit;
    %end;

    %loopout:

    *** CALCULATE STATISTICAL SUMMARY ***;
    
    proc sql noprint;
      select 
        sum(freq) into :freq
      from
        _tmp3;
      
      select
        sum(bad) into :bad
      from
        _tmp3;

      select
        sum(bad) / sum(freq) into :lgd
      from
        _tmp3;
    quit;

    proc sort data = _tmp3 sortsize = max;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq pct bad_rate good_rate;
      set _tmp3 end = eof;
      by rank;

      if rank = . then bin = 0;
      else do;
        retain b 0;
        bin + 1;
      end;
  
      pct  = freq / &freq;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&freq - &bad);
      woe  = log(bpct / gpct);
      iv   = (bpct - gpct) * woe;
      
      retain cum_bpct cum_gpct;
      cum_bpct + bpct;
      cum_gpct + gpct;
      ks = abs(cum_gpct - cum_bpct) * 100;
      
      retain iv_sum ks_max;
      iv_sum + iv;
      ks_max = max(ks_max, ks);
      if eof then do;
        call symput("bin", put(bin, 4.));
        call symput("ks", put(ks_max, 10.4));
        call symput("iv", put(iv_sum, 10.4));
      end;

      keep bin minx maxx bad freq pct bad_rate
           gpct bpct woe iv cum_gpct cum_bpct ks;
    run;

    *** REPORT STATISTICAL SUMMARY ***;
    proc printto print = outfile;
    run;

    title;
    ods listing;
    proc report data = _tmp4 spacing = 1 split = "*" headline nowindows;
      column(" * MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(%trim(&var))"
             bin minx maxx freq pct bad_rate woe iv ks);

      define bin      /"BIN#"         width = 5  format = z3. order order = data;
      define minx     /"LOWER LIMIT"  width = 15 format = 14.4;
      define maxx     /"UPPER LIMIT"  width = 15 format = 14.4;
      define freq     /"#FREQ"        width = 10 format = 9.;
      define pct      /"DISTRIBUTION" width = 15 format = percent14.4;
      define bad_rate /"MEAN LGD"     width = 15 format = percent14.4;
      define woe      /"WOE"          width = 15 format = 14.8;
      define iv       /"INFO. VALUE"  width = 15 format = 14.8;
      define ks       /"KS"           width = 10 format = 9.4;
      compute after;
        line @1 125 * "-";
        line @10 "# TOTAL = %trim(&freq), AVERAGE LGD = %trim(&lgd), "
                 "MAX. KS = %trim(&ks), INFO. VALUE = %trim(&iv).";
        line @1 125 * "-";    
      endcomp;
    run;
    ods listing close;

    proc printto;
    run;

    proc sql noprint;
      select
        case when sum(iv) >= &miniv then 1 else 0 end into :ivflg
      from
        _tmp4;
    quit;

    *** OUTPUT RECODING FILES IF IV >= &miniv BY DEFAULT ***;
    %if &ivflg = 1 %then %do;
      data _tmp5;
        length upper $20 lower $20;
        lower = compress(put(maxx, 20.4), ' ');

        set _tmp4 end = eof;
        upper = compress(put(maxx, 20.4), ' ');
        if bin = 1 then lower = "-%trim(&bignum)";
        if eof then upper = "%trim(&bignum)";
        w%trim(&var) = compress(put(woe, 12.8), ' ');
      run;

      *** OUTPUT WOE RECODE FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file woefile mod;

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3  "if %trim(&var) = . then w%trim(&var) = " w%trim(&var) ";";
        end;
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3 "if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if _n_ > 1 then do;
          put @5 "else if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if eof then do;
          put @5 "else w%trim(&var) = 0;";
        end;
      run;

      *** OUTPUT BINNING FORMAT FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file fmtfile mod;

        if bin = 1 then lower = "LOW";
        if eof then upper = "HIGH";

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 ". " @40 " = '" bin: z3.
              @47 ". MISSINGS'";
        end;

            
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
              @5 "BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 lower @15 " - " upper  @40 " = '" bin: z3.
              @47 "." + 1 lower "- " upper "'";
        end;

        if _n_ > 1 then do;
          put @5 lower @15 "<- " upper @40 " = '" bin: z3.
              @47 "." + 1 lower "<- " upper "'";
        end;
        if eof then do;
          put @5 "OTHER" @40 " = '999. OTHERS';";
        end;
      run;

      *** OUTPUT BINNING RECODE FILE ***;
      data _null_;
        file binfile mod;
        put " ";
        put @3 "*** BINNING RECODE of %trim(&var) ***;";
        put @3 "c%trim(&var) = put(%trim(&var), %trim(&var)_fmt.);";
      run;

      *** SAVE SUMMARY OF EACH VARIABLE INTO A TABLE ***;
      %if %sysfunc(exist(work._result)) %then %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          if _n_ = 1 then do;
            variable = "%trim(&var)";
            bin      = &bin;
            ks       = &ks;
            iv       = &iv;
            output;
          end;
          set _result;
          output;
        run;
      %end;
      %else %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          variable = "%trim(&var)";
          bin      = &bin;
          ks       = &ks;
          iv       = &iv;
        run;        
      %end;
    %end;

    %exit:

    *** CLEAN UP TEMPORARY TABLES ***;
    proc datasets library = work nolist;
      delete _tmp2 - _tmp5 _corr / memtype = data;
    run;
    quit;
  %end;    
%end;

*** SORT VARIABLES BY KS AND OUTPUT RESULTS ***;
proc sort data = _result sortsize = max;
  by descending iv descending ks;
run;

data _null_;
  set _result end = eof;
  file sumfile;

  if _n_ = 1 then do;
    put @1 80 * "-";
    put @1  "| RANK" @10 "| VARIABLE RANKED BY IV" @45 "| # BINS"
        @55 "|  KS"  @66 "| INFO. VALUE" @80 "|";
    put @1 80 * "-";
  end;
  put @1  "| " @4  _n_ z3. @10 "| " @12 variable @45 "| " @50 bin
      @55 "| " @57 ks      @66 "| " @69 iv       @80 "|";
  if eof then do;
    put @1 80 * "-";
  end;
run;

proc datasets library = work nolist;
  delete _result (mt = data);
run;
quit;

*********************************************************;
*                   END OF MACRO                        *;
*********************************************************;

%mend lgd_nwoe;

With this macro, I tested on the same data again but without the binary conversion proposed Anthony and Naeem. The output is also pasted below for the comparison purpose.

                                                                                                                            
                                 MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

  BIN#     LOWER LIMIT     UPPER LIMIT      #FREQ    DISTRIBUTION        MEAN LGD             WOE     INFO. VALUE         KS
 ---------------------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200       4060       11.1102%        38.9246%      -0.17266201      0.00326518     1.8911
   002         84.7300         91.9100       4068       11.1321%        39.4665%      -0.14992613      0.00247205     3.5399
   003         91.9132         97.3100       4053       11.0910%        41.5962%      -0.06155066      0.00041827     4.2195
   004         97.3146        101.8100       4060       11.1102%        42.0504%      -0.04288377      0.00020368     4.6945
   005        101.8162        105.2221       4061       11.1129%        42.4538%      -0.02634960      0.00007701     4.9867
   006        105.2223        110.1642       4060       11.1102%        42.9896%      -0.00445666      0.00000221     5.0362
   007        110.1700        115.7000       4066       11.1266%        45.0653%       0.07978905      0.00071190     4.1440
   008        115.7029        123.0886       4055       11.0965%        46.3687%       0.13231366      0.00195768     2.6644
   009        123.1100        150.0000       4060       11.1102%        48.9800%       0.23701619      0.00631510     0.0000
-----------------------------------------------------------------------------------------------------------------------------
         # TOTAL = 36543, AVERAGE LGD = 0.430988, MAX. KS = 5.0362, INFO. VALUE = 0.0154.                                    
-----------------------------------------------------------------------------------------------------------------------------

The comparison shows that most information from 2 implementations are almost identical. For instance, information value = 0.0154 in both cases. However, the frequency in the second output reflects the correct sample size, e.g 36,543. As a result, the computation is more efficient and the time consumed is much shorter. Considering the fact that usually hundreds of variables should be looped through this WoE transformation one by one, the approach without duplicating cases and converting to binary might make more practical sense.

Written by statcompute

September 8, 2012 at 12:33 am