Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

WoE Macro for LGD Model

WoE (Weight of Evidence) transformation has been widely used in scorecard or PD (Probability of Default) model development with a binary response variable in the retail banking. In SAS Global Forum 2012, Anthony and Naeem proposed an innovative method to employ WoE transformation in LGD (Loss Given Default) model development with the response in the range [0, 1]. Specifically, if an account with LGD = 0.25, then this account will be duplicated 100 times, of which 25 instances will be assigned Y = 1 and the rest 75 instances will be assigned Y = 0. As a result, 1 case with the unity-interval response variable has been converted to 100 cases with the binary response variable. After this conversion, WoE transformation algorithm designed for the binary response can be directly applied to the new dataset with the converted LGD.

Following the logic described by Anthony and Naeem, I did a test run with WoE SAS macro that I drafted by myself and got the following output. In the tested data set, there are totally 36,543 cases with the unity-interval response.

                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200     406000  11.11%    158036  38.93%     -0.1726     0.0033     1.8904
   002         84.7300         91.9100     406800  11.13%    160530  39.46%     -0.1501     0.0025     3.5410
   003         91.9132         97.3100     405300  11.09%    168589  41.60%     -0.0615     0.0004     4.2202
   004         97.3146        101.8100     406000  11.11%    170773  42.06%     -0.0424     0.0002     4.6894
   005        101.8162        105.2221     406100  11.11%    172397  42.45%     -0.0264     0.0001     4.9821
   006        105.2223        110.1642     406000  11.11%    174480  42.98%     -0.0050     0.0000     5.0376
   007        110.1700        115.7000     406600  11.13%    183253  45.07%      0.0800     0.0007     4.1431
   008        115.7029        123.0886     405500  11.10%    188026  46.37%      0.1324     0.0020     2.6630
   009        123.1100        150.0000     406000  11.11%    198842  48.98%      0.2369     0.0063     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 3654300, # BADs(Y=1) = 1574926, OVERALL BAD RATE = 43.0979%, MAX. KS = 5.0376, INFO. VALUE = 0.0154.                                    
--------------------------------------------------------------------------------------------------------------                                        

As shown in the above output, monotonic WoE binning, information value, and KS statistic come out nicely with the only exception that the frequency has been increased by 100 times. From a practical view, the increase in the data size will inevitably lead o the increase in the computing cost, making this binary conversion potentially inapplicable to large data.

After a few tweaks, I drafted a modified version of WoE SAS macro suitable for LGD model with the unity-interval response, as given below.

%macro lgd_nwoe(data = , y = , x = );

***********************************************************;
* THE SAS MACRO IS TO PERFORM UNIVARIATE IMPORTANCE RANK  *;
* ORDER AND MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION   *;
* FOR NUMERIC ATTRIBUTES IN PRE-MODELING DATA PROCESSING  *;
* FOR LGD MODELS. OUTPUTS ARE SUPPOSED TO BE USED IN A    *;
* FRACTIONAL MODEL WITH LOGIT LINK FUNCTION               *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  Y   : CONTINUOUS RESPONSE VARIABLE IN [0, 1] RANGE     *;
*  X   : A LIST OF NUMERIC ATTRIBUTES                     *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  NUM_WOE.WOE: A FILE OF WOE TRANSFORMATION RECODING     *;
*  NUM_WOE.FMT: A FILE OF BINNING FORMAT                  *;
*  NUM_WOE.PUT: A FILE OF PUT STATEMENTS FOR *.FMT FILE   *;
*  NUM_WOE.SUM: A FILE WITH PREDICTABILITY SUMMARY        *;
*  NUM_WOE.OUT: A FILE WITH STATISTICAL DETAILS           *;
*  NUM_WOE.IMP: A FILE OF MISSING IMPUTATION RECODING     *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 125 formchar = "|----|+|---+=|-/\<>*";

*** DEFAULT PARAMETERS ***;

%local maxbin miniv bignum mincor;

%let maxbin = 20;

%let miniv  = 0;

%let mincor = 1;

%let bignum = 1e300;

***********************************************************;
***         DO NOT CHANGE CODES BELOW THIS LINE         ***;
***********************************************************;

*** DEFAULT OUTPUT FILES ***;

* WOE RECODING FILE                     *;
filename woefile "LGD_NWOE.WOE";

* FORMAT FOR BINNING                    *;
filename fmtfile "LGD_NWOE.FMT";

* PUT STATEMENT TO USE FORMAT           *;
filename binfile "LGD_NWOE.PUT";

* KS SUMMARY                            *;
filename sumfile "LGD_NWOE.SUM";
 
* STATISTICAL SUMMARY FOR EACH VARIABLE *;
filename outfile "LGD_NWOE.OUT";

* IMPUTE RECODING FILE                  *;
filename impfile "LGD_NWOE.IMP";

*** A MACRO TO DELETE FILE ***;
%macro dfile(file = );
  data _null_;
    rc = fdelete("&file");
    if rc = 0 then do;
      put @1 50 * "+";
      put "THE EXISTED OUTPUT FILE HAS BEEN DELETED.";
      put @1 50 * "+";
    end;
  run;
%mend dfile;

*** CLEAN UP FILES ***;
%dfile(file = woefile);

%dfile(file = fmtfile);

%dfile(file = binfile);

%dfile(file = sumfile);

%dfile(file = outfile);

%dfile(file = impfile);

*** PARSING THE STRING OF NUMERIC PREDICTORS ***;
ods listing close;
ods output position = _pos1;
proc contents data = &data varnum;
run;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;


  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;
quit;

data _tmp1;
  retain &x2 &y &y.2;
  set &data;
  where &Y >= 0 and &y <= 1;
  &y.2 = 1 - &y;
  keep &x2 &y &y.2;
run;

ods output position = _pos2;
proc contents data = _tmp1 varnum;
run;

*** LOOP THROUGH EACH PREDICTOR ***;
%do i = 1 %to &xcnt;
    
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num= &i;

    select
      count(distinct &var) into :xflg
    from
      _tmp1
    where
      &var ~= .;
  quit;

  proc summary data = _tmp1 nway;
    output out  = _med(drop = _type_ _freq_)
    median(&var) = med nmiss(&var) = mis;
  run;
  
  proc sql;
    select
      med into :median
    from
      _med;

    select
      mis into :nmiss
    from
      _med;

    select 
      case when count(&y) = sum(&y) then 1 else 0 end into :mis_flg1
    from
      _tmp1
    where
      &var = .;

    select
      case when sum(&y) = 0 then 1 else 0 end into :mis_flg2
    from
      _tmp1
    where
      &var = .;
  quit;

  %let nbin = %sysfunc(min(&maxbin, &xflg));

  *** CHECK IF # OF DISTINCT VALUES > 1 ***;
  %if &xflg > 1 %then %do;

    *** IMPUTE MISS VALUE WHEN WOE CANNOT BE CALCULATED ***;
    %if &mis_flg1 = 1 | &mis_flg2 = 1 %then %do;
      data _null_;
        file impfile mod;
        put " ";
        put @3 "*** MEDIAN IMPUTATION OF %TRIM(%UPCASE(&VAR)) (NMISS = %trim(&nmiss)) ***;";
        put @3 "IF %TRIM(%UPCASE(&VAR)) = . THEN %TRIM(%UPCASE(&VAR)) = &MEDIAN;";
      run;

      data _tmp1;
        set _tmp1;
        if &var = . then &var = &median;
      run; 
    %end;      
      
    *** LOOP THROUGH ALL # OF BINS ***;
    %do j = &nbin %to 2 %by -1;
      proc rank data = _tmp1 groups = &j out = _tmp2(keep = &y &var rank &y.2);
        var &var;
        ranks rank;
      run;

      proc summary data = _tmp2 nway missing;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)   = bad    mean(&y)  = bad_rate mean(&y.2) = good_rate
        min(&var) = minx   max(&var) = maxx;
      run;

      *** CREATE FLAGS FOR MULTIPLE CRITERION ***;
      proc sql noprint;
        select
          case when min(bad_rate) > 0 then 1 else 0 end into :minflg
        from
          _tmp3;

        select
          case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
        from
          _tmp3;              
      quit;

      *** CHECK IF SPEARMAN CORRELATION = 1 ***;
      %if &minflg = 1 & &maxflg = 1 %then %do;
        ods output spearmancorr = _corr(rename = (minx = cor));
        proc corr data = _tmp3 spearman;
          var minx;
          with bad_rate;
        run;

        proc sql noprint;
          select
            case when abs(cor) >= &mincor then 1 else 0 end into :cor
          from
            _corr;
        quit;

        *** IF SPEARMAN CORR = 1 THEN BREAK THE LOOP ***;
        %if &cor = 1 %then %goto loopout;
      %end;
      %else %if &nbin = 2 %then %goto exit;
    %end;

    %loopout:

    *** CALCULATE STATISTICAL SUMMARY ***;
    
    proc sql noprint;
      select 
        sum(freq) into :freq
      from
        _tmp3;
      
      select
        sum(bad) into :bad
      from
        _tmp3;

      select
        sum(bad) / sum(freq) into :lgd
      from
        _tmp3;
    quit;

    proc sort data = _tmp3 sortsize = max;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq pct bad_rate good_rate;
      set _tmp3 end = eof;
      by rank;

      if rank = . then bin = 0;
      else do;
        retain b 0;
        bin + 1;
      end;
  
      pct  = freq / &freq;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&freq - &bad);
      woe  = log(bpct / gpct);
      iv   = (bpct - gpct) * woe;
      
      retain cum_bpct cum_gpct;
      cum_bpct + bpct;
      cum_gpct + gpct;
      ks = abs(cum_gpct - cum_bpct) * 100;
      
      retain iv_sum ks_max;
      iv_sum + iv;
      ks_max = max(ks_max, ks);
      if eof then do;
        call symput("bin", put(bin, 4.));
        call symput("ks", put(ks_max, 10.4));
        call symput("iv", put(iv_sum, 10.4));
      end;

      keep bin minx maxx bad freq pct bad_rate
           gpct bpct woe iv cum_gpct cum_bpct ks;
    run;

    *** REPORT STATISTICAL SUMMARY ***;
    proc printto print = outfile;
    run;

    title;
    ods listing;
    proc report data = _tmp4 spacing = 1 split = "*" headline nowindows;
      column(" * MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(%trim(&var))"
             bin minx maxx freq pct bad_rate woe iv ks);

      define bin      /"BIN#"         width = 5  format = z3. order order = data;
      define minx     /"LOWER LIMIT"  width = 15 format = 14.4;
      define maxx     /"UPPER LIMIT"  width = 15 format = 14.4;
      define freq     /"#FREQ"        width = 10 format = 9.;
      define pct      /"DISTRIBUTION" width = 15 format = percent14.4;
      define bad_rate /"MEAN LGD"     width = 15 format = percent14.4;
      define woe      /"WOE"          width = 15 format = 14.8;
      define iv       /"INFO. VALUE"  width = 15 format = 14.8;
      define ks       /"KS"           width = 10 format = 9.4;
      compute after;
        line @1 125 * "-";
        line @10 "# TOTAL = %trim(&freq), AVERAGE LGD = %trim(&lgd), "
                 "MAX. KS = %trim(&ks), INFO. VALUE = %trim(&iv).";
        line @1 125 * "-";    
      endcomp;
    run;
    ods listing close;

    proc printto;
    run;

    proc sql noprint;
      select
        case when sum(iv) >= &miniv then 1 else 0 end into :ivflg
      from
        _tmp4;
    quit;

    *** OUTPUT RECODING FILES IF IV >= &miniv BY DEFAULT ***;
    %if &ivflg = 1 %then %do;
      data _tmp5;
        length upper $20 lower $20;
        lower = compress(put(maxx, 20.4), ' ');

        set _tmp4 end = eof;
        upper = compress(put(maxx, 20.4), ' ');
        if bin = 1 then lower = "-%trim(&bignum)";
        if eof then upper = "%trim(&bignum)";
        w%trim(&var) = compress(put(woe, 12.8), ' ');
      run;

      *** OUTPUT WOE RECODE FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file woefile mod;

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3  "if %trim(&var) = . then w%trim(&var) = " w%trim(&var) ";";
        end;
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3 "if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if _n_ > 1 then do;
          put @5 "else if " lower "< %trim(&var) <= " upper
                 "then w%trim(&var) = " w%trim(&var) ";";
        end;
        if eof then do;
          put @5 "else w%trim(&var) = 0;";
        end;
      run;

      *** OUTPUT BINNING FORMAT FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file fmtfile mod;

        if bin = 1 then lower = "LOW";
        if eof then upper = "HIGH";

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 ". " @40 " = '" bin: z3.
              @47 ". MISSINGS'";
        end;

            
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
              @5 "BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 lower @15 " - " upper  @40 " = '" bin: z3.
              @47 "." + 1 lower "- " upper "'";
        end;

        if _n_ > 1 then do;
          put @5 lower @15 "<- " upper @40 " = '" bin: z3.
              @47 "." + 1 lower "<- " upper "'";
        end;
        if eof then do;
          put @5 "OTHER" @40 " = '999. OTHERS';";
        end;
      run;

      *** OUTPUT BINNING RECODE FILE ***;
      data _null_;
        file binfile mod;
        put " ";
        put @3 "*** BINNING RECODE of %trim(&var) ***;";
        put @3 "c%trim(&var) = put(%trim(&var), %trim(&var)_fmt.);";
      run;

      *** SAVE SUMMARY OF EACH VARIABLE INTO A TABLE ***;
      %if %sysfunc(exist(work._result)) %then %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          if _n_ = 1 then do;
            variable = "%trim(&var)";
            bin      = &bin;
            ks       = &ks;
            iv       = &iv;
            output;
          end;
          set _result;
          output;
        run;
      %end;
      %else %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          variable = "%trim(&var)";
          bin      = &bin;
          ks       = &ks;
          iv       = &iv;
        run;        
      %end;
    %end;

    %exit:

    *** CLEAN UP TEMPORARY TABLES ***;
    proc datasets library = work nolist;
      delete _tmp2 - _tmp5 _corr / memtype = data;
    run;
    quit;
  %end;    
%end;

*** SORT VARIABLES BY KS AND OUTPUT RESULTS ***;
proc sort data = _result sortsize = max;
  by descending iv descending ks;
run;

data _null_;
  set _result end = eof;
  file sumfile;

  if _n_ = 1 then do;
    put @1 80 * "-";
    put @1  "| RANK" @10 "| VARIABLE RANKED BY IV" @45 "| # BINS"
        @55 "|  KS"  @66 "| INFO. VALUE" @80 "|";
    put @1 80 * "-";
  end;
  put @1  "| " @4  _n_ z3. @10 "| " @12 variable @45 "| " @50 bin
      @55 "| " @57 ks      @66 "| " @69 iv       @80 "|";
  if eof then do;
    put @1 80 * "-";
  end;
run;

proc datasets library = work nolist;
  delete _result (mt = data);
run;
quit;

*********************************************************;
*                   END OF MACRO                        *;
*********************************************************;

%mend lgd_nwoe;

With this macro, I tested on the same data again but without the binary conversion proposed Anthony and Naeem. The output is also pasted below for the comparison purpose.

                                                                                                                            
                                 MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR X

  BIN#     LOWER LIMIT     UPPER LIMIT      #FREQ    DISTRIBUTION        MEAN LGD             WOE     INFO. VALUE         KS
 ---------------------------------------------------------------------------------------------------------------------------
   001          9.5300         84.7200       4060       11.1102%        38.9246%      -0.17266201      0.00326518     1.8911
   002         84.7300         91.9100       4068       11.1321%        39.4665%      -0.14992613      0.00247205     3.5399
   003         91.9132         97.3100       4053       11.0910%        41.5962%      -0.06155066      0.00041827     4.2195
   004         97.3146        101.8100       4060       11.1102%        42.0504%      -0.04288377      0.00020368     4.6945
   005        101.8162        105.2221       4061       11.1129%        42.4538%      -0.02634960      0.00007701     4.9867
   006        105.2223        110.1642       4060       11.1102%        42.9896%      -0.00445666      0.00000221     5.0362
   007        110.1700        115.7000       4066       11.1266%        45.0653%       0.07978905      0.00071190     4.1440
   008        115.7029        123.0886       4055       11.0965%        46.3687%       0.13231366      0.00195768     2.6644
   009        123.1100        150.0000       4060       11.1102%        48.9800%       0.23701619      0.00631510     0.0000
-----------------------------------------------------------------------------------------------------------------------------
         # TOTAL = 36543, AVERAGE LGD = 0.430988, MAX. KS = 5.0362, INFO. VALUE = 0.0154.                                    
-----------------------------------------------------------------------------------------------------------------------------

The comparison shows that most information from 2 implementations are almost identical. For instance, information value = 0.0154 in both cases. However, the frequency in the second output reflects the correct sample size, e.g 36,543. As a result, the computation is more efficient and the time consumed is much shorter. Considering the fact that usually hundreds of variables should be looped through this WoE transformation one by one, the approach without duplicating cases and converting to binary might make more practical sense.

Advertisements

Written by statcompute

September 8, 2012 at 12:33 am

%d bloggers like this: