Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for June 2012

Bumping: A Stochastic Search for the Best Model

Breiman (1996) showed how to use the bootstrap sampling technique to improve the prediction accuracy in the bagging algorithm, which has shown successful use cases in subset selection and decision trees. However, a major drawback of bagging is that it destroys the simple structure of the original model. For instance, the bagging of decision trees is not presented in a tree structure any more. As a result, bagging improves the model prediction accuracy at the cost of interpretability.

Tibshirani (1997) proposed another use case of bootstrap sampling, which was named Bumping. In the bumping algorithm, bootstrap sampling is used to estimate candidate models with the purpose to do a stochastic search for a best single model throughout the whole model space. As such, the simple structure of the original model, such as the one presented in a decision tree, will be well preserved.

A SAS macro below is showing how to implement the bumping algorithm.

%macro bumping(data = , y = , numx = , catx = x, ntrees = 100);
***********************************************************;
* THIS SAS MACRO IS AN ATTEMPT TO IMPLEMENT BUMPING       *;
* PROPOSED BY TIBSHIRANI AND KNIGHT (1997)                *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA   : INPUT SAS DATA TABLE                          *;
*  Y      : RESPONSE VARIABLE WITH 0/1 VALUE              *;
*  NUMX   : A LIST OF NUMERIC ATTRIBUTES                  *;
*  CATX   : A LIST OF CATEGORICAL ATTRIBUTES              *;
*  NTREES : # OF TREES TO DO THE BUMPING SEARCH           *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  BESTTREE.TXT: A TEXT FILE USED TO SCORE THE BEST TREE  *;
*                THROUGH THE BUMPING SEARCH               *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options mprint mlogic nocenter nodate nonumber;

*** a random seed value subject to change ***;
%let seed = 1;

data _null_;
  do i = 1 to &ntrees;
    random = put(ranuni(&seed) * (10 ** 8), 8.);
    name   = compress("random"||put(i, 3.), ' ');
    call symput(name, random);
  end;
run;    

proc datasets library = data nolist;
  delete catalog / memtype = catalog;
run;
quit;

proc sql noprint;
  select count(*) into :nobs from &data where &y in (1, 0);
quit;

%do i = 1 %to &ntrees;
  %put &&random&i;

  proc surveyselect data = &data method = urs n = &nobs seed = &&random&i
    out = sample&i(rename = (NumberHits = _hits)) noprint;
  run;
  
  proc dmdb data = sample&i out = db_sample&i dmdbcat = cl_sample&i;
    class &y &catx;
    var &numx;
    target &y;
    freq _hits;
  run;

  filename out_tree catalog "data.catalog.out_tree.source";
  
  proc split data = db_sample&i dmdbcat = cl_sample&i
    criterion    = gini
    assess       = impurity
    maxbranch    = 2
    splitsize    = 100
    subtree      = assessment
    exhaustive   = 0 
    nsurrs       = 0;
    code file    = out_tree;
    input &numx   / level = interval;
    input &catx   / level = nominal;
    target &y     / level = binary;
    freq _hits;
  run;  

  filename in_tree catalog "data.catalog.tree&i..source";

  data _null_;
    infile out_tree;
    input;
    file in_tree;
    if _n_ > 3 then put _infile_;
  run;

  data _tmp1(keep = p_&y.1 p_&y.0 &y);
    set &data;
    %include in_tree;
  run;

  proc printto new print = lst_out;
  run;

  ods output kolsmir2stats = _kstmp(where = (label1 = 'KS'));
  proc npar1way wilcoxon edf data = _tmp1;
    class &y;
    var p_&y.1;
  run;

  proc printto;
  run;

  %if &i = 1 %then %do;
    data _ks;
      set _kstmp (keep = nvalue2);
      tree_id = &i;
      seed    = &&random&i;
      ks      = round(nvalue2 * 100, 0.0001);
    run;
  %end;    
  %else %do;
    data _ks;
      set _ks _kstmp(in = a keep = nvalue2);
      if a then do;
        tree_id = &i;
        seed    = &&random&i;
        ks      = round(nvalue2 * 100, 0.0001);
      end;
    run;
  %end;  
%end;

proc sql noprint;
  select max(ks) into :ks from _ks;
  
  select tree_id into :best from _ks where round(ks, 0.0001) = round(&ks., 0.0001);
quit;

filename best catalog "data.catalog.tree%trim(&best).source";
filename output "BestTree.txt";

data _null_;
  infile best;
  input;
  file output;
  if _n_ = 1 then do;
    put " ******************************************************; ";
    put " ***** BEST TREE: TREE %trim(&best) WITH KS = &KS *****; ";
    put " ******************************************************; ";    
  end;
  put _infile_;
run;

data _out;
  set _ks;

  if round(ks, 0.0001) = round(&ks., 0.0001) then flag = '***';
run;

proc print data = _out noobs;
  var tree_id seed ks flag;
run;

%mend bumping;

libname data 'D:\SAS_CODE\bagging';

%let x1 = tot_derog tot_tr age_oldest_tr tot_open_tr tot_rev_tr tot_rev_debt
            tot_rev_line rev_util bureau_score ltv tot_income;

%let x2 = purpose;

%bumping(data = data.accepts, y = bad, numx = &x1, catx = &x2, ntrees = 50);

The table below is to show the result of a stochastic search for the best decision tree out of 50 trees estimated from bootstrapped samples. In the result table, the best tree has been flagged by “***”. With the related seed value, any bootstrap sample and decision trees should be replicated.

tree_id      seed         ks      flag
    1      18496257    41.1210        
    2      97008872    41.2568        
    3      39982431    39.2714        
    4      25939865    38.7901        
    5      92160258    40.9343        
    6      96927735    40.6441        
    7      54297917    41.2460        
    8      53169172    40.5881        
    9       4979403    40.9662        
   10       6656655    41.2006        
   11      81931857    41.1540        
   12      52387052    40.1930        
   13      85339431    36.7912        
   14       6718458    39.9277        
   15      95702386    39.0264        
   16      29719396    39.8790        
   17      27261179    40.1256        
   18      68992963    40.7699        
   19      97676486    37.7472        
   20      22650752    39.6255        
   21      68823655    40.3759        
   22      41276387    41.2282        
   23      55855411    41.5945    *** 
   24      28722561    40.6127        
   25      47578931    40.2973        
   26      84498698    38.6929        
   27      63452412    41.0329        
   28      59036467    39.1822        
   29      58258153    40.5223        
   30      37701337    40.2190        
   31      72836156    40.1872        
   32      50660353    38.5086        
   33      93121359    39.9043        
   34      92912005    40.0265        
   35      58966034    38.8403        
   36      29722285    39.7879        
   37      39104243    38.4006        
   38      47242918    39.5534        
   39      67952575    39.2817        
   40      16808835    40.4024        
   41      16652610    40.5237        
   42      87110489    39.9251        
   43      29878953    39.6106        
   44      93464176    40.5942        
   45      90047083    40.4422        
   46      56878347    40.6057        
   47       4954566    39.7689        
   48      13558826    38.7292        
   49      51131788    41.0891        
   50      43320456    41.0566        

At last, the text file used to score the best tree is attached below, in which you should be able to see the structure of the best decision tree selected out of 50 trees.

 ******************************************************; 
 ***** BEST TREE: TREE 23 WITH KS =  41.5945 *****; 
 ******************************************************; 
 
 ******         LENGTHS OF NEW CHARACTER VARIABLES         ******;
 LENGTH I_bad  $   12; 
 LENGTH _WARN_  $    4; 
 
 ******              LABELS FOR NEW VARIABLES              ******;
 LABEL _NODE_  = 'Node' ;
 LABEL _LEAF_  = 'Leaf' ;
 LABEL P_bad0  = 'Predicted: bad=0' ;
 LABEL P_bad1  = 'Predicted: bad=1' ;
 LABEL I_bad  = 'Into: bad' ;
 LABEL U_bad  = 'Unnormalized Into: bad' ;
 LABEL _WARN_  = 'Warnings' ;
 
 
 ******      TEMPORARY VARIABLES FOR FORMATTED VALUES      ******;
 LENGTH _ARBFMT_2 $     12; DROP _ARBFMT_2; 
 _ARBFMT_2 = ' '; /* Initialize to avoid warning. */
 LENGTH _ARBFMT_15 $      5; DROP _ARBFMT_15; 
 _ARBFMT_15 = ' '; /* Initialize to avoid warning. */
 
 
 ******             ASSIGN OBSERVATION TO NODE             ******;
 IF  NOT MISSING(bureau_score ) AND 
                  662.5 <= bureau_score  THEN DO;
   IF  NOT MISSING(bureau_score ) AND 
                    721.5 <= bureau_score  THEN DO;
     IF  NOT MISSING(tot_derog ) AND 
       tot_derog  <                  3.5 THEN DO;
       IF  NOT MISSING(ltv ) AND 
                         99.5 <= ltv  THEN DO;
         IF  NOT MISSING(tot_rev_line ) AND 
           tot_rev_line  <                16671 THEN DO;
           _ARBFMT_15 = PUT( purpose , $5.);
            %DMNORMIP( _ARBFMT_15);
           IF _ARBFMT_15 IN ('LEASE' ) THEN DO;
             _NODE_  =                   62;
             _LEAF_  =                   29;
             P_bad0  =     0.77884615384615;
             P_bad1  =     0.22115384615384;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   63;
             _LEAF_  =                   30;
             P_bad0  =     0.91479820627802;
             P_bad1  =     0.08520179372197;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         ELSE DO;
           IF  NOT MISSING(ltv ) AND 
                            131.5 <= ltv  THEN DO;
             _NODE_  =                   65;
             _LEAF_  =                   32;
             P_bad0  =     0.79166666666666;
             P_bad1  =     0.20833333333333;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   64;
             _LEAF_  =                   31;
             P_bad0  =     0.96962025316455;
             P_bad1  =     0.03037974683544;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_open_tr ) AND 
           tot_open_tr  <                  1.5 THEN DO;
           _NODE_  =                   42;
           _LEAF_  =                   26;
           P_bad0  =                  0.9;
           P_bad1  =                  0.1;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(tot_derog ) AND 
                              1.5 <= tot_derog  THEN DO;
             _NODE_  =                   61;
             _LEAF_  =                   28;
             P_bad0  =      0.9047619047619;
             P_bad1  =     0.09523809523809;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   60;
             _LEAF_  =                   27;
             P_bad0  =     0.98779134295227;
             P_bad1  =     0.01220865704772;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       END;
     ELSE DO;
       _NODE_  =                   15;
       _LEAF_  =                   33;
       P_bad0  =     0.75925925925925;
       P_bad1  =     0.24074074074074;
       I_bad  = '0' ;
       U_bad  =                    0;
       END;
     END;
   ELSE DO;
     IF  NOT MISSING(tot_rev_line ) AND 
                     6453.5 <= tot_rev_line  THEN DO;
       IF  NOT MISSING(ltv ) AND 
                        122.5 <= ltv  THEN DO;
         _NODE_  =                   27;
         _LEAF_  =                   25;
         P_bad0  =      0.7471264367816;
         P_bad1  =     0.25287356321839;
         I_bad  = '0' ;
         U_bad  =                    0;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_derog ) AND 
                            9.5 <= tot_derog  THEN DO;
           _NODE_  =                   41;
           _LEAF_  =                   24;
           P_bad0  =     0.53846153846153;
           P_bad1  =     0.46153846153846;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(tot_rev_line ) AND 
             tot_rev_line  <                16694 THEN DO;
             _NODE_  =                   58;
             _LEAF_  =                   22;
             P_bad0  =     0.84235294117647;
             P_bad1  =     0.15764705882352;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   59;
             _LEAF_  =                   23;
             P_bad0  =     0.92375366568914;
             P_bad1  =     0.07624633431085;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       END;
     ELSE DO;
       IF  NOT MISSING(ltv ) AND 
         ltv  <                133.5 THEN DO;
         IF  NOT MISSING(tot_income ) AND 
           tot_income  <               2377.2 THEN DO;
           IF  NOT MISSING(age_oldest_tr ) AND 
             age_oldest_tr  <                   57 THEN DO;
             _NODE_  =                   54;
             _LEAF_  =                   17;
             P_bad0  =                 0.75;
             P_bad1  =                 0.25;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   55;
             _LEAF_  =                   18;
             P_bad0  =     0.93150684931506;
             P_bad1  =     0.06849315068493;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         ELSE DO;
           IF  NOT MISSING(ltv ) AND 
             ltv  <                 94.5 THEN DO;
             _NODE_  =                   56;
             _LEAF_  =                   19;
             P_bad0  =     0.86301369863013;
             P_bad1  =     0.13698630136986;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   57;
             _LEAF_  =                   20;
             P_bad0  =     0.67766497461928;
             P_bad1  =     0.32233502538071;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         _NODE_  =                   25;
         _LEAF_  =                   21;
         P_bad0  =      0.4090909090909;
         P_bad1  =     0.59090909090909;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       END;
     END;
   END;
 ELSE DO;
   IF  NOT MISSING(ltv ) AND 
     ltv  <                 97.5 THEN DO;
     IF  NOT MISSING(bureau_score ) AND 
       bureau_score  <                639.5 THEN DO;
       IF  NOT MISSING(tot_open_tr ) AND 
                          3.5 <= tot_open_tr  THEN DO;
         IF  NOT MISSING(tot_income ) AND 
           tot_income  <             2604.165 THEN DO;
           _NODE_  =                   32;
           _LEAF_  =                    3;
           P_bad0  =     0.54237288135593;
           P_bad1  =     0.45762711864406;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(tot_income ) AND 
                             7375 <= tot_income  THEN DO;
             _NODE_  =                   47;
             _LEAF_  =                    5;
             P_bad0  =     0.57575757575757;
             P_bad1  =     0.42424242424242;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   46;
             _LEAF_  =                    4;
             P_bad0  =     0.81102362204724;
             P_bad1  =     0.18897637795275;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_rev_line ) AND 
           tot_rev_line  <                 2460 THEN DO;
           _NODE_  =                   30;
           _LEAF_  =                    1;
           P_bad0  =     0.69411764705882;
           P_bad1  =     0.30588235294117;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           _NODE_  =                   31;
           _LEAF_  =                    2;
           P_bad0  =     0.34343434343434;
           P_bad1  =     0.65656565656565;
           I_bad  = '1' ;
           U_bad  =                    1;
           END;
         END;
       END;
     ELSE DO;
       IF  NOT MISSING(tot_income ) AND 
         tot_income  <             9291.835 THEN DO;
         IF  NOT MISSING(tot_tr ) AND 
                           13.5 <= tot_tr  THEN DO;
           _NODE_  =                   35;
           _LEAF_  =                    8;
           P_bad0  =     0.94039735099337;
           P_bad1  =     0.05960264900662;
           I_bad  = '0' ;
           U_bad  =                    0;
           END;
         ELSE DO;
           IF  NOT MISSING(bureau_score ) AND 
             bureau_score  <                646.5 THEN DO;
             _NODE_  =                   48;
             _LEAF_  =                    6;
             P_bad0  =                    1;
             P_bad1  =                    0;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   49;
             _LEAF_  =                    7;
             P_bad0  =     0.73604060913705;
             P_bad1  =     0.26395939086294;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       ELSE DO;
         _NODE_  =                   19;
         _LEAF_  =                    9;
         P_bad0  =     0.35714285714285;
         P_bad1  =     0.64285714285714;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       END;
     END;
   ELSE DO;
     IF  NOT MISSING(tot_rev_line ) AND 
       tot_rev_line  <               1218.5 THEN DO;
       IF  NOT MISSING(age_oldest_tr ) AND 
                          115 <= age_oldest_tr  THEN DO;
         _NODE_  =                   21;
         _LEAF_  =                   11;
         P_bad0  =     0.60273972602739;
         P_bad1  =      0.3972602739726;
         I_bad  = '0' ;
         U_bad  =                    0;
         END;
       ELSE DO;
         _NODE_  =                   20;
         _LEAF_  =                   10;
         P_bad0  =     0.28776978417266;
         P_bad1  =     0.71223021582733;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       END;
     ELSE DO;
       IF  NOT MISSING(bureau_score ) AND 
         bureau_score  <                  566 THEN DO;
         _NODE_  =                   22;
         _LEAF_  =                   12;
         P_bad0  =     0.15384615384615;
         P_bad1  =     0.84615384615384;
         I_bad  = '1' ;
         U_bad  =                    1;
         END;
       ELSE DO;
         IF  NOT MISSING(tot_rev_line ) AND 
                        13717.5 <= tot_rev_line  THEN DO;
           IF  NOT MISSING(tot_rev_debt ) AND 
             tot_rev_debt  <                11884 THEN DO;
             _NODE_  =                   52;
             _LEAF_  =                   15;
             P_bad0  =     0.85869565217391;
             P_bad1  =     0.14130434782608;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           ELSE DO;
             _NODE_  =                   53;
             _LEAF_  =                   16;
             P_bad0  =     0.65972222222222;
             P_bad1  =     0.34027777777777;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         ELSE DO;
           IF  NOT MISSING(ltv ) AND 
             ltv  <                 99.5 THEN DO;
             _NODE_  =                   50;
             _LEAF_  =                   13;
             P_bad0  =     0.41489361702127;
             P_bad1  =     0.58510638297872;
             I_bad  = '1' ;
             U_bad  =                    1;
             END;
           ELSE DO;
             _NODE_  =                   51;
             _LEAF_  =                   14;
             P_bad0  =     0.61769352290679;
             P_bad1  =      0.3823064770932;
             I_bad  = '0' ;
             U_bad  =                    0;
             END;
           END;
         END;
       END;
     END;
   END;
 
 ****************************************************************;
 ******          END OF DECISION TREE SCORING CODE         ******;
 ****************************************************************;
Advertisements

Written by statcompute

June 30, 2012 at 1:32 pm

A SAS Macro Implementing Monotonic WOE Transformation in Scorecard Development

This SAS macro was specifically designed for model developers to do uni-variate variable importance ranking and monotonic weight of evidence (WOE) transformation for potentially hundreds of predictors in the scorecard development. Please feel free to use or distribute it at your own risk. I will really appreciate it if you could share your successful story using this macro in your model development with me.

%macro num_woe(data = , y = , x = );
***********************************************************;
* THE SAS MACRO IS TO PERFORM UNIVARIATE IMPORTANCE RANK  *;
* ORDER AND MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION   *;
* FOR NUMERIC ATTRIBUTES IN PRE-MODELING DATA PROCESSING  *;
* (IT IS RECOMMENDED TO RUN THIS MACRO IN THE BATCH MODE) *;
* ======================================================= *;
* PAMAMETERS:                                             *;
*  DATA: INPUT SAS DATA TABLE                             *;
*  Y   : RESPONSE VARIABLE WITH 0/1 VALUE                 *;
*  X   : A LIST OF NUMERIC ATTRIBUTES                     *;
* ======================================================= *;
* OUTPUTS:                                                *;
*  MONO_WOE.WOE: A FILE OF WOE TRANSFORMATION RECODING    *;
*  MONO_WOE.FMT: A FILE OF BINNING FORMAT                 *;
*  MONO_WOE.PUT: A FILE OF PUT STATEMENTS FOR *.FMT FILE  *;
*  MONO_WOE.SUM: A FILE WITH PREDICTABILITY SUMMARY       *;
*  MONO_WOE.OUT: A FILE WITH STATISTICAL DETAILS          *;
*  MONO_WOE.IMP: A FILE OF MISSING IMPUTATION RECODING    *;
* ======================================================= *;
* CONTACT:                                                *;
*  WENSUI.LIU@53.COM                                      *;
***********************************************************;

options nocenter nonumber nodate mprint mlogic symbolgen
        orientation = landscape ls = 150;

*** DEFAULT PARAMETERS ***;

%local maxbin minbad miniv bignum;

%let maxbin = 100;

%let minbad = 50;

%let miniv  = 0.03;

%let bignum = 1e300;

***********************************************************;
***         DO NOT CHANGE CODES BELOW THIS LINE         ***;
***********************************************************;

*** DEFAULT OUTPUT FILES ***;

* WOE RECODING FILE                     *;
filename woefile "MONO_WOE.WOE";

* FORMAT FOR BINNING                    *;
filename fmtfile "MONO_WOE.FMT";

* PUT STATEMENT TO USE FORMAT           *;
filename binfile "MONO_WOE.PUT";

* KS SUMMARY                            *;
filename sumfile "MONO_WOE.SUM";
 
* STATISTICAL SUMMARY FOR EACH VARIABLE *;
filename outfile "MONO_WOE.OUT";

* IMPUTE RECODING FILE                  *;
filename impfile "MONO_WOE.IMP";

*** A MACRO TO DELETE FILE ***;
%macro dfile(file = );
  data _null_;
    rc = fdelete("&file");
    if rc = 0 then do;
      put @1 50 * "+";
      put "THE EXISTED OUTPUT FILE HAS BEEN DELETED.";
      put @1 50 * "+";
    end;
  run;
%mend dfile;

*** CLEAN UP FILES ***;
%dfile(file = woefile);

%dfile(file = fmtfile);

%dfile(file = binfile);

%dfile(file = sumfile);

%dfile(file = outfile);

%dfile(file = impfile);

*** PARSING THE STRING OF NUMERIC PREDICTORS ***;
ods listing close;
ods output position = _pos1;
proc contents data = &data varnum;
run;

proc sql noprint;
  select
    upcase(variable) into :x2 separated by ' '
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;


  select
    count(variable) into :xcnt
  from
    _pos1
  where
    compress(upcase(type), ' ') = 'NUM' and
    index("%upcase(%sysfunc(compbl(&x)))", compress(upcase(variable), ' ')) > 0;
quit;

data _tmp1;
  retain &x2 &y;
  set &data;
  where &Y in (1, 0);
  keep &x2 &y;
run;

ods output position = _pos2;
proc contents data = _tmp1 varnum;
run;

*** LOOP THROUGH EACH PREDICTOR ***;
%do i = 1 %to &xcnt;
    
  proc sql noprint;
    select
      upcase(variable) into :var
    from
      _pos2
    where
      num= &i;

    select
      count(distinct &var) into :xflg
    from
      _tmp1
    where
      &var ~= .;
  quit;

  proc summary data = _tmp1 nway;
    output out  = _med(drop = _type_ _freq_)
    median(&var) = med nmiss(&var) = mis;
  run;
  
  proc sql;
    select
      med into :median
    from
      _med;

    select
      mis into :nmiss
    from
      _med;

    select 
      case when count(&y) = sum(&y) then 1 else 0 end into :mis_flg1
    from
      _tmp1
    where
      &var = .;

    select
      case when sum(&y) = 0 then 1 else 0 end into :mis_flg2
    from
      _tmp1
    where
      &var = .;
  quit;

  %let nbin = %sysfunc(min(&maxbin, &xflg));

  *** CHECK IF THE NUMBER OF DISTINCT VALUES > 1 ***;
  %if &xflg > 1 %then %do;

    *** IMPUTE MISS VALUE WHEN WOE CANNOT BE CALCULATED ***;
    %if &mis_flg1 = 1 | &mis_flg2 = 1 %then %do;
      data _null_;
        file impfile mod;
        put " ";
        put @3 "*** MEDIAN IMPUTATION OF %TRIM(%UPCASE(&VAR)) (NMISS = %trim(&nmiss)) ***;";
        put @3 "IF %TRIM(%UPCASE(&VAR)) = . THEN %TRIM(%UPCASE(&VAR)) = &MEDIAN;";
      run;

      data _tmp1;
        set _tmp1;
        if &var = . then &var = &median;
      run; 
    %end;      
      
    *** LOOP THROUGH THE NUMBER OF BINS ***;
    %do j = &nbin %to 2 %by -1;
      proc rank data = _tmp1 groups = &j out = _tmp2(keep = &y &var rank);
        var &var;
        ranks rank;
      run;

      proc summary data = _tmp2 nway missing;
        class rank;
        output out = _tmp3(drop = _type_ rename = (_freq_ = freq))
        sum(&y)   = bad    mean(&y)  = bad_rate
        min(&var) = minx   max(&var) = maxx;
      run;

      *** CREATE FLAGS FOR MULTIPLE CRITERION ***;
      proc sql noprint;
        select
          case when min(bad) >= &minbad then 1 else 0 end into :badflg
        from
          _tmp3;

        select
          case when min(bad_rate) > 0 then 1 else 0 end into :minflg
        from
          _tmp3;

        select
          case when max(bad_rate) < 1 then 1 else 0 end into :maxflg
        from
          _tmp3;              
      quit;

      *** CHECK IF SPEARMAN CORRELATION = 1 ***;
      %if &badflg = 1 & &minflg = 1 & &maxflg = 1 %then %do;
        ods output spearmancorr = _corr(rename = (minx = cor));
        proc corr data = _tmp3 spearman;
          var minx;
          with bad_rate;
        run;

        proc sql noprint;
          select
            case when abs(cor) = 1 then 1 else 0 end into :cor
          from
            _corr;
        quit;

        *** IF SPEARMAN CORR = 1 THEN BREAK THE LOOP ***;
        %if &cor = 1 %then %goto loopout;
      %end;
      %else %if &nbin = 2 %then %goto exit;
    %end;

    %loopout:
    
    *** CALCULATE STATISTICAL SUMMARY ***;
    proc sql noprint;
      select 
        sum(freq) into :freq
      from
        _tmp3;

      select
        sum(bad) into :bad
      from
        _tmp3;
    quit;

    proc sort data = _tmp3 sortsize = max;
      by rank;
    run;

    data _tmp4;
      retain bin minx maxx bad freq pct bad_rate;
      set _tmp3 end = eof;
      by rank;

      if rank = . then bin = 0;
      else do;
        retain b 0;
        bin + 1;
      end;
  
      pct  = freq / &freq;
      bpct = bad / &bad;
      gpct = (freq - bad) / (&freq - &bad);
      woe  = log(bpct / gpct);
      iv   = (bpct - gpct) * woe;

      retain cum_bpct cum_gpct;
      cum_bpct + bpct;
      cum_gpct + gpct;
      ks = abs(cum_gpct - cum_bpct) * 100;

      retain iv_sum ks_max;
      iv_sum + iv;
      ks_max = max(ks_max, ks);
      if eof then do;
        call symput("bin", put(bin, 4.));
        call symput("ks", put(ks_max, 10.4));
        call symput("iv", put(iv_sum, 10.4));
      end;

      keep bin minx maxx bad freq pct bad_rate
           gpct bpct woe iv cum_gpct cum_bpct ks;
    run;

    *** REPORT STATISTICAL SUMMARY ***;
    proc printto print = outfile;
    run;

    title;
    ods listing;
    proc report data = _tmp4 spacing = 1 split = "*" headline nowindows;
      column(" * MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR %upcase(%trim(&var))"
             bin minx maxx freq pct bad bad_rate woe iv ks);

      define bin      /"BIN*LEVEL"   width = 5  format = z3. order order = data;
      define minx     /"LOWER*LIMIT" width = 15 format = 14.4;
      define maxx     /"UPPER*LIMIT" width = 15 format = 14.4;
      define bad      /"#BADS*(Y=1)" width = 8  format = 7.;
      define freq     /"#FREQ"       width = 10 format = 9.;
      define pct      /"PERCENT"     width = 8  format = percent8.2;
      define bad_rate /"BAD*RATE"    width = 8  format = percent8.2;
      define woe      /"WOE"         width = 10 format = 9.4;
      define iv       /"INFO.*VALUE" width = 10 format = 9.4;
      define ks       /"KS"          width = 10 format = 9.4;
      compute after;
        line @1 110 * "-";
        line @5 "# TOTAL = %trim(&freq), # BADs(Y=1) = %trim(&bad), "
                "OVERALL BAD RATE = %trim(%sysfunc(round(&bad / &freq * 100, 0.0001)))%, "
                "MAX. KS = %trim(&ks), INFO. VALUE = %trim(&iv).";
        line @1 110 * "-";    
      endcomp;
    run;
    ods listing close;

    proc printto;
    run;

    proc sql noprint;
      select
        case when sum(iv) >= &miniv then 1 else 0 end into :ivflg
      from
        _tmp4;
    quit;

    *** OUTPUT RECODING FILES IF IV >= &miniv BY DEFAULT ***;
    %if &ivflg = 1 %then %do;
      data _tmp5;
        length upper $20 lower $20;
        lower = compress(put(maxx, 20.4), ' ');

        set _tmp4 end = eof;
        upper = compress(put(maxx, 20.4), ' ');
        if bin = 1 then lower = "-%trim(&bignum)";
        if eof then upper = "%trim(&bignum)";
        w%trim(&var) = compress(put(woe, 12.8), ' ');
      run;

      *** OUTPUT WOE RECODE FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file woefile mod;

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3  "if %trim(&var) = . then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " WOE RECODE OF %upcase(%trim(&var)) (KS = %trim(&ks), IV = %trim(&iv))"
                 + 1 3 * "*" ";";
          put @3 "if " + 1 lower " < %trim(&var) <= " upper
                 " then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if _n_ > 1 then do;
          put @5 "else if " + 1 lower " < %trim(&var) <= " upper
                 " then w%trim(&var) = " + 1 w%trim(&var) ";";
        end;
        if eof then do;
          put @5 "else w%trim(&var) = 0;";
        end;
      run;

      *** OUTPUT BINNING FORMAT FILE ***;
      data _null_;
        set _tmp5 end = eof;
        file fmtfile mod;

        if bin = 1 then lower = "LOW";
        if eof then upper = "HIGH";

        if bin = 0 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
                 " BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 ". " @40 " = '" bin: z3.
                 ". MISSINGS'";
        end;

            
        if bin = 1 and _n_ = 1 then do;
          put " ";
          put @3 3 * "*"
              @5 "BINNING FORMAT OF %trim(&var) (KS = %trim(&ks), IV = %trim(&IV))"
              + 1 3 * "*" ";";
          put @3 "value %trim(&var)_fmt";
          put @5 lower @15 " - " + 1 upper  @40 " = '" bin: z3.
                 ". " + 1 lower " - " + 1 upper "'";
        end;

        if _n_ > 1 then do;
          put @5 lower @15 "<- " + 1 upper @40 " = '" bin: z3.
                 ". " + 1 lower "<- " + 1 upper "'";
        end;
        if eof then do;
          put @5 "OTHER" @40 " = '999 .  OTHERS';";
        end;
      run;

      *** OUTPUT BINNING RECODE FILE ***;
      data _null_;
        file binfile mod;
        put " ";
        put @3 "*** BINNING RECODE of %trim(&var) ***;";
        put @3 "c%trim(&var) = put(%trim(&var), %trim(&var)_fmt.);";
      run;

      *** SAVE SUMMARY OF EACH VARIABLE INTO A TABLE ***;
      %if %sysfunc(exist(work._result)) %then %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          if _n_ = 1 then do;
            variable = "%trim(&var)";
            bin      = &bin;
            ks       = &ks;
            iv       = &iv;
            output;
          end;
          set _result;
          output;
        run;
      %end;
      %else %do;
        data _result;
          format variable $32. bin 3. ks 10.4 iv 10.4;
          variable = "%trim(&var)";
          bin      = &bin;
          ks       = &ks;
          iv       = &iv;
        run;        
      %end;
    %end;

    %exit:

    *** CLEAN UP TEMPORARY TABLES ***;
    proc datasets library = work nolist;
      delete _tmp2 - _tmp5 _corr / memtype = data;
    run;
    quit;
  %end;    
%end;

*** SORT VARIABLES BY KS AND OUTPUT RESULTS ***;
proc sort data = _result sortsize = max;
  by descending ks descending iv;
run;

data _null_;
  set _result end = eof;
  file sumfile;

  if _n_ = 1 then do;
    put @1 80 * "-";
    put @1  "| RANK" @10 "| VARIABLE RANKED BY KS" @45 "| # BINS"
        @55 "|  KS"  @66 "| INFO. VALUE" @80 "|";
    put @1 80 * "-";
  end;
  put @1  "| " @4  _n_ z3. @10 "| " @12 variable @45 "| " @50 bin
      @55 "| " @57 ks      @66 "| " @69 iv       @80 "|";
  if eof then do;
    put @1 80 * "-";
  end;
run;

proc datasets library = work nolist;
  delete _result (mt = data);
run;
quit;

*********************************************************;
*           END OF NUM_WOE MACRO                        *;
*********************************************************;
%mend num_woe;

libname data 'D:\SAS_CODE\woe';

%let x = 
tot_derog
tot_tr
age_oldest_tr
tot_open_tr
tot_rev_tr
tot_rev_debt
tot_rev_line
rev_util
bureau_score
ltv
tot_income
;

%num_woe(data = data.accepts, y = bad, x = &x);

The macro above will automatically generate 6 standard output files with different contents for various purposes through the whole process of scorecard development.

1) “MONO_WOE.WOE” is a file of WOE transformation recoding.

 
  *** WOE RECODE OF TOT_DEROG (KS = 20.0442, IV = 0.2480) ***;
  if TOT_DEROG = . then wTOT_DEROG =  0.64159782 ;
    else if  -1e300  < TOT_DEROG <= 0.0000  then wTOT_DEROG =  -0.55591373 ;
    else if  0.0000  < TOT_DEROG <= 2.0000  then wTOT_DEROG =  0.14404414 ;
    else if  2.0000  < TOT_DEROG <= 4.0000  then wTOT_DEROG =  0.50783799 ;
    else if  4.0000  < TOT_DEROG <= 1e300  then wTOT_DEROG =  0.64256014 ;
    else wTOT_DEROG = 0;
 
  *** WOE RECODE OF TOT_TR (KS = 16.8344, IV = 0.1307) ***;
  if TOT_TR = . then wTOT_TR =  0.64159782 ;
    else if  -1e300  < TOT_TR <= 7.0000  then wTOT_TR =  0.40925900 ;
    else if  7.0000  < TOT_TR <= 12.0000  then wTOT_TR =  0.26386662 ;
    else if  12.0000  < TOT_TR <= 18.0000  then wTOT_TR =  -0.13512611 ;
    else if  18.0000  < TOT_TR <= 25.0000  then wTOT_TR =  -0.40608173 ;
    else if  25.0000  < TOT_TR <= 1e300  then wTOT_TR =  -0.42369090 ;
    else wTOT_TR = 0;
 
  *** WOE RECODE OF AGE_OLDEST_TR (KS = 19.6163, IV = 0.2495) ***;
  if AGE_OLDEST_TR = . then wAGE_OLDEST_TR =  0.66280002 ;
    else if  -1e300  < AGE_OLDEST_TR <= 46.0000  then wAGE_OLDEST_TR =  0.66914925 ;
    else if  46.0000  < AGE_OLDEST_TR <= 77.0000  then wAGE_OLDEST_TR =  0.36328349 ;
    else if  77.0000  < AGE_OLDEST_TR <= 114.0000  then wAGE_OLDEST_TR =  0.15812827 ;
    else if  114.0000  < AGE_OLDEST_TR <= 137.0000  then wAGE_OLDEST_TR =  0.01844301 ;
    else if  137.0000  < AGE_OLDEST_TR <= 164.0000  then wAGE_OLDEST_TR =  -0.04100445 ;
    else if  164.0000  < AGE_OLDEST_TR <= 204.0000  then wAGE_OLDEST_TR =  -0.32667232 ;
    else if  204.0000  < AGE_OLDEST_TR <= 275.0000  then wAGE_OLDEST_TR =  -0.79931317 ;
    else if  275.0000  < AGE_OLDEST_TR <= 1e300  then wAGE_OLDEST_TR =  -0.89926463 ;
    else wAGE_OLDEST_TR = 0;
 
  *** WOE RECODE OF TOT_REV_TR (KS = 9.0779, IV = 0.0757) ***;
  if TOT_REV_TR = . then wTOT_REV_TR =  0.69097090 ;
    else if  -1e300  < TOT_REV_TR <= 1.0000  then wTOT_REV_TR =  0.00269270 ;
    else if  1.0000  < TOT_REV_TR <= 3.0000  then wTOT_REV_TR =  -0.14477602 ;
    else if  3.0000  < TOT_REV_TR <= 1e300  then wTOT_REV_TR =  -0.15200275 ;
    else wTOT_REV_TR = 0;
 
  *** WOE RECODE OF TOT_REV_DEBT (KS = 8.5317, IV = 0.0629) ***;
  if TOT_REV_DEBT = . then wTOT_REV_DEBT =  0.68160936 ;
    else if  -1e300  < TOT_REV_DEBT <= 3009.0000  then wTOT_REV_DEBT =  0.04044249 ;
    else if  3009.0000  < TOT_REV_DEBT <= 1e300  then wTOT_REV_DEBT =  -0.19723686 ;
    else wTOT_REV_DEBT = 0;
 
  *** WOE RECODE OF TOT_REV_LINE (KS = 25.5174, IV = 0.3970) ***;
  if TOT_REV_LINE = . then wTOT_REV_LINE =  0.68160936 ;
    else if  -1e300  < TOT_REV_LINE <= 1477.0000  then wTOT_REV_LINE =  0.73834416 ;
    else if  1477.0000  < TOT_REV_LINE <= 4042.0000  then wTOT_REV_LINE =  0.34923628 ;
    else if  4042.0000  < TOT_REV_LINE <= 8350.0000  then wTOT_REV_LINE =  0.11656236 ;
    else if  8350.0000  < TOT_REV_LINE <= 14095.0000  then wTOT_REV_LINE =  0.03996934 ;
    else if  14095.0000  < TOT_REV_LINE <= 23419.0000  then wTOT_REV_LINE =  -0.49492745 ;
    else if  23419.0000  < TOT_REV_LINE <= 38259.0000  then wTOT_REV_LINE =  -0.94090721 ;
    else if  38259.0000  < TOT_REV_LINE <= 1e300  then wTOT_REV_LINE =  -1.22174118 ;
    else wTOT_REV_LINE = 0;
 
  *** WOE RECODE OF REV_UTIL (KS = 14.3262, IV = 0.0834) ***;
  if  -1e300  < REV_UTIL <= 29.0000  then wREV_UTIL =  -0.31721190 ;
    else if  29.0000  < REV_UTIL <= 1e300  then wREV_UTIL =  0.26459777 ;
    else wREV_UTIL = 0;
 
  *** WOE RECODE OF BUREAU_SCORE (KS = 34.1481, IV = 0.7251) ***;
  if BUREAU_SCORE = . then wBUREAU_SCORE =  0.66280002 ;
    else if  -1e300  < BUREAU_SCORE <= 653.0000  then wBUREAU_SCORE =  0.93490359 ;
    else if  653.0000  < BUREAU_SCORE <= 692.0000  then wBUREAU_SCORE =  0.07762676 ;
    else if  692.0000  < BUREAU_SCORE <= 735.0000  then wBUREAU_SCORE =  -0.58254635 ;
    else if  735.0000  < BUREAU_SCORE <= 1e300  then wBUREAU_SCORE =  -1.61790566 ;
    else wBUREAU_SCORE = 0;
 
  *** WOE RECODE OF LTV (KS = 16.3484, IV = 0.1625) ***;
  if  -1e300  < LTV <= 82.0000  then wLTV =  -0.84674934 ;
    else if  82.0000  < LTV <= 91.0000  then wLTV =  -0.43163689 ;
    else if  91.0000  < LTV <= 97.0000  then wLTV =  -0.14361551 ;
    else if  97.0000  < LTV <= 101.0000  then wLTV =  0.08606320 ;
    else if  101.0000  < LTV <= 107.0000  then wLTV =  0.18554122 ;
    else if  107.0000  < LTV <= 115.0000  then wLTV =  0.22405397 ;
    else if  115.0000  < LTV <= 1e300  then wLTV =  0.51906325 ;
    else wLTV = 0;

2) “MONO_WOE.FMT” is a file for binning format.

 
  *** BINNING FORMAT OF TOT_DEROG (KS = 20.0442, IV = 0.2480) ***;
  value TOT_DEROG_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  0.0000                = '001 .  LOW <-  0.0000 '
    0.0000    <-  2.0000                = '002 .  0.0000 <-  2.0000 '
    2.0000    <-  4.0000                = '003 .  2.0000 <-  4.0000 '
    4.0000    <-  HIGH                  = '004 .  4.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_TR (KS = 16.8344, IV = 0.1307) ***;
  value TOT_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  7.0000                = '001 .  LOW <-  7.0000 '
    7.0000    <-  12.0000               = '002 .  7.0000 <-  12.0000 '
    12.0000   <-  18.0000               = '003 .  12.0000 <-  18.0000 '
    18.0000   <-  25.0000               = '004 .  18.0000 <-  25.0000 '
    25.0000   <-  HIGH                  = '005 .  25.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF AGE_OLDEST_TR (KS = 19.6163, IV = 0.2495) ***;
  value AGE_OLDEST_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  46.0000               = '001 .  LOW <-  46.0000 '
    46.0000   <-  77.0000               = '002 .  46.0000 <-  77.0000 '
    77.0000   <-  114.0000              = '003 .  77.0000 <-  114.0000 '
    114.0000  <-  137.0000              = '004 .  114.0000 <-  137.0000 '
    137.0000  <-  164.0000              = '005 .  137.0000 <-  164.0000 '
    164.0000  <-  204.0000              = '006 .  164.0000 <-  204.0000 '
    204.0000  <-  275.0000              = '007 .  204.0000 <-  275.0000 '
    275.0000  <-  HIGH                  = '008 .  275.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_TR (KS = 9.0779, IV = 0.0757) ***;
  value TOT_REV_TR_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  1.0000                = '001 .  LOW <-  1.0000 '
    1.0000    <-  3.0000                = '002 .  1.0000 <-  3.0000 '
    3.0000    <-  HIGH                  = '003 .  3.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_DEBT (KS = 8.5317, IV = 0.0629) ***;
  value TOT_REV_DEBT_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  3009.0000             = '001 .  LOW <-  3009.0000 '
    3009.0000 <-  HIGH                  = '002 .  3009.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF TOT_REV_LINE (KS = 25.5174, IV = 0.3970) ***;
  value TOT_REV_LINE_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  1477.0000             = '001 .  LOW <-  1477.0000 '
    1477.0000 <-  4042.0000             = '002 .  1477.0000 <-  4042.0000 '
    4042.0000 <-  8350.0000             = '003 .  4042.0000 <-  8350.0000 '
    8350.0000 <-  14095.0000            = '004 .  8350.0000 <-  14095.0000 '
    14095.0000<-  23419.0000            = '005 .  14095.0000 <-  23419.0000 '
    23419.0000<-  38259.0000            = '006 .  23419.0000 <-  38259.0000 '
    38259.0000<-  HIGH                  = '007 .  38259.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  **BINNING FORMAT OF REV_UTIL (KS = 14.3262, IV = 0.0834) ***;
  value REV_UTIL_fmt
    LOW        -  29.0000               = '001 .  LOW  -  29.0000 '
    29.0000   <-  HIGH                  = '002 .  29.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  *** BINNING FORMAT OF BUREAU_SCORE (KS = 34.1481, IV = 0.7251) ***;
  value BUREAU_SCORE_fmt
    .                                   = '000 . MISSINGS'
    LOW       <-  653.0000              = '001 .  LOW <-  653.0000 '
    653.0000  <-  692.0000              = '002 .  653.0000 <-  692.0000 '
    692.0000  <-  735.0000              = '003 .  692.0000 <-  735.0000 '
    735.0000  <-  HIGH                  = '004 .  735.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';
 
  **BINNING FORMAT OF LTV (KS = 16.3484, IV = 0.1625) ***;
  value LTV_fmt
    LOW        -  82.0000               = '001 .  LOW  -  82.0000 '
    82.0000   <-  91.0000               = '002 .  82.0000 <-  91.0000 '
    91.0000   <-  97.0000               = '003 .  91.0000 <-  97.0000 '
    97.0000   <-  101.0000              = '004 .  97.0000 <-  101.0000 '
    101.0000  <-  107.0000              = '005 .  101.0000 <-  107.0000 '
    107.0000  <-  115.0000              = '006 .  107.0000 <-  115.0000 '
    115.0000  <-  HIGH                  = '007 .  115.0000 <-  HIGH '
    OTHER                               = '999 .  OTHERS';

3) “MONO_WOE.PUT” is a file for “put” statements with the above *.FMT file.

  *** BINNING RECODE of TOT_DEROG ***;
  cTOT_DEROG = put(TOT_DEROG, TOT_DEROG_fmt.);

  *** BINNING RECODE of TOT_TR ***;
  cTOT_TR = put(TOT_TR, TOT_TR_fmt.);

  *** BINNING RECODE of AGE_OLDEST_TR ***;
  cAGE_OLDEST_TR = put(AGE_OLDEST_TR, AGE_OLDEST_TR_fmt.);

  *** BINNING RECODE of TOT_REV_TR ***;
  cTOT_REV_TR = put(TOT_REV_TR, TOT_REV_TR_fmt.);

  *** BINNING RECODE of TOT_REV_DEBT ***;
  cTOT_REV_DEBT = put(TOT_REV_DEBT, TOT_REV_DEBT_fmt.);

  *** BINNING RECODE of TOT_REV_LINE ***;
  cTOT_REV_LINE = put(TOT_REV_LINE, TOT_REV_LINE_fmt.);

  *** BINNING RECODE of REV_UTIL ***;
  cREV_UTIL = put(REV_UTIL, REV_UTIL_fmt.);

  *** BINNING RECODE of BUREAU_SCORE ***;
  cBUREAU_SCORE = put(BUREAU_SCORE, BUREAU_SCORE_fmt.);

  *** BINNING RECODE of LTV ***;
  cLTV = put(LTV, LTV_fmt.);

4) “MONO_WOE.SUM” is a file summarizing the predictability of all numeric variables, e.g. KS statistics and Information Values.

--------------------------------------------------------------------------------
| RANK   | VARIABLE RANKED BY KS            | # BINS  |  KS      | INFO. VALUE |
--------------------------------------------------------------------------------
|  001   | BUREAU_SCORE                     |    4    | 34.1481  |  0.7251     |
|  002   | TOT_REV_LINE                     |    7    | 25.5174  |  0.3970     |
|  003   | TOT_DEROG                        |    4    | 20.0442  |  0.2480     |
|  004   | AGE_OLDEST_TR                    |    8    | 19.6163  |  0.2495     |
|  005   | TOT_TR                           |    5    | 16.8344  |  0.1307     |
|  006   | LTV                              |    7    | 16.3484  |  0.1625     |
|  007   | REV_UTIL                         |    2    | 14.3262  |  0.0834     |
|  008   | TOT_REV_TR                       |    3    | 9.0779   |  0.0757     |
|  009   | TOT_REV_DEBT                     |    2    | 8.5317   |  0.0629     |
--------------------------------------------------------------------------------

5) “MONO_WOE.OUT” is a file providing statistical summaries of all numeric variables.

                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_DEROG                          
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            213   3.65%        70  32.86%      0.6416     0.0178     2.7716
   001          0.0000          0.0000       2850  48.83%       367  12.88%     -0.5559     0.1268    20.0442
   002          1.0000          2.0000       1369  23.45%       314  22.94%      0.1440     0.0051    16.5222
   003          3.0000          4.0000        587  10.06%       176  29.98%      0.5078     0.0298    10.6623
   004          5.0000         32.0000        818  14.01%       269  32.89%      0.6426     0.0685     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 20.0442, INFO. VALUE = 0.2480.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                            MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_TR                           
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            213   3.65%        70  32.86%      0.6416     0.0178     2.7716
   001          0.0000          7.0000       1159  19.86%       324  27.96%      0.4093     0.0372    11.8701
   002          8.0000         12.0000       1019  17.46%       256  25.12%      0.2639     0.0131    16.8344
   003         13.0000         18.0000       1170  20.04%       215  18.38%     -0.1351     0.0035    14.2335
   004         19.0000         25.0000       1126  19.29%       165  14.65%     -0.4061     0.0281     7.3227
   005         26.0000         77.0000       1150  19.70%       166  14.43%     -0.4237     0.0310     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 16.8344, INFO. VALUE = 0.1307.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                        MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR AGE_OLDEST_TR                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            216   3.70%        72  33.33%      0.6628     0.0193     2.9173
   001          1.0000         46.0000        708  12.13%       237  33.47%      0.6691     0.0647    12.5847
   002         47.0000         77.0000        699  11.98%       189  27.04%      0.3633     0.0175    17.3983
   003         78.0000        114.0000        703  12.04%       163  23.19%      0.1581     0.0032    19.3917
   004        115.0000        137.0000        707  12.11%       147  20.79%      0.0184     0.0000    19.6163
   005        138.0000        164.0000        706  12.10%       140  19.83%     -0.0410     0.0002    19.1263
   006        165.0000        204.0000        689  11.80%       108  15.67%     -0.3267     0.0114    15.6376
   007        205.0000        275.0000        703  12.04%        73  10.38%     -0.7993     0.0597     8.1666
   008        276.0000        588.0000        706  12.10%        67   9.49%     -0.8993     0.0734     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 19.6163, INFO. VALUE = 0.2495.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_OPEN_TR                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .           1416  24.26%       354  25.00%      0.2573     0.0173     6.7157
   001          0.0000          4.0000       1815  31.09%       353  19.45%     -0.0651     0.0013     4.7289
   002          5.0000          6.0000       1179  20.20%       226  19.17%     -0.0831     0.0014     3.0908
   003          7.0000         26.0000       1427  24.45%       263  18.43%     -0.1315     0.0041     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 6.7157, INFO. VALUE = 0.0240.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_TR                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            636  10.90%       216  33.96%      0.6910     0.0623     9.0104
   001          0.0000          1.0000       1461  25.03%       300  20.53%      0.0027     0.0000     9.0779
   002          2.0000          3.0000       2002  34.30%       365  18.23%     -0.1448     0.0069     4.3237
   003          4.0000         24.0000       1738  29.78%       315  18.12%     -0.1520     0.0066     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 9.0779, INFO. VALUE = 0.0757.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_DEBT                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            477   8.17%       161  33.75%      0.6816     0.0453     6.6527
   001          0.0000       3009.0000       2680  45.91%       567  21.16%      0.0404     0.0008     8.5317
   002       3010.0000      96260.0000       2680  45.91%       468  17.46%     -0.1972     0.0168     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 8.5317, INFO. VALUE = 0.0629.                                            
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_REV_LINE                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            477   8.17%       161  33.75%      0.6816     0.0453     6.6527
   001          0.0000       1477.0000        765  13.11%       268  35.03%      0.7383     0.0864    18.3518
   002       1481.0000       4042.0000        766  13.12%       205  26.76%      0.3492     0.0176    23.4043
   003       4044.0000       8350.0000        766  13.12%       172  22.45%      0.1166     0.0018    24.9867
   004       8360.0000      14095.0000        766  13.12%       162  21.15%      0.0400     0.0002    25.5174
   005      14100.0000      23419.0000        766  13.12%       104  13.58%     -0.4949     0.0276    19.9488
   006      23427.0000      38259.0000        766  13.12%        70   9.14%     -0.9409     0.0860    10.8049
   007      38300.0000     205395.0000        765  13.11%        54   7.06%     -1.2217     0.1320     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 25.5174, INFO. VALUE = 0.3970.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                           MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR REV_UTIL                          
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   001          0.0000         29.0000       2905  49.77%       459  15.80%     -0.3172     0.0454    14.3262
   002         30.0000        100.0000       2932  50.23%       737  25.14%      0.2646     0.0379     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 14.3262, INFO. VALUE = 0.0834.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                         MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR BUREAU_SCORE                        
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .            315   5.40%       105  33.33%      0.6628     0.0282     4.2544
   001        443.0000        653.0000       1393  23.86%       552  39.63%      0.9349     0.2621    32.2871
   002        654.0000        692.0000       1368  23.44%       298  21.78%      0.0776     0.0014    34.1481
   003        693.0000        735.0000       1383  23.69%       174  12.58%     -0.5825     0.0670    22.6462
   004        736.0000        848.0000       1378  23.61%        67   4.86%     -1.6179     0.3664     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 34.1481, INFO. VALUE = 0.7251.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                             MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR LTV                             
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   001          0.0000         82.0000        814  13.95%        81   9.95%     -0.8467     0.0764     9.0214
   002         83.0000         91.0000        837  14.34%       120  14.34%     -0.4316     0.0234    14.4372
   003         92.0000         97.0000        811  13.89%       148  18.25%     -0.1436     0.0027    16.3484
   004         98.0000        101.0000        830  14.22%       182  21.93%      0.0861     0.0011    15.0935
   005        102.0000        107.0000        870  14.90%       206  23.68%      0.1855     0.0054    12.1767
   006        108.0000        115.0000        808  13.84%       197  24.38%      0.2241     0.0074     8.8704
   007        116.0000        176.0000        867  14.85%       262  30.22%      0.5191     0.0460     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 16.3484, INFO. VALUE = 0.1625.                                           
--------------------------------------------------------------------------------------------------------------                                        
                                                                                                             
                          MONOTONIC WEIGHT OF EVIDENCE TRANSFORMATION FOR TOT_INCOME                         
   BIN           LOWER           UPPER                        #BADS      BAD                 INFO.           
 LEVEL           LIMIT           LIMIT      #FREQ  PERCENT    (Y=1)     RATE        WOE      VALUE         KS
 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
   000           .               .              5   0.09%         1  20.00%     -0.0303     0.0000     0.0026
   001          0.0000       3397.0000       2913  49.91%       669  22.97%      0.1457     0.0111     7.5822
   002       3400.0000    8147166.6600       2919  50.01%       526  18.02%     -0.1591     0.0121     0.0000
--------------------------------------------------------------------------------------------------------------                                        
    # TOTAL = 5837, # BADs(Y=1) = 1196, OVERALL BAD RATE = 20.49%, MAX. KS = 7.5822, INFO. VALUE = 0.0231.                                            
--------------------------------------------------------------------------------------------------------------                                        

6) “MONO_WOE.IMP” is a file imputing missing values in the case when # of bads or goods for missings is not enough to calculate WOE.

  *** MEDIAN IMPUTATION OF LTV (NMISS = 1) ***;
  IF LTV = . THEN LTV =      100;

Written by statcompute

June 10, 2012 at 2:15 am

Posted in SAS, Scorecard

Generalized Regression Neural Networks and the Implementation with Matlab

Generalized Regression Neural Networks (GRNN) is a special case of Radial Basis Networks (RBN). Compared with its competitor, e.g. standard feedforward neural network, GRNN has several advantages. First of all, the structure of a GRNN is relatively simple and static with 2 layers, namely pattern and summation layers. Once the input goes through each unit in the pattern layer, the relationship between the input and the response would be “memorized” and stored in the unit. As a result, # of units in the pattern layer is equal to # of observations in the training sample. In each pattern unit, a Gaussian PDF would be applied to the network input such that

Theta = EXP[-0.5 * (X – u) `(X – u) / ( Sigma ^2)]

where Theta is the output from pattern units, X is the input, u is training vector stored in the unit, and Sigma is a positive constant known as “spread” or “smooth parameter”. Once Theta is computed, it is passed to the summation layer to calculate Y|X = SUM(Y * Theta) / SUM(Theta), where Y|X is the prediction conditional on X and Y is the response in the training sample. In addition to the above, other benefits of GRNN claimed by Specht (1991) include:

1) The network is able to learning from the training data by “1-pass” training in a fraction of the time it takes to train standard feedforward networks.

2) The spread, Sigma, is the only free parameter in the network, which often can be identified by the V-fold or Split-Sample cross validation.

3) Unlike standard feedforward networks, GRNN estimation is always able to converge to a global solution and won’t be trapped by a local minimum.

With respect to the implementation of GRNN, Matlab might be considered the best computing engine from my limited experience in terms of ease to use and fast speed. A demo is given below on how to use matlab to develop a GRNN and to identify an optimal value of Sigma using split-sample cross validation.

load credit

Y = transpose(data(:, 2));
[n, m] = size(Y);
train_index = 2:2:m;

% SPLIT THE RESPONSE VECTOR INTO TRAINING AND TESTING
train_Y = Y(train_index);
test_Y = Y;
test_Y(train_index) = [];

% SPLIT X MATRIX INTO TRAINING AND TESTING
X = transpose(data(:, 3:10));
train_X = X(:, train_index);
test_X = X;
test_X(:, train_index) = [];

% STANDARDIZE X MATRIX IN TRAINING SET
[train_X2, map] = mapstd(train_X);

% STANDARDIZE X MATRIX IN TESTING SET
test_X2 = mapstd('apply', test_X, map);

% CHECK IF VARIANCE == 1
var(transpose(train_X2))
var(transpose(test_X2))

% TESTING DIFFERENT SPREAD OF RADIAL BASIS FUNCTION
j = 0;
for i = 1:0.02:2
  % TRAIN A GRNN    
  grnn = newgrnn(train_X2, train_Y, i);
  
  % CALCULATE THE PREDICTION FOR TESTING SET
  test_P = sim(grnn, test_X2);
  
  % COLLECT THE PERFORMANCE
  if j == 0
    spread = i;
    perf = sse(test_Y - test_P);    
  else
    spread = [spread, i];
    perf = [perf, sse(test_Y - test_P)];
  end;
  j = j + 1;
end;    

plot(spread, perf, '-ro');

The plot below is generated by the matlab program. As shown, SSE reaches the minimal when Sigma is between 1.3 and 1.4, indicating a reasonable range of the optimal Spread value.

Written by statcompute

June 3, 2012 at 2:25 am