Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Decision Stump with the Implementation in SAS

A decision stump is a naively simple but effective rule-based supervised learning algorithm similar to CART (Classification & Regression Tree). However, the stump is a 1-level decision tree consisting of 2 terminal nodes.

Albeit simple, the decision stump has shown successful use cases in many aspects. For instance, as a weak classifier, the decision stump has been proven an excellent base learner in ensemble learning algorithms such as bagging and boosting. Moreover, a single decision stump can also be employed to do feature screening for predictive modeling and cut-point searching for continuous features. The following is an example showing the SAS implementation as well as the predictive power of a decision stump.

First of all, a testing data is simulated with 1 binary response variable Y and 3 continuous features X1 – X3, which X1 is the most related feature to Y with a single cut-point at 5, X2 is also related to Y but with 2 different cut-points at 1.5 and 7.5, and X3 is a pure noise.

The SAS macro below is showing how to program a single decision stump. And this macro would be used to search for the simulated cut-point in each continuous feature.

%macro stump(data = , w = , y = , xlist = );

%let i = 1;

%local i;

proc sql;
create table _out
  (
  variable   char(32),
  gt_value   num,
  gini       num
  );
quit;

%do %while (%scan(&xlist, &i) ne %str());  
  %let x = %scan(&xlist, &i);
  
  data _tmp1(keep = &w &y &x);
    set &data;
    where &y in (0, 1);
  run;

  proc sql;
    create table
      _tmp2 as
    select
      b.&x                                                          as gt_value,
      sum(case when a.&x <= b.&x then &w * &y else 0 end) / 
      sum(case when a.&x <= b.&x then &w else 0 end)                as p1_1,
      sum(case when a.&x >  b.&x then &w * &y else 0 end) / 
      sum(case when a.&x >  b.&x then &w else 0 end)                as p1_2,
      sum(case when a.&x <= b.&x then 1 else 0 end) / count(*)      as ppn1,
      sum(case when a.&x >  b.&x then 1 else 0 end) / count(*)      as ppn2,
      2 * calculated p1_1 * (1 - calculated p1_1) * calculated ppn1 + 
      2 * calculated p1_2 * (1 - calculated p1_2) * calculated ppn2 as gini
    from
      _tmp1 as a,
      (select distinct &x from _tmp1) as b
    group by
      b.&x;

    insert into _out
    select
      "&x",
      gt_value,
      gini
    from
      _tmp2
    having
      gini = min(gini);

    drop table _tmp1;
  quit;

  %let i = %eval(&i + 1); 
%end;

proc sort data = _out;
  by gini;
run;

proc report data = _out box spacing = 1 split = "*" nowd;
  column("DECISION STUMP SUMMARY"
         variable gt_value gini);
  define variable / "VARIABLE"                     width = 30 center;
  define gt_value / "CUTOFF VALUE*(GREATER THAN)"  width = 15 center;
  define gini     / "GINI"                         width = 10 center format = 9.4;
run;

%mend stump;

As shown in the table below, the decision stump did a fairly good job both in identifying predictive features and in locating cut-points. Both related features, X1 and X2, have been identified and correctly ranked in terms of associations with Y. For X1, the cut-point located is 4.97, extremely close to 5. For X2, the cut-point located is 7.46, close enough to 1 of 2 simulated cut-points.

Advertisements

Written by statcompute

May 19, 2012 at 12:36 am

%d bloggers like this: