Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

SAS Macro Calculating Mutual Information

In statistics, various correlation functions, either Spearman or Pearson, have been used to measure the dependence between two data vectors under the linear or monotonic assumption. Mutual Information (MI) is an alternative widely used in Information Theory and is considered a more general measurement of the dependence between two vectors. More specifically, MI quantifies how much information two vectors, regardless of their actual values, might share based on their joint and marginal probability distribution functions.

Below is a sas macro implementing MI and Normalized MI by mimicking functions in Python, e.g. mutual_info_score() and normalized_mutual_info_score(). Although MI is used to evaluate the cluster analysis performance in sklearn package, it can also be used as an useful tool for Feature Selection in the context of Machine Learning and Statistical Modeling.

%macro mutual(data = , x = , y = );
***********************************************************;
* SAS MACRO CALCULATING MUTUAL INFORMATION AND ITS        *;
* NORMALIZED VARIANT BETWEEN TWO VECTORS BY MIMICKING     *;
* SKLEARN.METRICS.NORMALIZED_MUTUAL_INFO_SCORE()          *;
* SKLEARN.METRICS.MUTUAL_INFO_SCORE() IN PYTHON           *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  X    : FIRST INPUT VECTOR                              *;
*  Y    : SECOND INPUT VECTOR                             *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

data _1;
  set &data;
  where &x ~= . and &y ~= .;
  _id = _n_;
run;

proc sql;
create table
  _2 as
select
  _id,
  &x,
  &y,
  1 / (select count(*) from _1) as _p_xy
from
  _1;

create table
  _3 as
select
  _id,
  &x         as _x,
  sum(_p_xy) as _p_x,
  sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_x
from 
  _2
group by
  &x;

create table
  _4 as
select
  _id,
  &y         as _y,
  sum(_p_xy) as _p_y,
  sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_y
from 
  _2
group by
  &y;

create table
  _5 as
select
  a.*,
  b._p_x,
  b._h_x,
  c._p_y,
  c._h_y,
  a._p_xy * log(a._p_xy / (b._p_x * c._p_y)) as mutual
from
  _2 as a, _3 as b, _4 as c
where
  a._id = b._id and a._id = c._id;

select
  sum(mutual) as MI format = 12.8,
  case 
    when sum(mutual) = 0 then 0
    else sum(mutual) / (sum(_h_x) * sum(_h_y)) ** 0.5 
  end as NMI format = 12.8
from
  _5;
quit;

%mend mutual;
Advertisements

Written by statcompute

August 7, 2016 at 5:18 pm

%d bloggers like this: