I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

SAS Macro Calculating Mutual Information

In statistics, various correlation functions, either Spearman or Pearson, have been used to measure the dependence between two data vectors under the linear or monotonic assumption. Mutual Information (MI) is an alternative widely used in Information Theory and is considered a more general measurement of the dependence between two vectors. More specifically, MI quantifies how much information two vectors, regardless of their actual values, might share based on their joint and marginal probability distribution functions.

Below is a sas macro implementing MI and Normalized MI by mimicking functions in Python, e.g. mutual_info_score() and normalized_mutual_info_score(). Although MI is used to evaluate the cluster analysis performance in sklearn package, it can also be used as an useful tool for Feature Selection in the context of Machine Learning and Statistical Modeling.

```%macro mutual(data = , x = , y = );
***********************************************************;
* SAS MACRO CALCULATING MUTUAL INFORMATION AND ITS        *;
* NORMALIZED VARIANT BETWEEN TWO VECTORS BY MIMICKING     *;
* SKLEARN.METRICS.NORMALIZED_MUTUAL_INFO_SCORE()          *;
* SKLEARN.METRICS.MUTUAL_INFO_SCORE() IN PYTHON           *;
* ======================================================= *;
* INPUT PAREMETERS:                                       *;
*  DATA : INPUT SAS DATA TABLE                            *;
*  X    : FIRST INPUT VECTOR                              *;
*  Y    : SECOND INPUT VECTOR                             *;
* ======================================================= *;
* AUTHOR: WENSUI.LIU@53.COM                               *;
***********************************************************;

data _1;
set &data;
where &x ~= . and &y ~= .;
_id = _n_;
run;

proc sql;
create table
_2 as
select
_id,
&x,
&y,
1 / (select count(*) from _1) as _p_xy
from
_1;

create table
_3 as
select
_id,
&x         as _x,
sum(_p_xy) as _p_x,
sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_x
from
_2
group by
&x;

create table
_4 as
select
_id,
&y         as _y,
sum(_p_xy) as _p_y,
sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_y
from
_2
group by
&y;

create table
_5 as
select
a.*,
b._p_x,
b._h_x,
c._p_y,
c._h_y,
a._p_xy * log(a._p_xy / (b._p_x * c._p_y)) as mutual
from
_2 as a, _3 as b, _4 as c
where
a._id = b._id and a._id = c._id;

select
sum(mutual) as MI format = 12.8,
case
when sum(mutual) = 0 then 0
else sum(mutual) / (sum(_h_x) * sum(_h_y)) ** 0.5
end as NMI format = 12.8
from
_5;
quit;

%mend mutual;
```