## SAS Macro Calculating Mutual Information

In statistics, various correlation functions, either Spearman or Pearson, have been used to measure the dependence between two data vectors under the linear or monotonic assumption. Mutual Information (MI) is an alternative widely used in Information Theory and is considered a more general measurement of the dependence between two vectors. More specifically, MI quantifies how much information two vectors, regardless of their actual values, might share based on their joint and marginal probability distribution functions.

Below is a sas macro implementing MI and Normalized MI by mimicking functions in Python, e.g. mutual_info_score() and normalized_mutual_info_score(). Although MI is used to evaluate the cluster analysis performance in sklearn package, it can also be used as an useful tool for Feature Selection in the context of Machine Learning and Statistical Modeling.

%macro mutual(data = , x = , y = ); ***********************************************************; * SAS MACRO CALCULATING MUTUAL INFORMATION AND ITS *; * NORMALIZED VARIANT BETWEEN TWO VECTORS BY MIMICKING *; * SKLEARN.METRICS.NORMALIZED_MUTUAL_INFO_SCORE() *; * SKLEARN.METRICS.MUTUAL_INFO_SCORE() IN PYTHON *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * X : FIRST INPUT VECTOR *; * Y : SECOND INPUT VECTOR *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; data _1; set &data; where &x ~= . and &y ~= .; _id = _n_; run; proc sql; create table _2 as select _id, &x, &y, 1 / (select count(*) from _1) as _p_xy from _1; create table _3 as select _id, &x as _x, sum(_p_xy) as _p_x, sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_x from _2 group by &x; create table _4 as select _id, &y as _y, sum(_p_xy) as _p_y, sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_y from _2 group by &y; create table _5 as select a.*, b._p_x, b._h_x, c._p_y, c._h_y, a._p_xy * log(a._p_xy / (b._p_x * c._p_y)) as mutual from _2 as a, _3 as b, _4 as c where a._id = b._id and a._id = c._id; select sum(mutual) as MI format = 12.8, case when sum(mutual) = 0 then 0 else sum(mutual) / (sum(_h_x) * sum(_h_y)) ** 0.5 end as NMI format = 12.8 from _5; quit; %mend mutual;