## Archive for **August 2016**

## Copas Test for Overfitting in SAS

Overfitting is a concern for overly complex models. When a model suffers from the overfitting, it will tend to over-explain the model training data and canâ€™t generalize well in the out-of-sample (OOS) prediction. Many statistical measures, such as Adjusted R-squared and various Information criterion, have been developed to guard against the overfitting. However, these statistics are more suggestive than conclusive.

To test the null hypothesis of no overfitting, the Copas statistic is a convenient statistical measure to detect the overfitting and is based upon the fact that the conditional expectation of a response, e.g. E(Y|Y_oos), can be expressed as a linear function of its out-of-sample prediction Y_oos. For a model without the overfitting problem, E(Y|Y_oos) and Y_oos should be equal. In his research work, Copas also showed that this method can be generalized to the entire GLM family.

The implementation routine of Copas test is outlined as below.

– First of all, given a testing data sample, we generate the out-of-sample prediction, which could be derived from multiple approaches, such as n-fold, split-sample, or leave-one-out.

– Next, we fit a simple OLS regression between the observed Y and the out-of-sample prediction Y_hat such that Y = B0 + B1 * Y_hat.

– If the null hypothesis B0 = 0 and B1 = 1 is not rejected, then there is no concern about the overfitting.

Below is the SAS implementation of Copas test for Poisson regression based on LOO predictions and can be easily generalized to other cases with a few tweaks.

%macro copas(data = , y = , x = ); *************************************************; * COPAS TEST FOR OVERFITTING *; * ============================================= *; * INPUT PARAMETERS: *; * DATA: A SAS DATASET INCLUDING BOTH DEPENDENT *; * AND INDEPENDENT VARIABLES *; * Y : THE DEPENDENT VARIABLE *; * X : A LIST OF INDEPENDENT VARIABLES *; * ============================================= *; * Reference: *; * Measuring Overfitting and Mispecification in *; * Nonlinear Models *; *************************************************; options mprint mlogic symbolgen; data _1; set &data; _id = _n_; keep _id &x &y; run; proc sql noprint; select count(*) into :cnt from _1; quit; %do i = 1 %to &cnt; ods select none; proc genmod data = _1; where _id ~= &i; model &y = &x / dist = poisson link = log; store _est; run; ods select all; proc plm source = _est noprint; score data = _1(where = (_id = &i)) out = _2 / ilink; run; %if &i = 1 %then %do; data _3; set _2; run; %end; %else %do; proc append base = _3 data = _2; run; %end; %end; title "H0: No Overfitting (B0 = 0 and B1 = 1)"; ods select testanova; proc reg data = _3; Copas_Test: model &y = predicted; Copas_Statistic: test intercept = 0, predicted = 1; run; quit; %mend;

## SAS Macro Calculating Mutual Information

In statistics, various correlation functions, either Spearman or Pearson, have been used to measure the dependence between two data vectors under the linear or monotonic assumption. Mutual Information (MI) is an alternative widely used in Information Theory and is considered a more general measurement of the dependence between two vectors. More specifically, MI quantifies how much information two vectors, regardless of their actual values, might share based on their joint and marginal probability distribution functions.

Below is a sas macro implementing MI and Normalized MI by mimicking functions in Python, e.g. mutual_info_score() and normalized_mutual_info_score(). Although MI is used to evaluate the cluster analysis performance in sklearn package, it can also be used as an useful tool for Feature Selection in the context of Machine Learning and Statistical Modeling.

%macro mutual(data = , x = , y = ); ***********************************************************; * SAS MACRO CALCULATING MUTUAL INFORMATION AND ITS *; * NORMALIZED VARIANT BETWEEN TWO VECTORS BY MIMICKING *; * SKLEARN.METRICS.NORMALIZED_MUTUAL_INFO_SCORE() *; * SKLEARN.METRICS.MUTUAL_INFO_SCORE() IN PYTHON *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * X : FIRST INPUT VECTOR *; * Y : SECOND INPUT VECTOR *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; data _1; set &data; where &x ~= . and &y ~= .; _id = _n_; run; proc sql; create table _2 as select _id, &x, &y, 1 / (select count(*) from _1) as _p_xy from _1; create table _3 as select _id, &x as _x, sum(_p_xy) as _p_x, sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_x from _2 group by &x; create table _4 as select _id, &y as _y, sum(_p_xy) as _p_y, sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_y from _2 group by &y; create table _5 as select a.*, b._p_x, b._h_x, c._p_y, c._h_y, a._p_xy * log(a._p_xy / (b._p_x * c._p_y)) as mutual from _2 as a, _3 as b, _4 as c where a._id = b._id and a._id = c._id; select sum(mutual) as MI format = 12.8, case when sum(mutual) = 0 then 0 else sum(mutual) / (sum(_h_x) * sum(_h_y)) ** 0.5 end as NMI format = 12.8 from _5; quit; %mend mutual;