Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Copas Test for Overfitting in SAS

Overfitting is a concern for overly complex models. When a model suffers from the overfitting, it will tend to over-explain the model training data and can’t generalize well in the out-of-sample (OOS) prediction. Many statistical measures, such as Adjusted R-squared and various Information criterion, have been developed to guard against the overfitting. However, these statistics are more suggestive than conclusive.

To test the null hypothesis of no overfitting, the Copas statistic is a convenient statistical measure to detect the overfitting and is based upon the fact that the conditional expectation of a response, e.g. E(Y|Y_oos), can be expressed as a linear function of its out-of-sample prediction Y_oos. For a model without the overfitting problem, E(Y|Y_oos) and Y_oos should be equal. In his research work, Copas also showed that this method can be generalized to the entire GLM family.

The implementation routine of Copas test is outlined as below.
– First of all, given a testing data sample, we generate the out-of-sample prediction, which could be derived from multiple approaches, such as n-fold, split-sample, or leave-one-out.
– Next, we fit a simple OLS regression between the observed Y and the out-of-sample prediction Y_hat such that Y = B0 + B1 * Y_hat.
– If the null hypothesis B0 = 0 and B1 = 1 is not rejected, then there is no concern about the overfitting.

Below is the SAS implementation of Copas test for Poisson regression based on LOO predictions and can be easily generalized to other cases with a few tweaks.

%macro copas(data = , y = , x = );
*************************************************;
*         COPAS TEST FOR OVERFITTING            *;
* ============================================= *;
* INPUT PARAMETERS:                             *;
*  DATA: A SAS DATASET INCLUDING BOTH DEPENDENT *;
*        AND INDEPENDENT VARIABLES              *; 
*  Y   : THE DEPENDENT VARIABLE                 *;
*  X   : A LIST OF INDEPENDENT VARIABLES        *;
* ============================================= *;
* Reference:                                    *;
* Measuring Overfitting and Mispecification in  *;
* Nonlinear Models                              *;
*************************************************;
options mprint mlogic symbolgen;

data _1;
  set &data;
  _id = _n_;
  keep _id &x &y;
run;  

proc sql noprint;
  select count(*) into :cnt from _1;
quit;  

%do i = 1 %to &cnt;
ods select none;
proc genmod data = _1;
  where _id ~= &i;
  model &y = &x / dist = poisson link = log;
  store _est;
run;  
ods select all;

proc plm source = _est noprint;
  score data = _1(where = (_id = &i)) out = _2 / ilink;
run;

%if &i = 1 %then %do;
  data _3;
    set _2;
  run;
%end;
%else %do;
  proc append base = _3 data = _2;
  run;
%end;

%end;

title "H0: No Overfitting (B0 = 0 and B1 = 1)";
ods select testanova;
proc reg data = _3;
  Copas_Test: model &y = predicted;
  Copas_Statistic: test intercept = 0, predicted = 1;
run;
quit;

%mend;
Advertisements

Written by statcompute

August 21, 2016 at 12:08 pm

%d bloggers like this: