## Flavors of SQL on Pandas DataFrame

In R, sqldf() provides a convenient interface of running SQL statement on data frames. Similarly, Python also offers multiple ways to interact between SQL and Pandas DataFrames by leveraging the lightweight SQLite engine. While pandasql (https://github.com/yhat/pandasql) works similarly to sqldf() in R, pysqldf (https://github.com/airtoxin/pysqldf) is even more powerful. In my experiments shown below, advantages of pysqldf over pandasql are two-fold. First of all, pysqldf is 2 – 3 times faster than pandasql. Secondly, pysqldf supports new function definitions, which is not available in pandasql. However, it is worth mentioning that the generic python interface to an in-memory SQLite database can be more efficient and flexible than both pysqldf and pandasql, as demonstrated below, as long as we are able to get the DataFrame into the SQLite and let it stay in-memory.

from sqlite3 import connect from pandas import read_sql_query import pandasql import pysqldf import numpy # CREATE AN IN-MEMORY SQLITE DB con = connect(":memory:") cur = con.cursor() cur.execute("attach 'my.db' as filedb") cur.execute("create table df as select * from filedb.hflights") cur.execute("detach filedb") # IMPORT SQLITE TABLE INTO PANDAS DF df = read_sql_query("select * from df", con) # WRITE QUERIES sql01 = "select * from df where DayofWeek = 1 and Dest = 'CVG';" sql02 = "select DayofWeek, AVG(ArrTime) from df group by DayofWeek;" sql03 = "select DayofWeek, median(ArrTime) from df group by DayofWeek;" # SELECTION: # 1. PANDASQL %time t11 = pandasql.sqldf(sql01, globals()) # 2. PYSQLDF %time t12 = pysqldf.SQLDF(globals()).execute(sql01) # 3. GENERIC SQLITE CONNECTION %time t13 = read_sql_query(sql01, con) # AGGREGATION: # 1. PANDASQL %time t21 = pandasql.sqldf(sql02, globals()) # 2. PYSQLDF %time t22 = pysqldf.SQLDF(globals()).execute(sql02) # 3. GENERIC SQLITE CONNECTION %time t23 = read_sql_query(sql02, con) # DEFINING A NEW FUNCTION: # DEFINE A FUNCTION NOT SUPPORTED IN SQLITE class median(object): def __init__(self): self.a = [] def step(self, x): self.a.append(x) def finalize(self): return numpy.median(self.a) # 1. PYSQLDF udafs = {"median": median} %time t31 = pysqldf.SQLDF(globals(), udafs = udafs).execute(sql03) # 2 GENERIC SQLITE CONNECTION con.create_aggregate("median", 1, median) %time t32 = read_sql_query(sql03, con)

## Test Drive Proc Lua – Convert SAS Table to 2-Dimension Lua Table

data one (drop = i); array a x1 x2 x3 x4 x5; do i = 1 to 5; do over a; a = ranuni(i); end; output; end; run; proc lua; submit; local ds = sas.open("one") local tbl = {} for var in sas.vars(ds) do tbl[var.name] = {} end while sas.next(ds) do for i, v in pairs(tbl) do table.insert(tbl[i], sas.get_value(ds, i)) end end sas.close(ds) for i, item in pairs(tbl) do print(i, table.concat(item, " ")) end endsubmit; run;

## Copas Test for Overfitting in SAS

Overfitting is a concern for overly complex models. When a model suffers from the overfitting, it will tend to over-explain the model training data and canâ€™t generalize well in the out-of-sample (OOS) prediction. Many statistical measures, such as Adjusted R-squared and various Information criterion, have been developed to guard against the overfitting. However, these statistics are more suggestive than conclusive.

To test the null hypothesis of no overfitting, the Copas statistic is a convenient statistical measure to detect the overfitting and is based upon the fact that the conditional expectation of a response, e.g. E(Y|Y_oos), can be expressed as a linear function of its out-of-sample prediction Y_oos. For a model without the overfitting problem, E(Y|Y_oos) and Y_oos should be equal. In his research work, Copas also showed that this method can be generalized to the entire GLM family.

The implementation routine of Copas test is outlined as below.

– First of all, given a testing data sample, we generate the out-of-sample prediction, which could be derived from multiple approaches, such as n-fold, split-sample, or leave-one-out.

– Next, we fit a simple OLS regression between the observed Y and the out-of-sample prediction Y_hat such that Y = B0 + B1 * Y_hat.

– If the null hypothesis B0 = 0 and B1 = 1 is not rejected, then there is no concern about the overfitting.

Below is the SAS implementation of Copas test for Poisson regression based on LOO predictions and can be easily generalized to other cases with a few tweaks.

%macro copas(data = , y = , x = ); *************************************************; * COPAS TEST FOR OVERFITTING *; * ============================================= *; * INPUT PARAMETERS: *; * DATA: A SAS DATASET INCLUDING BOTH DEPENDENT *; * AND INDEPENDENT VARIABLES *; * Y : THE DEPENDENT VARIABLE *; * X : A LIST OF INDEPENDENT VARIABLES *; * ============================================= *; * Reference: *; * Measuring Overfitting and Mispecification in *; * Nonlinear Models *; *************************************************; options mprint mlogic symbolgen; data _1; set &data; _id = _n_; keep _id &x &y; run; proc sql noprint; select count(*) into :cnt from _1; quit; %do i = 1 %to &cnt; ods select none; proc genmod data = _1; where _id ~= &i; model &y = &x / dist = poisson link = log; store _est; run; ods select all; proc plm source = _est noprint; score data = _1(where = (_id = &i)) out = _2 / ilink; run; %if &i = 1 %then %do; data _3; set _2; run; %end; %else %do; proc append base = _3 data = _2; run; %end; %end; title "H0: No Overfitting (B0 = 0 and B1 = 1)"; ods select testanova; proc reg data = _3; Copas_Test: model &y = predicted; Copas_Statistic: test intercept = 0, predicted = 1; run; quit; %mend;

## SAS Macro Calculating Mutual Information

In statistics, various correlation functions, either Spearman or Pearson, have been used to measure the dependence between two data vectors under the linear or monotonic assumption. Mutual Information (MI) is an alternative widely used in Information Theory and is considered a more general measurement of the dependence between two vectors. More specifically, MI quantifies how much information two vectors, regardless of their actual values, might share based on their joint and marginal probability distribution functions.

Below is a sas macro implementing MI and Normalized MI by mimicking functions in Python, e.g. mutual_info_score() and normalized_mutual_info_score(). Although MI is used to evaluate the cluster analysis performance in sklearn package, it can also be used as an useful tool for Feature Selection in the context of Machine Learning and Statistical Modeling.

%macro mutual(data = , x = , y = ); ***********************************************************; * SAS MACRO CALCULATING MUTUAL INFORMATION AND ITS *; * NORMALIZED VARIANT BETWEEN TWO VECTORS BY MIMICKING *; * SKLEARN.METRICS.NORMALIZED_MUTUAL_INFO_SCORE() *; * SKLEARN.METRICS.MUTUAL_INFO_SCORE() IN PYTHON *; * ======================================================= *; * INPUT PAREMETERS: *; * DATA : INPUT SAS DATA TABLE *; * X : FIRST INPUT VECTOR *; * Y : SECOND INPUT VECTOR *; * ======================================================= *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; data _1; set &data; where &x ~= . and &y ~= .; _id = _n_; run; proc sql; create table _2 as select _id, &x, &y, 1 / (select count(*) from _1) as _p_xy from _1; create table _3 as select _id, &x as _x, sum(_p_xy) as _p_x, sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_x from _2 group by &x; create table _4 as select _id, &y as _y, sum(_p_xy) as _p_y, sum(_p_xy) * log(sum(_p_xy)) / count(*) as _h_y from _2 group by &y; create table _5 as select a.*, b._p_x, b._h_x, c._p_y, c._h_y, a._p_xy * log(a._p_xy / (b._p_x * c._p_y)) as mutual from _2 as a, _3 as b, _4 as c where a._id = b._id and a._id = c._id; select sum(mutual) as MI format = 12.8, case when sum(mutual) = 0 then 0 else sum(mutual) / (sum(_h_x) * sum(_h_y)) ** 0.5 end as NMI format = 12.8 from _5; quit; %mend mutual;

## Scorecard Development with Data from Multiple Sources

This week, one of my friends asked me a very interesting and practical question in the scorecard development. The model development data were collected from multiple independent sources with various data sizes, heterogeneous risk profiles and different bad rates. While the performance statistics seem satisfactory on the model training dataset, the model doesn’t generalize well with new accounts that might come from a unknown source. The situation is very common in a consulting company where a risk or marketing model is sometimes developed with the data from multiple organizations.

To better understand the issue, I simulated a dataset consisting of two groups. In the dataset, while x0 and x1 govern the group segmentation, x2 and x3 define the bad definition. It is important to point out that the group information “grp” is only known in the model development sample but is unknown in the production population. Therefore, the variable “grp”, albeit predictive, can not be explicitly used in the model estimation.

data one; do i = 1 to 100000; x0 = ranuni(0); x1 = ranuni(1); x2 = ranuni(2); x3 = ranuni(3); if 1 + x0 * 2 + x1 * 4 + rannor(1) > 5 then do; grp = 1; if x2 * 2 + x3 * 4 + rannor(2) > 5 then bad = 1; else bad = 0; end; else do; grp = 0; if x2 * 4 + x3 * 2 + rannor(3) > 4 then bad = 1; else bad = 0; end; output; end; run;

Our first approach is to use all variables x0 – x3 to build a logistic regression and then evaluate the model altogether and by groups.

proc logistic data = one desc noprint; model bad = x0 x1 x2 x3; score data = one out = mdl1 (rename = (p_1 = score1)); run; GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1 MAXIMUM KS = 59.5763 AT SCORE POINT 0.2281 ( AUC STATISTICS = 0.8800, GINI COEFFICIENT = 0.7599, DIVERGENCE = 2.6802 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6800 0.9699 2,057 7,943 10,000 79.43% 79.43% 33.81% 33.81% | 0.4679 0.6799 4,444 5,556 10,000 55.56% 67.50% 23.65% 57.46% | 0.3094 0.4679 6,133 3,867 10,000 38.67% 57.89% 16.46% 73.92% | 0.1947 0.3094 7,319 2,681 10,000 26.81% 50.12% 11.41% 85.33% | 0.1181 0.1946 8,364 1,636 10,000 16.36% 43.37% 6.96% 92.29% | 0.0690 0.1181 9,044 956 10,000 9.56% 37.73% 4.07% 96.36% | 0.0389 0.0690 9,477 523 10,000 5.23% 33.09% 2.23% 98.59% | 0.0201 0.0389 9,752 248 10,000 2.48% 29.26% 1.06% 99.64% V 0.0085 0.0201 9,925 75 10,000 0.75% 26.09% 0.32% 99.96% GOOD 0.0005 0.0085 9,991 9 10,000 0.09% 23.49% 0.04% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9699 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 0)) MAXIMUM KS = 61.0327 AT SCORE POINT 0.2457 ( AUC STATISTICS = 0.8872, GINI COEFFICIENT = 0.7744, DIVERGENCE = 2.8605 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7086 0.9699 1,051 6,162 7,213 85.43% 85.43% 30.51% 30.51% | 0.5019 0.7086 2,452 4,762 7,214 66.01% 75.72% 23.58% 54.10% | 0.3407 0.5019 3,710 3,504 7,214 48.57% 66.67% 17.35% 71.45% | 0.2195 0.3406 4,696 2,517 7,213 34.90% 58.73% 12.46% 83.91% | 0.1347 0.2195 5,650 1,564 7,214 21.68% 51.32% 7.74% 91.66% | 0.0792 0.1347 6,295 919 7,214 12.74% 44.89% 4.55% 96.21% | 0.0452 0.0792 6,737 476 7,213 6.60% 39.42% 2.36% 98.56% | 0.0234 0.0452 7,000 214 7,214 2.97% 34.86% 1.06% 99.62% V 0.0099 0.0234 7,150 64 7,214 0.89% 31.09% 0.32% 99.94% GOOD 0.0007 0.0099 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0007 0.9699 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE1 IN DATA MDL1(WHERE = (GRP = 1)) MAXIMUM KS = 53.0942 AT SCORE POINT 0.2290 ( AUC STATISTICS = 0.8486, GINI COEFFICIENT = 0.6973, DIVERGENCE = 2.0251 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5863 0.9413 1,351 1,435 2,786 51.51% 51.51% 43.48% 43.48% | 0.3713 0.5862 2,136 651 2,787 23.36% 37.43% 19.73% 63.21% | 0.2299 0.3712 2,340 446 2,786 16.01% 30.29% 13.52% 76.73% | 0.1419 0.2298 2,525 262 2,787 9.40% 25.07% 7.94% 84.67% | 0.0832 0.1419 2,584 202 2,786 7.25% 21.50% 6.12% 90.79% | 0.0480 0.0832 2,643 144 2,787 5.17% 18.78% 4.36% 95.15% | 0.0270 0.0480 2,682 104 2,786 3.73% 16.63% 3.15% 98.30% | 0.0140 0.0270 2,741 46 2,787 1.65% 14.76% 1.39% 99.70% V 0.0058 0.0140 2,776 10 2,786 0.36% 13.16% 0.30% 100.00% GOOD 0.0005 0.0058 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0005 0.9413 24,564 3,300 27,864

As shown in the above output, while the overall model performance looks ok, it doesn’t generalize well in the dataset from the 2nd group with a smaller size. While the overall KS could be as high as 60, the KS for the 2nd group is merely 53. The reason is that the overall model performance is heavily influenced by the dataset from the 1st group with the larger size. Therefore, the estimated model is biased toward the risk profile reflected in the 1st group.

To alleviate the bias in the first model, we could first introduce a look-alike model driven by x0 – x1 with the purpose to profile the group and then build two separate risk models with x2 – x3 only for 1st and 2nd groups respectively. As a result, the final predicted probability should be the composite of all three sub-models, as shown below. The model evaluation is also provided to compared with the first model.

proc logistic data = one desc noprint; where grp = 0; model bad = x2 x3; score data = one out = mdl20(rename = (p_1 = p_10)); run; proc logistic data = one desc noprint; where grp = 1; model bad = x2 x3; score data = one out = mdl21(rename = (p_1 = p_11)); run; proc logistic data = one desc noprint; model grp = x0 x1; score data = one out = seg; run; data mdl2; merge seg mdl20 mdl21; by i; score2 = p_10 * (1 - p_1) + p_11 * p_1; run; GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2 MAXIMUM KS = 60.6234 AT SCORE POINT 0.2469 ( AUC STATISTICS = 0.8858, GINI COEFFICIENT = 0.7715, DIVERGENCE = 2.8434 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.6877 0.9677 2,011 7,989 10,000 79.89% 79.89% 34.00% 34.00% | 0.4749 0.6876 4,300 5,700 10,000 57.00% 68.45% 24.26% 58.27% | 0.3125 0.4748 6,036 3,964 10,000 39.64% 58.84% 16.87% 75.14% | 0.1932 0.3124 7,451 2,549 10,000 25.49% 50.51% 10.85% 85.99% | 0.1142 0.1932 8,379 1,621 10,000 16.21% 43.65% 6.90% 92.89% | 0.0646 0.1142 9,055 945 10,000 9.45% 37.95% 4.02% 96.91% | 0.0345 0.0646 9,533 467 10,000 4.67% 33.19% 1.99% 98.90% | 0.0166 0.0345 9,800 200 10,000 2.00% 29.29% 0.85% 99.75% V 0.0062 0.0166 9,946 54 10,000 0.54% 26.10% 0.23% 99.98% GOOD 0.0001 0.0062 9,995 5 10,000 0.05% 23.49% 0.02% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9677 76,506 23,494 100,000 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 0)) MAXIMUM KS = 61.1591 AT SCORE POINT 0.2458 ( AUC STATISTICS = 0.8880, GINI COEFFICIENT = 0.7759, DIVERGENCE = 2.9130 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.7221 0.9677 1,075 6,138 7,213 85.10% 85.10% 30.40% 30.40% | 0.5208 0.7221 2,436 4,778 7,214 66.23% 75.66% 23.66% 54.06% | 0.3533 0.5208 3,670 3,544 7,214 49.13% 66.82% 17.55% 71.61% | 0.2219 0.3532 4,726 2,487 7,213 34.48% 58.73% 12.32% 83.92% | 0.1309 0.2219 5,617 1,597 7,214 22.14% 51.41% 7.91% 91.83% | 0.0731 0.1309 6,294 920 7,214 12.75% 44.97% 4.56% 96.39% | 0.0387 0.0731 6,762 451 7,213 6.25% 39.44% 2.23% 98.62% | 0.0189 0.0387 7,009 205 7,214 2.84% 34.86% 1.02% 99.63% V 0.0074 0.0189 7,152 62 7,214 0.86% 31.09% 0.31% 99.94% GOOD 0.0002 0.0073 7,201 12 7,213 0.17% 27.99% 0.06% 100.00% ========== ========== ========== ========== ========== 0.0002 0.9677 51,942 20,194 72,136 GOOD BAD SEPARATION REPORT FOR SCORE2 IN DATA MDL2(WHERE = (GRP = 1)) MAXIMUM KS = 57.6788 AT SCORE POINT 0.1979 ( AUC STATISTICS = 0.8717, GINI COEFFICIENT = 0.7434, DIVERGENCE = 2.4317 ) MIN MAX GOOD BAD TOTAL BAD CUMULATIVE BAD CUMU. BAD SCORE SCORE # # # RATE BAD RATE PERCENT PERCENT -------------------------------------------------------------------------------------------------------- BAD 0.5559 0.9553 1,343 1,443 2,786 51.79% 51.79% 43.73% 43.73% | 0.3528 0.5559 2,001 786 2,787 28.20% 40.00% 23.82% 67.55% | 0.2213 0.3528 2,364 422 2,786 15.15% 31.71% 12.79% 80.33% | 0.1372 0.2213 2,513 274 2,787 9.83% 26.24% 8.30% 88.64% | 0.0840 0.1372 2,588 198 2,786 7.11% 22.42% 6.00% 94.64% | 0.0484 0.0840 2,683 104 2,787 3.73% 19.30% 3.15% 97.79% | 0.0256 0.0483 2,729 57 2,786 2.05% 16.84% 1.73% 99.52% | 0.0118 0.0256 2,776 11 2,787 0.39% 14.78% 0.33% 99.85% V 0.0040 0.0118 2,781 5 2,786 0.18% 13.16% 0.15% 100.00% GOOD 0.0001 0.0040 2,786 0 2,786 0.00% 11.84% 0.00% 100.00% ========== ========== ========== ========== ========== 0.0001 0.9553 24,564 3,300 27,864

After comparing KS statistics from two modeling approaches, we can see that, while the performance from the 2nd approach on the overall sample is only slightly better than the one from the 1st approach, the KS on the 2nd group with a smaller size, e.g. grp = 1, increases from 53 upto 58 by 8.6%. While the example is just for two groups, it is trivial to generalize in cases with more than two groups.

## Risk Models with Generalized PLS

While developing risk models with hundreds of potential variables, we often run into the situation that risk characteristics or macro-economic indicators are highly correlated, namely multicollinearity. In such cases, we might have to drop variables with high VIFs or employ “variable shrinkage” methods, e.g. lasso or ridge, to suppress variables with colinearity.

Feature extraction approaches based on PCA and PLS have been widely discussed but are rarely used in real-world applications due to concerns around model interpretability and implementation. In the example below, it is shown that there shouldn’t any hurdle in the model implementation, e.g. score, given that coefficients can be extracted from a GPLS model in the similar way from a GLM model. In addition, compared with GLM with 8 variables, GPLS with only 5 components is able to provide a comparable performance in the hold-out testing data.

**R Code**

library(gpls) library(pROC) df1 <- read.csv("credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, -c(1, 10, 11, 12, 13)] set.seed(2016) n <- nrow(df2) sample <- sample(seq(n), size = n / 2, replace = FALSE) train <- df2[sample, ] test <- df2[-sample, ] m1 <- glm(DEFAULT ~ ., data = train, family = "binomial") cat("\n### ROC OF GLM PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m1, newdata = train, type = "response"))) cat("\n### ROC OF GLM PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m1, newdata = test, type = "response"))) m2 <- gpls(DEFAULT ~ ., data = train, family = "binomial", K.prov = 5) cat("\n### ROC OF GPLS PREDICTION WITH TRAINING DATA ###\n") print(roc(train$DEFAULT, predict(m2, newdata = train)$predicted[, 1])) cat("\n### ROC OF GPLS PREDICTION WITH TESTING DATA ###\n") print(roc(test$DEFAULT, predict(m2, newdata = test)$predicted[, 1])) cat("\n### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ###\n") print(data.frame(glm = m1$coefficients, gpls = m2$coefficients))

**Output**

### ROC OF GLM PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m1, newdata = train, type = "response")) Data: predict(m1, newdata = train, type = "response") in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6641 ### ROC OF GLM PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m1, newdata = test, type = "response")) Data: predict(m1, newdata = test, type = "response") in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6537 ### ROC OF GPLS PREDICTION WITH TRAINING DATA ### Call: roc.default(response = train$DEFAULT, predictor = predict(m2, newdata = train)$predicted[, 1]) Data: predict(m2, newdata = train)$predicted[, 1] in 4753 controls (train$DEFAULT 0) < 496 cases (train$DEFAULT 1). Area under the curve: 0.6627 ### ROC OF GPLS PREDICTION WITH TESTING DATA ### Call: roc.default(response = test$DEFAULT, predictor = predict(m2, newdata = test)$predicted[, 1]) Data: predict(m2, newdata = test)$predicted[, 1] in 4750 controls (test$DEFAULT 0) < 500 cases (test$DEFAULT 1). Area under the curve: 0.6542 ### COEFFICIENTS COMPARISON BETWEEN GLM AND GPLS ### glm gpls (Intercept) -0.1940785071 -0.1954618828 AGE -0.0122709412 -0.0147883358 ACADMOS 0.0005302022 0.0003671781 ADEPCNT 0.1090667092 0.1352491711 MAJORDRG 0.0757313171 0.0813835741 MINORDRG 0.2621574192 0.2547176301 OWNRENT -0.2803919685 -0.1032119571 INCOME -0.0004222914 -0.0004531543 LOGSPEND -0.1688395555 -0.1525963363

## Duplicate Breusch-Godfrey Test Logic in SAS Autoreg Procedure

Since it appears that SAS and R might give slightly different B-G test results, I spent a couple hours on duplicating the logic of B-G test implemented in SAS Autoreg Procedure. The written SAS macro should give my team more flexibility to perform B-G test in CCAR 2017 model developments in cases that models will not be estimated with Autoreg Procedure.

**B-G Test with Proc Autoreg**

data one; do i = 1 to 100; x1 = uniform(1); x2 = uniform(2); r = normal(3) * 0.5; y = 10 + 8 * x1 + 6 * x2 + r; output; end; run; proc autoreg data = one; model y = x1 x2 / godfrey = 4; run; quit; /* Godfrey's Serial Correlation Test Alternative LM Pr > LM AR(1) 0.2976 0.5854 AR(2) 1.5919 0.4512 AR(3) 1.7168 0.6332 AR(4) 1.7839 0.7754 */

**Home-brew SAS Macro**

%macro bgtest(data = , r = , x = , order = 4); options nocenter nonumber nodate mprint mlogic symbolgen formchar = "|----|+|---+=|-/\<>*"; proc sql noprint; select mean(&r) format = 12.8 into :m from &data; quit; data _1 (drop = _i); set &data (keep = &r &x); %do i = 1 %to ℴ _lag&i._&r = lag&i.(&r); %end; _i + 1; _index = _i - ℴ array _l _lag:; do over _l; if _l = . then _l = &m; end; run; proc reg data = _last_ noprint; model &r = &x _lag:; output out = _2 p = rhat; run; proc sql noprint; create table _result as select sum((rhat - &m) ** 2) / sum((&r - &m) ** 2) as _r2, (select count(*) from _2) * calculated _r2 as _chisq, 1 - probchi(calculated _chisq, &order.) as _p_chisq, &order as _df from _2; quit; title; proc report data = _last_ spacing = 1 headline nowindows split = "*"; column(" * BREUSCH-GODFREY TEST FOR SERIAL CORRELATION * H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO &order * " _chisq _df _p_chisq); define _chisq / "CHI-SQUARE" width = 20 format = 15.10; define _df / "DF" width = 10; define _p_chisq / "P-VALUE" width = 20 format = 15.10; run; %mend bgtest; proc reg data = one noprint; model y = x1 x2; output out = two r = r2; run; quit; data _null_; do i = 1 to 4; call execute('%bgtest(data = two, r = r2, x = x1 x2, order = '||put(i, 2.)||');'); end; run; /* BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 1 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 0.2976458421 1 0.5853620441 BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 2 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 1.5918785412 2 0.4511572771 BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 3 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 1.7167785901 3 0.6332099963 BREUSCH-GODFREY TEST FOR SERIAL CORRELATION H0: THERE IS NO SERIAL CORRELATION OF ANY ORDER UP TO 4 CHI-SQUARE DF P-VALUE ------------------------------------------------------- 1.7839349922 4 0.7754201982 */