Developed by Wes McKinney, pandas is a very efficient and powerful data analysis tool in python language for data scientists. Same as R, pandas reads the data into memory. As a result, we might often face the problem of running out of memory while analyzing large-size data with pandas.
Similar to Blaze, ibis is a new data analysis framework in python built on top of other back-end data engines, such as sqlite and impala. Even better, ibis provides a higher compatibility to pandas and better performance than Blaze.
In a previous blog (https://statcompute.wordpress.com/2015/03/27/a-comparison-between-blaze-and-pandas), I’ve shown the efficiency of Blaze through a simple example. However, in the demonstration below, it is shown that, while applied to the same data with sqlite engine, ibis is 50% more efficient than Blaze in terms of the “real time”.
import ibis as ibis tbl = ibis.sqlite.connect('//home/liuwensui/Documents/data/flights.db').table('tbl2008') exp = tbl[tbl.DayOfWeek > 1].group_by("DayOfWeek").aggregate(avg_AirTime = tbl.AirTime.mean()) pd = exp.execute() print(pd) #i DayOfWeek avg_AirTime #0 2 103.214930 #1 3 103.058508 #2 4 103.467138 #3 5 103.557539 #4 6 107.400631 #5 7 104.864885 # #real 0m10.346s #user 0m9.585s #sys 0m1.181s
Similar to quasi-Poisson regressions, quasi-binomial regressions try to address the excessive variance by the inclusion of a dispersion parameter. In addition to addressing the over-dispersion, quasi-binomial regressions also demonstrate extra values in other areas, such as LGD model development in credit risk modeling, due to its flexible distributional assumption.
Measuring the ratio between NCO and GCO, LGD could take any value in the range [0, 1] with no unanimous consensus on the distributional assumption currently in the industry. An advantage of quasi-binomial regression is that it makes no assumption of a specific distribution but merely specifies the conditional mean for a given model response. As a result, the trade-off is the lack of likelihood-based measures such as AIC and BIC.
Below is a demonstration on how to estimate a quasi-binomial model with GLIMMIX procedure in SAS.
proc glimmix data = _last_; model y = age number start / link = logit solution; _variance_ = _mu_ * (1-_mu_); random _residual_; run; /* Model Information Data Set WORK.KYPHOSIS Response Variable y Response Distribution Unknown Link Function Logit Variance Function _mu_ * (1-_mu_) Variance Matrix Diagonal Estimation Technique Quasi-Likelihood Degrees of Freedom Method Residual Parameter Estimates Standard Effect Estimate Error DF t Value Pr > |t| Intercept -2.0369 1.3853 77 -1.47 0.1455 age 0.01093 0.006160 77 1.77 0.0800 number 0.4106 0.2149 77 1.91 0.0598 start -0.2065 0.06470 77 -3.19 0.0020 Residual 0.9132 . . . . */
For the comparison purpose, the same model is also estimated with R glm() function, showing identical outputs.
summary(glm(data = kyphosis, Kyphosis ~ ., family = quasibinomial)) #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -2.03693 1.38527 -1.470 0.14552 #Age 0.01093 0.00616 1.774 0.07996 . #Number 0.41060 0.21489 1.911 0.05975 . #Start -0.20651 0.06470 -3.192 0.00205 ** #--- #(Dispersion parameter for quasibinomial family taken to be 0.913249)
When modeling the frequency measure in the operational risk with regressions, most modelers often prefer Poisson or Negative Binomial regressions as best practices in the industry. However, as an alternative approach, Quasi-Poisson regression provides a more flexible model estimation routine with at least two benefits. First of all, Quasi-Poisson regression is able to address both over-dispersion and under-dispersion by assuming that the variance is a function of the mean such that VAR(Y|X) = Theta * MEAN(Y|X), where Theta > 1 for the over-dispersion and Theta < 1 for the under-dispersion. Secondly, estimated coefficients with Quasi-Poisson regression are identical to the ones with Standard Poisson regression, which is considered the prevailing practice in the industry.
While Quasi-Poisson regression can be easily estimated with glm() in R language, its estimation in SAS is not very straight-forward. Luckily, with GLIMMIX procedure, we can estimate Quasi-Poisson regression by directly specifying the functional relationship between the variance and the mean and making no distributional assumption in the MODEL statement, as demonstrated below.
proc glimmix data = credit_count; model MAJORDRG = AGE ACADMOS MINORDRG OWNRENT / link = log solution; _variance_ = _mu_; random _residual_; run; /* Model Information Data Set WORK.CREDIT_COUNT Response Variable MAJORDRG Response Distribution Unknown Link Function Log Variance Function _mu_ Variance Matrix Diagonal Estimation Technique Quasi-Likelihood Degrees of Freedom Method Residual Fit Statistics -2 Log Quasi-Likelihood 19125.57 Quasi-AIC (smaller is better) 19135.57 Quasi-AICC (smaller is better) 19135.58 Quasi-BIC (smaller is better) 19173.10 Quasi-CAIC (smaller is better) 19178.10 Quasi-HQIC (smaller is better) 19148.09 Pearson Chi-Square 51932.87 Pearson Chi-Square / DF 3.86 Parameter Estimates Standard Effect Estimate Error DF t Value Pr > |t| Intercept -1.3793 0.08613 13439 -16.01 <.0001 AGE 0.01039 0.002682 13439 3.88 0.0001 ACADMOS 0.001532 0.000385 13439 3.98 <.0001 MINORDRG 0.4611 0.01348 13439 34.22 <.0001 OWNRENT -0.1994 0.05568 13439 -3.58 0.0003 Residual 3.8643 . . . . */
For the comparison purpose, we also estimated a Quasi-Poisson regression in R, showing completely identical statistical results.
summary(glm(MAJORDRG ~ AGE + ACADMOS + MINORDRG + OWNRENT, data = credit_count, family = quasipoisson(link = "log"))) # Estimate Std. Error t value Pr(>|t|) # (Intercept) -1.3793249 0.0861324 -16.014 < 2e-16 *** # AGE 0.0103949 0.0026823 3.875 0.000107 *** # ACADMOS 0.0015322 0.0003847 3.983 6.84e-05 *** # MINORDRG 0.4611297 0.0134770 34.216 < 2e-16 *** # OWNRENT -0.1993933 0.0556757 -3.581 0.000343 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # (Dispersion parameter for quasipoisson family taken to be 3.864409) # # Null deviance: 24954 on 13443 degrees of freedom # Residual deviance: 22048 on 13439 degrees of freedom # AIC: NA
In the Loss Distributional Approach (LDA) for Operational Risk models, multiple distributions, including Log Normal, Gamma, Burr, Pareto, and so on, can be considered candidates for the distribution of severity measures. However, the challenge remains in the stress testing exercise, e.g. CCAR, to relate operational losses to macro-economic scenarios denoted by a set of macro-economic attributes.
As a result, a more sensible approach employed in the annual CCAR exercise to model operational losses might be the regression-based modeling approach, which can intuitively link the severity measure of operational losses to macro-economic drivers with a explicit functional form within the framework of Generalized Linear Models (GLM). While 2-parameter Pareto distribution and 3-parameter Burr distribution are theoretically attractive, their implmentations in the regression setting could become difficult and even impractical without the availability of off-shelf modeling tools and variable selection routines. In such situation, Log Normal and Gamma distributional assumptions are much more realistic with successful applications in actuarial practices. For details, please see “Severity Distributions for GLMs” by Fu and Moncher in 2004.
While both Log Normal and Gamma are most popular choices for the severity model, there are pros and cons in each respectively. For instance, while Log Normal distributional assumption is extremely flexible and easy to understand, the predicted outcomes should be adjusted for the estimation bias. Fortunately, both SAS, e.g. SEVERITY PROCEDURE, and R, e.g. fitdistrplus package, provide convenient interfaces for the distribution selection procedure based on goodness-of-fit statistics and information criterion.
library(fitdistrplus) library(insuranceData) Fit1 <- fitdist(AutoCollision$Severity, dist = "lnorm", method = "mme") Fit2 <- fitdist(AutoCollision$Severity, dist = "gamma", method = "mme") gofstat(list(Fit1, Fit2)) #Goodness-of-fit statistics # 1-mme-lnorm 2-mme-gamma #Kolmogorov-Smirnov statistic 0.1892567 0.1991059 #Cramer-von Mises statistic 0.2338694 0.2927953 #Anderson-Darling statistic 1.5772642 1.9370056 # #Goodness-of-fit criteria # 1-mme-lnorm 2-mme-gamma #Aikake's Information Criterion 376.2738 381.2264 #Bayesian Information Criterion 379.2053 384.1578
In the above output, Log Normal seems marginally better than Gamma in this particular case. Since either Log(SEVERITY) in Log Normal or SEVERITY in Gamma belongs to exponential distribution family, it is convenient to employ GLM() with related variable selection routines in the model development exercise.
summary(mdl1 <- glm(log(Severity) ~ -1 + Vehicle_Use, data = AutoCollision, family = gaussian(link = "identity"))) #Coefficients: # Estimate Std. Error t value Pr(>|t|) #Vehicle_UseBusiness 5.92432 0.07239 81.84 <2e-16 *** #Vehicle_UseDriveLong 5.57621 0.07239 77.03 <2e-16 *** #Vehicle_UseDriveShort 5.43405 0.07239 75.07 <2e-16 *** #Vehicle_UsePleasure 5.35171 0.07239 73.93 <2e-16 *** summary(mdl2 <- glm(Severity ~ -1 + Vehicle_Use, data = AutoCollision, family = Gamma(link = "log"))) #Coefficients: # Estimate Std. Error t value Pr(>|t|) #Vehicle_UseBusiness 5.97940 0.08618 69.38 <2e-16 *** #Vehicle_UseDriveLong 5.58072 0.08618 64.76 <2e-16 *** #Vehicle_UseDriveShort 5.44560 0.08618 63.19 <2e-16 *** #Vehicle_UsePleasure 5.36225 0.08618 62.22 <2e-16 ***
As shown above, estimated coefficients are very similar in both Log Normal and Gamma regressions and standard erros are different due to different distributional assumptions. However, please note that predicted values of Log Normal regression should be adjusted by (RMSE ^ 2) / 2 before applying EXP().
In the coursework of time series analysis, we’ve been taught that a time series regression of Y on X could be valid only when both X and Y are stationary due to the so-call “spurious regression problem”. However, one exception holds that if X and Y, albeit non-stationary, share a common trend such that their trends can be cancelled each other out, then X and Y are co-integrated and the regression of Y on X is valid. As a result, it is important to test the co-integration between X and Y.
Following the definition of co-integration, it is straightforward to formulate a procedure of the co-integration test. First of all, construct a linear combination between Y and X such that e = Y – (a + b * X). Secondly, test if e is stationary with ADF test. If e is stationary, then X and Y are co-integrated. This two-stage procedure is also called Engle-Granger co-integration test.
Below is a SAS macro implementing Engle-Granger co-integration test to show the long-term relationship between GDP and other macro-economic variables, e.g. Personal Consumption and Personal Disposable Income.
%macro eg_coint(data = , y = , xs = ); *********************************************************************; * THIS SAS MACRO IMPLEMENTATION ENGLE-GRANGER COINTEGRATION TEST IN *; * A BATCH MODE TO PROCESS MANY TIME SERIES *; *********************************************************************; * INPUT PARAMETERS: *; * DATA: A INPUT SAS DATASET *; * Y : A DEPENDENT VARIABLE IN THE COINTEGRATION REGRESSION *; * X : A LIST OF INDEPENDENT VARIABLE IN THE COINTEGRATION *; * REGRESSION *; *********************************************************************; * AUTHOR: WENSUI.LIU@53.COM *; *********************************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*"; %local sig loop; %let sig = 0.1; %let loop = 1; %do %while (%scan(&xs, &loop) ne %str()); %let x = %scan(&xs, &loop); ods listing close; ods output FitStatistics = _fit; proc reg data = &data; model &y = &x; output out = _1 residual = r; run; quit; proc sql noprint; select cvalue2 into :r2 from _fit where upcase(label2) = "R-SQUARE"; quit; proc arima data = _1; ods output stationaritytests = _adf1 (where = (upcase(type) = "ZERO MEAN" and lags = 1) drop = rho probrho fvalue probf); identify var = r stationarity = (adf = 1); run; quit; ods listing; %if &loop = 1 %then %do; data _adf; format vars $32. lterm_r2 best12. flg_coint $3.; set _adf1 (drop = type lags); vars = upcase("&x"); lterm_r2 = &r2; if probtau < &sig then flg_coint = "YES"; else flg_coint = "NO"; run; %end; %else %do; data _adf; set _adf _adf1 (in = new drop = type lags); if new then do; vars = upcase("&x"); lterm_r2 = &r2; if probtau < &sig then flg_coint = "YES"; else flg_coint = "NO"; end; run; %end; %let loop = %eval(&loop + 1); %end; proc sort data = _last_; by descending flg_coint probtau; run; proc report data = _last_ box spacing = 1 split = "/" nowd; COLUMN("ENGLE-GRANGER COINTEGRATION TEST BETWEEN %UPCASE(&Y) AND EACH VARIABLE BELOW/ " vars lterm_r2 flg_coint tau probtau); define vars / "VARIABLES" width = 35; define lterm_r2 / "LONG-RUN/R-SQUARED" width = 15 format = 9.4 center; define flg_coint / "COINTEGRATION/FLAG" width = 15 center; define tau / "TAU STATISTIC/FOR ADF TEST" width = 20 format = 15.4; define probtau / "P-VALUE FOR/ADF TEST" width = 15 format = 9.4 center; run; %mend eg_coint; %eg_coint(data = sashelp.citiqtr, y = gdp, xs = gyd gc);
---------------------------------------------------------------------------------------------------------- | ENGLE-GRANGER COINTEGRATION TEST BETWEEN GDP AND EACH VARIABLE BELOW | | | | LONG-RUN COINTEGRATION TAU STATISTIC P-VALUE FOR | |VARIABLES R-SQUARED FLAG FOR ADF TEST ADF TEST | |--------------------------------------------------------------------------------------------------------| |GC | 0.9985 | YES | -2.8651| 0.0051 | |-----------------------------------+---------------+---------------+--------------------+---------------| |GYD | 0.9976 | YES | -1.7793| 0.0715 | ----------------------------------------------------------------------------------------------------------
From the output, it is interesting to see that GDP in U.S. is driven more by Personal Consumption than by Personal Disposable Income.
To determine if a time series is stationary or has the unit root, three methods can be used:
A. The most intuitive way, which is also sufficient in most cases, is to eyeball the ACF (Autocorrelation Function) plot of the time series. The ACF pattern with a fast decay might imply a stationary series.
B. Statistical tests for Unit Roots, e.g. ADF (Augmented Dickey–Fuller) or PP (Phillips–Perron) test, could be employed as well. With the Null Hypothesis of Unit Root, a statistically significant outcome might suggest a stationary series.
C. In addition to the aforementioned tests for Unit Roots, statistical tests for stationarity, e.g. KPSS (Kwiatkowski–Phillips–Schmidt–Shin) test, might be an useful complement as well. With the Null Hypothesis of stationarity, a statistically insignificant outcome might suggest a stationary series.
By testing both the unit root and stationarity, the analyst should be able to have a better understanding about the data nature of a specific time series.
The SAS macro below is a convenient wrapper of stationarity tests for many time series in the production environment. (Please note that this macro only works for SAS 9.2 or above.)
%macro stationary(data = , vars =); ***********************************************************; * THIS SAS MACRO IS TO DO STATIONARITY TESTS FOR MANY *; * TIME SERIES *; * ------------------------------------------------------- *; * INPUT PARAMETERS: *; * DATA: A INPUT SAS DATASET *; * VARS: A LIST OF TIME SERIES *; * ------------------------------------------------------- *; * AUTHOR: WENSUI.LIU@53.COM *; ***********************************************************; options nocenter nonumber nodate mprint mlogic symbolgen orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*"; %local sig loop; %let sig = 0.1; %let loop = 1; %do %while (%scan(&vars, &loop) ne %str()); %let x = %scan(&vars, &loop); proc sql noprint; select int(12 * ((count(&x) / 100) ** 0.25)) into :nlag1 from &data; select int(max(1, (count(&x) ** 0.5) / 5)) into :nlag2 from &data; quit; ods listing close; ods output kpss = _kpss (drop = model lags rename = (prob = probeta)) adf = _adf (drop = model lags rho probrho fstat probf rename = (tau = adf_tau probtau = adf_probtau)) philperron = _pp (drop = model lags rho probrho rename = (tau = pp_tau probtau = pp_probtau)); proc autoreg data = &data; model &x = / noint stationarity = (adf = &nlag1, phillips = &nlag2, kpss = (kernel = nw lag = &nlag1)); run; quit; ods listing; proc sql noprint; create table _1 as select upcase("&x") as vars length = 32, upcase(_adf.type) as type, _adf.adf_tau, _adf.adf_probtau, _pp.pp_tau, _pp.pp_probtau, _kpss.eta, _kpss.probeta, case when _adf.adf_probtau < &sig or _pp.pp_probtau < &sig or _kpss.probeta > &sig then "*" else " " end as _flg, &loop as _i, monotonic() as _j from _adf inner join _pp on _adf.type = _pp.type inner join _kpss on _adf.type = _kpss.type; quit; %if &loop = 1 %then %do; data _result; set _1; run; %end; %else %do; proc append base = _result data = _1; run; %end; proc datasets library = work nolist; delete _1 _adf _pp _kpss / memtype = data; quit; %let loop = %eval(&loop + 1); %end; proc sort data = _result; by _i _j; run; proc report data = _result box spacing = 1 split = "/" nowd; column("STATISTICAL TESTS FOR STATIONARITY/ " vars type adf_tau adf_probtau pp_tau pp_probtau eta probeta _flg); define vars / "VARIABLES/ " width = 20 group order order = data; define type / "TYPE/ " width = 15 order order = data; define adf_tau / "ADF TEST/FOR/UNIT ROOT" width = 10 format = 8.2; define adf_probtau / "P-VALUE/FOR/ADF TEST" width = 10 format = 8.4 center; define pp_tau / "PP TEST/FOR/UNIT ROOT" width = 10 format = 8.2; define pp_probtau / "P-VALUE/FOR/PP TEST" width = 10 format = 8.4 center; define eta / "KPSS TEST/FOR/STATIONARY" width = 10 format = 8.2; define probeta / "P-VALUE/FOR/KPSS TEST" width = 10 format = 8.4 center; define _flg / "STATIONARY/FLAG" width = 10 center; run; %mend stationary;
In Advanced Measurement Approaches (AMA) for Operational Risk models, the bank needs to segment operational losses into homogeneous segments known as “Unit of Measures (UoM)”, which are often defined by the combination of lines of business (LOB) and Basel II event types. However, how do we support whether the losses in one UoM are statistically different from the ones in another UoM? The answer is to test if the losses from various UoMs are distributionally different or equivalent.
Empirically, Kolmogorov-Smirnov (K-S) test is often used to test if two samples are from the same distribution. In the example below, although x and y share the same mean, K-S test still shows a significant result due to different variances.
n <- 300 set.seed(2015) x <- rnorm(n, 0, 1) y <- rnorm(n, 0, 1.5) ### 2-SAMPLE DISTRIBUTIONAL COMPARISON ### ks.test(x, y, alternative = "two.sided") # Two-sample Kolmogorov-Smirnov test # # data: x and y # D = 0.1567, p-value = 0.001268 # alternative hypothesis: two-sided
However, K-S test cannot be generalized to K-sample cases, where K > 2. In such scenario, the univariate coverage test or the more general multivariate MRPP test might be more appropriate. The Blossom package developed by Talbert and Cade (https://www.fort.usgs.gov/products/23735) provides convenient functions implementing both tests, as shown below.
z <- rnorm(n, 0, 2) df <- data.frame(x = c(x, y, z), g = unlist(lapply(c(1, 2, 3), rep, n))) ### K-SAMPLE DISTRIBUTIONAL COMPARISON ### # COVERAGE TEST FOR THE UNIVARIATE RESPONSES library(Blossom) ctest <- coverage(df$x, df$g) summary(ctest) # Results: # Observed coverage statistic : 1.870273 # Mean of coverage statistic : 1.774817 # Estimated variance of coverage statistic: 0.002275862 # Standard deviation of the variance # of the coverage statistic : 5.108031e-05 # # Observed standardized coverage statistic: 2.00093 # Skewness of observed coverage statistic : 0.08127759 # Probability (Pearson Type III) # of a larger or equal coverage statistic : 0.02484709 # Probability (Resampled) # of a largeror equal coverage statistic : 0.02475* # MULTIRESPONSE PERMUTATION PROCEDURE FOR MULTIVARIATE RESPONSES mtest <- mrpp(x, g, df) summary(mtest) # Results: # Delta Observed : 1.676303 # Delta Expected : 1.708194 # Delta Variance : 4.262303e-06 # Delta Skewness : -1.671773 # # Standardized test statistic : -15.44685 # Probability (Pearson Type III) # of a smaller or equal delta : 9.433116e-09***