## Archive for **November 2015**

## Ibis – A New Kid in Town

Developed by Wes McKinney, pandas is a very efficient and powerful data analysis tool in python language for data scientists. Same as R, pandas reads the data into memory. As a result, we might often face the problem of running out of memory while analyzing large-size data with pandas.

Similar to Blaze, ibis is a new data analysis framework in python built on top of other back-end data engines, such as sqlite and impala. Even better, ibis provides a higher compatibility to pandas and better performance than Blaze.

In a previous blog (https://statcompute.wordpress.com/2015/03/27/a-comparison-between-blaze-and-pandas), I’ve shown the efficiency of Blaze through a simple example. However, in the demonstration below, it is shown that, while applied to the same data with sqlite engine, ibis is 50% more efficient than Blaze in terms of the “real time”.

import ibis as ibis tbl = ibis.sqlite.connect('//home/liuwensui/Documents/data/flights.db').table('tbl2008') exp = tbl[tbl.DayOfWeek > 1].group_by("DayOfWeek").aggregate(avg_AirTime = tbl.AirTime.mean()) pd = exp.execute() print(pd) #i DayOfWeek avg_AirTime #0 2 103.214930 #1 3 103.058508 #2 4 103.467138 #3 5 103.557539 #4 6 107.400631 #5 7 104.864885 # #real 0m10.346s #user 0m9.585s #sys 0m1.181s

## Quasi-Binomial Model in SAS

Similar to quasi-Poisson regressions, quasi-binomial regressions try to address the excessive variance by the inclusion of a dispersion parameter. In addition to addressing the over-dispersion, quasi-binomial regressions also demonstrate extra values in other areas, such as LGD model development in credit risk modeling, due to its flexible distributional assumption.

Measuring the ratio between NCO and GCO, LGD could take any value in the range [0, 1] with no unanimous consensus on the distributional assumption currently in the industry. An advantage of quasi-binomial regression is that it makes no assumption of a specific distribution but merely specifies the conditional mean for a given model response. As a result, the trade-off is the lack of likelihood-based measures such as AIC and BIC.

Below is a demonstration on how to estimate a quasi-binomial model with GLIMMIX procedure in SAS.

proc glimmix data = _last_; model y = age number start / link = logit solution; _variance_ = _mu_ * (1-_mu_); random _residual_; run; /* Model Information Data Set WORK.KYPHOSIS Response Variable y Response Distribution Unknown Link Function Logit Variance Function _mu_ * (1-_mu_) Variance Matrix Diagonal Estimation Technique Quasi-Likelihood Degrees of Freedom Method Residual Parameter Estimates Standard Effect Estimate Error DF t Value Pr > |t| Intercept -2.0369 1.3853 77 -1.47 0.1455 age 0.01093 0.006160 77 1.77 0.0800 number 0.4106 0.2149 77 1.91 0.0598 start -0.2065 0.06470 77 -3.19 0.0020 Residual 0.9132 . . . . */

For the comparison purpose, the same model is also estimated with R glm() function, showing identical outputs.

summary(glm(data = kyphosis, Kyphosis ~ ., family = quasibinomial)) #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) -2.03693 1.38527 -1.470 0.14552 #Age 0.01093 0.00616 1.774 0.07996 . #Number 0.41060 0.21489 1.911 0.05975 . #Start -0.20651 0.06470 -3.192 0.00205 ** #--- #(Dispersion parameter for quasibinomial family taken to be 0.913249)