Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Query Pandas DataFrame with SQL

Similar to SQLDF package providing a seamless interface between SQL statement and R data.frame, PANDASQL allows python users to use SQL querying Pandas DataFrames.

Below are some examples showing how to use PANDASQL to do SELECT / AGGREGATE / JOIN operations. More information is also available on the GitHub (https://github.com/yhat/pandasql).

In [1]: import sas7bdat as sas

In [2]: import pandas as pd

In [3]: import pandasql as pdsql

In [4]: data = sas.SAS7BDAT("accepts.sas7bdat")

In [5]: df = data.toDataFrame()

In [6]: pysql = lambda q: pdsql.sqldf(q, globals())

In [7]: ### SELECT ###

In [8]: str1 = "select bureau_score, ltv from df where bureau_score < 600 and ltv > 100 limit 3;"

In [9]: df1 = pysql(str1)

In [10]: df1
Out[10]: 
   bureau_score  ltv
0           590  103
1           575  120
2           538  113

In [11]: ### AGGREGATE ###

In [12]: str2 = "select ltv, min(bureau_score) as min_score, max(bureau_score) as max_score from df group by ltv order by ltv DESC;"

In [13]: df2 = pysql(str2);

In [14]: df2.head(3)
Out[14]: 
   ltv  min_score  max_score
0  176        709        709
1  168        723        723
2  167        688        688

In [15]: ### JOIN ###

In [16]: str3 = "select b.*, a.bureau_score from df a inner join df2 b on a.ltv = b.ltv order by ltv DESC;"

In [17]: df3 = pysql(str3)

In [18]: df3.head(3)
Out[18]: 
   ltv  min_score  max_score  bureau_score
0  176        709        709           709
1  168        723        723           723
2  167        688        688           688

Written by statcompute

November 1, 2014 at 11:23 pm

Posted in Big Data, PYTHON, S+/R

Tagged with , ,

Flexible Beta Modeling

library(betareg)
library(sas7bdat)

df1 <- read.sas7bdat('lgd.sas7bdat')
df2 <- df1[df1$y < 1, ]

fml <- as.formula('y ~ x2 + x3 + x4 + x5 + x6 | x3 + x4 | x1 + x2')

### LATENT-CLASS BETA REGRESSION: AIC = -565 ###
mdl1 <- betamix(fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, minprior = 0.1))
print(mdl1)
#betamix(formula = fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, 
#    minprior = 0.1))
#
#Cluster sizes:
#  1   2 
#157 959 

summary(mdl1, which = 'concomitant')
#            Estimate Std. Error z value Pr(>|z|)   
#(Intercept) -1.35153    0.41988 -3.2188 0.001287 **
#x1           2.92537    1.13046  2.5878 0.009660 **
#x2           2.82809    1.42139  1.9897 0.046628 * 

summary(mdl1)
#$Comp.1$mean
#              Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -0.8963228  1.0385545 -0.8630 0.3881108    
#x2           3.1769062  0.6582108  4.8266 1.389e-06 ***
#x3          -0.0520060  0.0743714 -0.6993 0.4843805    
#x4           4.9642998  1.4204071  3.4950 0.0004741 ***
#x5           0.0021647  0.0022659  0.9554 0.3393987    
#x6           0.0248573  0.0062982  3.9467 7.922e-05 ***
#
#$Comp.1$precision
#            Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -5.37817    1.44817 -3.7138 0.0002042 ***
#x3           0.45009    0.10094  4.4589 8.239e-06 ***
#x4           3.06969    1.41450  2.1702 0.0299948 *  
#
#$Comp.2
#$Comp.2$mean
#              Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -1.8737088  0.3888454 -4.8186 1.445e-06 ***
#x2          -0.6318086  0.1892501 -3.3385 0.0008424 ***
#x3           0.1786425  0.0265428  6.7303 1.693e-11 ***
#x4           2.0646272  0.5256002  3.9281 8.561e-05 ***
#x5          -0.0064821  0.0014053 -4.6127 3.974e-06 ***
#x6           0.0018828  0.0022873  0.8231 0.4104318    
#
#$Comp.2$precision
#            Estimate Std. Error z value Pr(>|z|)   
#(Intercept) 1.092403   0.616974  1.7706 0.076630 . 
#x3          0.017330   0.040024  0.4330 0.665029   
#x4          2.138812   0.717702  2.9801 0.002882 **


### BETA REGRESSION TREE: AIC = -578 ###
mdl2 <- betatree(fml, data = df2, minsplit = 100)
print(mdl2)
#1) x2 <= 0.08584895; criterion = 1, statistic = 154.716
#  2)*  weights = 121 
#Terminal node model
#betaReg fit with coefficients:
#      (Intercept)                 x2                 x3                 x4  
#         3.307359          -2.854351          -0.262815          -2.414481  
#               x5                 x6  (phi)_(Intercept)           (phi)_x3  
#        -0.007555           0.030346           1.003767          -0.002907  
#         (phi)_x4  
#         2.528602  
#
#1) x2 > 0.08584895
#  3)*  weights = 995 
#Terminal node model
#betaReg fit with coefficients:
#      (Intercept)                 x2                 x3                 x4  
#        -2.134931          -0.194830           0.168136           2.811077  
#               x5                 x6  (phi)_(Intercept)           (phi)_x3  
#        -0.002070           0.004677          -1.018102           0.151778  
#         (phi)_x4  
#         2.142995  

sctest(mdl2, node = 1)
#                x1       x2
#statistic 113.4781 154.7165
#p.value     0.0000   0.0000

summary(mdl2)
#$`2`
#
#Coefficients (mean model with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
#(Intercept)  3.307359   1.091289   3.031 0.002440 ** 
#x2          -2.854351   3.644882  -0.783 0.433561    
#x3          -0.262815   0.074716  -3.518 0.000436 ***
#x4          -2.414481   1.785447  -1.352 0.176276    
#x5          -0.007555   0.002788  -2.710 0.006738 ** 
#x6           0.030346   0.006833   4.441 8.96e-06 ***
#
#Phi coefficients (precision model with log link):
#             Estimate Std. Error z value Pr(>|z|)
#(Intercept)  1.003767   1.353496   0.742    0.458
#x3          -0.002907   0.090816  -0.032    0.974
#x4           2.528602   2.344241   1.079    0.281

#$`3`
#
#Coefficients (mean model with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -2.134931   0.337784  -6.320 2.61e-10 ***
#x2          -0.194830   0.144062  -1.352  0.17625    
#x3           0.168136   0.022521   7.466 8.28e-14 ***
#x4           2.811077   0.387788   7.249 4.20e-13 ***
#x5          -0.002070   0.001136  -1.822  0.06848 .  
#x6           0.004677   0.001770   2.643  0.00822 ** 
#
#Phi coefficients (precision model with log link):
#            Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -1.01810    0.46575  -2.186 0.028821 *  
#x3           0.15178    0.03057   4.965 6.88e-07 ***
#x4           2.14300    0.56979   3.761 0.000169 ***

Written by statcompute

October 27, 2014 at 10:44 pm

Posted in S+/R, Statistical Models

Tagged with ,

Model Segmentation with Recursive Partitioning

library(party)

df1 <- read.csv("credit_count.csv")
df2 <- df1[df1$CARDHLDR == 1, ]

mdl <- mob(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT | AGE + SELFEMPL, data = df2, family = binomial(), control = mob_control(minsplit = 1000), model = glinearModel)

print(mdl)
#1) AGE <= 22.91667; criterion = 1, statistic = 48.255
#  2)*  weights = 1116 
#Terminal node model
#Binomial GLM with coefficients:
#(Intercept)     MAJORDRG     MINORDRG       INCOME      OWNRENT  
# -0.6651905    0.0633978    0.5182472   -0.0006038    0.3071785  
#
#1) AGE > 22.91667
#  3)*  weights = 9383 
#Terminal node model
#Binomial GLM with coefficients:
#(Intercept)     MAJORDRG     MINORDRG       INCOME      OWNRENT  
# -1.4117010    0.2262091    0.2067880   -0.0003822   -0.2127193  

### TEST FOR STRUCTURAL CHANGE ###
sctest(mdl, node = 1)
#                   AGE    SELFEMPL
#statistic 4.825458e+01 20.88612025
#p.value   1.527484e-07  0.04273836

summary(mdl, node = 2)
#Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -0.6651905  0.2817480  -2.361 0.018229 *  
#MAJORDRG     0.0633978  0.3487305   0.182 0.855743    
#MINORDRG     0.5182472  0.2347656   2.208 0.027278 *  
#INCOME      -0.0006038  0.0001639  -3.685 0.000229 ***
#OWNRENT      0.3071785  0.2028491   1.514 0.129945    

summary(mdl, node = 3)
#Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -1.412e+00  1.002e-01 -14.093  < 2e-16 ***
#MAJORDRG     2.262e-01  7.067e-02   3.201  0.00137 ** 
#MINORDRG     2.068e-01  4.925e-02   4.199 2.68e-05 ***
#INCOME      -3.822e-04  4.186e-05  -9.131  < 2e-16 ***
#OWNRENT     -2.127e-01  7.755e-02  -2.743  0.00609 ** 

Written by statcompute

October 26, 2014 at 2:23 am

Posted in Credit Risk, S+/R, Statistical Models

Tagged with

Estimating a Beta Regression with The Variable Dispersion in R

pkgs <- c('sas7bdat', 'betareg', 'lmtest')
lapply(pkgs, require, character.only = T)

df1 <- read.sas7bdat("lgd.sas7bdat")
df2 <- df1[which(df1$y < 1), ]

xvar <- paste("x", 1:7, sep = '', collapse = " + ")
fml1 <- as.formula(paste("y ~ ", xvar))
fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar))

# FIT A BETA MODEL WITH THE FIXED PHI
beta1 <- betareg(fml1, data = df2)
summary(beta1)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.500242   0.329670  -4.551 5.35e-06 ***
# x1           0.007516   0.026020   0.289 0.772680    
# x2           0.429756   0.135899   3.162 0.001565 ** 
# x3           0.099202   0.022285   4.452 8.53e-06 ***
# x4           2.465055   0.415657   5.931 3.02e-09 ***
# x5          -0.003687   0.001070  -3.446 0.000568 ***
# x6           0.007181   0.001821   3.943 8.06e-05 ***
# x7           0.128796   0.186715   0.690 0.490319    
#
# Phi coefficients (precision model with identity link):
#       Estimate Std. Error z value Pr(>|z|)    
# (phi)   3.6868     0.1421   25.95   <2e-16 ***

# FIT A BETA MODEL WITH THE VARIABLE PHI
beta2 <- betareg(fml2, data = df2)
summary(beta2)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.996661   0.336445  -5.935 2.95e-09 ***
# x1           0.007033   0.026809   0.262 0.793072    
# x2           0.371098   0.135186   2.745 0.006049 ** 
# x3           0.133356   0.022624   5.894 3.76e-09 ***
# x4           2.951245   0.401493   7.351 1.97e-13 ***
# x5          -0.003475   0.001119  -3.105 0.001902 ** 
# x6           0.006528   0.001884   3.466 0.000529 ***
# x7           0.100274   0.190915   0.525 0.599424    
#
# Phi coefficients (precision model with log link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -0.454399   0.452302  -1.005 0.315072    
# x1           0.009119   0.035659   0.256 0.798150    
# x2           0.611049   0.188225   3.246 0.001169 ** 
# x3           0.092073   0.030678   3.001 0.002689 ** 
# x4           2.248399   0.579440   3.880 0.000104 ***
# x5          -0.002188   0.001455  -1.504 0.132704    
# x6          -0.000317   0.002519  -0.126 0.899847    
# x7          -0.166226   0.256199  -0.649 0.516457    

# LIKELIHOOD RATIO TEST TO COMPARE BOTH BETA MODELS
lrtest(beta1, beta2) 

# Likelihood ratio test
#
# Model 1: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
# Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 | x1 + x2 + x3 + x4 + x5 + x6 + x7
#   #Df LogLik Df Chisq Pr(>Chisq)    
# 1   9 231.55                        
# 2  16 257.24  7 51.38  7.735e-09 ***

Written by statcompute

October 19, 2014 at 5:25 pm

Posted in Econometrics, S+/R, Statistical Models

Tagged with ,

Julia: Function vs. Macro

using DataFrames, Benchmark

# FUNCTION TO READ CSV FILE TO DATAFRAME
function csv2df(file)
  d = readcsv(file, header = true)
  head = d[2]
  data = d[1]
  dt = DataFrame()
  for i in 1:size(head)[2]
    dt[symbol(head[i])] = data[:, i]
  end
  return dt
end

df_funct() = csv2df("survdt.csv");

# MACRO TO READ CSV FILE TO DATAFRAME
macro csv2df(file)
  d = readcsv(file, header = true)
  head = d[2]
  data = d[1]
  dt = DataFrame()
  for i in 1:size(head)[2]
    dt[symbol(head[i])] = data[:, i]
  end
  return dt
end

df_macro() = @csv2df "survdt.csv";

# FUNCTION IN THE DATAFRAME PACKAGE
df_frame() = readtable("survdt.csv");

print(compare([df_funct, df_macro, df_frame], 10))

# | Row | Function   | Average  | Relative  | Replications |
# |-----|------------|----------|-----------|--------------|
# | 1   | "df_funct" | 0.164968 | 4.07328e6 | 10           |
# | 2   | "df_macro" | 4.05e-8  | 1.0       | 10           |
# | 3   | "df_frame" | 0.063254 | 1.56183e6 | 10           |

Written by statcompute

October 10, 2014 at 1:18 am

Posted in Julia

Tagged with

A Learning Section for Julia – Recoding A String Vector to A Integer Vector

julia> using RDatasets, DataFrames

julia> data = dataset("datasets", "iris");

julia> species = array(data[:, 5]);

julia> x = int([i == "setosa" for i in species]);

julia> sum(x)
50

Written by statcompute

October 7, 2014 at 11:34 pm

Posted in Julia

Tagged with

Fitting Lasso with Julia

Julia Code

using RDatasets, DataFrames, GLMNet

data = dataset("MASS", "Boston");
y = array(data[:, 14]);
x = array(data[:, 1:13]);

cv = glmnetcv(x, y);
cv.path.betas[:, indmin(cv.meanloss)];
result = DataFrame();
result[:Vars] = names(data)[1:13];
result[:Beta] = cv.path.betas[:, indmin(cv.meanloss)];
result

# | Row | Vars    | Beta       |
# |-----|---------|------------|
# | 1   | Crim    | -0.0983463 |
# | 2   | Zn      | 0.0414416  |
# | 3   | Indus   | 0.0        |
# | 4   | Chas    | 2.68519    |
# | 5   | NOx     | -16.3066   |
# | 6   | Rm      | 3.86694    |
# | 7   | Age     | 0.0        |
# | 8   | Dis     | -1.39602   |
# | 9   | Rad     | 0.252687   |
# | 10  | Tax     | -0.0098268 |
# | 11  | PTRatio | -0.929989  |
# | 12  | Black   | 0.00902588 |
# | 13  | LStat   | -0.5225    |

R Code

library(glmnet)
data(Boston, package = "MASS")

x <- as.matrix(Boston[, 1:13])
y <- as.matrix(Boston[, 14])

cv <- cv.glmnet(x, y, nfolds = 10) 	
mdl <- glmnet(x, y, lambda = cv$lambda.min)
mdl$beta

# crim     -0.098693203
# zn        0.041588291
# indus     .          
# chas      2.681633344
# nox     -16.354590598
# rm        3.860035926
# age       .          
# dis      -1.399697121
# rad       0.255484621
# tax      -0.009935509
# ptratio  -0.931031828
# black     0.009031422
# lstat    -0.522741592

Written by statcompute

October 7, 2014 at 11:09 pm

Follow

Get every new post delivered to your Inbox.

Join 81 other followers