Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for August 2013

Multinomial Logit with Python

In [1]: import statsmodels.api as st

In [2]: iris = st.datasets.get_rdataset('iris', 'datasets')

In [3]: ### get the y 

In [4]: y = iris.data.Species

In [5]: print y.head(3)
0    setosa
1    setosa
2    setosa
Name: Species, dtype: object

In [6]: ### get the x 

In [7]: x = iris.data.ix[:, 0]

In [8]: x = st.add_constant(x, prepend = False)

In [9]: print x.head(3)
   Sepal.Length  const
0           5.1      1
1           4.9      1
2           4.7      1

In [10]: ### specify the model

In [11]: mdl = st.MNLogit(y, x)

In [12]: mdl_fit = mdl.fit()
Optimization terminated successfully.
         Current function value: 0.606893
         Iterations 8

In [13]: ### print model summary ###

In [14]: print mdl_fit.summary()
                          MNLogit Regression Results                          
==============================================================================
Dep. Variable:                Species   No. Observations:                  150
Model:                        MNLogit   Df Residuals:                      146
Method:                           MLE   Df Model:                            2
Date:                Fri, 23 Aug 2013   Pseudo R-squ.:                  0.4476
Time:                        22:22:58   Log-Likelihood:                -91.034
converged:                       True   LL-Null:                       -164.79
                                        LLR p-value:                 9.276e-33
=====================================================================================
Species=versicolor       coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length           4.8157      0.907      5.310      0.000         3.038     6.593
const                -26.0819      4.889     -5.335      0.000       -35.665   -16.499
--------------------------------------------------------------------------------------
Species=virginica       coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Sepal.Length          6.8464      1.022      6.698      0.000         4.843     8.850
const               -38.7590      5.691     -6.811      0.000       -49.913   -27.605
=====================================================================================

In [15]: ### marginal effects ###

In [16]: mdl_margeff = mdl_fit.get_margeff()

In [17]: print mdl_margeff.summary()
       MNLogit Marginal Effects      
=====================================
Dep. Variable:                Species
Method:                          dydx
At:                           overall
=====================================================================================
    Species=setosa      dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length          -0.3785      0.003   -116.793      0.000        -0.385    -0.372
--------------------------------------------------------------------------------------
Species=versicolor      dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------------
Sepal.Length           0.0611      0.022      2.778      0.005         0.018     0.104
--------------------------------------------------------------------------------------
Species=virginica      dy/dx    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Sepal.Length          0.3173      0.022     14.444      0.000         0.274     0.360
=====================================================================================

In [18]: ### aic and bic ###

In [19]: print mdl_fit.aic
190.06793279

In [20]: print mdl_fit.bic
202.110473966
Advertisements

Written by statcompute

August 23, 2013 at 10:35 pm

Prototyping Multinomial Logit with R

Recently, I am working on a new modeling proposal based on the competing risk and need to prototype multinomial logit models with R. There are 2 R packages implementing multinomial logit models that I’ve tested, namely nnet and vgam. Model outputs with iris data are shown below.

data(iris)

### method 1: nnet package ###
library(nnet)
mdl1 <- multinom(Species ~ Sepal.Length, data = iris, model = TRUE)
summary(mdl1)

# Coefficients:
#            (Intercept) Sepal.Length
# versicolor   -26.08339     4.816072
# virginica    -38.76786     6.847957
#
# Std. Errors:
#            (Intercept) Sepal.Length
# versicolor    4.889635    0.9069211
# virginica     5.691596    1.0223867

### method 2: vgam package ### 
library(VGAM)
mdl2 <- vglm(Species ~ Sepal.Length, data = iris, multinomial(refLevel = 1)) 
summary(mdl2)

# Coefficients:
#                Estimate Std. Error z value
# (Intercept):1  -26.0819    4.88924 -5.3346
# (Intercept):2  -38.7590    5.69064 -6.8110
# Sepal.Length:1   4.8157    0.90683  5.3105
# Sepal.Length:2   6.8464    1.02222  6.6976

However, in my view, above methods are not flexible for real-world problems. For instance, there is no off-shelf solution for the variable selection for above multinomial logit models. Instead of building one multinomial logit model, we might develop two separate binomial logit models to accomplish the same task.

### method 3: two binary logit models ### 
iris$y <- ifelse(iris$Species == 'setosa', 0, 1)
mdl31 <- glm(y ~ Sepal.Length, data = iris, subset = (Species != 'virginica'), family = binomial)
summary(mdl31)

#  Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   -27.831      5.434  -5.122 3.02e-07 ***
# Sepal.Length    5.140      1.007   5.107 3.28e-07 ***

mdl32 <- glm(y ~ Sepal.Length, data = iris, subset = (Species != 'versicolor'), family = binomial)
summary(mdl32)

# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept)   -38.547      9.557  -4.033 5.50e-05 ***
# Sepal.Length    6.805      1.694   4.016 5.91e-05 ***

As shown above, we can get a set of similar estimated parameters by the third approach with much simpler models.

Written by statcompute

August 21, 2013 at 11:22 pm