Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for November 2012

Another Way to Access R from Python – PypeR

Different from RPy2, PypeR provides another simple way to access R from Python through pipes (http://www.jstatsoft.org/v35/c02/paper). This handy feature enables data analysts to do the data munging with python and the statistical analysis with R by passing objects interactively between two computing systems.

Below is a simple demonstration on how to call R within Python through RypeR, estimate a Beta regression, and then return the model prediction from R back to Python.

In [1]: # LOAD PYTHON PACKAGES

In [2]: import pandas as pd

In [3]: import pyper as pr

In [4]: # READ DATA

In [5]: data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt", header = 0)

In [6]: # CREATE A R INSTANCE WITH PYPER

In [7]: r = pr.R(use_pandas = True)

In [8]: # PASS DATA FROM PYTHON TO R

In [9]: r.assign("rdata", data)

In [10]: # SHOW DATA SUMMARY

In [11]: print r("summary(rdata)")
try({summary(rdata)})
    LEV_LT3           TAX_NDEB           COLLAT1           SIZE1       
 Min.   :0.00000   Min.   :  0.0000   Min.   :0.0000   Min.   : 7.738  
 1st Qu.:0.00000   1st Qu.:  0.3494   1st Qu.:0.1241   1st Qu.:12.317  
 Median :0.00000   Median :  0.5666   Median :0.2876   Median :13.540  
 Mean   :0.09083   Mean   :  0.8245   Mean   :0.3174   Mean   :13.511  
 3rd Qu.:0.01169   3rd Qu.:  0.7891   3rd Qu.:0.4724   3rd Qu.:14.751  
 Max.   :0.99837   Max.   :102.1495   Max.   :0.9953   Max.   :18.587  
     PROF2              GROWTH2             AGE              LIQ         
 Min.   :0.0000158   Min.   :-81.248   Min.   :  6.00   Min.   :0.00000  
 1st Qu.:0.0721233   1st Qu.: -3.563   1st Qu.: 11.00   1st Qu.:0.03483  
 Median :0.1203435   Median :  6.164   Median : 17.00   Median :0.10854  
 Mean   :0.1445929   Mean   : 13.620   Mean   : 20.37   Mean   :0.20281  
 3rd Qu.:0.1875148   3rd Qu.: 21.952   3rd Qu.: 25.00   3rd Qu.:0.29137  
 Max.   :1.5902009   Max.   :681.354   Max.   :210.00   Max.   :1.00018  
     IND2A            IND3A            IND4A             IND5A        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :1.0000   Median :0.0000   Median :0.00000   Median :0.00000  
 Mean   :0.6116   Mean   :0.1902   Mean   :0.02692   Mean   :0.09907  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  


In [12]: # LOAD R PACKAGE

In [13]: r("library(betareg)")
Out[13]: 'try({library(betareg)})\nLoading required package: Formula\n'

In [14]: # ESTIMATE A BETA REGRESSION

In [15]: r("m <- betareg(LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, subset = LEV_LT3 > 0)")
Out[15]: 'try({m <- betareg(LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, subset = LEV_LT3 > 0)})\n'

In [16]: # OUTPUT MODEL SUMMARY

In [17]: print r("summary(m)")
try({summary(m)})

Call:
betareg(formula = LEV_LT3 ~ SIZE1 + PROF2 + GROWTH2 + AGE + IND3A, data = rdata, 
    subset = LEV_LT3 > 0)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-7.2802 -0.5194  0.0777  0.6037  5.8777 

Coefficients (mean model with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.229773   0.312990   3.929 8.53e-05 ***
SIZE1       -0.105009   0.021211  -4.951 7.39e-07 ***
PROF2       -2.414794   0.377271  -6.401 1.55e-10 ***
GROWTH2      0.003306   0.001043   3.169  0.00153 ** 
AGE         -0.004999   0.001795  -2.786  0.00534 ** 
IND3A        0.688314   0.074069   9.293  < 2e-16 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)   3.9362     0.1528   25.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood: 266.7 on 7 Df
Pseudo R-squared: 0.1468
Number of iterations: 25 (BFGS) + 2 (Fisher scoring) 


In [18]: # CALCULATE MODEL PREDICTION

In [19]: r("beta_fit <- predict(m, link = 'response')")
Out[19]: "try({beta_fit <- predict(m, link = 'response')})\n"

In [20]: # SHOW PREDICTION SUMMARY IN R

In [21]: print r("summary(beta_fit)")
try({summary(beta_fit)})
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1634  0.3069  0.3465  0.3657  0.4007  0.6695 


In [22]: # PASS DATA FROM R TO PYTHON

In [23]: pydata = pd.DataFrame(r.get("beta_fit"), columns = ["y_hat"])

In [24]: # SHOW PREDICTION SUMMARY IN PYTHON

In [25]: pydata.y_hat.describe()
Out[25]: 
count    1116.000000
mean        0.365675
std         0.089804
min         0.163388
25%         0.306897
50%         0.346483
75%         0.400656
max         0.669489
Advertisements

Written by statcompute

November 29, 2012 at 11:22 pm

Run R Code Within Python On The Fly

Below is an example showing how to run R code within python, which is an extremely attractive feature for hardcore R programmers.

In [1]: import rpy2.robjects as ro

In [2]: _null_ = ro.r('data <- read.table("/home/liuwensui/data/credit_count.txt", header = TRUE, sep = ",")')

In [3]: print ro.r('str(data)')
'data.frame':	13444 obs. of  14 variables:
 $ CARDHLDR: int  0 0 1 1 1 1 1 1 1 1 ...
 $ DEFAULT : int  0 0 0 0 0 0 0 0 0 0 ...
 $ AGE     : num  27.2 40.8 37.7 42.5 21.3 ...
 $ ACADMOS : int  4 111 54 60 8 78 25 6 20 162 ...
 $ ADEPCNT : int  0 3 3 3 0 1 1 0 3 7 ...
 $ MAJORDRG: int  0 0 0 0 0 0 0 0 0 0 ...
 $ MINORDRG: int  0 0 0 0 0 0 0 0 0 0 ...
 $ OWNRENT : int  0 1 1 1 0 0 1 0 0 1 ...
 $ INCOME  : num  1200 4000 3667 2000 2917 ...
 $ SELFEMPL: int  0 0 0 0 0 0 0 0 0 0 ...
 $ INCPER  : num  18000 13500 11300 17250 35000 ...
 $ EXP_INC : num  0.000667 0.000222 0.03327 0.048427 0.016523 ...
 $ SPENDING: num  NA NA 122 96.9 48.2 ...
 $ LOGSPEND: num  NA NA 4.8 4.57 3.88 ...
NULL

In [4]: _null_ = ro.r('sample <- data[data$CARDHLDR == 1,]')

In [5]: print ro.r('summary(sample)')
    CARDHLDR    DEFAULT             AGE           ACADMOS         ADEPCNT      
 Min.   :1   Min.   :0.00000   Min.   : 0.00   Min.   :  0.0   Min.   :0.0000  
 1st Qu.:1   1st Qu.:0.00000   1st Qu.:25.75   1st Qu.: 12.0   1st Qu.:0.0000  
 Median :1   Median :0.00000   Median :31.67   Median : 30.0   Median :0.0000  
 Mean   :1   Mean   :0.09487   Mean   :33.67   Mean   : 55.9   Mean   :0.9904  
 3rd Qu.:1   3rd Qu.:0.00000   3rd Qu.:39.75   3rd Qu.: 72.0   3rd Qu.:2.0000  
 Max.   :1   Max.   :1.00000   Max.   :88.67   Max.   :564.0   Max.   :9.0000  
    MAJORDRG         MINORDRG         OWNRENT           INCOME    
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :  50  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1750  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :2292  
 Mean   :0.1433   Mean   :0.2207   Mean   :0.4791   Mean   :2606  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3042  
 Max.   :6.0000   Max.   :7.0000   Max.   :1.0000   Max.   :8333  
    SELFEMPL           INCPER          EXP_INC            SPENDING       
 Min.   :0.00000   Min.   :   700   Min.   :0.000096   Min.   :   0.111  
 1st Qu.:0.00000   1st Qu.: 12900   1st Qu.:0.025998   1st Qu.:  58.753  
 Median :0.00000   Median : 20000   Median :0.058957   Median : 139.992  
 Mean   :0.05362   Mean   : 22581   Mean   :0.090744   Mean   : 226.983  
 3rd Qu.:0.00000   3rd Qu.: 28337   3rd Qu.:0.116123   3rd Qu.: 284.440  
 Max.   :1.00000   Max.   :150000   Max.   :2.037728   Max.   :4810.309  
    LOGSPEND     
 Min.   :-2.197  
 1st Qu.: 4.073  
 Median : 4.942  
 Mean   : 4.729  
 3rd Qu.: 5.651  
 Max.   : 8.479  

In [6]: print ro.r('summary(glm(DEFAULT ~ MAJORDRG + MINORDRG + OWNRENT + INCOME, data = sample, family = binomial))')

Call:
glm(formula = DEFAULT ~ MAJORDRG + MINORDRG + OWNRENT + INCOME, 
    family = binomial, data = sample)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9587  -0.5003  -0.4351  -0.3305   3.1928  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.204e+00  9.084e-02 -13.259  < 2e-16 ***
MAJORDRG     2.031e-01  6.926e-02   2.933  0.00336 ** 
MINORDRG     2.027e-01  4.798e-02   4.225 2.38e-05 ***
OWNRENT     -2.012e-01  7.163e-02  -2.809  0.00496 ** 
INCOME      -4.422e-04  4.044e-05 -10.937  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6586.1  on 10498  degrees of freedom
Residual deviance: 6376.2  on 10494  degrees of freedom
AIC: 6386.2

Number of Fisher Scoring iterations: 6

Written by statcompute

November 24, 2012 at 11:19 pm

A Light Touch on RPy2

For a statistical analyst, the first step to start a data analysis project is to import the data into the program and then to screen the descriptive statistics of the data. In python, we can easily do so with pandas package.

In [1]: import pandas as pd

In [2]: data = pd.read_table("/home/liuwensui/Documents/data/csdata.txt", header = 0)

In [3]: pd.set_printoptions(precision = 5)

In [4]: print data.describe().to_string()
         LEV_LT3   TAX_NDEB    COLLAT1      SIZE1      PROF2    GROWTH2        AGE        LIQ      IND2A      IND3A      IND4A      IND5A
count  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000  4421.0000
mean      0.0908     0.8245     0.3174    13.5109     0.1446    13.6196    20.3664     0.2028     0.6116     0.1902     0.0269     0.0991
std       0.1939     2.8841     0.2272     1.6925     0.1109    36.5177    14.5390     0.2333     0.4874     0.3925     0.1619     0.2988
min       0.0000     0.0000     0.0000     7.7381     0.0000   -81.2476     6.0000     0.0000     0.0000     0.0000     0.0000     0.0000
25%       0.0000     0.3494     0.1241    12.3170     0.0721    -3.5632    11.0000     0.0348     0.0000     0.0000     0.0000     0.0000
50%       0.0000     0.5666     0.2876    13.5396     0.1203     6.1643    17.0000     0.1085     1.0000     0.0000     0.0000     0.0000
75%       0.0117     0.7891     0.4724    14.7511     0.1875    21.9516    25.0000     0.2914     1.0000     0.0000     0.0000     0.0000
max       0.9984   102.1495     0.9953    18.5866     1.5902   681.3542   210.0000     1.0002     1.0000     1.0000     1.0000     1.0000

Tonight, I’d like to add some spice to my python learning experience and do the work in a different flavor with rpy2 package, which allows me to call R functions from python.

 
In [5]: import rpy2.robjects as ro

In [6]: rdata = ro.packages.importr('utils').read_table("/home/liuwensui/Documents/data/csdata.txt", header = True)

In [7]: print ro.r.summary(rdata)
    LEV_LT3           TAX_NDEB           COLLAT1           SIZE1       
 Min.   :0.00000   Min.   :  0.0000   Min.   :0.0000   Min.   : 7.738  
 1st Qu.:0.00000   1st Qu.:  0.3494   1st Qu.:0.1241   1st Qu.:12.317  
 Median :0.00000   Median :  0.5666   Median :0.2876   Median :13.540  
 Mean   :0.09083   Mean   :  0.8245   Mean   :0.3174   Mean   :13.511  
 3rd Qu.:0.01169   3rd Qu.:  0.7891   3rd Qu.:0.4724   3rd Qu.:14.751  
 Max.   :0.99837   Max.   :102.1495   Max.   :0.9953   Max.   :18.587  
     PROF2              GROWTH2             AGE              LIQ         
 Min.   :0.0000158   Min.   :-81.248   Min.   :  6.00   Min.   :0.00000  
 1st Qu.:0.0721233   1st Qu.: -3.563   1st Qu.: 11.00   1st Qu.:0.03483  
 Median :0.1203435   Median :  6.164   Median : 17.00   Median :0.10854  
 Mean   :0.1445929   Mean   : 13.620   Mean   : 20.37   Mean   :0.20281  
 3rd Qu.:0.1875148   3rd Qu.: 21.952   3rd Qu.: 25.00   3rd Qu.:0.29137  
 Max.   :1.5902009   Max.   :681.354   Max.   :210.00   Max.   :1.00018  
     IND2A            IND3A            IND4A             IND5A        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :1.0000   Median :0.0000   Median :0.00000   Median :0.00000  
 Mean   :0.6116   Mean   :0.1902   Mean   :0.02692   Mean   :0.09907  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  

As shown above, the similar analysis can be conducted by calling R functions with python. This feature enables us to extract and process the data effectively with python without losing the graphical and statistical functionality of R.

Written by statcompute

November 24, 2012 at 12:31 am

Posted in PYTHON, S+/R, Statistics

Calculating K-S Statistic with Python

K-S statistic is a measure to evaluate the predictiveness of a statistical model for binary outcomes and has been widely used in direct marketing and risk modeling.

Below is a demonstration on how to calculate K-S statistic with less than 20 lines of python codes. In this piece of code snippet, I am also trying to show how to do data munging effectively with pandas and numpy packages.

As Wes McKinney pointed out in his book “Python for Data Analysis”, it is not necessary to become a proficient Python software developer in order to be a proficient Python data analyst.

In [1]: # IMPORT PACKAGES

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: # LOAD DATA FROM CSV FILE

In [5]: data = pd.read_csv('c:\\projects\\data.csv')

In [6]: data.describe()
Out[6]:
               bad        score
count  5522.000000  5522.000000
mean      0.197573   693.466135
std       0.398205    57.829769
min       0.000000   443.000000
25%       0.000000   653.000000
50%       0.000000   692.500000
75%       0.000000   735.000000
max       1.000000   848.000000

In [7]: data['good'] = 1 - data.bad

In [8]: # DEFINE 10 BUCKETS WITH EQUAL SIZE

In [9]: data['bucket'] = pd.qcut(data.score, 10)

In [10]: # GROUP THE DATA FRAME BY BUCKETS

In [11]: grouped = data.groupby('bucket', as_index = False)

In [12]: # CREATE A SUMMARY DATA FRAME

In [13]: agg1 = grouped.min().score

In [14]: agg1 = pd.DataFrame(grouped.min().score, columns = ['min_scr'])

In [15]: agg1['max_scr'] = grouped.max().score

In [16]: agg1['bads'] = grouped.sum().bad

In [17]: agg1['goods'] = grouped.sum().good

In [18]: agg1['total'] = agg1.bads + agg1.goods

In [19]: agg1
Out[19]:
   min_scr  max_scr  bads  goods  total
0      621      645   201    365    566
1      646      661   173    359    532
2      662      677   125    441    566
3      678      692    99    436    535
4      693      708    89    469    558
5      709      725    66    492    558
6      726      747    42    520    562
7      748      772    30    507    537
8      773      848    14    532    546
9      443      620   252    310    562

In [20]: # SORT THE DATA FRAME BY SCORE

In [21]: agg2 = (agg1.sort_index(by = 'min_scr')).reset_index(drop = True)

In [22]: agg2['odds'] = (agg2.goods / agg2.bads).apply('{0:.2f}'.format)

In [23]: agg2['bad_rate'] = (agg2.bads / agg2.total).apply('{0:.2%}'.format)

In [24]: # CALCULATE KS STATISTIC

In [25]: agg2['ks'] = np.round(((agg2.bads / data.bad.sum()).cumsum() - (agg2.goods / data.good.sum()).cumsum()), 4) * 100

In [26]: # DEFINE A FUNCTION TO FLAG MAX KS

In [27]: flag = lambda x: '<----' if x == agg2.ks.max() else ''

In [28]: # FLAG OUT MAX KS

In [29]: agg2['max_ks'] = agg2.ks.apply(flag)

In [30]: agg2
Out[30]:
   min_scr  max_scr  bads  goods  total   odds bad_rate     ks max_ks
0      443      620   252    310    562   1.23   44.84%  16.10
1      621      645   201    365    566   1.82   35.51%  26.29
2      646      661   173    359    532   2.08   32.52%  34.04
3      662      677   125    441    566   3.53   22.08%  35.55  <----
4      678      692    99    436    535   4.40   18.50%  34.78
5      693      708    89    469    558   5.27   15.95%  32.36
6      709      725    66    492    558   7.45   11.83%  27.30
7      726      747    42    520    562  12.38    7.47%  19.42
8      748      772    30    507    537  16.90    5.59%  10.72
9      773      848    14    532    546  38.00    2.56%  -0.00

Written by statcompute

November 18, 2012 at 6:47 pm

Fitting A Logistic Regression with Python

In [1]: from pandas import *

In [2]: import statsmodels.api as sm

In [3]: # LOAD EXTERNAL DATA

In [4]: data = read_table('C:\\data\\credit_count.txt', sep = ',')

In [5]: data
Out[5]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13444 entries, 0 to 13443
Data columns:
CARDHLDR     13444  non-null values
DEFAULT      13444  non-null values
AGE          13444  non-null values
ACADMOS      13444  non-null values
ADEPCNT      13444  non-null values
MAJORDRG     13444  non-null values
MINORDRG     13444  non-null values
OWNRENT      13444  non-null values
INCOME       13444  non-null values
SELFEMPL     13444  non-null values
INCPER       13444  non-null values
EXP_INC      13444  non-null values
SPENDING     13444  non-null values
LOGSPEND     13444  non-null values
dtypes: float64(4), int64(8), object(2)

In [6]: # DEFINE RESPONSE

In [7]: Y = data[data.CARDHLDR == 1].DEFAULT

In [8]: # SUMMARIZE RESPONSE VARIABLE

In [9]: Y.describe()
Out[9]:
count    10499.000000
mean         0.094866
std          0.293044
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000

In [10]: # DEFINE PREDICTORS

In [11]: X = sm.add_constant(data[data.CARDHLDR == 1][['AGE', 'ADEPCNT', 'MAJORDRG', 'MINORDRG', 'INCOME', 'OWNRENT']]

In [12]: # SUMMARIZE PREDICTORS

In [13]: X.describe()
Out[13]:
                AGE       ADEPCNT      MAJORDRG      MINORDRG        INCOME       OWNRENT  const
count  10499.000000  10499.000000  10499.000000  10499.000000  10499.000000  10499.000000  10499
mean      33.674945      0.990380      0.143252      0.220688   2606.125933      0.479093      1
std       10.290998      1.273887      0.461568      0.637142   1287.983386      0.499587      0
min        0.000000      0.000000      0.000000      0.000000     50.000000      0.000000      1
25%       25.750000      0.000000      0.000000      0.000000   1750.000000      0.000000      1
50%       31.666666      0.000000      0.000000      0.000000   2291.666667      0.000000      1
75%       39.750000      2.000000      0.000000      0.000000   3041.666667      1.000000      1
max       88.666664      9.000000      6.000000      7.000000   8333.250000      1.000000      1

In [14]: # DEFINE A MODEL

In [15]: model = sm.GLM(Y, X, family = sm.families.Binomial())

In [16]: # FIT A MODEL

In [17]: result = model.fit()

In [18]: # PRINT RESULTS

In [19]: print result.summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                DEFAULT   No. Observations:                10499
Model:                            GLM   Df Residuals:                    10492
Model Family:                Binomial   Df Model:                            6
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -3175.8
Date:                Thu, 08 Nov 2012   Deviance:                       6351.7
Time:                        23:24:02   Pearson chi2:                 1.11e+04
No. Iterations:                     7
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
AGE           -0.0095      0.004     -2.450      0.014        -0.017    -0.002
ADEPCNT        0.1338      0.029      4.655      0.000         0.077     0.190
MAJORDRG       0.2103      0.070      3.016      0.003         0.074     0.347
MINORDRG       0.2007      0.048      4.178      0.000         0.107     0.295
INCOME        -0.0005   4.19e-05    -11.057      0.000        -0.001    -0.000
OWNRENT       -0.2263      0.077     -2.924      0.003        -0.378    -0.075
const         -0.9648      0.133     -7.245      0.000        -1.226    -0.704
==============================================================================

Written by statcompute

November 8, 2012 at 11:29 pm

Another Class of Risk Models

In retail banking, it is a key interest to predict the probability of accounts’ adverse behaviors, such as delinquencies or defaults. A widely accepted practice in the industry is to classify accounts into two groups, the good and the bad, based upon the presence of certain adverse behaviors and then to model this binary outcome with discriminant models, e.g. logistic regression. However, an obvious limitation of discriminant models based upon the binary outcome is that the two-state classification over-simplifies adverse behaviors of accounts. What financially impacts a financial institute are not only the presence of a certain adverse behavior but also the frequency of such behavior.

In the definition of binary outcome, it is important to notice that delinquencies can also be measured directly as the frequency of over-due payments. Therefore, instead of modeling the binary outcome, a more sensible alternative might be to model the frequency of delinquencies within a given valuation horizon. In the statistical content, the genuine model for count outcome, e.g. frequency, is Poisson regression model with probability function

f(Y_i | X_i) = exp(-λ_i) * (λ_i ^ Y_i) / Y_i!, where λ_i = exp(X_i`B)

It is assumed that each observed outcome Y_i is drawn from a Poisson distribution with the conditional mean λ_i on a given covariate vector X_i for case i. In Poisson model, a strong assumption is that the mean is equal to the variance such that E(Y_i | X_i) = Var(Y_i | X_i) = λ_i, which is also known as Equi-Disperson. However, in practice, this Equi-Dispersion assumption is too restrictive for many empirical applications. In real-world count outcomes, the variance often exceeds the mean, namely Over-Dispersion, due to various reasons, such as excess zeroes or long right tail. For instance, in a credit card portfolio, majority of cardholders should have zero delinquency at any point in time, while a few might have more than three. With the similar consequence of heteroskedasticity in a linear regression, Over-Dispersion in a Poisson model will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics. Hence, Poisson model is often inadequate and practically unusable.

Considered a generalization of basic Poisson model, Negative Binomial (NB) model accommodates Over-Dispersion in data by including a dispersion parameter. In a NB model, it is assumed that the conditional mean λ_i for case i is determined not only by the observed heterogeneity explained by the covariate vector X_i but also by the unobserved heterogeneity denoted as ε_i that is independent of X_i such that

λ_i = exp(X_i`B + ε_i) = exp(X_i`B) * exp(ε_i), where exp(ε_i) ~ Gamma(1/α, 1/α)

While there are many variants of NB model, the most common one is NB2 model proposed by Cameron and Trivedi (1966) with probability function

f(Y_i | X_i) = Gamma(Y_i + 1/α) / [Gamma(Y_i + 1) * Gamma(1/α)] * [(1/α) / (1/α + λ_i)] ^ (1/α) * [λ_i / (1/α + λ_i)], where α is the dispersion parameter

For NB2 model, its conditional mean E(Y_i | X_i) is still λ_i, while its variance Var(Y_i | X_i) becomes λ_i + α * λ_i ^ 2. Since both λ_i > 0 and α > 0, the variance must exceed the mean and therefore the issue of Over-Dispersion has been addressed.

A major limitation of standard count data models, such as Poisson and Negative Binomial, is that the data is assumed to be generated by a single process. However, in many cases, it might be more appropriate to assume that the data is governed by two or more processes. For instance, it is believed that risk drivers of the first-time delinquent account might be very different from the ones of an account who had been delinquent for multiple times. From the business standpoint, the assumption of multiple processes is particularly attractive in that it provides the potential to segment the portfolio into two or more sub-groups based upon their delinquent pattern and loan characteristics.

Known as the two-part model, Hurdle Poisson model assumes that count outcomes come from two systematically different processes, a Binomial distribution determining the probability of zero counts and a Truncated-at-Zero Poisson governing positive outcomes. The probability function can be expressed as

for Y_i = 0, f(Y_i | X_i) = θ_i, where θ_i = Prob(Y_i = 0)
for Y_i > 0, f(Y_i | X_i) = (1 – θ_i) * exp(-λ_i) * λ_i ^ Y_i / {[1 – exp(-λ_i)] * Y_i!}, where λ_i = exp(X_i`B)

In the modeling framework, the first process can be analyzed by a logistic regression and the second can be reflected by a Truncated-at-Zero Poisson model. An advantage of Hurdle Model is that it is so flexible as to effectively model both Over-Dispersed data with too many zeroes and Under-Dispersed data with too few zeroes.

Alike to Hurdle model, Zero-Inflated Poisson (ZIP) model is another way to model count outcomes with excess zeroes under the assumption of two components. However, it is slightly different from Hurdle model in the sense that zero outcomes are assumed to come from two different sources, one generating only zero outcomes and the other generating both zero and nonzero outcomes. Specifically, a Binomial distribution decides if an individual is from the Always-Zero or the Not-Always-Zero group and then a standard Poisson distribution describes counts in the Not-always-zero group. The probability function of ZIP model is given as

for Y_i = 0, f(Y_i | X_i) = ω_i + (1 + ω_i) * exp(-λ_i), where ω_i = Prob[Y_i ~ Poisson(λ_i)]
for Y_i > 0, f(Y_i | X_i) = (1 – ω_i) * exp(-λ_i) * λ_i ^ Y_i / Y_i!

With the similar idea to Hurdle model, ZIP model can be represented jointly by two different sub-models as well. A logistic regression is used to separate the Always-Zero group from the Not-Always-Zero group and a basic Poisson model is applied to individuals in the Not-Always-Zero group. From a business prospective, ZIP Model describes an important fact that some not-at-risk accounts are well established such that they will never have financial problems, while the other at-risk ones might have chances to get into troubles during the tough time. Therefore, risk exposures and underlying matrices for accounts with same outcomes at zero count might still be differentiable.

In practice, a sharp dichotomization between at-risk group and not-at-risk group might not be realistic. Even a customer with the good financial condition might be exposed to risks in a certain situation. Therefore, it might make sense to split the whole portfolio into a couple segments with different levels of risk-exposure. A Latent Class Poisson model provides such mechanism by assuming that the population of interest is actually a mixture of S > 1 latent (unobservable) components and each individual is considered a draw from one of these latent groups. The probability function of a Latent Class Poisson model with S = 2 classes can be obtained as

F(Y_i | X_i) = P1_i * exp(-λ1_i) * λ1_i ^ Y_i / Y_i! + P2_i * exp(-λ2_i) * λ2_i ^ Y_i / Y_i!, where P1_i + P2_i = 1

Each latent component in the mixture is assumed to have a different parameter λ_i, which will account for the unobserved heterogeneity in the population. For instance, in the case of S = 2, a portfolio is assumed a mixture between a high risk group and a low risk one. Impacts of predictors are allowed to differ across different latent groups, providing a possibility of more informative and flexible interpretations.

Besides models discussed above, it is also worth to point out that the discrete choice model, such as Logit or Probit, has also been widely used to model count outcomes as well. However, such discrete choice model needs to be based upon sequential or ordered instead of multinomial response, namely ordered Logit.

Written by statcompute

November 3, 2012 at 10:19 pm