Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

A Two-Stage Approach in Loan-Level Consumer PD Models

In the traditional methodology of loan-level consumer probability of default (PD) models, a general functional form is taken as below

Logit(PD) = A * X + B * Y

where A * X is the linear combination of loan-level risk characteristics and B * Y is the linear combination of macro-economic indicators. While B * Y is considered a dynamic factor and might vary according to macro-economic conditions, A * X would usually remain static for each loan regardless of macro-economic scenarios during the forecast horizon, e.g. 9 quarters for CCAR exercises. However, based upon our observation in 2008 financial crisis, it is neither realistic nor conservative to assume the loan-level consumer risk characteristics, e.g. loan-to-value, credit score, and delinquent status, immune from the economic downturn.

In this article, a two-stage approach is proposed to address the above drawback in the legacy method by the means of injecting macroeconomic dynamics into loan-level risk characteristics that were assumed static in the PD model formulation. With this innovative two-stage method, it is assumed that the risk profile of each consumer loan, e.g. A * X, is jointly determined by both the historical profile and macro-economic dynamics such that

A * X = Z = Function(macroeconomic Indicator) offset by the historical Z for each consumer loan
==> Delta(Z) = C * M

where C * M represents the linear combination of macro-economic indicators or variants and scores the health of a general economy such that C * M is positive when the economy is deteriorating and is negative when the economy is improving. As a result, the loan-level risk characteristics, e.g. A * X, would be updated dynamically upon macroeconomic inputs for the whole forecast horizon across various macroeconomic scenarios.

It is worth mentioning that Fair Issac also announced the FICO score economic calibration service in 2013, which forecasts how FICO score would change under different macroeconomic scenarios and which shares the similar idea of the two-stage approach with the intent to address the same issue faced by CCAR practitioners in the banking industry.

Written by statcompute

March 1, 2015 at 3:53 pm

Posted in CCAR, Credit Risk

Tagged with ,

Chain Operations of Pandas DataFrame

Chain operations is an innovative feature provided in dlpy package of R language (https://statcompute.wordpress.com/2014/07/28/chain-operations-an-interesting-feature-in-dplyr-package). This new functionality makes the data manipulation more compact and expressive.

In python, pandas-ply package also brings this capability to pandas DataFrame objects, as shown below.

In [1]: import pandas as pd

In [2]: df = pd.read_csv('/home/liuwensui/Downloads/2008.csv')

In [3]: ### STANDARD PANDAS OPERATIONS ###

In [4]: grp = df[df.Month < 6].groupby(["Month", "DayOfWeek"])

In [5]: df1 = pd.DataFrame()

In [6]: df1["ArrMedian"] = grp.ArrTime.median()

In [7]: df1["DepMean"] = grp.DepTime.mean()

In [8]: df2 = df1[(df1.ArrMedian > 1515) & (df1.DepMean > 1350)]

In [9]: print(df2)
                 ArrMedian      DepMean
Month DayOfWeek                        
1     7               1545  1375.156487
2     5               1522  1352.657670
      7               1538  1375.605698
3     7               1542  1377.128506
4     7               1536  1366.719829
5     7               1535  1361.323864

In [10]: ### PLY CHAIN OPERATIONS ###

In [11]: from ply import install_ply, X

In [12]: install_ply(pd)

In [13]: df1 = (df
   ....:        .ply_where(X.Month < 6)
   ....:        .groupby(['Month', 'DayOfWeek'])
   ....:        .ply_select(DepMean = X.DepTime.mean(), ArrMedian = X.ArrTime.median())
   ....:        .ply_where(X.ArrMedian > 1515, X.DepMean > 1350))

In [14]: print(df1)
                  ArrMedian      DepMean
Month DayOfWeek                         
1     7                1545  1375.156487
2     5                1522  1352.657670
      7                1538  1375.605698
3     7                1542  1377.128506
4     7                1536  1366.719829
5     7                1535  1361.323864

Written by statcompute

January 31, 2015 at 4:26 pm

Posted in PYTHON

Tagged with ,

Query Pandas DataFrame with SQL

Similar to SQLDF package providing a seamless interface between SQL statement and R data.frame, PANDASQL allows python users to use SQL querying Pandas DataFrames.

Below are some examples showing how to use PANDASQL to do SELECT / AGGREGATE / JOIN operations. More information is also available on the GitHub (https://github.com/yhat/pandasql).

In [1]: import sas7bdat as sas

In [2]: import pandas as pd

In [3]: import pandasql as pdsql

In [4]: data = sas.SAS7BDAT("accepts.sas7bdat")

In [5]: df = data.toDataFrame()

In [6]: pysql = lambda q: pdsql.sqldf(q, globals())

In [7]: ### SELECT ###

In [8]: str1 = "select bureau_score, ltv from df where bureau_score < 600 and ltv > 100 limit 3;"

In [9]: df1 = pysql(str1)

In [10]: df1
Out[10]: 
   bureau_score  ltv
0           590  103
1           575  120
2           538  113

In [11]: ### AGGREGATE ###

In [12]: str2 = "select ltv, min(bureau_score) as min_score, max(bureau_score) as max_score from df group by ltv order by ltv DESC;"

In [13]: df2 = pysql(str2);

In [14]: df2.head(3)
Out[14]: 
   ltv  min_score  max_score
0  176        709        709
1  168        723        723
2  167        688        688

In [15]: ### JOIN ###

In [16]: str3 = "select b.*, a.bureau_score from df a inner join df2 b on a.ltv = b.ltv order by ltv DESC;"

In [17]: df3 = pysql(str3)

In [18]: df3.head(3)
Out[18]: 
   ltv  min_score  max_score  bureau_score
0  176        709        709           709
1  168        723        723           723
2  167        688        688           688

Written by statcompute

November 1, 2014 at 11:23 pm

Posted in Big Data, PYTHON, S+/R

Tagged with , ,

Flexible Beta Modeling

library(betareg)
library(sas7bdat)

df1 <- read.sas7bdat('lgd.sas7bdat')
df2 <- df1[df1$y < 1, ]

fml <- as.formula('y ~ x2 + x3 + x4 + x5 + x6 | x3 + x4 | x1 + x2')

### LATENT-CLASS BETA REGRESSION: AIC = -565 ###
mdl1 <- betamix(fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, minprior = 0.1))
print(mdl1)
#betamix(formula = fml, data = df2, k = 2, FLXcontrol = list(iter.max = 500, 
#    minprior = 0.1))
#
#Cluster sizes:
#  1   2 
#157 959 

summary(mdl1, which = 'concomitant')
#            Estimate Std. Error z value Pr(>|z|)   
#(Intercept) -1.35153    0.41988 -3.2188 0.001287 **
#x1           2.92537    1.13046  2.5878 0.009660 **
#x2           2.82809    1.42139  1.9897 0.046628 * 

summary(mdl1)
#$Comp.1$mean
#              Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -0.8963228  1.0385545 -0.8630 0.3881108    
#x2           3.1769062  0.6582108  4.8266 1.389e-06 ***
#x3          -0.0520060  0.0743714 -0.6993 0.4843805    
#x4           4.9642998  1.4204071  3.4950 0.0004741 ***
#x5           0.0021647  0.0022659  0.9554 0.3393987    
#x6           0.0248573  0.0062982  3.9467 7.922e-05 ***
#
#$Comp.1$precision
#            Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -5.37817    1.44817 -3.7138 0.0002042 ***
#x3           0.45009    0.10094  4.4589 8.239e-06 ***
#x4           3.06969    1.41450  2.1702 0.0299948 *  
#
#$Comp.2
#$Comp.2$mean
#              Estimate Std. Error z value  Pr(>|z|)    
#(Intercept) -1.8737088  0.3888454 -4.8186 1.445e-06 ***
#x2          -0.6318086  0.1892501 -3.3385 0.0008424 ***
#x3           0.1786425  0.0265428  6.7303 1.693e-11 ***
#x4           2.0646272  0.5256002  3.9281 8.561e-05 ***
#x5          -0.0064821  0.0014053 -4.6127 3.974e-06 ***
#x6           0.0018828  0.0022873  0.8231 0.4104318    
#
#$Comp.2$precision
#            Estimate Std. Error z value Pr(>|z|)   
#(Intercept) 1.092403   0.616974  1.7706 0.076630 . 
#x3          0.017330   0.040024  0.4330 0.665029   
#x4          2.138812   0.717702  2.9801 0.002882 **


### BETA REGRESSION TREE: AIC = -578 ###
mdl2 <- betatree(fml, data = df2, minsplit = 100)
print(mdl2)
#1) x2 <= 0.08584895; criterion = 1, statistic = 154.716
#  2)*  weights = 121 
#Terminal node model
#betaReg fit with coefficients:
#      (Intercept)                 x2                 x3                 x4  
#         3.307359          -2.854351          -0.262815          -2.414481  
#               x5                 x6  (phi)_(Intercept)           (phi)_x3  
#        -0.007555           0.030346           1.003767          -0.002907  
#         (phi)_x4  
#         2.528602  
#
#1) x2 > 0.08584895
#  3)*  weights = 995 
#Terminal node model
#betaReg fit with coefficients:
#      (Intercept)                 x2                 x3                 x4  
#        -2.134931          -0.194830           0.168136           2.811077  
#               x5                 x6  (phi)_(Intercept)           (phi)_x3  
#        -0.002070           0.004677          -1.018102           0.151778  
#         (phi)_x4  
#         2.142995  

sctest(mdl2, node = 1)
#                x1       x2
#statistic 113.4781 154.7165
#p.value     0.0000   0.0000

summary(mdl2)
#$`2`
#
#Coefficients (mean model with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
#(Intercept)  3.307359   1.091289   3.031 0.002440 ** 
#x2          -2.854351   3.644882  -0.783 0.433561    
#x3          -0.262815   0.074716  -3.518 0.000436 ***
#x4          -2.414481   1.785447  -1.352 0.176276    
#x5          -0.007555   0.002788  -2.710 0.006738 ** 
#x6           0.030346   0.006833   4.441 8.96e-06 ***
#
#Phi coefficients (precision model with log link):
#             Estimate Std. Error z value Pr(>|z|)
#(Intercept)  1.003767   1.353496   0.742    0.458
#x3          -0.002907   0.090816  -0.032    0.974
#x4           2.528602   2.344241   1.079    0.281

#$`3`
#
#Coefficients (mean model with logit link):
#             Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -2.134931   0.337784  -6.320 2.61e-10 ***
#x2          -0.194830   0.144062  -1.352  0.17625    
#x3           0.168136   0.022521   7.466 8.28e-14 ***
#x4           2.811077   0.387788   7.249 4.20e-13 ***
#x5          -0.002070   0.001136  -1.822  0.06848 .  
#x6           0.004677   0.001770   2.643  0.00822 ** 
#
#Phi coefficients (precision model with log link):
#            Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -1.01810    0.46575  -2.186 0.028821 *  
#x3           0.15178    0.03057   4.965 6.88e-07 ***
#x4           2.14300    0.56979   3.761 0.000169 ***

Written by statcompute

October 27, 2014 at 10:44 pm

Posted in S+/R, Statistical Models

Tagged with ,

Model Segmentation with Recursive Partitioning

library(party)

df1 <- read.csv("credit_count.csv")
df2 <- df1[df1$CARDHLDR == 1, ]

mdl <- mob(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT | AGE + SELFEMPL, data = df2, family = binomial(), control = mob_control(minsplit = 1000), model = glinearModel)

print(mdl)
#1) AGE <= 22.91667; criterion = 1, statistic = 48.255
#  2)*  weights = 1116 
#Terminal node model
#Binomial GLM with coefficients:
#(Intercept)     MAJORDRG     MINORDRG       INCOME      OWNRENT  
# -0.6651905    0.0633978    0.5182472   -0.0006038    0.3071785  
#
#1) AGE > 22.91667
#  3)*  weights = 9383 
#Terminal node model
#Binomial GLM with coefficients:
#(Intercept)     MAJORDRG     MINORDRG       INCOME      OWNRENT  
# -1.4117010    0.2262091    0.2067880   -0.0003822   -0.2127193  

### TEST FOR STRUCTURAL CHANGE ###
sctest(mdl, node = 1)
#                   AGE    SELFEMPL
#statistic 4.825458e+01 20.88612025
#p.value   1.527484e-07  0.04273836

summary(mdl, node = 2)
#Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -0.6651905  0.2817480  -2.361 0.018229 *  
#MAJORDRG     0.0633978  0.3487305   0.182 0.855743    
#MINORDRG     0.5182472  0.2347656   2.208 0.027278 *  
#INCOME      -0.0006038  0.0001639  -3.685 0.000229 ***
#OWNRENT      0.3071785  0.2028491   1.514 0.129945    

summary(mdl, node = 3)
#Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
#(Intercept) -1.412e+00  1.002e-01 -14.093  < 2e-16 ***
#MAJORDRG     2.262e-01  7.067e-02   3.201  0.00137 ** 
#MINORDRG     2.068e-01  4.925e-02   4.199 2.68e-05 ***
#INCOME      -3.822e-04  4.186e-05  -9.131  < 2e-16 ***
#OWNRENT     -2.127e-01  7.755e-02  -2.743  0.00609 ** 

Written by statcompute

October 26, 2014 at 2:23 am

Posted in Credit Risk, S+/R, Statistical Models

Tagged with

Estimating a Beta Regression with The Variable Dispersion in R

pkgs <- c('sas7bdat', 'betareg', 'lmtest')
lapply(pkgs, require, character.only = T)

df1 <- read.sas7bdat("lgd.sas7bdat")
df2 <- df1[which(df1$y < 1), ]

xvar <- paste("x", 1:7, sep = '', collapse = " + ")
fml1 <- as.formula(paste("y ~ ", xvar))
fml2 <- as.formula(paste("y ~ ", xvar, "|", xvar))

# FIT A BETA MODEL WITH THE FIXED PHI
beta1 <- betareg(fml1, data = df2)
summary(beta1)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.500242   0.329670  -4.551 5.35e-06 ***
# x1           0.007516   0.026020   0.289 0.772680    
# x2           0.429756   0.135899   3.162 0.001565 ** 
# x3           0.099202   0.022285   4.452 8.53e-06 ***
# x4           2.465055   0.415657   5.931 3.02e-09 ***
# x5          -0.003687   0.001070  -3.446 0.000568 ***
# x6           0.007181   0.001821   3.943 8.06e-05 ***
# x7           0.128796   0.186715   0.690 0.490319    
#
# Phi coefficients (precision model with identity link):
#       Estimate Std. Error z value Pr(>|z|)    
# (phi)   3.6868     0.1421   25.95   <2e-16 ***

# FIT A BETA MODEL WITH THE VARIABLE PHI
beta2 <- betareg(fml2, data = df2)
summary(beta2)

# Coefficients (mean model with logit link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.996661   0.336445  -5.935 2.95e-09 ***
# x1           0.007033   0.026809   0.262 0.793072    
# x2           0.371098   0.135186   2.745 0.006049 ** 
# x3           0.133356   0.022624   5.894 3.76e-09 ***
# x4           2.951245   0.401493   7.351 1.97e-13 ***
# x5          -0.003475   0.001119  -3.105 0.001902 ** 
# x6           0.006528   0.001884   3.466 0.000529 ***
# x7           0.100274   0.190915   0.525 0.599424    
#
# Phi coefficients (precision model with log link):
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -0.454399   0.452302  -1.005 0.315072    
# x1           0.009119   0.035659   0.256 0.798150    
# x2           0.611049   0.188225   3.246 0.001169 ** 
# x3           0.092073   0.030678   3.001 0.002689 ** 
# x4           2.248399   0.579440   3.880 0.000104 ***
# x5          -0.002188   0.001455  -1.504 0.132704    
# x6          -0.000317   0.002519  -0.126 0.899847    
# x7          -0.166226   0.256199  -0.649 0.516457    

# LIKELIHOOD RATIO TEST TO COMPARE BOTH BETA MODELS
lrtest(beta1, beta2) 

# Likelihood ratio test
#
# Model 1: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
# Model 2: y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 | x1 + x2 + x3 + x4 + x5 + x6 + x7
#   #Df LogLik Df Chisq Pr(>Chisq)    
# 1   9 231.55                        
# 2  16 257.24  7 51.38  7.735e-09 ***

Written by statcompute

October 19, 2014 at 5:25 pm

Posted in Econometrics, S+/R, Statistical Models

Tagged with ,

Julia: Function vs. Macro

using DataFrames, Benchmark

# FUNCTION TO READ CSV FILE TO DATAFRAME
function csv2df(file)
  d = readcsv(file, header = true)
  head = d[2]
  data = d[1]
  dt = DataFrame()
  for i in 1:size(head)[2]
    dt[symbol(head[i])] = data[:, i]
  end
  return dt
end

df_funct() = csv2df("survdt.csv");

# MACRO TO READ CSV FILE TO DATAFRAME
macro csv2df(file)
  d = readcsv(file, header = true)
  head = d[2]
  data = d[1]
  dt = DataFrame()
  for i in 1:size(head)[2]
    dt[symbol(head[i])] = data[:, i]
  end
  return dt
end

df_macro() = @csv2df "survdt.csv";

# FUNCTION IN THE DATAFRAME PACKAGE
df_frame() = readtable("survdt.csv");

print(compare([df_funct, df_macro, df_frame], 10))

# | Row | Function   | Average  | Relative  | Replications |
# |-----|------------|----------|-----------|--------------|
# | 1   | "df_funct" | 0.164968 | 4.07328e6 | 10           |
# | 2   | "df_macro" | 4.05e-8  | 1.0       | 10           |
# | 3   | "df_frame" | 0.063254 | 1.56183e6 | 10           |

Written by statcompute

October 10, 2014 at 1:18 am

Posted in Julia

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 130 other followers