Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Create Highly Customized Excel Chart with Python

import pandas as pd
import pandas.io.data as web

data = web.get_data_yahoo('FITB', '1/1/2009', '9/1/2014')
df = pd.DataFrame({'FITB': data['Adj Close']})

excel_file = 'fitb.xlsx'
sheet_name = 'Sheet1'

writer = pd.ExcelWriter(excel_file, engine = 'xlsxwriter')
df.to_excel(writer, sheet_name = sheet_name)

workbook = writer.book
worksheet = writer.sheets[sheet_name]
worksheet.set_column('A:A', 20)

chart = workbook.add_chart({'type': 'line'})
max_row = len(df) + 1
chart.add_series({
    'name':       ['Sheet1', 0, 1],
    'categories': ['Sheet1', 2, 0, max_row, 0],
    'values':     ['Sheet1', 2, 1, max_row, 1],
    'line':       {'width': 1.00},
})

chart.set_x_axis({'name': 'Date', 'date_axis': True})
chart.set_y_axis({'name': 'Price', 'major_gridlines': {'visible': False}})
chart.set_legend({'position': 'top'})

worksheet.insert_chart('D1', chart)
writer.save()

Screenshot

Written by statcompute

September 1, 2014 at 10:41 am

Posted in PYTHON

Thoughts on Modeling Practices in Exposure at Default

In the consumer credit risk arena, EAD (Exposure at Default) is a major component in the calculation of EL (Expected Loss) particularly in Line of Credit products such as Credit Card or HeLOC. While it is common to gauge EAD at the portfolio level based upon the utilization rate, a leading practice implemented in more complex banks is to estimate account-level EAD models with the inclusion of individual risk characteristics.

Empirically, as three risk measures, namely LEQ, CCF, and EADF, are employed to describe EAD at the account-level, each of them carries a different business meaning and therefore should be applied to different type of accounts. For instance, in a Credit Card portfolio, while LEQ should be applicable to most accounts, it might not be suitable for accounts with close to 100% utilization rates. In addition, as newly originated or inactive accounts do not have any account activity, EADF might be more appropriate.

The table below is a summary of three EAD measures together with their use cases in the practice. I hope that it is useful for consumer banking practitioners and wish the best in the coming CCAR (Comprehensive Capital Analysis and Review) 2015.
EAD

Written by statcompute

August 31, 2014 at 2:42 pm

Posted in Credit Risk, Statistics

Chain Operations: An Interesting Feature in dplyr Package


library(data.table)
library(dplyr)

data1 <- fread('/home/liuwensui/Downloads/2008.csv', header = T, sep = ',')
dim(data1)
# [1] 7009728      29

data2 <- data1 %.%
           filter(Year = 2008, Month %in% c(1, 2, 3, 4, 5, 6)) %.%
           select(Year, Month, AirTime) %.%
           group_by(Year, Month) %.%
           summarize(avg_time = mean(AirTime, na.rm = TRUE)) %.%
           arrange(desc(avg_time))

print(data2)
#   Year Month avg_time
# 1 2008     3 106.1939
# 2 2008     2 105.3185
# 3 2008     6 104.7604
# 4 2008     1 104.6181
# 5 2008     5 104.3720
# 6 2008     4 104.2694

Written by statcompute

July 28, 2014 at 10:57 pm

Posted in Big Data, S+/R

Estimating GLM with Julia

using DataFrames, GLM

df1 = readtable("credit_count.txt");

df2 = df1[df1[:CARDHLDR] .== 1, [:DEFAULT, :MAJORDRG, :MINORDRG, :INCOME, :OWNRENT]];

mdl = glm(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT, df2, Binomial());

print(mdl);
# Coefficients:
#                  Estimate  Std.Error  z value Pr(>|z|)
# (Intercept)      -1.20444  0.0908218 -13.2616  < eps()
# MAJORDRG         0.203135  0.0692537  2.93319   0.0034
# MINORDRG         0.202727  0.0479741  4.22575   2.4e-5
# INCOME       -0.000442229 4.04222e-5 -10.9402  < eps()
# OWNRENT         -0.201223  0.0716217 -2.80953    0.005

print(deviance(mdl));
# 6376.220859525586

Written by statcompute

March 8, 2014 at 7:13 pm

Learning Pig Latin on 2014-03-01

grunt> sh head -4 2008.csv;
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

grunt> sh sed '1d' 2008.csv > 2008.csv2;

grunt> sh head -4 2008.csv2;            
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,926,930,1054,1100,WN,1746,N612SW,88,90,78,-6,-4,IND,BWI,515,3,7,0,,0,NA,NA,NA,NA,NA

grunt> A = LOAD '2008.csv2' USING PigStorage(',');                

grunt> B = GROUP A BY ($0, $1); 

grunt> C = FOREACH B GENERATE group, COUNT(A);

grunt> D = FILTER C BY $0.$1 IN (1, 2, 3);

grunt> SPLIT D INTO D1 IF $0.$1 == 1, D2 IF $0.$1 == 2, D3 IF $0.$1 == 3;

grunt> dump D3;
((2008,3),616090)

Written by statcompute

March 1, 2014 at 11:05 pm

Posted in Big Data, Pig Latin

My First Learning Session in Pig Latin

-- load data
A = load '/home/liuwensui/Documents/data/test_pig.txt' using PigStorage(',') as (id: chararray, x: int, y: float);
dump A;

/*
(A,1,1.1)
(A,2,2.2)
(A,2,3.3)
(B,1,1.1)
(B,1,2.2)
*/

-- group data
B = group A by id;
dump B;

/*
(A,{(A,1,1.1),(A,2,2.2),(A,2,3.3)})
(B,{(B,1,1.1),(B,1,2.2)})
*/

-- summarize data by group
C1 = foreach B generate group as id1, MAX(A.x) as max_x;
dump C1;

/*
(A,2)
(B,1)
*/

C2 = foreach B generate group as id2, MIN(A.y) as min_y;
dump C2;

/*
(A,1.1)
(B,1.1)
*/

-- select data by criterion
C3 = filter C2 by id2 == 'A';
dump C3;

/*
(A,1.1)
*/

-- inner join
C4 = join C1 by id1, C3 by id2;
dump C4;

/*
(A,2,A,1.1)
*/

-- full join
C5 = join C1 by id1 full, C3 by id2;
dump C5;

/*
(A,2,A,1.1)
(B,1,,)
*/

-- union
C6 = union C4, C5;
dump C6;

/*
(A,2,A,1.1)
(B,1,,)
(A,2,A,1.1)
*/

Written by statcompute

February 23, 2014 at 11:23 pm

Posted in Big Data, Pig Latin

Efficiency of Importing Large CSV Files in R

### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ###

system.time(read.csv('../data/2008.csv', header = T))
#   user  system elapsed 
# 88.301   2.416  90.716

library(data.table)
system.time(fread('../data/2008.csv', header = T, sep = ',')) 
#   user  system elapsed 
#  4.740   0.048   4.785

library(bigmemory)
system.time(read.big.matrix('../data/2008.csv', header = T))
#   user  system elapsed 
# 59.544   0.764  60.308

library(ff)
system.time(read.csv.ffdf(file = '../data/2008.csv', header = T))
#   user  system elapsed 
# 60.028   1.280  61.335 

library(sqldf)
system.time(read.csv.sql('../data/2008.csv'))
#   user  system elapsed 
# 87.461   3.880  91.447

Written by statcompute

February 11, 2014 at 12:07 am

Posted in Big Data, S+/R

Follow

Get every new post delivered to your Inbox.

Join 69 other followers