Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Performance Improvement by Expression Evaluation in Pandas

In [1]: import pandas as pd

In [2]: df1 = pd.read_csv('/home/liuwensui/Downloads/2008.csv')

In [3]: df1.shape
Out[3]: (7009728, 29)

In [4]: # STANDARD PANDAS SYNTAX

In [5]: %timeit -n 10 df2 = df1[(df1['Month'] == 1) & (df1['DayofMonth'] == 1)]
10 loops, best of 3: 85.1 ms per loop

In [6]: # EXPRESSION EVALUATION

In [7]: %timeit -n 10 df3 = df1.query('Month == 1 and DayofMonth == 1')
10 loops, best of 3: 44 ms per loop

Written by statcompute

September 14, 2014 at 11:04 pm

Posted in PYTHON

Tagged with ,

SAS Macro Aligning The Logit Variable to A Scorecard with Specific PDO and Base Point

%macro align_score(data = , y = , logit = , pdo = , base_point = , base_odds = , min_point = 100, max_point = 900); 
***********************************************************; 
* THE MACRO IS TO ALIGN A LOGIT VARIABLE TO A SCORE WITH  *; 
* SPECIFIC PDO, BASE POINT, AND BASE ODDS                 *; 
* ======================================================= *; 
* PAMAMETERS:                                             *; 
*  DATA      : INPUT SAS DATA TABLE                       *; 
*  Y         : PERFORMANCE VARIABLE WITH 0/1 VALUE        *; 
*  LOGIT     : A LOGIT VARIABLE TO BE ALIGNED FROM        *; 
*  PDO       : PDO OF THE SCORE ALIGNED TO                *; 
*  BASE_POINT: BASE POINT OF THE SCORE ALIGNED TO         *; 
*  BASE_ODDS : ODDS AT BASE POINT OF THE SCORE ALIGNED TO *; 
*  MIN_POINT : LOWER END OF SCORE POINT, 100 BY DEFAULT   *; 
*  MAX_POINT : UPPER END OF SCORE POINT, 900 BY DEFAULT   *; 
* ======================================================= *; 
* OUTPUTS:                                                *; 
*  ALIGN_FORMULA.SAS                                      *; 
*  A SAS CODE WITH THE FORMULA TO ALIGN THE LOGIT FIELD   *; 
*  TO A SPECIFIC SCORE TOGETHER WITH THE STATISTICAL      *; 
*  SUMMARY OF ALIGN_SCORE                                 *;                                    
***********************************************************; 
 
options nocenter nonumber nodate mprint mlogic symbolgen 
        orientation = landscape ls = 150 formchar = "|----|+|---+=|-/\<>*"; 
 
%local b0 b1; 
 
data _tmp1 (keep = &y &logit); 
  set &data; 
  where &y in (0, 1) and &logit ~= .; 
run; 
 
ods listing close; 
ods output ParameterEstimates = _est1 (keep = variable estimate); 
proc logistic data = _last_ desc; 
  model &y = &logit; 
run; 
ods listing; 
 
data _null_; 
  set _last_; 
   
  if _n_ = 1 then do; 
    b = - (estimate + (log(&base_odds) - (log(2) / &pdo) * &base_point)) / (log(2) / &pdo); 
    call symput('b0', put(b, 15.8)); 
  end; 
  else do; 
    b = estimate / (log(2) / &pdo); 
    call symput('b1', put(b, 15.8)); 
  end; 
run; 
 
filename formula "ALIGN_FORMULA.SAS"; 
 
data _null_; 
  file formula; 
 
  put @3 3 * "*" " SCORE ALIGNMENT FORMULA OF %upcase(&logit) " 3 * "*" ";"; 
  put; 
  put @3 "ALIGN_SCORE = MAX(MIN(ROUND((%trim(&b0)) - (%trim(&b1)) * %upcase(&logit), 1), &max_point), &min_point);"; 
  put; 
run; 
 
data _tmp2; 
  set _tmp1; 
  %inc formula; 
run; 
 
proc summary data = _last_ nway; 
  class &y; 
  output out = _tmp3(drop = _type_ _freq_) 
  min(align_score) = min_scr max(align_score) = max_scr 
  median(align_score) = mdn_scr; 
run; 
 
data _null_; 
  set _last_; 
  file formula mod; 
 
  if _n_ = 1 then do; 
    put @3 3 * "*" " STATISTICAL SUMMARY OF ALIGN_SCORE BY INPUT DATA: " 3 * "*" ";"; 
  end; 
  put @3 "* WHEN %upcase(&y) = " &y ": MIN(SCORE) = " min_scr " MEDIAN(SCORE) = " mdn_scr " MAX(SCORE) = " max_scr "*;"; 
run; 
 
proc datasets library = work nolist; 
  delete _: (mt = data); 
run; 
quit; 
 
***********************************************************; 
*                     END OF THE MACRO                    *; 
***********************************************************;  
%mend align_score;

Written by statcompute

September 3, 2014 at 9:51 pm

Posted in Credit Risk, SAS, Scorecard

Create Highly Customized Excel Chart with Python

import pandas as pd
import pandas.io.data as web

data = web.get_data_yahoo('FITB', '1/1/2009', '9/1/2014')
df = pd.DataFrame({'FITB': data['Adj Close']})

excel_file = 'fitb.xlsx'
sheet_name = 'Sheet1'

writer = pd.ExcelWriter(excel_file, engine = 'xlsxwriter')
df.to_excel(writer, sheet_name = sheet_name)

workbook = writer.book
worksheet = writer.sheets[sheet_name]
worksheet.set_column('A:A', 20)

chart = workbook.add_chart({'type': 'line'})
max_row = len(df) + 1
chart.add_series({
    'name':       ['Sheet1', 0, 1],
    'categories': ['Sheet1', 2, 0, max_row, 0],
    'values':     ['Sheet1', 2, 1, max_row, 1],
    'line':       {'width': 1.00},
})

chart.set_x_axis({'name': 'Date', 'date_axis': True})
chart.set_y_axis({'name': 'Price', 'major_gridlines': {'visible': False}})
chart.set_legend({'position': 'top'})

worksheet.insert_chart('D1', chart)
writer.save()

Screenshot

Written by statcompute

September 1, 2014 at 10:41 am

Posted in PYTHON

Thoughts on Modeling Practices in Exposure at Default

In the consumer credit risk arena, EAD (Exposure at Default) is a major component in the calculation of EL (Expected Loss) particularly in Line of Credit products such as Credit Card or HeLOC. While it is common to gauge EAD at the portfolio level based upon the utilization rate, a leading practice implemented in more complex banks is to estimate account-level EAD models with the inclusion of individual risk characteristics.

Empirically, as three risk measures, namely LEQ, CCF, and EADF, are employed to describe EAD at the account-level, each of them carries a different business meaning and therefore should be applied to different type of accounts. For instance, in a Credit Card portfolio, while LEQ should be applicable to most accounts, it might not be suitable for accounts with close to 100% utilization rates. In addition, as newly originated or inactive accounts do not have any account activity, EADF might be more appropriate.

The table below is a summary of three EAD measures together with their use cases in the practice. I hope that it is useful for consumer banking practitioners and wish the best in the coming CCAR (Comprehensive Capital Analysis and Review) 2015.
EAD

Written by statcompute

August 31, 2014 at 2:42 pm

Posted in Credit Risk, Statistics

Chain Operations: An Interesting Feature in dplyr Package


library(data.table)
library(dplyr)

data1 <- fread('/home/liuwensui/Downloads/2008.csv', header = T, sep = ',')
dim(data1)
# [1] 7009728      29

data2 <- data1 %.%
           filter(Year = 2008, Month %in% c(1, 2, 3, 4, 5, 6)) %.%
           select(Year, Month, AirTime) %.%
           group_by(Year, Month) %.%
           summarize(avg_time = mean(AirTime, na.rm = TRUE)) %.%
           arrange(desc(avg_time))

print(data2)
#   Year Month avg_time
# 1 2008     3 106.1939
# 2 2008     2 105.3185
# 3 2008     6 104.7604
# 4 2008     1 104.6181
# 5 2008     5 104.3720
# 6 2008     4 104.2694

Written by statcompute

July 28, 2014 at 10:57 pm

Posted in Big Data, S+/R

Estimating GLM with Julia

using DataFrames, GLM

df1 = readtable("credit_count.txt");

df2 = df1[df1[:CARDHLDR] .== 1, [:DEFAULT, :MAJORDRG, :MINORDRG, :INCOME, :OWNRENT]];

mdl = glm(DEFAULT ~ MAJORDRG + MINORDRG + INCOME + OWNRENT, df2, Binomial());

print(mdl);
# Coefficients:
#                  Estimate  Std.Error  z value Pr(>|z|)
# (Intercept)      -1.20444  0.0908218 -13.2616  < eps()
# MAJORDRG         0.203135  0.0692537  2.93319   0.0034
# MINORDRG         0.202727  0.0479741  4.22575   2.4e-5
# INCOME       -0.000442229 4.04222e-5 -10.9402  < eps()
# OWNRENT         -0.201223  0.0716217 -2.80953    0.005

print(deviance(mdl));
# 6376.220859525586

Written by statcompute

March 8, 2014 at 7:13 pm

Learning Pig Latin on 2014-03-01

grunt> sh head -4 2008.csv;
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA

grunt> sh sed '1d' 2008.csv > 2008.csv2;

grunt> sh head -4 2008.csv2;            
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,926,930,1054,1100,WN,1746,N612SW,88,90,78,-6,-4,IND,BWI,515,3,7,0,,0,NA,NA,NA,NA,NA

grunt> A = LOAD '2008.csv2' USING PigStorage(',');                

grunt> B = GROUP A BY ($0, $1); 

grunt> C = FOREACH B GENERATE group, COUNT(A);

grunt> D = FILTER C BY $0.$1 IN (1, 2, 3);

grunt> SPLIT D INTO D1 IF $0.$1 == 1, D2 IF $0.$1 == 2, D3 IF $0.$1 == 3;

grunt> dump D3;
((2008,3),616090)

Written by statcompute

March 1, 2014 at 11:05 pm

Posted in Big Data, Pig Latin

Follow

Get every new post delivered to your Inbox.

Join 72 other followers