Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for October 2012

Data Aggregation with Python

In the code below, I tried to do the data aggregation by groups with three different methods with python and compared their efficiency. It is also my first time to use ipython, which is an awesome testing environment.

It turns out that pandas package is the fastest and sqlite3 through in-memory database is the slowest.

In [1]: from random import *

In [2]: from pandas import *

In [3]: from sqlite3 import *

In [4]: from datetime import datetime, timedelta

In [5]: # SIMULATE A LIST WITH 1M RECORDS

In [6]: list = [[randint(1, 10), x] for x in xrange(1, 1000001)]

In [7]: # CALCULATE THE SUM BY GENERIC CODING

In [8]: start1 = datetime.now()

In [9]: summ1 = []

In [10]: for i in set(x[0] for x in list):
   ....:   total = sum([x[1] for x in list if x[0] == i])
   ....:   summ1.append([i, total])
   ....:

In [11]: end1 = datetime.now()

In [12]: print str(end1 - start1)
0:00:04.703000

In [13]: summ1
Out[13]:
[[1, 49899161104L],
 [2, 50089982753L],
 [3, 50289189719L],
 [4, 50152366638L],
 [5, 49876070739L],
 [6, 50024578837L],
 [7, 50024658776L],
 [8, 50082058403L],
 [9, 49665851019L],
 [10, 49896582012L]]

In [14]: # CALCULATE THE SUM BY DATAFRAME

In [15]: start2 = datetime.now()

In [16]: df = DataFrame(list, columns = ["x", "y"])

In [17]: summ2 = df.groupby('x', as_index = False).sum()

In [18]: end2 = datetime.now()

In [19]: print str(end2 - start2)
0:00:01.500000

In [20]: summ2
Out[20]:
    x            y
0   1  49899161104
1   2  50089982753
2   3  50289189719
3   4  50152366638
4   5  49876070739
5   6  50024578837
6   7  50024658776
7   8  50082058403
8   9  49665851019
9  10  49896582012

In [21]: # CALCULATE THE SUM BY SQLITE

In [22]: start3 = datetime.now()

In [23]: con = connect(":memory:")

In [24]: with con:
   ....:   cur = con.cursor()
   ....:   cur.execute("create table tbl (x real, y real)")
   ....:   cur.executemany("insert into tbl values(?, ?)", list)
   ....:   cur.execute("select x, sum(y) from tbl group by x")
   ....:   summ3 = cur.fetchall()
   ....:   cur.close()
   ....:

In [25]: con.close
Out[25]: <function close>

In [26]: end3 = datetime.now()

In [27]: print str(end3 - start3)
0:00:22.016000

In [28]: summ3
Out[28]:
[(1.0, 49899161104.0),
 (2.0, 50089982753.0),
 (3.0, 50289189719.0),
 (4.0, 50152366638.0),
 (5.0, 49876070739.0),
 (6.0, 50024578837.0),
 (7.0, 50024658776.0),
 (8.0, 50082058403.0),
 (9.0, 49665851019.0),
 (10.0, 49896582012.0)]
Advertisements

Written by statcompute

October 24, 2012 at 12:14 am

Data Munging with In-Memory SQLite Database

In-memory database is a powerful functionality in SQLite. Because the whole in-memory database is saved in memory instead of hard drive, it allows a faster access to the data. When used properly with python, it can become a handy tool to efficiently accomplish complex operations in data munging that can’t be easily done with generic python codes.

In the demonstrate below, I will show how to input lists into tables in a in-memory database, merge 2 tables, and then fetch data from the resulted table.

acct = [['id1', '123-456-7890'], ['id2', '234-567-8901'],
        ['id3', '999-999-9999']]

hist = [['id1', 'deposited', 100], ['id1', 'withdraw', 40],
        ['id2', 'withdraw', 15], ['id1', 'withdraw', 25],
        ['id2',  'deposited', 30], ['id4',  'deposited', 30]]

import sqlite3

con = sqlite3.connect(":memory:")

with con:
  cur = con.cursor()
  
  cur.execute("create table tbl_acct (id text, phone text)")
  cur.executemany("insert into tbl_acct values(?, ?)", acct)

  cur.execute("create table tbl_hist (id text, trans text, amount real)")
  cur.executemany("insert into tbl_hist values(?, ?, ?)", hist)

  cur.execute("drop table if exists tbl_join")
  cur.execute("create table tbl_join as select a.*, b.phone from tbl_hist as a \
               inner join tbl_acct as b on a.id = b.id order by a.id")
  
  cur.execute("select * from tbl_join")
  for i in cur:
    print i
    
  cur.close()
    
con.close

# output:
#(u'id1', u'deposited', 100.0, u'123-456-7890')
#(u'id1', u'withdraw', 40.0, u'123-456-7890')
#(u'id1', u'withdraw', 25.0, u'123-456-7890')
#(u'id2', u'withdraw', 15.0, u'234-567-8901')
#(u'id2', u'deposited', 30.0, u'234-567-8901')

Written by statcompute

October 21, 2012 at 12:54 am

Data Munging with Pandas

Being a long-time user of SAS and R, I feel very comfortable with rectangle-shaped data tables and database-style operations in data munging. However, when I started diving into python and tried to pick it up as a data analysis tool, it took me a while to get used to the “python style”.

For instance, there are two data tables, one account table with 1 record per id and one transaction history table with multiple records per id.

acct = [['id1', '123-456-7890'], ['id2', '234-567-8901'],
        ['id3', '999-999-9999']]

hist = [['id1', 'deposited', 100], ['id1', 'withdraw', 40],
        ['id2', 'withdraw', 15], ['id1', 'withdraw', 25],
        ['id2',  'deposited', 30], ['id4',  'deposited', 30]]

If I’d like to do an inner join between these two tables, the generic coding logic might look like below, which is not be very intuitive (to me at least).

join = []
for h in hist:
  for a in acct:
    if h[0] == a[0]:
      list = [x for x in h]
      list.append(a[1])
      join.append(list)

for line in join:
  print line

# output:
#['id1', 'deposited', 100, '123-456-7890']
#['id1', 'withdraw', 40, '123-456-7890']
#['id2', 'withdraw', 15, '234-567-8901']
#['id1', 'withdraw', 25, '123-456-7890']
#['id2', 'deposited', 30, '234-567-8901']

Pandas package provides a very flexible column-oriented data structure DataFrame, which makes data munging operations extremely easy.

from pandas import *

df_acct = DataFrame(acct, columns = ['id', 'phone'])
df_hist = DataFrame(hist, columns = ['id', 'trans', 'amount'])
join2 = merge(df_acct, df_hist, on = 'id', how = 'inner')
print join2

# output
#    id         phone      trans  amount
#0  id1  123-456-7890  deposited     100
#1  id1  123-456-7890   withdraw      40
#2  id1  123-456-7890   withdraw      25
#3  id2  234-567-8901   withdraw      15
#4  id2  234-567-8901  deposited      30

Written by statcompute

October 20, 2012 at 5:42 pm

Adjustment of Selection Bias in Marketing Campaign

A standard practice to evaluate the effect of a marketing campaign is to divide the targeted prospects into two testing groups, a control group without the marketing intervention and the other with the intervention, and then to compare the difference of results between two groups given observed characteristics. The sole purpose is to see if the intervention is the causal effect of results that we are interested in, e.g. response or conversion rate. From the statistical perspective, it is desirable to randomly assign targeted individuals into one of the testing groups such that the background information of individuals in different groups are comparable or balancing and the assignment of individuals to groups is independent of outcomes of the intervention. This practice is also called Randomization. However, in many marketing campaigns that we observed, the randomization mentioned above is prohibitively difficult due to various constraints in the real world. An example is that the assignment of an individual to the specific group might be somehow determined by his background information, which might be related to his response to the campaign. As a result, this characteristic heterogeneity of targeted individuals between different testing groups will give a biased estimation for the campaign effect, which is also called selection bias. A typical observation of such bias is that the campaign effect looks more optimistic than it is supposed to be.

The selection bias is a common issue in many observational studies in social science. While different methods have been purposed to adjust or correct this bias, we’d like to demonstrate two modeling techniques to correct the selection bias in the marketing campaign, namely propensity score method and Heckman selection method.

Introduced by Rosenbaum and Rubin (1983), propensity score can be defined as the conditional probability of an individual receiving a specific exposure (treatment or campaign) given a certain observed background information. The idea of propensity score is to partial out the observed characteristic heterogeneity of individuals between testing groups so as to make the assignment of groups conditionally independent of the campaign outcome. While there are numerous ways, such as matching or weighting, to implement propensity score, we are particularly interested in the one proposed by Cela (2003) due to its simplicity and flexibility. The implementation of Cela’s method is straight-forward and considered a two-step approach.

Step One
First of all, we build a logistic regression using the assignment of groups as the response variable and then estimate the probability of an individual assigned to the group with marketing intervention given a set of background variables. The propensity score model can be formulated as:

Prob(d = 1 | x) = p = EXP(T * x) / [1 + EXP(T * x)]

where d = 1 for an individual assigned to the intervention group, p is the propensity score, and x is the covariate matrix of background information. In this step, while we propose using a logistic regression, other models designed for binary outcome, such as probit model or classification tree, might also work well.

Step Two
Secondly, we build another logistic regression to estimate the casual effect of marketing campaign such that

Prob(y = 1 | z, d, p) = EXP(A * z + B * d) / [1 + EXP(A * z + B * d + C * p)]

where y = 1 for the individual with a positive response and z is the covariate matrix of control factors and background information. In this model, the propensity score acts as a control factor such that the outcome y is independent of the group assignment d after partialling out the observed characteristic heterogeneity incorporated in the propensity score p. In the formulation of our second model, the parameter C is the estimated causal effect of marketing intervention that we are interested. While we are using logistic regression for the binary outcome in our second model, any model within the framework of Generalized Linear Models (GLM) should be applicable for the appropriate response variable.

While the propensity score method is able to adjust the selection bias due to observed characteristics, it is under the criticism that it fails to address the bias arised from the unobservable. Originated by Heckman (1976), Heckman selection method is able to control the bias due to the unobservable, if the assumption about the underlying distribution is valid. In the framework of Heckman model, it is assumed that there are 2 simultaneous processes existed in the model, one called Selection Equation and the other Outcome Equation, of which the error terms follow a Bivariate Normal Distribution with the nonzero correlation. While two equations can be modeled simultaneously, we prefer a 2-stage estimation method due to its simplicity and its advantage in algorithm convergence.

Step One
Similar to the propensity score method, we fit a probit model to estimate the probability of an individual assigned to the group with marketing intervention.

Prob(d = 1 | x) = G(T * x)

where G(.) is the cumulative density functin of the Normal Distribution. Please note that the use of probit model is determined by the assumption of Heckman model for the Normal Distribution. Based upon the result from this probit model, we calculate Inverse Mills Ratio (IMR) for each individual, which is considered a measure of selection bias and can be formulated as

IMR = g(T * x) / G(T * x)

where g(.) is the probability density function and x is the covariate matrix in Selection Equation.

Step Two
We build a second probit model by including IMR as one of the predictors, which can be formulated as

Prob(y = 1 | z, d, IMR) = G(A * z + B * d + C * IMR)

where y = 1 for the individual with a positive response and z is the covariate matrix of control factors and background information. While the significance of parameter C indicates the existence of selection bias, the lack of such significance doesn’t necessarily imply that there is no bias. The consistency of estimates in Heckman model strongly replies on the distributional assumption, which is difficult to justify in the real-world data.

Written by statcompute

October 20, 2012 at 12:03 am

Risk Classification of Auto Insurance Claims

In the insurance industry, Poisson regression (PR) has been widely used to model insurance claims. In the real-world auto insurance data, more than 90% of the insured reported zero claim, which might bring difficulties in two aspects if PR is employed. First of all, the observed number of zero count outcome is more than the predicted one by PR. Secondly, the variance might exceed the mean in the observed outcome, a violation of equi-dispersion assumption in PR. In order to improve the model performance, alternatives to PR should be considered.

As the most common alternative to PR, Negative Binomial (NB) regression addresses the issue of over-dispersion by including a random component with Gamma distribution. In NB, the variance can be expressed as the mean plus a non-negative term such that

V(Y|X) = mu + mu^2 / theta >= mu = E(Y|X)

Therefore, NB has a more flexible variance structure than PR does. While NB is able to capture the heterogeneity in the count outcome, it is under the criticism of failing to provide a satisfactory explanation on excess zeros.

Known as the two-part model, Hurdle regression (Mullahy 1986) provides a more appealing interpretation for the count outcome with excess zeros. While the first part separates the potential insurance claimants (Y ≥ 1) from the non-claimants (Y = 0) with a Binomial model, the second part models the number of claims made by those potential claimants (Y ≥ 1) with a truncated-at-zero Poisson model. Thus, the density of Hurdle model can be expressed as

F(Y|X) = θ for Y = 0

(1 – θ) * G(Y) / (1 – G(Y)) for Y > 0

θ is the probability of non-claimants and G(.) is a count density, e.g. Poisson or Negative Binomial. The major motivation of Hurdle regression in the auto insurance content is that the insured individual tends to behave differently once he/she makes the first claim, which is in line with our observation on most human behaviors.

Fallen into the class of finite mixture model, Zero-Inflated Poisson (ZIP) regression (Lambert 1992) tries to accommodate excess zeros by assuming zero count outcome coming from 2 different sources, one always with zero outcome and the other generating both zero and nonzero outcome. For instance, in the insured population, some never make insurance claim (Y = 0) despite the occurrence of accidents, while the other make claims (Y ≥ 0) whenever accidents happen. The density of ZIP model can be formulated as

F(Y|X) = ω + (1 – ω) * H(0) for Y = 0

(1 – ω) * H(Y) for Y > 0

ω is the probability of an individual belonging to the always-zero group and H(.) is a count density, e.g. Poisson or Negative Binomial. It is noted that, unlike θ in Hurdle model, ω in ZIP model is not directly observable but determined by the claim pattern of the insured. From the marketing perspective, ZIP model is attractive in that the insurance company is able to differentiate the insured and therefore to charge different premiums based upon their claim behaviors.

Considered a generalization of ZIP model, Latent Class Poisson (LCP) regression (Wedel 1993) assumes that all individuals instead of just the ones with zero counts are drawn from a finite mixture of Poisson distributions. Given the fact that even a careful driver might also have chance to file the claim, it is more desirable to avoid the somehow unrealistic dichotomization solely based upon the claim pattern but to distinguish the insured by latent risk factors such as lifestyle, attitude to risk, financial ability, and so on. For a population consisting of K mixing components with the proportion π, the density of LCP is given as

F(Y|X) = SUM πi * Fi(Y|X)

πi is the probability of an individual coming from component i with sum(πi = 1 to K) = 1 and Fi(.) is the count density for component i. In the mixture distribution, impacts of each explanatory variable are allowed to differ across different components to account for the heterogeneity in the population. Therefore, LCP provides an ability to differentiate the insured in a more flexible manner and a better framework for the market segmentation.

Written by statcompute

October 19, 2012 at 12:46 am

Composite Conditional Mean and Variance Modeling in Time Series

In time series analysis, it is often necessary to model both conditional mean and conditional variance simultaneously, which is so-called composite modeling. For instance, while the conditional mean is an AR(1) model, the conditional variance can be a GARCH(1, 1) model.

In SAS/ETS module, it is convenient to build such composite models with AUTOREG procedure if the conditional mean specification is as simple as shown below.

data garch1;
  lu = 0;
  lh = 0;
  do i = 1 to 5000;
    x = ranuni(1);
    h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh;
    u = sqrt(h) * rannor(1);
    y = 1 + 3 * x + u;
    lu = u;
    lh = h;
    output;
  end;
run;

proc autoreg data = _last_;
  model y = x / garch = (p = 1, q = 1);
run;
/*
                                    Standard                 Approx
Variable        DF     Estimate        Error    t Value    Pr > |t|

Intercept        1       1.0125       0.0316      32.07      <.0001
x                1       2.9332       0.0536      54.72      <.0001
ARCH0            1       0.2886       0.0256      11.28      <.0001
ARCH1            1       0.3881       0.0239      16.22      <.0001
GARCH1           1       0.5040       0.0239      21.10      <.0001
*/

However, when the conditional mean has a more complex structure, then MODEL instead of AUTOREG procedure should be used. Below is an perfect example showing the flexibility of MODEL procedure. In the demonstration, the conditional mean is an ARMA(1, 1) model and the conditional variance is a GARCH(1, 1) model.

data garch2;
  lu = 0;
  lh = 0;
  ly = 0;
  do i = 1 to 5000;
    x = ranuni(1);
    h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh;
    u = sqrt(h) * rannor(1);
    y = 1 + 3 * x + 0.6 * (ly - 1) + u - 0.7 * lu;
    lu = u;
    lh = h;
    ly = y;
    output;
  end;
run;

proc model data = _last_;
  parms mu x_beta ar1 ma1 arch0 arch1 garch1;
  y = mu + x_beta * x + ar1 * zlag1(y - mu) + ma1 * zlag1(resid.y);
  h.y = arch0 + arch1 * xlag(resid.y ** 2, mse.y) +
        garch1 * xlag(h.y, mse.y);
  fit y / method = marquardt fiml;
run;
/*
                              Approx                  Approx
Parameter       Estimate     Std Err    t Value     Pr > |t|

mu              0.953905      0.0673      14.18       <.0001
x_beta           2.92509      0.0485      60.30       <.0001
ar1             0.613025     0.00819      74.89       <.0001
ma1             0.700154      0.0126      55.49       <.0001
arch0           0.288948      0.0257      11.26       <.0001
arch1           0.387436      0.0238      16.28       <.0001
garch1          0.504588      0.0237      21.26       <.0001
*/

Written by statcompute

October 15, 2012 at 6:27 pm

Posted in Econometrics, SAS, Statistical Models

Tagged with ,

Download Stock Price Online with R

library(chron)
library(zoo)

# STOCK TICKER OF Fifth Third Bancorp 
stock <- 'FITB'

# DEFINE STARTING DATE
start.date  <- 1
start.month <- 1
start.year  <- 2012

# DEFINE ENDING DATE
end.date  <- 11
end.month <- 10
end.year  <- 2012

# DEFINE URL LINK
link <- paste("http://ichart.finance.yahoo.com/table.csv?s=", stock,
              "&a=", as.character(start.month - 1), 
              "&b=", as.character(start.date), 
              "&c=", as.character(start.year),
              "&d=", as.character(end.month - 1),
              "&e=", as.character(end.date), 
              "&f=", as.character(end.year),
              "&g=d&ignore=.csv", sep = '')

# DOWNLOAD STOCK PRICE AS CSV FILE
download.file(link, "c:/projects/data.csv")

# READ THE CSV FILE INTO R
data <- read.csv("c:/projects/data.csv")

# CONVERT CHARACTER INTO DATE
dt <- dates(as.character(data[, 1]), format = "y-m-d")

# CONVERT DATA FRAME INTO TS OBJECT
ts <- zoo(data[, 2:5], dt)

# CREATE A PLOT FOR OPEN/CLOSE/HIGH/LOW PRICES
plot(ts, main = stock)

Written by statcompute

October 11, 2012 at 10:46 pm

Posted in S+/R

Tagged with