Archive for October 2012
In the code below, I tried to do the data aggregation by groups with three different methods with python and compared their efficiency. It is also my first time to use ipython, which is an awesome testing environment.
It turns out that pandas package is the fastest and sqlite3 through in-memory database is the slowest.
In : from random import * In : from pandas import * In : from sqlite3 import * In : from datetime import datetime, timedelta In : # SIMULATE A LIST WITH 1M RECORDS In : list = [[randint(1, 10), x] for x in xrange(1, 1000001)] In : # CALCULATE THE SUM BY GENERIC CODING In : start1 = datetime.now() In : summ1 =  In : for i in set(x for x in list): ....: total = sum([x for x in list if x == i]) ....: summ1.append([i, total]) ....: In : end1 = datetime.now() In : print str(end1 - start1) 0:00:04.703000 In : summ1 Out: [[1, 49899161104L], [2, 50089982753L], [3, 50289189719L], [4, 50152366638L], [5, 49876070739L], [6, 50024578837L], [7, 50024658776L], [8, 50082058403L], [9, 49665851019L], [10, 49896582012L]] In : # CALCULATE THE SUM BY DATAFRAME In : start2 = datetime.now() In : df = DataFrame(list, columns = ["x", "y"]) In : summ2 = df.groupby('x', as_index = False).sum() In : end2 = datetime.now() In : print str(end2 - start2) 0:00:01.500000 In : summ2 Out: x y 0 1 49899161104 1 2 50089982753 2 3 50289189719 3 4 50152366638 4 5 49876070739 5 6 50024578837 6 7 50024658776 7 8 50082058403 8 9 49665851019 9 10 49896582012 In : # CALCULATE THE SUM BY SQLITE In : start3 = datetime.now() In : con = connect(":memory:") In : with con: ....: cur = con.cursor() ....: cur.execute("create table tbl (x real, y real)") ....: cur.executemany("insert into tbl values(?, ?)", list) ....: cur.execute("select x, sum(y) from tbl group by x") ....: summ3 = cur.fetchall() ....: cur.close() ....: In : con.close Out: <function close> In : end3 = datetime.now() In : print str(end3 - start3) 0:00:22.016000 In : summ3 Out: [(1.0, 49899161104.0), (2.0, 50089982753.0), (3.0, 50289189719.0), (4.0, 50152366638.0), (5.0, 49876070739.0), (6.0, 50024578837.0), (7.0, 50024658776.0), (8.0, 50082058403.0), (9.0, 49665851019.0), (10.0, 49896582012.0)]
In-memory database is a powerful functionality in SQLite. Because the whole in-memory database is saved in memory instead of hard drive, it allows a faster access to the data. When used properly with python, it can become a handy tool to efficiently accomplish complex operations in data munging that can’t be easily done with generic python codes.
In the demonstrate below, I will show how to input lists into tables in a in-memory database, merge 2 tables, and then fetch data from the resulted table.
acct = [['id1', '123-456-7890'], ['id2', '234-567-8901'], ['id3', '999-999-9999']] hist = [['id1', 'deposited', 100], ['id1', 'withdraw', 40], ['id2', 'withdraw', 15], ['id1', 'withdraw', 25], ['id2', 'deposited', 30], ['id4', 'deposited', 30]] import sqlite3 con = sqlite3.connect(":memory:") with con: cur = con.cursor() cur.execute("create table tbl_acct (id text, phone text)") cur.executemany("insert into tbl_acct values(?, ?)", acct) cur.execute("create table tbl_hist (id text, trans text, amount real)") cur.executemany("insert into tbl_hist values(?, ?, ?)", hist) cur.execute("drop table if exists tbl_join") cur.execute("create table tbl_join as select a.*, b.phone from tbl_hist as a \ inner join tbl_acct as b on a.id = b.id order by a.id") cur.execute("select * from tbl_join") for i in cur: print i cur.close() con.close # output: #(u'id1', u'deposited', 100.0, u'123-456-7890') #(u'id1', u'withdraw', 40.0, u'123-456-7890') #(u'id1', u'withdraw', 25.0, u'123-456-7890') #(u'id2', u'withdraw', 15.0, u'234-567-8901') #(u'id2', u'deposited', 30.0, u'234-567-8901')
Being a long-time user of SAS and R, I feel very comfortable with rectangle-shaped data tables and database-style operations in data munging. However, when I started diving into python and tried to pick it up as a data analysis tool, it took me a while to get used to the “python style”.
For instance, there are two data tables, one account table with 1 record per id and one transaction history table with multiple records per id.
acct = [['id1', '123-456-7890'], ['id2', '234-567-8901'], ['id3', '999-999-9999']] hist = [['id1', 'deposited', 100], ['id1', 'withdraw', 40], ['id2', 'withdraw', 15], ['id1', 'withdraw', 25], ['id2', 'deposited', 30], ['id4', 'deposited', 30]]
If I’d like to do an inner join between these two tables, the generic coding logic might look like below, which is not be very intuitive (to me at least).
join =  for h in hist: for a in acct: if h == a: list = [x for x in h] list.append(a) join.append(list) for line in join: print line # output: #['id1', 'deposited', 100, '123-456-7890'] #['id1', 'withdraw', 40, '123-456-7890'] #['id2', 'withdraw', 15, '234-567-8901'] #['id1', 'withdraw', 25, '123-456-7890'] #['id2', 'deposited', 30, '234-567-8901']
Pandas package provides a very flexible column-oriented data structure DataFrame, which makes data munging operations extremely easy.
from pandas import * df_acct = DataFrame(acct, columns = ['id', 'phone']) df_hist = DataFrame(hist, columns = ['id', 'trans', 'amount']) join2 = merge(df_acct, df_hist, on = 'id', how = 'inner') print join2 # output # id phone trans amount #0 id1 123-456-7890 deposited 100 #1 id1 123-456-7890 withdraw 40 #2 id1 123-456-7890 withdraw 25 #3 id2 234-567-8901 withdraw 15 #4 id2 234-567-8901 deposited 30
A standard practice to evaluate the effect of a marketing campaign is to divide the targeted prospects into two testing groups, a control group without the marketing intervention and the other with the intervention, and then to compare the difference of results between two groups given observed characteristics. The sole purpose is to see if the intervention is the causal effect of results that we are interested in, e.g. response or conversion rate. From the statistical perspective, it is desirable to randomly assign targeted individuals into one of the testing groups such that the background information of individuals in different groups are comparable or balancing and the assignment of individuals to groups is independent of outcomes of the intervention. This practice is also called Randomization. However, in many marketing campaigns that we observed, the randomization mentioned above is prohibitively difficult due to various constraints in the real world. An example is that the assignment of an individual to the specific group might be somehow determined by his background information, which might be related to his response to the campaign. As a result, this characteristic heterogeneity of targeted individuals between different testing groups will give a biased estimation for the campaign effect, which is also called selection bias. A typical observation of such bias is that the campaign effect looks more optimistic than it is supposed to be.
The selection bias is a common issue in many observational studies in social science. While different methods have been purposed to adjust or correct this bias, we’d like to demonstrate two modeling techniques to correct the selection bias in the marketing campaign, namely propensity score method and Heckman selection method.
Introduced by Rosenbaum and Rubin (1983), propensity score can be defined as the conditional probability of an individual receiving a specific exposure (treatment or campaign) given a certain observed background information. The idea of propensity score is to partial out the observed characteristic heterogeneity of individuals between testing groups so as to make the assignment of groups conditionally independent of the campaign outcome. While there are numerous ways, such as matching or weighting, to implement propensity score, we are particularly interested in the one proposed by Cela (2003) due to its simplicity and flexibility. The implementation of Cela’s method is straight-forward and considered a two-step approach.
First of all, we build a logistic regression using the assignment of groups as the response variable and then estimate the probability of an individual assigned to the group with marketing intervention given a set of background variables. The propensity score model can be formulated as:
Prob(d = 1 | x) = p = EXP(T * x) / [1 + EXP(T * x)]
where d = 1 for an individual assigned to the intervention group, p is the propensity score, and x is the covariate matrix of background information. In this step, while we propose using a logistic regression, other models designed for binary outcome, such as probit model or classification tree, might also work well.
Secondly, we build another logistic regression to estimate the casual effect of marketing campaign such that
Prob(y = 1 | z, d, p) = EXP(A * z + B * d) / [1 + EXP(A * z + B * d + C * p)]
where y = 1 for the individual with a positive response and z is the covariate matrix of control factors and background information. In this model, the propensity score acts as a control factor such that the outcome y is independent of the group assignment d after partialling out the observed characteristic heterogeneity incorporated in the propensity score p. In the formulation of our second model, the parameter C is the estimated causal effect of marketing intervention that we are interested. While we are using logistic regression for the binary outcome in our second model, any model within the framework of Generalized Linear Models (GLM) should be applicable for the appropriate response variable.
While the propensity score method is able to adjust the selection bias due to observed characteristics, it is under the criticism that it fails to address the bias arised from the unobservable. Originated by Heckman (1976), Heckman selection method is able to control the bias due to the unobservable, if the assumption about the underlying distribution is valid. In the framework of Heckman model, it is assumed that there are 2 simultaneous processes existed in the model, one called Selection Equation and the other Outcome Equation, of which the error terms follow a Bivariate Normal Distribution with the nonzero correlation. While two equations can be modeled simultaneously, we prefer a 2-stage estimation method due to its simplicity and its advantage in algorithm convergence.
Similar to the propensity score method, we fit a probit model to estimate the probability of an individual assigned to the group with marketing intervention.
Prob(d = 1 | x) = G(T * x)
where G(.) is the cumulative density functin of the Normal Distribution. Please note that the use of probit model is determined by the assumption of Heckman model for the Normal Distribution. Based upon the result from this probit model, we calculate Inverse Mills Ratio (IMR) for each individual, which is considered a measure of selection bias and can be formulated as
IMR = g(T * x) / G(T * x)
where g(.) is the probability density function and x is the covariate matrix in Selection Equation.
We build a second probit model by including IMR as one of the predictors, which can be formulated as
Prob(y = 1 | z, d, IMR) = G(A * z + B * d + C * IMR)
where y = 1 for the individual with a positive response and z is the covariate matrix of control factors and background information. While the significance of parameter C indicates the existence of selection bias, the lack of such significance doesn’t necessarily imply that there is no bias. The consistency of estimates in Heckman model strongly replies on the distributional assumption, which is difficult to justify in the real-world data.
In the insurance industry, Poisson regression (PR) has been widely used to model insurance claims. In the real-world auto insurance data, more than 90% of the insured reported zero claim, which might bring difficulties in two aspects if PR is employed. First of all, the observed number of zero count outcome is more than the predicted one by PR. Secondly, the variance might exceed the mean in the observed outcome, a violation of equi-dispersion assumption in PR. In order to improve the model performance, alternatives to PR should be considered.
As the most common alternative to PR, Negative Binomial (NB) regression addresses the issue of over-dispersion by including a random component with Gamma distribution. In NB, the variance can be expressed as the mean plus a non-negative term such that
V(Y|X) = mu + mu^2 / theta >= mu = E(Y|X)
Therefore, NB has a more flexible variance structure than PR does. While NB is able to capture the heterogeneity in the count outcome, it is under the criticism of failing to provide a satisfactory explanation on excess zeros.
Known as the two-part model, Hurdle regression (Mullahy 1986) provides a more appealing interpretation for the count outcome with excess zeros. While the first part separates the potential insurance claimants (Y ≥ 1) from the non-claimants (Y = 0) with a Binomial model, the second part models the number of claims made by those potential claimants (Y ≥ 1) with a truncated-at-zero Poisson model. Thus, the density of Hurdle model can be expressed as
F(Y|X) = θ for Y = 0
(1 – θ) * G(Y) / (1 – G(Y)) for Y > 0
θ is the probability of non-claimants and G(.) is a count density, e.g. Poisson or Negative Binomial. The major motivation of Hurdle regression in the auto insurance content is that the insured individual tends to behave differently once he/she makes the first claim, which is in line with our observation on most human behaviors.
Fallen into the class of finite mixture model, Zero-Inflated Poisson (ZIP) regression (Lambert 1992) tries to accommodate excess zeros by assuming zero count outcome coming from 2 different sources, one always with zero outcome and the other generating both zero and nonzero outcome. For instance, in the insured population, some never make insurance claim (Y = 0) despite the occurrence of accidents, while the other make claims (Y ≥ 0) whenever accidents happen. The density of ZIP model can be formulated as
F(Y|X) = ω + (1 – ω) * H(0) for Y = 0
(1 – ω) * H(Y) for Y > 0
ω is the probability of an individual belonging to the always-zero group and H(.) is a count density, e.g. Poisson or Negative Binomial. It is noted that, unlike θ in Hurdle model, ω in ZIP model is not directly observable but determined by the claim pattern of the insured. From the marketing perspective, ZIP model is attractive in that the insurance company is able to differentiate the insured and therefore to charge different premiums based upon their claim behaviors.
Considered a generalization of ZIP model, Latent Class Poisson (LCP) regression (Wedel 1993) assumes that all individuals instead of just the ones with zero counts are drawn from a finite mixture of Poisson distributions. Given the fact that even a careful driver might also have chance to file the claim, it is more desirable to avoid the somehow unrealistic dichotomization solely based upon the claim pattern but to distinguish the insured by latent risk factors such as lifestyle, attitude to risk, financial ability, and so on. For a population consisting of K mixing components with the proportion π, the density of LCP is given as
F(Y|X) = SUM πi * Fi(Y|X)
πi is the probability of an individual coming from component i with sum(πi = 1 to K) = 1 and Fi(.) is the count density for component i. In the mixture distribution, impacts of each explanatory variable are allowed to differ across different components to account for the heterogeneity in the population. Therefore, LCP provides an ability to differentiate the insured in a more flexible manner and a better framework for the market segmentation.
In time series analysis, it is often necessary to model both conditional mean and conditional variance simultaneously, which is so-called composite modeling. For instance, while the conditional mean is an AR(1) model, the conditional variance can be a GARCH(1, 1) model.
In SAS/ETS module, it is convenient to build such composite models with AUTOREG procedure if the conditional mean specification is as simple as shown below.
data garch1; lu = 0; lh = 0; do i = 1 to 5000; x = ranuni(1); h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh; u = sqrt(h) * rannor(1); y = 1 + 3 * x + u; lu = u; lh = h; output; end; run; proc autoreg data = _last_; model y = x / garch = (p = 1, q = 1); run; /* Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.0125 0.0316 32.07 <.0001 x 1 2.9332 0.0536 54.72 <.0001 ARCH0 1 0.2886 0.0256 11.28 <.0001 ARCH1 1 0.3881 0.0239 16.22 <.0001 GARCH1 1 0.5040 0.0239 21.10 <.0001 */
However, when the conditional mean has a more complex structure, then MODEL instead of AUTOREG procedure should be used. Below is an perfect example showing the flexibility of MODEL procedure. In the demonstration, the conditional mean is an ARMA(1, 1) model and the conditional variance is a GARCH(1, 1) model.
data garch2; lu = 0; lh = 0; ly = 0; do i = 1 to 5000; x = ranuni(1); h = 0.3 + 0.4 * lu ** 2 + 0.5 * lh; u = sqrt(h) * rannor(1); y = 1 + 3 * x + 0.6 * (ly - 1) + u - 0.7 * lu; lu = u; lh = h; ly = y; output; end; run; proc model data = _last_; parms mu x_beta ar1 ma1 arch0 arch1 garch1; y = mu + x_beta * x + ar1 * zlag1(y - mu) + ma1 * zlag1(resid.y); h.y = arch0 + arch1 * xlag(resid.y ** 2, mse.y) + garch1 * xlag(h.y, mse.y); fit y / method = marquardt fiml; run; /* Approx Approx Parameter Estimate Std Err t Value Pr > |t| mu 0.953905 0.0673 14.18 <.0001 x_beta 2.92509 0.0485 60.30 <.0001 ar1 0.613025 0.00819 74.89 <.0001 ma1 0.700154 0.0126 55.49 <.0001 arch0 0.288948 0.0257 11.26 <.0001 arch1 0.387436 0.0238 16.28 <.0001 garch1 0.504588 0.0237 21.26 <.0001 */
library(chron) library(zoo) # STOCK TICKER OF Fifth Third Bancorp stock <- 'FITB' # DEFINE STARTING DATE start.date <- 1 start.month <- 1 start.year <- 2012 # DEFINE ENDING DATE end.date <- 11 end.month <- 10 end.year <- 2012 # DEFINE URL LINK link <- paste("http://ichart.finance.yahoo.com/table.csv?s=", stock, "&a=", as.character(start.month - 1), "&b=", as.character(start.date), "&c=", as.character(start.year), "&d=", as.character(end.month - 1), "&e=", as.character(end.date), "&f=", as.character(end.year), "&g=d&ignore=.csv", sep = '') # DOWNLOAD STOCK PRICE AS CSV FILE download.file(link, "c:/projects/data.csv") # READ THE CSV FILE INTO R data <- read.csv("c:/projects/data.csv") # CONVERT CHARACTER INTO DATE dt <- dates(as.character(data[, 1]), format = "y-m-d") # CONVERT DATA FRAME INTO TS OBJECT ts <- zoo(data[, 2:5], dt) # CREATE A PLOT FOR OPEN/CLOSE/HIGH/LOW PRICES plot(ts, main = stock)