Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Archive for October 2013

Faster Random Sampling with Replacement in SAS

Most SAS users like to use SURVEYSELECT procedures to do the random sampling. However, when it comes to the big dataset, the efficiency of SURVEYSELECT procedure seems pretty low. As a result, I normally like to use data step to do the sampling.

While the simple random sample without replacement is trivial and can be easily accomplished by generating a random number with the uniform distribution, the random sample with replacement doesn’t seem straightforward with the data step. In the demo below, I will show how to do sampling with replacement by both SURVEYSELECT and data step and compare their efficiencies.

First of all, I will artificially simulate a data set with 10 million rows.

data one;
  do i = 1 to 10000000;
  output;
  end;
run;

Secondly, I will wrap SURVEYSELECT procedure into a macro to do sampling with replacement. with this method, it took more than 20 seconds CPU time to get the work done even after subtracting ~1 second simulation time.

%macro urs1(indata = , outdata = );
options mprint;

proc sql noprint;
  select put(count(*), 10.) into :n from &indata;
quit;

proc surveyselect data = &indata out = &outdata n = &n method = urs seed = 2013;
run;

proc freq data = &outdata;
  tables numberhits;
run;

%mend urs1;

%urs1(indata = one, outdata = two);
/*
      real time           30.32 seconds
      cpu time            22.54 seconds

                                       Cumulative    Cumulative
NumberHits    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
         1     3686249       58.25       3686249        58.25
         2     1843585       29.13       5529834        87.38
         3      611396        9.66       6141230        97.04
         4      151910        2.40       6293140        99.44
         5       30159        0.48       6323299        99.91
         6        4763        0.08       6328062        99.99
         7         641        0.01       6328703       100.00
         8          98        0.00       6328801       100.00
         9          11        0.00       6328812       100.00
        10           1        0.00       6328813       100.00
*/

At last, let’s take a look at how to accomplish the same task with a simple data step. The real trick here is to understand the statistical nature of a Poisson distribution. As shown below, while delivering a very similar result, this approach only consumes roughly a quarter of the CPU time. This efficiency gain would be particularly more attractive when we need to apply complex machine learning algorithms, e.g. bagging, to big data problems.

%macro urs2(indata = , outdata = );
options mprint;

data &outdata;
  set &indata;
  numberhits = ranpoi(2013, 1);
  if numberhits > 0 then output;
run;

proc freq data = &outdata;
  tables numberhits;
run;

%mend urs2;

%urs2(indata = one, outdata = two);
/*
      real time           13.42 seconds
      cpu time            6.60 seconds

                                       Cumulative    Cumulative
numberhits    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
         1     3677134       58.18       3677134        58.18
         2     1840742       29.13       5517876        87.31
         3      612487        9.69       6130363        97.00
         4      152895        2.42       6283258        99.42
         5       30643        0.48       6313901        99.90
         6        5180        0.08       6319081        99.99
         7         732        0.01       6319813       100.00
         8          92        0.00       6319905       100.00
         9          12        0.00       6319917       100.00
        10           2        0.00       6319919       100.00
*/

Written by statcompute

October 18, 2013 at 10:13 am

rPython – R Interface to Python

> library(rPython)
Loading required package: RJSONIO
> ### load r data.frame ###
> data(iris)
> r_df1 <- iris
> class(r_df1)
[1] "data.frame"
> head(r_df1, n = 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> ### pass r data.frame to python dict ###
> python.assign('py_dict1', r_df1)
> python.exec('print type(py_dict1)')
<type 'dict'>
> ### convert python dict to pandas DataFrame ###
> python.exec('import pandas as pd')
> python.exec('py_df = pd.DataFrame(py_dict1)')
> python.method.call('py_df', 'info')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 5 columns):
Petal.Length    150  non-null values
Petal.Width     150  non-null values
Sepal.Length    150  non-null values
Sepal.Width     150  non-null values
Species         150  non-null values
dtypes: float64(4), object(1)NULL
> python.exec('print py_df.head(3)')
   Petal.Length  Petal.Width  Sepal.Length  Sepal.Width Species
0           1.4          0.2           5.1          3.5  setosa
1           1.4          0.2           4.9          3.0  setosa
2           1.3          0.2           4.7          3.2  setosa
> ### convert pandas DataFrame back to dict ###
> python.exec('py_dict2 = py_df.to_dict(outtype = "list")') 
> ### pass python dict back to r list ###
> r_list <- python.get('py_dict2')
> class(r_list)
[1] "list"
> ### convert r list to r data.frame ###
> r_df2 <- data.frame(r_list)
> class(r_df2)
[1] "data.frame"
> head(r_df2, n = 3)
  Petal.Length Sepal.Length Petal.Width Sepal.Width Species
1          1.4          5.1         0.2         3.5  setosa
2          1.4          4.9         0.2         3.0  setosa
3          1.3          4.7         0.2         3.2  setosa

Written by statcompute

October 13, 2013 at 10:31 pm

Posted in PYTHON, S+/R

Calculating Marginal Effects in Zero-Inflated Beta Model

libname data 'c:\projects\sgf14\data';

ods output parameterestimates = _parms;
proc nlmixed data = data.full tech = trureg alpha = 0.01;
  parms a0 = 0  a1 = 0  a2 = 0  a3 = 0  a4 = 0  a5 = 0  a6 = 0  a7 = 0
        b0 = 0  b1 = 0  b2 = 0  b3 = 0  b4 = 0  b5 = 0  b6 = 0  b7 = 0 
        c0 = 1  c1 = 0  c2 = 0  c3 = 0  c4 = 0  c5 = 0  c6 = 0  c7 = 0;
  xa = a0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + a5 * x5 + a6 * x6 + a7 * x7;
  xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + b7 * x7;
  xc = c0 + c1 * x1 + c2 * x2 + c3 * x3 + c4 * x4 + c5 * x5 + c6 * x6 + c7 * x7;
  mu_xa = 1 / (1 + exp(-xa));
  mu_xb = 1 / (1 + exp(-xb));
  phi = exp(xc);
  w = mu_xb * phi;
  t = (1 - mu_xb) * phi;
  if y = 0 then lh = 1 - mu_xa;
  else lh = mu_xa * (gamma(w + t) / (gamma(w) * gamma(t)) * (y ** (w - 1)) * ((1 - y) ** (t - 1)));
  ll = log(lh);
  model y ~ general(ll);
  *** calculate components for marginal effects ***;
  _mfx_a = exp(xa) / ((1 + exp(xa)) ** 2);
  _mfx_b = exp(xb) / ((1 + exp(xb)) ** 2);
  _p_a = 1 / (1 + exp(-xa));
  _p_b = 1 / (1 + exp(-xb));
  predict _mfx_a out = _marg1 (rename = (pred = _mfx_a) keep = id pred y);
  predict _p_a   out = _marg2 (rename = (pred = _p_a)   keep = id pred);
  predict _mfx_b out = _marg3 (rename = (pred = _mfx_b) keep = id pred);
  predict _p_b   out = _marg4 (rename = (pred = _p_b)   keep = id pred);
run;

data _null_;
  set _parms;
  call symput(parameter, put(estimate, 20.15));
run;

data _marg5;
  merge _marg1 _marg2 _marg3 _marg4;
  by id;

  _marg_x1 = _mfx_a * &a1 * _p_b + _mfx_b * &b1 * _p_a;
  _marg_x2 = _mfx_a * &a2 * _p_b + _mfx_b * &b2 * _p_a;
  _marg_x3 = _mfx_a * &a3 * _p_b + _mfx_b * &b3 * _p_a;
  _marg_x4 = _mfx_a * &a4 * _p_b + _mfx_b * &b4 * _p_a;
  _marg_x5 = _mfx_a * &a5 * _p_b + _mfx_b * &b5 * _p_a;
  _marg_x6 = _mfx_a * &a6 * _p_b + _mfx_b * &b6 * _p_a;
  _marg_x7 = _mfx_a * &a7 * _p_b + _mfx_b * &b7 * _p_a;
run;

proc means data = _marg5 mean;
  var _marg_x:;
run;
/*
Variable            Mean
------------------------
_marg_x1      -0.0037445
_marg_x2       0.0783118
_marg_x3       0.0261884
_marg_x4      -0.3105482
_marg_x5     0.000156693
_marg_x6    -0.000430756
_marg_x7      -0.0977589
------------------------
*/

Written by statcompute

October 12, 2013 at 4:17 pm