## Archive for **October 2013**

## Faster Random Sampling with Replacement in SAS

Most SAS users like to use SURVEYSELECT procedures to do the random sampling. However, when it comes to the big dataset, the efficiency of SURVEYSELECT procedure seems pretty low. As a result, I normally like to use data step to do the sampling.

While the simple random sample without replacement is trivial and can be easily accomplished by generating a random number with the uniform distribution, the random sample with replacement doesn’t seem straightforward with the data step. In the demo below, I will show how to do sampling with replacement by both SURVEYSELECT and data step and compare their efficiencies.

First of all, I will artificially simulate a data set with 10 million rows.

data one; do i = 1 to 10000000; output; end; run;

Secondly, I will wrap SURVEYSELECT procedure into a macro to do sampling with replacement. with this method, it took more than 20 seconds CPU time to get the work done even after subtracting ~1 second simulation time.

%macro urs1(indata = , outdata = ); options mprint; proc sql noprint; select put(count(*), 10.) into :n from &indata; quit; proc surveyselect data = &indata out = &outdata n = &n method = urs seed = 2013; run; proc freq data = &outdata; tables numberhits; run; %mend urs1; %urs1(indata = one, outdata = two); /* real time 30.32 seconds cpu time 22.54 seconds Cumulative Cumulative NumberHits Frequency Percent Frequency Percent --------------------------------------------------------------- 1 3686249 58.25 3686249 58.25 2 1843585 29.13 5529834 87.38 3 611396 9.66 6141230 97.04 4 151910 2.40 6293140 99.44 5 30159 0.48 6323299 99.91 6 4763 0.08 6328062 99.99 7 641 0.01 6328703 100.00 8 98 0.00 6328801 100.00 9 11 0.00 6328812 100.00 10 1 0.00 6328813 100.00 */

At last, let’s take a look at how to accomplish the same task with a simple data step. The real trick here is to understand the statistical nature of a Poisson distribution. As shown below, while delivering a very similar result, this approach only consumes roughly a quarter of the CPU time. This efficiency gain would be particularly more attractive when we need to apply complex machine learning algorithms, e.g. bagging, to big data problems.

%macro urs2(indata = , outdata = ); options mprint; data &outdata; set &indata; numberhits = ranpoi(2013, 1); if numberhits > 0 then output; run; proc freq data = &outdata; tables numberhits; run; %mend urs2; %urs2(indata = one, outdata = two); /* real time 13.42 seconds cpu time 6.60 seconds Cumulative Cumulative numberhits Frequency Percent Frequency Percent --------------------------------------------------------------- 1 3677134 58.18 3677134 58.18 2 1840742 29.13 5517876 87.31 3 612487 9.69 6130363 97.00 4 152895 2.42 6283258 99.42 5 30643 0.48 6313901 99.90 6 5180 0.08 6319081 99.99 7 732 0.01 6319813 100.00 8 92 0.00 6319905 100.00 9 12 0.00 6319917 100.00 10 2 0.00 6319919 100.00 */

## rPython – R Interface to Python

> library(rPython) Loading required package: RJSONIO > ### load r data.frame ### > data(iris) > r_df1 <- iris > class(r_df1) [1] "data.frame" > head(r_df1, n = 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa > ### pass r data.frame to python dict ### > python.assign('py_dict1', r_df1) > python.exec('print type(py_dict1)') <type 'dict'> > ### convert python dict to pandas DataFrame ### > python.exec('import pandas as pd') > python.exec('py_df = pd.DataFrame(py_dict1)') > python.method.call('py_df', 'info') <class 'pandas.core.frame.DataFrame'> Int64Index: 150 entries, 0 to 149 Data columns (total 5 columns): Petal.Length 150 non-null values Petal.Width 150 non-null values Sepal.Length 150 non-null values Sepal.Width 150 non-null values Species 150 non-null values dtypes: float64(4), object(1)NULL > python.exec('print py_df.head(3)') Petal.Length Petal.Width Sepal.Length Sepal.Width Species 0 1.4 0.2 5.1 3.5 setosa 1 1.4 0.2 4.9 3.0 setosa 2 1.3 0.2 4.7 3.2 setosa > ### convert pandas DataFrame back to dict ### > python.exec('py_dict2 = py_df.to_dict(outtype = "list")') > ### pass python dict back to r list ### > r_list <- python.get('py_dict2') > class(r_list) [1] "list" > ### convert r list to r data.frame ### > r_df2 <- data.frame(r_list) > class(r_df2) [1] "data.frame" > head(r_df2, n = 3) Petal.Length Sepal.Length Petal.Width Sepal.Width Species 1 1.4 5.1 0.2 3.5 setosa 2 1.4 4.9 0.2 3.0 setosa 3 1.3 4.7 0.2 3.2 setosa

## Calculating Marginal Effects in Zero-Inflated Beta Model

libname data 'c:\projects\sgf14\data'; ods output parameterestimates = _parms; proc nlmixed data = data.full tech = trureg alpha = 0.01; parms a0 = 0 a1 = 0 a2 = 0 a3 = 0 a4 = 0 a5 = 0 a6 = 0 a7 = 0 b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0 c0 = 1 c1 = 0 c2 = 0 c3 = 0 c4 = 0 c5 = 0 c6 = 0 c7 = 0; xa = a0 + a1 * x1 + a2 * x2 + a3 * x3 + a4 * x4 + a5 * x5 + a6 * x6 + a7 * x7; xb = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b6 * x6 + b7 * x7; xc = c0 + c1 * x1 + c2 * x2 + c3 * x3 + c4 * x4 + c5 * x5 + c6 * x6 + c7 * x7; mu_xa = 1 / (1 + exp(-xa)); mu_xb = 1 / (1 + exp(-xb)); phi = exp(xc); w = mu_xb * phi; t = (1 - mu_xb) * phi; if y = 0 then lh = 1 - mu_xa; else lh = mu_xa * (gamma(w + t) / (gamma(w) * gamma(t)) * (y ** (w - 1)) * ((1 - y) ** (t - 1))); ll = log(lh); model y ~ general(ll); *** calculate components for marginal effects ***; _mfx_a = exp(xa) / ((1 + exp(xa)) ** 2); _mfx_b = exp(xb) / ((1 + exp(xb)) ** 2); _p_a = 1 / (1 + exp(-xa)); _p_b = 1 / (1 + exp(-xb)); predict _mfx_a out = _marg1 (rename = (pred = _mfx_a) keep = id pred y); predict _p_a out = _marg2 (rename = (pred = _p_a) keep = id pred); predict _mfx_b out = _marg3 (rename = (pred = _mfx_b) keep = id pred); predict _p_b out = _marg4 (rename = (pred = _p_b) keep = id pred); run; data _null_; set _parms; call symput(parameter, put(estimate, 20.15)); run; data _marg5; merge _marg1 _marg2 _marg3 _marg4; by id; _marg_x1 = _mfx_a * &a1 * _p_b + _mfx_b * &b1 * _p_a; _marg_x2 = _mfx_a * &a2 * _p_b + _mfx_b * &b2 * _p_a; _marg_x3 = _mfx_a * &a3 * _p_b + _mfx_b * &b3 * _p_a; _marg_x4 = _mfx_a * &a4 * _p_b + _mfx_b * &b4 * _p_a; _marg_x5 = _mfx_a * &a5 * _p_b + _mfx_b * &b5 * _p_a; _marg_x6 = _mfx_a * &a6 * _p_b + _mfx_b * &b6 * _p_a; _marg_x7 = _mfx_a * &a7 * _p_b + _mfx_b * &b7 * _p_a; run; proc means data = _marg5 mean; var _marg_x:; run; /* Variable Mean ------------------------ _marg_x1 -0.0037445 _marg_x2 0.0783118 _marg_x3 0.0261884 _marg_x4 -0.3105482 _marg_x5 0.000156693 _marg_x6 -0.000430756 _marg_x7 -0.0977589 ------------------------ */