## Faster Random Sampling with Replacement in SAS

Most SAS users like to use SURVEYSELECT procedures to do the random sampling. However, when it comes to the big dataset, the efficiency of SURVEYSELECT procedure seems pretty low. As a result, I normally like to use data step to do the sampling.

While the simple random sample without replacement is trivial and can be easily accomplished by generating a random number with the uniform distribution, the random sample with replacement doesn’t seem straightforward with the data step. In the demo below, I will show how to do sampling with replacement by both SURVEYSELECT and data step and compare their efficiencies.

First of all, I will artificially simulate a data set with 10 million rows.

data one; do i = 1 to 10000000; output; end; run;

Secondly, I will wrap SURVEYSELECT procedure into a macro to do sampling with replacement. with this method, it took more than 20 seconds CPU time to get the work done even after subtracting ~1 second simulation time.

%macro urs1(indata = , outdata = ); options mprint; proc sql noprint; select put(count(*), 10.) into :n from &indata; quit; proc surveyselect data = &indata out = &outdata n = &n method = urs seed = 2013; run; proc freq data = &outdata; tables numberhits; run; %mend urs1; %urs1(indata = one, outdata = two); /* real time 30.32 seconds cpu time 22.54 seconds Cumulative Cumulative NumberHits Frequency Percent Frequency Percent --------------------------------------------------------------- 1 3686249 58.25 3686249 58.25 2 1843585 29.13 5529834 87.38 3 611396 9.66 6141230 97.04 4 151910 2.40 6293140 99.44 5 30159 0.48 6323299 99.91 6 4763 0.08 6328062 99.99 7 641 0.01 6328703 100.00 8 98 0.00 6328801 100.00 9 11 0.00 6328812 100.00 10 1 0.00 6328813 100.00 */

At last, let’s take a look at how to accomplish the same task with a simple data step. The real trick here is to understand the statistical nature of a Poisson distribution. As shown below, while delivering a very similar result, this approach only consumes roughly a quarter of the CPU time. This efficiency gain would be particularly more attractive when we need to apply complex machine learning algorithms, e.g. bagging, to big data problems.

%macro urs2(indata = , outdata = ); options mprint; data &outdata; set &indata; numberhits = ranpoi(2013, 1); if numberhits > 0 then output; run; proc freq data = &outdata; tables numberhits; run; %mend urs2; %urs2(indata = one, outdata = two); /* real time 13.42 seconds cpu time 6.60 seconds Cumulative Cumulative numberhits Frequency Percent Frequency Percent --------------------------------------------------------------- 1 3677134 58.18 3677134 58.18 2 1840742 29.13 5517876 87.31 3 612487 9.69 6130363 97.00 4 152895 2.42 6283258 99.42 5 30643 0.48 6313901 99.90 6 5180 0.08 6319081 99.99 7 732 0.01 6319813 100.00 8 92 0.00 6319905 100.00 9 12 0.00 6319917 100.00 10 2 0.00 6319919 100.00 */