Most SAS users like to use SURVEYSELECT procedures to do the random sampling. However, when it comes to the big dataset, the efficiency of SURVEYSELECT procedure seems pretty low. As a result, I normally like to use data step to do the sampling.

While the simple random sample without replacement is trivial and can be easily accomplished by generating a random number with the uniform distribution, the random sample with replacement doesn’t seem straightforward with the data step. In the demo below, I will show how to do sampling with replacement by both SURVEYSELECT and data step and compare their efficiencies.

First of all, I will artificially simulate a data set with 10 million rows.

data one; do i = 1 to 10000000; output; end; run;

Secondly, I will wrap SURVEYSELECT procedure into a macro to do sampling with replacement. with this method, it took more than 20 seconds CPU time to get the work done even after subtracting ~1 second simulation time.

%macro urs1(indata = , outdata = ); options mprint; proc sql noprint; select put(count(*), 10.) into :n from &indata; quit; proc surveyselect data = &indata out = &outdata n = &n method = urs seed = 2013; run; proc freq data = &outdata; tables numberhits; run; %mend urs1; %urs1(indata = one, outdata = two); /* real time 30.32 seconds cpu time 22.54 seconds Cumulative Cumulative NumberHits Frequency Percent Frequency Percent --------------------------------------------------------------- 1 3686249 58.25 3686249 58.25 2 1843585 29.13 5529834 87.38 3 611396 9.66 6141230 97.04 4 151910 2.40 6293140 99.44 5 30159 0.48 6323299 99.91 6 4763 0.08 6328062 99.99 7 641 0.01 6328703 100.00 8 98 0.00 6328801 100.00 9 11 0.00 6328812 100.00 10 1 0.00 6328813 100.00 */

At last, let’s take a look at how to accomplish the same task with a simple data step. The real trick here is to understand the statistical nature of a Poisson distribution. As shown below, while delivering a very similar result, this approach only consumes roughly a quarter of the CPU time. This efficiency gain would be particularly more attractive when we need to apply complex machine learning algorithms, e.g. bagging, to big data problems.

%macro urs2(indata = , outdata = ); options mprint; data &outdata; set &indata; numberhits = ranpoi(2013, 1); if numberhits > 0 then output; run; proc freq data = &outdata; tables numberhits; run; %mend urs2; %urs2(indata = one, outdata = two); /* real time 13.42 seconds cpu time 6.60 seconds Cumulative Cumulative numberhits Frequency Percent Frequency Percent --------------------------------------------------------------- 1 3677134 58.18 3677134 58.18 2 1840742 29.13 5517876 87.31 3 612487 9.69 6130363 97.00 4 152895 2.42 6283258 99.42 5 30643 0.48 6313901 99.90 6 5180 0.08 6319081 99.99 7 732 0.01 6319813 100.00 8 92 0.00 6319905 100.00 9 12 0.00 6319917 100.00 10 2 0.00 6319919 100.00 */