Yet Another Blog in Statistical Computing

I can calculate the motion of heavenly bodies but not the madness of people. -Isaac Newton

Faster Random Sampling with Replacement in SAS

Most SAS users like to use SURVEYSELECT procedures to do the random sampling. However, when it comes to the big dataset, the efficiency of SURVEYSELECT procedure seems pretty low. As a result, I normally like to use data step to do the sampling.

While the simple random sample without replacement is trivial and can be easily accomplished by generating a random number with the uniform distribution, the random sample with replacement doesn’t seem straightforward with the data step. In the demo below, I will show how to do sampling with replacement by both SURVEYSELECT and data step and compare their efficiencies.

First of all, I will artificially simulate a data set with 10 million rows.

data one;
  do i = 1 to 10000000;
  output;
  end;
run;

Secondly, I will wrap SURVEYSELECT procedure into a macro to do sampling with replacement. with this method, it took more than 20 seconds CPU time to get the work done even after subtracting ~1 second simulation time.

%macro urs1(indata = , outdata = );
options mprint;

proc sql noprint;
  select put(count(*), 10.) into :n from &indata;
quit;

proc surveyselect data = &indata out = &outdata n = &n method = urs seed = 2013;
run;

proc freq data = &outdata;
  tables numberhits;
run;

%mend urs1;

%urs1(indata = one, outdata = two);
/*
      real time           30.32 seconds
      cpu time            22.54 seconds

                                       Cumulative    Cumulative
NumberHits    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
         1     3686249       58.25       3686249        58.25
         2     1843585       29.13       5529834        87.38
         3      611396        9.66       6141230        97.04
         4      151910        2.40       6293140        99.44
         5       30159        0.48       6323299        99.91
         6        4763        0.08       6328062        99.99
         7         641        0.01       6328703       100.00
         8          98        0.00       6328801       100.00
         9          11        0.00       6328812       100.00
        10           1        0.00       6328813       100.00
*/

At last, let’s take a look at how to accomplish the same task with a simple data step. The real trick here is to understand the statistical nature of a Poisson distribution. As shown below, while delivering a very similar result, this approach only consumes roughly a quarter of the CPU time. This efficiency gain would be particularly more attractive when we need to apply complex machine learning algorithms, e.g. bagging, to big data problems.

%macro urs2(indata = , outdata = );
options mprint;

data &outdata;
  set &indata;
  numberhits = ranpoi(2013, 1);
  if numberhits > 0 then output;
run;

proc freq data = &outdata;
  tables numberhits;
run;

%mend urs2;

%urs2(indata = one, outdata = two);
/*
      real time           13.42 seconds
      cpu time            6.60 seconds

                                       Cumulative    Cumulative
numberhits    Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
         1     3677134       58.18       3677134        58.18
         2     1840742       29.13       5517876        87.31
         3      612487        9.69       6130363        97.00
         4      152895        2.42       6283258        99.42
         5       30643        0.48       6313901        99.90
         6        5180        0.08       6319081        99.99
         7         732        0.01       6319813       100.00
         8          92        0.00       6319905       100.00
         9          12        0.00       6319917       100.00
        10           2        0.00       6319919       100.00
*/
Advertisements

Written by statcompute

October 18, 2013 at 10:13 am

%d bloggers like this: