How the SAS Random Number Generators Work

From sasCommunity
Jump to: navigation, search

RANxxx() functions and CALL RANxxx() subroutines

All the SAS RNGs named RANxxx are based on RANUNI and use some transform, inversion, or acceptance/rejection method to generate pseudorandom number streams with various other distributional properties. UNIFORM() is an alias for RANUNI(), and NORMAL() is an alias for RANNOR().

RANUNI() uses a multiplicative linear congruential generator (from SAS docs) where

  • SEED = mod( SEED * 397204094, 2**31-1 )

and then returns

  • SEED / (2**31-1)

as the uniform random number. When using the SAS random number functions or subroutines, one should specify a SEED in the range of 1 to 2**31-2 to initialize the starting point of the pseudorandom number stream, or use a nonpositive integer (0 or negative) to create the intial seed from the system clock. If a nonpositive number is used, SAS reads the system time and computes the initial seed value using an algorithm equivalent to (see http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0902d&L=sas-l&D=0&P=23540)

SEED = 1e3 * mod(round(1e3 * datetime()), 1e6) + 1;

The period of RANUNI generator is 2**31 - 2. (The SAS documentation says the period is 2**31 - 1 but the sequence repeats after 2**31 - 2 numbers.) To illustrate what this means, examine the following simple RNG.

data random;
  seed=6; *seed must be between 1 and 30 (i.e. 2**5-2);
  do _n_=1 to 100;
    seed = mod( 3*seed, 2**5-1 );
    urand = seed/(2**5-1);
    output;
  end;
run;

The period for this generator is 2**5-2 = 30. It will generate a sequence of 30 numbers and then repeat that sequence over and over again. In the above example:

  • the initial seed is 6
  • the first updated seed value will be mod(3*6, 31) = 18, and
  • the first uniform pseudorandom number is 18/31 or approximately 0.5806452. The updated seed value will be saved for use the next time a pseudorandom number is to be computed.

The next time a pseudorandom nubmer is called for:

  • the saved updated seed value (18) will be used to get the next seed value mod(3*18, 31) = 23, and
  • returns the pseudorandom number 23/31 or 0.7419355.

With an inital seed of 6, the sequence of updated values generated will be,

18 23  7 21  1  3  9 27 19 26 16 17 20 29 25 13  8 24 10 30 28 22  4 12  5 15 14 11  2  6

and that cycle will continue repeating over and over, for as long as the datastep continues to run. These numbers are not computed, stored, and looked up when needed, rather they are computed on the fly as requested. But because the process is deterministic we know in what order they will occur. If you change the initial seed to say 22, the sequence will begin at a different place,

4 12  5 15 14 11  2  6 18 23  7 21  1  3  9 27 19 26 16 17 20 29 25 13  8 24 10 30 28 22

but it will be the same ordering of updated seeds (and uniform random numbers). If you look for 4 in the previous sequence (where initial seed was 6), you will see that it is followed by the same sequence of numbers when as here the inital seed was 22, and the numbers "wrap-around" to the begining of the sequence.

Obviously, this toy RNG is not very useful because it doesn't produce enough unique numbers and the period is too short. However, it illustrates how RANUNI works. As stated above, the period for RANUNI is 2**31-2, or 2,147,483,646. It will generate a sequence of 2**31-2 pseudorandom numbers before it starts repeating.

Differences between RANxxx functions and CALL RANxxx subroutines

The RANxxx functions and CALL RANxxx subroutines use the same algorithm, but differ in how the seed is managed. The difference is that with the CALL versions, one can change the seed and the stream of numbers will change, and one can maintain separate number streams with different starting points. With the FUNCTION versions, the starting point in the random number stream is fixed by the first reference to a FUNCTION RNG, and all FUNCTION RNG calls use the same stream of seeds.

RANxxx functions

  1. The seed can be a constant or variable with an integer value in the range [1, 2147483646] (or 0 to seed from the system time).
  2. Within a datastep, the seed used by the first RANxxx function executed initializes the RNG (sets the sequence of pseudorandom numbers).
  3. All RANxxx functions encountered later draw from the same stream.
  4. Changing the value of a seed variable does not alter the sequence of random variables within a data step.
  5. The value of a variable used as a random number seed IS NOT altered by calling a RANxxx function (the internal seed, once initialized, is no longer available to the data step), and altering the seed variable value after it has been used does not affect the values returned by the RNG function

CALL RANxxx subroutines

  1. The seed MUST be a variable, a constant is not allowed. The seed variable should be retained, or it will be set to missing at the top of the data step. The SAS documentation does not seem to be explicit about this, but in examples, shows the seed being initialized with a retain statement when using the CALL RANxxx subroutines.
  2. Each occurence of CALL RANxxx subroutine sets up an RNG which can be initialized with its own seed independent of other subroutines
  3. Each CALL RANxxx() subroutine can draw from its own stream of numbers, not influenced by other subroutines, if a different seed variable is used. If one uses the same seed variable in multiple CALL RANxxx subroutines, then they will all draw from the same sequence of numbers as the RANxxx functions do.
  4. Changing the value of a seed variable alters the stream of numbers.
  5. The value of a variable used as a random number seed IS altered each time CALL RANxxx is executed, and that value is available to the data step. Changing the value of the seed variable used by a CALL RANxxx subroutine will alter the output of the pseudorandom number generator.

In the following data step, seed1 and seed2 are set and never change. If one compares the sequence of r1 and 2 together, they are the same as the sequence of numbers in r3, because the the seed for RANxxx functions is set by the first use of any of the family of functions and all the RANxxx functions draw from the same sequence of numbers. However, the CALL RANxxx subroutines can be started at different points in the sequence of numbers. Seed3 and seed4 values change with each call and they can be changed to alter the output of the call (not recommended, because unless one is very careful it is possible to introduce serial correlation into the sequence of numbers or otherwise modify the sequence in unhelpful ways).

data _null_;
  seed1 = 1;
  seed2 = 3271985;
  retain seed3 1 seed4 3271985 ;
  do _n_ = 1 to 10;
    r1 = ranuni(seed1);
    r2 = ranuni(seed2);
    call ranuni(seed3, r3);
    call ranuni(seed4, r4);
    put _all_;
  end;
run;

RAND() -- Another RNG function (new to SAS 9)

The newest SAS RNG is RAND (see also RANDGEN) which is available in SAS 9.1 and later. This RNG is based on the Mersenne-Twister algorithm. It is much more complicated than the linear congruential algorithm and has a cycle length of 2**19937-1 according to the SAS documentation. Most SAS users (and I include myself in this group) will have to simply accept what the number theorists have to say about the properties of this RNG as there is no way to generate this many numbers in a lifetime given current computing power.

To select the seed for the RAND function, use the CALL STREAMINIT routine. Do this once before using RAND() to generate any random numbers. If you don't use CALL STREAMINIT or if you specify a non-positive seed, the seed will be set using the system clock.

Questions

which pseudo-random number generators (RNGs) are best for which situations?

One could argue that for most of the situations that the average SAS user will encounter, the RANxxx functions or subroutines are adequate.

what are the pros, cons, merits and gotchas related to each of those methods?

The period or cycle length is very important; the longer the better. For serious statistical work, simulations, etc., where one is going to simulate a lot of data that needs to be "as random as possible" then one ought to be using the RAND() function which has a longer period and better properties than the RANxxx functions.

Changing the seed when using the CALL RANxxx subroutines could lead to serial correlations or other disturbances in the properties of the RNGs.

which seeds are recommended for various situations and methods?

It doesn't matter which seed one uses as long as the seed is within the range specified in the documentation and as long as one doesn't use the same seed every time an RNG is initialized. Again, the seed only initializes where one begins in the pseudorandom sequence of numbers, and this sequence is fixed by the specific algorithm used and the initial seed.

which seeds should be avoided for various situations and methods?

No seed needs to be avoided. As long as the seed meets the requirements of the method used it is appropriate.

why might one want to keep a record of the seed they used?

In order to be able to replicate results if a client or boss asks you to show how you got your results.

what are the benefits and differences from using the random number generator functions and calls?

The RANxxx function RNGs are the same as the CALL subroutine versions in terms of the stream of random numbers which are produced. The difference is that with the CALL versions, one can change the seed and the stream of numbers will change, and one can maintain separate number streams with different starting points. With the FUNCTION versions, the starting point in the random number stream is fixed by the first reference to a FUNCTION RNG, and all FUNCTION RNG calls use the same stream of seeds.

what are some of the ways users have benefited from the availability of the random number generator functions and calls?

The SAS community is encouraged to chime in here, but at least one benefit is the ease with which one do "random sampling".

One other aspect which random number generators has proven to be useful is in data masking. They are also useful in providing numerical keys in which can be used for encryption purposes in conjunction with other bit operations (For more information, please look for the function XOR in the SAS documentation) in SAS.

Random number generators might also be helpful to people who might be using Monte Carlo simulations.

is there anything else one should know?

Probably. Others are welcome to add neglected topics, corrections, or otherwise reorganize things.

One major thing to note is that while these random number generators are statistically random, they are not cryptographically secured generator[1].