Re: help with random #s!

[text edited slightly since first posted]

Newsgroups: comp.lang.c
From: scs@eskimo.com (Steve Summit)
Subject: Re: help with random #s!
Date: Wed, 17 Nov 1993 06:10:36 GMT
Message-ID: <CGMH5u.1r7@eskimo.com>

In article <2c80r2INN4js@umbc8.umbc.edu>, Jonas Schlein writes:
> ...Why not just do the % operator? For instance if I wanted a number
> from 0 - 99 I would do something like:
> rand () % 100;

This ought to work, but has problems if the implementation of rand() is poor, which is all too common.

In article <ZY7IBWQQ@math.fu-berlin.de>, denholm writes:
> doesn't this introduce a non-uniformity..?
> it is clearer if one considers RAND_MAX=150
> rand() % 100 will give 0->49 twice as often as 50->99
>
> (double) rand() / RAND_MAX will give a real 0->1 ...

It won't be a ``real'' 0-1 range; it will still be quantized into 150 (i.e. RAND_MAX) distinct values, and the nonuniformity will persist.

The question of how to produce pseudorandom numbers in a given range is frequently-asked and not nearly as trivial as it might at first appear. One might well wonder why it is not addressed in the comp.lang.c FAQ list -- the reason is that I have not been satisfied with my own understanding of the subject. Having spent much of today exploring at least one facet of the issue to my satisfaction, I am ready to say a few words about this latest incarnation of the ``help with random #s!'' thread, and provide an inverse outline of the entry which I will soon add to the FAQ list. (I say ``inverse outline'' because the entry to be added to the FAQ list will in fact be an outline or summary of this article, which will be considerably too long for an FAQ list entry.)

We will consider several ways of writing the routine

	int myrand(unsigned int n)

, which is supposed to return well-distributed random integers in the half-open interval [0, n); that is, integers x such that 0 <= x < n. All of our attempts at writing myrand will be based on the underlying, ANSI Standard rand routine, which returns integers in the closed interval [0, RAND_MAX]. (It is obviously trivial to use our extend myrand to generate random numbers for ranges which are not 0-based.) RAND_MAX is #defined in <stdlib.h>; that header should be considered to be #included in the code fragments below.

There are a couple of things to worry about when writing myrand. One is that many existing rand implementations are deficient; in particular, they are not very random in the low-order bits. Another is that unless n is much less than RAND_MAX, myrand's output distribution is likely to be lopsided.

The most obvious way of writing myrand is

	myrand1(n)
	unsigned int n;
	{
	return rand() % n;
	}

It is a very great tragedy that this implementation cannot be generally used, because it otherwise displays a number of significant virtues, and would otherwise be preferable to the alternatives which I shall consider, all of which have deficiencies of their own.

One of myrand1's virtues is that it is very obvious how it works, and that it works. Another is that it does not require knowing RAND_MAX, which is useful when code must be portable to pre-ANSI systems. Still another is that its output distribution is superior (when n is not much less than RAND_MAX) to another alternative which I shall consider later on.

However, the widespread existence of poor rand implementations dooms myrand1 for general-purpose use. If the underlying rand implementation is a linear congruential random number generator with period m, and if n in myrand1 is a divisor of m, then myrand1's output will repeat with period n [Knuth Volume 2 Sec. 3.2.1.1; p. 12 in the second edition]. If m matches the computer's word size and is a power of 2 (a popular choice), then sequential calls to myrand1(2) will generate output which alternates 0, 1, 0, 1, 0, 1...

Our second attempt at implementing myrand gets at the higher order bits of rand's output by using RAND_MAX and floating point:

	myrand2(n)
	unsigned int n;
	{
	return (double)rand() / ((double)RAND_MAX + 1) * n;
	}

myrand2 works well; its only real disadvantage is that it uses floating point, which can decrease speed and increase executable size.

(It is often claimed that an implementation like myrand2 will also overcome distribution problems if n approaches RAND_MAX; but this belief is quite false, as a moment's thought will demonstrate. As long as rand's output domain is quantized and consists of RAND_MAX+1 values, no simple mapping, floating or integral, can avoid maldistribution when (RAND_MAX+1) % n != 0.)

If we can't trust rand to be random in the low-order bits, and we don't want to use floating point, we can use what amounts to a fixed-point version of myrand2:

	myrand3(n)
	unsigned int n;
	{
	return rand() / (RAND_MAX / n + 1);
	}

This is probably the preferable routine for general-purpose use. Its only drawback is that if n is greater than about RAND_MAX/2, it will not only distribute its outputs unevenly, but some values will not appear at all. (As n approaches RAND_MAX, myrand1 has a bimodal distribution, but myrand3 tends to have a trimodal distribution, with some values at zero. myrand2, on the other hand, will display the same number of ``heavy'' outputs as myrand1, but spread out over the range.) However, as long as n is much less than RAND_MAX, as is usually the case, myrand3 should be perfectly adequate.

If you want to avoid the discrepancies that arise when (RAND_MAX+1) % n != 0, you don't have much choice but to call rand multiple times, as in our final attempt:

	#define MAXTIMES 5 
	myrand4(n)
	unsigned int n;
	{
	int d = ((unsigned)RAND_MAX + 1) / n;
	unsigned int m = d * n;
	int i;
	int r;

	for(i = 0; i < MAXTIMES; i++)
		{
		r = rand();
		if((unsigned)r < m)
			return r / d;
		}

	return r / (RAND_MAX / n + 1);
	}

Here, we throw away some of rand's output values, to simulate another random number generator with a period which is a multiple of n. The obvious concern is that if we're really unlucky, we may hit a long run of these discarded values, and waste a lot of time calling rand during one call to myrand4. (In fact, if we wait long enough, we might get an arbitrarily long string of values we need to discard.) Therefore, the code above implements a limit; if it gets too many bad values in a row, it falls back to to myrand3's algorithm. (The code displays a classic tradeoff between time and accuracy: the larger you make MAXTIMES, the less will be the accumulated error, but the longer you might spend on a particular call. If MAXTIMES is 1, myrand4 is essentially myrand3.)

Since this is obviously a rather pedagogical article, I can't resist suggesting a few exercises for the reader. (I include these as exercises not only because they are interesting but also because it's getting late and I don't feel like figuring out and presenting their answers; therefore, please don't send me e-mail asking for the answers or inquiring whether your answers are correct.)

Write myrandmn(int m, int n), which returns integers in the range m to n. Use myrand to do most of the work. Should myrandmn be defined in terms of a closed or half-open interval?
Explain the presence of the +1 ``fudge factors'' in myrand2, myrand3, and myrand4. Is the expression in myrand3 parenthesized correctly?
Why does myrand accept an unsigned value? (This one I do have an answer for; see below.)
Fix myrand3 so that in the worst case it has a bimodal rather than trimodal distribution, yet still uses no floating point. Consider using something like Walker's Algorithm [Knuth Volume 2 Sec. 3.4.1; p. 115 in the second edition].
Improve myrand4 so that when it does have to break out of the loop to avoid too many calls to rand, it at least spreads the ``leftover'' values around evenly, rather than always favoring the same values that myrand3 favors. Hint: you'll probably need a local static variable.
Investigate the behavior of various myrand implementations, in conjunction with several underlying rand implementations, some with good and some with poor randomness properties. Make RAND_MAX artificially small, to explore in detail the behavior as n approaches RAND_MAX. Collect histograms to display the output distribution. Verify whether the output repeats with unacceptably low period.

Finally, a few disclaimers. As I mentioned, random numbers are subtle, and not nearly as trivial as they might at first appear. Knuth devotes some 177 pages to the subject, not all of which I can claim to understand completely, yet he does not touch on many of the implementation issues I have raised here. This article satisfies one level of my own curiosity about the subject, but is by no means definitive. My assessment of the tradeoffs between the four code fragments I have presented is superficial; there are a few additional factors I haven't fully explained, and doubtless more which I'm not fully aware of. I just wrote myrand3 today and don't distinctly remember seeing something like it before, it may have some lurking deficiency as ultimately fatal as myrand1's.

Steve Summit
scs@eskimo.com

Answers to selected exercises:

3. myrand accepts unsigned values so that it works in the limiting case, namely myrand(RAND_MAX+1), when RAND_MAX == INT_MAX (as is usually the case).