How a simple algorithm for generating a random output can be biased may be a very difficult analysis, and unless every step is carefully analyzed, randomness can be easily lost. In this post I'll go through some basics of probabilities, using an ideal coin-toss-generator, and create a completely useless integer generator just for the sake of this post.
Note: Before you read the whole post, let me tell you that there is nothing on real techniques for random integer generation. The idea is to study the equiprobabilty of the results of some algorithms, and see how that translates into randomness. For some real-world random-generation algorithms, I strongly recommend reading chapter 3 of Knuth’s Art of Computer Programming.
Note 2: This post is dedicated to Facundo, who apparently does not enjoy my posts on Silverlight.
Let’s suppose we have a perfect coin. This is, every time we toss it, there are exactly the same chances of getting heads or tails. Or, for this experiments, every time we invoke it we get 0 or 1:
def coin(): return random.randint(0,1)
Running this method 100 000 times yielded rather good results:
0: 0.50042 1: 0.49958
A difference of 8.4 E-04 seems acceptable. So let’s see what we can do with our boolean random generator.
Step one will be simulating a die using this generator. Some of you out there may already know that the naive approach of adding five coins simply does not work:
def die1(): value = 1 for i in range(5): value += coin() return value
These are the results obtained after the 100 000 iterations:
This non-uniform distribution can be explained by taking a look at all the different outcomes of the algorithm based on the possible values of the coin flip.
In the algorithm we are making a total of 5 coin flips, each of them with two possible outcomes. That is a total of 2^5 = 32 total possibilities. Suppose we are looking for a 6 in the die. The only way to obtain it is to get all 1s in the 5 tosses. That is only 1/32 chances. Note that 1/32 is 0.03125, which is rather close to the 0.0288 obtained.
Now for the 3. We must get tails in 2 tosses and heads in the other 3 (or vice versa, depending on to which outcome we assigned the value of 1). There are a few more possibilities now:
|[0, 0, 1, 1, 1]||[1, 0, 0, 1, 1]||[1, 1, 0, 0, 1]|
|[0, 1, 0, 1, 1]||[1, 0, 1, 0, 1]||[1, 1, 0, 1, 0]|
|[0, 1, 1, 0, 1]||[1, 0, 1, 1, 0]||[1, 1, 1, 0, 0]|
|[0, 1, 1, 1, 0]|
That’s ten times more than before, and explains why we are actually getting 11.06 times more threes than sixes.
Let’s freeze our die generator for a second and go for a more general one, so we can later compare the results.
The experiment we just made proved that adding booleans for generating a random integer simply does not work. We should take a different approach.
If we think every coin toss as a node in a decision binary tree, we can easily see how our algorithm performs regarding results. Suppose we have two coin tosses, our tree would look like the following:
As every branch in the tree happens with the same probability, every leaf has 0.25 chances of being returned. This is a nice way of generating a four-result random generator. Or what is the same, an integer generator that returns numbers from 0 to 3.
How do we assign each of this outcomes to a number? Well, being computer scientists, the most natural way is to think them as binary representations of the results.
A random integer generator can then be built by generating numbers through their binary representations. In order not to mess with strings, we can directly build the number by adding powers of two.
This is, if the coin is 1, we add the greatest power of 2 allowed and call the algorithm recursively with the next smaller power. We repeat this process until there are no powers of 2 left.
def generate(pot): if pot == 0: return 0 else: return coin() * 2 ** (pot-1) + generate(pot-1)
Therefore, the only input to the algorithm is the maximum power of 2 that it should generate. To make things simpler, we will use the maximum exponent of the exclusive upper bound to be returned. The lower bound will always be zero.
Now we do have a generator that returns uniformly distributed random integers using our coin toss. Let’s test it by generating die-numbers. The main problem for this is that our generator only handles ranges of [0, 2^k). Therefore, we will use it by discarding any results out of range and re requesting a number. Clearly not the most efficient way, but useful nevertheless.
Now these are properly uniformly distributed values.
Let’s draw the decision tree for our previous die generator. At least, for a 4-sided die, just to keep it small and be able to compare it against our binary (and correct) 4 results generator.
A 4-sided die generator would take 3 tosses (instead of the two required by the binary one), so the tree would look like this:
See why the adding scheme does not work? The more tosses, the more probable are the values in the “middle” of the tree’s last level.
In a next post I’ll attempt to continue with this subject. Some pending matters are seeing what happens if the flow of the algorithm is changed based on partial results, and creating another integer generator, based on prime numbers.