Be careful using MOD with random numbers

The C function rand is terrible source of random numbers for almost any application, just don’t use it ever. I’m particular fond of RC4 as a source as it is easy to implement and takes very little code. A simple method of generating random numbers with RC4 is to first gather some entropy values. These are values that should be different each time you gather them, and between computers. The nice property about entropy is that it only increases. So as long as you mix in some good high entropy values you can’t “dilute” it by mixing in low entropy ones (unless you do something stupid with the way you mix them). If you have less than 256 bytes of entropy values and you want to keep code size down to a minimum then you can feed those values in as the key to the RC4 stream. However it is better and more flexible to use a hash function on the entropy values and use the resulting hash as the key to the RC4 stream. High precision timers are a good source of entropy, as are various input sources such as mouse position, memory usage, process lists etc. Just keep on feeding more entropy values into the hash function. As long as you have some pretty good values in there you should be okay. For extra points run the hash function multiple times before using the final value as the key (ie hash the hash and repeat). This makes it much more difficult for someone to determine the initial state of the RC4 stream.

Now you have an RC4 stream that has been keyed by a random value. Remember to throw away the first 256 bytes and now you have a very good source of random bytes. RC4 produces one byte per iteration. This makes it very easy to get random values of bytes, or larger words. If you want a 32 bit random number, just output four bytes, you don’t even have to worry about endianness.

Sometimes, however, you need a range of random values that is not a multiple of bytes. If for example you need random numbers in the range 0-31, this is easily produced. For each value take an output byte from RC4 and then chop off the top 3 bits. This can be easily done with an AND operation (with value 0x1f), or alternatively a MOD 32 operation.

newValue = randValue & 0x1f;
newValue = randValue % 32;

The new value is now a random value in the range 0-31 and still has a nice flat distribution. This is because each bit in the RC4 stream has a random distribution and we can take any 5 bits out of the 8 bit output. Another (more complicated) method would be to treat the RC4 as a bitstream and take 5 bits at a time. This requires more both programatically and computationally and provides no advantage, its not as if you are “wasting” the 3 bits produced by the first method and this is more efficient!.

While in the above example using MOD is equivalent to using the AND function, it can give the wrong impression that you can use the MOD operation more flexibly to get ranges that are not a whole number of bits. For example if you want the range of values to be 0-29, it looks like this could easily be produced by using the MOD function with a value 30. Don’t do this, not ever. The problem is that the MOD function is a mapping of the 256 values into 30 values. 256 does not evenly divide into 30.

The following original 8 bit values become a 0 when MOD 30 is applied: 0, 30, 60, 90, 120, 150, 180, 210, 240. That is 9 original values map to a 0. Similarly there are 9 values that map to 1 (1, 31, 61, 91, 121, 151, 181, 211, 241). So far so good, but look what happens for the values that map to 16: 16, 46, 76, 106, 136, 166, 196, 226. This is only 8 values. So in the final output the values from 0-15 have 9 original input values, whereas the values 16-29 have 8 original input values. This is not a flat distribution, instead of each value 0-29 having a probability for selection of 3.3333%, half of them have a probability of 3.5156% and the other half have a lower probability of 3.125%. This is bad for any application, and tragic for encryption.

There is fortunately a very simple solution which will produce a flat distribution of any desired range. First truncate the value to the next whole number of bits for the range wanted. In the case of the range 0-29 this is 32 (5 bits). If the value is larger than 8 bits, then produce larger values by outputting multiple bytes at a time, filling the word, and then mask the required bits. Next check to see if the value is within the desired range (eg 0-29). If it is then output it, if it is not, then throw it away and repeat the process until one is found within the range. This will produce a flat distribution of values. The function will take a variable amount of time to process as it may have to repeat several times before it gets a value within the range, however because we have truncated the bits to start with, there will always be more than a 50% chance of a value being in the range desired on each try, so it shouldn’t have to repeat much. You could skip the first step, but then you would be throwing away a lot more values and therefore repeating a lot more, but the output will still be just as good.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s