Saturday, 8 December 2018

Thoughts on Random Number Generators

In DevOps, I’ve used RNG’s for creating message payloads of arbitrary size, and for file or directory names. These values are often created using scripts written in bash. This first article will explore three simple RNG’s that can be run from bash. It is not an exhaustive list, as there are others such as jot that is also easy to use from bash. However, the three described here are likely to already be installed on your Linux box.
The three RNG’s being evaluated here are:

Caution: Do not use for password generation!
None of these tools are suitable for generating passwords or for cryptography.
Source and test data for this article can be found here.
In future articles we will take a closer look at random values used in testing code and how they can be used for model simulation in statistics.

Testing Numeric Sequences

This discussion will not test the programs for randomness. Instead, we are going to evaluate a short list of 1000 values generated by each RNG. To ease comparison the values will be scaled to the range [0,1). They are rounded to two decimal places and the list of values is formatted as 0.nn.
There are many tests that can be applied to sequences to check for randomness. Here we are only looking at one: the Bartels Rank test. It’s limitations and those of other tests are described here. I’ve chosen this test as it is relatively easy to understand and interpret. Rather than comparing the magnitude of each observation with its preceding sample, Bartels Rank test, ranks all the samples from the smallest to the largest. The rank is the corresponding sequential number in the list of possibilities. Under the null hypothesis of randomness, any rank arrangement from all possibilities should be equiprobable. Bartels Rank test is also suitable for small samples.
To get a feel of the test: consider two of the vignettes provided by the R package, randtests.

Example 5.1 in Gibbons and Chakraborti (2003), p.98.

Annual data on total number of tourists to the United States for 1970-1982.
years <- 1970:1982
tourists <- c(12362, 12739, 13057, 13955, 14123, 15698, 17523, 18610, 19842,
20310, 22500, 23080, 21916)
# plot(years, tourists, pch = 20, col = 'purple')
plot(years, tourists, col = 'purple')
bartels.rank.test(tourists, alternative = "left.sided", pvalue = "beta")
## 
##  Bartels Ratio Test
## 
## data:  tourists
## statistic = -3.6453, n = 13, p-value = 1.21e-08
## alternative hypothesis: trend
What this tells us about the sample data is there is strong evidence against the null hypothesis of randomness. Instead, it favours the alternative hypothesis, that of a trend.

Example in Bartels (1982)

Changes in stock levels for 1968-1969 to 1977-1978 (in $AUD million), deflated by the Australian gross domestic product (GDP) price index (base 1966-1967). Source
y <- c(528, 348, 264, -20, -167, 575, 410, -4, 430, -122)
x1 <- 1968:1977
x2 <- 1969:1978
df <- data.frame(period = paste(sep = '-', x1, x2), gdp = y, stringsAsFactors = TRUE)
plot(df, xlab = 'Financial Year', ylab = 'GDP ($AUD million)')
bartels.rank.test(y, pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  y
## statistic = 0.083357, n = 10, p-value = 0.9379
## alternative hypothesis: nonrandomness
Here, the sampled data provides weak evidence against the null hypothesis of randomness. Which does not support the alternative hypothesis of non-random data.
(For a simple guide on how to interpret the p-value, see this)

Random Number Generators that can be used in Bash scripts

The following sections describe 3 RNG’s. It includes a small description with typical use of the RNG. Then a list of 1000 values is produced and analysed using Bartels Rank test.

bash RANDOM variable

Bash, provides the shell variable $RANDOM. It will generate a pseudorandom signed 16-bit integer between 0 and 32767.
RANDOM is easy to use is bash:
RANDOM=314
echo $RANDOM
## 1750
Here, we have seeded RANDOM with a value. Seeding a random variable will return the same sequence of numbers. This is required for results to be reproducible.
To generate a random integer between START and END use:
RANGE=$(( END - START + 1))
echo $(( (RANDOM % RANGE) + START ))
Where START < END are non-negative numbers.
For example, to simulate 10 rolls of a 6 sided dice:
START=1
END=6
RANGE=$(( END - START + 1 ))

RANDOM=314;for i in $(seq 10); do
   echo -n $(( (RANDOM % RANGE) + START )) " "
done
## 5  4  6  3  2  1  1  2  2  4

Checking Sequences from RANDOM

Prepare the test data:
RANDOM=314;for i in $(seq 1000); do
   printf "%0.2f\n" $(bc <<< "scale=2; $RANDOM/32768"); 
done > ./bash.random
bashText <- readLines(con <- file(paste0(getwd(), '/bash.random')))
close(con)
bashRandom <- as.numeric(bashText)
Show first 10 values:
head(bashRandom, n = 10)
##  [1] 0.41 0.62 0.66 0.11 0.88 0.18 0.35 0.29 0.30 0.09
Plot the sequence vs value from RNG:
plot(x = seq(1000), y = bashRandom, 
     xlab = 'sequence', ylab = 'RANDOM', 
     title('bash RANDOM'), 
     pch = 20, cex = 0.4, col = 'purple')
Run Bartels Rank test:
bartels.rank.test(bashRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  bashRandom
## statistic = -0.17763, n = 1000, p-value = 0.8591
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

awk rand()

Here we are using rand function from GNU Awk. This function generates random numbers between 0 and 1.
Example where rand() seeded
echo | awk -e 'BEGIN {srand(314)} {print rand();}'
## 0.669965
If you don’t specify a seed in srand() it will return the same results.
You can also generate random integers in a range.
For example, to simulate 10 rolls of a 6 sided dice:
echo | awk 'BEGIN {srand(314)} {for (i=0; i<10; i++) printf("%d ", int(rand()*6+1));}'
## 5 2 2 5 3 6 1 6 5 5

Checking Sequences from rand()

Prepare the test data:
seq 1000 | awk -e 'BEGIN {srand(314)} {printf("%0.2f\n",rand());}' > awk.random
awkText <- readLines(con <- file(paste0(getwd(), '/awk.random')))
close(con)
awkRandom <- as.numeric(awkText)
Show first 10 values:
head(awkRandom, n = 10)
##  [1] 0.67 0.33 0.22 0.78 0.40 0.84 0.11 0.94 0.70 0.68
Plot the sequence vs value from RNG:
plot(x = seq(1000), y = awkRandom, 
     xlab = 'sequence', ylab = 'rand()', 
     title('awk rand()'), 
     pch = 20, cex = 0.4, col = 'purple')
Run Bartels Rank test:
bartels.rank.test(awkRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  awkRandom
## statistic = 0.29451, n = 1000, p-value = 0.7685
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

urandom device

The final tool we will look at is the /dev/urandom device. The device provides an interface to the kernels random number generator. This is a useful tool as it can generate a wide variety of data types.
For example print a list of unsigned decimal using od(1):
seq 5 | xargs -I -- od -vAn -N4 -tu4 /dev/urandom
##  2693130566
##  2935068248
##    55879791
##  1981604742
##  3938513101
It can also be used to source random hexadecimal values:
seq 5 | xargs -I -- od -vAn -N4 -tx4 /dev/urandom
##  43541111
##  a3ddd4ba
##  1beec8f7
##  eb1a967b
##  0c62a481
Or it can generate a block of 64 random alphanumeric bytes using:
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 64 | head -n 5
## FrnEm2U112PUt1S3Qwnzx8oQb9luADbfqx2vaQmRiDVU9glWzmQSTEfRWeHlWBol
## HmHfI8PDveicuOUPkC0hl1nXNuGKJZF7SplTbkAd090H6A9ZMMOtn7bVMLJC5rud
## 0yZkFLeLDLLCIlAPoLOTB86j3SeMX7me38imXCamgW73zq2YiPIV7HHVUbu6zRAC
## kIMGUrDixHYbRCpfKCs0X0KQclkdfjvavw0RuxAcKEVIMgbtDzyGB6H1xiq4MY3G
## 2JrlyQrHwyYwl7c3T8OQalFnMh5Ncio2IxYhFO8pVDIuVeBcBjooZRZ4nA0Wo32W

Checking Sequences from urandom

Prepare the test data:
cat /dev/urandom | tr -dc  '0-9' | fold -w2 | \
   awk '{printf("0.%02d\n",$1)}' | head -1000 > urandom.random
urText <- readLines(con <- file(paste0(getwd(), '/urandom.random')))
close(con)
urRandom <- as.numeric(urText)
Show first 10 values:
head(urRandom, n = 10)
##  [1] 0.99 0.56 0.82 0.05 0.77 0.92 0.73 0.79 0.69 0.11
Plot the sequence vs value from RNG:
plot(x = seq(1000), y = urRandom, 
     xlab = 'sequence', ylab = 'urandom', 
     title('/dev/urandom'), 
     pch = 20, cex = 0.4, col = 'purple')
Run Bartels Rank test:
bartels.rank.test(urRandom, 'two.sided', pvalue = 'beta')
## 
##  Bartels Ratio Test
## 
## data:  urRandom
## statistic = 0.68108, n = 1000, p-value = 0.4961
## alternative hypothesis: nonrandomness

Result

With a p-value > 0.05 there is weak evidence against the null hypothesis of randomness.

Some final thoughts

Of the tools explored, urandom is the most versatile. So it has broader application. The down side is, its results are not easily reproducible and issues have been identified by a study by Gutterman, Zvi; Pinkas, Benny; Reinman, Tzachy (2006-03-06) for the Linux kernel version 2.6.10. Indeed, our own test here, suggests that it may not be as random as the other tools tested here.
Personally, this has been a useful learning exercise. For one it showed the limitations in generating and testing for (pseudo) random sequences. Indeed, Aaron Roth, has suggested:
As others have mentioned, a fixed sequence is a deterministic object, but you can still meaningfully talk about how “random” it is using Kolmogorov Complexity: (Kolmogorov complexity). Intuitively, a Kolmogorov random object is one that cannot be compressed. Sequences that are drawn from truly random sources are Kolmogorov random with extremely high probability.
Unfortunately, it is not possible to compute the Kolmogorov complexity of sequences in general (it is an undecidable property of strings). However, you can still estimate it simply by trying to compress the sequence. Run it through a Zip compression engine, or anything else. If the algorithm succeeds in achieving significant compression, then this certifies that the sequence is -not- Kolmogorov random, and hence very likely was not drawn from a random source. If the compression fails, of course, it doesn’t prove that the sequence has high Kolmogorov complexity (since you are just using a heuristic algorithm, not the optimal (undecidable) compression). But at least you can certify the answer in one direction.
In light of this knowledge, lets run the compression tests for the sequences above:
ls -l *.random
## -rw-r--r-- 1 frank frank 5000 Dec  8 14:00 awk.random
## -rw-r--r-- 1 frank frank 5000 Dec  8 14:00 bash.random
## -rw-r--r-- 1 frank frank 5000 Dec  8 14:00 urandom.random
Compress using zip:
for z in 'awk' 'bash' 'urandom'; do zip $z $z.random; done
## updating: awk.random (deflated 71%)
## updating: bash.random (deflated 72%)
## updating: urandom.random (deflated 72%)
Compare this to non-random (trend) data:
for i in $(seq 1000); do printf "0.%02d\n" $(( i % 100 )) ; done > test.nonrandom
zip test test.nonrandom 
## updating: test.nonrandom (deflated 96%)
Or just constant data:
for i in $(seq 1000); do echo 0.00 ; done > test.constant
zip constant test.constant 
## updating: test.constant (deflated 99%)
So zipping is a good, if rough, proxy for a measure of randomness.
Stay tuned for part two to discover how random data can be used in testing code.

No comments: