Not sure what should happen theoretically for the code in vseq.c, but I see the same pattern with the R generators I tried (default, Super-Duper, and L'Ecuyer) and with with bash $RANDOM using N <- 10000 X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE))) X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE))) X <- X1 + 2 ^ 15 * (X2 > 2^14) and with numbers from random.org library(random) X <- randomNumbers(N, 0, 2^16-1, col = 1) So I'm not convinced there is an issue. Best, luke On Fri, 21 Sep 2018, Steve Grubb wrote:> Hello, > > Top posting. Several people have asked about the code to replicate my > results. I have cleaned up the code to remove an x/y coordinate bias for > displaying the results directly on a 640 x 480 VGA adapter. You can find the > code here: > > http://people.redhat.com/sgrubb/files/vseq.c > > To collect R samples: > X <- runif(10000, min = 0, max = 65535) > write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE) > > Then: > cat ~/r-rand.txt | ./vseq > ~/r-rand.csv > > And then to create the chart: > > library(ggplot2); > num.csv <- read.csv("~/random.csv", header=T) > qplot(X, Y, data=num.csv); > > Hope this helps sort this out. > > Best Regards, > -Steve > > On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote: >> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote: >>> On 20/09/2018 6:59 AM, Ralf Stubner wrote: >>>> On 9/20/18 1:43 AM, Carl Boettiger wrote: >>>>> For a well-tested C algorithm, based on my reading of Lemire, the >>>>> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part >>>>> already of the C standard library in OpenBSD and macOS (as >>>>> arc4random_uniform), and in the GNU standard library. Lemire also >>>>> provides C++ code in the appendix of his piece for both this and the >>>>> faster "nearly divisionless" algorithm. >>>>> >>>>> It would be excellent if any R core members were interested in >>>>> considering bindings to these algorithms as a patch, or might express >>>>> expectations for how that patch would have to operate (e.g. re >>>>> Duncan's >>>>> comment about non-integer arguments to sample size). Otherwise, an R >>>>> package binding seems like a good starting point, but I'm not the >>>>> right >>>>> volunteer. >>>> >>>> It is difficult to do this in a package, since R does not provide >>>> access >>>> to the random bits generated by the RNG. Only a float in (0,1) is >>>> available via unif_rand(). >>> >>> I believe it is safe to multiply the unif_rand() value by 2^32, and take >>> the whole number part as an unsigned 32 bit integer. Depending on the >>> RNG in use, that will give at least 25 random bits. (The low order bits >>> are the questionable ones. 25 is just a guess, not a guarantee.) >>> >>> However, if one is willing to use an external >>> >>>> RNG, it is of course possible. After reading about Lemire's work [1], I >>>> had planned to integrate such an unbiased sampling scheme into the >>>> dqrng >>>> package, which I have now started. [2] >>>> >>>> Using Duncan's example, the results look much better: >>>>> library(dqrng) >>>>> m <- (2/5)*2^32 >>>>> y <- dqsample(m, 1000000, replace = TRUE) >>>>> table(y %% 2) >>>>> >>>> 0 1 >>>> >>>> 500252 499748 >>> >>> Another useful diagnostic is >>> >>> plot(density(y[y %% 2 == 0])) >>> >>> Obviously that should give a more or less uniform density, but for >>> values near m, the default sample() gives some nice pretty pictures of >>> quite non-uniform densities. >>> >>> By the way, there are actually quite a few examples of very large m >>> besides m = (2/5)*2^32 where performance of sample() is noticeably bad. >>> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) >>> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + >>> 3a)*2^32, etc. >>> >>> So perhaps I'm starting to be convinced that the default sample() should >>> be fixed. >> >> I find this discussion fascinating. I normally test random numbers in >> different languages every now and again using various methods. One simple >> check that I do is to use Michal Zalewski's method when he studied Strange >> Attractors and Initial TCP/IP Sequence Numbers: >> >> http://lcamtuf.coredump.cx/newtcp/ >> https://pdfs.semanticscholar.org/ >> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf >> >> The technique works by mapping the dynamics of the generated numbers into a >> three-dimensional phase space. This is then plotted in a graph so that you >> can visually see if something odd is going on. >> >> I used runif(10000, min = 0, max = 65535) to get a set of numbers. This >> is the resulting plot that was generated from R's numbers using this >> technique: >> >> http://people.redhat.com/sgrubb/files/r-random.jpg >> >> And for comparison this was generated by collecting the same number of >> samples from the bash shell: >> >> http://people.redhat.com/sgrubb/files/bash-random.jpg >> >> The net result is that it shows some banding in the R generated random >> numbers where bash has uniform random numbers with no discernible pattern. >> >> Best Regards, >> -Steve >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
On 9/21/18 6:38 PM, Tierney, Luke wrote:> Not sure what should happen theoretically for the code in vseq.c, but > I see the same pattern with the R generators I tried (default, > Super-Duper, and L'Ecuyer) and with with bash $RANDOM using > > N <- 10000 > X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE))) > X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE))) > X <- X1 + 2 ^ 15 * (X2 > 2^14) > > and with numbers from random.org > > library(random) > X <- randomNumbers(N, 0, 2^16-1, col = 1) > > So I'm not convinced there is an issue.There is an issue, but it is in vseq.c. The plot I found striking was this: http://people.redhat.com/sgrubb/files/r-random.jpg It shows a scatter plot that is bounded to some rectangle where the upper right and lower left corner are empty. Roughly speaking, X and Y correspond to *consecutive differences* between random draws. It is obvious that differences between random draws are bounded by the range of the RNG, and that there cannot be two *differences in a row* that are close to the maximum (or minimum). Hence the expected shape for such a scatter plot is a rectangle with two corners being forbidden. Within the allowed region, there should be no structure what so ever (given enough draws). And that was striking about the above picture: It showed clear vertical bands which should not be there. MT does fail some statistical tests, but it cannot be brought down that easily. Interestingly, I first used Dirk's C++ function for convenience, and that did *not* show these bands. But when I compiled vseq.c I could reproduce this. To cut this short: There is an error in vseq.c when the numbers are read in: tmp = strtoul(buf, NULL, 16); The third argument to strtoul is the base in which the numbers should be interpreted. However, R has written numbers with base 10. Those can be interpreted as base 16, but they will mean something different. Once one changes the above line to tmp = strtoul(buf, NULL, 10); the bands do disappear. cheerio ralf -- Ralf Stubner Senior Software Engineer / Trainer daqana GmbH Dortustra?e 48 14467 Potsdam T: +49 331 23 61 93 11 F: +49 331 23 61 93 90 M: +49 162 20 91 196 Mail: ralf.stubner at daqana.com Sitz: Potsdam Register: AG Potsdam HRB 27966 P Ust.-IdNr.: DE300072622 Gesch?ftsf?hrer: Prof. Dr. Dr. Karl-Kuno Kunze -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20180921/b4bc371f/attachment.sig>
On Friday, September 21, 2018 12:38:15 PM EDT Tierney, Luke wrote:> Not sure what should happen theoretically for the code in vseq.c, but > I see the same pattern with the R generators I tried (default, > Super-Duper, and L'Ecuyer) and with with bash $RANDOM using > > N <- 10000 > X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern > TRUE))) X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", > intern = TRUE))) X <- X1 + 2 ^ 15 * (X2 > 2^14) > > and with numbers from random.org > > library(random) > X <- randomNumbers(N, 0, 2^16-1, col = 1) > > So I'm not convinced there is an issue.Indeed. I found the issue: Hex vs Decimal output. If the bash command was printf "0x%08x\n" $RANDOM, then everything is fine. So, the code in vseq.c needs to be strtoul(buf, NULL, 0); so that it handles Hex, Decimal, or Octal. With this correction to vseq all is fine. No banding in any generated image. Corrected program is uploaded. So, the technique works. When the numbers had a problem, due to a conversion error, it showed a distinct pattern. :-) Best Regards, -Steve
On Friday, September 21, 2018 5:28:38 PM EDT Ralf Stubner wrote:> On 9/21/18 6:38 PM, Tierney, Luke wrote: > > Not sure what should happen theoretically for the code in vseq.c, but > > I see the same pattern with the R generators I tried (default, > > Super-Duper, and L'Ecuyer) and with with bash $RANDOM using > > > > N <- 10000 > > X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern > > TRUE))) X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", > > intern = TRUE))) X <- X1 + 2 ^ 15 * (X2 > 2^14) > > > > and with numbers from random.org > > > > library(random) > > X <- randomNumbers(N, 0, 2^16-1, col = 1) > > > > So I'm not convinced there is an issue. > > There is an issue, but it is in vseq.c. > > The plot I found striking was this: > > http://people.redhat.com/sgrubb/files/r-random.jpg > > It shows a scatter plot that is bounded to some rectangle where the > upper right and lower left corner are empty. Roughly speaking, X and Y > correspond to *consecutive differences* between random draws. It is > obvious that differences between random draws are bounded by the range > of the RNG, and that there cannot be two *differences in a row* that are > close to the maximum (or minimum). Hence the expected shape for such a > scatter plot is a rectangle with two corners being forbidden. > > Within the allowed region, there should be no structure what so ever > (given enough draws). And that was striking about the above picture: It > showed clear vertical bands which should not be there. MT does fail some > statistical tests, but it cannot be brought down that easily. > > Interestingly, I first used Dirk's C++ function for convenience, and > that did *not* show these bands. But when I compiled vseq.c I could > reproduce this. To cut this short: There is an error in vseq.c when the > numbers are read in: > > tmp = strtoul(buf, NULL, 16); > > The third argument to strtoul is the base in which the numbers should be > interpreted. However, R has written numbers with base 10. Those can be > interpreted as base 16, but they will mean something different. Once one > changes the above line to > > tmp = strtoul(buf, NULL, 10); > > the bands do disappear.Yes. I just discovered the problem also. I was looking at how my bash script worked fine and how the example Luke gave had a problem. I was using the print command to keep things in hex. A corrected copy was uploaded so no one else runs across this. Best Regards, -Steve