Hello, On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:> On 20/09/2018 6:59 AM, Ralf Stubner wrote: > > On 9/20/18 1:43 AM, Carl Boettiger wrote: > >> For a well-tested C algorithm, based on my reading of Lemire, the > >> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part > >> already of the C standard library in OpenBSD and macOS (as > >> arc4random_uniform), and in the GNU standard library. Lemire also > >> provides C++ code in the appendix of his piece for both this and the > >> faster "nearly divisionless" algorithm. > >> > >> It would be excellent if any R core members were interested in > >> considering bindings to these algorithms as a patch, or might express > >> expectations for how that patch would have to operate (e.g. re Duncan's > >> comment about non-integer arguments to sample size). Otherwise, an R > >> package binding seems like a good starting point, but I'm not the right > >> volunteer. > > > > It is difficult to do this in a package, since R does not provide access > > to the random bits generated by the RNG. Only a float in (0,1) is > > available via unif_rand(). > > I believe it is safe to multiply the unif_rand() value by 2^32, and take > the whole number part as an unsigned 32 bit integer. Depending on the > RNG in use, that will give at least 25 random bits. (The low order bits > are the questionable ones. 25 is just a guess, not a guarantee.) > > However, if one is willing to use an external > > > RNG, it is of course possible. After reading about Lemire's work [1], I > > had planned to integrate such an unbiased sampling scheme into the dqrng > > package, which I have now started. [2] > > > > Using Duncan's example, the results look much better: > >> library(dqrng) > >> m <- (2/5)*2^32 > >> y <- dqsample(m, 1000000, replace = TRUE) > >> table(y %% 2) > >> > > 0 1 > > > > 500252 499748 > > Another useful diagnostic is > > plot(density(y[y %% 2 == 0])) > > Obviously that should give a more or less uniform density, but for > values near m, the default sample() gives some nice pretty pictures of > quite non-uniform densities. > > By the way, there are actually quite a few examples of very large m > besides m = (2/5)*2^32 where performance of sample() is noticeably bad. > You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + > 3a)*2^32, etc. > > So perhaps I'm starting to be convinced that the default sample() should > be fixed.I find this discussion fascinating. I normally test random numbers in different languages every now and again using various methods. One simple check that I do is to use Michal Zalewski's method when he studied Strange Attractors and Initial TCP/IP Sequence Numbers: http://lcamtuf.coredump.cx/newtcp/ https://pdfs.semanticscholar.org/ adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf The technique works by mapping the dynamics of the generated numbers into a three-dimensional phase space. This is then plotted in a graph so that you can visually see if something odd is going on. I used runif(10000, min = 0, max = 65535) to get a set of numbers. This is the resulting plot that was generated from R's numbers using this technique: http://people.redhat.com/sgrubb/files/r-random.jpg And for comparison this was generated by collecting the same number of samples from the bash shell: http://people.redhat.com/sgrubb/files/bash-random.jpg The net result is that it shows some banding in the R generated random numbers where bash has uniform random numbers with no discernible pattern. Best Regards, -Steve
Hello, Top posting. Several people have asked about the code to replicate my results. I have cleaned up the code to remove an x/y coordinate bias for displaying the results directly on a 640 x 480 VGA adapter. You can find the code here: http://people.redhat.com/sgrubb/files/vseq.c To collect R samples: X <- runif(10000, min = 0, max = 65535) write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE) Then: cat ~/r-rand.txt | ./vseq > ~/r-rand.csv And then to create the chart: library(ggplot2); num.csv <- read.csv("~/random.csv", header=T) qplot(X, Y, data=num.csv); Hope this helps sort this out. Best Regards, -Steve On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote:> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote: > > On 20/09/2018 6:59 AM, Ralf Stubner wrote: > > > On 9/20/18 1:43 AM, Carl Boettiger wrote: > > >> For a well-tested C algorithm, based on my reading of Lemire, the > > >> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part > > >> already of the C standard library in OpenBSD and macOS (as > > >> arc4random_uniform), and in the GNU standard library. Lemire also > > >> provides C++ code in the appendix of his piece for both this and the > > >> faster "nearly divisionless" algorithm. > > >> > > >> It would be excellent if any R core members were interested in > > >> considering bindings to these algorithms as a patch, or might express > > >> expectations for how that patch would have to operate (e.g. re > > >> Duncan's > > >> comment about non-integer arguments to sample size). Otherwise, an R > > >> package binding seems like a good starting point, but I'm not the > > >> right > > >> volunteer. > > > > > > It is difficult to do this in a package, since R does not provide > > > access > > > to the random bits generated by the RNG. Only a float in (0,1) is > > > available via unif_rand(). > > > > I believe it is safe to multiply the unif_rand() value by 2^32, and take > > the whole number part as an unsigned 32 bit integer. Depending on the > > RNG in use, that will give at least 25 random bits. (The low order bits > > are the questionable ones. 25 is just a guess, not a guarantee.) > > > > However, if one is willing to use an external > > > > > RNG, it is of course possible. After reading about Lemire's work [1], I > > > had planned to integrate such an unbiased sampling scheme into the > > > dqrng > > > package, which I have now started. [2] > > > > > > Using Duncan's example, the results look much better: > > >> library(dqrng) > > >> m <- (2/5)*2^32 > > >> y <- dqsample(m, 1000000, replace = TRUE) > > >> table(y %% 2) > > >> > > > 0 1 > > > > > > 500252 499748 > > > > Another useful diagnostic is > > > > plot(density(y[y %% 2 == 0])) > > > > Obviously that should give a more or less uniform density, but for > > values near m, the default sample() gives some nice pretty pictures of > > quite non-uniform densities. > > > > By the way, there are actually quite a few examples of very large m > > besides m = (2/5)*2^32 where performance of sample() is noticeably bad. > > You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) > > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + > > 3a)*2^32, etc. > > > > So perhaps I'm starting to be convinced that the default sample() should > > be fixed. > > I find this discussion fascinating. I normally test random numbers in > different languages every now and again using various methods. One simple > check that I do is to use Michal Zalewski's method when he studied Strange > Attractors and Initial TCP/IP Sequence Numbers: > > http://lcamtuf.coredump.cx/newtcp/ > https://pdfs.semanticscholar.org/ > adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf > > The technique works by mapping the dynamics of the generated numbers into a > three-dimensional phase space. This is then plotted in a graph so that you > can visually see if something odd is going on. > > I used runif(10000, min = 0, max = 65535) to get a set of numbers. This > is the resulting plot that was generated from R's numbers using this > technique: > > http://people.redhat.com/sgrubb/files/r-random.jpg > > And for comparison this was generated by collecting the same number of > samples from the bash shell: > > http://people.redhat.com/sgrubb/files/bash-random.jpg > > The net result is that it shows some banding in the R generated random > numbers where bash has uniform random numbers with no discernible pattern. > > Best Regards, > -Steve > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
On 21 September 2018 at 09:50, Steve Grubb wrote: | Hello, | | Top posting. Several people have asked about the code to replicate my | results. I have cleaned up the code to remove an x/y coordinate bias for | displaying the results directly on a 640 x 480 VGA adapter. You can find the | code here: | | http://people.redhat.com/sgrubb/files/vseq.c | | To collect R samples: | X <- runif(10000, min = 0, max = 65535) | write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE) | | Then: | cat ~/r-rand.txt | ./vseq > ~/r-rand.csv | | And then to create the chart: | | library(ggplot2); | num.csv <- read.csv("~/random.csv", header=T) | qplot(X, Y, data=num.csv); | | Hope this helps sort this out. Here is a simpler version in one file, combining the operations in a bit of C++ and also an auto-executed piece of R to create the data, call the transformation you created, and plot. Store as eg /tmp/stevegrubb.cpp and then call Rcpp::sourceCpp("/tmp/stevegrubb.cpp") Dirk --- snip -------------------------------------------------------------------- #include <Rcpp.h> #define SCALING 0.5 // [[Rcpp::export]] Rcpp::NumericMatrix stevePlot(Rcpp::NumericVector itable) { size_t cnt = itable.size(); Rcpp::NumericMatrix M(cnt,2); size_t num = 0; double nth, n1th = 0, n2th = 0, n3th = 0; double x, y; double usex, usey, pha = 0; usex = sin(pha); usey = cos(pha); while (num < cnt) { double x1, y1, z1; nth = itable[num]; x1 = (nth - n1th) * SCALING; y1 = (n1th - n2th) * SCALING; z1 = (n2th - n3th) * SCALING; x = (x1*usey+z1*usex); y = y1 + (z1*usey-x1*usex) / 2; if (num > 4) { M(num,0) = x; M(num,1) = y; } else { M(num,0) = M(num,1) = NA_REAL; } num++; n3th=n2th; n2th=n1th; n1th=nth; } Rcpp::colnames(M) = Rcpp::CharacterVector::create("X", "Y"); return(M); } /*** R library(ggplot2); Xin <- runif(10000, min = 0, max = 65535) Xout <- stevePlot(Xin); qplot(X, Y, data=as.data.frame(Xout)); */ --- snip -------------------------------------------------------------------- -- http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
Slightly nicer version: --- snip -------------------------------------------------------------------- #include <Rcpp.h> // [[Rcpp::export]] Rcpp::DataFrame stevePlot(Rcpp::NumericVector itable) { size_t cnt = itable.size(), num = 0; double nth, n1th = 0, n2th = 0, n3th = 0; double x, y, usex, usey, pha = 0; std::vector<double> xv, yv; const double scaling = 0.5; usex = sin(pha); usey = cos(pha); while (num < cnt) { double x1, y1, z1; nth = itable[num]; num++; x1 = (nth - n1th) * scaling; y1 = (n1th - n2th) * scaling; z1 = (n2th - n3th) * scaling; x = (x1*usey+z1*usex); y = y1 + (z1*usey-x1*usex) / 2; if (num > 4) { xv.push_back(x); yv.push_back(y); } n3th=n2th; n2th=n1th; n1th=nth; } return(Rcpp::DataFrame::create(Rcpp::Named("X") = xv, Rcpp::Named("Y") = yv)); } /*** R library(ggplot2) Xin <- runif(10000, min = 0, max = 65535) Xout <- stevePlot(Xin) qplot(X, Y, data=Xout) */ --- snip -------------------------------------------------------------------- Still just Rcpp::sourceCpp("filenamehere.cpp") it. Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
Not sure what should happen theoretically for the code in vseq.c, but I see the same pattern with the R generators I tried (default, Super-Duper, and L'Ecuyer) and with with bash $RANDOM using N <- 10000 X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE))) X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE))) X <- X1 + 2 ^ 15 * (X2 > 2^14) and with numbers from random.org library(random) X <- randomNumbers(N, 0, 2^16-1, col = 1) So I'm not convinced there is an issue. Best, luke On Fri, 21 Sep 2018, Steve Grubb wrote:> Hello, > > Top posting. Several people have asked about the code to replicate my > results. I have cleaned up the code to remove an x/y coordinate bias for > displaying the results directly on a 640 x 480 VGA adapter. You can find the > code here: > > http://people.redhat.com/sgrubb/files/vseq.c > > To collect R samples: > X <- runif(10000, min = 0, max = 65535) > write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE) > > Then: > cat ~/r-rand.txt | ./vseq > ~/r-rand.csv > > And then to create the chart: > > library(ggplot2); > num.csv <- read.csv("~/random.csv", header=T) > qplot(X, Y, data=num.csv); > > Hope this helps sort this out. > > Best Regards, > -Steve > > On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote: >> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote: >>> On 20/09/2018 6:59 AM, Ralf Stubner wrote: >>>> On 9/20/18 1:43 AM, Carl Boettiger wrote: >>>>> For a well-tested C algorithm, based on my reading of Lemire, the >>>>> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part >>>>> already of the C standard library in OpenBSD and macOS (as >>>>> arc4random_uniform), and in the GNU standard library. Lemire also >>>>> provides C++ code in the appendix of his piece for both this and the >>>>> faster "nearly divisionless" algorithm. >>>>> >>>>> It would be excellent if any R core members were interested in >>>>> considering bindings to these algorithms as a patch, or might express >>>>> expectations for how that patch would have to operate (e.g. re >>>>> Duncan's >>>>> comment about non-integer arguments to sample size). Otherwise, an R >>>>> package binding seems like a good starting point, but I'm not the >>>>> right >>>>> volunteer. >>>> >>>> It is difficult to do this in a package, since R does not provide >>>> access >>>> to the random bits generated by the RNG. Only a float in (0,1) is >>>> available via unif_rand(). >>> >>> I believe it is safe to multiply the unif_rand() value by 2^32, and take >>> the whole number part as an unsigned 32 bit integer. Depending on the >>> RNG in use, that will give at least 25 random bits. (The low order bits >>> are the questionable ones. 25 is just a guess, not a guarantee.) >>> >>> However, if one is willing to use an external >>> >>>> RNG, it is of course possible. After reading about Lemire's work [1], I >>>> had planned to integrate such an unbiased sampling scheme into the >>>> dqrng >>>> package, which I have now started. [2] >>>> >>>> Using Duncan's example, the results look much better: >>>>> library(dqrng) >>>>> m <- (2/5)*2^32 >>>>> y <- dqsample(m, 1000000, replace = TRUE) >>>>> table(y %% 2) >>>>> >>>> 0 1 >>>> >>>> 500252 499748 >>> >>> Another useful diagnostic is >>> >>> plot(density(y[y %% 2 == 0])) >>> >>> Obviously that should give a more or less uniform density, but for >>> values near m, the default sample() gives some nice pretty pictures of >>> quite non-uniform densities. >>> >>> By the way, there are actually quite a few examples of very large m >>> besides m = (2/5)*2^32 where performance of sample() is noticeably bad. >>> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) >>> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + >>> 3a)*2^32, etc. >>> >>> So perhaps I'm starting to be convinced that the default sample() should >>> be fixed. >> >> I find this discussion fascinating. I normally test random numbers in >> different languages every now and again using various methods. One simple >> check that I do is to use Michal Zalewski's method when he studied Strange >> Attractors and Initial TCP/IP Sequence Numbers: >> >> http://lcamtuf.coredump.cx/newtcp/ >> https://pdfs.semanticscholar.org/ >> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf >> >> The technique works by mapping the dynamics of the generated numbers into a >> three-dimensional phase space. This is then plotted in a graph so that you >> can visually see if something odd is going on. >> >> I used runif(10000, min = 0, max = 65535) to get a set of numbers. This >> is the resulting plot that was generated from R's numbers using this >> technique: >> >> http://people.redhat.com/sgrubb/files/r-random.jpg >> >> And for comparison this was generated by collecting the same number of >> samples from the bash shell: >> >> http://people.redhat.com/sgrubb/files/bash-random.jpg >> >> The net result is that it shows some banding in the R generated random >> numbers where bash has uniform random numbers with no discernible pattern. >> >> Best Regards, >> -Steve >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu