Before the next release we really should to sort out the bias issue in sample() reported by Ottoboni and Stark in https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and filed aa a bug report by Duncan Murdoch at https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. Here are two examples of bad behavior through current R-devel: set.seed(123) m <- (2/5) * 2^32 x <- sample(m, 1000000, replace = TRUE) table(x %% 2, x > m / 2) ## ## FALSE TRUE ## 0 300620 198792 ## 1 200196 300392 table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2) ## ## 0 1 ## 429054 570946 I committed a modification to R_unif_index to address this by generating random bits (blocks of 16) and rejection sampling, but for now this is only enabled if the environment variable R_NEW_SAMPLE is set before the first call. Some things still needed: - someone to look over the change and see if there are any issues - adjustment of RNGkind to allowing the old behavior to be selected - make the new behavior the default - adjust documentation - ??? Unfortunately I don't have enough free cycles to do this, but I can help if someone else can take the lead. There are two other places I found that might suffer from the same issue, in walker_ProbSampleReplace (pointed out bu O & S) and in src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I have done that for walker_ProbSampleReplace, but the wilcox change might need adjusting to support the standalone math library and I don't feel confident enough I'd get that right. Best, luke -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
Luke, I'm happy to help with this. Its great to see this get tackled (I've cc'ed Kelli Ottoboni who helped flag this issue). I can prepare a patch for the RNGkind related stuff and the doc update. As for ???, what are your (and others') thoughts about the possibility of a) a reproducibility API which takes either an R version (or maybe alternatively a date) and sets the RNGkind to the default for that version/date, and/or b) that sessionInfo be modified to capture (and display) the RNGkind in effect. Best, ~G On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney at uiowa.edu> wrote:> > Before the next release we really should to sort out the bias issue in > sample() reported by Ottoboni and Stark in > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and > filed aa a bug report by Duncan Murdoch at > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. > > Here are two examples of bad behavior through current R-devel: > > set.seed(123) > m <- (2/5) * 2^32 > x <- sample(m, 1000000, replace = TRUE) > table(x %% 2, x > m / 2) > ## > ## FALSE TRUE > ## 0 300620 198792 > ## 1 200196 300392 > > table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2) > ## > ## 0 1 > ## 429054 570946 > > I committed a modification to R_unif_index to address this by > generating random bits (blocks of 16) and rejection sampling, but for > now this is only enabled if the environment variable R_NEW_SAMPLE is > set before the first call. > > Some things still needed: > > - someone to look over the change and see if there are any issues > - adjustment of RNGkind to allowing the old behavior to be selected > - make the new behavior the default > - adjust documentation > - ??? > > Unfortunately I don't have enough free cycles to do this, but I can > help if someone else can take the lead. > > There are two other places I found that might suffer from the same > issue, in walker_ProbSampleReplace (pointed out bu O & S) and in > src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I > have done that for walker_ProbSampleReplace, but the wilcox change > might need adjusting to support the standalone math library and I > don't feel confident enough I'd get that right. > > Best, > > luke > > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tierney at uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Gabe As mentioned on Twitter, I think the following behavior should be fixed as part of the upcoming changes: R.version.string ## [1] "R Under development (unstable) (2019-02-25 r76160)" .Machine$double.digits ## [1] 53 set.seed(123) RNGkind() ## [1] "Mersenne-Twister" "Inversion"??????? "Rejection" length(table(runif(1e6))) ## [1] 999863 I don't expect any collisions when using Mersenne-Twister to generate a million floating point values. I'm not sure what causes this behavior, but it's documented in ?Random: "Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)" The "Wichman-Hill" bit is interesting: RNGkind("Wichmann-Hill") length(table(runif(1e6))) ## [1] 1000000 length(table(runif(1e6))) ## [1] 1000000 Mersenne-Twister has a much much larger periodicity than Wichmann-Hill, it would be great to see the above behavior also for Mersenne-Twister. Thanks for considering. Best regards Kirill On 20.02.19 08:01, Gabriel Becker wrote:> Luke, > > I'm happy to help with this. Its great to see this get tackled (I've cc'ed > Kelli Ottoboni who helped flag this issue). > > I can prepare a patch for the RNGkind related stuff and the doc update. > > As for ???, what are your (and others') thoughts about the possibility of > a) a reproducibility API which takes either an R version (or maybe > alternatively a date) and sets the RNGkind to the default for that > version/date, and/or b) that sessionInfo be modified to capture (and > display) the RNGkind in effect. > > Best, > ~G > > > On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney at uiowa.edu> > wrote: > >> Before the next release we really should to sort out the bias issue in >> sample() reported by Ottoboni and Stark in >> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and >> filed aa a bug report by Duncan Murdoch at >> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. >> >> Here are two examples of bad behavior through current R-devel: >> >> set.seed(123) >> m <- (2/5) * 2^32 >> x <- sample(m, 1000000, replace = TRUE) >> table(x %% 2, x > m / 2) >> ## >> ## FALSE TRUE >> ## 0 300620 198792 >> ## 1 200196 300392 >> >> table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2) >> ## >> ## 0 1 >> ## 429054 570946 >> >> I committed a modification to R_unif_index to address this by >> generating random bits (blocks of 16) and rejection sampling, but for >> now this is only enabled if the environment variable R_NEW_SAMPLE is >> set before the first call. >> >> Some things still needed: >> >> - someone to look over the change and see if there are any issues >> - adjustment of RNGkind to allowing the old behavior to be selected >> - make the new behavior the default >> - adjust documentation >> - ??? >> >> Unfortunately I don't have enough free cycles to do this, but I can >> help if someone else can take the lead. >> >> There are two other places I found that might suffer from the same >> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in >> src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I >> have done that for walker_ProbSampleReplace, but the wilcox change >> might need adjusting to support the standalone math library and I >> don't feel confident enough I'd get that right. >> >> Best, >> >> luke >> >> >> -- >> Luke Tierney >> Ralph E. Wareham Professor of Mathematical Sciences >> University of Iowa Phone: 319-335-3386 >> Department of Statistics and Fax: 319-335-3017 >> Actuarial Science >> 241 Schaeffer Hall email: luke-tierney at uiowa.edu >> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel