Gabe As mentioned on Twitter, I think the following behavior should be fixed as part of the upcoming changes: R.version.string ## [1] "R Under development (unstable) (2019-02-25 r76160)" .Machine$double.digits ## [1] 53 set.seed(123) RNGkind() ## [1] "Mersenne-Twister" "Inversion"??????? "Rejection" length(table(runif(1e6))) ## [1] 999863 I don't expect any collisions when using Mersenne-Twister to generate a million floating point values. I'm not sure what causes this behavior, but it's documented in ?Random: "Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)" The "Wichman-Hill" bit is interesting: RNGkind("Wichmann-Hill") length(table(runif(1e6))) ## [1] 1000000 length(table(runif(1e6))) ## [1] 1000000 Mersenne-Twister has a much much larger periodicity than Wichmann-Hill, it would be great to see the above behavior also for Mersenne-Twister. Thanks for considering. Best regards Kirill On 20.02.19 08:01, Gabriel Becker wrote:> Luke, > > I'm happy to help with this. Its great to see this get tackled (I've cc'ed > Kelli Ottoboni who helped flag this issue). > > I can prepare a patch for the RNGkind related stuff and the doc update. > > As for ???, what are your (and others') thoughts about the possibility of > a) a reproducibility API which takes either an R version (or maybe > alternatively a date) and sets the RNGkind to the default for that > version/date, and/or b) that sessionInfo be modified to capture (and > display) the RNGkind in effect. > > Best, > ~G > > > On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney at uiowa.edu> > wrote: > >> Before the next release we really should to sort out the bias issue in >> sample() reported by Ottoboni and Stark in >> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and >> filed aa a bug report by Duncan Murdoch at >> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. >> >> Here are two examples of bad behavior through current R-devel: >> >> set.seed(123) >> m <- (2/5) * 2^32 >> x <- sample(m, 1000000, replace = TRUE) >> table(x %% 2, x > m / 2) >> ## >> ## FALSE TRUE >> ## 0 300620 198792 >> ## 1 200196 300392 >> >> table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2) >> ## >> ## 0 1 >> ## 429054 570946 >> >> I committed a modification to R_unif_index to address this by >> generating random bits (blocks of 16) and rejection sampling, but for >> now this is only enabled if the environment variable R_NEW_SAMPLE is >> set before the first call. >> >> Some things still needed: >> >> - someone to look over the change and see if there are any issues >> - adjustment of RNGkind to allowing the old behavior to be selected >> - make the new behavior the default >> - adjust documentation >> - ??? >> >> Unfortunately I don't have enough free cycles to do this, but I can >> help if someone else can take the lead. >> >> There are two other places I found that might suffer from the same >> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in >> src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I >> have done that for walker_ProbSampleReplace, but the wilcox change >> might need adjusting to support the standalone math library and I >> don't feel confident enough I'd get that right. >> >> Best, >> >> luke >> >> >> -- >> Luke Tierney >> Ralph E. Wareham Professor of Mathematical Sciences >> University of Iowa Phone: 319-335-3386 >> Department of Statistics and Fax: 319-335-3017 >> Actuarial Science >> 241 Schaeffer Hall email: luke-tierney at uiowa.edu >> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Kirill, I think some level of collision is actually expected! R uses a 32bit MT that can produce 2^32 different doubles. The probability for a collision within a million draws is> pbirthday(1e6, classes = 2^32)[1] 1 Greetings Ralf On 26.02.19 07:06, Kirill M?ller wrote:> Gabe > > > As mentioned on Twitter, I think the following behavior should be fixed > as part of the upcoming changes: > > R.version.string > ## [1] "R Under development (unstable) (2019-02-25 r76160)" > .Machine$double.digits > ## [1] 53 > set.seed(123) > RNGkind() > ## [1] "Mersenne-Twister" "Inversion"??????? "Rejection" > length(table(runif(1e6))) > ## [1] 999863 > > I don't expect any collisions when using Mersenne-Twister to generate a > million floating point values. I'm not sure what causes this behavior, > but it's documented in ?Random: > > "Do not rely on randomness of low-order bits from RNGs. Most of the > supplied uniform generators return 32-bit integer values that are > converted to doubles, so they take at most 2^32 distinct values and long > runs will return duplicated values (Wichmann-Hill is the exception, and > all give at least 30 varying bits.)" > > The "Wichman-Hill" bit is interesting: > > RNGkind("Wichmann-Hill") > length(table(runif(1e6))) > ## [1] 1000000 > length(table(runif(1e6))) > ## [1] 1000000 > > Mersenne-Twister has a much much larger periodicity than Wichmann-Hill, > it would be great to see the above behavior also for Mersenne-Twister. > Thanks for considering. > > > Best regards > > Kirill > > > On 20.02.19 08:01, Gabriel Becker wrote: >> Luke, >> >> I'm happy to help with this. Its great to see this get tackled (I've >> cc'ed >> Kelli Ottoboni who helped flag this issue). >> >> I can prepare a patch for the RNGkind related stuff and the doc update. >> >> As for ???, what are your (and others') thoughts about the possibility of >> a) a reproducibility API which takes either an R version (or maybe >> alternatively a date) and sets the RNGkind to the default for that >> version/date, and/or b) that sessionInfo be modified to capture (and >> display) the RNGkind in effect. >> >> Best, >> ~G >> >> >> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney at uiowa.edu> >> wrote: >> >>> Before the next release we really should to sort out the bias issue in >>> sample() reported by Ottoboni and Stark in >>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and >>> filed aa a bug report by Duncan Murdoch at >>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. >>> >>> Here are two examples of bad behavior through current R-devel: >>> >>> ????? set.seed(123) >>> ????? m <- (2/5) * 2^32 >>> ????? x <- sample(m, 1000000, replace = TRUE) >>> ????? table(x %% 2, x > m / 2) >>> ????? ## >>> ????? ##??? FALSE?? TRUE >>> ????? ## 0 300620 198792 >>> ????? ## 1 200196 300392 >>> >>> ????? table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2) >>> ????? ## >>> ????? ##????? 0????? 1 >>> ????? ## 429054 570946 >>> >>> I committed a modification to R_unif_index to address this by >>> generating random bits (blocks of 16) and rejection sampling, but for >>> now this is only enabled if the environment variable R_NEW_SAMPLE is >>> set before the first call. >>> >>> Some things still needed: >>> >>> - someone to look over the change and see if there are any issues >>> - adjustment of RNGkind to allowing the old behavior to be selected >>> - make the new behavior the default >>> - adjust documentation >>> - ??? >>> >>> Unfortunately I don't have enough free cycles to do this, but I can >>> help if someone else can take the lead. >>> >>> There are two other places I found that might suffer from the same >>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in >>> src/nmath/wilcox.c.? Both can be addressed by using R_unif_index. I >>> have done that for walker_ProbSampleReplace, but the wilcox change >>> might need adjusting to support the standalone math library and I >>> don't feel confident enough I'd get that right. >>> >>> Best, >>> >>> luke >>> >>> >>> -- >>> Luke Tierney >>> Ralph E. Wareham Professor of Mathematical Sciences >>> University of Iowa????????????????? Phone:???????????? 319-335-3386 >>> Department of Statistics and??????? Fax:?????????????? 319-335-3017 >>> ???? Actuarial Science >>> 241 Schaeffer Hall????????????????? email:?? luke-tierney at uiowa.edu >>> Iowa City, IA 52242???????????????? WWW:? http://www.stat.uiowa.edu >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> ????[[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Ralf Stubner Senior Software Engineer / Trainer daqana GmbH Dortustra?e 48 14467 Potsdam T: +49 331 23 61 93 11 F: +49 331 23 61 93 90 M: +49 162 20 91 196 Mail: ralf.stubner at daqana.com Sitz: Potsdam Register: AG Potsdam HRB 27966 Ust.-IdNr.: DE300072622 Gesch?ftsf?hrer: Dr.-Ing. Stefan Knirsch, Prof. Dr. Dr. Karl-Kuno Kunze -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20190226/2b4562c2/attachment.sig>
Ralf I don't doubt this is expected with the current implementation, I doubt the implementation is desirable. Suggesting to turn this to pbirthday(1e6, classes = 2^53) ## [1] 5.550956e-05 (which is still non-zero, but much less likely to cause confusion.) Best regards Kirill On 26.02.19 10:18, Ralf Stubner wrote:> Kirill, > > I think some level of collision is actually expected! R uses a 32bit MT > that can produce 2^32 different doubles. The probability for a collision > within a million draws is > >> pbirthday(1e6, classes = 2^32) > [1] 1 > > Greetings > Ralf > > > On 26.02.19 07:06, Kirill M?ller wrote: >> Gabe >> >> >> As mentioned on Twitter, I think the following behavior should be fixed >> as part of the upcoming changes: >> >> R.version.string >> ## [1] "R Under development (unstable) (2019-02-25 r76160)" >> .Machine$double.digits >> ## [1] 53 >> set.seed(123) >> RNGkind() >> ## [1] "Mersenne-Twister" "Inversion"??????? "Rejection" >> length(table(runif(1e6))) >> ## [1] 999863 >> >> I don't expect any collisions when using Mersenne-Twister to generate a >> million floating point values. I'm not sure what causes this behavior, >> but it's documented in ?Random: >> >> "Do not rely on randomness of low-order bits from RNGs. Most of the >> supplied uniform generators return 32-bit integer values that are >> converted to doubles, so they take at most 2^32 distinct values and long >> runs will return duplicated values (Wichmann-Hill is the exception, and >> all give at least 30 varying bits.)" >> >> The "Wichman-Hill" bit is interesting: >> >> RNGkind("Wichmann-Hill") >> length(table(runif(1e6))) >> ## [1] 1000000 >> length(table(runif(1e6))) >> ## [1] 1000000 >> >> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill, >> it would be great to see the above behavior also for Mersenne-Twister. >> Thanks for considering. >> >> >> Best regards >> >> Kirill >> >> >> On 20.02.19 08:01, Gabriel Becker wrote: >>> Luke, >>> >>> I'm happy to help with this. Its great to see this get tackled (I've >>> cc'ed >>> Kelli Ottoboni who helped flag this issue). >>> >>> I can prepare a patch for the RNGkind related stuff and the doc update. >>> >>> As for ???, what are your (and others') thoughts about the possibility of >>> a) a reproducibility API which takes either an R version (or maybe >>> alternatively a date) and sets the RNGkind to the default for that >>> version/date, and/or b) that sessionInfo be modified to capture (and >>> display) the RNGkind in effect. >>> >>> Best, >>> ~G >>> >>> >>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke <luke-tierney at uiowa.edu> >>> wrote: >>> >>>> Before the next release we really should to sort out the bias issue in >>>> sample() reported by Ottoboni and Stark in >>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and >>>> filed aa a bug report by Duncan Murdoch at >>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. >>>> >>>> Here are two examples of bad behavior through current R-devel: >>>> >>>> ????? set.seed(123) >>>> ????? m <- (2/5) * 2^32 >>>> ????? x <- sample(m, 1000000, replace = TRUE) >>>> ????? table(x %% 2, x > m / 2) >>>> ????? ## >>>> ????? ##??? FALSE?? TRUE >>>> ????? ## 0 300620 198792 >>>> ????? ## 1 200196 300392 >>>> >>>> ????? table(sample(2/7 * 2^32, 1000000, replace = TRUE) %% 2) >>>> ????? ## >>>> ????? ##????? 0????? 1 >>>> ????? ## 429054 570946 >>>> >>>> I committed a modification to R_unif_index to address this by >>>> generating random bits (blocks of 16) and rejection sampling, but for >>>> now this is only enabled if the environment variable R_NEW_SAMPLE is >>>> set before the first call. >>>> >>>> Some things still needed: >>>> >>>> - someone to look over the change and see if there are any issues >>>> - adjustment of RNGkind to allowing the old behavior to be selected >>>> - make the new behavior the default >>>> - adjust documentation >>>> - ??? >>>> >>>> Unfortunately I don't have enough free cycles to do this, but I can >>>> help if someone else can take the lead. >>>> >>>> There are two other places I found that might suffer from the same >>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in >>>> src/nmath/wilcox.c.? Both can be addressed by using R_unif_index. I >>>> have done that for walker_ProbSampleReplace, but the wilcox change >>>> might need adjusting to support the standalone math library and I >>>> don't feel confident enough I'd get that right. >>>> >>>> Best, >>>> >>>> luke >>>> >>>> >>>> -- >>>> Luke Tierney >>>> Ralph E. Wareham Professor of Mathematical Sciences >>>> University of Iowa????????????????? Phone:???????????? 319-335-3386 >>>> Department of Statistics and??????? Fax:?????????????? 319-335-3017 >>>> ???? Actuarial Science >>>> 241 Schaeffer Hall????????????????? email:?? luke-tierney at uiowa.edu >>>> Iowa City, IA 52242???????????????? WWW:? http://www.stat.uiowa.edu >>>> >>>> ______________________________________________ >>>> R-devel at r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> ????[[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel