On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> > I think the analyses are correct, but I doubt if a change to the default > is likely to be accepted as it would make it more difficult to reproduce > older results.I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a statistical language. As a consumer of R I'd like to think that e.g. my bootstrapped p values are correct. Surely if the old results depend on the biased algorithm, then they are false results? -- Sent from Gmail Mobile [[alternative HTML version deleted]]
On 2018-09-19 09:40 AM, David Hugh-Jones wrote:> On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > >> >> I think the analyses are correct, but I doubt if a change to the default >> is likely to be accepted as it would make it more difficult to reproduce >> older results. > > > I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a > statistical language. As a consumer of R I'd like to think that e.g. my > bootstrapped p values are correct. > Surely if the old results depend on the biased algorithm, then they are > false results? >Balancing backward compatibility and correctness is a tough problem here. If this goes into base R, what's the best way to do it? What was the protocol for migrating away from the "buggy Kinderman-Ramage" generator, back in the day? (Version 1.7 was sometime between 2001 and 2004). I couldn't find the exact commit in the GitHub mirror: this is related ... https://github.com/wch/r-source/commit/7ad3044639fd1fe093c655e573fd1a67aa7f55f6#diff-dbcad570d4fb9b7005550ff630543b37 ==?normal.kind? can be ?"Kinderman-Ramage"?, ?"Buggy Kinderman-Ramage"? (not for ?set.seed?), ?"Ahrens-Dieter"?, ?"Box-Muller"?, ?"Inversion"? (the default), or ?"user-supplied"?. (For inversion, see the reference in ?qnorm?.) The Kinderman-Ramage generator used in versions prior to 1.7.0 (now called ?"Buggy"?) had several approximation errors and should only be used for reproduction of old results.
On 19/09/2018 9:40 AM, David Hugh-Jones wrote:> > > On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> wrote: > > > I think the analyses are correct, but I doubt if a change to the > default > is likely to be accepted as it would make it more difficult to > reproduce > older results. > > > I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a > statistical language. As a consumer of R I'd like to think that e.g. my > bootstrapped p values are correct. > Surely if the old results depend on the biased algorithm, then they are > false results?All Monte Carlo results contain Monte Carlo error. Using the biased function will have some additional error, but for almost all simulations, it will be negligible compared to the Monte Carlo error. I suspect the only simulations where the bias was anywhere near the same order of magnitude as the Monte Carlo error would be ones designed with this specific code in mind. Duncan Murdoch
That depends on the number of replications, among other things. Moreover, because of the bias, the usual formulae for uncertainty in estimates based on random samples, etc., are incorrect: sample() does not give a simple random sample. On Wed, Sep 19, 2018 at 9:15 AM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 19/09/2018 9:40 AM, David Hugh-Jones wrote: > > > > > > On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> wrote: > > > > > > I think the analyses are correct, but I doubt if a change to the > > default > > is likely to be accepted as it would make it more difficult to > > reproduce > > older results. > > > > > > I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a > > statistical language. As a consumer of R I'd like to think that e.g. my > > bootstrapped p values are correct. > > Surely if the old results depend on the biased algorithm, then they are > > false results? > > All Monte Carlo results contain Monte Carlo error. Using the biased > function will have some additional error, but for almost all > simulations, it will be negligible compared to the Monte Carlo error. I > suspect the only simulations where the bias was anywhere near the same > order of magnitude as the Monte Carlo error would be ones designed with > this specific code in mind. > > Duncan Murdoch > >-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark [[alternative HTML version deleted]]
On 09/19/2018 10:03 AM, Ben Bolker wrote: ...> Balancing backward compatibility and correctness is a tough problem > here.I think improvements in the RNG is a situation where backward compatibility is not really going to be lost, because people can specify the old generator, they just will not get it by default. My opinion is that the default needs to generally be the best option available because too many people will be expecting that, or not know better, in which case that is what they should get. There are only two small problems that occur to me: 1/ Researchers that want to have reproducible results (all I hope) need to be aware the change has happened. In theory they should have recorded the RNG they were using, along with the seed (and, BTW, the number of nodes if they generate with a parallel generator). If they have not done that then they can figure out the RNG from knowing what version of R they used. If they haven't recorded that then they can figure it out by some experimentation and knowing roughly when they did the research. If none of this works then the research probably should be lost. As an exercise, researchers might also want to experiment with whether the new default qualitatively changes their results. That might lead to publishable research, so no one should complain. 2/ Package maintainers that have used the default RNG to generate tests may need to change their tests to specify the old generator, or modify results used for comparisons in the tests. Since package testing is usually for code checking rather than statistical results, not using the best available generator is not usually an issue. Most of my own package testing already specifies the generator, lots uses "buggy Kinderman-Ramage" because tests were set up a long time ago. I will have to change package setRNG which warns when the default generator changes. (This warning is intentional because I was bitten badly by a small change in the S generator circa 1990.)> If this goes into base R, what's the best way to do it? What was > the protocol for migrating away from the "buggy Kinderman-Ramage" > generator, back in the day? (Version 1.7 was sometime between 2001 and > 2004).I think there may have been a change in R 0.99 too. At least my notes suggest that the code I changed for R 1.7.0 had worked with the default generator from R 0.99 to 1.6.2. I don't recall the protocol, I think it just happened and was announced in the NEWS. (Has this protocol changed?) The ramification for me was that I had to go through all of my packages' testing and change the name of the explicitly specified RNG to "buggy Kinderman-Ramage". Perhaps there does need to be a protocol for testing before release. When my package setRNG fails then many of my other packages will also fail because they depend on it. This is a simple fix but reverse dependencies may make it look like lots of things are broken. Paul Gilbert> I couldn't find the exact commit in the GitHub mirror: this is related ... > > https://github.com/wch/r-source/commit/7ad3044639fd1fe093c655e573fd1a67aa7f55f6#diff-dbcad570d4fb9b7005550ff630543b37 > > > > ==> ?normal.kind? can be ?"Kinderman-Ramage"?, ?"Buggy > Kinderman-Ramage"? (not for ?set.seed?), ?"Ahrens-Dieter"?, > ?"Box-Muller"?, ?"Inversion"? (the default), or ?"user-supplied"?. > (For inversion, see the reference in ?qnorm?.) The > Kinderman-Ramage generator used in versions prior to 1.7.0 (now > called ?"Buggy"?) had several approximation errors and should only > be used for reproduction of old results. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >