It doesn't seem too hard to come up with plausible ways in which this could give bad results. Suppose I sample rows from a large dataset, maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g. odd rows are males, even rows are females. Oops! Very non-representative sample, bootstrap p values are garbage. David On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 19/09/2018 3:52 PM, Philip B. Stark wrote: > > Hi Duncan-- > > > > > > That is a mathematically true statement, but I suspect it is not very > relevant. Pseudo-random number generators always have test functions > whose sample averages are quite different from the expectation under the > true distribution. Remember Von Neumann's "state of sin" quote. The > bug in sample() just means it is easier to find such a function than it > would otherwise be. > > The practical question is whether such a function is likely to arise in > practice or not. > > > Whether those correspond to commonly used statistics or not, I have no > > idea. > > I am pretty confident that this bug rarely matters. > > > Regarding backwards compatibility: as a user, I'd rather the default > > sample() do the best possible thing, and take an extra step to use > > something like sample(..., legacy=TRUE) if I want to reproduce old > results. > > I suspect there's a good chance the bug I discovered today (non-integer > x values not being truncated) will be declared to be a feature, and the > documentation will be changed. Then the rejection sampling approach > would need to be quite a bit more complicated. > > I think a documentation warning about the accuracy of sampling > probabilities would also be a sufficient fix here, and would be quite a > bit less trouble than changing the default sample(). But as I said in > my original post, a contribution of a function without this bug would be > a nice addition. > > Duncan Murdoch > > > > > Regards, > > Philip > > > > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> wrote: > > > > On 19/09/2018 12:23 PM, Philip B. Stark wrote: > > > No, the 2nd call only happens when m > 2**31. Here's the code: > > > > Yes, you're right. Sorry! > > > > So the ratio really does come close to 2. However, the difference in > > probabilities between outcomes is still at most 2^-32 when m is less > > than that cutoff. That's not feasible to detect; the only detectable > > difference would happen if some event was constructed to hold an > > abundance of outcomes with especially low (or especially high) > > probability. > > > > As I said in my original post, it's probably not hard to construct > such > > a thing, but as I've said more recently, it probably wouldn't happen > by > > chance. Here's one attempt to do it: > > > > Call the values from unif_rand() "the unif_rand() outcomes". Call > the > > values from sample() the sample outcomes. > > > > It would be easiest to see the error if half of the sample() outcomes > > used two unif_rand() outcomes, and half used just one. That would > mean > > m should be (2/3) * 2^32, but that's too big and would trigger the > > other > > version. > > > > So how about half use 2 unif_rands(), and half use 3? That means m > > (2/5) * 2^32 = 1717986918. A good guess is that sample() outcomes > > would > > alternate between the two possibilities, so our event could be even > > versus odd outcomes. > > > > Let's try it: > > > > > m <- (2/5)*2^32 > > > m > 2^31 > > [1] FALSE > > > x <- sample(m, 1000000, replace = TRUE) > > > table(x %% 2) > > > > 0 1 > > 399850 600150 > > > > Since m is an even number, the true proportions of evens and odds > > should > > be exactly 0.5. That's some pretty strong evidence of the bug in the > > generator. (Note that the ratio of the observed probabilities is > about > > 1.5, so I may not be the first person to have done this.) > > > > I'm still not convinced that there has ever been a simulation run > with > > detectable bias compared to Monte Carlo error unless it (like this > one) > > was designed specifically to show the problem. > > > > Duncan Murdoch > > > > > > > > (RNG.c, lines 793ff) > > > > > > double R_unif_index(double dn) > > > { > > > double cut = INT_MAX; > > > > > > switch(RNG_kind) { > > > case KNUTH_TAOCP: > > > case USER_UNIF: > > > case KNUTH_TAOCP2: > > > cut = 33554431.0; /* 2^25 - 1 */ > > > break; > > > default: > > > break; > > > } > > > > > > double u = dn > cut ? ru() : unif_rand(); > > > return floor(dn * u); > > > } > > > > > > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch > > <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>> wrote: > > > > > > On 19/09/2018 12:09 PM, Philip B. Stark wrote: > > > > The 53 bits only encode at most 2^{32} possible values, > > because the > > > > source of the float is the output of a 32-bit PRNG (the > > obsolete > > > version > > > > of MT). 53 bits isn't the relevant number here. > > > > > > No, two calls to unif_rand() are used. There are two 32 bit > > values, > > > but > > > some of the bits are thrown away. > > > > > > Duncan Murdoch > > > > > > > > > > > The selection ratios can get close to 2. Computer > scientists > > > don't do it > > > > the way R does, for a reason. > > > > > > > > Regards, > > > > Philip > > > > > > > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch > > > <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com> > > <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> > > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>>> wrote: > > > > > > > > On 19/09/2018 9:09 AM, I?aki Ucar wrote: > > > > > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch > > > > > (<murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>>>) > > > > escribi?: > > > > >> > > > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > > > > >>> Dear list, > > > > >>> > > > > >>> It looks to me that R samples random integers > > using an > > > > intuitive but biased > > > > >>> algorithm by going from a random number on [0,1) > from > > > the PRNG > > > > to a random > > > > >>> integer, e.g. > > > > >>> > > > > > > > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > > > > >>> > > > > >>> Many other languages use various rejection > sampling > > > approaches > > > > which > > > > >>> provide an unbiased method for sampling, such as > > in Go, > > > python, > > > > and others > > > > >>> described here: https://arxiv.org/abs/1805.10941 > (I > > > believe the > > > > biased > > > > >>> algorithm currently used in R is also described > > there). I'm > > > > not an expert > > > > >>> in this area, but does it make sense for the R to > > adopt > > > one of > > > > the unbiased > > > > >>> random sample algorithms outlined there and used > > in other > > > > languages? Would > > > > >>> a patch providing such an algorithm be welcome? > What > > > concerns > > > > would need to > > > > >>> be addressed first? > > > > >>> > > > > >>> I believe this issue was also raised by Killie & > > Philip in > > > > >>> > > > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > > > > more > > > > >>> recently in > > > > >>> > > > > > > > > > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, > > > > >>> pointing to the python implementation for > comparison: > > > > >>> > > > > > > > > > > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > > > > >> > > > > >> I think the analyses are correct, but I doubt if a > > change > > > to the > > > > default > > > > >> is likely to be accepted as it would make it more > > > difficult to > > > > reproduce > > > > >> older results. > > > > >> > > > > >> On the other hand, a contribution of a new > > function like > > > > sample() but > > > > >> not suffering from the bias would be good. The > > normal way to > > > > make such > > > > >> a contribution is in a user contributed package. > > > > >> > > > > >> By the way, R code illustrating the bias is > > probably not very > > > > hard to > > > > >> put together. I believe the bias manifests itself > in > > > sample() > > > > producing > > > > >> values with two different probabilities (instead > > of all equal > > > > >> probabilities). Those may differ by as much as > > one part in > > > > 2^32. It's > > > > > > > > > > According to Kellie and Philip, in the attachment > > of the > > > thread > > > > > referenced by Carl, "The maximum ratio of selection > > > probabilities can > > > > > get as large as 1.5 if n is just below 2^31". > > > > > > > > Sorry, I didn't write very well. I meant to say that > the > > > difference in > > > > probabilities would be 2^-32, not that the ratio of > > > probabilities would > > > > be 1 + 2^-32. > > > > > > > > By the way, I don't see the statement giving the ratio > as > > > 1.5, but > > > > maybe > > > > I was looking in the wrong place. In Theorem 1 of the > > paper > > > I was > > > > looking in the ratio was "1 + m 2^{-w + 1}". In that > > formula > > > m is your > > > > n. If it is near 2^31, R uses w = 57 random bits, so > > the ratio > > > > would be > > > > very, very small (one part in 2^25). > > > > > > > > The worst case for R would happen when m is just below > > > 2^25, where w > > > > is at least 31 for the default generators. In that > > case the > > > ratio > > > > could > > > > be about 1.03. > > > > > > > > Duncan Murdoch > > > > > > > > > > > > > > > > -- > > > > Philip B. Stark | Associate Dean, Mathematical and Physical > > > Sciences | > > > > Professor, Department of Statistics | > > > > University of California > > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> > > > > <http://statistics.berkeley.edu/%7Estark> | > > > > @philipbstark > > > > > > > > > > > > > > > > -- > > > Philip B. Stark | Associate Dean, Mathematical and Physical > > Sciences | > > > Professor, Department of Statistics | > > > University of California > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> | > > > @philipbstark > > > > > > > > > > > -- > > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | > > Professor, Department of Statistics | > > University of California > > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> | > > @philipbstark > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Sent from Gmail Mobile [[alternative HTML version deleted]]
On 19/09/2018 5:57 PM, David Hugh-Jones wrote:> > It doesn't seem too hard to come up with plausible ways in which this > could give bad results. Suppose I sample rows from a large dataset, > maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g. > odd rows are males, even rows are females. Oops! Very non-representative > sample, bootstrap p values are garbage.That would only happen if your dataset was exactly 1717986918 elements in size. (And in fact, it will be less extreme than I posted: I had x set to 1717986918.4, as described in another thread. If you use an integer value you need a different pattern; add or subtract an element or two and the pattern needed to see a problem changes drastically.) But if you're sampling from a dataset of that exact size, then you should worry about this bug. Don't use sample(). Use the algorithm that Carl described. Duncan Murdoch> > David > > On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> wrote: > > On 19/09/2018 3:52 PM, Philip B. Stark wrote: > > Hi Duncan-- > > > > > > That is a mathematically true statement, but I suspect it is not very > relevant.? Pseudo-random number generators always have test functions > whose sample averages are quite different from the expectation under > the > true distribution.? Remember Von Neumann's "state of sin" quote.? The > bug in sample() just means it is easier to find such a function than it > would otherwise be. > > The practical question is whether such a function is likely to arise in > practice or not. > > ?> Whether those correspond to commonly used statistics or not, I > have no > ?> idea. > > I am pretty confident that this bug rarely matters. > > > Regarding backwards compatibility: as a user, I'd rather the default > > sample() do the best possible thing, and take an extra step to use > > something like sample(..., legacy=TRUE) if I want to reproduce > old results. > > I suspect there's a good chance the bug I discovered today (non-integer > x values not being truncated) will be declared to be a feature, and the > documentation will be changed.? Then the rejection sampling approach > would need to be quite a bit more complicated. > > I think a documentation warning about the accuracy of sampling > probabilities would also be a sufficient fix here, and would be quite a > bit less trouble than changing the default sample().? But as I said in > my original post, a contribution of a function without this bug > would be > a nice addition. > > Duncan Murdoch > > > > > Regards, > > Philip > > > > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch > <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com> > > <mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>>> wrote: > > > >? ? ?On 19/09/2018 12:23 PM, Philip B. Stark wrote: > >? ? ? > No, the 2nd call only happens when m > 2**31. Here's the code: > > > >? ? ?Yes, you're right. Sorry! > > > >? ? ?So the ratio really does come close to 2.? However, the > difference in > >? ? ?probabilities between outcomes is still at most 2^-32 when m > is less > >? ? ?than that cutoff.? That's not feasible to detect; the only > detectable > >? ? ?difference would happen if some event was constructed to hold an > >? ? ?abundance of outcomes with especially low (or especially high) > >? ? ?probability. > > > >? ? ?As I said in my original post, it's probably not hard to > construct such > >? ? ?a thing, but as I've said more recently, it probably wouldn't > happen by > >? ? ?chance.? Here's one attempt to do it: > > > >? ? ?Call the values from unif_rand() "the unif_rand() outcomes". > Call the > >? ? ?values from sample() the sample outcomes. > > > >? ? ?It would be easiest to see the error if half of the sample() > outcomes > >? ? ?used two unif_rand() outcomes, and half used just one.? That > would mean > >? ? ?m should be (2/3) * 2^32, but that's too big and would > trigger the > >? ? ?other > >? ? ?version. > > > >? ? ?So how about half use 2 unif_rands(), and half use 3?? That > means m > >? ? ?(2/5) * 2^32 = 1717986918.? A good guess is that sample() > outcomes > >? ? ?would > >? ? ?alternate between the two possibilities, so our event could > be even > >? ? ?versus odd outcomes. > > > >? ? ?Let's try it: > > > >? ? ? ?> m <- (2/5)*2^32 > >? ? ? ?> m > 2^31 > >? ? ?[1] FALSE > >? ? ? ?> x <- sample(m, 1000000, replace = TRUE) > >? ? ? ?> table(x %% 2) > > > >? ? ? ? ? ? 0? ? ? 1 > >? ? ?399850 600150 > > > >? ? ?Since m is an even number, the true proportions of evens and odds > >? ? ?should > >? ? ?be exactly 0.5.? That's some pretty strong evidence of the > bug in the > >? ? ?generator.? (Note that the ratio of the observed > probabilities is about > >? ? ?1.5, so I may not be the first person to have done this.) > > > >? ? ?I'm still not convinced that there has ever been a simulation > run with > >? ? ?detectable bias compared to Monte Carlo error unless it (like > this one) > >? ? ?was designed specifically to show the problem. > > > >? ? ?Duncan Murdoch > > > >? ? ? > > >? ? ? > (RNG.c, lines 793ff) > >? ? ? > > >? ? ? > double R_unif_index(double dn) > >? ? ? > { > >? ? ? >? ? ? double cut = INT_MAX; > >? ? ? > > >? ? ? >? ? ? switch(RNG_kind) { > >? ? ? >? ? ? case KNUTH_TAOCP: > >? ? ? >? ? ? case USER_UNIF: > >? ? ? >? ? ? case KNUTH_TAOCP2: > >? ? ? > cut = 33554431.0; /* 2^25 - 1 */ > >? ? ? > break; > >? ? ? >? ? ? default: > >? ? ? > break; > >? ? ? >? ? ?} > >? ? ? > > >? ? ? >? ? ? double u = dn > cut ? ru() : unif_rand(); > >? ? ? >? ? ? return floor(dn * u); > >? ? ? > } > >? ? ? > > >? ? ? > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch > >? ? ?<murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com> > <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> > >? ? ? > <mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>>>> wrote: > >? ? ? > > >? ? ? >? ? ?On 19/09/2018 12:09 PM, Philip B. Stark wrote: > >? ? ? >? ? ? > The 53 bits only encode at most 2^{32} possible values, > >? ? ?because the > >? ? ? >? ? ? > source of the float is the output of a 32-bit PRNG (the > >? ? ?obsolete > >? ? ? >? ? ?version > >? ? ? >? ? ? > of MT). 53 bits isn't the relevant number here. > >? ? ? > > >? ? ? >? ? ?No, two calls to unif_rand() are used.? There are two > 32 bit > >? ? ?values, > >? ? ? >? ? ?but > >? ? ? >? ? ?some of the bits are thrown away. > >? ? ? > > >? ? ? >? ? ?Duncan Murdoch > >? ? ? > > >? ? ? >? ? ? > > >? ? ? >? ? ? > The selection ratios can get close to 2. Computer > scientists > >? ? ? >? ? ?don't do it > >? ? ? >? ? ? > the way R does, for a reason. > >? ? ? >? ? ? > > >? ? ? >? ? ? > Regards, > >? ? ? >? ? ? > Philip > >? ? ? >? ? ? > > >? ? ? >? ? ? > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch > >? ? ? >? ? ?<murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>>> > >? ? ? >? ? ? > <mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> > >? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>>>>> wrote: > >? ? ? >? ? ? > > >? ? ? >? ? ? >? ? ?On 19/09/2018 9:09 AM, I?aki Ucar wrote: > >? ? ? >? ? ? >? ? ? > El mi?., 19 sept. 2018 a las 14:43, Duncan > Murdoch > >? ? ? >? ? ? >? ? ? > (<murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> > >? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>>> <mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> > >? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com> > >? ? ?<mailto:murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>>>>>) > >? ? ? >? ? ? >? ? ?escribi?: > >? ? ? >? ? ? >? ? ? >> > >? ? ? >? ? ? >? ? ? >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > >? ? ? >? ? ? >? ? ? >>> Dear list, > >? ? ? >? ? ? >? ? ? >>> > >? ? ? >? ? ? >? ? ? >>> It looks to me that R samples random integers > >? ? ?using an > >? ? ? >? ? ? >? ? ?intuitive but biased > >? ? ? >? ? ? >? ? ? >>> algorithm by going from a random number on > [0,1) from > >? ? ? >? ? ?the PRNG > >? ? ? >? ? ? >? ? ?to a random > >? ? ? >? ? ? >? ? ? >>> integer, e.g. > >? ? ? >? ? ? >? ? ? >>> > >? ? ? >? ? ? > > > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > >? ? ? >? ? ? >? ? ? >>> > >? ? ? >? ? ? >? ? ? >>> Many other languages use various rejection > sampling > >? ? ? >? ? ?approaches > >? ? ? >? ? ? >? ? ?which > >? ? ? >? ? ? >? ? ? >>> provide an unbiased method for sampling, > such as > >? ? ?in Go, > >? ? ? >? ? ?python, > >? ? ? >? ? ? >? ? ?and others > >? ? ? >? ? ? >? ? ? >>> described here: > https://arxiv.org/abs/1805.10941 (I > >? ? ? >? ? ?believe the > >? ? ? >? ? ? >? ? ?biased > >? ? ? >? ? ? >? ? ? >>> algorithm currently used in R is also > described > >? ? ?there).? I'm > >? ? ? >? ? ? >? ? ?not an expert > >? ? ? >? ? ? >? ? ? >>> in this area, but does it make sense for > the R to > >? ? ?adopt > >? ? ? >? ? ?one of > >? ? ? >? ? ? >? ? ?the unbiased > >? ? ? >? ? ? >? ? ? >>> random sample algorithms outlined there > and used > >? ? ?in other > >? ? ? >? ? ? >? ? ?languages?? Would > >? ? ? >? ? ? >? ? ? >>> a patch providing such an algorithm be > welcome? What > >? ? ? >? ? ?concerns > >? ? ? >? ? ? >? ? ?would need to > >? ? ? >? ? ? >? ? ? >>> be addressed first? > >? ? ? >? ? ? >? ? ? >>> > >? ? ? >? ? ? >? ? ? >>> I believe this issue was also raised by > Killie & > >? ? ?Philip in > >? ? ? >? ? ? >? ? ? >>> > >? ? ? > > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > >? ? ? >? ? ? >? ? ?more > >? ? ? >? ? ? >? ? ? >>> recently in > >? ? ? >? ? ? >? ? ? >>> > >? ? ? >? ? ? > > >? ? ? > > > > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > ?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > >? ? ? > > > > ?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > >? ? ? >? ? ? > > >? ? ? > > > > ?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, > >? ? ? >? ? ? >? ? ? >>> pointing to the python implementation for > comparison: > >? ? ? >? ? ? >? ? ? >>> > >? ? ? >? ? ? > > >? ? ? > > > > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > >? ? ? >? ? ? >? ? ? >> > >? ? ? >? ? ? >? ? ? >> I think the analyses are correct, but I > doubt if a > >? ? ?change > >? ? ? >? ? ?to the > >? ? ? >? ? ? >? ? ?default > >? ? ? >? ? ? >? ? ? >> is likely to be accepted as it would make > it more > >? ? ? >? ? ?difficult to > >? ? ? >? ? ? >? ? ?reproduce > >? ? ? >? ? ? >? ? ? >> older results. > >? ? ? >? ? ? >? ? ? >> > >? ? ? >? ? ? >? ? ? >> On the other hand, a contribution of a new > >? ? ?function like > >? ? ? >? ? ? >? ? ?sample() but > >? ? ? >? ? ? >? ? ? >> not suffering from the bias would be good.? The > >? ? ?normal way to > >? ? ? >? ? ? >? ? ?make such > >? ? ? >? ? ? >? ? ? >> a contribution is in a user contributed > package. > >? ? ? >? ? ? >? ? ? >> > >? ? ? >? ? ? >? ? ? >> By the way, R code illustrating the bias is > >? ? ?probably not very > >? ? ? >? ? ? >? ? ?hard to > >? ? ? >? ? ? >? ? ? >> put together.? I believe the bias manifests > itself in > >? ? ? >? ? ?sample() > >? ? ? >? ? ? >? ? ?producing > >? ? ? >? ? ? >? ? ? >> values with two different probabilities > (instead > >? ? ?of all equal > >? ? ? >? ? ? >? ? ? >> probabilities).? Those may differ by as much as > >? ? ?one part in > >? ? ? >? ? ? >? ? ?2^32.? It's > >? ? ? >? ? ? >? ? ? > > >? ? ? >? ? ? >? ? ? > According to Kellie and Philip, in the > attachment > >? ? ?of the > >? ? ? >? ? ?thread > >? ? ? >? ? ? >? ? ? > referenced by Carl, "The maximum ratio of > selection > >? ? ? >? ? ?probabilities can > >? ? ? >? ? ? >? ? ? > get as large as 1.5 if n is just below 2^31". > >? ? ? >? ? ? > > >? ? ? >? ? ? >? ? ?Sorry, I didn't write very well.? I meant to > say that the > >? ? ? >? ? ?difference in > >? ? ? >? ? ? >? ? ?probabilities would be 2^-32, not that the ratio of > >? ? ? >? ? ?probabilities would > >? ? ? >? ? ? >? ? ?be 1 + 2^-32. > >? ? ? >? ? ? > > >? ? ? >? ? ? >? ? ?By the way, I don't see the statement giving > the ratio as > >? ? ? >? ? ?1.5, but > >? ? ? >? ? ? >? ? ?maybe > >? ? ? >? ? ? >? ? ?I was looking in the wrong place.? In Theorem 1 > of the > >? ? ?paper > >? ? ? >? ? ?I was > >? ? ? >? ? ? >? ? ?looking in the ratio was "1 + m 2^{-w + 1}". > In that > >? ? ?formula > >? ? ? >? ? ?m is your > >? ? ? >? ? ? >? ? ?n.? If it is near 2^31, R uses w = 57 random > bits, so > >? ? ?the ratio > >? ? ? >? ? ? >? ? ?would be > >? ? ? >? ? ? >? ? ?very, very small (one part in 2^25). > >? ? ? >? ? ? > > >? ? ? >? ? ? >? ? ?The worst case for R would happen when m? is > just below > >? ? ? >? ? ?2^25, where w > >? ? ? >? ? ? >? ? ?is at least 31 for the default generators.? In that > >? ? ?case the > >? ? ? >? ? ?ratio > >? ? ? >? ? ? >? ? ?could > >? ? ? >? ? ? >? ? ?be about 1.03. > >? ? ? >? ? ? > > >? ? ? >? ? ? >? ? ?Duncan Murdoch > >? ? ? >? ? ? > > >? ? ? >? ? ? > > >? ? ? >? ? ? > > >? ? ? >? ? ? > -- > >? ? ? >? ? ? > Philip B. Stark | Associate Dean, Mathematical and > Physical > >? ? ? >? ? ?Sciences | > >? ? ? >? ? ? > Professor, ?Department of Statistics | > >? ? ? >? ? ? > University of California > >? ? ? >? ? ? > Berkeley, CA 94720-3860 | 510-394-5077 | > >? ? ? > statistics.berkeley.edu/~stark > <http://statistics.berkeley.edu/%7Estark> > >? ? ?<http://statistics.berkeley.edu/%7Estark> > >? ? ? >? ? ?<http://statistics.berkeley.edu/%7Estark> > >? ? ? >? ? ? > <http://statistics.berkeley.edu/%7Estark> | > >? ? ? >? ? ? > @philipbstark > >? ? ? >? ? ? > > >? ? ? > > >? ? ? > > >? ? ? > > >? ? ? > -- > >? ? ? > Philip B. Stark | Associate Dean, Mathematical and Physical > >? ? ?Sciences | > >? ? ? > Professor, ?Department of Statistics | > >? ? ? > University of California > >? ? ? > Berkeley, CA 94720-3860 | 510-394-5077 | > > statistics.berkeley.edu/~stark > <http://statistics.berkeley.edu/%7Estark> > >? ? ?<http://statistics.berkeley.edu/%7Estark> > >? ? ? > <http://statistics.berkeley.edu/%7Estark> | > >? ? ? > @philipbstark > >? ? ? > > > > > > > > > -- > > Philip B. Stark | Associate Dean, Mathematical and Physical > Sciences | > > Professor, ?Department of Statistics | > > University of California > > Berkeley, CA 94720-3860 | 510-394-5077 | > statistics.berkeley.edu/~stark > <http://statistics.berkeley.edu/%7Estark> > > <http://statistics.berkeley.edu/%7Estark> | > > @philipbstark > > > > ______________________________________________ > R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Sent from Gmail Mobile
The same issue occurs in walker_ProbSampleReplace() in random.c, lines 386-387: rU = unif_rand() * n; k = (int) rU; Cheers, Philip On Wed, Sep 19, 2018 at 3:08 PM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 19/09/2018 5:57 PM, David Hugh-Jones wrote: > > > > It doesn't seem too hard to come up with plausible ways in which this > > could give bad results. Suppose I sample rows from a large dataset, > > maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g. > > odd rows are males, even rows are females. Oops! Very non-representative > > sample, bootstrap p values are garbage. > > That would only happen if your dataset was exactly 1717986918 elements > in size. (And in fact, it will be less extreme than I posted: I had x > set to 1717986918.4, as described in another thread. If you use an > integer value you need a different pattern; add or subtract an element > or two and the pattern needed to see a problem changes drastically.) > > But if you're sampling from a dataset of that exact size, then you > should worry about this bug. Don't use sample(). Use the algorithm that > Carl described. > > Duncan Murdoch > > > > > David > > > > On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> wrote: > > > > On 19/09/2018 3:52 PM, Philip B. Stark wrote: > > > Hi Duncan-- > > > > > > > > > > That is a mathematically true statement, but I suspect it is not very > > relevant. Pseudo-random number generators always have test functions > > whose sample averages are quite different from the expectation under > > the > > true distribution. Remember Von Neumann's "state of sin" quote. The > > bug in sample() just means it is easier to find such a function than > it > > would otherwise be. > > > > The practical question is whether such a function is likely to arise > in > > practice or not. > > > > > Whether those correspond to commonly used statistics or not, I > > have no > > > idea. > > > > I am pretty confident that this bug rarely matters. > > > > > Regarding backwards compatibility: as a user, I'd rather the > default > > > sample() do the best possible thing, and take an extra step to use > > > something like sample(..., legacy=TRUE) if I want to reproduce > > old results. > > > > I suspect there's a good chance the bug I discovered today > (non-integer > > x values not being truncated) will be declared to be a feature, and > the > > documentation will be changed. Then the rejection sampling approach > > would need to be quite a bit more complicated. > > > > I think a documentation warning about the accuracy of sampling > > probabilities would also be a sufficient fix here, and would be > quite a > > bit less trouble than changing the default sample(). But as I said > in > > my original post, a contribution of a function without this bug > > would be > > a nice addition. > > > > Duncan Murdoch > > > > > > > > Regards, > > > Philip > > > > > > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch > > <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>> wrote: > > > > > > On 19/09/2018 12:23 PM, Philip B. Stark wrote: > > > > No, the 2nd call only happens when m > 2**31. Here's the > code: > > > > > > Yes, you're right. Sorry! > > > > > > So the ratio really does come close to 2. However, the > > difference in > > > probabilities between outcomes is still at most 2^-32 when m > > is less > > > than that cutoff. That's not feasible to detect; the only > > detectable > > > difference would happen if some event was constructed to hold > an > > > abundance of outcomes with especially low (or especially high) > > > probability. > > > > > > As I said in my original post, it's probably not hard to > > construct such > > > a thing, but as I've said more recently, it probably wouldn't > > happen by > > > chance. Here's one attempt to do it: > > > > > > Call the values from unif_rand() "the unif_rand() outcomes". > > Call the > > > values from sample() the sample outcomes. > > > > > > It would be easiest to see the error if half of the sample() > > outcomes > > > used two unif_rand() outcomes, and half used just one. That > > would mean > > > m should be (2/3) * 2^32, but that's too big and would > > trigger the > > > other > > > version. > > > > > > So how about half use 2 unif_rands(), and half use 3? That > > means m > > > (2/5) * 2^32 = 1717986918. A good guess is that sample() > > outcomes > > > would > > > alternate between the two possibilities, so our event could > > be even > > > versus odd outcomes. > > > > > > Let's try it: > > > > > > > m <- (2/5)*2^32 > > > > m > 2^31 > > > [1] FALSE > > > > x <- sample(m, 1000000, replace = TRUE) > > > > table(x %% 2) > > > > > > 0 1 > > > 399850 600150 > > > > > > Since m is an even number, the true proportions of evens and > odds > > > should > > > be exactly 0.5. That's some pretty strong evidence of the > > bug in the > > > generator. (Note that the ratio of the observed > > probabilities is about > > > 1.5, so I may not be the first person to have done this.) > > > > > > I'm still not convinced that there has ever been a simulation > > run with > > > detectable bias compared to Monte Carlo error unless it (like > > this one) > > > was designed specifically to show the problem. > > > > > > Duncan Murdoch > > > > > > > > > > > (RNG.c, lines 793ff) > > > > > > > > double R_unif_index(double dn) > > > > { > > > > double cut = INT_MAX; > > > > > > > > switch(RNG_kind) { > > > > case KNUTH_TAOCP: > > > > case USER_UNIF: > > > > case KNUTH_TAOCP2: > > > > cut = 33554431.0; /* 2^25 - 1 */ > > > > break; > > > > default: > > > > break; > > > > } > > > > > > > > double u = dn > cut ? ru() : unif_rand(); > > > > return floor(dn * u); > > > > } > > > > > > > > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch > > > <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com> > > <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> > > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>>> wrote: > > > > > > > > On 19/09/2018 12:09 PM, Philip B. Stark wrote: > > > > > The 53 bits only encode at most 2^{32} possible > values, > > > because the > > > > > source of the float is the output of a 32-bit PRNG > (the > > > obsolete > > > > version > > > > > of MT). 53 bits isn't the relevant number here. > > > > > > > > No, two calls to unif_rand() are used. There are two > > 32 bit > > > values, > > > > but > > > > some of the bits are thrown away. > > > > > > > > Duncan Murdoch > > > > > > > > > > > > > > The selection ratios can get close to 2. Computer > > scientists > > > > don't do it > > > > > the way R does, for a reason. > > > > > > > > > > Regards, > > > > > Philip > > > > > > > > > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch > > > > <murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>> > > > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> > > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>>>> wrote: > > > > > > > > > > On 19/09/2018 9:09 AM, I?aki Ucar wrote: > > > > > > El mi?., 19 sept. 2018 a las 14:43, Duncan > > Murdoch > > > > > > (<murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> > > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>> <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> > > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com> > > > <mailto:murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>>>>>) > > > > > escribi?: > > > > > >> > > > > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > > > > > >>> Dear list, > > > > > >>> > > > > > >>> It looks to me that R samples random > integers > > > using an > > > > > intuitive but biased > > > > > >>> algorithm by going from a random number on > > [0,1) from > > > > the PRNG > > > > > to a random > > > > > >>> integer, e.g. > > > > > >>> > > > > > > > > > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > > > > > >>> > > > > > >>> Many other languages use various rejection > > sampling > > > > approaches > > > > > which > > > > > >>> provide an unbiased method for sampling, > > such as > > > in Go, > > > > python, > > > > > and others > > > > > >>> described here: > > https://arxiv.org/abs/1805.10941 (I > > > > believe the > > > > > biased > > > > > >>> algorithm currently used in R is also > > described > > > there). I'm > > > > > not an expert > > > > > >>> in this area, but does it make sense for > > the R to > > > adopt > > > > one of > > > > > the unbiased > > > > > >>> random sample algorithms outlined there > > and used > > > in other > > > > > languages? Would > > > > > >>> a patch providing such an algorithm be > > welcome? What > > > > concerns > > > > > would need to > > > > > >>> be addressed first? > > > > > >>> > > > > > >>> I believe this issue was also raised by > > Killie & > > > Philip in > > > > > >>> > > > > > > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > > > > > more > > > > > >>> recently in > > > > > >>> > > > > > > > > > > > > > > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > > > > > > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, > > > > > >>> pointing to the python implementation for > > comparison: > > > > > >>> > > > > > > > > > > > > > > > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > > > > > >> > > > > > >> I think the analyses are correct, but I > > doubt if a > > > change > > > > to the > > > > > default > > > > > >> is likely to be accepted as it would make > > it more > > > > difficult to > > > > > reproduce > > > > > >> older results. > > > > > >> > > > > > >> On the other hand, a contribution of a new > > > function like > > > > > sample() but > > > > > >> not suffering from the bias would be good. > The > > > normal way to > > > > > make such > > > > > >> a contribution is in a user contributed > > package. > > > > > >> > > > > > >> By the way, R code illustrating the bias is > > > probably not very > > > > > hard to > > > > > >> put together. I believe the bias manifests > > itself in > > > > sample() > > > > > producing > > > > > >> values with two different probabilities > > (instead > > > of all equal > > > > > >> probabilities). Those may differ by as > much as > > > one part in > > > > > 2^32. It's > > > > > > > > > > > > According to Kellie and Philip, in the > > attachment > > > of the > > > > thread > > > > > > referenced by Carl, "The maximum ratio of > > selection > > > > probabilities can > > > > > > get as large as 1.5 if n is just below 2^31". > > > > > > > > > > Sorry, I didn't write very well. I meant to > > say that the > > > > difference in > > > > > probabilities would be 2^-32, not that the > ratio of > > > > probabilities would > > > > > be 1 + 2^-32. > > > > > > > > > > By the way, I don't see the statement giving > > the ratio as > > > > 1.5, but > > > > > maybe > > > > > I was looking in the wrong place. In Theorem 1 > > of the > > > paper > > > > I was > > > > > looking in the ratio was "1 + m 2^{-w + 1}". > > In that > > > formula > > > > m is your > > > > > n. If it is near 2^31, R uses w = 57 random > > bits, so > > > the ratio > > > > > would be > > > > > very, very small (one part in 2^25). > > > > > > > > > > The worst case for R would happen when m is > > just below > > > > 2^25, where w > > > > > is at least 31 for the default generators. In > that > > > case the > > > > ratio > > > > > could > > > > > be about 1.03. > > > > > > > > > > Duncan Murdoch > > > > > > > > > > > > > > > > > > > > -- > > > > > Philip B. Stark | Associate Dean, Mathematical and > > Physical > > > > Sciences | > > > > > Professor, Department of Statistics | > > > > > University of California > > > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> > > > > <http://statistics.berkeley.edu/%7Estark> > > > > > <http://statistics.berkeley.edu/%7Estark> | > > > > > @philipbstark > > > > > > > > > > > > > > > > > > > > > -- > > > > Philip B. Stark | Associate Dean, Mathematical and Physical > > > Sciences | > > > > Professor, Department of Statistics | > > > > University of California > > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> > > > > <http://statistics.berkeley.edu/%7Estark> | > > > > @philipbstark > > > > > > > > > > > > > > > > -- > > > Philip B. Stark | Associate Dean, Mathematical and Physical > > Sciences | > > > Professor, Department of Statistics | > > > University of California > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> | > > > @philipbstark > > > > > > > ______________________________________________ > > R-devel at r-project.org <mailto:R-devel at r-project.org> mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > > Sent from Gmail Mobile > >-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark [[alternative HTML version deleted]]