The 53 bits only encode at most 2^{32} possible values, because the source of the float is the output of a 32-bit PRNG (the obsolete version of MT). 53 bits isn't the relevant number here. The selection ratios can get close to 2. Computer scientists don't do it the way R does, for a reason. Regards, Philip On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 19/09/2018 9:09 AM, I?aki Ucar wrote: > > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch > > (<murdoch.duncan at gmail.com>) escribi?: > >> > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > >>> Dear list, > >>> > >>> It looks to me that R samples random integers using an intuitive but > biased > >>> algorithm by going from a random number on [0,1) from the PRNG to a > random > >>> integer, e.g. > >>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > >>> > >>> Many other languages use various rejection sampling approaches which > >>> provide an unbiased method for sampling, such as in Go, python, and > others > >>> described here: https://arxiv.org/abs/1805.10941 (I believe the > biased > >>> algorithm currently used in R is also described there). I'm not an > expert > >>> in this area, but does it make sense for the R to adopt one of the > unbiased > >>> random sample algorithms outlined there and used in other languages? > Would > >>> a patch providing such an algorithm be welcome? What concerns would > need to > >>> be addressed first? > >>> > >>> I believe this issue was also raised by Killie & Philip in > >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more > >>> recently in > >>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf, > >>> pointing to the python implementation for comparison: > >>> > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > >> > >> I think the analyses are correct, but I doubt if a change to the default > >> is likely to be accepted as it would make it more difficult to reproduce > >> older results. > >> > >> On the other hand, a contribution of a new function like sample() but > >> not suffering from the bias would be good. The normal way to make such > >> a contribution is in a user contributed package. > >> > >> By the way, R code illustrating the bias is probably not very hard to > >> put together. I believe the bias manifests itself in sample() producing > >> values with two different probabilities (instead of all equal > >> probabilities). Those may differ by as much as one part in 2^32. It's > > > > According to Kellie and Philip, in the attachment of the thread > > referenced by Carl, "The maximum ratio of selection probabilities can > > get as large as 1.5 if n is just below 2^31". > > Sorry, I didn't write very well. I meant to say that the difference in > probabilities would be 2^-32, not that the ratio of probabilities would > be 1 + 2^-32. > > By the way, I don't see the statement giving the ratio as 1.5, but maybe > I was looking in the wrong place. In Theorem 1 of the paper I was > looking in the ratio was "1 + m 2^{-w + 1}". In that formula m is your > n. If it is near 2^31, R uses w = 57 random bits, so the ratio would be > very, very small (one part in 2^25). > > The worst case for R would happen when m is just below 2^25, where w > is at least 31 for the default generators. In that case the ratio could > be about 1.03. > > Duncan Murdoch >-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark [[alternative HTML version deleted]]
On 19/09/2018 12:09 PM, Philip B. Stark wrote:> The 53 bits only encode at most 2^{32} possible values, because the > source of the float is the output of a 32-bit PRNG (the obsolete version > of MT). 53 bits isn't the relevant number here.No, two calls to unif_rand() are used. There are two 32 bit values, but some of the bits are thrown away. Duncan Murdoch> > The selection ratios can get close to 2. Computer scientists don't do it > the way R does, for a reason. > > Regards, > Philip > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> wrote: > > On 19/09/2018 9:09 AM, I?aki Ucar wrote: > > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch > > (<murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>>) > escribi?: > >> > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > >>> Dear list, > >>> > >>> It looks to me that R samples random integers using an > intuitive but biased > >>> algorithm by going from a random number on [0,1) from the PRNG > to a random > >>> integer, e.g. > >>> > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > >>> > >>> Many other languages use various rejection sampling approaches > which > >>> provide an unbiased method for sampling, such as in Go, python, > and others > >>> described here: https://arxiv.org/abs/1805.10941 (I believe the > biased > >>> algorithm currently used in R is also described there).? I'm > not an expert > >>> in this area, but does it make sense for the R to adopt one of > the unbiased > >>> random sample algorithms outlined there and used in other > languages?? Would > >>> a patch providing such an algorithm be welcome? What concerns > would need to > >>> be addressed first? > >>> > >>> I believe this issue was also raised by Killie & Philip in > >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > more > >>> recently in > >>> > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, > >>> pointing to the python implementation for comparison: > >>> > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > >> > >> I think the analyses are correct, but I doubt if a change to the > default > >> is likely to be accepted as it would make it more difficult to > reproduce > >> older results. > >> > >> On the other hand, a contribution of a new function like > sample() but > >> not suffering from the bias would be good.? The normal way to > make such > >> a contribution is in a user contributed package. > >> > >> By the way, R code illustrating the bias is probably not very > hard to > >> put together.? I believe the bias manifests itself in sample() > producing > >> values with two different probabilities (instead of all equal > >> probabilities).? Those may differ by as much as one part in > 2^32.? It's > > > > According to Kellie and Philip, in the attachment of the thread > > referenced by Carl, "The maximum ratio of selection probabilities can > > get as large as 1.5 if n is just below 2^31". > > Sorry, I didn't write very well.? I meant to say that the difference in > probabilities would be 2^-32, not that the ratio of probabilities would > be 1 + 2^-32. > > By the way, I don't see the statement giving the ratio as 1.5, but > maybe > I was looking in the wrong place.? In Theorem 1 of the paper I was > looking in the ratio was "1 + m 2^{-w + 1}".? In that formula m is your > n.? If it is near 2^31, R uses w = 57 random bits, so the ratio > would be > very, very small (one part in 2^25). > > The worst case for R would happen when m? is just below? 2^25, where w > is at least 31 for the default generators.? In that case the ratio > could > be about 1.03. > > Duncan Murdoch > > > > -- > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | > Professor, ?Department of Statistics | > University of California > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark > <http://statistics.berkeley.edu/%7Estark> | > @philipbstark >
No, the 2nd call only happens when m > 2**31. Here's the code: (RNG.c, lines 793ff) double R_unif_index(double dn) { double cut = INT_MAX; switch(RNG_kind) { case KNUTH_TAOCP: case USER_UNIF: case KNUTH_TAOCP2: cut = 33554431.0; /* 2^25 - 1 */ break; default: break; } double u = dn > cut ? ru() : unif_rand(); return floor(dn * u); } On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 19/09/2018 12:09 PM, Philip B. Stark wrote: > > The 53 bits only encode at most 2^{32} possible values, because the > > source of the float is the output of a 32-bit PRNG (the obsolete version > > of MT). 53 bits isn't the relevant number here. > > No, two calls to unif_rand() are used. There are two 32 bit values, but > some of the bits are thrown away. > > Duncan Murdoch > > > > > The selection ratios can get close to 2. Computer scientists don't do it > > the way R does, for a reason. > > > > Regards, > > Philip > > > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at gmail.com > > <mailto:murdoch.duncan at gmail.com>> wrote: > > > > On 19/09/2018 9:09 AM, I?aki Ucar wrote: > > > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch > > > (<murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>>) > > escribi?: > > >> > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > > >>> Dear list, > > >>> > > >>> It looks to me that R samples random integers using an > > intuitive but biased > > >>> algorithm by going from a random number on [0,1) from the PRNG > > to a random > > >>> integer, e.g. > > >>> > > > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > > >>> > > >>> Many other languages use various rejection sampling approaches > > which > > >>> provide an unbiased method for sampling, such as in Go, python, > > and others > > >>> described here: https://arxiv.org/abs/1805.10941 (I believe the > > biased > > >>> algorithm currently used in R is also described there). I'm > > not an expert > > >>> in this area, but does it make sense for the R to adopt one of > > the unbiased > > >>> random sample algorithms outlined there and used in other > > languages? Would > > >>> a patch providing such an algorithm be welcome? What concerns > > would need to > > >>> be addressed first? > > >>> > > >>> I believe this issue was also raised by Killie & Philip in > > >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > > more > > >>> recently in > > >>> > > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, > > >>> pointing to the python implementation for comparison: > > >>> > > > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > > >> > > >> I think the analyses are correct, but I doubt if a change to the > > default > > >> is likely to be accepted as it would make it more difficult to > > reproduce > > >> older results. > > >> > > >> On the other hand, a contribution of a new function like > > sample() but > > >> not suffering from the bias would be good. The normal way to > > make such > > >> a contribution is in a user contributed package. > > >> > > >> By the way, R code illustrating the bias is probably not very > > hard to > > >> put together. I believe the bias manifests itself in sample() > > producing > > >> values with two different probabilities (instead of all equal > > >> probabilities). Those may differ by as much as one part in > > 2^32. It's > > > > > > According to Kellie and Philip, in the attachment of the thread > > > referenced by Carl, "The maximum ratio of selection probabilities > can > > > get as large as 1.5 if n is just below 2^31". > > > > Sorry, I didn't write very well. I meant to say that the difference > in > > probabilities would be 2^-32, not that the ratio of probabilities > would > > be 1 + 2^-32. > > > > By the way, I don't see the statement giving the ratio as 1.5, but > > maybe > > I was looking in the wrong place. In Theorem 1 of the paper I was > > looking in the ratio was "1 + m 2^{-w + 1}". In that formula m is > your > > n. If it is near 2^31, R uses w = 57 random bits, so the ratio > > would be > > very, very small (one part in 2^25). > > > > The worst case for R would happen when m is just below 2^25, where > w > > is at least 31 for the default generators. In that case the ratio > > could > > be about 1.03. > > > > Duncan Murdoch > > > > > > > > -- > > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | > > Professor, Department of Statistics | > > University of California > > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> | > > @philipbstark > > > >-- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark [[alternative HTML version deleted]]