Dear list, It looks to me that R samples random integers using an intuitive but biased algorithm by going from a random number on [0,1) from the PRNG to a random integer, e.g. https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 Many other languages use various rejection sampling approaches which provide an unbiased method for sampling, such as in Go, python, and others described here: https://arxiv.org/abs/1805.10941 (I believe the biased algorithm currently used in R is also described there). I'm not an expert in this area, but does it make sense for the R to adopt one of the unbiased random sample algorithms outlined there and used in other languages? Would a patch providing such an algorithm be welcome? What concerns would need to be addressed first? I believe this issue was also raised by Killie & Philip in http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more recently in https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf, pointing to the python implementation for comparison: https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 Thanks! Carl -- http://carlboettiger.info [[alternative HTML version deleted]]
On 18/09/2018 5:46 PM, Carl Boettiger wrote:> Dear list, > > It looks to me that R samples random integers using an intuitive but biased > algorithm by going from a random number on [0,1) from the PRNG to a random > integer, e.g. > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > > Many other languages use various rejection sampling approaches which > provide an unbiased method for sampling, such as in Go, python, and others > described here: https://arxiv.org/abs/1805.10941 (I believe the biased > algorithm currently used in R is also described there). I'm not an expert > in this area, but does it make sense for the R to adopt one of the unbiased > random sample algorithms outlined there and used in other languages? Would > a patch providing such an algorithm be welcome? What concerns would need to > be addressed first? > > I believe this issue was also raised by Killie & Philip in > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more > recently in > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf, > pointing to the python implementation for comparison: > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265I think the analyses are correct, but I doubt if a change to the default is likely to be accepted as it would make it more difficult to reproduce older results. On the other hand, a contribution of a new function like sample() but not suffering from the bias would be good. The normal way to make such a contribution is in a user contributed package. By the way, R code illustrating the bias is probably not very hard to put together. I believe the bias manifests itself in sample() producing values with two different probabilities (instead of all equal probabilities). Those may differ by as much as one part in 2^32. It's very difficult to detect a probability difference that small, but if you define the partition of values into the high probability values vs the low probability values, you can probably detect the difference in a feasible simulation. Duncan Murdoch
El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch (<murdoch.duncan at gmail.com>) escribi?:> > On 18/09/2018 5:46 PM, Carl Boettiger wrote: > > Dear list, > > > > It looks to me that R samples random integers using an intuitive but biased > > algorithm by going from a random number on [0,1) from the PRNG to a random > > integer, e.g. > > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > > > > Many other languages use various rejection sampling approaches which > > provide an unbiased method for sampling, such as in Go, python, and others > > described here: https://arxiv.org/abs/1805.10941 (I believe the biased > > algorithm currently used in R is also described there). I'm not an expert > > in this area, but does it make sense for the R to adopt one of the unbiased > > random sample algorithms outlined there and used in other languages? Would > > a patch providing such an algorithm be welcome? What concerns would need to > > be addressed first? > > > > I believe this issue was also raised by Killie & Philip in > > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more > > recently in > > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf, > > pointing to the python implementation for comparison: > > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > > I think the analyses are correct, but I doubt if a change to the default > is likely to be accepted as it would make it more difficult to reproduce > older results. > > On the other hand, a contribution of a new function like sample() but > not suffering from the bias would be good. The normal way to make such > a contribution is in a user contributed package. > > By the way, R code illustrating the bias is probably not very hard to > put together. I believe the bias manifests itself in sample() producing > values with two different probabilities (instead of all equal > probabilities). Those may differ by as much as one part in 2^32. It'sAccording to Kellie and Philip, in the attachment of the thread referenced by Carl, "The maximum ratio of selection probabilities can get as large as 1.5 if n is just below 2^31". I?aki> very difficult to detect a probability difference that small, but if you > define the partition of values into the high probability values vs the > low probability values, you can probably detect the difference in a > feasible simulation. > > Duncan Murdoch >
On Wed, 19 Sep 2018 at 13:43, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> > I think the analyses are correct, but I doubt if a change to the default > is likely to be accepted as it would make it more difficult to reproduce > older results.I'm a bit alarmed by the logic here. Unbiased sampling seems basic for a statistical language. As a consumer of R I'd like to think that e.g. my bootstrapped p values are correct. Surely if the old results depend on the biased algorithm, then they are false results? -- Sent from Gmail Mobile [[alternative HTML version deleted]]