thr3ads.net - R devel - [Rd] Bias in R's random integers? [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Iñaki Ucar

2018-Sep-19 13:09 UTC

[Rd] Bias in R's random integers?

El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch
(<murdoch.duncan at gmail.com>) escribi?:>
> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> > Dear list,
> >
> > It looks to me that R samples random integers using an intuitive but
biased
> > algorithm by going from a random number on [0,1) from the PRNG to a
random
> > integer, e.g.
> > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >
> > Many other languages use various rejection sampling approaches which
> > provide an unbiased method for sampling, such as in Go, python, and
others
> > described here:  https://arxiv.org/abs/1805.10941 (I believe the
biased
> > algorithm currently used in R is also described there).  I'm not
an expert
> > in this area, but does it make sense for the R to adopt one of the
unbiased
> > random sample algorithms outlined there and used in other languages? 
Would
> > a patch providing such an algorithm be welcome? What concerns would
need to
> > be addressed first?
> >
> > I believe this issue was also raised by Killie & Philip in
> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and more
> > recently in
> > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
> > pointing to the python implementation for comparison:
> >
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>
> I think the analyses are correct, but I doubt if a change to the default
> is likely to be accepted as it would make it more difficult to reproduce
> older results.
>
> On the other hand, a contribution of a new function like sample() but
> not suffering from the bias would be good.  The normal way to make such
> a contribution is in a user contributed package.
>
> By the way, R code illustrating the bias is probably not very hard to
> put together.  I believe the bias manifests itself in sample() producing
> values with two different probabilities (instead of all equal
> probabilities).  Those may differ by as much as one part in 2^32.  It's
According to Kellie and Philip, in the attachment of the thread
referenced by Carl, "The maximum ratio of selection probabilities can
get as large as 1.5 if n is just below 2^31".

I?aki
> very difficult to detect a probability difference that small, but if you
> define the partition of values into the high probability values vs the
> low probability values, you can probably detect the difference in a
> feasible simulation.
>
> Duncan Murdoch
>

Duncan Murdoch

2018-Sep-19 16:05 UTC

head link

[Rd] Bias in R's random integers?

On 19/09/2018 9:09 AM, I?aki Ucar wrote:> El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch
> (<murdoch.duncan at gmail.com>) escribi?:
>>
>> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
>>> Dear list,
>>>
>>> It looks to me that R samples random integers using an intuitive
but biased
>>> algorithm by going from a random number on [0,1) from the PRNG to a
random
>>> integer, e.g.
>>>
https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>>>
>>> Many other languages use various rejection sampling approaches
which
>>> provide an unbiased method for sampling, such as in Go, python, and
others
>>> described here:  https://arxiv.org/abs/1805.10941 (I believe the
biased
>>> algorithm currently used in R is also described there).  I'm
not an expert
>>> in this area, but does it make sense for the R to adopt one of the
unbiased
>>> random sample algorithms outlined there and used in other
languages?  Would
>>> a patch providing such an algorithm be welcome? What concerns would
need to
>>> be addressed first?
>>>
>>> I believe this issue was also raised by Killie & Philip in
>>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
more
>>> recently in
>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
>>> pointing to the python implementation for comparison:
>>>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>>
>> I think the analyses are correct, but I doubt if a change to the
default
>> is likely to be accepted as it would make it more difficult to
reproduce
>> older results.
>>
>> On the other hand, a contribution of a new function like sample() but
>> not suffering from the bias would be good.  The normal way to make such
>> a contribution is in a user contributed package.
>>
>> By the way, R code illustrating the bias is probably not very hard to
>> put together.  I believe the bias manifests itself in sample()
producing
>> values with two different probabilities (instead of all equal
>> probabilities).  Those may differ by as much as one part in 2^32. 
It's
> 
> According to Kellie and Philip, in the attachment of the thread
> referenced by Carl, "The maximum ratio of selection probabilities can
> get as large as 1.5 if n is just below 2^31".
Sorry, I didn't write very well.  I meant to say that the difference in 
probabilities would be 2^-32, not that the ratio of probabilities would 
be 1 + 2^-32.

By the way, I don't see the statement giving the ratio as 1.5, but maybe 
I was looking in the wrong place.  In Theorem 1 of the paper I was 
looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m is
your
n.  If it is near 2^31, R uses w = 57 random bits, so the ratio would be 
very, very small (one part in 2^25).

The worst case for R would happen when m  is just below  2^25, where w 
is at least 31 for the default generators.  In that case the ratio could 
be about 1.03.

Duncan Murdoch

Philip B. Stark

2018-Sep-19 16:09 UTC

head link

[Rd] Bias in R's random integers?

The 53 bits only encode at most 2^{32} possible values, because the source
of the float is the output of a 32-bit PRNG (the obsolete version of MT).
53 bits isn't the relevant number here.

The selection ratios can get close to 2. Computer scientists don't do it
the way R does, for a reason.

Regards,
Philip

On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 19/09/2018 9:09 AM, I?aki Ucar wrote:
> > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch
> > (<murdoch.duncan at gmail.com>) escribi?:
> >>
> >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >>> Dear list,
> >>>
> >>> It looks to me that R samples random integers using an
intuitive but
> biased
> >>> algorithm by going from a random number on [0,1) from the PRNG
to a
> random
> >>> integer, e.g.
> >>>
https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >>>
> >>> Many other languages use various rejection sampling approaches
which
> >>> provide an unbiased method for sampling, such as in Go,
python, and
> others
> >>> described here:  https://arxiv.org/abs/1805.10941 (I believe
the
> biased
> >>> algorithm currently used in R is also described there). 
I'm not an
> expert
> >>> in this area, but does it make sense for the R to adopt one of
the
> unbiased
> >>> random sample algorithms outlined there and used in other
languages?
> Would
> >>> a patch providing such an algorithm be welcome? What concerns
would
> need to
> >>> be addressed first?
> >>>
> >>> I believe this issue was also raised by Killie & Philip in
> >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html,
and more
> >>> recently in
> >>>
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
> >>> pointing to the python implementation for comparison:
> >>>
>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >>
> >> I think the analyses are correct, but I doubt if a change to the
default
> >> is likely to be accepted as it would make it more difficult to
reproduce
> >> older results.
> >>
> >> On the other hand, a contribution of a new function like sample()
but
> >> not suffering from the bias would be good.  The normal way to make
such
> >> a contribution is in a user contributed package.
> >>
> >> By the way, R code illustrating the bias is probably not very hard
to
> >> put together.  I believe the bias manifests itself in sample()
producing
> >> values with two different probabilities (instead of all equal
> >> probabilities).  Those may differ by as much as one part in 2^32. 
It's
> >
> > According to Kellie and Philip, in the attachment of the thread
> > referenced by Carl, "The maximum ratio of selection probabilities
can
> > get as large as 1.5 if n is just below 2^31".
>
> Sorry, I didn't write very well.  I meant to say that the difference in
> probabilities would be 2^-32, not that the ratio of probabilities would
> be 1 + 2^-32.
>
> By the way, I don't see the statement giving the ratio as 1.5, but
maybe
> I was looking in the wrong place.  In Theorem 1 of the paper I was
> looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m
is your
> n.  If it is near 2^31, R uses w = 57 random bits, so the ratio would be
> very, very small (one part in 2^25).
>
> The worst case for R would happen when m  is just below  2^25, where w
> is at least 31 for the default generators.  In that case the ratio could
> be about 1.03.
>
> Duncan Murdoch
>

-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more reasonably related threads

R devel - Sep 2018 - Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

Seemingly Similar Threads