thr3ads.net - R devel - [Rd] Bias in R's random integers? [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Philip B. Stark

2018-Sep-19 19:52 UTC

[Rd] Bias in R's random integers?

Hi Duncan--

Nice simulation!

The absolute difference in probabilities is small, but the maximum relative
difference grows from something negligible to almost 2 as m approaches
2**31.

Because the L_1 distance between the uniform distribution on {1, ..., m}
and what you actually get is large, there have to be test functions whose
expectations are quite different under the two distributions. Whether those
correspond to commonly used statistics or not, I have no idea.

Regarding backwards compatibility: as a user, I'd rather the default
sample() do the best possible thing, and take an extra step to use
something like sample(..., legacy=TRUE) if I want to reproduce old results.

Regards,
Philip

On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> > No, the 2nd call only happens when m > 2**31. Here's the code:
>
> Yes, you're right. Sorry!
>
> So the ratio really does come close to 2.  However, the difference in
> probabilities between outcomes is still at most 2^-32 when m is less
> than that cutoff.  That's not feasible to detect; the only detectable
> difference would happen if some event was constructed to hold an
> abundance of outcomes with especially low (or especially high) probability.
>
> As I said in my original post, it's probably not hard to construct such
> a thing, but as I've said more recently, it probably wouldn't
happen by
> chance.  Here's one attempt to do it:
>
> Call the values from unif_rand() "the unif_rand() outcomes". 
Call the
> values from sample() the sample outcomes.
>
> It would be easiest to see the error if half of the sample() outcomes
> used two unif_rand() outcomes, and half used just one.  That would mean
> m should be (2/3) * 2^32, but that's too big and would trigger the
other
> version.
>
> So how about half use 2 unif_rands(), and half use 3?  That means m >
(2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes would
> alternate between the two possibilities, so our event could be even
> versus odd outcomes.
>
> Let's try it:
>
>  > m <- (2/5)*2^32
>  > m > 2^31
> [1] FALSE
>  > x <- sample(m, 1000000, replace = TRUE)
>  > table(x %% 2)
>
>       0      1
> 399850 600150
>
> Since m is an even number, the true proportions of evens and odds should
> be exactly 0.5.  That's some pretty strong evidence of the bug in the
> generator.  (Note that the ratio of the observed probabilities is about
> 1.5, so I may not be the first person to have done this.)
>
> I'm still not convinced that there has ever been a simulation run with
> detectable bias compared to Monte Carlo error unless it (like this one)
> was designed specifically to show the problem.
>
> Duncan Murdoch
>
> >
> > (RNG.c, lines 793ff)
> >
> > double R_unif_index(double dn)
> > {
> >      double cut = INT_MAX;
> >
> >      switch(RNG_kind) {
> >      case KNUTH_TAOCP:
> >      case USER_UNIF:
> >      case KNUTH_TAOCP2:
> > cut = 33554431.0; /* 2^25 - 1 */
> > break;
> >      default:
> > break;
> >     }
> >
> >      double u = dn > cut ? ru() : unif_rand();
> >      return floor(dn * u);
> > }
> >
> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <murdoch.duncan at
gmail.com
> > <mailto:murdoch.duncan at gmail.com>> wrote:
> >
> >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      > The 53 bits only encode at most 2^{32} possible values,
because
> the
> >      > source of the float is the output of a 32-bit PRNG (the
obsolete
> >     version
> >      > of MT). 53 bits isn't the relevant number here.
> >
> >     No, two calls to unif_rand() are used.  There are two 32 bit
values,
> >     but
> >     some of the bits are thrown away.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > The selection ratios can get close to 2. Computer scientists
> >     don't do it
> >      > the way R does, for a reason.
> >      >
> >      > Regards,
> >      > Philip
> >      >
> >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
> >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>> wrote:
> >      >
> >      >     On 19/09/2018 9:09 AM, I?aki Ucar wrote:
> >      >      > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >      >      > (<murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
<mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>)
> >      >     escribi?:
> >      >      >>
> >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >      >      >>> Dear list,
> >      >      >>>
> >      >      >>> It looks to me that R samples random
integers using an
> >      >     intuitive but biased
> >      >      >>> algorithm by going from a random number on
[0,1) from
> >     the PRNG
> >      >     to a random
> >      >      >>> integer, e.g.
> >      >      >>>
> >      >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >>>
> >      >      >>> Many other languages use various rejection
sampling
> >     approaches
> >      >     which
> >      >      >>> provide an unbiased method for sampling,
such as in Go,
> >     python,
> >      >     and others
> >      >      >>> described here:
https://arxiv.org/abs/1805.10941 (I
> >     believe the
> >      >     biased
> >      >      >>> algorithm currently used in R is also
described there).
> I'm
> >      >     not an expert
> >      >      >>> in this area, but does it make sense for
the R to adopt
> >     one of
> >      >     the unbiased
> >      >      >>> random sample algorithms outlined there
and used in other
> >      >     languages?  Would
> >      >      >>> a patch providing such an algorithm be
welcome? What
> >     concerns
> >      >     would need to
> >      >      >>> be addressed first?
> >      >      >>>
> >      >      >>> I believe this issue was also raised by
Killie & Philip
> in
> >      >      >>>
> >     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >      >     more
> >      >      >>> recently in
> >      >      >>>
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >>> pointing to the python implementation for
comparison:
> >      >      >>>
> >      >
> >
>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >>
> >      >      >> I think the analyses are correct, but I doubt
if a change
> >     to the
> >      >     default
> >      >      >> is likely to be accepted as it would make it
more
> >     difficult to
> >      >     reproduce
> >      >      >> older results.
> >      >      >>
> >      >      >> On the other hand, a contribution of a new
function like
> >      >     sample() but
> >      >      >> not suffering from the bias would be good. 
The normal
> way to
> >      >     make such
> >      >      >> a contribution is in a user contributed
package.
> >      >      >>
> >      >      >> By the way, R code illustrating the bias is
probably not
> very
> >      >     hard to
> >      >      >> put together.  I believe the bias manifests
itself in
> >     sample()
> >      >     producing
> >      >      >> values with two different probabilities
(instead of all
> equal
> >      >      >> probabilities).  Those may differ by as much
as one part
> in
> >      >     2^32.  It's
> >      >      >
> >      >      > According to Kellie and Philip, in the attachment
of the
> >     thread
> >      >      > referenced by Carl, "The maximum ratio of
selection
> >     probabilities can
> >      >      > get as large as 1.5 if n is just below 2^31".
> >      >
> >      >     Sorry, I didn't write very well.  I meant to say
that the
> >     difference in
> >      >     probabilities would be 2^-32, not that the ratio of
> >     probabilities would
> >      >     be 1 + 2^-32.
> >      >
> >      >     By the way, I don't see the statement giving the
ratio as
> >     1.5, but
> >      >     maybe
> >      >     I was looking in the wrong place.  In Theorem 1 of the
paper
> >     I was
> >      >     looking in the ratio was "1 + m 2^{-w + 1}". 
In that formula
> >     m is your
> >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
the
> ratio
> >      >     would be
> >      >     very, very small (one part in 2^25).
> >      >
> >      >     The worst case for R would happen when m  is just below
> >     2^25, where w
> >      >     is at least 31 for the default generators.  In that case
the
> >     ratio
> >      >     could
> >      >     be about 1.03.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 |
statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
>
-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

	[[alternative HTML version deleted]]

Philip B. Stark

2018-Sep-19 19:58 UTC

head link

[Rd] Bias in R's random integers?

One more thing, apropos this:

I'm still not convinced that there has ever been a simulation run with
detectable bias compared to Monte Carlo error unless it (like this one) was
designed specifically to show the problem.


I often use random permutations to simulate p-values to calibrate
permutation tests. If I'm trying to simulate the probability of a
low-probability event, this could matter a lot.

Best wishes,
Philip

On Wed, Sep 19, 2018 at 12:52 PM Philip B. Stark <stark at
stat.berkeley.edu>
wrote:
> Hi Duncan--
>
> Nice simulation!
>
> The absolute difference in probabilities is small, but the maximum
> relative difference grows from something negligible to almost 2 as m
> approaches 2**31.
>
> Because the L_1 distance between the uniform distribution on {1, ..., m}
> and what you actually get is large, there have to be test functions whose
> expectations are quite different under the two distributions. Whether those
> correspond to commonly used statistics or not, I have no idea.
>
> Regarding backwards compatibility: as a user, I'd rather the default
> sample() do the best possible thing, and take an extra step to use
> something like sample(..., legacy=TRUE) if I want to reproduce old results.
>
> Regards,
> Philip
>
> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
> wrote:
>
>> On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>> > No, the 2nd call only happens when m > 2**31. Here's the
code:
>>
>> Yes, you're right. Sorry!
>>
>> So the ratio really does come close to 2.  However, the difference in
>> probabilities between outcomes is still at most 2^-32 when m is less
>> than that cutoff.  That's not feasible to detect; the only
detectable
>> difference would happen if some event was constructed to hold an
>> abundance of outcomes with especially low (or especially high)
>> probability.
>>
>> As I said in my original post, it's probably not hard to construct
such
>> a thing, but as I've said more recently, it probably wouldn't
happen by
>> chance.  Here's one attempt to do it:
>>
>> Call the values from unif_rand() "the unif_rand() outcomes". 
Call the
>> values from sample() the sample outcomes.
>>
>> It would be easiest to see the error if half of the sample() outcomes
>> used two unif_rand() outcomes, and half used just one.  That would mean
>> m should be (2/3) * 2^32, but that's too big and would trigger the
other
>> version.
>>
>> So how about half use 2 unif_rands(), and half use 3?  That means m
>> (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
would
>> alternate between the two possibilities, so our event could be even
>> versus odd outcomes.
>>
>> Let's try it:
>>
>>  > m <- (2/5)*2^32
>>  > m > 2^31
>> [1] FALSE
>>  > x <- sample(m, 1000000, replace = TRUE)
>>  > table(x %% 2)
>>
>>       0      1
>> 399850 600150
>>
>> Since m is an even number, the true proportions of evens and odds
should
>> be exactly 0.5.  That's some pretty strong evidence of the bug in
the
>> generator.  (Note that the ratio of the observed probabilities is about
>> 1.5, so I may not be the first person to have done this.)
>>
>> I'm still not convinced that there has ever been a simulation run
with
>> detectable bias compared to Monte Carlo error unless it (like this one)
>> was designed specifically to show the problem.
>>
>> Duncan Murdoch
>>
>> >
>> > (RNG.c, lines 793ff)
>> >
>> > double R_unif_index(double dn)
>> > {
>> >      double cut = INT_MAX;
>> >
>> >      switch(RNG_kind) {
>> >      case KNUTH_TAOCP:
>> >      case USER_UNIF:
>> >      case KNUTH_TAOCP2:
>> > cut = 33554431.0; /* 2^25 - 1 */
>> > break;
>> >      default:
>> > break;
>> >     }
>> >
>> >      double u = dn > cut ? ru() : unif_rand();
>> >      return floor(dn * u);
>> > }
>> >
>> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <
>> murdoch.duncan at gmail.com
>> > <mailto:murdoch.duncan at gmail.com>> wrote:
>> >
>> >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>> >      > The 53 bits only encode at most 2^{32} possible values,
because
>> the
>> >      > source of the float is the output of a 32-bit PRNG (the
obsolete
>> >     version
>> >      > of MT). 53 bits isn't the relevant number here.
>> >
>> >     No, two calls to unif_rand() are used.  There are two 32 bit
values,
>> >     but
>> >     some of the bits are thrown away.
>> >
>> >     Duncan Murdoch
>> >
>> >      >
>> >      > The selection ratios can get close to 2. Computer
scientists
>> >     don't do it
>> >      > the way R does, for a reason.
>> >      >
>> >      > Regards,
>> >      > Philip
>> >      >
>> >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>> >     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
>> >      > <mailto:murdoch.duncan at gmail.com
>> >     <mailto:murdoch.duncan at gmail.com>>> wrote:
>> >      >
>> >      >     On 19/09/2018 9:09 AM, I?aki Ucar wrote:
>> >      >      > El mi?., 19 sept. 2018 a las 14:43, Duncan
Murdoch
>> >      >      > (<murdoch.duncan at gmail.com
>> >     <mailto:murdoch.duncan at gmail.com>
<mailto:murdoch.duncan at gmail.com
>> >     <mailto:murdoch.duncan at gmail.com>>>)
>> >      >     escribi?:
>> >      >      >>
>> >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger
wrote:
>> >      >      >>> Dear list,
>> >      >      >>>
>> >      >      >>> It looks to me that R samples random
integers using an
>> >      >     intuitive but biased
>> >      >      >>> algorithm by going from a random
number on [0,1) from
>> >     the PRNG
>> >      >     to a random
>> >      >      >>> integer, e.g.
>> >      >      >>>
>> >      >
>> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>> >      >      >>>
>> >      >      >>> Many other languages use various
rejection sampling
>> >     approaches
>> >      >     which
>> >      >      >>> provide an unbiased method for
sampling, such as in Go,
>> >     python,
>> >      >     and others
>> >      >      >>> described here:
https://arxiv.org/abs/1805.10941 (I
>> >     believe the
>> >      >     biased
>> >      >      >>> algorithm currently used in R is also
described
>> there).  I'm
>> >      >     not an expert
>> >      >      >>> in this area, but does it make sense
for the R to adopt
>> >     one of
>> >      >     the unbiased
>> >      >      >>> random sample algorithms outlined
there and used in
>> other
>> >      >     languages?  Would
>> >      >      >>> a patch providing such an algorithm be
welcome? What
>> >     concerns
>> >      >     would need to
>> >      >      >>> be addressed first?
>> >      >      >>>
>> >      >      >>> I believe this issue was also raised
by Killie & Philip
>> in
>> >      >      >>>
>> >     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html,
and
>> >      >     more
>> >      >      >>> recently in
>> >      >      >>>
>> >      >
>> >    
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>> >     <
>>
https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>> >      >
>> >       <
>>
https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>> >      >      >>> pointing to the python implementation
for comparison:
>> >      >      >>>
>> >      >
>> >
>>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>> >      >      >>
>> >      >      >> I think the analyses are correct, but I
doubt if a change
>> >     to the
>> >      >     default
>> >      >      >> is likely to be accepted as it would make
it more
>> >     difficult to
>> >      >     reproduce
>> >      >      >> older results.
>> >      >      >>
>> >      >      >> On the other hand, a contribution of a new
function like
>> >      >     sample() but
>> >      >      >> not suffering from the bias would be good.
The normal
>> way to
>> >      >     make such
>> >      >      >> a contribution is in a user contributed
package.
>> >      >      >>
>> >      >      >> By the way, R code illustrating the bias
is probably not
>> very
>> >      >     hard to
>> >      >      >> put together.  I believe the bias
manifests itself in
>> >     sample()
>> >      >     producing
>> >      >      >> values with two different probabilities
(instead of all
>> equal
>> >      >      >> probabilities).  Those may differ by as
much as one part
>> in
>> >      >     2^32.  It's
>> >      >      >
>> >      >      > According to Kellie and Philip, in the
attachment of the
>> >     thread
>> >      >      > referenced by Carl, "The maximum ratio of
selection
>> >     probabilities can
>> >      >      > get as large as 1.5 if n is just below
2^31".
>> >      >
>> >      >     Sorry, I didn't write very well.  I meant to say
that the
>> >     difference in
>> >      >     probabilities would be 2^-32, not that the ratio of
>> >     probabilities would
>> >      >     be 1 + 2^-32.
>> >      >
>> >      >     By the way, I don't see the statement giving the
ratio as
>> >     1.5, but
>> >      >     maybe
>> >      >     I was looking in the wrong place.  In Theorem 1 of
the paper
>> >     I was
>> >      >     looking in the ratio was "1 + m 2^{-w +
1}".  In that formula
>> >     m is your
>> >      >     n.  If it is near 2^31, R uses w = 57 random bits,
so the
>> ratio
>> >      >     would be
>> >      >     very, very small (one part in 2^25).
>> >      >
>> >      >     The worst case for R would happen when m  is just
below
>> >     2^25, where w
>> >      >     is at least 31 for the default generators.  In that
case the
>> >     ratio
>> >      >     could
>> >      >     be about 1.03.
>> >      >
>> >      >     Duncan Murdoch
>> >      >
>> >      >
>> >      >
>> >      > --
>> >      > Philip B. Stark | Associate Dean, Mathematical and
Physical
>> >     Sciences |
>> >      > Professor,  Department of Statistics |
>> >      > University of California
>> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
>> >     statistics.berkeley.edu/~stark
>> >     <http://statistics.berkeley.edu/%7Estark>
>> >      > <http://statistics.berkeley.edu/%7Estark> |
>> >      > @philipbstark
>> >      >
>> >
>> >
>> >
>> > --
>> > Philip B. Stark | Associate Dean, Mathematical and Physical
Sciences |
>> > Professor,  Department of Statistics |
>> > University of California
>> > Berkeley, CA 94720-3860 | 510-394-5077 |
statistics.berkeley.edu/~stark
>> > <http://statistics.berkeley.edu/%7Estark> |
>> > @philipbstark
>> >
>>
>>
>
> --
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> Professor,  Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
> @philipbstark
>
>
-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

	[[alternative HTML version deleted]]

Duncan Murdoch

2018-Sep-19 20:19 UTC

head link

[Rd] Bias in R's random integers?

On 19/09/2018 3:52 PM, Philip B. Stark wrote:> Hi Duncan--
> 
> Nice simulation!
> 
> The absolute difference in probabilities is small, but the maximum 
> relative difference grows from something negligible to almost 2 as m 
> approaches 2**31.
> 
> Because the L_1 distance between the uniform distribution on {1, ..., m} 
> and what you actually get is large, there have to be test functions 
> whose expectations are quite different under the two distributions. 
That is a mathematically true statement, but I suspect it is not very 
relevant.  Pseudo-random number generators always have test functions 
whose sample averages are quite different from the expectation under the 
true distribution.  Remember Von Neumann's "state of sin" quote. 
The
bug in sample() just means it is easier to find such a function than it 
would otherwise be.

The practical question is whether such a function is likely to arise in 
practice or not.

 > Whether those correspond to commonly used statistics or not, I have no
 > idea.

I am pretty confident that this bug rarely matters.
> Regarding backwards compatibility: as a user, I'd rather the default 
> sample() do the best possible thing, and take an extra step to use 
> something like sample(..., legacy=TRUE) if I want to reproduce old results.
I suspect there's a good chance the bug I discovered today (non-integer 
x values not being truncated) will be declared to be a feature, and the 
documentation will be changed.  Then the rejection sampling approach 
would need to be quite a bit more complicated.

I think a documentation warning about the accuracy of sampling 
probabilities would also be a sufficient fix here, and would be quite a 
bit less trouble than changing the default sample().  But as I said in 
my original post, a contribution of a function without this bug would be 
a nice addition.

Duncan Murdoch
> 
> Regards,
> Philip
> 
> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan at
gmail.com
> <mailto:murdoch.duncan at gmail.com>> wrote:
> 
>     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>      > No, the 2nd call only happens when m > 2**31. Here's the
code:
> 
>     Yes, you're right. Sorry!
> 
>     So the ratio really does come close to 2.? However, the difference in
>     probabilities between outcomes is still at most 2^-32 when m is less
>     than that cutoff.? That's not feasible to detect; the only
detectable
>     difference would happen if some event was constructed to hold an
>     abundance of outcomes with especially low (or especially high)
>     probability.
> 
>     As I said in my original post, it's probably not hard to construct
such
>     a thing, but as I've said more recently, it probably wouldn't
happen by
>     chance.? Here's one attempt to do it:
> 
>     Call the values from unif_rand() "the unif_rand() outcomes".?
Call the
>     values from sample() the sample outcomes.
> 
>     It would be easiest to see the error if half of the sample() outcomes
>     used two unif_rand() outcomes, and half used just one.? That would mean
>     m should be (2/3) * 2^32, but that's too big and would trigger the
>     other
>     version.
> 
>     So how about half use 2 unif_rands(), and half use 3?? That means m
>     (2/5) * 2^32 = 1717986918.? A good guess is that sample() outcomes
>     would
>     alternate between the two possibilities, so our event could be even
>     versus odd outcomes.
> 
>     Let's try it:
> 
>      ?> m <- (2/5)*2^32
>      ?> m > 2^31
>     [1] FALSE
>      ?> x <- sample(m, 1000000, replace = TRUE)
>      ?> table(x %% 2)
> 
>      ? ? ? 0? ? ? 1
>     399850 600150
> 
>     Since m is an even number, the true proportions of evens and odds
>     should
>     be exactly 0.5.? That's some pretty strong evidence of the bug in
the
>     generator.? (Note that the ratio of the observed probabilities is about
>     1.5, so I may not be the first person to have done this.)
> 
>     I'm still not convinced that there has ever been a simulation run
with
>     detectable bias compared to Monte Carlo error unless it (like this one)
>     was designed specifically to show the problem.
> 
>     Duncan Murdoch
> 
>      >
>      > (RNG.c, lines 793ff)
>      >
>      > double R_unif_index(double dn)
>      > {
>      >? ? ? double cut = INT_MAX;
>      >
>      >? ? ? switch(RNG_kind) {
>      >? ? ? case KNUTH_TAOCP:
>      >? ? ? case USER_UNIF:
>      >? ? ? case KNUTH_TAOCP2:
>      > cut = 33554431.0; /* 2^25 - 1 */
>      > break;
>      >? ? ? default:
>      > break;
>      >? ? ?}
>      >
>      >? ? ? double u = dn > cut ? ru() : unif_rand();
>      >? ? ? return floor(dn * u);
>      > }
>      >
>      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
>      > <mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>> wrote:
>      >
>      >? ? ?On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>      >? ? ? > The 53 bits only encode at most 2^{32} possible values,
>     because the
>      >? ? ? > source of the float is the output of a 32-bit PRNG (the
>     obsolete
>      >? ? ?version
>      >? ? ? > of MT). 53 bits isn't the relevant number here.
>      >
>      >? ? ?No, two calls to unif_rand() are used.? There are two 32 bit
>     values,
>      >? ? ?but
>      >? ? ?some of the bits are thrown away.
>      >
>      >? ? ?Duncan Murdoch
>      >
>      >? ? ? >
>      >? ? ? > The selection ratios can get close to 2. Computer
scientists
>      >? ? ?don't do it
>      >? ? ? > the way R does, for a reason.
>      >? ? ? >
>      >? ? ? > Regards,
>      >? ? ? > Philip
>      >? ? ? >
>      >? ? ? > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>      >? ? ?<murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
>     <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>>
>      >? ? ? > <mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>>> wrote:
>      >? ? ? >
>      >? ? ? >? ? ?On 19/09/2018 9:09 AM, I?aki Ucar wrote:
>      >? ? ? >? ? ? > El mi?., 19 sept. 2018 a las 14:43, Duncan
Murdoch
>      >? ? ? >? ? ? > (<murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>
<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>>>)
>      >? ? ? >? ? ?escribi?:
>      >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >> On 18/09/2018 5:46 PM, Carl Boettiger
wrote:
>      >? ? ? >? ? ? >>> Dear list,
>      >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >>> It looks to me that R samples random
integers
>     using an
>      >? ? ? >? ? ?intuitive but biased
>      >? ? ? >? ? ? >>> algorithm by going from a random
number on [0,1) from
>      >? ? ?the PRNG
>      >? ? ? >? ? ?to a random
>      >? ? ? >? ? ? >>> integer, e.g.
>      >? ? ? >? ? ? >>>
>      >? ? ? >
>     https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >>> Many other languages use various
rejection sampling
>      >? ? ?approaches
>      >? ? ? >? ? ?which
>      >? ? ? >? ? ? >>> provide an unbiased method for
sampling, such as
>     in Go,
>      >? ? ?python,
>      >? ? ? >? ? ?and others
>      >? ? ? >? ? ? >>> described here:
https://arxiv.org/abs/1805.10941 (I
>      >? ? ?believe the
>      >? ? ? >? ? ?biased
>      >? ? ? >? ? ? >>> algorithm currently used in R is also
described
>     there).? I'm
>      >? ? ? >? ? ?not an expert
>      >? ? ? >? ? ? >>> in this area, but does it make sense
for the R to
>     adopt
>      >? ? ?one of
>      >? ? ? >? ? ?the unbiased
>      >? ? ? >? ? ? >>> random sample algorithms outlined
there and used
>     in other
>      >? ? ? >? ? ?languages?? Would
>      >? ? ? >? ? ? >>> a patch providing such an algorithm
be welcome? What
>      >? ? ?concerns
>      >? ? ? >? ? ?would need to
>      >? ? ? >? ? ? >>> be addressed first?
>      >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >>> I believe this issue was also raised
by Killie &
>     Philip in
>      >? ? ? >? ? ? >>>
>      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>      >? ? ? >? ? ?more
>      >? ? ? >? ? ? >>> recently in
>      >? ? ? >? ? ? >>>
>      >? ? ? >
>      >
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>    
<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >   
>     
?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >? ? ? >
>      >     
>     
?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >? ? ? >? ? ? >>> pointing to the python implementation
for comparison:
>      >? ? ? >? ? ? >>>
>      >? ? ? >
>      >
>    
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >> I think the analyses are correct, but I
doubt if a
>     change
>      >? ? ?to the
>      >? ? ? >? ? ?default
>      >? ? ? >? ? ? >> is likely to be accepted as it would make
it more
>      >? ? ?difficult to
>      >? ? ? >? ? ?reproduce
>      >? ? ? >? ? ? >> older results.
>      >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >> On the other hand, a contribution of a
new
>     function like
>      >? ? ? >? ? ?sample() but
>      >? ? ? >? ? ? >> not suffering from the bias would be
good.? The
>     normal way to
>      >? ? ? >? ? ?make such
>      >? ? ? >? ? ? >> a contribution is in a user contributed
package.
>      >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >> By the way, R code illustrating the bias
is
>     probably not very
>      >? ? ? >? ? ?hard to
>      >? ? ? >? ? ? >> put together.? I believe the bias
manifests itself in
>      >? ? ?sample()
>      >? ? ? >? ? ?producing
>      >? ? ? >? ? ? >> values with two different probabilities
(instead
>     of all equal
>      >? ? ? >? ? ? >> probabilities).? Those may differ by as
much as
>     one part in
>      >? ? ? >? ? ?2^32.? It's
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? > According to Kellie and Philip, in the
attachment
>     of the
>      >? ? ?thread
>      >? ? ? >? ? ? > referenced by Carl, "The maximum ratio
of selection
>      >? ? ?probabilities can
>      >? ? ? >? ? ? > get as large as 1.5 if n is just below
2^31".
>      >? ? ? >
>      >? ? ? >? ? ?Sorry, I didn't write very well.? I meant to
say that the
>      >? ? ?difference in
>      >? ? ? >? ? ?probabilities would be 2^-32, not that the ratio of
>      >? ? ?probabilities would
>      >? ? ? >? ? ?be 1 + 2^-32.
>      >? ? ? >
>      >? ? ? >? ? ?By the way, I don't see the statement giving
the ratio as
>      >? ? ?1.5, but
>      >? ? ? >? ? ?maybe
>      >? ? ? >? ? ?I was looking in the wrong place.? In Theorem 1 of
the
>     paper
>      >? ? ?I was
>      >? ? ? >? ? ?looking in the ratio was "1 + m 2^{-w +
1}".? In that
>     formula
>      >? ? ?m is your
>      >? ? ? >? ? ?n.? If it is near 2^31, R uses w = 57 random bits,
so
>     the ratio
>      >? ? ? >? ? ?would be
>      >? ? ? >? ? ?very, very small (one part in 2^25).
>      >? ? ? >
>      >? ? ? >? ? ?The worst case for R would happen when m? is just
below
>      >? ? ?2^25, where w
>      >? ? ? >? ? ?is at least 31 for the default generators.? In that
>     case the
>      >? ? ?ratio
>      >? ? ? >? ? ?could
>      >? ? ? >? ? ?be about 1.03.
>      >? ? ? >
>      >? ? ? >? ? ?Duncan Murdoch
>      >? ? ? >
>      >? ? ? >
>      >? ? ? >
>      >? ? ? > --
>      >? ? ? > Philip B. Stark | Associate Dean, Mathematical and
Physical
>      >? ? ?Sciences |
>      >? ? ? > Professor, ?Department of Statistics |
>      >? ? ? > University of California
>      >? ? ? > Berkeley, CA 94720-3860 | 510-394-5077 |
>      > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >? ? ?<http://statistics.berkeley.edu/%7Estark>
>      >? ? ? > <http://statistics.berkeley.edu/%7Estark> |
>      >? ? ? > @philipbstark
>      >? ? ? >
>      >
>      >
>      >
>      > --
>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>     Sciences |
>      > Professor, ?Department of Statistics |
>      > University of California
>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>     statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      > <http://statistics.berkeley.edu/%7Estark> |
>      > @philipbstark
>      >
> 
> 
> 
> -- 
> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | 
> Professor, ?Department of Statistics |
> University of California
> Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark 
> <http://statistics.berkeley.edu/%7Estark> |
> @philipbstark
>

David Hugh-Jones

2018-Sep-19 21:57 UTC

head link

[Rd] Bias in R's random integers?

It doesn't seem too hard to come up with plausible ways in which this could
give bad results. Suppose I sample rows from a large dataset, maybe for
bootstrapping. Suppose the rows are non-randomly ordered, e.g. odd rows are
males, even rows are females. Oops! Very non-representative sample,
bootstrap p values are garbage.

David

On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> > Hi Duncan--
> >
> >
>
> That is a mathematically true statement, but I suspect it is not very
> relevant.  Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution.  Remember Von Neumann's "state of sin"
quote.  The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
>
> The practical question is whether such a function is likely to arise in
> practice or not.
>
>  > Whether those correspond to commonly used statistics or not, I have
no
>  > idea.
>
> I am pretty confident that this bug rarely matters.
>
> > Regarding backwards compatibility: as a user, I'd rather the
default
> > sample() do the best possible thing, and take an extra step to use
> > something like sample(..., legacy=TRUE) if I want to reproduce old
> results.
>
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed.  Then the rejection sampling approach
> would need to be quite a bit more complicated.
>
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample().  But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
>
> Duncan Murdoch
>
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan at
gmail.com
> > <mailto:murdoch.duncan at gmail.com>> wrote:
> >
> >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >      > No, the 2nd call only happens when m > 2**31. Here's
the code:
> >
> >     Yes, you're right. Sorry!
> >
> >     So the ratio really does come close to 2.  However, the difference
in
> >     probabilities between outcomes is still at most 2^-32 when m is
less
> >     than that cutoff.  That's not feasible to detect; the only
detectable
> >     difference would happen if some event was constructed to hold an
> >     abundance of outcomes with especially low (or especially high)
> >     probability.
> >
> >     As I said in my original post, it's probably not hard to
construct
> such
> >     a thing, but as I've said more recently, it probably
wouldn't happen
> by
> >     chance.  Here's one attempt to do it:
> >
> >     Call the values from unif_rand() "the unif_rand()
outcomes".  Call
> the
> >     values from sample() the sample outcomes.
> >
> >     It would be easiest to see the error if half of the sample()
outcomes
> >     used two unif_rand() outcomes, and half used just one.  That would
> mean
> >     m should be (2/3) * 2^32, but that's too big and would trigger
the
> >     other
> >     version.
> >
> >     So how about half use 2 unif_rands(), and half use 3?  That means
m > >     (2/5) * 2^32 = 1717986918.  A good guess is that sample()
outcomes
> >     would
> >     alternate between the two possibilities, so our event could be
even
> >     versus odd outcomes.
> >
> >     Let's try it:
> >
> >       > m <- (2/5)*2^32
> >       > m > 2^31
> >     [1] FALSE
> >       > x <- sample(m, 1000000, replace = TRUE)
> >       > table(x %% 2)
> >
> >            0      1
> >     399850 600150
> >
> >     Since m is an even number, the true proportions of evens and odds
> >     should
> >     be exactly 0.5.  That's some pretty strong evidence of the bug
in the
> >     generator.  (Note that the ratio of the observed probabilities is
> about
> >     1.5, so I may not be the first person to have done this.)
> >
> >     I'm still not convinced that there has ever been a simulation
run
> with
> >     detectable bias compared to Monte Carlo error unless it (like this
> one)
> >     was designed specifically to show the problem.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > (RNG.c, lines 793ff)
> >      >
> >      > double R_unif_index(double dn)
> >      > {
> >      >      double cut = INT_MAX;
> >      >
> >      >      switch(RNG_kind) {
> >      >      case KNUTH_TAOCP:
> >      >      case USER_UNIF:
> >      >      case KNUTH_TAOCP2:
> >      > cut = 33554431.0; /* 2^25 - 1 */
> >      > break;
> >      >      default:
> >      > break;
> >      >     }
> >      >
> >      >      double u = dn > cut ? ru() : unif_rand();
> >      >      return floor(dn * u);
> >      > }
> >      >
> >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
> >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>> wrote:
> >      >
> >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      >      > The 53 bits only encode at most 2^{32} possible
values,
> >     because the
> >      >      > source of the float is the output of a 32-bit PRNG
(the
> >     obsolete
> >      >     version
> >      >      > of MT). 53 bits isn't the relevant number
here.
> >      >
> >      >     No, two calls to unif_rand() are used.  There are two 32
bit
> >     values,
> >      >     but
> >      >     some of the bits are thrown away.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >      >
> >      >      > The selection ratios can get close to 2. Computer
> scientists
> >      >     don't do it
> >      >      > the way R does, for a reason.
> >      >      >
> >      >      > Regards,
> >      >      > Philip
> >      >      >
> >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >      >     <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
> >     <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan
at gmail.com>>
> >      >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>> wrote:
> >      >      >
> >      >      >     On 19/09/2018 9:09 AM, I?aki Ucar wrote:
> >      >      >      > El mi?., 19 sept. 2018 a las 14:43,
Duncan Murdoch
> >      >      >      > (<murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>
<mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>>)
> >      >      >     escribi?:
> >      >      >      >>
> >      >      >      >> On 18/09/2018 5:46 PM, Carl
Boettiger wrote:
> >      >      >      >>> Dear list,
> >      >      >      >>>
> >      >      >      >>> It looks to me that R samples
random integers
> >     using an
> >      >      >     intuitive but biased
> >      >      >      >>> algorithm by going from a random
number on [0,1)
> from
> >      >     the PRNG
> >      >      >     to a random
> >      >      >      >>> integer, e.g.
> >      >      >      >>>
> >      >      >
> >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >      >>>
> >      >      >      >>> Many other languages use various
rejection
> sampling
> >      >     approaches
> >      >      >     which
> >      >      >      >>> provide an unbiased method for
sampling, such as
> >     in Go,
> >      >     python,
> >      >      >     and others
> >      >      >      >>> described here:
https://arxiv.org/abs/1805.10941
> (I
> >      >     believe the
> >      >      >     biased
> >      >      >      >>> algorithm currently used in R is
also described
> >     there).  I'm
> >      >      >     not an expert
> >      >      >      >>> in this area, but does it make
sense for the R to
> >     adopt
> >      >     one of
> >      >      >     the unbiased
> >      >      >      >>> random sample algorithms
outlined there and used
> >     in other
> >      >      >     languages?  Would
> >      >      >      >>> a patch providing such an
algorithm be welcome?
> What
> >      >     concerns
> >      >      >     would need to
> >      >      >      >>> be addressed first?
> >      >      >      >>>
> >      >      >      >>> I believe this issue was also
raised by Killie &
> >     Philip in
> >      >      >      >>>
> >      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html,
and
> >      >      >     more
> >      >      >      >>> recently in
> >      >      >      >>>
> >      >      >
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >      >>> pointing to the python
implementation for
> comparison:
> >      >      >      >>>
> >      >      >
> >      >
> >
>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >      >>
> >      >      >      >> I think the analyses are correct,
but I doubt if a
> >     change
> >      >     to the
> >      >      >     default
> >      >      >      >> is likely to be accepted as it would
make it more
> >      >     difficult to
> >      >      >     reproduce
> >      >      >      >> older results.
> >      >      >      >>
> >      >      >      >> On the other hand, a contribution of
a new
> >     function like
> >      >      >     sample() but
> >      >      >      >> not suffering from the bias would be
good.  The
> >     normal way to
> >      >      >     make such
> >      >      >      >> a contribution is in a user
contributed package.
> >      >      >      >>
> >      >      >      >> By the way, R code illustrating the
bias is
> >     probably not very
> >      >      >     hard to
> >      >      >      >> put together.  I believe the bias
manifests itself
> in
> >      >     sample()
> >      >      >     producing
> >      >      >      >> values with two different
probabilities (instead
> >     of all equal
> >      >      >      >> probabilities).  Those may differ by
as much as
> >     one part in
> >      >      >     2^32.  It's
> >      >      >      >
> >      >      >      > According to Kellie and Philip, in the
attachment
> >     of the
> >      >     thread
> >      >      >      > referenced by Carl, "The maximum
ratio of selection
> >      >     probabilities can
> >      >      >      > get as large as 1.5 if n is just below
2^31".
> >      >      >
> >      >      >     Sorry, I didn't write very well.  I meant
to say that
> the
> >      >     difference in
> >      >      >     probabilities would be 2^-32, not that the
ratio of
> >      >     probabilities would
> >      >      >     be 1 + 2^-32.
> >      >      >
> >      >      >     By the way, I don't see the statement
giving the ratio
> as
> >      >     1.5, but
> >      >      >     maybe
> >      >      >     I was looking in the wrong place.  In Theorem
1 of the
> >     paper
> >      >     I was
> >      >      >     looking in the ratio was "1 + m 2^{-w +
1}".  In that
> >     formula
> >      >     m is your
> >      >      >     n.  If it is near 2^31, R uses w = 57 random
bits, so
> >     the ratio
> >      >      >     would be
> >      >      >     very, very small (one part in 2^25).
> >      >      >
> >      >      >     The worst case for R would happen when m  is
just below
> >      >     2^25, where w
> >      >      >     is at least 31 for the default generators.  In
that
> >     case the
> >      >     ratio
> >      >      >     could
> >      >      >     be about 1.03.
> >      >      >
> >      >      >     Duncan Murdoch
> >      >      >
> >      >      >
> >      >      >
> >      >      > --
> >      >      > Philip B. Stark | Associate Dean, Mathematical and
Physical
> >      >     Sciences |
> >      >      > Professor,  Department of Statistics |
> >      >      > University of California
> >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      > <http://statistics.berkeley.edu/%7Estark> |
> >      >      > @philipbstark
> >      >      >
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 |
statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>-- 
Sent from Gmail Mobile

	[[alternative HTML version deleted]]

Ben Bolker

2018-Sep-19 23:17 UTC

head link

[Rd] Bias in R's random integers?

A quick point of order here: arguing with Duncan in this forum is
helpful to expose ideas, but probably neither side will convince the
other; eventually, if you want this adopted in core R, you'll need to
convince an R-core member to pursue this fix.

  In the meantime, a good, well-tested implementation in a
user-contributed package (presumably written in C for speed) would be
enormously useful.  Volunteers ... ?



On 2018-09-19 04:19 PM, Duncan Murdoch wrote:> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
>> Hi Duncan--
>>
>> Nice simulation!
>>
>> The absolute difference in probabilities is small, but the maximum
>> relative difference grows from something negligible to almost 2 as m
>> approaches 2**31.
>>
>> Because the L_1 distance between the uniform distribution on {1, ...,
>> m} and what you actually get is large, there have to be test functions
>> whose expectations are quite different under the two distributions. 
> 
> That is a mathematically true statement, but I suspect it is not very
> relevant.? Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution.? Remember Von Neumann's "state of sin"
quote.? The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
> 
> The practical question is whether such a function is likely to arise in
> practice or not.
> 
>> Whether those correspond to commonly used statistics or not, I have no
>> idea.
> 
> I am pretty confident that this bug rarely matters.
> 
>> Regarding backwards compatibility: as a user, I'd rather the
default
>> sample() do the best possible thing, and take an extra step to use
>> something like sample(..., legacy=TRUE) if I want to reproduce old
>> results.
> 
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed.? Then the rejection sampling approach
> would need to be quite a bit more complicated.
> 
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample().? But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
> 
> Duncan Murdoch
> 
>>
>> Regards,
>> Philip
>>
>> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
>> <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>> wrote:
>>
>> ??? On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>> ???? > No, the 2nd call only happens when m > 2**31. Here's
the code:
>>
>> ??? Yes, you're right. Sorry!
>>
>> ??? So the ratio really does come close to 2.? However, the difference
in
>> ??? probabilities between outcomes is still at most 2^-32 when m is
less
>> ??? than that cutoff.? That's not feasible to detect; the only
detectable
>> ??? difference would happen if some event was constructed to hold an
>> ??? abundance of outcomes with especially low (or especially high)
>> ??? probability.
>>
>> ??? As I said in my original post, it's probably not hard to
construct
>> such
>> ??? a thing, but as I've said more recently, it probably
wouldn't
>> happen by
>> ??? chance.? Here's one attempt to do it:
>>
>> ??? Call the values from unif_rand() "the unif_rand()
outcomes".? Call
>> the
>> ??? values from sample() the sample outcomes.
>>
>> ??? It would be easiest to see the error if half of the sample()
outcomes
>> ??? used two unif_rand() outcomes, and half used just one.? That would
>> mean
>> ??? m should be (2/3) * 2^32, but that's too big and would trigger
the
>> ??? other
>> ??? version.
>>
>> ??? So how about half use 2 unif_rands(), and half use 3?? That means m
>> ??? (2/5) * 2^32 = 1717986918.? A good guess is that sample() outcomes
>> ??? would
>> ??? alternate between the two possibilities, so our event could be even
>> ??? versus odd outcomes.
>>
>> ??? Let's try it:
>>
>> ???? ?> m <- (2/5)*2^32
>> ???? ?> m > 2^31
>> ??? [1] FALSE
>> ???? ?> x <- sample(m, 1000000, replace = TRUE)
>> ???? ?> table(x %% 2)
>>
>> ???? ? ? ? 0? ? ? 1
>> ??? 399850 600150
>>
>> ??? Since m is an even number, the true proportions of evens and odds
>> ??? should
>> ??? be exactly 0.5.? That's some pretty strong evidence of the bug
in the
>> ??? generator.? (Note that the ratio of the observed probabilities is
>> about
>> ??? 1.5, so I may not be the first person to have done this.)
>>
>> ??? I'm still not convinced that there has ever been a simulation
run
>> with
>> ??? detectable bias compared to Monte Carlo error unless it (like this
>> one)
>> ??? was designed specifically to show the problem.
>>
>> ??? Duncan Murdoch
>>
>> ???? >
>> ???? > (RNG.c, lines 793ff)
>> ???? >
>> ???? > double R_unif_index(double dn)
>> ???? > {
>> ???? >? ? ? double cut = INT_MAX;
>> ???? >
>> ???? >? ? ? switch(RNG_kind) {
>> ???? >? ? ? case KNUTH_TAOCP:
>> ???? >? ? ? case USER_UNIF:
>> ???? >? ? ? case KNUTH_TAOCP2:
>> ???? > cut = 33554431.0; /* 2^25 - 1 */
>> ???? > break;
>> ???? >? ? ? default:
>> ???? > break;
>> ???? >? ? ?}
>> ???? >
>> ???? >? ? ? double u = dn > cut ? ru() : unif_rand();
>> ???? >? ? ? return floor(dn * u);
>> ???? > }
>> ???? >
>> ???? > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>> ??? <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
>> ???? > <mailto:murdoch.duncan at gmail.com
>> ??? <mailto:murdoch.duncan at gmail.com>>> wrote:
>> ???? >
>> ???? >? ? ?On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>> ???? >? ? ? > The 53 bits only encode at most 2^{32} possible
values,
>> ??? because the
>> ???? >? ? ? > source of the float is the output of a 32-bit PRNG
(the
>> ??? obsolete
>> ???? >? ? ?version
>> ???? >? ? ? > of MT). 53 bits isn't the relevant number here.
>> ???? >
>> ???? >? ? ?No, two calls to unif_rand() are used.? There are two 32
bit
>> ??? values,
>> ???? >? ? ?but
>> ???? >? ? ?some of the bits are thrown away.
>> ???? >
>> ???? >? ? ?Duncan Murdoch
>> ???? >
>> ???? >? ? ? >
>> ???? >? ? ? > The selection ratios can get close to 2. Computer
>> scientists
>> ???? >? ? ?don't do it
>> ???? >? ? ? > the way R does, for a reason.
>> ???? >? ? ? >
>> ???? >? ? ? > Regards,
>> ???? >? ? ? > Philip
>> ???? >? ? ? >
>> ???? >? ? ? > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
>> ???? >? ? ?<murdoch.duncan at gmail.com <mailto:murdoch.duncan
at gmail.com>
>> ??? <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>>
>> ???? >? ? ? > <mailto:murdoch.duncan at gmail.com
>> ??? <mailto:murdoch.duncan at gmail.com>
>> ???? >? ? ?<mailto:murdoch.duncan at gmail.com
>> ??? <mailto:murdoch.duncan at gmail.com>>>> wrote:
>> ???? >? ? ? >
>> ???? >? ? ? >? ? ?On 19/09/2018 9:09 AM, I?aki Ucar wrote:
>> ???? >? ? ? >? ? ? > El mi?., 19 sept. 2018 a las 14:43,
Duncan Murdoch
>> ???? >? ? ? >? ? ? > (<murdoch.duncan at gmail.com
>> ??? <mailto:murdoch.duncan at gmail.com>
>> ???? >? ? ?<mailto:murdoch.duncan at gmail.com
>> ??? <mailto:murdoch.duncan at gmail.com>>
<mailto:murdoch.duncan at gmail.com
>> ??? <mailto:murdoch.duncan at gmail.com>
>> ???? >? ? ?<mailto:murdoch.duncan at gmail.com
>> ??? <mailto:murdoch.duncan at gmail.com>>>>)
>> ???? >? ? ? >? ? ?escribi?:
>> ???? >? ? ? >? ? ? >>
>> ???? >? ? ? >? ? ? >> On 18/09/2018 5:46 PM, Carl Boettiger
wrote:
>> ???? >? ? ? >? ? ? >>> Dear list,
>> ???? >? ? ? >? ? ? >>>
>> ???? >? ? ? >? ? ? >>> It looks to me that R samples
random integers
>> ??? using an
>> ???? >? ? ? >? ? ?intuitive but biased
>> ???? >? ? ? >? ? ? >>> algorithm by going from a random
number on
>> [0,1) from
>> ???? >? ? ?the PRNG
>> ???? >? ? ? >? ? ?to a random
>> ???? >? ? ? >? ? ? >>> integer, e.g.
>> ???? >? ? ? >? ? ? >>>
>> ???? >? ? ? >
>> ???
https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>> ???? >? ? ? >? ? ? >>>
>> ???? >? ? ? >? ? ? >>> Many other languages use various
rejection
>> sampling
>> ???? >? ? ?approaches
>> ???? >? ? ? >? ? ?which
>> ???? >? ? ? >? ? ? >>> provide an unbiased method for
sampling, such as
>> ??? in Go,
>> ???? >? ? ?python,
>> ???? >? ? ? >? ? ?and others
>> ???? >? ? ? >? ? ? >>> described here:
>> https://arxiv.org/abs/1805.10941 (I
>> ???? >? ? ?believe the
>> ???? >? ? ? >? ? ?biased
>> ???? >? ? ? >? ? ? >>> algorithm currently used in R is
also described
>> ??? there).? I'm
>> ???? >? ? ? >? ? ?not an expert
>> ???? >? ? ? >? ? ? >>> in this area, but does it make
sense for the R to
>> ??? adopt
>> ???? >? ? ?one of
>> ???? >? ? ? >? ? ?the unbiased
>> ???? >? ? ? >? ? ? >>> random sample algorithms outlined
there and used
>> ??? in other
>> ???? >? ? ? >? ? ?languages?? Would
>> ???? >? ? ? >? ? ? >>> a patch providing such an
algorithm be welcome?
>> What
>> ???? >? ? ?concerns
>> ???? >? ? ? >? ? ?would need to
>> ???? >? ? ? >? ? ? >>> be addressed first?
>> ???? >? ? ? >? ? ? >>>
>> ???? >? ? ? >? ? ? >>> I believe this issue was also
raised by Killie &
>> ??? Philip in
>> ???? >? ? ? >? ? ? >>>
>> ???? > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html,
and
>> ???? >? ? ? >? ? ?more
>> ???? >? ? ? >? ? ? >>> recently in
>> ???? >? ? ? >? ? ? >>>
>> ???? >? ? ? >
>> ???? >
>> ??? https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>> ???
>>
<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>> ???? >?? ????
>>
?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>> ???? >? ? ? >
>> ???? >???? ????
>>
?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>> ???? >? ? ? >? ? ? >>> pointing to the python
implementation for
>> comparison:
>> ???? >? ? ? >? ? ? >>>
>> ???? >? ? ? >
>> ???? >
>> ???
>>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>>
>> ???? >? ? ? >? ? ? >>
>> ???? >? ? ? >? ? ? >> I think the analyses are correct, but
I doubt if a
>> ??? change
>> ???? >? ? ?to the
>> ???? >? ? ? >? ? ?default
>> ???? >? ? ? >? ? ? >> is likely to be accepted as it would
make it more
>> ???? >? ? ?difficult to
>> ???? >? ? ? >? ? ?reproduce
>> ???? >? ? ? >? ? ? >> older results.
>> ???? >? ? ? >? ? ? >>
>> ???? >? ? ? >? ? ? >> On the other hand, a contribution of
a new
>> ??? function like
>> ???? >? ? ? >? ? ?sample() but
>> ???? >? ? ? >? ? ? >> not suffering from the bias would be
good.? The
>> ??? normal way to
>> ???? >? ? ? >? ? ?make such
>> ???? >? ? ? >? ? ? >> a contribution is in a user
contributed package.
>> ???? >? ? ? >? ? ? >>
>> ???? >? ? ? >? ? ? >> By the way, R code illustrating the
bias is
>> ??? probably not very
>> ???? >? ? ? >? ? ?hard to
>> ???? >? ? ? >? ? ? >> put together.? I believe the bias
manifests
>> itself in
>> ???? >? ? ?sample()
>> ???? >? ? ? >? ? ?producing
>> ???? >? ? ? >? ? ? >> values with two different
probabilities (instead
>> ??? of all equal
>> ???? >? ? ? >? ? ? >> probabilities).? Those may differ by
as much as
>> ??? one part in
>> ???? >? ? ? >? ? ?2^32.? It's
>> ???? >? ? ? >? ? ? >
>> ???? >? ? ? >? ? ? > According to Kellie and Philip, in the
attachment
>> ??? of the
>> ???? >? ? ?thread
>> ???? >? ? ? >? ? ? > referenced by Carl, "The maximum
ratio of selection
>> ???? >? ? ?probabilities can
>> ???? >? ? ? >? ? ? > get as large as 1.5 if n is just below
2^31".
>> ???? >? ? ? >
>> ???? >? ? ? >? ? ?Sorry, I didn't write very well.? I meant
to say
>> that the
>> ???? >? ? ?difference in
>> ???? >? ? ? >? ? ?probabilities would be 2^-32, not that the
ratio of
>> ???? >? ? ?probabilities would
>> ???? >? ? ? >? ? ?be 1 + 2^-32.
>> ???? >? ? ? >
>> ???? >? ? ? >? ? ?By the way, I don't see the statement
giving the
>> ratio as
>> ???? >? ? ?1.5, but
>> ???? >? ? ? >? ? ?maybe
>> ???? >? ? ? >? ? ?I was looking in the wrong place.? In Theorem 1
of the
>> ??? paper
>> ???? >? ? ?I was
>> ???? >? ? ? >? ? ?looking in the ratio was "1 + m 2^{-w +
1}".? In that
>> ??? formula
>> ???? >? ? ?m is your
>> ???? >? ? ? >? ? ?n.? If it is near 2^31, R uses w = 57 random
bits, so
>> ??? the ratio
>> ???? >? ? ? >? ? ?would be
>> ???? >? ? ? >? ? ?very, very small (one part in 2^25).
>> ???? >? ? ? >
>> ???? >? ? ? >? ? ?The worst case for R would happen when m? is
just below
>> ???? >? ? ?2^25, where w
>> ???? >? ? ? >? ? ?is at least 31 for the default generators.? In
that
>> ??? case the
>> ???? >? ? ?ratio
>> ???? >? ? ? >? ? ?could
>> ???? >? ? ? >? ? ?be about 1.03.
>> ???? >? ? ? >
>> ???? >? ? ? >? ? ?Duncan Murdoch
>> ???? >? ? ? >
>> ???? >? ? ? >
>> ???? >? ? ? >
>> ???? >? ? ? > --
>> ???? >? ? ? > Philip B. Stark | Associate Dean, Mathematical and
Physical
>> ???? >? ? ?Sciences |
>> ???? >? ? ? > Professor, ?Department of Statistics |
>> ???? >? ? ? > University of California
>> ???? >? ? ? > Berkeley, CA 94720-3860 | 510-394-5077 |
>> ???? > statistics.berkeley.edu/~stark
>> ??? <http://statistics.berkeley.edu/%7Estark>
>> ???? >? ? ?<http://statistics.berkeley.edu/%7Estark>
>> ???? >? ? ? > <http://statistics.berkeley.edu/%7Estark> |
>> ???? >? ? ? > @philipbstark
>> ???? >? ? ? >
>> ???? >
>> ???? >
>> ???? >
>> ???? > --
>> ???? > Philip B. Stark | Associate Dean, Mathematical and Physical
>> ??? Sciences |
>> ???? > Professor, ?Department of Statistics |
>> ???? > University of California
>> ???? > Berkeley, CA 94720-3860 | 510-394-5077 |
>> ??? statistics.berkeley.edu/~stark
>> ??? <http://statistics.berkeley.edu/%7Estark>
>> ???? > <http://statistics.berkeley.edu/%7Estark> |
>> ???? > @philipbstark
>> ???? >
>>
>>
>>
>> --?
>> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
>> Professor, ?Department of Statistics |
>> University of California
>> Berkeley, CA 94720-3860 | 510-394-5077 |
>> statistics.berkeley.edu/~stark
>> <http://statistics.berkeley.edu/%7Estark> |
>> @philipbstark
>>
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Sep 2018 - Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

Seemingly Similar Threads