thr3ads.net - R devel - [Rd] Bias in R's random integers? [Sep 2018]

If this information is useful, please help other people find it:
Share via:

David Hugh-Jones

2018-Sep-19 21:57 UTC

[Rd] Bias in R's random integers?

It doesn't seem too hard to come up with plausible ways in which this could
give bad results. Suppose I sample rows from a large dataset, maybe for
bootstrapping. Suppose the rows are non-randomly ordered, e.g. odd rows are
males, even rows are females. Oops! Very non-representative sample,
bootstrap p values are garbage.

David

On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> > Hi Duncan--
> >
> >
>
> That is a mathematically true statement, but I suspect it is not very
> relevant.  Pseudo-random number generators always have test functions
> whose sample averages are quite different from the expectation under the
> true distribution.  Remember Von Neumann's "state of sin"
quote.  The
> bug in sample() just means it is easier to find such a function than it
> would otherwise be.
>
> The practical question is whether such a function is likely to arise in
> practice or not.
>
>  > Whether those correspond to commonly used statistics or not, I have
no
>  > idea.
>
> I am pretty confident that this bug rarely matters.
>
> > Regarding backwards compatibility: as a user, I'd rather the
default
> > sample() do the best possible thing, and take an extra step to use
> > something like sample(..., legacy=TRUE) if I want to reproduce old
> results.
>
> I suspect there's a good chance the bug I discovered today (non-integer
> x values not being truncated) will be declared to be a feature, and the
> documentation will be changed.  Then the rejection sampling approach
> would need to be quite a bit more complicated.
>
> I think a documentation warning about the accuracy of sampling
> probabilities would also be a sufficient fix here, and would be quite a
> bit less trouble than changing the default sample().  But as I said in
> my original post, a contribution of a function without this bug would be
> a nice addition.
>
> Duncan Murdoch
>
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.duncan at
gmail.com
> > <mailto:murdoch.duncan at gmail.com>> wrote:
> >
> >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >      > No, the 2nd call only happens when m > 2**31. Here's
the code:
> >
> >     Yes, you're right. Sorry!
> >
> >     So the ratio really does come close to 2.  However, the difference
in
> >     probabilities between outcomes is still at most 2^-32 when m is
less
> >     than that cutoff.  That's not feasible to detect; the only
detectable
> >     difference would happen if some event was constructed to hold an
> >     abundance of outcomes with especially low (or especially high)
> >     probability.
> >
> >     As I said in my original post, it's probably not hard to
construct
> such
> >     a thing, but as I've said more recently, it probably
wouldn't happen
> by
> >     chance.  Here's one attempt to do it:
> >
> >     Call the values from unif_rand() "the unif_rand()
outcomes".  Call
> the
> >     values from sample() the sample outcomes.
> >
> >     It would be easiest to see the error if half of the sample()
outcomes
> >     used two unif_rand() outcomes, and half used just one.  That would
> mean
> >     m should be (2/3) * 2^32, but that's too big and would trigger
the
> >     other
> >     version.
> >
> >     So how about half use 2 unif_rands(), and half use 3?  That means
m > >     (2/5) * 2^32 = 1717986918.  A good guess is that sample()
outcomes
> >     would
> >     alternate between the two possibilities, so our event could be
even
> >     versus odd outcomes.
> >
> >     Let's try it:
> >
> >       > m <- (2/5)*2^32
> >       > m > 2^31
> >     [1] FALSE
> >       > x <- sample(m, 1000000, replace = TRUE)
> >       > table(x %% 2)
> >
> >            0      1
> >     399850 600150
> >
> >     Since m is an even number, the true proportions of evens and odds
> >     should
> >     be exactly 0.5.  That's some pretty strong evidence of the bug
in the
> >     generator.  (Note that the ratio of the observed probabilities is
> about
> >     1.5, so I may not be the first person to have done this.)
> >
> >     I'm still not convinced that there has ever been a simulation
run
> with
> >     detectable bias compared to Monte Carlo error unless it (like this
> one)
> >     was designed specifically to show the problem.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > (RNG.c, lines 793ff)
> >      >
> >      > double R_unif_index(double dn)
> >      > {
> >      >      double cut = INT_MAX;
> >      >
> >      >      switch(RNG_kind) {
> >      >      case KNUTH_TAOCP:
> >      >      case USER_UNIF:
> >      >      case KNUTH_TAOCP2:
> >      > cut = 33554431.0; /* 2^25 - 1 */
> >      > break;
> >      >      default:
> >      > break;
> >      >     }
> >      >
> >      >      double u = dn > cut ? ru() : unif_rand();
> >      >      return floor(dn * u);
> >      > }
> >      >
> >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
> >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>> wrote:
> >      >
> >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      >      > The 53 bits only encode at most 2^{32} possible
values,
> >     because the
> >      >      > source of the float is the output of a 32-bit PRNG
(the
> >     obsolete
> >      >     version
> >      >      > of MT). 53 bits isn't the relevant number
here.
> >      >
> >      >     No, two calls to unif_rand() are used.  There are two 32
bit
> >     values,
> >      >     but
> >      >     some of the bits are thrown away.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >      >
> >      >      > The selection ratios can get close to 2. Computer
> scientists
> >      >     don't do it
> >      >      > the way R does, for a reason.
> >      >      >
> >      >      > Regards,
> >      >      > Philip
> >      >      >
> >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
> >      >     <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
> >     <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan
at gmail.com>>
> >      >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>> wrote:
> >      >      >
> >      >      >     On 19/09/2018 9:09 AM, I?aki Ucar wrote:
> >      >      >      > El mi?., 19 sept. 2018 a las 14:43,
Duncan Murdoch
> >      >      >      > (<murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>
<mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>>)
> >      >      >     escribi?:
> >      >      >      >>
> >      >      >      >> On 18/09/2018 5:46 PM, Carl
Boettiger wrote:
> >      >      >      >>> Dear list,
> >      >      >      >>>
> >      >      >      >>> It looks to me that R samples
random integers
> >     using an
> >      >      >     intuitive but biased
> >      >      >      >>> algorithm by going from a random
number on [0,1)
> from
> >      >     the PRNG
> >      >      >     to a random
> >      >      >      >>> integer, e.g.
> >      >      >      >>>
> >      >      >
> >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >      >>>
> >      >      >      >>> Many other languages use various
rejection
> sampling
> >      >     approaches
> >      >      >     which
> >      >      >      >>> provide an unbiased method for
sampling, such as
> >     in Go,
> >      >     python,
> >      >      >     and others
> >      >      >      >>> described here:
https://arxiv.org/abs/1805.10941
> (I
> >      >     believe the
> >      >      >     biased
> >      >      >      >>> algorithm currently used in R is
also described
> >     there).  I'm
> >      >      >     not an expert
> >      >      >      >>> in this area, but does it make
sense for the R to
> >     adopt
> >      >     one of
> >      >      >     the unbiased
> >      >      >      >>> random sample algorithms
outlined there and used
> >     in other
> >      >      >     languages?  Would
> >      >      >      >>> a patch providing such an
algorithm be welcome?
> What
> >      >     concerns
> >      >      >     would need to
> >      >      >      >>> be addressed first?
> >      >      >      >>>
> >      >      >      >>> I believe this issue was also
raised by Killie &
> >     Philip in
> >      >      >      >>>
> >      > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html,
and
> >      >      >     more
> >      >      >      >>> recently in
> >      >      >      >>>
> >      >      >
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >      >>> pointing to the python
implementation for
> comparison:
> >      >      >      >>>
> >      >      >
> >      >
> >
>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >      >>
> >      >      >      >> I think the analyses are correct,
but I doubt if a
> >     change
> >      >     to the
> >      >      >     default
> >      >      >      >> is likely to be accepted as it would
make it more
> >      >     difficult to
> >      >      >     reproduce
> >      >      >      >> older results.
> >      >      >      >>
> >      >      >      >> On the other hand, a contribution of
a new
> >     function like
> >      >      >     sample() but
> >      >      >      >> not suffering from the bias would be
good.  The
> >     normal way to
> >      >      >     make such
> >      >      >      >> a contribution is in a user
contributed package.
> >      >      >      >>
> >      >      >      >> By the way, R code illustrating the
bias is
> >     probably not very
> >      >      >     hard to
> >      >      >      >> put together.  I believe the bias
manifests itself
> in
> >      >     sample()
> >      >      >     producing
> >      >      >      >> values with two different
probabilities (instead
> >     of all equal
> >      >      >      >> probabilities).  Those may differ by
as much as
> >     one part in
> >      >      >     2^32.  It's
> >      >      >      >
> >      >      >      > According to Kellie and Philip, in the
attachment
> >     of the
> >      >     thread
> >      >      >      > referenced by Carl, "The maximum
ratio of selection
> >      >     probabilities can
> >      >      >      > get as large as 1.5 if n is just below
2^31".
> >      >      >
> >      >      >     Sorry, I didn't write very well.  I meant
to say that
> the
> >      >     difference in
> >      >      >     probabilities would be 2^-32, not that the
ratio of
> >      >     probabilities would
> >      >      >     be 1 + 2^-32.
> >      >      >
> >      >      >     By the way, I don't see the statement
giving the ratio
> as
> >      >     1.5, but
> >      >      >     maybe
> >      >      >     I was looking in the wrong place.  In Theorem
1 of the
> >     paper
> >      >     I was
> >      >      >     looking in the ratio was "1 + m 2^{-w +
1}".  In that
> >     formula
> >      >     m is your
> >      >      >     n.  If it is near 2^31, R uses w = 57 random
bits, so
> >     the ratio
> >      >      >     would be
> >      >      >     very, very small (one part in 2^25).
> >      >      >
> >      >      >     The worst case for R would happen when m  is
just below
> >      >     2^25, where w
> >      >      >     is at least 31 for the default generators.  In
that
> >     case the
> >      >     ratio
> >      >      >     could
> >      >      >     be about 1.03.
> >      >      >
> >      >      >     Duncan Murdoch
> >      >      >
> >      >      >
> >      >      >
> >      >      > --
> >      >      > Philip B. Stark | Associate Dean, Mathematical and
Physical
> >      >     Sciences |
> >      >      > Professor,  Department of Statistics |
> >      >      > University of California
> >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      > <http://statistics.berkeley.edu/%7Estark> |
> >      >      > @philipbstark
> >      >      >
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 |
statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>-- 
Sent from Gmail Mobile

	[[alternative HTML version deleted]]

Duncan Murdoch

2018-Sep-19 22:08 UTC

head link

[Rd] Bias in R's random integers?

On 19/09/2018 5:57 PM, David Hugh-Jones wrote:> 
> It doesn't seem too hard to come up with plausible ways in which this 
> could give bad results. Suppose I sample rows from a large dataset, 
> maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g. 
> odd rows are males, even rows are females. Oops! Very non-representative 
> sample, bootstrap p values are garbage.
That would only happen if your dataset was exactly 1717986918 elements 
in size. (And in fact, it will be less extreme than I posted:  I had x 
set to 1717986918.4, as described in another thread.  If you use an 
integer value you need a different pattern; add or subtract an element 
or two and the pattern needed to see a problem changes drastically.)

But if you're sampling from a dataset of that exact size, then you 
should worry about this bug. Don't use sample().  Use the algorithm that 
Carl described.

Duncan Murdoch
> 
> David
> 
> On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at
gmail.com
> <mailto:murdoch.duncan at gmail.com>> wrote:
> 
>     On 19/09/2018 3:52 PM, Philip B. Stark wrote:
>      > Hi Duncan--
>      >
>      >
> 
>     That is a mathematically true statement, but I suspect it is not very
>     relevant.? Pseudo-random number generators always have test functions
>     whose sample averages are quite different from the expectation under
>     the
>     true distribution.? Remember Von Neumann's "state of sin"
quote.? The
>     bug in sample() just means it is easier to find such a function than it
>     would otherwise be.
> 
>     The practical question is whether such a function is likely to arise in
>     practice or not.
> 
>      ?> Whether those correspond to commonly used statistics or not, I
>     have no
>      ?> idea.
> 
>     I am pretty confident that this bug rarely matters.
> 
>      > Regarding backwards compatibility: as a user, I'd rather the
default
>      > sample() do the best possible thing, and take an extra step to
use
>      > something like sample(..., legacy=TRUE) if I want to reproduce
>     old results.
> 
>     I suspect there's a good chance the bug I discovered today
(non-integer
>     x values not being truncated) will be declared to be a feature, and the
>     documentation will be changed.? Then the rejection sampling approach
>     would need to be quite a bit more complicated.
> 
>     I think a documentation warning about the accuracy of sampling
>     probabilities would also be a sufficient fix here, and would be quite a
>     bit less trouble than changing the default sample().? But as I said in
>     my original post, a contribution of a function without this bug
>     would be
>     a nice addition.
> 
>     Duncan Murdoch
> 
>      >
>      > Regards,
>      > Philip
>      >
>      > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
>     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
>      > <mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>> wrote:
>      >
>      >? ? ?On 19/09/2018 12:23 PM, Philip B. Stark wrote:
>      >? ? ? > No, the 2nd call only happens when m > 2**31.
Here's the code:
>      >
>      >? ? ?Yes, you're right. Sorry!
>      >
>      >? ? ?So the ratio really does come close to 2.? However, the
>     difference in
>      >? ? ?probabilities between outcomes is still at most 2^-32 when m
>     is less
>      >? ? ?than that cutoff.? That's not feasible to detect; the
only
>     detectable
>      >? ? ?difference would happen if some event was constructed to hold
an
>      >? ? ?abundance of outcomes with especially low (or especially
high)
>      >? ? ?probability.
>      >
>      >? ? ?As I said in my original post, it's probably not hard to
>     construct such
>      >? ? ?a thing, but as I've said more recently, it probably
wouldn't
>     happen by
>      >? ? ?chance.? Here's one attempt to do it:
>      >
>      >? ? ?Call the values from unif_rand() "the unif_rand()
outcomes".
>     Call the
>      >? ? ?values from sample() the sample outcomes.
>      >
>      >? ? ?It would be easiest to see the error if half of the sample()
>     outcomes
>      >? ? ?used two unif_rand() outcomes, and half used just one.? That
>     would mean
>      >? ? ?m should be (2/3) * 2^32, but that's too big and would
>     trigger the
>      >? ? ?other
>      >? ? ?version.
>      >
>      >? ? ?So how about half use 2 unif_rands(), and half use 3?? That
>     means m >      >? ? ?(2/5) * 2^32 = 1717986918.? A good guess is
that sample()
>     outcomes
>      >? ? ?would
>      >? ? ?alternate between the two possibilities, so our event could
>     be even
>      >? ? ?versus odd outcomes.
>      >
>      >? ? ?Let's try it:
>      >
>      >? ? ? ?> m <- (2/5)*2^32
>      >? ? ? ?> m > 2^31
>      >? ? ?[1] FALSE
>      >? ? ? ?> x <- sample(m, 1000000, replace = TRUE)
>      >? ? ? ?> table(x %% 2)
>      >
>      >? ? ? ? ? ? 0? ? ? 1
>      >? ? ?399850 600150
>      >
>      >? ? ?Since m is an even number, the true proportions of evens and
odds
>      >? ? ?should
>      >? ? ?be exactly 0.5.? That's some pretty strong evidence of
the
>     bug in the
>      >? ? ?generator.? (Note that the ratio of the observed
>     probabilities is about
>      >? ? ?1.5, so I may not be the first person to have done this.)
>      >
>      >? ? ?I'm still not convinced that there has ever been a
simulation
>     run with
>      >? ? ?detectable bias compared to Monte Carlo error unless it (like
>     this one)
>      >? ? ?was designed specifically to show the problem.
>      >
>      >? ? ?Duncan Murdoch
>      >
>      >? ? ? >
>      >? ? ? > (RNG.c, lines 793ff)
>      >? ? ? >
>      >? ? ? > double R_unif_index(double dn)
>      >? ? ? > {
>      >? ? ? >? ? ? double cut = INT_MAX;
>      >? ? ? >
>      >? ? ? >? ? ? switch(RNG_kind) {
>      >? ? ? >? ? ? case KNUTH_TAOCP:
>      >? ? ? >? ? ? case USER_UNIF:
>      >? ? ? >? ? ? case KNUTH_TAOCP2:
>      >? ? ? > cut = 33554431.0; /* 2^25 - 1 */
>      >? ? ? > break;
>      >? ? ? >? ? ? default:
>      >? ? ? > break;
>      >? ? ? >? ? ?}
>      >? ? ? >
>      >? ? ? >? ? ? double u = dn > cut ? ru() : unif_rand();
>      >? ? ? >? ? ? return floor(dn * u);
>      >? ? ? > }
>      >? ? ? >
>      >? ? ? > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
>      >? ? ?<murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
>     <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>>
>      >? ? ? > <mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>>> wrote:
>      >? ? ? >
>      >? ? ? >? ? ?On 19/09/2018 12:09 PM, Philip B. Stark wrote:
>      >? ? ? >? ? ? > The 53 bits only encode at most 2^{32}
possible values,
>      >? ? ?because the
>      >? ? ? >? ? ? > source of the float is the output of a 32-bit
PRNG (the
>      >? ? ?obsolete
>      >? ? ? >? ? ?version
>      >? ? ? >? ? ? > of MT). 53 bits isn't the relevant number
here.
>      >? ? ? >
>      >? ? ? >? ? ?No, two calls to unif_rand() are used.? There are
two
>     32 bit
>      >? ? ?values,
>      >? ? ? >? ? ?but
>      >? ? ? >? ? ?some of the bits are thrown away.
>      >? ? ? >
>      >? ? ? >? ? ?Duncan Murdoch
>      >? ? ? >
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? > The selection ratios can get close to 2.
Computer
>     scientists
>      >? ? ? >? ? ?don't do it
>      >? ? ? >? ? ? > the way R does, for a reason.
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? > Regards,
>      >? ? ? >? ? ? > Philip
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? > On Wed, Sep 19, 2018 at 9:05 AM Duncan
Murdoch
>      >? ? ? >? ? ?<murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at
gmail.com
>     <mailto:murdoch.duncan at gmail.com>>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com> <mailto:murdoch.duncan at
gmail.com
>     <mailto:murdoch.duncan at gmail.com>>>
>      >? ? ? >? ? ? > <mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>
>      >? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>>>> wrote:
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? >? ? ?On 19/09/2018 9:09 AM, I?aki Ucar wrote:
>      >? ? ? >? ? ? >? ? ? > El mi?., 19 sept. 2018 a las 14:43,
Duncan
>     Murdoch
>      >? ? ? >? ? ? >? ? ? > (<murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>
>      >? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>>
<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>
>      >? ? ? >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>
>      >? ? ?<mailto:murdoch.duncan at gmail.com
>     <mailto:murdoch.duncan at gmail.com>>>>>)
>      >? ? ? >? ? ? >? ? ?escribi?:
>      >? ? ? >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >? ? ? >> On 18/09/2018 5:46 PM, Carl
Boettiger wrote:
>      >? ? ? >? ? ? >? ? ? >>> Dear list,
>      >? ? ? >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >? ? ? >>> It looks to me that R
samples random integers
>      >? ? ?using an
>      >? ? ? >? ? ? >? ? ?intuitive but biased
>      >? ? ? >? ? ? >? ? ? >>> algorithm by going from a
random number on
>     [0,1) from
>      >? ? ? >? ? ?the PRNG
>      >? ? ? >? ? ? >? ? ?to a random
>      >? ? ? >? ? ? >? ? ? >>> integer, e.g.
>      >? ? ? >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >
>      >
https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
>      >? ? ? >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >? ? ? >>> Many other languages use
various rejection
>     sampling
>      >? ? ? >? ? ?approaches
>      >? ? ? >? ? ? >? ? ?which
>      >? ? ? >? ? ? >? ? ? >>> provide an unbiased method
for sampling,
>     such as
>      >? ? ?in Go,
>      >? ? ? >? ? ?python,
>      >? ? ? >? ? ? >? ? ?and others
>      >? ? ? >? ? ? >? ? ? >>> described here:
>     https://arxiv.org/abs/1805.10941 (I
>      >? ? ? >? ? ?believe the
>      >? ? ? >? ? ? >? ? ?biased
>      >? ? ? >? ? ? >? ? ? >>> algorithm currently used in
R is also
>     described
>      >? ? ?there).? I'm
>      >? ? ? >? ? ? >? ? ?not an expert
>      >? ? ? >? ? ? >? ? ? >>> in this area, but does it
make sense for
>     the R to
>      >? ? ?adopt
>      >? ? ? >? ? ?one of
>      >? ? ? >? ? ? >? ? ?the unbiased
>      >? ? ? >? ? ? >? ? ? >>> random sample algorithms
outlined there
>     and used
>      >? ? ?in other
>      >? ? ? >? ? ? >? ? ?languages?? Would
>      >? ? ? >? ? ? >? ? ? >>> a patch providing such an
algorithm be
>     welcome? What
>      >? ? ? >? ? ?concerns
>      >? ? ? >? ? ? >? ? ?would need to
>      >? ? ? >? ? ? >? ? ? >>> be addressed first?
>      >? ? ? >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >? ? ? >>> I believe this issue was
also raised by
>     Killie &
>      >? ? ?Philip in
>      >? ? ? >? ? ? >? ? ? >>>
>      >? ? ? >
>     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
>      >? ? ? >? ? ? >? ? ?more
>      >? ? ? >? ? ? >? ? ? >>> recently in
>      >? ? ? >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >
>      >? ? ? >
>      >
>     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
>    
<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >   
>     
?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >? ? ? >
>      >     
>     
?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>      >? ? ? >? ? ? >
>      >? ? ? >
>      >     
>     
?<https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
>      >? ? ? >? ? ? >? ? ? >>> pointing to the python
implementation for
>     comparison:
>      >? ? ? >? ? ? >? ? ? >>>
>      >? ? ? >? ? ? >
>      >? ? ? >
>      >
>    
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
>      >? ? ? >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >? ? ? >> I think the analyses are
correct, but I
>     doubt if a
>      >? ? ?change
>      >? ? ? >? ? ?to the
>      >? ? ? >? ? ? >? ? ?default
>      >? ? ? >? ? ? >? ? ? >> is likely to be accepted as it
would make
>     it more
>      >? ? ? >? ? ?difficult to
>      >? ? ? >? ? ? >? ? ?reproduce
>      >? ? ? >? ? ? >? ? ? >> older results.
>      >? ? ? >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >? ? ? >> On the other hand, a
contribution of a new
>      >? ? ?function like
>      >? ? ? >? ? ? >? ? ?sample() but
>      >? ? ? >? ? ? >? ? ? >> not suffering from the bias
would be good.? The
>      >? ? ?normal way to
>      >? ? ? >? ? ? >? ? ?make such
>      >? ? ? >? ? ? >? ? ? >> a contribution is in a user
contributed
>     package.
>      >? ? ? >? ? ? >? ? ? >>
>      >? ? ? >? ? ? >? ? ? >> By the way, R code illustrating
the bias is
>      >? ? ?probably not very
>      >? ? ? >? ? ? >? ? ?hard to
>      >? ? ? >? ? ? >? ? ? >> put together.? I believe the
bias manifests
>     itself in
>      >? ? ? >? ? ?sample()
>      >? ? ? >? ? ? >? ? ?producing
>      >? ? ? >? ? ? >? ? ? >> values with two different
probabilities
>     (instead
>      >? ? ?of all equal
>      >? ? ? >? ? ? >? ? ? >> probabilities).? Those may
differ by as much as
>      >? ? ?one part in
>      >? ? ? >? ? ? >? ? ?2^32.? It's
>      >? ? ? >? ? ? >? ? ? >
>      >? ? ? >? ? ? >? ? ? > According to Kellie and Philip, in
the
>     attachment
>      >? ? ?of the
>      >? ? ? >? ? ?thread
>      >? ? ? >? ? ? >? ? ? > referenced by Carl, "The
maximum ratio of
>     selection
>      >? ? ? >? ? ?probabilities can
>      >? ? ? >? ? ? >? ? ? > get as large as 1.5 if n is just
below 2^31".
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? >? ? ?Sorry, I didn't write very well.? I
meant to
>     say that the
>      >? ? ? >? ? ?difference in
>      >? ? ? >? ? ? >? ? ?probabilities would be 2^-32, not that
the ratio of
>      >? ? ? >? ? ?probabilities would
>      >? ? ? >? ? ? >? ? ?be 1 + 2^-32.
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? >? ? ?By the way, I don't see the statement
giving
>     the ratio as
>      >? ? ? >? ? ?1.5, but
>      >? ? ? >? ? ? >? ? ?maybe
>      >? ? ? >? ? ? >? ? ?I was looking in the wrong place.? In
Theorem 1
>     of the
>      >? ? ?paper
>      >? ? ? >? ? ?I was
>      >? ? ? >? ? ? >? ? ?looking in the ratio was "1 + m
2^{-w + 1}".
>     In that
>      >? ? ?formula
>      >? ? ? >? ? ?m is your
>      >? ? ? >? ? ? >? ? ?n.? If it is near 2^31, R uses w = 57
random
>     bits, so
>      >? ? ?the ratio
>      >? ? ? >? ? ? >? ? ?would be
>      >? ? ? >? ? ? >? ? ?very, very small (one part in 2^25).
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? >? ? ?The worst case for R would happen when m?
is
>     just below
>      >? ? ? >? ? ?2^25, where w
>      >? ? ? >? ? ? >? ? ?is at least 31 for the default
generators.? In that
>      >? ? ?case the
>      >? ? ? >? ? ?ratio
>      >? ? ? >? ? ? >? ? ?could
>      >? ? ? >? ? ? >? ? ?be about 1.03.
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? >? ? ?Duncan Murdoch
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? >
>      >? ? ? >? ? ? > --
>      >? ? ? >? ? ? > Philip B. Stark | Associate Dean,
Mathematical and
>     Physical
>      >? ? ? >? ? ?Sciences |
>      >? ? ? >? ? ? > Professor, ?Department of Statistics |
>      >? ? ? >? ? ? > University of California
>      >? ? ? >? ? ? > Berkeley, CA 94720-3860 | 510-394-5077 |
>      >? ? ? > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >? ? ?<http://statistics.berkeley.edu/%7Estark>
>      >? ? ? >? ? ?<http://statistics.berkeley.edu/%7Estark>
>      >? ? ? >? ? ? >
<http://statistics.berkeley.edu/%7Estark> |
>      >? ? ? >? ? ? > @philipbstark
>      >? ? ? >? ? ? >
>      >? ? ? >
>      >? ? ? >
>      >? ? ? >
>      >? ? ? > --
>      >? ? ? > Philip B. Stark | Associate Dean, Mathematical and
Physical
>      >? ? ?Sciences |
>      >? ? ? > Professor, ?Department of Statistics |
>      >? ? ? > University of California
>      >? ? ? > Berkeley, CA 94720-3860 | 510-394-5077 |
>      > statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      >? ? ?<http://statistics.berkeley.edu/%7Estark>
>      >? ? ? > <http://statistics.berkeley.edu/%7Estark> |
>      >? ? ? > @philipbstark
>      >? ? ? >
>      >
>      >
>      >
>      > --
>      > Philip B. Stark | Associate Dean, Mathematical and Physical
>     Sciences |
>      > Professor, ?Department of Statistics |
>      > University of California
>      > Berkeley, CA 94720-3860 | 510-394-5077 |
>     statistics.berkeley.edu/~stark
>     <http://statistics.berkeley.edu/%7Estark>
>      > <http://statistics.berkeley.edu/%7Estark> |
>      > @philipbstark
>      >
> 
>     ______________________________________________
>     R-devel at r-project.org <mailto:R-devel at r-project.org>
mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> -- 
> Sent from Gmail Mobile

Philip B. Stark

2018-Sep-21 02:58 UTC

head link

[Rd] Bias in R's random integers?

The same issue occurs in walker_ProbSampleReplace() in random.c, lines
386-387:

rU = unif_rand() * n;
k = (int) rU;

Cheers,
Philip

On Wed, Sep 19, 2018 at 3:08 PM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
> >
> > It doesn't seem too hard to come up with plausible ways in which
this
> > could give bad results. Suppose I sample rows from a large dataset,
> > maybe for bootstrapping. Suppose the rows are non-randomly ordered,
e.g.
> > odd rows are males, even rows are females. Oops! Very
non-representative
> > sample, bootstrap p values are garbage.
>
> That would only happen if your dataset was exactly 1717986918 elements
> in size. (And in fact, it will be less extreme than I posted:  I had x
> set to 1717986918.4, as described in another thread.  If you use an
> integer value you need a different pattern; add or subtract an element
> or two and the pattern needed to see a problem changes drastically.)
>
> But if you're sampling from a dataset of that exact size, then you
> should worry about this bug. Don't use sample().  Use the algorithm
that
> Carl described.
>
> Duncan Murdoch
>
> >
> > David
> >
> > On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.duncan at
gmail.com
> > <mailto:murdoch.duncan at gmail.com>> wrote:
> >
> >     On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> >      > Hi Duncan--
> >      >
> >      >
> >
> >     That is a mathematically true statement, but I suspect it is not
very
> >     relevant.  Pseudo-random number generators always have test
functions
> >     whose sample averages are quite different from the expectation
under
> >     the
> >     true distribution.  Remember Von Neumann's "state of
sin" quote.  The
> >     bug in sample() just means it is easier to find such a function
than
> it
> >     would otherwise be.
> >
> >     The practical question is whether such a function is likely to
arise
> in
> >     practice or not.
> >
> >       > Whether those correspond to commonly used statistics or
not, I
> >     have no
> >       > idea.
> >
> >     I am pretty confident that this bug rarely matters.
> >
> >      > Regarding backwards compatibility: as a user, I'd rather
the
> default
> >      > sample() do the best possible thing, and take an extra step
to use
> >      > something like sample(..., legacy=TRUE) if I want to
reproduce
> >     old results.
> >
> >     I suspect there's a good chance the bug I discovered today
> (non-integer
> >     x values not being truncated) will be declared to be a feature,
and
> the
> >     documentation will be changed.  Then the rejection sampling
approach
> >     would need to be quite a bit more complicated.
> >
> >     I think a documentation warning about the accuracy of sampling
> >     probabilities would also be a sufficient fix here, and would be
> quite a
> >     bit less trouble than changing the default sample().  But as I
said
> in
> >     my original post, a contribution of a function without this bug
> >     would be
> >     a nice addition.
> >
> >     Duncan Murdoch
> >
> >      >
> >      > Regards,
> >      > Philip
> >      >
> >      > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
> >     <murdoch.duncan at gmail.com <mailto:murdoch.duncan at
gmail.com>
> >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>> wrote:
> >      >
> >      >     On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >      >      > No, the 2nd call only happens when m > 2**31.
Here's the
> code:
> >      >
> >      >     Yes, you're right. Sorry!
> >      >
> >      >     So the ratio really does come close to 2.  However, the
> >     difference in
> >      >     probabilities between outcomes is still at most 2^-32
when m
> >     is less
> >      >     than that cutoff.  That's not feasible to detect;
the only
> >     detectable
> >      >     difference would happen if some event was constructed to
hold
> an
> >      >     abundance of outcomes with especially low (or especially
high)
> >      >     probability.
> >      >
> >      >     As I said in my original post, it's probably not
hard to
> >     construct such
> >      >     a thing, but as I've said more recently, it probably
wouldn't
> >     happen by
> >      >     chance.  Here's one attempt to do it:
> >      >
> >      >     Call the values from unif_rand() "the unif_rand()
outcomes".
> >     Call the
> >      >     values from sample() the sample outcomes.
> >      >
> >      >     It would be easiest to see the error if half of the
sample()
> >     outcomes
> >      >     used two unif_rand() outcomes, and half used just one. 
That
> >     would mean
> >      >     m should be (2/3) * 2^32, but that's too big and
would
> >     trigger the
> >      >     other
> >      >     version.
> >      >
> >      >     So how about half use 2 unif_rands(), and half use 3? 
That
> >     means m > >      >     (2/5) * 2^32 = 1717986918.  A good
guess is that sample()
> >     outcomes
> >      >     would
> >      >     alternate between the two possibilities, so our event
could
> >     be even
> >      >     versus odd outcomes.
> >      >
> >      >     Let's try it:
> >      >
> >      >       > m <- (2/5)*2^32
> >      >       > m > 2^31
> >      >     [1] FALSE
> >      >       > x <- sample(m, 1000000, replace = TRUE)
> >      >       > table(x %% 2)
> >      >
> >      >            0      1
> >      >     399850 600150
> >      >
> >      >     Since m is an even number, the true proportions of evens
and
> odds
> >      >     should
> >      >     be exactly 0.5.  That's some pretty strong evidence
of the
> >     bug in the
> >      >     generator.  (Note that the ratio of the observed
> >     probabilities is about
> >      >     1.5, so I may not be the first person to have done
this.)
> >      >
> >      >     I'm still not convinced that there has ever been a
simulation
> >     run with
> >      >     detectable bias compared to Monte Carlo error unless it
(like
> >     this one)
> >      >     was designed specifically to show the problem.
> >      >
> >      >     Duncan Murdoch
> >      >
> >      >      >
> >      >      > (RNG.c, lines 793ff)
> >      >      >
> >      >      > double R_unif_index(double dn)
> >      >      > {
> >      >      >      double cut = INT_MAX;
> >      >      >
> >      >      >      switch(RNG_kind) {
> >      >      >      case KNUTH_TAOCP:
> >      >      >      case USER_UNIF:
> >      >      >      case KNUTH_TAOCP2:
> >      >      > cut = 33554431.0; /* 2^25 - 1 */
> >      >      > break;
> >      >      >      default:
> >      >      > break;
> >      >      >     }
> >      >      >
> >      >      >      double u = dn > cut ? ru() : unif_rand();
> >      >      >      return floor(dn * u);
> >      >      > }
> >      >      >
> >      >      > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
> >      >     <murdoch.duncan at gmail.com
<mailto:murdoch.duncan at gmail.com>
> >     <mailto:murdoch.duncan at gmail.com <mailto:murdoch.duncan
at gmail.com>>
> >      >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>> wrote:
> >      >      >
> >      >      >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> >      >      >      > The 53 bits only encode at most 2^{32}
possible
> values,
> >      >     because the
> >      >      >      > source of the float is the output of a
32-bit PRNG
> (the
> >      >     obsolete
> >      >      >     version
> >      >      >      > of MT). 53 bits isn't the relevant
number here.
> >      >      >
> >      >      >     No, two calls to unif_rand() are used.  There
are two
> >     32 bit
> >      >     values,
> >      >      >     but
> >      >      >     some of the bits are thrown away.
> >      >      >
> >      >      >     Duncan Murdoch
> >      >      >
> >      >      >      >
> >      >      >      > The selection ratios can get close to 2.
Computer
> >     scientists
> >      >      >     don't do it
> >      >      >      > the way R does, for a reason.
> >      >      >      >
> >      >      >      > Regards,
> >      >      >      > Philip
> >      >      >      >
> >      >      >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan
Murdoch
> >      >      >     <murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
<mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
<mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>
> >      >      >      > <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>
> >      >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>>> wrote:
> >      >      >      >
> >      >      >      >     On 19/09/2018 9:09 AM, I?aki Ucar
wrote:
> >      >      >      >      > El mi?., 19 sept. 2018 a las
14:43, Duncan
> >     Murdoch
> >      >      >      >      > (<murdoch.duncan at
gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>
> >      >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>
<mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>
> >      >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>
> >      >     <mailto:murdoch.duncan at gmail.com
> >     <mailto:murdoch.duncan at gmail.com>>>>>)
> >      >      >      >     escribi?:
> >      >      >      >      >>
> >      >      >      >      >> On 18/09/2018 5:46 PM,
Carl Boettiger wrote:
> >      >      >      >      >>> Dear list,
> >      >      >      >      >>>
> >      >      >      >      >>> It looks to me that R
samples random
> integers
> >      >     using an
> >      >      >      >     intuitive but biased
> >      >      >      >      >>> algorithm by going
from a random number on
> >     [0,1) from
> >      >      >     the PRNG
> >      >      >      >     to a random
> >      >      >      >      >>> integer, e.g.
> >      >      >      >      >>>
> >      >      >      >
> >      >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >      >      >      >>>
> >      >      >      >      >>> Many other languages
use various rejection
> >     sampling
> >      >      >     approaches
> >      >      >      >     which
> >      >      >      >      >>> provide an unbiased
method for sampling,
> >     such as
> >      >     in Go,
> >      >      >     python,
> >      >      >      >     and others
> >      >      >      >      >>> described here:
> >     https://arxiv.org/abs/1805.10941 (I
> >      >      >     believe the
> >      >      >      >     biased
> >      >      >      >      >>> algorithm currently
used in R is also
> >     described
> >      >     there).  I'm
> >      >      >      >     not an expert
> >      >      >      >      >>> in this area, but does
it make sense for
> >     the R to
> >      >     adopt
> >      >      >     one of
> >      >      >      >     the unbiased
> >      >      >      >      >>> random sample
algorithms outlined there
> >     and used
> >      >     in other
> >      >      >      >     languages?  Would
> >      >      >      >      >>> a patch providing such
an algorithm be
> >     welcome? What
> >      >      >     concerns
> >      >      >      >     would need to
> >      >      >      >      >>> be addressed first?
> >      >      >      >      >>>
> >      >      >      >      >>> I believe this issue
was also raised by
> >     Killie &
> >      >     Philip in
> >      >      >      >      >>>
> >      >      >
> >     http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >      >      >      >     more
> >      >      >      >      >>> recently in
> >      >      >      >      >>>
> >      >      >      >
> >      >      >
> >      >
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
> >      >      >      >
> >      >      >
> >      >
> >       <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >      >      >      >>> pointing to the python
implementation for
> >     comparison:
> >      >      >      >      >>>
> >      >      >      >
> >      >      >
> >      >
> >
>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >      >      >      >>
> >      >      >      >      >> I think the analyses are
correct, but I
> >     doubt if a
> >      >     change
> >      >      >     to the
> >      >      >      >     default
> >      >      >      >      >> is likely to be accepted
as it would make
> >     it more
> >      >      >     difficult to
> >      >      >      >     reproduce
> >      >      >      >      >> older results.
> >      >      >      >      >>
> >      >      >      >      >> On the other hand, a
contribution of a new
> >      >     function like
> >      >      >      >     sample() but
> >      >      >      >      >> not suffering from the
bias would be good.
> The
> >      >     normal way to
> >      >      >      >     make such
> >      >      >      >      >> a contribution is in a
user contributed
> >     package.
> >      >      >      >      >>
> >      >      >      >      >> By the way, R code
illustrating the bias is
> >      >     probably not very
> >      >      >      >     hard to
> >      >      >      >      >> put together.  I believe
the bias manifests
> >     itself in
> >      >      >     sample()
> >      >      >      >     producing
> >      >      >      >      >> values with two different
probabilities
> >     (instead
> >      >     of all equal
> >      >      >      >      >> probabilities).  Those may
differ by as
> much as
> >      >     one part in
> >      >      >      >     2^32.  It's
> >      >      >      >      >
> >      >      >      >      > According to Kellie and
Philip, in the
> >     attachment
> >      >     of the
> >      >      >     thread
> >      >      >      >      > referenced by Carl, "The
maximum ratio of
> >     selection
> >      >      >     probabilities can
> >      >      >      >      > get as large as 1.5 if n is
just below 2^31".
> >      >      >      >
> >      >      >      >     Sorry, I didn't write very well.
I meant to
> >     say that the
> >      >      >     difference in
> >      >      >      >     probabilities would be 2^-32, not
that the
> ratio of
> >      >      >     probabilities would
> >      >      >      >     be 1 + 2^-32.
> >      >      >      >
> >      >      >      >     By the way, I don't see the
statement giving
> >     the ratio as
> >      >      >     1.5, but
> >      >      >      >     maybe
> >      >      >      >     I was looking in the wrong place. 
In Theorem 1
> >     of the
> >      >     paper
> >      >      >     I was
> >      >      >      >     looking in the ratio was "1 + m
2^{-w + 1}".
> >     In that
> >      >     formula
> >      >      >     m is your
> >      >      >      >     n.  If it is near 2^31, R uses w =
57 random
> >     bits, so
> >      >     the ratio
> >      >      >      >     would be
> >      >      >      >     very, very small (one part in 2^25).
> >      >      >      >
> >      >      >      >     The worst case for R would happen
when m  is
> >     just below
> >      >      >     2^25, where w
> >      >      >      >     is at least 31 for the default
generators.  In
> that
> >      >     case the
> >      >      >     ratio
> >      >      >      >     could
> >      >      >      >     be about 1.03.
> >      >      >      >
> >      >      >      >     Duncan Murdoch
> >      >      >      >
> >      >      >      >
> >      >      >      >
> >      >      >      > --
> >      >      >      > Philip B. Stark | Associate Dean,
Mathematical and
> >     Physical
> >      >      >     Sciences |
> >      >      >      > Professor,  Department of Statistics |
> >      >      >      > University of California
> >      >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      >    
<http://statistics.berkeley.edu/%7Estark>
> >      >      >      >
<http://statistics.berkeley.edu/%7Estark> |
> >      >      >      > @philipbstark
> >      >      >      >
> >      >      >
> >      >      >
> >      >      >
> >      >      > --
> >      >      > Philip B. Stark | Associate Dean, Mathematical and
Physical
> >      >     Sciences |
> >      >      > Professor,  Department of Statistics |
> >      >      > University of California
> >      >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >      > statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      >     <http://statistics.berkeley.edu/%7Estark>
> >      >      > <http://statistics.berkeley.edu/%7Estark> |
> >      >      > @philipbstark
> >      >      >
> >      >
> >      >
> >      >
> >      > --
> >      > Philip B. Stark | Associate Dean, Mathematical and Physical
> >     Sciences |
> >      > Professor,  Department of Statistics |
> >      > University of California
> >      > Berkeley, CA 94720-3860 | 510-394-5077 |
> >     statistics.berkeley.edu/~stark
> >     <http://statistics.berkeley.edu/%7Estark>
> >      > <http://statistics.berkeley.edu/%7Estark> |
> >      > @philipbstark
> >      >
> >
> >     ______________________________________________
> >     R-devel at r-project.org <mailto:R-devel at r-project.org>
mailing list
> >     https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Sent from Gmail Mobile
>
>
-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more apparently analagous threads

R devel - Sep 2018 - Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

Apparently Analagous Threads