The 53 bits only encode at most 2^{32} possible values, because the source
of the float is the output of a 32-bit PRNG (the obsolete version of MT).
53 bits isn't the relevant number here.
The selection ratios can get close to 2. Computer scientists don't do it
the way R does, for a reason.
Regards,
Philip
On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 19/09/2018 9:09 AM, I?aki Ucar wrote:
> > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch
> > (<murdoch.duncan at gmail.com>) escribi?:
> >>
> >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >>> Dear list,
> >>>
> >>> It looks to me that R samples random integers using an
intuitive but
> biased
> >>> algorithm by going from a random number on [0,1) from the PRNG
to a
> random
> >>> integer, e.g.
> >>>
https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >>>
> >>> Many other languages use various rejection sampling approaches
which
> >>> provide an unbiased method for sampling, such as in Go,
python, and
> others
> >>> described here:  https://arxiv.org/abs/1805.10941 (I believe
the
> biased
> >>> algorithm currently used in R is also described there). 
I'm not an
> expert
> >>> in this area, but does it make sense for the R to adopt one of
the
> unbiased
> >>> random sample algorithms outlined there and used in other
languages?
> Would
> >>> a patch providing such an algorithm be welcome? What concerns
would
> need to
> >>> be addressed first?
> >>>
> >>> I believe this issue was also raised by Killie & Philip in
> >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html,
and more
> >>> recently in
> >>>
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf,
> >>> pointing to the python implementation for comparison:
> >>>
>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >>
> >> I think the analyses are correct, but I doubt if a change to the
default
> >> is likely to be accepted as it would make it more difficult to
reproduce
> >> older results.
> >>
> >> On the other hand, a contribution of a new function like sample()
but
> >> not suffering from the bias would be good.  The normal way to make
such
> >> a contribution is in a user contributed package.
> >>
> >> By the way, R code illustrating the bias is probably not very hard
to
> >> put together.  I believe the bias manifests itself in sample()
producing
> >> values with two different probabilities (instead of all equal
> >> probabilities).  Those may differ by as much as one part in 2^32. 
It's
> >
> > According to Kellie and Philip, in the attachment of the thread
> > referenced by Carl, "The maximum ratio of selection probabilities
can
> > get as large as 1.5 if n is just below 2^31".
>
> Sorry, I didn't write very well.  I meant to say that the difference in
> probabilities would be 2^-32, not that the ratio of probabilities would
> be 1 + 2^-32.
>
> By the way, I don't see the statement giving the ratio as 1.5, but
maybe
> I was looking in the wrong place.  In Theorem 1 of the paper I was
> looking in the ratio was "1 + m 2^{-w + 1}".  In that formula m
is your
> n.  If it is near 2^31, R uses w = 57 random bits, so the ratio would be
> very, very small (one part in 2^25).
>
> The worst case for R would happen when m  is just below  2^25, where w
> is at least 31 for the default generators.  In that case the ratio could
> be about 1.03.
>
> Duncan Murdoch
>
-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark
	[[alternative HTML version deleted]]
On 19/09/2018 12:09 PM, Philip B. Stark wrote:> The 53 bits only encode at most 2^{32} possible values, because the > source of the float is the output of a 32-bit PRNG (the obsolete version > of MT). 53 bits isn't the relevant number here.No, two calls to unif_rand() are used. There are two 32 bit values, but some of the bits are thrown away. Duncan Murdoch> > The selection ratios can get close to 2. Computer scientists don't do it > the way R does, for a reason. > > Regards, > Philip > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at gmail.com > <mailto:murdoch.duncan at gmail.com>> wrote: > > On 19/09/2018 9:09 AM, I?aki Ucar wrote: > > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch > > (<murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>>) > escribi?: > >> > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > >>> Dear list, > >>> > >>> It looks to me that R samples random integers using an > intuitive but biased > >>> algorithm by going from a random number on [0,1) from the PRNG > to a random > >>> integer, e.g. > >>> > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > >>> > >>> Many other languages use various rejection sampling approaches > which > >>> provide an unbiased method for sampling, such as in Go, python, > and others > >>> described here: https://arxiv.org/abs/1805.10941 (I believe the > biased > >>> algorithm currently used in R is also described there).? I'm > not an expert > >>> in this area, but does it make sense for the R to adopt one of > the unbiased > >>> random sample algorithms outlined there and used in other > languages?? Would > >>> a patch providing such an algorithm be welcome? What concerns > would need to > >>> be addressed first? > >>> > >>> I believe this issue was also raised by Killie & Philip in > >>> http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > more > >>> recently in > >>> > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, > >>> pointing to the python implementation for comparison: > >>> > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > >> > >> I think the analyses are correct, but I doubt if a change to the > default > >> is likely to be accepted as it would make it more difficult to > reproduce > >> older results. > >> > >> On the other hand, a contribution of a new function like > sample() but > >> not suffering from the bias would be good.? The normal way to > make such > >> a contribution is in a user contributed package. > >> > >> By the way, R code illustrating the bias is probably not very > hard to > >> put together.? I believe the bias manifests itself in sample() > producing > >> values with two different probabilities (instead of all equal > >> probabilities).? Those may differ by as much as one part in > 2^32.? It's > > > > According to Kellie and Philip, in the attachment of the thread > > referenced by Carl, "The maximum ratio of selection probabilities can > > get as large as 1.5 if n is just below 2^31". > > Sorry, I didn't write very well.? I meant to say that the difference in > probabilities would be 2^-32, not that the ratio of probabilities would > be 1 + 2^-32. > > By the way, I don't see the statement giving the ratio as 1.5, but > maybe > I was looking in the wrong place.? In Theorem 1 of the paper I was > looking in the ratio was "1 + m 2^{-w + 1}".? In that formula m is your > n.? If it is near 2^31, R uses w = 57 random bits, so the ratio > would be > very, very small (one part in 2^25). > > The worst case for R would happen when m? is just below? 2^25, where w > is at least 31 for the default generators.? In that case the ratio > could > be about 1.03. > > Duncan Murdoch > > > > -- > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | > Professor, ?Department of Statistics | > University of California > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark > <http://statistics.berkeley.edu/%7Estark> | > @philipbstark >
No, the 2nd call only happens when m > 2**31. Here's the code:
(RNG.c, lines 793ff)
double R_unif_index(double dn)
{
    double cut = INT_MAX;
    switch(RNG_kind) {
    case KNUTH_TAOCP:
    case USER_UNIF:
    case KNUTH_TAOCP2:
cut = 33554431.0; /* 2^25 - 1 */
  break;
    default:
  break;
   }
    double u = dn > cut ? ru() : unif_rand();
    return floor(dn * u);
}
On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 19/09/2018 12:09 PM, Philip B. Stark wrote:
> > The 53 bits only encode at most 2^{32} possible values, because the
> > source of the float is the output of a 32-bit PRNG (the obsolete
version
> > of MT). 53 bits isn't the relevant number here.
>
> No, two calls to unif_rand() are used.  There are two 32 bit values, but
> some of the bits are thrown away.
>
> Duncan Murdoch
>
> >
> > The selection ratios can get close to 2. Computer scientists don't
do it
> > the way R does, for a reason.
> >
> > Regards,
> > Philip
> >
> > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch <murdoch.duncan at
gmail.com
> > <mailto:murdoch.duncan at gmail.com>> wrote:
> >
> >     On 19/09/2018 9:09 AM, I?aki Ucar wrote:
> >      > El mi?., 19 sept. 2018 a las 14:43, Duncan Murdoch
> >      > (<murdoch.duncan at gmail.com <mailto:murdoch.duncan
at gmail.com>>)
> >     escribi?:
> >      >>
> >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
> >      >>> Dear list,
> >      >>>
> >      >>> It looks to me that R samples random integers using
an
> >     intuitive but biased
> >      >>> algorithm by going from a random number on [0,1)
from the PRNG
> >     to a random
> >      >>> integer, e.g.
> >      >>>
> >
> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
> >      >>>
> >      >>> Many other languages use various rejection sampling
approaches
> >     which
> >      >>> provide an unbiased method for sampling, such as in
Go, python,
> >     and others
> >      >>> described here: https://arxiv.org/abs/1805.10941 (I
believe the
> >     biased
> >      >>> algorithm currently used in R is also described
there).  I'm
> >     not an expert
> >      >>> in this area, but does it make sense for the R to
adopt one of
> >     the unbiased
> >      >>> random sample algorithms outlined there and used in
other
> >     languages?  Would
> >      >>> a patch providing such an algorithm be welcome? What
concerns
> >     would need to
> >      >>> be addressed first?
> >      >>>
> >      >>> I believe this issue was also raised by Killie &
Philip in
> >      >>>
http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
> >     more
> >      >>> recently in
> >      >>>
> >     https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
> >     <
> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
> >      >>> pointing to the python implementation for
comparison:
> >      >>>
> >
>
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
> >      >>
> >      >> I think the analyses are correct, but I doubt if a
change to the
> >     default
> >      >> is likely to be accepted as it would make it more
difficult to
> >     reproduce
> >      >> older results.
> >      >>
> >      >> On the other hand, a contribution of a new function like
> >     sample() but
> >      >> not suffering from the bias would be good.  The normal
way to
> >     make such
> >      >> a contribution is in a user contributed package.
> >      >>
> >      >> By the way, R code illustrating the bias is probably not
very
> >     hard to
> >      >> put together.  I believe the bias manifests itself in
sample()
> >     producing
> >      >> values with two different probabilities (instead of all
equal
> >      >> probabilities).  Those may differ by as much as one part
in
> >     2^32.  It's
> >      >
> >      > According to Kellie and Philip, in the attachment of the
thread
> >      > referenced by Carl, "The maximum ratio of selection
probabilities
> can
> >      > get as large as 1.5 if n is just below 2^31".
> >
> >     Sorry, I didn't write very well.  I meant to say that the
difference
> in
> >     probabilities would be 2^-32, not that the ratio of probabilities
> would
> >     be 1 + 2^-32.
> >
> >     By the way, I don't see the statement giving the ratio as 1.5,
but
> >     maybe
> >     I was looking in the wrong place.  In Theorem 1 of the paper I was
> >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
formula m is
> your
> >     n.  If it is near 2^31, R uses w = 57 random bits, so the ratio
> >     would be
> >     very, very small (one part in 2^25).
> >
> >     The worst case for R would happen when m  is just below  2^25,
where
> w
> >     is at least 31 for the default generators.  In that case the ratio
> >     could
> >     be about 1.03.
> >
> >     Duncan Murdoch
> >
> >
> >
> > --
> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
> > Professor,  Department of Statistics |
> > University of California
> > Berkeley, CA 94720-3860 | 510-394-5077 |
statistics.berkeley.edu/~stark
> > <http://statistics.berkeley.edu/%7Estark> |
> > @philipbstark
> >
>
>
-- 
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |
Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark |
@philipbstark
	[[alternative HTML version deleted]]