thr3ads.net - R devel - [Rd] Bias in R's random integers? [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Steve Grubb

2018-Sep-20 21:09 UTC

[Rd] Bias in R's random integers?

Hello,

On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch
wrote:> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
> > On 9/20/18 1:43 AM, Carl Boettiger wrote:
> >> For a well-tested C algorithm, based on my reading of Lemire, the
> >> unbiased "algorithm 3" in
https://arxiv.org/abs/1805.10941 is part
> >> already of the C standard library in OpenBSD and macOS (as
> >> arc4random_uniform), and in the GNU standard library.  Lemire also
> >> provides C++ code in the appendix of his piece for both this and
the
> >> faster "nearly divisionless" algorithm.
> >> 
> >> It would be excellent if any R core members were interested in
> >> considering bindings to these algorithms as a patch, or might
express
> >> expectations for how that patch would have to operate (e.g. re
Duncan's
> >> comment about non-integer arguments to sample size).  Otherwise,
an R
> >> package binding seems like a good starting point, but I'm not
the right
> >> volunteer.
> > 
> > It is difficult to do this in a package, since R does not provide
access
> > to the random bits generated by the RNG. Only a float in (0,1) is
> > available via unif_rand().
> 
> I believe it is safe to multiply the unif_rand() value by 2^32, and take
> the whole number part as an unsigned 32 bit integer.  Depending on the
> RNG in use, that will give at least 25 random bits.  (The low order bits
> are the questionable ones.  25 is just a guess, not a guarantee.)
> 
> However, if one is willing to use an external
> 
> > RNG, it is of course possible. After reading about Lemire's work
[1], I
> > had planned to integrate such an unbiased sampling scheme into the
dqrng
> > package, which I have now started. [2]
> > 
> > Using Duncan's example, the results look much better:
> >> library(dqrng)
> >> m <- (2/5)*2^32
> >> y <- dqsample(m, 1000000, replace = TRUE)
> >> table(y %% 2)
> >> 
> >       0      1
> > 
> > 500252 499748
> 
> Another useful diagnostic is
> 
>    plot(density(y[y %% 2 == 0]))
> 
> Obviously that should give a more or less uniform density, but for
> values near m, the default sample() gives some nice pretty pictures of
> quite non-uniform densities.
> 
> By the way, there are actually quite a few examples of very large m
> besides m = (2/5)*2^32 where performance of sample() is noticeably bad.
> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 +
2a)
> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
> 3a)*2^32, etc.
> 
> So perhaps I'm starting to be convinced that the default sample()
should
> be fixed.
I find this discussion fascinating. I normally test random numbers in 
different languages every now and again using various methods. One simple 
check that I do is to use Michal Zalewski's method when he studied Strange 
Attractors and Initial TCP/IP Sequence Numbers:

http://lcamtuf.coredump.cx/newtcp/
https://pdfs.semanticscholar.org/
adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf

The technique works by mapping the dynamics of the generated numbers into a 
three-dimensional phase space. This is then plotted in a graph so that you 
can visually see if something odd is going on.

I used   runif(10000, min = 0, max = 65535)  to get a set of numbers. This is 
the resulting plot that was generated from R's numbers using this technique:

http://people.redhat.com/sgrubb/files/r-random.jpg

And for comparison this was generated by collecting the same number of 
samples from the bash shell:

http://people.redhat.com/sgrubb/files/bash-random.jpg

The net result is that it shows some banding in the R generated random 
numbers where bash has uniform random numbers with no discernible pattern.

Best Regards,
-Steve

Steve Grubb

2018-Sep-21 13:50 UTC

head link

[Rd] Bias in R's random integers?

Hello,

Top posting. Several people have asked about the code to replicate my 
results. I have cleaned up the code to remove an x/y coordinate bias for 
displaying the results directly on a 640 x 480 VGA adapter. You can find the 
code here:

http://people.redhat.com/sgrubb/files/vseq.c

To collect R samples:
X <- runif(10000, min = 0, max = 65535)
write.table(X, file = "~/r-rand.txt", sep = "\n", row.names
= FALSE)

Then:
cat ~/r-rand.txt | ./vseq > ~/r-rand.csv

And then to create the chart:

library(ggplot2);
num.csv <- read.csv("~/random.csv", header=T)
qplot(X, Y, data=num.csv);

Hope this helps sort this out.

Best Regards,
-Steve

On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb
wrote:> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
> > On 20/09/2018 6:59 AM, Ralf Stubner wrote:
> > > On 9/20/18 1:43 AM, Carl Boettiger wrote:
> > >> For a well-tested C algorithm, based on my reading of Lemire,
the
> > >> unbiased "algorithm 3" in
https://arxiv.org/abs/1805.10941 is part
> > >> already of the C standard library in OpenBSD and macOS (as
> > >> arc4random_uniform), and in the GNU standard library.  Lemire
also
> > >> provides C++ code in the appendix of his piece for both this
and the
> > >> faster "nearly divisionless" algorithm.
> > >> 
> > >> It would be excellent if any R core members were interested
in
> > >> considering bindings to these algorithms as a patch, or might
express
> > >> expectations for how that patch would have to operate (e.g.
re
> > >> Duncan's
> > >> comment about non-integer arguments to sample size). 
Otherwise, an R
> > >> package binding seems like a good starting point, but I'm
not the
> > >> right
> > >> volunteer.
> > > 
> > > It is difficult to do this in a package, since R does not provide
> > > access
> > > to the random bits generated by the RNG. Only a float in (0,1) is
> > > available via unif_rand().
> > 
> > I believe it is safe to multiply the unif_rand() value by 2^32, and
take
> > the whole number part as an unsigned 32 bit integer.  Depending on the
> > RNG in use, that will give at least 25 random bits.  (The low order
bits
> > are the questionable ones.  25 is just a guess, not a guarantee.)
> > 
> > However, if one is willing to use an external
> > 
> > > RNG, it is of course possible. After reading about Lemire's
work [1], I
> > > had planned to integrate such an unbiased sampling scheme into
the
> > > dqrng
> > > package, which I have now started. [2]
> > > 
> > > Using Duncan's example, the results look much better:
> > >> library(dqrng)
> > >> m <- (2/5)*2^32
> > >> y <- dqsample(m, 1000000, replace = TRUE)
> > >> table(y %% 2)
> > >> 
> > >       0      1
> > > 
> > > 500252 499748
> > 
> > Another useful diagnostic is
> > 
> >    plot(density(y[y %% 2 == 0]))
> > 
> > Obviously that should give a more or less uniform density, but for
> > values near m, the default sample() gives some nice pretty pictures of
> > quite non-uniform densities.
> > 
> > By the way, there are actually quite a few examples of very large m
> > besides m = (2/5)*2^32 where performance of sample() is noticeably
bad.
> > You'll see problems in y %% 2 for any integer a > 1 with m =
2/(1 + 2a)
> > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
> > 3a)*2^32, etc.
> > 
> > So perhaps I'm starting to be convinced that the default sample()
should
> > be fixed.
> 
> I find this discussion fascinating. I normally test random numbers in
> different languages every now and again using various methods. One simple
> check that I do is to use Michal Zalewski's method when he studied
Strange
> Attractors and Initial TCP/IP Sequence Numbers:
> 
> http://lcamtuf.coredump.cx/newtcp/
> https://pdfs.semanticscholar.org/
> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf
> 
> The technique works by mapping the dynamics of the generated numbers into a
> three-dimensional phase space. This is then plotted in a graph so that you
> can visually see if something odd is going on.
> 
> I used   runif(10000, min = 0, max = 65535)  to get a set of numbers. This
> is the resulting plot that was generated from R's numbers using this
> technique:
> 
> http://people.redhat.com/sgrubb/files/r-random.jpg
> 
> And for comparison this was generated by collecting the same number of
> samples from the bash shell:
> 
> http://people.redhat.com/sgrubb/files/bash-random.jpg
> 
> The net result is that it shows some banding in the R generated random
> numbers where bash has uniform random numbers with no discernible pattern.
> 
> Best Regards,
> -Steve
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Dirk Eddelbuettel

2018-Sep-21 14:20 UTC

head link

[Rd] Bias in R's random integers?

On 21 September 2018 at 09:50, Steve Grubb wrote:
| Hello,
| 
| Top posting. Several people have asked about the code to replicate my 
| results. I have cleaned up the code to remove an x/y coordinate bias for 
| displaying the results directly on a 640 x 480 VGA adapter. You can find the 
| code here:
| 
| http://people.redhat.com/sgrubb/files/vseq.c
| 
| To collect R samples:
| X <- runif(10000, min = 0, max = 65535)
| write.table(X, file = "~/r-rand.txt", sep = "\n",
row.names = FALSE)
| 
| Then:
| cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
| 
| And then to create the chart:
| 
| library(ggplot2);
| num.csv <- read.csv("~/random.csv", header=T)
| qplot(X, Y, data=num.csv);
| 
| Hope this helps sort this out.

Here is a simpler version in one file, combining the operations in a bit of
C++ and also an auto-executed piece of R to create the data, call the
transformation you created, and plot.  Store as eg /tmp/stevegrubb.cpp
and then call   Rcpp::sourceCpp("/tmp/stevegrubb.cpp")

Dirk

--- snip --------------------------------------------------------------------
#include <Rcpp.h>

#define SCALING 0.5

// [[Rcpp::export]]
Rcpp::NumericMatrix stevePlot(Rcpp::NumericVector itable) {
  size_t cnt = itable.size();
  Rcpp::NumericMatrix M(cnt,2);
  size_t num = 0;
  double nth, n1th = 0, n2th = 0, n3th = 0;
  double x, y;
  double usex, usey, pha = 0;

  
  usex = sin(pha);
  usey = cos(pha);
  while (num < cnt) {
    double x1, y1, z1;
    nth = itable[num];
    x1 = (nth - n1th)  * SCALING;
    y1 = (n1th - n2th) * SCALING;
    z1 = (n2th - n3th) * SCALING;
    x = (x1*usey+z1*usex);
    y = y1 + (z1*usey-x1*usex) / 2;
    if (num > 4) {
      M(num,0) = x;
      M(num,1) = y;
    } else {
      M(num,0) = M(num,1) = NA_REAL;
    }
    num++;
    n3th=n2th; n2th=n1th; n1th=nth;
  }
  Rcpp::colnames(M) = Rcpp::CharacterVector::create("X",
"Y");
  return(M);
}

/*** R
library(ggplot2);
Xin <- runif(10000, min = 0, max = 65535)
Xout <- stevePlot(Xin);
qplot(X, Y, data=as.data.frame(Xout));
*/
--- snip --------------------------------------------------------------------


-- 
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org

Dirk Eddelbuettel

2018-Sep-21 14:25 UTC

head link

[Rd] Bias in R's random integers?

Slightly nicer version:

--- snip --------------------------------------------------------------------
#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::DataFrame stevePlot(Rcpp::NumericVector itable) {
  size_t cnt = itable.size(), num = 0;
  double nth, n1th = 0, n2th = 0, n3th = 0;
  double x, y, usex, usey, pha = 0;
  std::vector<double> xv, yv;
  const double scaling = 0.5;
  usex = sin(pha);
  usey = cos(pha);
  while (num < cnt) {
    double x1, y1, z1;
    nth = itable[num];
    num++;
    x1 = (nth - n1th)  * scaling;
    y1 = (n1th - n2th) * scaling;
    z1 = (n2th - n3th) * scaling;
    x = (x1*usey+z1*usex);
    y = y1 + (z1*usey-x1*usex) / 2;
    if (num > 4) {
      xv.push_back(x);
      yv.push_back(y);
    }
    n3th=n2th; n2th=n1th; n1th=nth;
  }
  return(Rcpp::DataFrame::create(Rcpp::Named("X") = xv,
                                 Rcpp::Named("Y") = yv));
}

/*** R
library(ggplot2)
Xin <- runif(10000, min = 0, max = 65535)
Xout <- stevePlot(Xin)
qplot(X, Y, data=Xout)
*/
--- snip --------------------------------------------------------------------

Still just   Rcpp::sourceCpp("filenamehere.cpp")   it.

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org

Tierney, Luke

2018-Sep-21 16:38 UTC

head link

[Rd] Bias in R's random integers?

Not sure what should happen theoretically for the code in vseq.c, but
I see the same pattern with the R generators I tried (default,
Super-Duper, and L'Ecuyer) and with with bash $RANDOM using

N <- 10000
X1 <- replicate(N, as.integer(system("bash -c 'echo
$RANDOM'", intern = TRUE)))
X2 <- replicate(N, as.integer(system("bash -c 'echo
$RANDOM'", intern = TRUE)))
X <- X1 + 2 ^ 15 * (X2 > 2^14)

and with numbers from random.org

library(random)
X <- randomNumbers(N, 0, 2^16-1, col = 1)

So I'm not convinced there is an issue.

Best,

luke

On Fri, 21 Sep 2018, Steve Grubb wrote:
> Hello,
>
> Top posting. Several people have asked about the code to replicate my
> results. I have cleaned up the code to remove an x/y coordinate bias for
> displaying the results directly on a 640 x 480 VGA adapter. You can find
the
> code here:
>
> http://people.redhat.com/sgrubb/files/vseq.c
>
> To collect R samples:
> X <- runif(10000, min = 0, max = 65535)
> write.table(X, file = "~/r-rand.txt", sep = "\n",
row.names = FALSE)
>
> Then:
> cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
>
> And then to create the chart:
>
> library(ggplot2);
> num.csv <- read.csv("~/random.csv", header=T)
> qplot(X, Y, data=num.csv);
>
> Hope this helps sort this out.
>
> Best Regards,
> -Steve
>
> On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote:
>> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
>>> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
>>>> On 9/20/18 1:43 AM, Carl Boettiger wrote:
>>>>> For a well-tested C algorithm, based on my reading of
Lemire, the
>>>>> unbiased "algorithm 3" in
https://arxiv.org/abs/1805.10941 is part
>>>>> already of the C standard library in OpenBSD and macOS (as
>>>>> arc4random_uniform), and in the GNU standard library. 
Lemire also
>>>>> provides C++ code in the appendix of his piece for both
this and the
>>>>> faster "nearly divisionless" algorithm.
>>>>>
>>>>> It would be excellent if any R core members were interested
in
>>>>> considering bindings to these algorithms as a patch, or
might express
>>>>> expectations for how that patch would have to operate (e.g.
re
>>>>> Duncan's
>>>>> comment about non-integer arguments to sample size). 
Otherwise, an R
>>>>> package binding seems like a good starting point, but
I'm not the
>>>>> right
>>>>> volunteer.
>>>>
>>>> It is difficult to do this in a package, since R does not
provide
>>>> access
>>>> to the random bits generated by the RNG. Only a float in (0,1)
is
>>>> available via unif_rand().
>>>
>>> I believe it is safe to multiply the unif_rand() value by 2^32, and
take
>>> the whole number part as an unsigned 32 bit integer.  Depending on
the
>>> RNG in use, that will give at least 25 random bits.  (The low order
bits
>>> are the questionable ones.  25 is just a guess, not a guarantee.)
>>>
>>> However, if one is willing to use an external
>>>
>>>> RNG, it is of course possible. After reading about Lemire's
work [1], I
>>>> had planned to integrate such an unbiased sampling scheme into
the
>>>> dqrng
>>>> package, which I have now started. [2]
>>>>
>>>> Using Duncan's example, the results look much better:
>>>>> library(dqrng)
>>>>> m <- (2/5)*2^32
>>>>> y <- dqsample(m, 1000000, replace = TRUE)
>>>>> table(y %% 2)
>>>>>
>>>>       0      1
>>>>
>>>> 500252 499748
>>>
>>> Another useful diagnostic is
>>>
>>>    plot(density(y[y %% 2 == 0]))
>>>
>>> Obviously that should give a more or less uniform density, but for
>>> values near m, the default sample() gives some nice pretty pictures
of
>>> quite non-uniform densities.
>>>
>>> By the way, there are actually quite a few examples of very large m
>>> besides m = (2/5)*2^32 where performance of sample() is noticeably
bad.
>>> You'll see problems in y %% 2 for any integer a > 1 with m =
2/(1 + 2a)
>>> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
>>> 3a)*2^32, etc.
>>>
>>> So perhaps I'm starting to be convinced that the default
sample() should
>>> be fixed.
>>
>> I find this discussion fascinating. I normally test random numbers in
>> different languages every now and again using various methods. One
simple
>> check that I do is to use Michal Zalewski's method when he studied
Strange
>> Attractors and Initial TCP/IP Sequence Numbers:
>>
>> http://lcamtuf.coredump.cx/newtcp/
>> https://pdfs.semanticscholar.org/
>> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf
>>
>> The technique works by mapping the dynamics of the generated numbers
into a
>> three-dimensional phase space. This is then plotted in a graph so that
you
>> can visually see if something odd is going on.
>>
>> I used   runif(10000, min = 0, max = 65535)  to get a set of numbers.
This
>> is the resulting plot that was generated from R's numbers using
this
>> technique:
>>
>> http://people.redhat.com/sgrubb/files/r-random.jpg
>>
>> And for comparison this was generated by collecting the same number of
>> samples from the bash shell:
>>
>> http://people.redhat.com/sgrubb/files/bash-random.jpg
>>
>> The net result is that it shows some banding in the R generated random
>> numbers where bash has uniform random numbers with no discernible
pattern.
>>
>> Best Regards,
>> -Steve
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

Reasonably Related Threads

Search for more maybe matching threads

R devel - Sep 2018 - Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

[Rd] Bias in R's random integers?

Reasonably Related Threads