Hello, On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:> On 20/09/2018 6:59 AM, Ralf Stubner wrote: > > On 9/20/18 1:43 AM, Carl Boettiger wrote: > >> For a well-tested C algorithm, based on my reading of Lemire, the > >> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part > >> already of the C standard library in OpenBSD and macOS (as > >> arc4random_uniform), and in the GNU standard library. Lemire also > >> provides C++ code in the appendix of his piece for both this and the > >> faster "nearly divisionless" algorithm. > >> > >> It would be excellent if any R core members were interested in > >> considering bindings to these algorithms as a patch, or might express > >> expectations for how that patch would have to operate (e.g. re Duncan's > >> comment about non-integer arguments to sample size). Otherwise, an R > >> package binding seems like a good starting point, but I'm not the right > >> volunteer. > > > > It is difficult to do this in a package, since R does not provide access > > to the random bits generated by the RNG. Only a float in (0,1) is > > available via unif_rand(). > > I believe it is safe to multiply the unif_rand() value by 2^32, and take > the whole number part as an unsigned 32 bit integer. Depending on the > RNG in use, that will give at least 25 random bits. (The low order bits > are the questionable ones. 25 is just a guess, not a guarantee.) > > However, if one is willing to use an external > > > RNG, it is of course possible. After reading about Lemire's work [1], I > > had planned to integrate such an unbiased sampling scheme into the dqrng > > package, which I have now started. [2] > > > > Using Duncan's example, the results look much better: > >> library(dqrng) > >> m <- (2/5)*2^32 > >> y <- dqsample(m, 1000000, replace = TRUE) > >> table(y %% 2) > >> > > 0 1 > > > > 500252 499748 > > Another useful diagnostic is > > plot(density(y[y %% 2 == 0])) > > Obviously that should give a more or less uniform density, but for > values near m, the default sample() gives some nice pretty pictures of > quite non-uniform densities. > > By the way, there are actually quite a few examples of very large m > besides m = (2/5)*2^32 where performance of sample() is noticeably bad. > You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + > 3a)*2^32, etc. > > So perhaps I'm starting to be convinced that the default sample() should > be fixed.I find this discussion fascinating. I normally test random numbers in different languages every now and again using various methods. One simple check that I do is to use Michal Zalewski's method when he studied Strange Attractors and Initial TCP/IP Sequence Numbers: http://lcamtuf.coredump.cx/newtcp/ https://pdfs.semanticscholar.org/ adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf The technique works by mapping the dynamics of the generated numbers into a three-dimensional phase space. This is then plotted in a graph so that you can visually see if something odd is going on. I used runif(10000, min = 0, max = 65535) to get a set of numbers. This is the resulting plot that was generated from R's numbers using this technique: http://people.redhat.com/sgrubb/files/r-random.jpg And for comparison this was generated by collecting the same number of samples from the bash shell: http://people.redhat.com/sgrubb/files/bash-random.jpg The net result is that it shows some banding in the R generated random numbers where bash has uniform random numbers with no discernible pattern. Best Regards, -Steve
Hello,
Top posting. Several people have asked about the code to replicate my
results. I have cleaned up the code to remove an x/y coordinate bias for
displaying the results directly on a 640 x 480 VGA adapter. You can find the
code here:
http://people.redhat.com/sgrubb/files/vseq.c
To collect R samples:
X <- runif(10000, min = 0, max = 65535)
write.table(X, file = "~/r-rand.txt", sep = "\n", row.names
= FALSE)
Then:
cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
And then to create the chart:
library(ggplot2);
num.csv <- read.csv("~/random.csv", header=T)
qplot(X, Y, data=num.csv);
Hope this helps sort this out.
Best Regards,
-Steve
On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb
wrote:> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
> > On 20/09/2018 6:59 AM, Ralf Stubner wrote:
> > > On 9/20/18 1:43 AM, Carl Boettiger wrote:
> > >> For a well-tested C algorithm, based on my reading of Lemire,
the
> > >> unbiased "algorithm 3" in
https://arxiv.org/abs/1805.10941 is part
> > >> already of the C standard library in OpenBSD and macOS (as
> > >> arc4random_uniform), and in the GNU standard library. Lemire
also
> > >> provides C++ code in the appendix of his piece for both this
and the
> > >> faster "nearly divisionless" algorithm.
> > >>
> > >> It would be excellent if any R core members were interested
in
> > >> considering bindings to these algorithms as a patch, or might
express
> > >> expectations for how that patch would have to operate (e.g.
re
> > >> Duncan's
> > >> comment about non-integer arguments to sample size).
Otherwise, an R
> > >> package binding seems like a good starting point, but I'm
not the
> > >> right
> > >> volunteer.
> > >
> > > It is difficult to do this in a package, since R does not provide
> > > access
> > > to the random bits generated by the RNG. Only a float in (0,1) is
> > > available via unif_rand().
> >
> > I believe it is safe to multiply the unif_rand() value by 2^32, and
take
> > the whole number part as an unsigned 32 bit integer. Depending on the
> > RNG in use, that will give at least 25 random bits. (The low order
bits
> > are the questionable ones. 25 is just a guess, not a guarantee.)
> >
> > However, if one is willing to use an external
> >
> > > RNG, it is of course possible. After reading about Lemire's
work [1], I
> > > had planned to integrate such an unbiased sampling scheme into
the
> > > dqrng
> > > package, which I have now started. [2]
> > >
> > > Using Duncan's example, the results look much better:
> > >> library(dqrng)
> > >> m <- (2/5)*2^32
> > >> y <- dqsample(m, 1000000, replace = TRUE)
> > >> table(y %% 2)
> > >>
> > > 0 1
> > >
> > > 500252 499748
> >
> > Another useful diagnostic is
> >
> > plot(density(y[y %% 2 == 0]))
> >
> > Obviously that should give a more or less uniform density, but for
> > values near m, the default sample() gives some nice pretty pictures of
> > quite non-uniform densities.
> >
> > By the way, there are actually quite a few examples of very large m
> > besides m = (2/5)*2^32 where performance of sample() is noticeably
bad.
> > You'll see problems in y %% 2 for any integer a > 1 with m =
2/(1 + 2a)
> > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
> > 3a)*2^32, etc.
> >
> > So perhaps I'm starting to be convinced that the default sample()
should
> > be fixed.
>
> I find this discussion fascinating. I normally test random numbers in
> different languages every now and again using various methods. One simple
> check that I do is to use Michal Zalewski's method when he studied
Strange
> Attractors and Initial TCP/IP Sequence Numbers:
>
> http://lcamtuf.coredump.cx/newtcp/
> https://pdfs.semanticscholar.org/
> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf
>
> The technique works by mapping the dynamics of the generated numbers into a
> three-dimensional phase space. This is then plotted in a graph so that you
> can visually see if something odd is going on.
>
> I used runif(10000, min = 0, max = 65535) to get a set of numbers. This
> is the resulting plot that was generated from R's numbers using this
> technique:
>
> http://people.redhat.com/sgrubb/files/r-random.jpg
>
> And for comparison this was generated by collecting the same number of
> samples from the bash shell:
>
> http://people.redhat.com/sgrubb/files/bash-random.jpg
>
> The net result is that it shows some banding in the R generated random
> numbers where bash has uniform random numbers with no discernible pattern.
>
> Best Regards,
> -Steve
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
On 21 September 2018 at 09:50, Steve Grubb wrote:
| Hello,
|
| Top posting. Several people have asked about the code to replicate my
| results. I have cleaned up the code to remove an x/y coordinate bias for
| displaying the results directly on a 640 x 480 VGA adapter. You can find the
| code here:
|
| http://people.redhat.com/sgrubb/files/vseq.c
|
| To collect R samples:
| X <- runif(10000, min = 0, max = 65535)
| write.table(X, file = "~/r-rand.txt", sep = "\n",
row.names = FALSE)
|
| Then:
| cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
|
| And then to create the chart:
|
| library(ggplot2);
| num.csv <- read.csv("~/random.csv", header=T)
| qplot(X, Y, data=num.csv);
|
| Hope this helps sort this out.
Here is a simpler version in one file, combining the operations in a bit of
C++ and also an auto-executed piece of R to create the data, call the
transformation you created, and plot. Store as eg /tmp/stevegrubb.cpp
and then call Rcpp::sourceCpp("/tmp/stevegrubb.cpp")
Dirk
--- snip --------------------------------------------------------------------
#include <Rcpp.h>
#define SCALING 0.5
// [[Rcpp::export]]
Rcpp::NumericMatrix stevePlot(Rcpp::NumericVector itable) {
size_t cnt = itable.size();
Rcpp::NumericMatrix M(cnt,2);
size_t num = 0;
double nth, n1th = 0, n2th = 0, n3th = 0;
double x, y;
double usex, usey, pha = 0;
usex = sin(pha);
usey = cos(pha);
while (num < cnt) {
double x1, y1, z1;
nth = itable[num];
x1 = (nth - n1th) * SCALING;
y1 = (n1th - n2th) * SCALING;
z1 = (n2th - n3th) * SCALING;
x = (x1*usey+z1*usex);
y = y1 + (z1*usey-x1*usex) / 2;
if (num > 4) {
M(num,0) = x;
M(num,1) = y;
} else {
M(num,0) = M(num,1) = NA_REAL;
}
num++;
n3th=n2th; n2th=n1th; n1th=nth;
}
Rcpp::colnames(M) = Rcpp::CharacterVector::create("X",
"Y");
return(M);
}
/*** R
library(ggplot2);
Xin <- runif(10000, min = 0, max = 65535)
Xout <- stevePlot(Xin);
qplot(X, Y, data=as.data.frame(Xout));
*/
--- snip --------------------------------------------------------------------
--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
Slightly nicer version:
--- snip --------------------------------------------------------------------
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::DataFrame stevePlot(Rcpp::NumericVector itable) {
size_t cnt = itable.size(), num = 0;
double nth, n1th = 0, n2th = 0, n3th = 0;
double x, y, usex, usey, pha = 0;
std::vector<double> xv, yv;
const double scaling = 0.5;
usex = sin(pha);
usey = cos(pha);
while (num < cnt) {
double x1, y1, z1;
nth = itable[num];
num++;
x1 = (nth - n1th) * scaling;
y1 = (n1th - n2th) * scaling;
z1 = (n2th - n3th) * scaling;
x = (x1*usey+z1*usex);
y = y1 + (z1*usey-x1*usex) / 2;
if (num > 4) {
xv.push_back(x);
yv.push_back(y);
}
n3th=n2th; n2th=n1th; n1th=nth;
}
return(Rcpp::DataFrame::create(Rcpp::Named("X") = xv,
Rcpp::Named("Y") = yv));
}
/*** R
library(ggplot2)
Xin <- runif(10000, min = 0, max = 65535)
Xout <- stevePlot(Xin)
qplot(X, Y, data=Xout)
*/
--- snip --------------------------------------------------------------------
Still just Rcpp::sourceCpp("filenamehere.cpp") it.
Dirk
--
http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
Not sure what should happen theoretically for the code in vseq.c, but
I see the same pattern with the R generators I tried (default,
Super-Duper, and L'Ecuyer) and with with bash $RANDOM using
N <- 10000
X1 <- replicate(N, as.integer(system("bash -c 'echo
$RANDOM'", intern = TRUE)))
X2 <- replicate(N, as.integer(system("bash -c 'echo
$RANDOM'", intern = TRUE)))
X <- X1 + 2 ^ 15 * (X2 > 2^14)
and with numbers from random.org
library(random)
X <- randomNumbers(N, 0, 2^16-1, col = 1)
So I'm not convinced there is an issue.
Best,
luke
On Fri, 21 Sep 2018, Steve Grubb wrote:
> Hello,
>
> Top posting. Several people have asked about the code to replicate my
> results. I have cleaned up the code to remove an x/y coordinate bias for
> displaying the results directly on a 640 x 480 VGA adapter. You can find
the
> code here:
>
> http://people.redhat.com/sgrubb/files/vseq.c
>
> To collect R samples:
> X <- runif(10000, min = 0, max = 65535)
> write.table(X, file = "~/r-rand.txt", sep = "\n",
row.names = FALSE)
>
> Then:
> cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
>
> And then to create the chart:
>
> library(ggplot2);
> num.csv <- read.csv("~/random.csv", header=T)
> qplot(X, Y, data=num.csv);
>
> Hope this helps sort this out.
>
> Best Regards,
> -Steve
>
> On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote:
>> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
>>> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
>>>> On 9/20/18 1:43 AM, Carl Boettiger wrote:
>>>>> For a well-tested C algorithm, based on my reading of
Lemire, the
>>>>> unbiased "algorithm 3" in
https://arxiv.org/abs/1805.10941 is part
>>>>> already of the C standard library in OpenBSD and macOS (as
>>>>> arc4random_uniform), and in the GNU standard library.
Lemire also
>>>>> provides C++ code in the appendix of his piece for both
this and the
>>>>> faster "nearly divisionless" algorithm.
>>>>>
>>>>> It would be excellent if any R core members were interested
in
>>>>> considering bindings to these algorithms as a patch, or
might express
>>>>> expectations for how that patch would have to operate (e.g.
re
>>>>> Duncan's
>>>>> comment about non-integer arguments to sample size).
Otherwise, an R
>>>>> package binding seems like a good starting point, but
I'm not the
>>>>> right
>>>>> volunteer.
>>>>
>>>> It is difficult to do this in a package, since R does not
provide
>>>> access
>>>> to the random bits generated by the RNG. Only a float in (0,1)
is
>>>> available via unif_rand().
>>>
>>> I believe it is safe to multiply the unif_rand() value by 2^32, and
take
>>> the whole number part as an unsigned 32 bit integer. Depending on
the
>>> RNG in use, that will give at least 25 random bits. (The low order
bits
>>> are the questionable ones. 25 is just a guess, not a guarantee.)
>>>
>>> However, if one is willing to use an external
>>>
>>>> RNG, it is of course possible. After reading about Lemire's
work [1], I
>>>> had planned to integrate such an unbiased sampling scheme into
the
>>>> dqrng
>>>> package, which I have now started. [2]
>>>>
>>>> Using Duncan's example, the results look much better:
>>>>> library(dqrng)
>>>>> m <- (2/5)*2^32
>>>>> y <- dqsample(m, 1000000, replace = TRUE)
>>>>> table(y %% 2)
>>>>>
>>>> 0 1
>>>>
>>>> 500252 499748
>>>
>>> Another useful diagnostic is
>>>
>>> plot(density(y[y %% 2 == 0]))
>>>
>>> Obviously that should give a more or less uniform density, but for
>>> values near m, the default sample() gives some nice pretty pictures
of
>>> quite non-uniform densities.
>>>
>>> By the way, there are actually quite a few examples of very large m
>>> besides m = (2/5)*2^32 where performance of sample() is noticeably
bad.
>>> You'll see problems in y %% 2 for any integer a > 1 with m =
2/(1 + 2a)
>>> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
>>> 3a)*2^32, etc.
>>>
>>> So perhaps I'm starting to be convinced that the default
sample() should
>>> be fixed.
>>
>> I find this discussion fascinating. I normally test random numbers in
>> different languages every now and again using various methods. One
simple
>> check that I do is to use Michal Zalewski's method when he studied
Strange
>> Attractors and Initial TCP/IP Sequence Numbers:
>>
>> http://lcamtuf.coredump.cx/newtcp/
>> https://pdfs.semanticscholar.org/
>> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf
>>
>> The technique works by mapping the dynamics of the generated numbers
into a
>> three-dimensional phase space. This is then plotted in a graph so that
you
>> can visually see if something odd is going on.
>>
>> I used runif(10000, min = 0, max = 65535) to get a set of numbers.
This
>> is the resulting plot that was generated from R's numbers using
this
>> technique:
>>
>> http://people.redhat.com/sgrubb/files/r-random.jpg
>>
>> And for comparison this was generated by collecting the same number of
>> samples from the bash shell:
>>
>> http://people.redhat.com/sgrubb/files/bash-random.jpg
>>
>> The net result is that it shows some banding in the R generated random
>> numbers where bash has uniform random numbers with no discernible
pattern.
>>
>> Best Regards,
>> -Steve
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke-tierney at uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu