> Duncan Murdoch:
>
> and you can see it in the original m with
>
> x <- sample(m, 1000000, replace = TRUE)
> plot(density(x[x %% 2 == 0]))
OK. Thanks. I see there is a real problem.
One option to fix it while mostly retaining backwards-compatibility
would be to add extra bits from a second RNG call only when m is large
- eg, larger than 2^27. That would retain reproducibility for most
analyses of small to moderate size data sets. Of course, there would
still be some small, detectable error for values a bit less than 2^27,
but perhaps that's tolerable. (The 2^27 threshold could obviously be
debated.)
R Core made a similar decision in the case of sampling with
replacement when implementing a new hashing algorithm that produces
different results. It is enabled by default only when m > 1e7 and no
more than half the values are to be sampled, as was noted:
> Note that it wouldn't be the first time that sample() changes behavior
> in a non-backward compatible way:
>
> https://stat.ethz.ch/pipermail/r-devel/2012-October/065049.html
>
> Cheers,
> H.
That incompatibility could have been avoided. A year ago I posted
a fast hashing algorithm that produces the same results as the simple
algorithm, here:
https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html
The latest version of this will be in the soon-to-be new release of
pqR, and will of course enabled automatically whenever it seems
desirable, for a considerable speed gain in many cases.
Radford Neal