Shengqiao Li
2008-Aug-17  00:29 UTC
[R] Wichmann-Hill Random Number Generator and the Birthday Problem
Dear all, Recently I am generating large random samples (10M) and any duplicated numbers are not desired. We tried several RNGs in R and found Wichmann-Hill did not produce duplications. The duplication problem is the interesting birthday problem. If there are M possible numbers, randomly draw N numbers from them, the average number of dupilcations D = N(N-1)/2/M. For Knuth-TAOCP and Knuth-TAOCP-2002, M=2^30, since this modulus is used. D = 46566.12 for N=10M samples. For Marsaglia-Multicarry, Super-Duper and Mersene-Twister, M=2^32. D = 11641.53 for N = 10M samples. My testing results (see below) agree with above analysis. But for Wichmann-Hill, it wasn't. Wichmann-Hill's cycle is 6.9536e12 (refer to RNG help by ?RNG and Whichmann's correction in 1984). Thus M <= 6.9536e12. D>= 7.19052 when N=1e7 and D>= 179.763 for N=5e7.But in my tests, duplications were surprisingly not observed. It seems that Wichmann-Hill has a much larger cycle than the one documented! Anybody can solve this puzzle? Regards, Shengqiao Li Department of Statistics West Virgina Unversity ==============Testing==================> RNGkind(kind="Knuth-TAOCP"); > RNGkind();[1] "Knuth-TAOCP" "Inversion"> sum(duplicated(runif(1e7)));[1] 46216> > RNGkind(kind="Knuth-TAOCP-2002"); > RNGkind();[1] "Knuth-TAOCP-2002" "Inversion"> sum(duplicated(runif(1e7)));[1] 46373>> RNGkind(kind="Marsaglia-Multicarry"); > RNGkind();[1] "Marsaglia-Multicarry" "Inversion"> sum(duplicated(runif(1e7)));[1] 11671> > > RNGkind(kind="Super-Duper"); > RNGkind();[1] "Super-Duper" "Inversion"> sum(duplicated(runif(1e7)));[1] 11704> > RNGkind(kind="Mersenne-Twister"); > RNGkind();[1] "Mersenne-Twister" "Inversion"> sum(duplicated(runif(1e7)));[1] 11508> > RNGkind(kind="Wichmann-Hill"); > RNGkind();[1] "Wichmann-Hill" "Inversion"> sum(duplicated(runif(1e7)));[1] 0> gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 199157 5.4 646480 17.3 10149018 271.1 Vcells 4497151 34.4 108714629 829.5 169585390 1293.9> sum(duplicated(runif(5e7)))[1] 0> sessionInfo()R version 2.7.1 (2008-06-23) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.7.1>==============End of Testing===================
Duncan Murdoch
2008-Aug-17  10:24 UTC
[R] Wichmann-Hill Random Number Generator and the Birthday Problem
Shengqiao Li wrote:> Dear all, > > Recently I am generating large random samples (10M) and any duplicated > numbers are not desired. > We tried several RNGs in R and found Wichmann-Hill did not produce > duplications. > > The duplication problem is the interesting birthday problem. If there are > M possible numbers, randomly draw N numbers from them, > the average number of dupilcations D = N(N-1)/2/M. > > For Knuth-TAOCP and Knuth-TAOCP-2002, M=2^30, since this modulus is used. > D = 46566.12 for N=10M samples. > > For Marsaglia-Multicarry, Super-Duper and Mersene-Twister, M=2^32. D = > 11641.53 for N = 10M samples. > > My testing results (see below) agree with above analysis. But for > Wichmann-Hill, it wasn't. Wichmann-Hill's cycle is 6.9536e12 (refer to RNG > help by ?RNG and Whichmann's correction in 1984). Thus M <= 6.9536e12. D > >> = 7.19052 when N=1e7 and D>= 179.763 for N=5e7. >> > But in my tests, duplications were surprisingly not observed. > > It seems that Wichmann-Hill has a much larger cycle than the one > documented! > > Anybody can solve this puzzle? >As I told you, look at the code. The cycle length of Wichmann-Hill in R appears to be 2.8e13, and that also appears to be M in your notation. Your birthday problem calculation does not apply here. You could probably get a good approximation to it by rounding the Wichmann-Hill output (e.g. look at round(2^30 * runif()), or maybe more severe rounding). M is larger than what was previously documented, but Brian Ripley has corrected the documentation. Duncan Murdoch> Regards, > > Shengqiao Li > > Department of Statistics > West Virgina Unversity > > ==============Testing==================> > >> RNGkind(kind="Knuth-TAOCP"); >> RNGkind(); >> > [1] "Knuth-TAOCP" "Inversion" > >> sum(duplicated(runif(1e7))); >> > [1] 46216 > >> RNGkind(kind="Knuth-TAOCP-2002"); >> RNGkind(); >> > [1] "Knuth-TAOCP-2002" "Inversion" > >> sum(duplicated(runif(1e7))); >> > [1] 46373 > > > >> RNGkind(kind="Marsaglia-Multicarry"); >> RNGkind(); >> > [1] "Marsaglia-Multicarry" "Inversion" > >> sum(duplicated(runif(1e7))); >> > [1] 11671 > >> RNGkind(kind="Super-Duper"); >> RNGkind(); >> > [1] "Super-Duper" "Inversion" > >> sum(duplicated(runif(1e7))); >> > [1] 11704 > >> RNGkind(kind="Mersenne-Twister"); >> RNGkind(); >> > [1] "Mersenne-Twister" "Inversion" > >> sum(duplicated(runif(1e7))); >> > [1] 11508 > >> RNGkind(kind="Wichmann-Hill"); >> RNGkind(); >> > [1] "Wichmann-Hill" "Inversion" > >> sum(duplicated(runif(1e7))); >> > [1] 0 > > >> gc() >> > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 199157 5.4 646480 17.3 10149018 271.1 > Vcells 4497151 34.4 108714629 829.5 169585390 1293.9 > > >> sum(duplicated(runif(5e7))) >> > [1] 0 > > > >> sessionInfo() >> > R version 2.7.1 (2008-06-23) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United > States.1252;LC_MONETARY=English_United > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] tools_2.7.1 > > > ==============End of Testing==================> > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >