Dear R-Aficionados: I realize that no random number generator is perfect, so what I report below may be a result of that simple fact. However, if I have made an error in my thinking I would greatly appreciate being corrected. I wish to illustrate the behavior of small samples (n=10) and so generate 100,000 of them. n.samples <- 1000000 sample.size = 10 p <- 0.0001 z.normal <- qnorm(p) # generate n.samples of sample.size each from a normal(mean=0, sd=1) density # small.sample <- matrix(rnorm(n=sample.size*n.samples, mean=0, sd=1), nrow=n.samples, ncol=sample.size) # Verify that from the entire small.sample matrix, p sampled values are below, p above. # observed.fraction.below <- sum(small.sample < z.normal)/length(small.sample) observed.fraction.above <- sum(small.sample > -z.normal)/length(small.sample)> observed.fraction.below[1] 6.3e-05> observed.fraction.above[1] 0.000142>I've checked the behavior of the entire sample's mean and median and they seem fine. The total fraction in both tails is 0.0002, as it should be. However in every instance about 1/3 are in the lower tail, 2/3 in the upper. I also observe the same 1/3:2/3 ratio for one million samples of ten. Is this simply because random number generators aren't perfect? Or have I stepped in something? Thank you for your kind counsel. Charles Annis, P.E. Charles.Annis at StatisticalEngineering.com phone: 561-352-9699 eFAX: 503-217-5849 http://www.StatisticalEngineering.com
"Charles Annis, P.E." <AnnisC at asme.org> writes:> Dear R-Aficionados: > > I realize that no random number generator is perfect, so what I report > below may be a result of that simple fact. However, if I have made an > error in my thinking I would greatly appreciate being corrected. > > I wish to illustrate the behavior of small samples (n=10) and so > generate 100,000 of them. > > n.samples <- 1000000 > sample.size = 10 > p <- 0.0001 > z.normal <- qnorm(p) > # generate n.samples of sample.size each from a normal(mean=0, sd=1) > density > # > small.sample <- matrix(rnorm(n=sample.size*n.samples, mean=0, sd=1), > nrow=n.samples, ncol=sample.size) > # Verify that from the entire small.sample matrix, p sampled values are > below, p above. > # > observed.fraction.below <- sum(small.sample < > z.normal)/length(small.sample) > observed.fraction.above <- sum(small.sample > > -z.normal)/length(small.sample) > > > observed.fraction.below > [1] 6.3e-05 > > observed.fraction.above > [1] 0.000142 > > > > I've checked the behavior of the entire sample's mean and median and > they seem fine. The total fraction in both tails is 0.0002, as it > should be. However in every instance about 1/3 are in the lower tail, > 2/3 in the upper. I also observe the same 1/3:2/3 ratio for one million > samples of ten. > > Is this simply because random number generators aren't perfect? Or have > I stepped in something? > > Thank you for your kind counsel.You stepped in something, I think, but I probably shouldn't elaborate on the metaphor ... There's an unfortunate interaction between the two methods that are used for generating uniform and normal variables (the latter uses the former). This has been reported a couple of times before and typically gives anomalous tail behaviour. Changing one of the generators (see help(RNGkind)) usually helps. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Might I suggest taking a poll (even though unscientific) of how many people will be affected by a change in default RNG? My totally arbitrary guess is very few, if any. If I'm not mistaken, Python had only recently changed the default RNG to Mersenne-Twister. If Python can do it, I should think R can, too, without too much pain... Just my $0.02... Andy> -----Original Message----- > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] > Sent: Tuesday, January 28, 2003 3:53 PM > To: Charles Annis, P.E. > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] random number generator? > > > Can I suggest > > RNGkind("Mersenne-Twister", "Inversion") > > and especially the use of Inversion where tail behaviour of > the normal is > important. > > Were it not for concerns about reproducibility we would have > switched to > Inversion a while back. > > On Tue, 28 Jan 2003, Charles Annis, P.E. wrote: > > > > > Earlier today I reported finding an unbalanced number of > observations in > > the p=0.0001 tails of rnorm. > > > > Many thanks to Peter Dalgaard who suggested changing the normal.kind > > generator. > > > > Using RNGkind(kind = NULL, normal.kind ="Box-Muller") > > seems to have provided the remedy. For example: > > > > > observed.fraction.below > > [1] 0.000103 > > > observed.fraction.above > > [1] 0.000101 > > > > > > > Thank you, Peter! > > > > > > Charles Annis, P.E. > > > > Charles.Annis at StatisticalEngineering.com > > phone: 561-352-9699 > > eFAX: 503-217-5849 > > http://www.StatisticalEngineering.com > > > > > > -----Original Message----- > > From: r-help-admin at stat.math.ethz.ch > > [mailto:r-help-admin at stat.math.ethz.ch] On Behalf Of Peter > Dalgaard BSA > > Sent: Tuesday, January 28, 2003 2:36 PM > > To: Charles Annis, P.E. > > Cc: r-help at stat.math.ethz.ch > > Subject: Re: [R] random number generator? > > > > "Charles Annis, P.E." <AnnisC at asme.org> writes: > > > > > Dear R-Aficionados: > > > > > > I realize that no random number generator is perfect, so > what I report > > > below may be a result of that simple fact. However, if I > have made an > > > error in my thinking I would greatly appreciate being corrected. > > > > > > I wish to illustrate the behavior of small samples (n=10) and so > > > generate 100,000 of them. > > > > > > n.samples <- 1000000 > > > sample.size = 10 > > > p <- 0.0001 > > > z.normal <- qnorm(p) > > > # generate n.samples of sample.size each from a > normal(mean=0, sd=1) > > > density > > > # > > > small.sample <- matrix(rnorm(n=sample.size*n.samples, > mean=0, sd=1), > > > nrow=n.samples, ncol=sample.size) > > > # Verify that from the entire small.sample matrix, p > sampled values > > are > > > below, p above. > > > # > > > observed.fraction.below <- sum(small.sample < > > > z.normal)/length(small.sample) > > > observed.fraction.above <- sum(small.sample > > > > -z.normal)/length(small.sample) > > > > > > > observed.fraction.below > > > [1] 6.3e-05 > > > > observed.fraction.above > > > [1] 0.000142 > > > > > > > > > > I've checked the behavior of the entire sample's mean and > median and > > > they seem fine. The total fraction in both tails is 0.0002, as it > > > should be. However in every instance about 1/3 are in > the lower tail, > > > 2/3 in the upper. I also observe the same 1/3:2/3 ratio for one > > million > > > samples of ten. > > > > > > Is this simply because random number generators aren't > perfect? Or > > have > > > I stepped in something? > > > > > > Thank you for your kind counsel. > > > > You stepped in something, I think, but I probably shouldn't > elaborate > > on the metaphor ... There's an unfortunate interaction > between the two > > methods that are used for generating uniform and normal > variables (the > > latter uses the former). This has been reported a couple of times > > before and typically gives anomalous tail behaviour. Changing one of > > the generators (see help(RNGkind)) usually helps. > > > > > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > http://www.stat.math.ethz.ch/mailman/listinfo/r-help >------------------------------------------------------------------------------
> From: rossini at blindglobe.net [mailto:rossini at blindglobe.net] > > >>>>> "AL" == Andy Liaw <Liaw> writes: > > AL> Might I suggest taking a poll (even though > unscientific) of how many people > AL> will be affected by a change in default RNG? My > totally arbitrary guess is > AL> very few, if any. > > But they may tend to be important (i.e. those with large stat software > projects/tools) doing heavy regression testing. >But if those who want to use better RNG in R can now put RNGkind() in their .Rprofile, surely those who really need the old generator can do similar, if the default is changed? As Prof. Dalgaard said (privately), I don't think is that painful if the current default generator is kept as an option in RNGkind(). (I.e., all I'm asking is to change the default RNG, not getting rid of the current default.) Cheers, Andy> AL> If I'm not mistaken, Python had only recently changed > the default RNG to > AL> Mersenne-Twister. If Python can do it, I should > think R can, too, without > AL> too much pain... > > Depends on which L_p norm you measure pain by. > > best, > -tony > > -- > A.J. Rossini Rsrch. Asst. Prof. of > Biostatistics > U. of Washington Biostatistics > rossini at u.washington.edu > FHCRC/SCHARP/HIV Vaccine Trials Net rossini at scharp.org > -------------- http://software.biostat.washington.edu/ > ---------------- > FHCRC: M: 206-667-7025 (fax=4812)|Voicemail is pretty > sketchy/use Email > UW: Th: 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX > (my tuesday/wednesday/friday locations are completely unpredictable.) >------------------------------------------------------------------------------
With respect to the method used to produce normals, Professor Ripley noted:>Were it not for concerns about reproducibility we would have switched to >Inversion a while back.Presumably this refers to reproducibility across platforms. I searched the archive and could find nothing on this point. Gentle's text on RNGs provided no answer, either. Might Professor Ripley or someone else briefly explain why the inversion method is not reproducible but Box-Muller is, or perhaps offer a citation? Thanks, Bruce
Reproducibility of existing R scripts was meant. Changing the default behaviour of R is going to surprise a large number of users, who may not even know which was the default generator when they did their runs. On Wed, 29 Jan 2003, Bruce D McCullough wrote:> With respect to the method used to produce > normals, Professor Ripley noted: > > > >Were it not for concerns about reproducibility we would have switched to > >Inversion a while back. > > Presumably this refers to reproducibility across platforms.Presumption seems rather out of place. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595