Dear list, I did simulations in which I generated 10000 independent Bernoulli(0.5)-sequences of length 100. I estimated p for each sequence and I also estimated the conditional probability that a one is followed by another one (which should be p as well). However, the second probability is significantly smaller than 0.5 (namely about 0.494, see below) and of course smaller than the direct estimates of p as well, indicating negative correlation between the random numbers. See below the code and the results. Did I do something wrong or are the numbers in fact negatively correlated? (A type I error is quite unlikely with a p-value below 2.2e-16.) Best, Christian set.seed(123456) n <- 100 p <- 0.5 simruns <- 10000 est <- est11 <- numeric(0) for (i in 1:simruns){ # if (i/100==round(i/100)) print(i) x <- rbinom(n,1,p) est[i] <- mean(x) x11 <- 3*x[2:n]-x[1:(n-1)] est11[i] <- sum(x11==2)/sum(x11==2 | x11==(-1)) # x11==(-1): 0 follows 1, x11==2: 1 follows 1. }> print(mean(est))[1] 0.499554> print(sd(est)/sqrt(simruns))[1] 0.0004958232 # OK> print(mean(est11))[1] 0.4935211> print(sd(est11)/sqrt(simruns))[1] 0.0007136213 # mean(est11)+2*sd(mean) < 0.495> print(sum(est>est11))[1] 5575> binom.test(5575,10000)Exact binomial test data: 5575 and 10000 number of successes = 5575, number of trials = 10000, p-value < 2.2e-16 *** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
On Tue, 27 Jun 2006, Christian Hennig wrote:> Dear list, > > I did simulations in which I generated 10000 > independent Bernoulli(0.5)-sequences of length 100. I estimated > p for each sequence and I also estimated the conditional probability that > a one is followed by another one (which should be p as well). > However, the second probability is significantly smaller than 0.5 (namely > about 0.494, see below) and of course smaller than the direct estimates of > p as well, indicating negative correlation between the random numbers. > > See below the code and the results. > Did I do something wrong or are the numbers in > fact negatively correlated? (A type I error is quite unlikely with a > p-value below 2.2e-16.)I think you did something wrong, and that there is a problem with overlapping blocks of two. If you do x<-matrix(rbinom(1e6,1,p),ncol=2) tt<-table(x[,1],x[,2]) you get much better looking results. In this case you have 500,000 independent pairs of numbers that can be 01, 10, 11, 00. A test for independence seems fine.> tt0 1 0 125246 124814 1 125140 124800> fisher.test(tt)Fisher's Exact Test for Count Data data: tt p-value = 0.8987 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.9896745 1.0119211 sample estimates: odds ratio 1.000735 In your case the deficit in est11 is suspiciously close to 0.5/n. Changing n to 1000 and using the same seed I get> mean(est11)-0.5[1] -0.0005534743 10 times smaller, and still close to 0.5/n. Now, consider what happens in a case where we can see all the possibilities, n=3 x 1/0 1/1 000 - - 001 - - 010 1 0 011 0 1 100 1 0 101 1 0 110 1 1 111 2 0 So that if each of these three triplets has the same probability your est11 would be 2/8 rather than 4/8, and est11 is not an unbiased estimate of the long-run conditional probability. The bias is of order 1/n, so you need n to be of larger order than sqrt(simruns). -thomas
On Tue, 27 Jun 2006, Thomas Lumley wrote: I got the table wrong, it should read x 1/0 1/1 est11 000 - - - 001 - - - 010 1 0 0 011 0 1 1 100 1 0 0 101 1 0 0 110 1 1 0.5 111 0 2 1 So the explanation is slightly more complicated. The problem is that although sum(x11==2)/n and sum(x11==2 | x11==(-1))/n are unbiased estimators their ratio is not an unbiased estimator, but has mean 5/12 (which is 0.5-0.5/n). -thomas