Dear list,
I did simulations in which I generated 10000
independent Bernoulli(0.5)-sequences of length 100. I estimated
p for each sequence and I also estimated the conditional probability that
a one is followed by another one (which should be p as well).
However, the second probability is significantly smaller than 0.5 (namely
about 0.494, see below) and of course smaller than the direct estimates of
p as well, indicating negative correlation between the random numbers.
See below the code and the results.
Did I do something wrong or are the numbers in
fact negatively correlated? (A type I error is quite unlikely with a
p-value below 2.2e-16.)
Best,
Christian
set.seed(123456)
n <- 100
p <- 0.5
simruns <- 10000
est <- est11 <- numeric(0)
for (i in 1:simruns){
# if (i/100==round(i/100)) print(i)
x <- rbinom(n,1,p)
est[i] <- mean(x)
x11 <- 3*x[2:n]-x[1:(n-1)]
est11[i] <- sum(x11==2)/sum(x11==2 | x11==(-1))
# x11==(-1): 0 follows 1, x11==2: 1 follows 1.
}
> print(mean(est))
[1] 0.499554> print(sd(est)/sqrt(simruns))
[1] 0.0004958232
# OK
> print(mean(est11))
[1] 0.4935211> print(sd(est11)/sqrt(simruns))
[1] 0.0007136213
# mean(est11)+2*sd(mean) < 0.495
> print(sum(est>est11))
[1] 5575> binom.test(5575,10000)
Exact binomial test
data: 5575 and 10000
number of successes = 5575, number of trials = 10000, p-value <
2.2e-16
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
On Tue, 27 Jun 2006, Christian Hennig wrote:> Dear list, > > I did simulations in which I generated 10000 > independent Bernoulli(0.5)-sequences of length 100. I estimated > p for each sequence and I also estimated the conditional probability that > a one is followed by another one (which should be p as well). > However, the second probability is significantly smaller than 0.5 (namely > about 0.494, see below) and of course smaller than the direct estimates of > p as well, indicating negative correlation between the random numbers. > > See below the code and the results. > Did I do something wrong or are the numbers in > fact negatively correlated? (A type I error is quite unlikely with a > p-value below 2.2e-16.)I think you did something wrong, and that there is a problem with overlapping blocks of two. If you do x<-matrix(rbinom(1e6,1,p),ncol=2) tt<-table(x[,1],x[,2]) you get much better looking results. In this case you have 500,000 independent pairs of numbers that can be 01, 10, 11, 00. A test for independence seems fine.> tt0 1 0 125246 124814 1 125140 124800> fisher.test(tt)Fisher's Exact Test for Count Data data: tt p-value = 0.8987 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.9896745 1.0119211 sample estimates: odds ratio 1.000735 In your case the deficit in est11 is suspiciously close to 0.5/n. Changing n to 1000 and using the same seed I get> mean(est11)-0.5[1] -0.0005534743 10 times smaller, and still close to 0.5/n. Now, consider what happens in a case where we can see all the possibilities, n=3 x 1/0 1/1 000 - - 001 - - 010 1 0 011 0 1 100 1 0 101 1 0 110 1 1 111 2 0 So that if each of these three triplets has the same probability your est11 would be 2/8 rather than 4/8, and est11 is not an unbiased estimate of the long-run conditional probability. The bias is of order 1/n, so you need n to be of larger order than sqrt(simruns). -thomas
On Tue, 27 Jun 2006, Thomas Lumley wrote: I got the table wrong, it should read x 1/0 1/1 est11 000 - - - 001 - - - 010 1 0 0 011 0 1 1 100 1 0 0 101 1 0 0 110 1 1 0.5 111 0 2 1 So the explanation is slightly more complicated. The problem is that although sum(x11==2)/n and sum(x11==2 | x11==(-1))/n are unbiased estimators their ratio is not an unbiased estimator, but has mean 5/12 (which is 0.5-0.5/n). -thomas