Jason Rupert
2009-Mar-27 02:44 UTC
[R] Physical or Statistical Explanation for the "Funnel" Plot?
The R code below produces (after running for a few minutes on a decent computer) the plot shown at the following location: http://n2.nabble.com/Is-there-a-physical-and-quantitative-explanation-for-this-plot--td2542321.html I'm just taking the mean of a given set of random variables, where the set size is increased. There appears to be a quick convergence and then a pretty steady variance out to a set size of 10,0000. I'm just wondering if there is a statistical explanation out there for this convergence and it has been explored further. Thanks again. # First case N<-100000 X<-rnorm(N) step_size<-1 # Groups g<-rep(1:(N/step_size),each=step_size) # The result tmp_output<-tapply(X[1:length(g)],g,mean) length_tmp_output<-length(tmp_output) tmp_x_vals<-rep(step_size,length_tmp_output) plot(tmp_x_vals, tmp_output, xlim=c(0,10000)) #points(tmp_x_vals, tmp_output) for(ii in 1:10000) { step_size<-ii # Groups g<-rep(1:(N/step_size),each=step_size) # The result #tmp_output<-tapply(X,g,mean) tmp_output<-tapply(X[1:length(g)],g,mean) length_tmp_output<-length(tmp_output) tmp_x_vals<-rep(step_size,length_tmp_output) points(tmp_x_vals, tmp_output) }
Mike Miller
2009-Mar-27 04:34 UTC
[R] Physical or Statistical Explanation for the "Funnel" Plot?
On Thu, 26 Mar 2009, Jason Rupert wrote:> The R code below produces (after running for a few minutes on a decent > computer) the plot shown at the following location: > > http://n2.nabble.com/Is-there-a-physical-and-quantitative-explanation-for-this-plot--td2542321.html > > I'm just taking the mean of a given set of random variables, where the > set size is increased. There appears to be a quick convergence and then > a pretty steady variance out to a set size of 10,0000.I don't have time to study your code, but it sounds like you are taking random normal variables with mean 0 and variance 1, but then taking the mean for sets of those. We know exactly the distribution for the mean of the "set" (a.k.a., "sample"). The mean has a normal distribution with mean 0 and variance 1/N where N is the size of the sample. When you allow N to vary, you produce a mixture of random normal variables all having mean 0 but with different variances. The plot you show looks correct -- the distributions in the mixture that have small variance pile up in the middle, while those with greater variance form the long tails. You could get a lot of different shapes depending ont he distribution of N. But save yourself some time. Instead of making N normal variables and taking the mean, just make one and divide it by sqrt(N) -- that will give you *exactly* the same result. Your graph looks a little weird - first, why turn it sideways? We normally plot density on the ordinate, not on the abscissa. Second, there is a thick black bar on the left, but that seems to be an artifact because at least half of it is below zero -- how can that happen? Mike
Thomas Lumley
2009-Mar-27 07:55 UTC
[R] Physical or Statistical Explanation for the "Funnel" Plot?
On Thu, 26 Mar 2009, Jason Rupert wrote:> > The R code below produces (after running for a few minutes on a decent computer) the plot shown at the following location: > > http://n2.nabble.com/Is-there-a-physical-and-quantitative-explanation-for-this-plot--td2542321.html > > I'm just taking the mean of a given set of random variables, where the set size >is increased. There appears to be a quick convergence and then a pretty steady > variance out to a set size of 10,0000.Part of the convergence is just that the standard devation of a mean of N observations is proportional to 1/sqrt(N). In your case the distributions are all exactly Normal; the same convergence would occur with other distributions, but you would also see the change in shape from left to right as the distribution converged to Normal. There's also some plotting artifacts due to the size of the points. The apparent stabilization at large N (and the wide vertical bar at zero that Marc Schwartz commented on) are due partly to the slow convergence of 1/sqrt(N) but largely because the width can't be smaller than the width of a point. When I draw funnel plots like this for whole-genome association data I use the 'hexbin' package, which doesn't have these artifacts and is much faster and produces smaller graphics files. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle