thr3ads.net - R help - [R] 'simulate.p.value' for goodness of fit [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Bob

2008-Jan-17 18:59 UTC

[R] 'simulate.p.value' for goodness of fit

R Help on 'chisq.test' states that
    "if 'simulate.p.value' is 'TRUE', the p-value is
computed by Monte
     Carlo simulation with 'B' replicates.

     In the contingency table case this is done by random sampling from
     the set of all contingency tables with given marginals, and works
     only if the marginals are positive...

     In the goodness-of-fit case this is done by random sampling from
     the discrete distribution specified by 'p', each sample being of
     size 'n = sum(x)'."

The last paragraph suggests that in the goodness-of-fit case, if p gives the

expected probability for each cell, this random sampling is multinomial.

Unfortunately, as the following examples reveal, the sampling model is 
neither

multinomial nor hypergeometric - at least when it is applied to a 4-fold 
table.

This is rather sad as some people assume that R's chisq.test function can

perform a Monte Carlo test of X-squared, employing a multinomial model - in

other words, assuming that your data are a single random sample.


### Example 1.> x=matrix(c(1,2,3,4),nc=2)
> # To begin with, let us apply the large-sample approximations
> chisq.test(x,correct=TRUE)$p.value[1] 0.6726038
Warning message:
Chi-squared approximation may be incorrect in: chisq.test(x, correct =
TRUE)> chisq.test(x,correct=FALSE)$p.value[1] 0.7781597
Warning message:
Chi-squared approximation may be incorrect in: chisq.test(x, correct = 
FALSE)>
> # So let us apply a 2-tailed test of O.R.=1, using a hypergeometric model
> fisher.test(x)$p.value
[1] 1> # This should also apply a hypergeometric model
> chisq.test(x,simulate.p.value=TRUE,B=500000)$p.value
[1] 1>
> # Now we work out the expected probability for each cell
> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/sum(x)^2
> # But this applies a hypergeometric model, presumably because p is not 
> scalar
> chisq.test(x,p=p,simulate.p.value=TRUE,B=500000)$p.value
[1] 1> # This seems to do something different,
> # at any rate it is much slower, and needs more memory
> chisq.test(x[1:4],p=p[1:4],simulate.p.value=TRUE,B=10000)$p.value
[1] 1> # Which would appear to be using the same model as above
>
> # Now let us apply an X2 test using a multinomial model
> # (The code for this x2.test function is in Appendix 1, below.)
> x2.test(x,R=200000)with cc P =  0.7316812
conventional-P =  0.8838786
mid-P =  0.8423058>
> # All of these P-values are higher than those given by the Chi-squared
approximation,> # but they certainly do not equal 1.
> # But is this is an artefact of our very small sample?
>
>
>
>
> ### Example 2.
> # Let us try a larger sample
> x=matrix(c(56,35,23,42),nc=2)
>
> # First we apply the asymptotic model
> chisq.test(x,correct=TRUE)$p.value
[1] 0.002222425> chisq.test(x,correct=FALSE)$p.value
[1] 0.001276595>
> # Now for the hypergeometric (fixed margin totals model)
> fisher.test(x)$p.value
[1] 0.001931078> chisq.test(x,simulate.p.value=TRUE,B=500000)$p.value
[1] 0.001913996> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/sum(x)^2
> chisq.test(x,p=p,simulate.p.value=TRUE,B=500000)$p.value
[1] 0.001891996>
> Next comes what we had hoped to be a multinomial test
> chisq.test(x[1:4],p=p[1:4],simulate.p.value=TRUE,B=10000)$p.value
[1] 0.01639836> # This is obviously not the same hypergeometric model as used for a > #
chi-squared test.> # The P-value is about 10x of the approximate tests (above)
> #  or the exact tests (below).
>
> x2.test(x,R=200000)with cc P =  0.002059990
conventional-P =  0.001184994
mid-P =  0.001172494>
> # Whatever that chi-squared test model IS, it is certainly not 
> multinomial!
> # Could it possibly be Poisson and, if so, why???

######## Appendix 1:

# We have used these  functions to do a 2x2 multinomial test of X2:

x2=function(y,cc=FALSE){
y=y*1.;n=sum(y);C=cc*n/2
a=y[1];b=y[2];c=y[3];d=y[4]
ab=a+b;cd=c+d;ac=a+c;bd=b+d
D=ab*cd*ac*bd
if(D==0)x2=NA else x2=n*(abs(a*d-b*c)-C)^2/D
x2}

x2.test=function(x,R=5000){
n=sum(x)
p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/n/n
Q=sort(apply(rmultinom(R,n,p),2,x2))
q=x2(x,cc=TRUE)
pl=rank(c(q,Q),ties.method='max')[1]/(length(Q)+1)
pe=sum(c(q,Q)==q)/(length(Q)+1);pu=1-pl+pe
cat('with cc P = ',pu,'\n')
q=x2(x)
pl=rank(c(q,Q),ties.method='max')[1]/(length(Q)+1)
pe=sum(c(q,Q)==q)/(length(Q)+1);pu=1-pl+pe
cat('conventional-P = ',pu,'\n')
cat('mid-P = ',pu-pe/2,'\n')}

Bob

Rolf Turner

2008-Jan-17 20:28 UTC

head link

[R] 'simulate.p.value' for goodness of fit

I'm afraid I can't follow your examples, but you seem to me to be
mixing contingency table tests and goodness of fit tests in a somewhat
incoherent fashion.

Note that your ``x2()'' function does a contingency table test, and not
a goodness of fit test.

Note that in chisq.test(), if ``x'' is a matrix, then the ``p''
argument
is ignored.

For a goodness of fit test, the sampling in chisq.test ***is***  
multinomial.  It
uses the sample() function, as a quick glance at the code would have  
told you.

I haven't the time to plough through your code and figure out what
you're driving at, but I think that part of your problem could be
the degrees of freedom.  Under the contingency table test the
degrees of freedom are 1; under the goodness of fit test the
degrees of freedom are 3.  (The vector of probabilities is
*known* under the g.o.f. test, not estimated.)

		cheers,

			Rolf Turner

On 18/01/2008, at 7:59 AM, Bob wrote:
> R Help on 'chisq.test' states that
>     "if 'simulate.p.value' is 'TRUE', the p-value is
computed by Monte
>      Carlo simulation with 'B' replicates.
>
>      In the contingency table case this is done by random sampling  
> from
>      the set of all contingency tables with given marginals, and works
>      only if the marginals are positive...
>
>      In the goodness-of-fit case this is done by random sampling from
>      the discrete distribution specified by 'p', each sample being
of
>      size 'n = sum(x)'."
>
> The last paragraph suggests that in the goodness-of-fit case, if p  
> gives the
>
> expected probability for each cell, this random sampling is  
> multinomial.
>
> Unfortunately, as the following examples reveal, the sampling model is
> neither
>
> multinomial nor hypergeometric - at least when it is applied to a 4- 
> fold
> table.
>
> This is rather sad as some people assume that R's chisq.test  
> function can
>
> perform a Monte Carlo test of X-squared, employing a multinomial  
> model - in
>
> other words, assuming that your data are a single random sample.
>
>
> ### Example 1.
>> x=matrix(c(1,2,3,4),nc=2)
>> # To begin with, let us apply the large-sample approximations
>> chisq.test(x,correct=TRUE)$p.value
> [1] 0.6726038
> Warning message:
> Chi-squared approximation may be incorrect in: chisq.test(x,  
> correct = TRUE)
>> chisq.test(x,correct=FALSE)$p.value
> [1] 0.7781597
> Warning message:
> Chi-squared approximation may be incorrect in: chisq.test(x, correct >
FALSE)
>>
>> # So let us apply a 2-tailed test of O.R.=1, using a  
>> hypergeometric model
>> fisher.test(x)$p.value
> [1] 1
>> # This should also apply a hypergeometric model
>> chisq.test(x,simulate.p.value=TRUE,B=500000)$p.value
> [1] 1
>>
>> # Now we work out the expected probability for each cell
>> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/sum(x)^2
>> # But this applies a hypergeometric model, presumably because p is  
>> not
>> scalar
>> chisq.test(x,p=p,simulate.p.value=TRUE,B=500000)$p.value
> [1] 1
>> # This seems to do something different,
>> # at any rate it is much slower, and needs more memory
>> chisq.test(x[1:4],p=p[1:4],simulate.p.value=TRUE,B=10000)$p.value
> [1] 1
>> # Which would appear to be using the same model as above
>>
>> # Now let us apply an X2 test using a multinomial model
>> # (The code for this x2.test function is in Appendix 1, below.)
>> x2.test(x,R=200000)
> with cc P =  0.7316812
> conventional-P =  0.8838786
> mid-P =  0.8423058
>>
>> # All of these P-values are higher than those given by the Chi- 
>> squared
>
> approximation,
>> # but they certainly do not equal 1.
>> # But is this is an artefact of our very small sample?
>>
>>
>>
>>
>> ### Example 2.
>> # Let us try a larger sample
>> x=matrix(c(56,35,23,42),nc=2)
>>
>> # First we apply the asymptotic model
>> chisq.test(x,correct=TRUE)$p.value
> [1] 0.002222425
>> chisq.test(x,correct=FALSE)$p.value
> [1] 0.001276595
>>
>> # Now for the hypergeometric (fixed margin totals model)
>> fisher.test(x)$p.value
> [1] 0.001931078
>> chisq.test(x,simulate.p.value=TRUE,B=500000)$p.value
> [1] 0.001913996
>> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/sum(x)^2
>> chisq.test(x,p=p,simulate.p.value=TRUE,B=500000)$p.value
> [1] 0.001891996
>>
>> Next comes what we had hoped to be a multinomial test
>> chisq.test(x[1:4],p=p[1:4],simulate.p.value=TRUE,B=10000)$p.value
> [1] 0.01639836
>> # This is obviously not the same hypergeometric model as used for  
>> a > #
>
> chi-squared test.
>> # The P-value is about 10x of the approximate tests (above)
>> #  or the exact tests (below).
>>
>> x2.test(x,R=200000)
> with cc P =  0.002059990
> conventional-P =  0.001184994
> mid-P =  0.001172494
>>
>> # Whatever that chi-squared test model IS, it is certainly not
>> multinomial!
>> # Could it possibly be Poisson and, if so, why???
>
>
> ######## Appendix 1:
>
> # We have used these  functions to do a 2x2 multinomial test of X2:
>
> x2=function(y,cc=FALSE){
> y=y*1.;n=sum(y);C=cc*n/2
> a=y[1];b=y[2];c=y[3];d=y[4]
> ab=a+b;cd=c+d;ac=a+c;bd=b+d
> D=ab*cd*ac*bd
> if(D==0)x2=NA else x2=n*(abs(a*d-b*c)-C)^2/D
> x2}
>
> x2.test=function(x,R=5000){
> n=sum(x)
> p=outer(c(sum(x[1,]),sum(x[2,])),c(sum(x[,1]),sum(x[,2])))/n/n
> Q=sort(apply(rmultinom(R,n,p),2,x2))
> q=x2(x,cc=TRUE)
> pl=rank(c(q,Q),ties.method='max')[1]/(length(Q)+1)
> pe=sum(c(q,Q)==q)/(length(Q)+1);pu=1-pl+pe
> cat('with cc P = ',pu,'\n')
> q=x2(x)
> pl=rank(c(q,Q),ties.method='max')[1]/(length(Q)+1)
> pe=sum(c(q,Q)==q)/(length(Q)+1);pu=1-pl+pe
> cat('conventional-P = ',pu,'\n')
> cat('mid-P = ',pu-pe/2,'\n')}
>
> Bob
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

Reasonably Related Threads

Search for more seemingly similar threads

R help - Jan 2008 - 'simulate.p.value' for goodness of fit

[R] 'simulate.p.value' for goodness of fit

[R] 'simulate.p.value' for goodness of fit

Reasonably Related Threads