michael watson (IAH-C)
2008-Sep-10 13:19 UTC
[R] A question about the hypergeometric distribution and phyper()
Dear All I have a question about the hypergeomteric distribution. Example 1: I have a universe of 6187 objects, and 164 have a particular attribute, therefore 6187-164 do not have that attribute. I sample 249 of those objects, and find that 19 have that attribute. I get a p-value here (looking at just over-representation): phyper(19, 164, 6187-164, 249, lower.tail=FALSE) [1] 7.816235e-06 Example 2: I have a universe of 6187 objects, and 12 have a particular attribute, therefore 6187-12 do not have that attribute. I sample 249 of those objects, and find that 4 have that attribute. I get a p-value here (looking at just over-representation): phyper(4, 12, 6187-12, 249, lower.tail=FALSE) [1] 6.368919e-05 It seems to me that the probability of seeing 19 out of 164 in a sample of 249 being less than the probability of seeing 4 out of 12 in a sample of the same size is counter-intuitive. First off, am I using phyper() properly? Secondly, can someone point me to some documentation explaining why these seemingly counter-intuitive p-values occur? Thanks Mick
Stefan Evert
2008-Sep-10 14:11 UTC
[R] A question about the hypergeometric distribution and phyper()
On 10 Sep 2008, at 15:19, michael watson (IAH-C) wrote:> Example 1: I have a universe of 6187 objects, and 164 have a > particular > attribute, therefore 6187-164 do not have that attribute. I sample > 249 > of those objects, and find that 19 have that attribute. I get a p- > value > here (looking at just over-representation): > > phyper(19, 164, 6187-164, 249, lower.tail=FALSE) > [1] 7.816235e-06Actually, if you look at ?phyper, you'll see that this should be phyper(18, 164, 6187-164, 249, lower.tail=FALSE) [1] 2.775819e-05 if you want to calculate Pr(X >= 19) = Pr(X > 18). Similarly:> phyper(4, 12, 6187-12, 249, lower.tail=FALSE) > [1] 6.368919e-05phyper(3, 12, 6187-12, 249, lower.tail=FALSE) [1] 0.0009816739 Which you'll still find counterintuitive, of course.> It seems to me that the probability of seeing 19 out of 164 in a > sample > of 249 being less than the probability of seeing 4 out of 12 in a > sample > of the same size is counter-intuitive. > > Secondly, can someone point me to some documentation explaining why > these seemingly counter-intuitive p-values occur?I think it's just because the hypergeometric distribution becomes very skewed and non-normal for expected values < 1 (expectations should be roughly 6.6 in the first case and 0.5 in the second case). Perhaps it helps to visualize the two distributions? M <- rbind(dhyper(0:20, 164, 6187-164, 249), dhyper(0:20, 12, 6187-12, 249)) rownames(M) <- c("164 out of 6187", "12 out of 6187") colnames(M) <- 0:20 barplot(M, beside=TRUE, legend = TRUE) Best regards, Stefan Evert [ stefan.evert at uos.de | http://purl.org/stefan.evert ]