Ethan Johnsons
2006-Dec-02 15:56 UTC
[R] Chi-squared approximation may be incorrect in: chisq.test(x)
I am getting "Chi-squared approximation may be incorrect in: chisq.test(x)" with the data bleow. Frequency distribution of number of male offspring in families of size 5. Number of Male Offspring N 0 518 1 2245 2 4621 3 4753 4 2476 5 549 Total 15,162> x=matrix(c(0,1,2,3,4,5,518,2245,4621,4753,2476,549), ncol=2) > x[,1] [,2] [1,] 0 518 [2,] 1 2245 [3,] 2 4621 [4,] 3 4753 [5,] 4 2476 [6,] 5 549> chisq.test(x)Pearson's Chi-squared test data: x X-squared = 40.4672, df = 5, p-value = 1.202e-07 Warning message: Chi-squared approximation may be incorrect in: chisq.test(x) Does the warning mean that chisq.test(x) can't be used for this data. If chisq.test(x) can't be used for this data, which test is to be used? Pls advise. thx ej
(Ted Harding)
2006-Dec-02 17:42 UTC
[R] Chi-squared approximation may be incorrect in: chisq.tes
On 02-Dec-06 Ethan Johnsons wrote:> I am getting "Chi-squared approximation may be incorrect in: > chisq.test(x)" with the data bleow. > > Frequency distribution of number of male offspring in families of size > 5. > Number of Male Offspring N > 0 518 > 1 2245 > 2 4621 > 3 4753 > 4 2476 > 5 549 > Total 15162 > > >> x=matrix(c(0,1,2,3,4,5,518,2245,4621,4753,2476,549), ncol=2) >> x > [,1] [,2] > [1,] 0 518 > [2,] 1 2245 > [3,] 2 4621 > [4,] 3 4753 > [5,] 4 2476 > [6,] 5 549 >> chisq.test(x) > > Pearson's Chi-squared test > > data: x > X-squared = 40.4672, df = 5, p-value = 1.202e-07 > > Warning message: > Chi-squared approximation may be incorrect in: chisq.test(x) > > Does the warning mean that chisq.test(x) can't be used for this data. > > If chisq.test(x) can't be used for this data, which test is to be used?You are feeding a matrix with 6 rows and *2 columns* to chisq.test() See "?chisq.test": If 'x' is a matrix with at least two rows and columns, it is taken as a two-dimensional contingency table, and hence its entries should be nonnegative integers. In other words, the first column -- giving the numbers (0-5) of male offspring -- is also being used as count data, so what you're getting is a test of association between the two sets of "counts" {0,1,2,3,4,5} and {518,2245,4621,4753,2476,549}! So, if (which seems likely) you want to test the counts in the second column against (say) a binomial distribution, you could use the following code (assuming P[boy]=0.5) chisq.test(x[,2],p=dbinom(x[,1],5,0.5)) ## Chi-squared test for given probabilities ## data: x[, 2] ## X-squared = 30.3181, df = 5, p-value = 1.277e-05 so it seems a binomial with P=0.5 won't do. Hence next try the best estimate of P assuming binomial: P<-sum((0:5)*x[,2])/(5*sum(x[,2])) P ## [1] 0.5064635 So does this improve things?: chisq.test(x[,2],p=dbinom(x[,1],5,P)) ## Chi-squared test for given probabilities ## data: x[, 2] ## X-squared = 17.744, df = 5, p-value = 0.003285 well, quite a bit -- but of course chisq.test() uses the wrong degrees of freedom (should be 4, since we've estimated P this time), so actually it's worse than that: ## 1-pchisq(17.744,4) ## [1] 0.001384664 though at 1.4e-3 it's still better than 1.3e-5! Nevertheless, now that we've got the best-fitting binomial model, the hypothesis that the numbers of male children in families of size 5 is binomially distributed is clearly implausible. Sufficient conditions for binomial distribution are a) The genders of different births in the same family are independent of each other b) The probability of "boy" is the same for all families. Hence either (a) or (b) [or both] fails for these data. I don't know of any way to force chisq.test() itself to take account of degrees of freedom used up in estimation. chisq.test() is a test for contingency tables (including a 1-row table tested against a fixed given probability distribution). Anyway, an indication of how the data fail to follow the best fitting binomial distribution can be obtained by looking at X2T<-chisq.test(x[,2],p=dbinom(x[,1],5,P)) cbind((0:5),X2T$expected,X2T$observed) ## [,1] [,2] [,3] ## [1,] 0 443.9691 518 ## [2,] 1 2277.9893 2245 ## [3,] 2 4675.3120 4621 ## [4,] 3 4797.7711 4753 ## [5,] 4 2461.7188 2476 ## [6,] 5 505.2396 549 showing (at least) that there seem to be excess counts in the "0" and "5" classes. This can be further clarified with n.E<-X2T$expected; n.O<-X2T$observed; Dev<-(n.O-n.E)/sqrt(n.E) cbind((0:5),X2T$expected,X2T$observed,Dev) ## Dev ## [1,] 0 443.9691 518 3.5134726 ## [2,] 1 2277.9893 2245 -0.6911901 ## [3,] 2 4675.3120 4621 -0.7943114 ## [4,] 3 4797.7711 4753 -0.6463652 ## [5,] 4 2461.7188 2476 0.2878353 ## [6,] 5 505.2396 549 1.9468512 sum(Dev^2) ## [1] 17.74403 [agreeing with the 17.744 from chisq.test()] Hence there seems to be a clear excess of all-girl families, and some excess of all-boy families, but nothing obviously remarkable about the others (which does not of course exclude that there is a dependency between the genders of successive births in these too). Quite how you proceed further depends on how you wish to model this dependency. See VGAM package for betabin.ab(), which fits a beta-binomial (binomial where the probability P varies from case [family] to case according to a beta-distribution), for instance. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 02-Dec-06 Time: 17:42:16 ------------------------------ XFMail ------------------------------
Reasonably Related Threads
- In chisq.test(x) : Chi-squared approximation may be incorrect
- visualizing frequencies
- Chi square value of anova(binomialglmnull, binomglmmod, test="Chisq")
- accuracy of chi-square distribution approximations
- Extracting approximate Wald test (Chisq) from coxph(..frailty)