This question is mainly aimed at Kurt Hornik as author of the ctest package, but I'm cc'ing it to r-help as I suspect there will be other valuable opinions out there. I have been attempting 2 sample Kolmogorov-Smirnov tests using the ks.test function from the ctest package (ctest v.0.9-15, R v.0.63.3 win32). I am comparing fish length-frequency distributions. My main reference for the KS test at present is Sokal & Rohlf, Biometry (2nd edn), pages 440-445). The individuals in my samples are measured to the nearest 0.5cm and so in most samples there are several identical length values. It appears that the KS test statistic D is being overestimated (and the p value therefore underestimated). I think this is best illustrated by a trivial (but extreme) example: > library(ctest) > x <- y <- rep(1,10) > ks.test(x,y) Two-sample Kolmogorov-Smirnov test data: x and y D = 1, p-value = 9.08e-005 alternative hypothesis: two.sided Obviously when two identical vectors are compared the test statistic D should be zero and the probability that the two vectors represent the same underlying distribution should be 1. If D is calculated using the first method outlined by Sokal & Rohlf (maximum absolute difference between relative cumulative frequencies) then D is indeed 0. The method used in the ctest code is presented by Sokal and Rohlf as an alternate (NB not approximate) computation scheme and attributed to Gideon & Mueller (1978). The pertinent code is the line: z <- ifelse(order(c(x, y)) <= n.x, 1/n.x, -1/n.y). If the two vectors in the example above had been identical, but with no repeated values, the result of order(c(x, y)) would have been along the lines of [1] 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 10 20 (the essential point being that items in the result come alternately from x & y). D is calculated as max(abs(cumsum(z))), with the result that the minimum D for identical vectors is min(1/n.x,1/n.y). (It therefore appears to me that this computational method should be considered an approximate rather than alternative method.) In the case of vectors with replicated values the problem is worse because values from one vector are grouped in the vector returned by order. In the case of the example above:> order(c(x, y))[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 I don't think this can be considered a bug, but it is certainly a problem for the method used in computing D. Has anyone coded alternative KS test computation methods in R/S? It's obviously not hard, but could be slow unless done elegantly! Thanks David Middleton -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Thu, 8 Apr 1999, David Middleton wrote:> > This question is mainly aimed at Kurt Hornik as author of the ctest package, > but I'm cc'ing it to r-help as I suspect there will be other valuable > opinions out there. > > I have been attempting 2 sample Kolmogorov-Smirnov tests using the ks.test > function from the ctest package (ctest v.0.9-15, R v.0.63.3 win32). I am > comparing fish length-frequency distributions. My main reference for the KS > test at present is Sokal & Rohlf, Biometry (2nd edn), pages 440-445). > > The individuals in my samples are measured to the nearest 0.5cm and so in > most samples there are several identical length values. It appears that the > KS test statistic D is being overestimated (and the p value therefore > underestimated).If the data are discretized the KS test does not have the standard (distribution-free) distribution. `Distribution-free' here means independent samples from a continuous distribution. So the KS test is not IMHO appropriate in your problem. My view is that the function should warn you off, and not give a p-value if it finds ties. It might be good to construct the exact statistic, though. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Brian Many thanks for the rapid response. Here are the inevitable follow up questions!>If the data are discretized the KS test does not have the standard >(distribution-free) distribution. `Distribution-free' here means >independent samples from a continuous distribution. So the KS test is >not IMHO appropriate in your problem.I am aware that the KS test assumes samples from a continuous distribution. Fish length obviously is a continuous variable, though it is apparent that measuring to the nearest 0.5cm (or 1cm in some cases) does introduce a certain discretization. In the case I'm considering lengths to the nearest 0.5cm are the highest precision available. I wonder, therefore, whether there are guidelines regarding the precision required before a continuous variable yields continuous measurements? Possibly some criterion based on the ratio of precision to range? In this case the fact that there are repeated values for the length measurements suggests there has been inherent creation of size classes.> My view is that the function should >warn you off, and not give a p-value if it finds ties. It might be good to >construct the exact statistic, though.Sokal and Rohlf do give an approximate KS 2 sample test for large sample sizes. Again D is the maximum absolute difference between cumulative relative frequencies but the difference is only calculated once per measurement class, rather than for each individual measurement. Their example has sample sizes of 400-500. Is there any published guidance for the sample size that is considered "large enough"? I hope that these questions are not too general for the R list - unfortunately my access to a statistical publications is somewhat limited at present. I do note with satisfaction that it will be relatively easy to code the approximate test in R. Thanks David Middleton, dajm at deeq.demon.co.uk Falkland Islands Fisheries Department -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._