I frequently want to test for differences between animal size frequency distributions. The obvious test (I think) to use is the Kolmogorov-Smirnov two sample test (provided in R as the function ks.test in package ctest). The KS test is for continuous variables and this obviously includes length, weight etc. However, limitations in measuring (e.g length to the nearest cm/mm, weight to the nearest g/mg etc) has the obvious effect of "discretising" real data. The ks.test function checks for the presence of ties noting in the help page that "continuous distributions do not generate them". Given the problem of "measuring to the nearest..." noted above I frequently find that my data has ties and ks.test generates a warning. I was interested to note that the example of a two-sample KS test given in Sokal & Rohlf's "Biometry" (I have the 2nd edition where the example is on p.441) has exactly the same problem:> A <- c(104,109,112,114,116,118,118,117,121,123,125,126,126,128,128,128) > B <- c(100,105,107,107,108,111,116,120,121,123) > ks.test(A,B)Two-sample Kolmogorov-Smirnov test data: A and B D = 0.475, p-value = 0.1244 alternative hypothesis: two.sided Warning message: cannot compute correct p-values with ties in: ks.test(A, B) In their chapter 2, "Data in Biology", Sokal & Rohlf note "any given reading of a continuous variable ... is therefore an approximation to the exact reading, which is in practice unknowable. However, for the purposes of computation these approximations are usually sufficient..." I am interested to know whether this can be made more exact. Are there methods to test that data are measured at an appropriate scale so as to be regarded as sufficiently continuous for a KS test, or is common sense choice of measurement precision widely regarded as sufficient? Any comments/references would be appreciated! David Middleton -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> > I frequently want to test for differences between animal size frequency > distributions. The obvious test (I think) to use is the Kolmogorov-Smirnov > two sample test (provided in R as the function ks.test in package ctest)."obvious" depends on the problem you want to test: KS tests the hypothesis H_0: F(z) = G(z) for all z vs. H_1: F(z) != G(z) for at least one z ks.test assumes that both F and G are continuous variables. However, if you want to test H_0: F(z) = G(z) vs. H_1: F(z) = G(z - delta); delta != 0 as "test for differences" indicates, the Wilcoxon rank sum test is "obvious". Or, more general, if your hypothesis is "exchangeability", a permutation test can be used.> The KS test is for continuous variables and this obviously includes length, > weight etc. However, limitations in measuring (e.g length to the nearest > cm/mm, weight to the nearest g/mg etc) has the obvious effect of > "discretising" real data.or maybe the underlying distribution is discrete? Anyway: ks.test and wilcox.test in ctest assume data from continuous distributions and the normal approximation is used if ties occur. For the Wilcoxon and permutation test, the conditional distribution (that is: conditional on the ties) can be computed using the exactRankTests package.> > The ks.test function checks for the presence of ties noting in the help page > that "continuous distributions do not generate them". Given the problem of > "measuring to the nearest..." noted above I frequently find that my data has > ties and ks.test generates a warning. > I was interested to note that the example of a two-sample KS test given in > Sokal & Rohlf's "Biometry" (I have the 2nd edition where the example is on > p.441) has exactly the same problem: > > A <- c(104,109,112,114,116,118,118,117,121,123,125,126,126,128,128,128) > > B <- c(100,105,107,107,108,111,116,120,121,123)For your example: R> library(exactRankTests) R> wilcox.exact(B, A) Exact Wilcoxon rank sum test data: B and A W = 36.5, p-value = 0.02039 alternative hypothesis: true mu is not equal to 0 R> perm.test(B, A) 2-sample Permutation Test data: B and A T = 1118, p-value = 0.01864 alternative hypothesis: true mu is not equal to 0 Torsten> > ks.test(A,B) > > Two-sample Kolmogorov-Smirnov test > > data: A and B > D = 0.475, p-value = 0.1244 > alternative hypothesis: two.sided > > Warning message: > cannot compute correct p-values with ties in: ks.test(A, B) > In their chapter 2, "Data in Biology", Sokal & Rohlf note "any given reading > of a continuous variable ... is therefore an approximation to the exact > reading, which is in practice unknowable. However, for the purposes of > computation these approximations are usually sufficient..." > I am interested to know whether this can be made more exact. Are there > methods to test that data are measured at an appropriate scale so as to be > regarded as sufficiently continuous for a KS test, or is common sense choice > of measurement precision widely regarded as sufficient? > Any comments/references would be appreciated! > David Middleton > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thanks for the input, and sorry for the delay in returning to the thread.> > I frequently want to test for differences between animal size frequency > > distributions. The obvious test (I think) to use is theKolmogorov-Smirnov> > two sample test (provided in R as the function ks.test in packagectest).> > "obvious" depends on the problem you want to test: KS tests the hypothesis > > H_0: F(z) = G(z) for all z vs. H_1: F(z) != G(z) for at least one z > > ks.test assumes that both F and G are continuous variables. However, if > you want to test > > H_0: F(z) = G(z) vs. H_1: F(z) = G(z - delta); delta != 0 > > as "test for differences" indicates, the Wilcoxon rank sum test is > "obvious". Or, more general, if your hypothesis is "exchangeability", a > permutation test can be used.Apologies for my vague description. The Wilcoxon rank sum test is a test of difference in location, as is the permutation test I believe. I am interested in more than just location - the animal size distributions I have in mind are often multimodal, encompassing different cohorts for example - so I am interested in a more general test of differences in the distributions, both for exploratory purposes and too see if it is appropriate to lump samples. Thus the KS test seems the "obvious" choice.> > The KS test is for continuous variables and this obviously includeslength,> > weight etc. However, limitations in measuring (e.g length to thenearest> > cm/mm, weight to the nearest g/mg etc) has the obvious effect of > > "discretising" real data. > > or maybe the underlying distribution is discrete?In the case I described (animal size) it is pretty clear that the variable is continuous, and likewise the underlying distribution. The ties really are the result of rounding error. Off list both Don MacQueen and Ross Darnell came up with the idea of "jittering" the values (adding a random number form a uniform distribution half the width of the measurement unit) to remove the ties, and re-testing to see if the rounding was influencing the results. This seems to be what I need. David Middleton -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>> David Middleton <dmiddleton at fisheries.gov.fk> 03/28/02 05:48AM >>>wrote> > I frequently want to test for differences between animal size >>frequency distributions. The obvious test (I think) to use is the >> Kolmogorov-Smirnov two sample test (provided in R as the function >>ks.test in package ctest).and later added:>Apologies for my vague description. The Wilcoxon rank sum test is a test >of difference in location, as is the permutation test I believe. I am >interested in more than just location - the animal size distributions I have >in mind are often multimodal, encompassing different cohorts for example >- so I am interested in a more general test of differences in the >distributions, both for exploratory purposes and too see if it is >appropriate to lump samples. Thus the KS test seems the "obvious" >choice.In which case, I recommend the methods developed and advocated Handcock & Morris see www.stat.washington.edu/handcock/RelDist For which code in R is available. These provide more complete methods for comparing two distributions; I think they're really good. The only caveat is that the sample size should be large (at least hundreds, preferably thousands). Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._