ripley@stats.ox.ac.uk
2001-Jul-01 06:50 UTC
(PR#1007) [Rd] ks.test doesn't compute correct empirical
On Sun, 1 Jul 2001 mcdowella@mcdowella.demon.co.uk wrote:> Full_Name: Andrew Grant McDowell > Version: R 1.1.1 (but source in 1.3.0 looks fishy as well) > OS: Windows 2K Professional (Consumer) > Submission from: (NULL) (194.222.243.209)Please upgrade: we've found a number of Win2k bugs and worked around them since then, let alone teh bug fixes and improvements in R ....> In article <xeQ_6.1949$xd.353840@typhoon.snet.net>, > johnt@tman.dnsalias.com writes > >Can someone help? In R, I am generating a vector of 1000 samples from > >Bin (1000, 0.25). I then do a Kolmogorov Smirnov test to test if the > >vector has been drawn from a population of Bin (1000, 0.25). I would > >expect a reasonably high p-value.....You do realize that the Kolmogorov tests (and the Kolmogorov-Smirnov extension) assume continuous distributions, so the distribution theory is not valid in this case? S-PLUS does stop you doing this:> ks.gof(o, dist="binomial", size=100, prob=0.25)Problem in not.cont1(ttest = d.test, nx = nx, alt.ex..: For testing discrete distributions when sample size > 50, use the Chi-square test> >Either I am doing something wrong in R, or I am misunderstanding how this > >test should work (both quite possible)... > > > > > >Thanks, > >JT.. > > > > > > > >> #### 1000 random samples from binomial dist with mean =.25, n=100... > >> o<-rbinom (1000, 100, .25) > >> mean (o); > >[1] 25.178 > >> var (o); > >[1] 19.61193 > >> ks.test (o, "pbinom", 100, .25); > > > > One-sample Kolmogorov-Smirnov test > > > >data: o > >D = 0.0967, p-value = 1.487e-08 > >alternative hypothesis: two.sided > > > > > > > >p-value is mighty small, leading me to reject the null hypothesis that > >the sample has been drawn from the Bin(100, 0.25) distribution!!!That's OK. That's not what you tested (see above). An S language point: the `;' are unnecessary.> Some more oddities: > > > o<-rbinom(10000, 1, 0.25) > > ks.test(o, "pbinom", 1, 0.25) > > One-sample Kolmogorov-Smirnov test > > data: o > D = 0.75, p-value = < 2.2e-16 > alternative hypothesis: two.sided > > > length(o[o==0]) > [1] 7491 > > length(o[o==1]) > [1] 2509 > > o<-rep(0,10000) > > ks.test(o, "pbinom", 1, 0.25) > > One-sample Kolmogorov-Smirnov test > > data: o > D = 0.75, p-value = < 2.2e-16 > alternative hypothesis: two.sided > > > length(o[o==0]) > [1] 10000 > > length(o[o==1]) > [1] 0 > > Here zeroing out the data does not change the reported D valueNor does it change the maximum discrepancy.> ks.test(rep(1,10000), "pbinom", 1, 0.25)One-sample Kolmogorov-Smirnov test data: rep(1, 10000) D = 1, p-value = < 2.2e-16 alternative hypothesis: two.sided shows 0 is special here.> > After playing about with > ks.test(c(rep(0, X), rep(1, 1000-x)), "pbinom", 1, p) > for a bit I conjecture that ks.test() takes no account > whatsoever of ties, but merely sorts the input values > and looks for max (position/N - pbinom(value, 1, p)). > Anybody got the source handy? > > After 30 minutes of download, the relevant part of ks.test.R would appear to beEh? Just type ks.test in your R session for the source .... -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
ripley@stats.ox.ac.uk
2001-Jul-03 05:18 UTC
(PR#1007) [Rd] ks.test doesn't compute correct empirical
On Tue, 3 Jul 2001, A. G. McDowell wrote:> In message <Pine.GSO.4.31.0107010731110.7616-100000@auk.stats>, Prof > Brian D Ripley <ripley@stats.ox.ac.uk> writes > > > >You do realize that the Kolmogorov tests (and the Kolmogorov-Smirnov > >extension) assume continuous distributions, so the distribution theory > >is not valid in this case? > > > >S-PLUS does stop you doing this: > > > >> ks.gof(o, dist="binomial", size=100, prob=0.25) > >Problem in not.cont1(ttest = d.test, nx = nx, alt.ex..: For testing > >discrete distributions when sample size > 50, use the > > Chi-square test > > > > Thank you for your prompt reply to my bug report. While I agree that > the distribution theory for the Kolmogorov tests assumes a continuous > distribution, I would like to request a modification to the > existing routines. The purpose of this would be to provide a result > that would represent a conservative test in the case when the underlying > distribution is discrete. > > This would be in accord with P 432 of the 3rd edition of "Practical > Nonparametric Statistics", by Conover, and section 25.38 of "Kendall's > Advanced Theory of Statistics, 6th Edition, Vol 2A", by Stewart, Ord, > and Arnold, both of which refer to Noether (1963) "Note on the > Kolmogorov Statistic in the discrete case", Metrika, 7, 115. Users > reared on these and similar textbooks would be less surprised at the > behaviour of R if this modification was made, whereas users who do > not attempt to apply the Kolmogorov-Smirnov test to discrete > distributions would not notice any difference.(Hopefully readers of those textbooks would understand that the results you reported as a bug *are* the behaviour of KS test. Nowhere does R say it has implemented a modified KS test. The one data point we have suggests otherwise ....)> It would also be in accord with the behaviour of R in the two-sample > case, where the effect of the existing code seems to be to provide > a conservative test (since the statistic returned is no larger than > might be returned in any possible tie-breaking) coupled with a warning, > (to which I would have no objection in the one-sample case). > > It seems to me that the following modification would suffice: replace > > x <- y(sort(x), ...) - (0 : (n-1)) / n > > with > > x <- sort(x) > untied <- c(x[1:n-1] != x[2:n], TRUE) > x <- y(x, ...) - (0 : (n-1)) / n > x <- x[untied]In your original examples, this reduces a sample of size 10000 to one of size 101 or 2. Conservative - yes. Useful - very unlikely!> Users dealing with data derived from continuous distributions would > not see any difference, because (except with very small probability > due to floating point inaccuracy) they would never produce tied data.There are circumstances in which one would want the original KS definition for all data sets, where one wnats the test value and not the p value. I've added a warning, but I do not think we should be implementing a different definition. -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Reasonably Related Threads
- ks.test doesn't compute correct empirical distribution if there are ties in the data (PR#1007)
- (PR#1007) ks.test doesn't compute correct empirical distribution if there are ties in the data
- Problems with ks.test
- KS test and theoretical distribution
- Pb with ks.test pvalue