Hi R Users, I have two vectors, x and y, of equal length representing two types of data from two studies. I would like to test if they are similar enough to use them interchangeably. No assumptions about distributions can be made (initial tests clearly show that they are not normal). Here some result: Two-sample Kolmogorov-Smirnov test data: x and y D = 0.1091, p-value < 2.2e-16 alternative hypothesis: two-sided Warning message: In ks.test(x[1:nx], y[1:nx], exact = FALSE) : cannot compute correct p-values with ties Here some questions: a) What does the error message means and what does it imply? b) The data is very noisy and the initial result shows that there is no relation between x and y. Is there a way to calculate and effect size? c) Can the p-value be used, when running tests over a large amount of different data sets, as a metric for ranking similarity between x and y data sets? Best R.
On Aug 4, 2010, at 5:49 PM, Ralf B wrote:> Hi R Users, > > I have two vectors, x and y, of equal length representing two types of > data from two studies. I would like to test if they are similar enough > to use them interchangeably. No assumptions about distributions can be > made (initial tests clearly show that they are not normal). > Here some result: > > Two-sample Kolmogorov-Smirnov test > > data: x and y > D = 0.1091, p-value < 2.2e-16 > alternative hypothesis: two-sided > > Warning message: > In ks.test(x[1:nx], y[1:nx], exact = FALSE) : > cannot compute correct p-values with ties > > Here some questions: > > a) What does the error message means and what does it imply?a) It is not an error message. b) It does seem rather self-explanatory.> b) The data is very noisy and the initial resultWhat "initial result"?> shows that there is > no relation between x and y. Is there a way to calculate and effect > size?An "effect size" implies some sort of statistical model. You have not offered one yet.> c) Can the p-value be used, when running tests over a large amount of > different data sets, as a metric for ranking similarity between x and > y data sets?Not in a useful way. The p-value for KS.test large datasets will always be small but that information does not characterize the differences in distribution in any meaningful way. Many similar questions have been posted and answered over the years on r-help.> > Best > R. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
It looks like the test is indicating a far bigger difference than could be explained by random variation. Since the sample sizes are equal, have you considered plotting the ordered values of one against the ordered values of the other (essentially an empirical QQplot), with a 45 degree line drawn in, to examine the way(s) in which the two samples differ? On Thu, Aug 5, 2010 at 7:49 AM, Ralf B <ralf.bierig at gmail.com> wrote:> Hi R Users, > > I have two vectors, x and y, of equal length representing two types of > data from two studies. I would like to test if they are similar enough > to use them interchangeably. No assumptions about distributions can be > made (initial tests clearly show that they are not normal). > Here some result: > > Two-sample Kolmogorov-Smirnov test > > data: ?x and y > D = 0.1091, p-value < 2.2e-16 > alternative hypothesis: two-sided > > Warning message: > In ks.test(x[1:nx], y[1:nx], exact = FALSE) : > ?cannot compute correct p-values with ties > > Here some questions: > > a) What does the error message means and what does it imply? > b) The data is very noisy and the initial result shows that there is > no relation between x and y. Is there a way to calculate and effect > size? > c) Can the p-value be used, when running tests over a large amount of > different data sets, as a metric for ranking similarity between x and > y data sets? > > Best > R. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
This is unbelievable. Now people like yourself start doing background searches on one and accusing one of not being professional plus posting cheeky R code. The reason why I submitted the questions I have submitted was that these answers did not satisfy my particular problem (or perhaps I mistakenly thought so). The point here is that the forum should be a forum where one should be allowed to ask questions without first studying the history of the the entire forum in fear that someone might have asked it before. I was hoping that I could find clearer answers then what I was able to read. I do know how to search in Google. But I am not an expert in statistics, as you already found in your background check. If I would be fluent in stastitsics and R and if past answers would have exactly satisfied my problem I would not post here and I certainly would not have occupied your expensive attention. On Wed, Aug 4, 2010 at 6:16 PM, David Winsemius <dwinsemius at comcast.net> wrote:> > On Aug 4, 2010, at 5:49 PM, Ralf B wrote: > >> Hi R Users, >> >> I have two vectors, x and y, of equal length representing two types of >> data from two studies. I would like to test if they are similar enough >> to use them interchangeably. No assumptions about distributions can be >> made (initial tests clearly show that they are not normal). >> Here some result: >> >> Two-sample Kolmogorov-Smirnov test >> >> data: ?x and y >> D = 0.1091, p-value < 2.2e-16 >> alternative hypothesis: two-sided >> >> Warning message: >> In ks.test(x[1:nx], y[1:nx], exact = FALSE) : >> ?cannot compute correct p-values with ties >> >> Here some questions: >> >> a) What does the error message means and what does it imply? >> b) The data is very noisy and the initial result shows that there is >> no relation between x and y. Is there a way to calculate and effect >> size? >> c) Can the p-value be used, when running tests over a large amount of >> different data sets, as a metric for ranking similarity between x and >> y data sets? > > There has been quite a bit of discussion on this list over the years about > why KS test is not good in this situation. If I read the results of a search > on your name correctly, you are in a department of Information Sciences. I > would have thought that the first reaction of someone in that field would be > do do a search on a question. Why are you filling up the archives with > questions that have been repeatedly asked and ?answered? > > Do you need help in this area? > > rhelpSearch <- function(string, > ? ? ? ? ? ? ? ? ?restrict = c("Rhelp10", "Rhelp08", "Rhelp02", "functions" > ), > ? ? ? ? ? ? ? ? ?matchesPerPage = 100, ...) > ? ? ? ? RSiteSearch(string=string, ?restrict = restrict, ?matchesPerPage > matchesPerPage, ...) > > > rhelpSearch("KS.test ties p-value") > >> >> Best >> R. >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius, MD > West Hartford, CT > >
The warning (with an error you would not see any results) means that there are ties in your data, the theory behind the ks test says that the probability of seeing ties is 0, so your data and the theory do not match, therefore the p-value is suspect (though an ok approximation for some uses). These types of tests are useful for showing differences (often in a non meaningful way), not similarities. You really need to decide what you mean by similar. Consider two population distributions, the first is the standard uniform with density height equal to 1 between 0 and 1 (0 elsewhere), the 2nd distribution has height 1 from 0 to 0.99 and from 99.99 to 100 (0 elsewhere), are these 2 populations similar? By some measures they are (the ks statistic for one), by other measures they are not (comparing mean and variance as an example). Whether they are similar or not really depends on what you want to do with them. One additional "test" you might consider is use the vis.test function in the TeachingDemos package, write a function that will either draw a standard qqplot of your 2 datasets, or pools them together then splits them randomly and creates the qqplot. Use this with vis.test, if you cannot pick out the real dataset then it is less likely to matter if you interchange them. (this assumes 2 random samples from the respective populations, if there is something more going on then you will need to come up with a different comparison that accounts for any structure). Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Ralf B > Sent: Wednesday, August 04, 2010 3:50 PM > To: r-help at r-project.org > Subject: [R] KS Test question (2) > > Hi R Users, > > I have two vectors, x and y, of equal length representing two types of > data from two studies. I would like to test if they are similar enough > to use them interchangeably. No assumptions about distributions can be > made (initial tests clearly show that they are not normal). > Here some result: > > Two-sample Kolmogorov-Smirnov test > > data: x and y > D = 0.1091, p-value < 2.2e-16 > alternative hypothesis: two-sided > > Warning message: > In ks.test(x[1:nx], y[1:nx], exact = FALSE) : > cannot compute correct p-values with ties > > Here some questions: > > a) What does the error message means and what does it imply? > b) The data is very noisy and the initial result shows that there is > no relation between x and y. Is there a way to calculate and effect > size? > c) Can the p-value be used, when running tests over a large amount of > different data sets, as a metric for ranking similarity between x and > y data sets? > > Best > R. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.