Nitin Agrawal
2008-Aug-21 20:56 UTC
[R] Null and Alternate hypothesis for Significance test
Hi, I had a question about specifying the Null hypothesis in a significance test. Advance apologies if this has already been asked previously or is a naive question. I have two samples A and B, and I want to test whether A and B come from the same distribution. The default Null hypothesis would be H0: A=B But since I am trying to prove that A and B indeed come from the same distribution, I think this is not the right choice for the null hypothesis (it should be one that is set up to be rejected) How do I specify a null hypothesis H0: A not equal to B for say a KS test. An example to do this in R would be greatly appreciated. On a related note: what is a good way to measure the difference between observed and expected PDFs? Is the D statistic of the KS test a good choice? Thanks! Nitin [[alternative HTML version deleted]]
Nitin Agrawal
2008-Aug-21 20:58 UTC
[R] Null and Alternate hypothesis for Significance test
Hi, I had a question about specifying the Null hypothesis in a significance test. Advance apologies if this has already been asked previously or is a naive question. I have two samples A and B, and I want to test whether A and B come from the same distribution. The default Null hypothesis would be H0: A=B But since I am trying to prove that A and B indeed come from the same distribution, I think this is not the right choice for the null hypothesis (it should be one that is set up to be rejected) How do I specify a null hypothesis H0: A not equal to B for say a KS test. An example to do this in R would be greatly appreciated. On a related note: what is a good way to measure the difference between observed and expected PDFs? Is the D statistic of the KS test a good choice? Thanks! Nitin [[alternative HTML version deleted]]
Moshe Olshansky
2008-Aug-22 00:40 UTC
[R] Null and Alternate hypothesis for Significance test
Hi Nitin, I believe that you can not have null hypothesis to be that A and B come from different distributions. Asymptotically (as both sample sizes go to infinity) KS test has power 1, i.e. it will reject H0:A=B for any case where A and B have different distributions. To work with a finite sample you must be more specific, i.e. your null hypothesis must be not that A and B just have different distributions but must be more specific, i.e that their means are different by at least something or that certain distance between their distributions is bigger than something, etc. and such hypotheses can be tested (and rejected). --- On Fri, 22/8/08, Nitin Agrawal <NITINA.A+rhelp at gmail.com> wrote:> From: Nitin Agrawal <NITINA.A+rhelp at gmail.com> > Subject: [R] Null and Alternate hypothesis for Significance test > To: r-help at r-project.org > Received: Friday, 22 August, 2008, 6:58 AM > Hi, > I had a question about specifying the Null hypothesis in a > significance > test. > Advance apologies if this has already been asked previously > or is a naive > question. > > I have two samples A and B, and I want to test whether A > and B come from > the same distribution. The default Null hypothesis would be > H0: A=B > But since I am trying to prove that A and B indeed come > from the same > distribution, I think this is not the right choice for the > null hypothesis > (it should be one that is set up to be rejected) > > How do I specify a null hypothesis H0: A not equal to B for > say a KS test. > An example to do this in R would be greatly appreciated. > > On a related note: what is a good way to measure the > difference between > observed and expected PDFs? Is the D statistic of the KS > test a good choice? > > Thanks! > Nitin > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code.
The problem with trying to prove that 2 distributions are equal is that there is exactly one way in which they can be equal and an infinite number of ways that they can be different (and its rather a large infinity). The traditional use of equality as the Null hypothesis works, because we can assume that it is true and then compute the probability of seeing data as or more extreem than that observed. If we just want to assume that the 2 distributions are different and compute the probability of the similarity between the datasets, then we can assume differences that are small enough, but still non-zero, to easily result in the similarity of the data seen. It is not hard to come up with 2 distributions that are clearly different, but which would yield identical datasets. Given this, any test to show exact equality would need to have a p-value identically set at 1 (data observed is completely plausible with null hypothesis). Or we could get a significantly higher powered test of the correct size by generating a p-value from a uniform(0,1) distribution. Neither option has much scientific merit. Stepping away from proving equality, the more common approach is to prove equivalence, where the alternative hypothesis is that the 2 distributions are close enough even if not equal. Determining "close enough" is subjective and dependent on the scientific question. Close enough is often determined by visual inspection of graphs rather than a hypothesis test. If you insist on a hypothesis test, then you need to determine in advance what is meant by "close enough", both in deciding the distance measure and how big that distance has to be before they are no longer equivalent. You asked about the KS distance measure, that is one option for choosing a distance, there are others, which works best depends on you and the scientific question. Take for example 2 distributions F and G. F is uniform(0,1) meaning the the density function is 1 between 0 and 1 and 0 elsewhere, the distribution of G is equal to 1 between 0 and 0.99 and also equal to 1 between 99.99 and 100, zero elsewhere. Are these 2 functions equivalent? The 2 functions have 99% overlap, the KS distance is small (0.01 if I remember correctly), but the means and variances are quite different. When generating random values from the 2 distributions we will see very similar numbers with the exception that G will generate outliers near 100 1% of the time. Some people would consider these 2 distributions equivalent (they are pretty much the same if we discard the outliers), while others would consider the potential outliers (very extreeme) to make them non-equivalent. You need to decide that based on the science from which your data comes. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org (801) 408-8111> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Nitin Agrawal > Sent: Thursday, August 21, 2008 2:59 PM > To: r-help at r-project.org > Subject: [R] Null and Alternate hypothesis for Significance test > > Hi, > I had a question about specifying the Null hypothesis in a > significance test. > Advance apologies if this has already been asked previously > or is a naive question. > > I have two samples A and B, and I want to test whether A and > B come from the same distribution. The default Null > hypothesis would be H0: A=B But since I am trying to prove > that A and B indeed come from the same distribution, I think > this is not the right choice for the null hypothesis (it > should be one that is set up to be rejected) > > How do I specify a null hypothesis H0: A not equal to B for > say a KS test. > An example to do this in R would be greatly appreciated. > > On a related note: what is a good way to measure the > difference between observed and expected PDFs? Is the D > statistic of the KS test a good choice? > > Thanks! > Nitin > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >