Monnand
2015-Jan-14 07:17 UTC
[R] two-sample KS test: data becomes significantly different after normalization
I know this must be a wrong method, but I cannot help to ask: Can I only use the p-value from KS test, saying if p-value is greater than \beta, then two samples are from the same distribution. If the definition of p-value is the probability that the null hypothesis is true, then why there's little people uses p-value as a "true" probability. e.g. normally, people will not multiply or add p-values to get the probability that two independent null hypothesis are both true or one of them is true. I had this question for very long time. -Monnand On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at med.umich.edu> wrote:> This sounds more like quality control than hypothesis testing. Rather > than statistical significance, you want to determine what is an acceptable > difference (an 'equivalence margin', if you will). And that is a question > about the application, not a statistical one. > ________________________________________ > From: Monnand [monnand at gmail.com] > Sent: Monday, January 12, 2015 10:14 PM > To: Andrews, Chris > Cc: r-help at r-project.org > Subject: Re: [R] two-sample KS test: data becomes significantly different > after normalization > > Thank you, Chris! > > I think it is exactly the problem you mentioned. I did consider > 1000-point data is a large one at first. > > I down-sampled the data from 1000 points to 100 points and ran KS test > again. It worked as expected. Is there any typical method to compare > two large samples? I also tried KL diverge, but it only gives me some > number but does not tell me how large the distance is should be > considered as significantly different. > > Regards, > -Monnand > > On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at med.umich.edu> > wrote: > > > > The main issue is that the original distributions are the same, you > shift the two samples *by different amounts* (about 0.01 SD), and you have > a large (n=1000) sample size. Thus the new distributions are not the same. > > > > This is a problem with testing for equality of distributions. With > large samples, even a small deviation is significant. > > > > Chris > > > > -----Original Message----- > > From: Monnand [mailto:monnand at gmail.com] > > Sent: Sunday, January 11, 2015 10:13 PM > > To: r-help at r-project.org > > Subject: [R] two-sample KS test: data becomes significantly different > after normalization > > > > Hi all, > > > > This question is sort of related to R (I'm not sure if I used an R > function > > correctly), but also related to stats in general. I'm sorry if this is > > considered as off-topic. > > > > I'm currently working on a data set with two sets of samples. The csv > file > > of the data could be found here: http://pastebin.com/200v10py > > > > I would like to use KS test to see if these two sets of samples are from > > different distributions. > > > > I ran the following R script: > > > > # read data from the file > >> data = read.csv('data.csv') > >> ks.test(data[[1]], data[[2]]) > > Two-sample Kolmogorov-Smirnov test > > > > data: data[[1]] and data[[2]] > > D = 0.025, p-value = 0.9132 > > alternative hypothesis: two-sided > > The KS test shows that these two samples are very similar. (In fact, they > > should come from same distribution.) > > > > However, due to some reasons, instead of the raw values, the actual data > > that I will get will be normalized (zero mean, unit variance). So I tried > > to normalize the raw data I have and run the KS test again: > > > >> ks.test(scale(data[[1]]), scale(data[[2]])) > > Two-sample Kolmogorov-Smirnov test > > > > data: scale(data[[1]]) and scale(data[[2]]) > > D = 0.3273, p-value < 2.2e-16 > > alternative hypothesis: two-sided > > The p-value becomes almost zero after normalization indicating these two > > samples are significantly different (from different distributions). > > > > My question is: How the normalization could make two similar samples > > becomes different from each other? I can see that if two samples are > > different, then normalization could make them similar. However, if two > sets > > of data are similar, then intuitively, applying same operation onto them > > should make them still similar, at least not different from each other > too > > much. > > > > I did some further analysis about the data. I also tried to normalize the > > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but > > same thing happened. At first, I thought it might be outliers caused this > > problem (I can see that an outlier may cause this problem if I normalize > > the data into [0,1] range.) I deleted all data whose abs value is larger > > than 4 standard deviation. But it still didn't help. > > > > Plus, I even plotted the eCDFs, they *really* look the same to me even > > after normalization. Anything wrong with my usage of the R function? > > > > Since the data contains ties, I also tried ks.boot ( > > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same > > result. > > > > Could anyone help me to explain why it happened? Also, any suggestion > about > > the hypothesis testing on normalized data? (The data I have right now is > > simulated data. In real world, I cannot get raw data, but only normalized > > one.) > > > > Regards, > > -Monnand > > > > [[alternative HTML version deleted]] > > > > > > ********************************************************** > > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues > >[[alternative HTML version deleted]]
Martin Maechler
2015-Jan-14 10:27 UTC
[R] two-sample KS test: data becomes significantly different after normalization
>>>>> Monnand <monnand at gmail.com> >>>>> on Wed, 14 Jan 2015 07:17:02 +0000 writes:> I know this must be a wrong method, but I cannot help to ask: Can I only > use the p-value from KS test, saying if p-value is greater than \beta, then > two samples are from the same distribution. If the definition of p-value is > the probability that the null hypothesis is true, Ouch, ouch, ouch, ouch !!!!!!!! The worst misuse/misunderstanding of statistics now even on R-help ... ---> please get help from a statistician !! --> and erase that sentence from your mind (unless you are pro and want to keep it for anectdotal or didactical purposes...) > then why there's little > people uses p-value as a "true" probability. e.g. normally, people will not > multiply or add p-values to get the probability that two independent null > hypothesis are both true or one of them is true. I had this question for > very long time. > -Monnand > On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at med.umich.edu> > wrote: >> This sounds more like quality control than hypothesis testing. Rather >> than statistical significance, you want to determine what is an acceptable >> difference (an 'equivalence margin', if you will). And that is a question >> about the application, not a statistical one. >> ________________________________________ >> From: Monnand [monnand at gmail.com] >> Sent: Monday, January 12, 2015 10:14 PM >> To: Andrews, Chris >> Cc: r-help at r-project.org >> Subject: Re: [R] two-sample KS test: data becomes significantly different >> after normalization >> >> Thank you, Chris! >> >> I think it is exactly the problem you mentioned. I did consider >> 1000-point data is a large one at first. >> >> I down-sampled the data from 1000 points to 100 points and ran KS test >> again. It worked as expected. Is there any typical method to compare >> two large samples? I also tried KL diverge, but it only gives me some >> number but does not tell me how large the distance is should be >> considered as significantly different. >> >> Regards, >> -Monnand >> >> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at med.umich.edu> >> wrote: >> > >> > The main issue is that the original distributions are the same, you >> shift the two samples *by different amounts* (about 0.01 SD), and you have >> a large (n=1000) sample size. Thus the new distributions are not the same. >> > >> > This is a problem with testing for equality of distributions. With >> large samples, even a small deviation is significant. >> > >> > Chris >> > >> > -----Original Message----- >> > From: Monnand [mailto:monnand at gmail.com] >> > Sent: Sunday, January 11, 2015 10:13 PM >> > To: r-help at r-project.org >> > Subject: [R] two-sample KS test: data becomes significantly different >> after normalization >> > >> > Hi all, >> > >> > This question is sort of related to R (I'm not sure if I used an R >> function >> > correctly), but also related to stats in general. I'm sorry if this is >> > considered as off-topic. >> > >> > I'm currently working on a data set with two sets of samples. The csv >> file >> > of the data could be found here: http://pastebin.com/200v10py >> > >> > I would like to use KS test to see if these two sets of samples are from >> > different distributions. >> > >> > I ran the following R script: >> > >> > # read data from the file >> >> data = read.csv('data.csv') >> >> ks.test(data[[1]], data[[2]]) >> > Two-sample Kolmogorov-Smirnov test >> > >> > data: data[[1]] and data[[2]] >> > D = 0.025, p-value = 0.9132 >> > alternative hypothesis: two-sided >> > The KS test shows that these two samples are very similar. (In fact, they >> > should come from same distribution.) >> > >> > However, due to some reasons, instead of the raw values, the actual data >> > that I will get will be normalized (zero mean, unit variance). So I tried >> > to normalize the raw data I have and run the KS test again: >> > >> >> ks.test(scale(data[[1]]), scale(data[[2]])) >> > Two-sample Kolmogorov-Smirnov test >> > >> > data: scale(data[[1]]) and scale(data[[2]]) >> > D = 0.3273, p-value < 2.2e-16 >> > alternative hypothesis: two-sided >> > The p-value becomes almost zero after normalization indicating these two >> > samples are significantly different (from different distributions). >> > >> > My question is: How the normalization could make two similar samples >> > becomes different from each other? I can see that if two samples are >> > different, then normalization could make them similar. However, if two >> sets >> > of data are similar, then intuitively, applying same operation onto them >> > should make them still similar, at least not different from each other >> too >> > much. >> > >> > I did some further analysis about the data. I also tried to normalize the >> > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but >> > same thing happened. At first, I thought it might be outliers caused this >> > problem (I can see that an outlier may cause this problem if I normalize >> > the data into [0,1] range.) I deleted all data whose abs value is larger >> > than 4 standard deviation. But it still didn't help. >> > >> > Plus, I even plotted the eCDFs, they *really* look the same to me even >> > after normalization. Anything wrong with my usage of the R function? >> > >> > Since the data contains ties, I also tried ks.boot ( >> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same >> > result. >> > >> > Could anyone help me to explain why it happened? Also, any suggestion >> about >> > the hypothesis testing on normalized data? (The data I have right now is >> > simulated data. In real world, I cannot get raw data, but only normalized >> > one.) >> > >> > Regards, >> > -Monnand
Andrews, Chris
2015-Jan-14 12:31 UTC
[R] two-sample KS test: data becomes significantly different after normalization
Your definition of p-value is not correct. See, for example, http://en.wikipedia.org/wiki/P-value#Misunderstandings -----Original Message----- From: Monnand [mailto:monnand at gmail.com] Sent: Wednesday, January 14, 2015 2:17 AM To: Andrews, Chris Cc: r-help at r-project.org Subject: Re: [R] two-sample KS test: data becomes significantly different after normalization I know this must be a wrong method, but I cannot help to ask: Can I only use the p-value from KS test, saying if p-value is greater than \beta, then two samples are from the same distribution. If the definition of p-value is the probability that the null hypothesis is true, then why there's little people uses p-value as a "true" probability. e.g. normally, people will not multiply or add p-values to get the probability that two independent null hypothesis are both true or one of them is true. I had this question for very long time. -Monnand On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at med.umich.edu> wrote:> This sounds more like quality control than hypothesis testing. Rather > than statistical significance, you want to determine what is an acceptable > difference (an 'equivalence margin', if you will). And that is a question > about the application, not a statistical one. > ________________________________________ > From: Monnand [monnand at gmail.com] > Sent: Monday, January 12, 2015 10:14 PM > To: Andrews, Chris > Cc: r-help at r-project.org > Subject: Re: [R] two-sample KS test: data becomes significantly different > after normalization > > Thank you, Chris! > > I think it is exactly the problem you mentioned. I did consider > 1000-point data is a large one at first. > > I down-sampled the data from 1000 points to 100 points and ran KS test > again. It worked as expected. Is there any typical method to compare > two large samples? I also tried KL diverge, but it only gives me some > number but does not tell me how large the distance is should be > considered as significantly different. > > Regards, > -Monnand > > On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at med.umich.edu> > wrote: > > > > The main issue is that the original distributions are the same, you > shift the two samples *by different amounts* (about 0.01 SD), and you have > a large (n=1000) sample size. Thus the new distributions are not the same. > > > > This is a problem with testing for equality of distributions. With > large samples, even a small deviation is significant. > > > > Chris > > > > -----Original Message----- > > From: Monnand [mailto:monnand at gmail.com] > > Sent: Sunday, January 11, 2015 10:13 PM > > To: r-help at r-project.org > > Subject: [R] two-sample KS test: data becomes significantly different > after normalization > > > > Hi all, > > > > This question is sort of related to R (I'm not sure if I used an R > function > > correctly), but also related to stats in general. I'm sorry if this is > > considered as off-topic. > > > > I'm currently working on a data set with two sets of samples. The csv > file > > of the data could be found here: http://pastebin.com/200v10py > > > > I would like to use KS test to see if these two sets of samples are from > > different distributions. > > > > I ran the following R script: > > > > # read data from the file > >> data = read.csv('data.csv') > >> ks.test(data[[1]], data[[2]]) > > Two-sample Kolmogorov-Smirnov test > > > > data: data[[1]] and data[[2]] > > D = 0.025, p-value = 0.9132 > > alternative hypothesis: two-sided > > The KS test shows that these two samples are very similar. (In fact, they > > should come from same distribution.) > > > > However, due to some reasons, instead of the raw values, the actual data > > that I will get will be normalized (zero mean, unit variance). So I tried > > to normalize the raw data I have and run the KS test again: > > > >> ks.test(scale(data[[1]]), scale(data[[2]])) > > Two-sample Kolmogorov-Smirnov test > > > > data: scale(data[[1]]) and scale(data[[2]]) > > D = 0.3273, p-value < 2.2e-16 > > alternative hypothesis: two-sided > > The p-value becomes almost zero after normalization indicating these two > > samples are significantly different (from different distributions). > > > > My question is: How the normalization could make two similar samples > > becomes different from each other? I can see that if two samples are > > different, then normalization could make them similar. However, if two > sets > > of data are similar, then intuitively, applying same operation onto them > > should make them still similar, at least not different from each other > too > > much. > > > > I did some further analysis about the data. I also tried to normalize the > > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but > > same thing happened. At first, I thought it might be outliers caused this > > problem (I can see that an outlier may cause this problem if I normalize > > the data into [0,1] range.) I deleted all data whose abs value is larger > > than 4 standard deviation. But it still didn't help. > > > > Plus, I even plotted the eCDFs, they *really* look the same to me even > > after normalization. Anything wrong with my usage of the R function? > > > > Since the data contains ties, I also tried ks.boot ( > > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same > > result. > > > > Could anyone help me to explain why it happened? Also, any suggestion > about > > the hypothesis testing on normalized data? (The data I have right now is > > simulated data. In real world, I cannot get raw data, but only normalized > > one.) > > > > Regards, > > -Monnand > > > > [[alternative HTML version deleted]] > > > > > > ********************************************************** > > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues > >[[alternative HTML version deleted]] ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
Monnand
2015-Jan-16 01:07 UTC
[R] two-sample KS test: data becomes significantly different after normalization
Thank you, Chris and Martin! On Wed Jan 14 2015 at 7:31:12 AM Andrews, Chris <chrisaa at med.umich.edu> wrote:> Your definition of p-value is not correct. See, for example, > http://en.wikipedia.org/wiki/P-value#Misunderstandings > > -----Original Message----- > From: Monnand [mailto:monnand at gmail.com] > Sent: Wednesday, January 14, 2015 2:17 AM > To: Andrews, Chris > Cc: r-help at r-project.org > Subject: Re: [R] two-sample KS test: data becomes significantly different > after normalization > > I know this must be a wrong method, but I cannot help to ask: Can I only > use the p-value from KS test, saying if p-value is greater than \beta, then > two samples are from the same distribution. If the definition of p-value is > the probability that the null hypothesis is true, then why there's little > people uses p-value as a "true" probability. e.g. normally, people will not > multiply or add p-values to get the probability that two independent null > hypothesis are both true or one of them is true. I had this question for > very long time. > > -Monnand > > On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at med.umich.edu> > wrote: > > > This sounds more like quality control than hypothesis testing. Rather > > than statistical significance, you want to determine what is an > acceptable > > difference (an 'equivalence margin', if you will). And that is a > question > > about the application, not a statistical one. > > ________________________________________ > > From: Monnand [monnand at gmail.com] > > Sent: Monday, January 12, 2015 10:14 PM > > To: Andrews, Chris > > Cc: r-help at r-project.org > > Subject: Re: [R] two-sample KS test: data becomes significantly different > > after normalization > > > > Thank you, Chris! > > > > I think it is exactly the problem you mentioned. I did consider > > 1000-point data is a large one at first. > > > > I down-sampled the data from 1000 points to 100 points and ran KS test > > again. It worked as expected. Is there any typical method to compare > > two large samples? I also tried KL diverge, but it only gives me some > > number but does not tell me how large the distance is should be > > considered as significantly different. > > > > Regards, > > -Monnand > > > > On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at med.umich.edu> > > wrote: > > > > > > The main issue is that the original distributions are the same, you > > shift the two samples *by different amounts* (about 0.01 SD), and you > have > > a large (n=1000) sample size. Thus the new distributions are not the > same. > > > > > > This is a problem with testing for equality of distributions. With > > large samples, even a small deviation is significant. > > > > > > Chris > > > > > > -----Original Message----- > > > From: Monnand [mailto:monnand at gmail.com] > > > Sent: Sunday, January 11, 2015 10:13 PM > > > To: r-help at r-project.org > > > Subject: [R] two-sample KS test: data becomes significantly different > > after normalization > > > > > > Hi all, > > > > > > This question is sort of related to R (I'm not sure if I used an R > > function > > > correctly), but also related to stats in general. I'm sorry if this is > > > considered as off-topic. > > > > > > I'm currently working on a data set with two sets of samples. The csv > > file > > > of the data could be found here: http://pastebin.com/200v10py > > > > > > I would like to use KS test to see if these two sets of samples are > from > > > different distributions. > > > > > > I ran the following R script: > > > > > > # read data from the file > > >> data = read.csv('data.csv') > > >> ks.test(data[[1]], data[[2]]) > > > Two-sample Kolmogorov-Smirnov test > > > > > > data: data[[1]] and data[[2]] > > > D = 0.025, p-value = 0.9132 > > > alternative hypothesis: two-sided > > > The KS test shows that these two samples are very similar. (In fact, > they > > > should come from same distribution.) > > > > > > However, due to some reasons, instead of the raw values, the actual > data > > > that I will get will be normalized (zero mean, unit variance). So I > tried > > > to normalize the raw data I have and run the KS test again: > > > > > >> ks.test(scale(data[[1]]), scale(data[[2]])) > > > Two-sample Kolmogorov-Smirnov test > > > > > > data: scale(data[[1]]) and scale(data[[2]]) > > > D = 0.3273, p-value < 2.2e-16 > > > alternative hypothesis: two-sided > > > The p-value becomes almost zero after normalization indicating these > two > > > samples are significantly different (from different distributions). > > > > > > My question is: How the normalization could make two similar samples > > > becomes different from each other? I can see that if two samples are > > > different, then normalization could make them similar. However, if two > > sets > > > of data are similar, then intuitively, applying same operation onto > them > > > should make them still similar, at least not different from each other > > too > > > much. > > > > > > I did some further analysis about the data. I also tried to normalize > the > > > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), > but > > > same thing happened. At first, I thought it might be outliers caused > this > > > problem (I can see that an outlier may cause this problem if I > normalize > > > the data into [0,1] range.) I deleted all data whose abs value is > larger > > > than 4 standard deviation. But it still didn't help. > > > > > > Plus, I even plotted the eCDFs, they *really* look the same to me even > > > after normalization. Anything wrong with my usage of the R function? > > > > > > Since the data contains ties, I also tried ks.boot ( > > > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same > > > result. > > > > > > Could anyone help me to explain why it happened? Also, any suggestion > > about > > > the hypothesis testing on normalized data? (The data I have right now > is > > > simulated data. In real world, I cannot get raw data, but only > normalized > > > one.) > > > > > > Regards, > > > -Monnand > > > > > > [[alternative HTML version deleted]] > > > > > > > > > ********************************************************** > > > Electronic Mail is not secure, may not be read every day, and should > not > > be used for urgent or sensitive issues > > ********************************************************** > > Electronic Mail is not secure, may not be read every day, and should not > > be used for urgent or sensitive issues > > > > > > [[alternative HTML version deleted]] > > > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues >[[alternative HTML version deleted]]