thr3ads.net - R help - [R] two-sample KS test: data becomes significantly different after normalization [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Monnand

2015-Jan-14 07:17 UTC

[R] two-sample KS test: data becomes significantly different after normalization

I know this must be a wrong method, but I cannot help to ask: Can I only
use the p-value from KS test, saying if p-value is greater than \beta, then
two samples are from the same distribution. If the definition of p-value is
the probability that the null hypothesis is true, then why there's little
people uses p-value as a "true" probability. e.g. normally, people
will not
multiply or add p-values to get the probability that two independent null
hypothesis are both true or one of them is true. I had this question for
very long time.

-Monnand

On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at med.umich.edu>
wrote:
> This sounds more like quality control than hypothesis testing.  Rather
> than statistical significance, you want to determine what is an acceptable
> difference (an 'equivalence margin', if you will).  And that is a
question
> about the application, not a statistical one.
> ________________________________________
> From: Monnand [monnand at gmail.com]
> Sent: Monday, January 12, 2015 10:14 PM
> To: Andrews, Chris
> Cc: r-help at r-project.org
> Subject: Re: [R] two-sample KS test: data becomes significantly different
> after normalization
>
> Thank you, Chris!
>
> I think it is exactly the problem you mentioned. I did consider
> 1000-point data is a large one at first.
>
> I down-sampled the data from 1000 points to 100 points and ran KS test
> again. It worked as expected. Is there any typical method to compare
> two large samples? I also tried KL diverge, but it only gives me some
> number but does not tell me how large the distance is should be
> considered as significantly different.
>
> Regards,
> -Monnand
>
> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at
med.umich.edu>
> wrote:
> >
> > The main issue is that the original distributions are the same, you
> shift the two samples *by different amounts* (about 0.01 SD), and you have
> a large (n=1000) sample size.  Thus the new distributions are not the same.
> >
> > This is a problem with testing for equality of distributions.  With
> large samples, even a small deviation is significant.
> >
> > Chris
> >
> > -----Original Message-----
> > From: Monnand [mailto:monnand at gmail.com]
> > Sent: Sunday, January 11, 2015 10:13 PM
> > To: r-help at r-project.org
> > Subject: [R] two-sample KS test: data becomes significantly different
> after normalization
> >
> > Hi all,
> >
> > This question is sort of related to R (I'm not sure if I used an R
> function
> > correctly), but also related to stats in general. I'm sorry if
this is
> > considered as off-topic.
> >
> > I'm currently working on a data set with two sets of samples. The
csv
> file
> > of the data could be found here: http://pastebin.com/200v10py
> >
> > I would like to use KS test to see if these two sets of samples are
from
> > different distributions.
> >
> > I ran the following R script:
> >
> > # read data from the file
> >> data = read.csv('data.csv')
> >> ks.test(data[[1]], data[[2]])
> >     Two-sample Kolmogorov-Smirnov test
> >
> > data:  data[[1]] and data[[2]]
> > D = 0.025, p-value = 0.9132
> > alternative hypothesis: two-sided
> > The KS test shows that these two samples are very similar. (In fact,
they
> > should come from same distribution.)
> >
> > However, due to some reasons, instead of the raw values, the actual
data
> > that I will get will be normalized (zero mean, unit variance). So I
tried
> > to normalize the raw data I have and run the KS test again:
> >
> >> ks.test(scale(data[[1]]), scale(data[[2]]))
> >     Two-sample Kolmogorov-Smirnov test
> >
> > data:  scale(data[[1]]) and scale(data[[2]])
> > D = 0.3273, p-value < 2.2e-16
> > alternative hypothesis: two-sided
> > The p-value becomes almost zero after normalization indicating these
two
> > samples are significantly different (from different distributions).
> >
> > My question is: How the normalization could make two similar samples
> > becomes different from each other? I can see that if two samples are
> > different, then normalization could make them similar. However, if two
> sets
> > of data are similar, then intuitively, applying same operation onto
them
> > should make them still similar, at least not different from each other
> too
> > much.
> >
> > I did some further analysis about the data. I also tried to normalize
the
> > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))),
but
> > same thing happened. At first, I thought it might be outliers caused
this
> > problem (I can see that an outlier may cause this problem if I
normalize
> > the data into [0,1] range.) I deleted all data whose abs value is
larger
> > than 4 standard deviation. But it still didn't help.
> >
> > Plus, I even plotted the eCDFs, they *really* look the same to me even
> > after normalization. Anything wrong with my usage of the R function?
> >
> > Since the data contains ties, I also tried ks.boot (
> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> > result.
> >
> > Could anyone help me to explain why it happened? Also, any suggestion
> about
> > the hypothesis testing on normalized data? (The data I have right now
is
> > simulated data. In real world, I cannot get raw data, but only
normalized
> > one.)
> >
> > Regards,
> > -Monnand
> >
> >         [[alternative HTML version deleted]]
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
not
> be used for urgent or sensitive issues
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
>
	[[alternative HTML version deleted]]

Martin Maechler

2015-Jan-14 10:27 UTC

head link

[R] two-sample KS test: data becomes significantly different after normalization

>>>>> Monnand  <monnand at gmail.com>
>>>>>     on Wed, 14 Jan 2015 07:17:02 +0000 writes:
    > I know this must be a wrong method, but I cannot help to ask: Can I
only
    > use the p-value from KS test, saying if p-value is greater than \beta,
then
    > two samples are from the same distribution. If the definition of
p-value is
    > the probability that the null hypothesis is true, 

Ouch, ouch, ouch, ouch !!!!!!!!

The worst misuse/misunderstanding of statistics  now even on R-help ...

---> please get help from a statistician !!

--> and erase that sentence from your mind (unless you are pro
    and want to keep it for anectdotal or didactical purposes...) 

    > then why there's little
    > people uses p-value as a "true" probability. e.g. normally,
people will not
    > multiply or add p-values to get the probability that two independent
null
    > hypothesis are both true or one of them is true. I had this question
for
    > very long time.

    > -Monnand

    > On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at
med.umich.edu>
    > wrote:

    >> This sounds more like quality control than hypothesis testing. 
Rather
    >> than statistical significance, you want to determine what is an
acceptable
    >> difference (an 'equivalence margin', if you will).  And
that is a question
    >> about the application, not a statistical one.
    >> ________________________________________
    >> From: Monnand [monnand at gmail.com]
    >> Sent: Monday, January 12, 2015 10:14 PM
    >> To: Andrews, Chris
    >> Cc: r-help at r-project.org
    >> Subject: Re: [R] two-sample KS test: data becomes significantly
different
    >> after normalization
    >> 
    >> Thank you, Chris!
    >> 
    >> I think it is exactly the problem you mentioned. I did consider
    >> 1000-point data is a large one at first.
    >> 
    >> I down-sampled the data from 1000 points to 100 points and ran KS
test
    >> again. It worked as expected. Is there any typical method to
compare
    >> two large samples? I also tried KL diverge, but it only gives me
some
    >> number but does not tell me how large the distance is should be
    >> considered as significantly different.
    >> 
    >> Regards,
    >> -Monnand
    >> 
    >> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at
med.umich.edu>
    >> wrote:
    >> >
    >> > The main issue is that the original distributions are the
same, you
    >> shift the two samples *by different amounts* (about 0.01 SD), and
you have
    >> a large (n=1000) sample size.  Thus the new distributions are not
the same.
    >> >
    >> > This is a problem with testing for equality of distributions. 
With
    >> large samples, even a small deviation is significant.
    >> >
    >> > Chris
    >> >
    >> > -----Original Message-----
    >> > From: Monnand [mailto:monnand at gmail.com]
    >> > Sent: Sunday, January 11, 2015 10:13 PM
    >> > To: r-help at r-project.org
    >> > Subject: [R] two-sample KS test: data becomes significantly
different
    >> after normalization
    >> >
    >> > Hi all,
    >> >
    >> > This question is sort of related to R (I'm not sure if I
used an R
    >> function
    >> > correctly), but also related to stats in general. I'm
sorry if this is
    >> > considered as off-topic.
    >> >
    >> > I'm currently working on a data set with two sets of
samples. The csv
    >> file
    >> > of the data could be found here: http://pastebin.com/200v10py
    >> >
    >> > I would like to use KS test to see if these two sets of
samples are from
    >> > different distributions.
    >> >
    >> > I ran the following R script:
    >> >
    >> > # read data from the file
    >> >> data = read.csv('data.csv')
    >> >> ks.test(data[[1]], data[[2]])
    >> >     Two-sample Kolmogorov-Smirnov test
    >> >
    >> > data:  data[[1]] and data[[2]]
    >> > D = 0.025, p-value = 0.9132
    >> > alternative hypothesis: two-sided
    >> > The KS test shows that these two samples are very similar. (In
fact, they
    >> > should come from same distribution.)
    >> >
    >> > However, due to some reasons, instead of the raw values, the
actual data
    >> > that I will get will be normalized (zero mean, unit variance).
So I tried
    >> > to normalize the raw data I have and run the KS test again:
    >> >
    >> >> ks.test(scale(data[[1]]), scale(data[[2]]))
    >> >     Two-sample Kolmogorov-Smirnov test
    >> >
    >> > data:  scale(data[[1]]) and scale(data[[2]])
    >> > D = 0.3273, p-value < 2.2e-16
    >> > alternative hypothesis: two-sided
    >> > The p-value becomes almost zero after normalization indicating
these two
    >> > samples are significantly different (from different
distributions).
    >> >
    >> > My question is: How the normalization could make two similar
samples
    >> > becomes different from each other? I can see that if two
samples are
    >> > different, then normalization could make them similar.
However, if two
    >> sets
    >> > of data are similar, then intuitively, applying same operation
onto them
    >> > should make them still similar, at least not different from
each other
    >> too
    >> > much.
    >> >
    >> > I did some further analysis about the data. I also tried to
normalize the
    >> > data into [0,1] range (using the formula
(x-min(x))/(max(x)-min(x))), but
    >> > same thing happened. At first, I thought it might be outliers
caused this
    >> > problem (I can see that an outlier may cause this problem if I
normalize
    >> > the data into [0,1] range.) I deleted all data whose abs value
is larger
    >> > than 4 standard deviation. But it still didn't help.
    >> >
    >> > Plus, I even plotted the eCDFs, they *really* look the same to
me even
    >> > after normalization. Anything wrong with my usage of the R
function?
    >> >
    >> > Since the data contains ties, I also tried ks.boot (
    >> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got
the same
    >> > result.
    >> >
    >> > Could anyone help me to explain why it happened? Also, any
suggestion
    >> about
    >> > the hypothesis testing on normalized data? (The data I have
right now is
    >> > simulated data. In real world, I cannot get raw data, but only
normalized
    >> > one.)
    >> >
    >> > Regards,
    >> > -Monnand

Andrews, Chris

2015-Jan-14 12:31 UTC

head link

[R] two-sample KS test: data becomes significantly different after normalization

Your definition of p-value is not correct.  See, for example,
http://en.wikipedia.org/wiki/P-value#Misunderstandings

-----Original Message-----
From: Monnand [mailto:monnand at gmail.com] 
Sent: Wednesday, January 14, 2015 2:17 AM
To: Andrews, Chris
Cc: r-help at r-project.org
Subject: Re: [R] two-sample KS test: data becomes significantly different after
normalization

I know this must be a wrong method, but I cannot help to ask: Can I only
use the p-value from KS test, saying if p-value is greater than \beta, then
two samples are from the same distribution. If the definition of p-value is
the probability that the null hypothesis is true, then why there's little
people uses p-value as a "true" probability. e.g. normally, people
will not
multiply or add p-values to get the probability that two independent null
hypothesis are both true or one of them is true. I had this question for
very long time.

-Monnand

On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at med.umich.edu>
wrote:
> This sounds more like quality control than hypothesis testing.  Rather
> than statistical significance, you want to determine what is an acceptable
> difference (an 'equivalence margin', if you will).  And that is a
question
> about the application, not a statistical one.
> ________________________________________
> From: Monnand [monnand at gmail.com]
> Sent: Monday, January 12, 2015 10:14 PM
> To: Andrews, Chris
> Cc: r-help at r-project.org
> Subject: Re: [R] two-sample KS test: data becomes significantly different
> after normalization
>
> Thank you, Chris!
>
> I think it is exactly the problem you mentioned. I did consider
> 1000-point data is a large one at first.
>
> I down-sampled the data from 1000 points to 100 points and ran KS test
> again. It worked as expected. Is there any typical method to compare
> two large samples? I also tried KL diverge, but it only gives me some
> number but does not tell me how large the distance is should be
> considered as significantly different.
>
> Regards,
> -Monnand
>
> On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at
med.umich.edu>
> wrote:
> >
> > The main issue is that the original distributions are the same, you
> shift the two samples *by different amounts* (about 0.01 SD), and you have
> a large (n=1000) sample size.  Thus the new distributions are not the same.
> >
> > This is a problem with testing for equality of distributions.  With
> large samples, even a small deviation is significant.
> >
> > Chris
> >
> > -----Original Message-----
> > From: Monnand [mailto:monnand at gmail.com]
> > Sent: Sunday, January 11, 2015 10:13 PM
> > To: r-help at r-project.org
> > Subject: [R] two-sample KS test: data becomes significantly different
> after normalization
> >
> > Hi all,
> >
> > This question is sort of related to R (I'm not sure if I used an R
> function
> > correctly), but also related to stats in general. I'm sorry if
this is
> > considered as off-topic.
> >
> > I'm currently working on a data set with two sets of samples. The
csv
> file
> > of the data could be found here: http://pastebin.com/200v10py
> >
> > I would like to use KS test to see if these two sets of samples are
from
> > different distributions.
> >
> > I ran the following R script:
> >
> > # read data from the file
> >> data = read.csv('data.csv')
> >> ks.test(data[[1]], data[[2]])
> >     Two-sample Kolmogorov-Smirnov test
> >
> > data:  data[[1]] and data[[2]]
> > D = 0.025, p-value = 0.9132
> > alternative hypothesis: two-sided
> > The KS test shows that these two samples are very similar. (In fact,
they
> > should come from same distribution.)
> >
> > However, due to some reasons, instead of the raw values, the actual
data
> > that I will get will be normalized (zero mean, unit variance). So I
tried
> > to normalize the raw data I have and run the KS test again:
> >
> >> ks.test(scale(data[[1]]), scale(data[[2]]))
> >     Two-sample Kolmogorov-Smirnov test
> >
> > data:  scale(data[[1]]) and scale(data[[2]])
> > D = 0.3273, p-value < 2.2e-16
> > alternative hypothesis: two-sided
> > The p-value becomes almost zero after normalization indicating these
two
> > samples are significantly different (from different distributions).
> >
> > My question is: How the normalization could make two similar samples
> > becomes different from each other? I can see that if two samples are
> > different, then normalization could make them similar. However, if two
> sets
> > of data are similar, then intuitively, applying same operation onto
them
> > should make them still similar, at least not different from each other
> too
> > much.
> >
> > I did some further analysis about the data. I also tried to normalize
the
> > data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))),
but
> > same thing happened. At first, I thought it might be outliers caused
this
> > problem (I can see that an outlier may cause this problem if I
normalize
> > the data into [0,1] range.) I deleted all data whose abs value is
larger
> > than 4 standard deviation. But it still didn't help.
> >
> > Plus, I even plotted the eCDFs, they *really* look the same to me even
> > after normalization. Anything wrong with my usage of the R function?
> >
> > Since the data contains ties, I also tried ks.boot (
> > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same
> > result.
> >
> > Could anyone help me to explain why it happened? Also, any suggestion
> about
> > the hypothesis testing on normalized data? (The data I have right now
is
> > simulated data. In real world, I cannot get raw data, but only
normalized
> > one.)
> >
> > Regards,
> > -Monnand
> >
> >         [[alternative HTML version deleted]]
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
not
> be used for urgent or sensitive issues
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
>
	[[alternative HTML version deleted]]


**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used
for urgent or sensitive issues

Monnand

2015-Jan-16 01:07 UTC

head link

[R] two-sample KS test: data becomes significantly different after normalization

Thank you, Chris and Martin!

On Wed Jan 14 2015 at 7:31:12 AM Andrews, Chris <chrisaa at med.umich.edu>
wrote:
> Your definition of p-value is not correct.  See, for example,
> http://en.wikipedia.org/wiki/P-value#Misunderstandings
>
> -----Original Message-----
> From: Monnand [mailto:monnand at gmail.com]
> Sent: Wednesday, January 14, 2015 2:17 AM
> To: Andrews, Chris
> Cc: r-help at r-project.org
> Subject: Re: [R] two-sample KS test: data becomes significantly different
> after normalization
>
> I know this must be a wrong method, but I cannot help to ask: Can I only
> use the p-value from KS test, saying if p-value is greater than \beta, then
> two samples are from the same distribution. If the definition of p-value is
> the probability that the null hypothesis is true, then why there's
little
> people uses p-value as a "true" probability. e.g. normally,
people will not
> multiply or add p-values to get the probability that two independent null
> hypothesis are both true or one of them is true. I had this question for
> very long time.
>
> -Monnand
>
> On Tue Jan 13 2015 at 2:47:30 PM Andrews, Chris <chrisaa at
med.umich.edu>
> wrote:
>
> > This sounds more like quality control than hypothesis testing.  Rather
> > than statistical significance, you want to determine what is an
> acceptable
> > difference (an 'equivalence margin', if you will).  And that
is a
> question
> > about the application, not a statistical one.
> > ________________________________________
> > From: Monnand [monnand at gmail.com]
> > Sent: Monday, January 12, 2015 10:14 PM
> > To: Andrews, Chris
> > Cc: r-help at r-project.org
> > Subject: Re: [R] two-sample KS test: data becomes significantly
different
> > after normalization
> >
> > Thank you, Chris!
> >
> > I think it is exactly the problem you mentioned. I did consider
> > 1000-point data is a large one at first.
> >
> > I down-sampled the data from 1000 points to 100 points and ran KS test
> > again. It worked as expected. Is there any typical method to compare
> > two large samples? I also tried KL diverge, but it only gives me some
> > number but does not tell me how large the distance is should be
> > considered as significantly different.
> >
> > Regards,
> > -Monnand
> >
> > On Mon, Jan 12, 2015 at 9:32 AM, Andrews, Chris <chrisaa at
med.umich.edu>
> > wrote:
> > >
> > > The main issue is that the original distributions are the same,
you
> > shift the two samples *by different amounts* (about 0.01 SD), and you
> have
> > a large (n=1000) sample size.  Thus the new distributions are not the
> same.
> > >
> > > This is a problem with testing for equality of distributions. 
With
> > large samples, even a small deviation is significant.
> > >
> > > Chris
> > >
> > > -----Original Message-----
> > > From: Monnand [mailto:monnand at gmail.com]
> > > Sent: Sunday, January 11, 2015 10:13 PM
> > > To: r-help at r-project.org
> > > Subject: [R] two-sample KS test: data becomes significantly
different
> > after normalization
> > >
> > > Hi all,
> > >
> > > This question is sort of related to R (I'm not sure if I used
an R
> > function
> > > correctly), but also related to stats in general. I'm sorry
if this is
> > > considered as off-topic.
> > >
> > > I'm currently working on a data set with two sets of samples.
The csv
> > file
> > > of the data could be found here: http://pastebin.com/200v10py
> > >
> > > I would like to use KS test to see if these two sets of samples
are
> from
> > > different distributions.
> > >
> > > I ran the following R script:
> > >
> > > # read data from the file
> > >> data = read.csv('data.csv')
> > >> ks.test(data[[1]], data[[2]])
> > >     Two-sample Kolmogorov-Smirnov test
> > >
> > > data:  data[[1]] and data[[2]]
> > > D = 0.025, p-value = 0.9132
> > > alternative hypothesis: two-sided
> > > The KS test shows that these two samples are very similar. (In
fact,
> they
> > > should come from same distribution.)
> > >
> > > However, due to some reasons, instead of the raw values, the
actual
> data
> > > that I will get will be normalized (zero mean, unit variance). So
I
> tried
> > > to normalize the raw data I have and run the KS test again:
> > >
> > >> ks.test(scale(data[[1]]), scale(data[[2]]))
> > >     Two-sample Kolmogorov-Smirnov test
> > >
> > > data:  scale(data[[1]]) and scale(data[[2]])
> > > D = 0.3273, p-value < 2.2e-16
> > > alternative hypothesis: two-sided
> > > The p-value becomes almost zero after normalization indicating
these
> two
> > > samples are significantly different (from different
distributions).
> > >
> > > My question is: How the normalization could make two similar
samples
> > > becomes different from each other? I can see that if two samples
are
> > > different, then normalization could make them similar. However,
if two
> > sets
> > > of data are similar, then intuitively, applying same operation
onto
> them
> > > should make them still similar, at least not different from each
other
> > too
> > > much.
> > >
> > > I did some further analysis about the data. I also tried to
normalize
> the
> > > data into [0,1] range (using the formula
(x-min(x))/(max(x)-min(x))),
> but
> > > same thing happened. At first, I thought it might be outliers
caused
> this
> > > problem (I can see that an outlier may cause this problem if I
> normalize
> > > the data into [0,1] range.) I deleted all data whose abs value is
> larger
> > > than 4 standard deviation. But it still didn't help.
> > >
> > > Plus, I even plotted the eCDFs, they *really* look the same to me
even
> > > after normalization. Anything wrong with my usage of the R
function?
> > >
> > > Since the data contains ties, I also tried ks.boot (
> > > http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the
same
> > > result.
> > >
> > > Could anyone help me to explain why it happened? Also, any
suggestion
> > about
> > > the hypothesis testing on normalized data? (The data I have right
now
> is
> > > simulated data. In real world, I cannot get raw data, but only
> normalized
> > > one.)
> > >
> > > Regards,
> > > -Monnand
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > >
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and
should
> not
> > be used for urgent or sensitive issues
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
not
> > be used for urgent or sensitive issues
> >
> >
>
>         [[alternative HTML version deleted]]
>
>
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
	[[alternative HTML version deleted]]

R help - Jan 2015 - two-sample KS test: data becomes significantly different after normalization

[R] two-sample KS test: data becomes significantly different after normalization

[R] two-sample KS test: data becomes significantly different after normalization

[R] two-sample KS test: data becomes significantly different after normalization

[R] two-sample KS test: data becomes significantly different after normalization