thr3ads.net - R help - [R] P-values Kolmogorov

If this information is useful, please help other people find it:
Share via:

Boo G.

2019-Sep-05 16:06 UTC

[R] P-values Kolmogorov–Smirnov test

Hello,

I am trying to perform a Kolmogorov?Smirnov test to assess the difference
between a distribution and samples drawn proportionally to size of different
sizes. I managed to compute the Kolmogorov?Smirnov distance but I am lost with
the p-value. I have looked into the ks.test function unsuccessfully. Can anyone
help me with computing p-values for a two-tailed test?

Below a simplified version of my code.

Thanks in advance.
Gianluca


library(spatstat)

#reference distribution
d_1 <- sort(rpois(1000, 500))
p_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, p_1)

#data frame to store the values of the siumation
d_stat <- data.frame(1:1000, NA, NA)
names(d_stat) <- c("sample_size", "ks_distance",
"p_value")

#simulation
for (i in 1:1000) {
  #sample from the reference distribution
  m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
  m_2 <-m_2[order(m_2$d_1),]
  d_2 <- m_2$d_1
  p_2 <- m_2$p_1

  #weighted ecdf for the reference distribution and the sample
  f_d_1 <- ewcdf(d_1, normalise=F)
  f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))

  #kolmogorov-smirnov distance
  d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
}


	[[alternative HTML version deleted]]

Rui Barradas

2019-Sep-05 18:29 UTC

head link

[R] P-values Kolmogorov–Smirnov test

Hello,

I don't have the algorithms at hand but the KS statistic calculation is 
more complicated than your max/abs difference.

Anyway, why not use ks.test? it's not that difficult:


set.seed(1234)
#reference distribution
d_1 <- sort(rpois(1000, 500))
p_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, p_1)

#data frame to store the values of the simulation
d_stat <- data.frame(1:1000, NA, NA)
names(d_stat) <- c("sample_size", "ks_distance",
"p_value")

#simulation
for (i in 1:1000) {
   #sample from the reference distribution
   m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
   d_2 <- m_2$d_1

   ht <- ks.test(d_1, d_2)
   #kolmogorov-smirnov distance
   d_stat[i, 2] <- ht$statistic
   d_stat[i, 3] <- ht$p.value
}

hist(d_stat[, 2])
hist(d_stat[, 3])


Note that d_2 is not sorted, but the results are equal in the sense of 
function identical(), meaning they are *exactly* the same. Why shouldn't 
they?

Hope this helps,

Rui Barradas


?s 17:06 de 05/09/19, Boo G. escreveu:> Hello,
> 
> I am trying to perform a Kolmogorov?Smirnov test to assess the difference
between a distribution and samples drawn proportionally to size of different
sizes. I managed to compute the Kolmogorov?Smirnov distance but I am lost with
the p-value. I have looked into the ks.test function unsuccessfully. Can anyone
help me with computing p-values for a two-tailed test?
> 
> Below a simplified version of my code.
> 
> Thanks in advance.
> Gianluca
> 
> 
> library(spatstat)
> 
> #reference distribution
> d_1 <- sort(rpois(1000, 500))
> p_1 <- d_1/sum(d_1)
> m_1 <- data.frame(d_1, p_1)
> 
> #data frame to store the values of the siumation
> d_stat <- data.frame(1:1000, NA, NA)
> names(d_stat) <- c("sample_size", "ks_distance",
"p_value")
> 
> #simulation
> for (i in 1:1000) {
>    #sample from the reference distribution
>    m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
>    m_2 <-m_2[order(m_2$d_1),]
>    d_2 <- m_2$d_1
>    p_2 <- m_2$p_1
> 
>    #weighted ecdf for the reference distribution and the sample
>    f_d_1 <- ewcdf(d_1, normalise=F)
>    f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))
> 
>    #kolmogorov-smirnov distance
>    d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
> }
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Boo G.

2019-Sep-05 18:42 UTC

head link

[R] P-values Kolmogorov–Smirnov test

Thanks for your reply, Rui. 

I don?t think that I can use directly the ks.test because I have a weighted
sample (see  m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),])
and I want to account for that. That?s why I am trying to compute everything
manually.

Also, if you look at the results of the ks.test in your simulation, you will
notice that the p-value always implies that the sample is always (even with same
size = 1) drawn form the same distribution. This looks suspicious to me.

What are your thoughts?
> 
> On 5 Sep 2019, at 20:29, Rui Barradas <ruipbarradas at sapo.pt>
wrote:
> 
> Hello,
> 
> I don't have the algorithms at hand but the KS statistic calculation is
more complicated than your max/abs difference.
> 
> Anyway, why not use ks.test? it's not that difficult:
> 
> 
> set.seed(1234)
> #reference distribution
> d_1 <- sort(rpois(1000, 500))
> p_1 <- d_1/sum(d_1)
> m_1 <- data.frame(d_1, p_1)
> 
> #data frame to store the values of the simulation
> d_stat <- data.frame(1:1000, NA, NA)
> names(d_stat) <- c("sample_size", "ks_distance",
"p_value")
> 
> #simulation
> for (i in 1:1000) {
>  #sample from the reference distribution
>  m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
>  d_2 <- m_2$d_1
> 
>  ht <- ks.test(d_1, d_2)
>  #kolmogorov-smirnov distance
>  d_stat[i, 2] <- ht$statistic
>  d_stat[i, 3] <- ht$p.value
> }
> 
> hist(d_stat[, 2])
> hist(d_stat[, 3])
> 
> 
> Note that d_2 is not sorted, but the results are equal in the sense of
function identical(), meaning they are *exactly* the same. Why shouldn't
they?
> 
> Hope this helps,
> 
> Rui Barradas
> 
> 
> ?s 17:06 de 05/09/19, Boo G. escreveu:
>> Hello,
>> I am trying to perform a Kolmogorov?Smirnov test to assess the
difference between a distribution and samples drawn proportionally to size of
different sizes. I managed to compute the Kolmogorov?Smirnov distance but I am
lost with the p-value. I have looked into the ks.test function unsuccessfully.
Can anyone help me with computing p-values for a two-tailed test?
>> Below a simplified version of my code.
>> Thanks in advance.
>> Gianluca
>> library(spatstat)
>> #reference distribution
>> d_1 <- sort(rpois(1000, 500))
>> p_1 <- d_1/sum(d_1)
>> m_1 <- data.frame(d_1, p_1)
>> #data frame to store the values of the siumation
>> d_stat <- data.frame(1:1000, NA, NA)
>> names(d_stat) <- c("sample_size", "ks_distance",
"p_value")
>> #simulation
>> for (i in 1:1000) {
>>   #sample from the reference distribution
>>   m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
>>   m_2 <-m_2[order(m_2$d_1),]
>>   d_2 <- m_2$d_1
>>   p_2 <- m_2$p_1
>>   #weighted ecdf for the reference distribution and the sample
>>   f_d_1 <- ewcdf(d_1, normalise=F)
>>   f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))
>>   #kolmogorov-smirnov distance
>>   d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
>> }
>> 	[[alternative HTML version deleted]]
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C0c709068527c41e062dd08d7322f0d72%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=y9jfixyNiroKwKZEJj0owuCcWoeFQKZdaG9WLe2xHQ8%3D&amp;reserved=0
>> PLEASE do read the posting guide
https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&amp;data=01%7C01%7Cgianluca.boo%40soton.ac.uk%7C0c709068527c41e062dd08d7322f0d72%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=7n0doy4P1S1TpApX1zpUborAnUnxuOxYtn%2FQ%2BtVztGM%3D&amp;reserved=0
>> and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2019-Sep-05 18:46 UTC

head link

[R] P-values Kolmogorov–Smirnov test

Hello,

I'm sorry, but apparently I missed the point of your problem.
Please do not take my previous answer seriously.

But you can use ks.test, just in a different way than what I wrote 
previously.

Corrected code:


#simulation
for (i in 1:1000) {
   #sample from the reference distribution
   m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
   m_2 <-m_2[order(m_2$d_1),]
   d_2 <- m_2$d_1
   p_2 <- m_2$p_1

   #weighted ecdf for the reference distribution and the sample
   f_d_1 <- ewcdf(d_1, normalise=F)
   f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))

   #kolmogorov-smirnov distance
   x <- f_d_1(d_2)
   y <- f_d_2(d_2)
   ht <- ks.test(x, y)
   d_stat[i, 2] <- ht$statistic
   d_stat[i, 3] <- ht$p.value
}


Hope this helps,

Rui Barradas

?s 19:29 de 05/09/19, Rui Barradas escreveu:> Hello,
> 
> I don't have the algorithms at hand but the KS statistic calculation is
> more complicated than your max/abs difference.
> 
> Anyway, why not use ks.test? it's not that difficult:
> 
> 
> set.seed(1234)
> #reference distribution
> d_1 <- sort(rpois(1000, 500))
> p_1 <- d_1/sum(d_1)
> m_1 <- data.frame(d_1, p_1)
> 
> #data frame to store the values of the simulation
> d_stat <- data.frame(1:1000, NA, NA)
> names(d_stat) <- c("sample_size", "ks_distance",
"p_value")
> 
> #simulation
> for (i in 1:1000) {
>  ? #sample from the reference distribution
>  ? m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
>  ? d_2 <- m_2$d_1
> 
>  ? ht <- ks.test(d_1, d_2)
>  ? #kolmogorov-smirnov distance
>  ? d_stat[i, 2] <- ht$statistic
>  ? d_stat[i, 3] <- ht$p.value
> }
> 
> hist(d_stat[, 2])
> hist(d_stat[, 3])
> 
> 
> Note that d_2 is not sorted, but the results are equal in the sense of 
> function identical(), meaning they are *exactly* the same. Why
shouldn't
> they?
> 
> Hope this helps,
> 
> Rui Barradas
> 
> 
> ?s 17:06 de 05/09/19, Boo G. escreveu:
>> Hello,
>>
>> I am trying to perform a Kolmogorov?Smirnov test to assess the 
>> difference between a distribution and samples drawn proportionally to 
>> size of different sizes. I managed to compute the Kolmogorov?Smirnov 
>> distance but I am lost with the p-value. I have looked into the 
>> ks.test function unsuccessfully. Can anyone help me with computing 
>> p-values for a two-tailed test?
>>
>> Below a simplified version of my code.
>>
>> Thanks in advance.
>> Gianluca
>>
>>
>> library(spatstat)
>>
>> #reference distribution
>> d_1 <- sort(rpois(1000, 500))
>> p_1 <- d_1/sum(d_1)
>> m_1 <- data.frame(d_1, p_1)
>>
>> #data frame to store the values of the siumation
>> d_stat <- data.frame(1:1000, NA, NA)
>> names(d_stat) <- c("sample_size", "ks_distance",
"p_value")
>>
>> #simulation
>> for (i in 1:1000) {
>> ?? #sample from the reference distribution
>> ?? m_2 <-m_1[(sample(nrow(m_1), size=i, prob=p_1, replace=F)),]
>> ?? m_2 <-m_2[order(m_2$d_1),]
>> ?? d_2 <- m_2$d_1
>> ?? p_2 <- m_2$p_1
>>
>> ?? #weighted ecdf for the reference distribution and the sample
>> ?? f_d_1 <- ewcdf(d_1, normalise=F)
>> ?? f_d_2 <- ewcdf(d_2, 1/p_2, normalise=F, adjust=1/length(d_2))
>>
>> ?? #kolmogorov-smirnov distance
>> ?? d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
>> }
>>
>>
>> ????[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2019 - P-values Kolmogorov–Smirnov test

[R] P-values Kolmogorov–Smirnov test

[R] P-values Kolmogorov–Smirnov test

[R] P-values Kolmogorov–Smirnov test

[R] P-values Kolmogorov–Smirnov test