I have a fairly simple problem--I have about 80,000 values (call them y) that I am using as an empirical distribution and I want to find the p-value (never mind the multiple testing issues here, for the time being) of 130,000 points (call them x) from the empirical distribution. I typically do that (for one-sided test) something like loop over i in x p.val[i] = sum(y>x[i])/length(y) and repeat for all i. However, length(x) is large here as is length(y), so this process takes quite a long time. Any suggestions? Thanks, Sean
On Thursday 03 March 2005 16:22, Sean Davis wrote:> I have a fairly simple problem--I have about 80,000 values (call them > y) that I am using as an empirical distribution and I want to find > the p-value (never mind the multiple testing issues here, for the > time being) of 130,000 points (call them x) from the empirical > distribution. I typically do that (for one-sided test) something like > > loop over i in x > p.val[i] = sum(y>x[i])/length(y) > > and repeat for all i. However, length(x) is large here as is > length(y), so this process takes quite a long time. Any suggestions?The obvious thing to do would be p.val = 1 - ecdf(x)(y) wouldn't it? On a 1.1 GHz Athlon, I get> x <- rnorm(130000) > y <- rnorm(80000) > system.time(p.val <- 1 - ecdf(y)(x))[1] 1.03 0.03 1.06 0.00 0.00 -Deepayan
When you say the 130,000 points are from the empirical distribution, how did you get them? Is each one really one of the values of y? If you sorted y first, would you know which one (ie which index) each x is? (Sorting 80,000 elements took essentially no time at all on my sub-gigahertz Pentium III.) But maybe that's not an option... more details would help. Reid Huntsinger -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Sean Davis Sent: Thursday, March 03, 2005 5:22 PM To: r-help Subject: [R] Rank-based p-value on large dataset I have a fairly simple problem--I have about 80,000 values (call them y) that I am using as an empirical distribution and I want to find the p-value (never mind the multiple testing issues here, for the time being) of 130,000 points (call them x) from the empirical distribution. I typically do that (for one-sided test) something like loop over i in x p.val[i] = sum(y>x[i])/length(y) and repeat for all i. However, length(x) is large here as is length(y), so this process takes quite a long time. Any suggestions? Thanks, Sean ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
One solution is to cut() 'x' according to the breaks defined by 'y'. Using cut with labels=FALSE is really fast. See a simulation below. However the accuracy depends on the number of ties you have in your "empirical" distribution. I have tried to simulate with the round() function below. # simulate data y <- round(rnorm(80000), 5) # empirical distribution with some ties x <- round(rnorm(130000), 5) # observed values with some ties # current way system.time({ p1 <- numeric( length(x) ) for(i in 1:length(x)){ p1[i] <- sum( x[i] < y )/length(y) } }) [1] 484.67 25.08 512.04 0.00 0.00 # suggested solution system.time({ break.points <- c(-Inf, unique(sort(y)), Inf) p2 <- cut( x, breaks=break.points, labels=FALSE ) p2 <- 1 - p2/length(break.points) }) [1] 0.27 0.01 0.28 0.00 0.00 head( cbind( p1, p2 ) ) p1 p2 [1,] 0.658375 0.65813482 [2,] 0.144000 0.14533705 [3,] 0.815500 0.81436898 [4,] 0.412500 0.41269640 [5,] 0.553625 0.55334516 [6,] 0.044500 0.04510897 ... cor(p1, p2) [1] 0.9999987 The difference arises mainly because I had to use a unique breakpoints in cut(). You may want to investigate further if you are interested. Please let me know if you find anything good or bad about this suggestion as I am using it as part of my codes too. Thank you. Regards, Adai On Thu, 2005-03-03 at 17:22 -0500, Sean Davis wrote:> I have a fairly simple problem--I have about 80,000 values (call them > y) that I am using as an empirical distribution and I want to find the > p-value (never mind the multiple testing issues here, for the time > being) of 130,000 points (call them x) from the empirical distribution. > I typically do that (for one-sided test) something like > > loop over i in x > p.val[i] = sum(y>x[i])/length(y) > > and repeat for all i. However, length(x) is large here as is > length(y), so this process takes quite a long time. Any suggestions? > > Thanks, > Sean > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >