Allan Engelhardt
2009-Jun-04 08:18 UTC
[R] Fast way of finding top-n values of a long vector
If x is a (long) vector and n << length(x), what is a fast way of finding the top-n values of x? Some suggestions (calculating the ratio of the two top values): library("rbenchmark") set.seed(1); x <- runif(1e6, max=1e7); x[1] <- NA; benchmark( replications=20, columns=c("test","elapsed"), order="elapsed" , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];} , max = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; m/max(x[-w], na.rm=TRUE);} , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);} ) # test elapsed # 3 max2 0.772 # 2 max 1.732 # 1 sort 4.958 I want to apply this code to a few tens of thousands of vectors so speed does matter. In C or similar I would of course calculate the result with a single pass through x, and not with three passes as in 'max2'. Allan. PS: I know na.last=NA is the default for sort, but there is no harm in being explicit in how you want NA's handled.
Barry Rowlingson
2009-Jun-04 08:36 UTC
[R] Fast way of finding top-n values of a long vector
On Thu, Jun 4, 2009 at 9:18 AM, Allan Engelhardt <allane at cybaea.com> wrote:> I want to apply this code to a few tens of thousands of vectors so speed > does matter. ?In C or similar I would of course calculate the result with a > single pass through x, and not with three passes as in 'max2'. >So why not code it in C? See the R guide on writing extensions for how to link C code with R it's pretty easy especially if you know the size of the objects being passed around, then there's no messy memory management to deal with. Wikipedia has some algorithms, with an inevitable dose of Knuth: http://en.wikipedia.org/wiki/Selection_algorithm and apparently something like that is in the standard C++ library: "C++ also provides the partial_sort algorithm, which solves the problem of selecting the smallest k elements (sorted), with a time complexity of O(nlog k). No algorithm is provided for selecting the greatest k elements, but this can easily be achieved by inverting the ordering predicate." Barry
Try adding a version that uses sort with the partial argument, that should be faster than regular sort (for long enough test vectors) and possibly faster than the max solutions when finding more than just the largest 2. Also, for your max solutions, what happens when the 2 largest values are identical? -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Allan Engelhardt > Sent: Thursday, June 04, 2009 2:18 AM > To: r-help at r-project.org > Subject: [R] Fast way of finding top-n values of a long vector > > If x is a (long) vector and n << length(x), what is a fast way of > finding the top-n values of x? > > Some suggestions (calculating the ratio of the two top values): > > > library("rbenchmark") > set.seed(1); x <- runif(1e6, max=1e7); x[1] <- NA; > benchmark( > replications=20, > columns=c("test","elapsed"), > order="elapsed" > , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];} > , max = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; m/max(x[-w], > na.rm=TRUE);} > , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);} > ) > # test elapsed > # 3 max2 0.772 > # 2 max 1.732 > # 1 sort 4.958 > > > I want to apply this code to a few tens of thousands of vectors so > speed > does matter. In C or similar I would of course calculate the result > with a single pass through x, and not with three passes as in 'max2'. > > > Allan. > > PS: I know na.last=NA is the default for sort, but there is no harm in > being explicit in how you want NA's handled. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Allan Engelhardt
2009-Jun-04 18:51 UTC
[R] Fast way of finding top-n values of a long vector
Greg Snow wrote:> Try adding a version that uses sort with the partial argument, that should be faster than regular sort (for long enough test vectors) and possibly faster than the max solutions when finding more than just the largest 2.I find the documentation for the partial argument in sort very difficult to understand, but it seems to be something like this: library("rbenchmark") set.seed(1); x <- runif(1e6, max=1e7); x[1] <- NA; benchmark( replications=20, columns=c("test","elapsed"), order="elapsed" , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];} , qsrt = {a<-sort(x, decreasing=TRUE, na.last=NA, method="quick")[1:2]; a[1]/a[2];} , part = {a<-sort.int(-x, partial=1:2, na.last=NA)[1:2]; a[1]/a[2];} , max1 = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; m/max(x[-w],na.rm=TRUE);} , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);} ) # test elapsed # 5 max2 0.846 # 4 max1 1.957 # 3 part 2.752 # 2 qsrt 4.561 # 1 sort 5.577 Completely agree on your point about partial sort being faster for larger values of n and certainly giving more scalable code which was also what I was looking for so thanks for that tip!> Also, for your max solutions, what happens when the 2 largest values are identical?It returns those two values, just like the sort solution: x <- c(999,NA,1:10,NA,999) m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; c(m, max(x[-w],na.rm=TRUE)) # [1] 999 999 sort(x, decreasing=TRUE, na.last=NA)[1:2] # [1] 999 999 Allan.