thr3ads.net - R help - [R] Fast way of finding top-n values of a long vector [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Allan Engelhardt

2009-Jun-04 08:18 UTC

[R] Fast way of finding top-n values of a long vector

If x is a (long) vector and n << length(x), what is a fast way of 
finding the top-n values of x?

Some suggestions (calculating the ratio of the two top values):


library("rbenchmark")
set.seed(1); x <- runif(1e6, max=1e7); x[1] <- NA;
benchmark(
 replications=20,
 columns=c("test","elapsed"),
 order="elapsed"
 , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
 , max  = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; m/max(x[-w], 
na.rm=TRUE);}
 , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
)
#   test elapsed
# 3 max2   0.772
# 2  max   1.732
# 1 sort   4.958


I want to apply this code to a few tens of thousands of vectors so speed 
does matter.  In C or similar I would of course calculate the result 
with a single pass through x, and not with three passes as in 'max2'.


Allan.

PS: I know na.last=NA is the default for sort, but there is no harm in 
being explicit in how you want NA's handled.

Barry Rowlingson

2009-Jun-04 08:36 UTC

head link

[R] Fast way of finding top-n values of a long vector

On Thu, Jun 4, 2009 at 9:18 AM, Allan Engelhardt <allane at cybaea.com>
wrote:
> I want to apply this code to a few tens of thousands of vectors so speed
> does matter. ?In C or similar I would of course calculate the result with a
> single pass through x, and not with three passes as in 'max2'.
>
 So why not code it in C? See the R guide on writing extensions for
how to link C code with R it's pretty easy especially if you know the
size of the objects being passed around, then there's no messy memory
management to deal with.

Wikipedia has some algorithms, with an inevitable dose of Knuth:

http://en.wikipedia.org/wiki/Selection_algorithm

and apparently something like that is in the standard C++ library:

"C++ also provides the partial_sort algorithm, which solves the
problem of selecting the smallest k elements (sorted), with a time
complexity of O(nlog k). No algorithm is provided for selecting the
greatest k elements, but this can easily be achieved by inverting the
ordering predicate."

Barry

Greg Snow

2009-Jun-04 17:14 UTC

head link

[R] Fast way of finding top-n values of a long vector

Try adding a version that uses sort with the partial argument, that should be
faster than regular sort (for long enough test vectors) and possibly faster than
the max solutions when finding more than just the largest 2.

Also, for your max solutions, what happens when the 2 largest values are
identical?

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Allan Engelhardt
> Sent: Thursday, June 04, 2009 2:18 AM
> To: r-help at r-project.org
> Subject: [R] Fast way of finding top-n values of a long vector
> 
> If x is a (long) vector and n << length(x), what is a fast way of
> finding the top-n values of x?
> 
> Some suggestions (calculating the ratio of the two top values):
> 
> 
> library("rbenchmark")
> set.seed(1); x <- runif(1e6, max=1e7); x[1] <- NA;
> benchmark(
>  replications=20,
>  columns=c("test","elapsed"),
>  order="elapsed"
>  , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
>  , max  = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; m/max(x[-w],
> na.rm=TRUE);}
>  , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
> )
> #   test elapsed
> # 3 max2   0.772
> # 2  max   1.732
> # 1 sort   4.958
> 
> 
> I want to apply this code to a few tens of thousands of vectors so
> speed
> does matter.  In C or similar I would of course calculate the result
> with a single pass through x, and not with three passes as in
'max2'.
> 
> 
> Allan.
> 
> PS: I know na.last=NA is the default for sort, but there is no harm in
> being explicit in how you want NA's handled.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Allan Engelhardt

2009-Jun-04 18:51 UTC

head link

[R] Fast way of finding top-n values of a long vector

Greg Snow wrote:> Try adding a version that uses sort with the partial argument, that should
be faster than regular sort (for long enough test vectors) and possibly faster
than the max solutions when finding more than just the largest 2.
I find the documentation for the partial argument in sort very difficult 
to understand, but it seems to be something like this:


library("rbenchmark")
set.seed(1); x <- runif(1e6, max=1e7); x[1] <- NA;
benchmark(
 replications=20,
 columns=c("test","elapsed"),
 order="elapsed"
 , sort = {a<-sort(x, decreasing=TRUE, na.last=NA)[1:2]; a[1]/a[2];}
 , qsrt = {a<-sort(x, decreasing=TRUE, na.last=NA,
method="quick")[1:2];
a[1]/a[2];}
 , part = {a<-sort.int(-x, partial=1:2, na.last=NA)[1:2]; a[1]/a[2];}
 , max1 = {m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; 
m/max(x[-w],na.rm=TRUE);}
 , max2 = {w<-which.max(x); max(x, na.rm=TRUE)/max(x[-w], na.rm=TRUE);}
)
#   test elapsed
# 5 max2   0.846
# 4 max1   1.957
# 3 part   2.752
# 2 qsrt   4.561
# 1 sort   5.577


Completely agree on your point about partial sort being faster for 
larger values of n and certainly giving more scalable code which was 
also what I was looking for so thanks for that tip!
> Also, for your max solutions, what happens when the 2 largest values are
identical?
It returns those two values, just like the sort solution:

x <- c(999,NA,1:10,NA,999)
m<-max(x, na.rm=TRUE); w<-which(x==m)[1]; c(m, max(x[-w],na.rm=TRUE))
# [1] 999 999
sort(x, decreasing=TRUE, na.last=NA)[1:2]
# [1] 999 999


Allan.

Reasonably Related Threads

Search for more apparently analagous threads

R help - Jun 2009 - Fast way of finding top-n values of a long vector

[R] Fast way of finding top-n values of a long vector

[R] Fast way of finding top-n values of a long vector

[R] Fast way of finding top-n values of a long vector

[R] Fast way of finding top-n values of a long vector

Reasonably Related Threads