Is there a generic binary search routine in a standard library which a) works for character vectors b) runs in O(log(N)) time? I'm aware of findInterval(x,vec), but it is restricted to numeric vectors. I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and fastmatch), but I need the greatest-lower-bound match in my application. findInterval is also slow for large N=length(vec) because of the O(N) checking it does, as Duncan Murdoch has pointed out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>: though its documentation says it runs in O(n * log(N)), it actually runs in O(n * log(N) + N), which is quite noticeable for largish N. But that is easy enough to work around by writing a variant of findInterval which calls find_interv_vec without checking. -s PS Yes, binary search is a one-liner in R, but I always prefer to use standard, fast native libraries when possible.... binarysearch <- function(val,tab,L,H) {while (H>=L) { M=L+(H-L) %/% 2; if (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)}; return(L-1)} [[alternative HTML version deleted]]
> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Stavros Macrakis > Sent: Monday, April 04, 2011 1:15 PM > To: r-help > Subject: [R] General binary search? > > Is there a generic binary search routine in a standard library which > > a) works for character vectors > b) runs in O(log(N)) time? > > I'm aware of findInterval(x,vec), but it is restricted to > numeric vectors.xtfrm(x) will convert a character (or other) vector to a numeric vector with the same ordering. findInterval can work on that. E.g., > f0 <- function(x, vec) { tmp <- xtfrm(c(x, vec)) findInterval(tmp[seq_along(x)], tmp[-seq_along(x)]) } > f0(c("Baby", "Aunt", "Dog"), LETTERS) [1] 2 1 4 I've never looked at its speed. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> > I'm also aware of various hashing solutions (e.g. > new.env(hash=TRUE) and > fastmatch), but I need the greatest-lower-bound match in my > application. > > findInterval is also slow for large N=length(vec) because of the O(N) > checking it does, as Duncan Murdoch has pointed > out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>: > though > its documentation says it runs in O(n * log(N)), it actually > runs in O(n * > log(N) + N), which is quite noticeable for largish N. But > that is easy > enough to work around by writing a variant of findInterval which calls > find_interv_vec without checking. > > -s > > PS Yes, binary search is a one-liner in R, but I always prefer to use > standard, fast native libraries when possible.... > > binarysearch <- function(val,tab,L,H) {while (H>=L) { > M=L+(H-L) %/% 2; if > (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)}; > return(L-1)} > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Try data.table:::sortedmatch, which is implemented in C. It requires it's input to be sorted (and doesn't check) "Stavros Macrakis" <macrakis at alum.mit.edu> wrote in message news:BANLkTi=j2LF5SyXYTV1Dd4k9wr0ZGK88ig at mail.gmail.com...> Is there a generic binary search routine in a standard library which > > a) works for character vectors > b) runs in O(log(N)) time? > > I'm aware of findInterval(x,vec), but it is restricted to numeric vectors. > > I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and > fastmatch), but I need the greatest-lower-bound match in my application. > > findInterval is also slow for large N=length(vec) because of the O(N) > checking it does, as Duncan Murdoch has pointed > out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>: > though > its documentation says it runs in O(n * log(N)), it actually runs in O(n * > log(N) + N), which is quite noticeable for largish N. But that is easy > enough to work around by writing a variant of findInterval which calls > find_interv_vec without checking. > > -s > > PS Yes, binary search is a one-liner in R, but I always prefer to use > standard, fast native libraries when possible.... > > binarysearch <- function(val,tab,L,H) {while (H>=L) { M=L+(H-L) %/% 2; if > (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)}; > return(L-1)} > > [[alternative HTML version deleted]] >