If anyone has a very fast vectorized method for doing the following I would appreciate some help. I want to avoid outer() to limit memory problems for very large n. Let x = real vector of length n y = real vector of length n w = real vector of length m, m typically less than n/2 but can be > n z = real vector of length m For w[i], i=1,,,m, find the value of x that is closest to w[i]. In the case of ties, select one (optimally at random or just take the first match). Let z[i] = value of y corresponding to the closest x. If there is a Fortran or C function (builtin or otherwise) that does much of the work, all the better. Thanks, Frank -- Frank E Harrell Jr Prof. of Biostatistics & Statistics Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Frank E Harrell Jr <fharrell at virginia.edu> writes:> If anyone has a very fast vectorized method for doing the following I would appreciate some help. I want to avoid outer() to limit memory problems for very large n. > > Let > > x = real vector of length n > y = real vector of length n > w = real vector of length m, m typically less than n/2 but can be > n > z = real vector of length m > > For w[i], i=1,,,m, find the value of x that is closest to w[i]. In the case of ties, select one (optimally at random or just take the first match). Let z[i] = value of y corresponding to the closest x. > > If there is a Fortran or C function (builtin or otherwise) that does much of the work, all the better.Sounds rather close to what approx() does in the stepfunction cases, except that you want jumps at the midpoints between the x values rather than at the x themselves. -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Thu, 28 Mar 2002, Frank E Harrell Jr wrote:> If anyone has a very fast vectorized method for doing the following I > would appreciate some help. I want to avoid outer() to limit memory > problems for very large n. > > Let > > x = real vector of length n > y = real vector of length n > w = real vector of length m, m typically less than n/2 but can be > n > z = real vector of length m > > For w[i], i=1,,,m, find the value of x that is closest to w[i]. In the > case of ties, select one (optimally at random or just take the first > match). Let z[i] = value of y corresponding to the closest x. >I believe the following will work in pre1.5.0 using the new findInterval function (modulo any remaining off-by-one errors that should be easy to fix) # sort them oo<-order(x) x<-x[oo] y<-y[oo] # find the right interval for w j<-pmax(1,findInterval(w,x)) xlow<-x[j] #or possibly j-1 and j or something xhi<-x[j+1] # which end of the interval? bestj<-ifelse(xhi-w<w-xlow,j+1,j) # get z z<-y[j] For length(x)=10^5 it takes 0.5s with length(w)=length(x)/3 and 2.5s with length(w)=length(x)*3 on my machine with a current pre1.5.0 -thomas -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Frank, Apologies if I'm being very stupid, but I'd guess that reducing the overall complexity would help for large datasets. The following is not vectorised, but should be linear apart from the order() operations, which are hopefully O(nlogn):>closest<-function(w,x,y) {ow<-order(w) ox<-order(x) ws<-w[ow] xs<-x[ox] ys<-y[ox] res<-rep(length(x),length(w)) i<-1 j<-1 while(i<=length(w) && j<length(x)) { if ( ws[i]<=(xs[j]+xs[j+1])/2 ) { res[i]<-j i<-i+1 } else j<-j+1 } reshuffle<-order((1:length(w))[ow]) z<-ys[res[reshuffle]] return(z) } It is all based on sorting the data first so that only m+n comparisons are needed rather than mn. I'd be glad to know whether I've messed up/completely misunderstood the problem (& I've just realised you're more interested in space than time so this may all be completely irrelevant.) Cheers, Mike. > -----Original Message----- > From: owner-r-help at stat.math.ethz.ch > [mailto:owner-r-help at stat.math.ethz.ch]On Behalf Of Frank E > Harrell Jr > Sent: 28 March 2002 13:20 > To: rhelp > Subject: [R] Vectorizing closest match > > > If anyone has a very fast vectorized method for doing the > following I would appreciate some help. I want to avoid > outer() to limit memory problems for very large n. > > Let > > x = real vector of length n > y = real vector of length n > w = real vector of length m, m typically less than n/2 but can be > n > z = real vector of length m > > For w[i], i=1,,,m, find the value of x that is closest to > w[i]. In the case of ties, select one (optimally at random > or just take the first match). Let z[i] = value of y > corresponding to the closest x. > > If there is a Fortran or C function (builtin or otherwise) > that does much of the work, all the better. > > Thanks, > > Frank > -- > Frank E Harrell Jr Prof. of Biostatistics & Statistics > Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences > U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._