Hi Monica,
I think the key to speeding this up is, for every point in 'track', to
compute the distance to all points in 'classif'
'simultaneously',
using vectorized calculations. Here's my function. On my laptop it's
about 160 times faster than the original for the case I looked at
(10,000 observations in track and 500 in classif). I get around 18
seconds for the 30,000 and 4,000 example (2 GHz processor running
linux).
Dan
dist.merge2 <- function(x, y, xeast, xnorth, yeast, ynorth) {
## construct data frame d in which d[i,] contains information
## associated with the closest point in y to x[i,]
xpos <- as.matrix(x[,c(xeast, xnorth)])
xposl <- lapply(seq.int(nrow(x)), function(i) xpos[i,])
ypos <- t(as.matrix(y[,c(yeast, ynorth)]))
yinfo <- y[,! colnames(y) %in% c(yeast,ynorth)]
get.match.and.dist <- function(point) {
sqdists <- colSums((point - ypos)^2)
ind <- which.min(sqdists)
c(ind, sqrt(sqdists[ind]))
}
match <- sapply(xposl, get.match.and.dist)
cbind(xpos, mindist=match[2,], yinfo[match[1,],])
}
It's marginally faster to convert xpos to a list followed by sapply as
I do here, than to leave it as a matrix and use apply to get the
matches.
On Tue, Sep 16, 2008 at 04:23:33PM +0000, Monica Pisica
wrote:>
> Hi,
>
> Few days ago I have asked about spatial join on the minimum distance
between 2 sets of points with coordinates and attributes in 2 different data
frames.
>
> Simon Knapp sent code to do it when calculating distance on a sphere using
lat, long coordinates and I've change his code to use Euclidian distances
since my data had UTM coordinates.
>
> Typically one data frame has around 30 000 points and the classification
data frame has around 4000 points, and the aim is to add to each point from the
first data frame all the attributes from the second data frame of the point that
is closest to it.
>
> On my PC (Dell, OptiPlex GX620, X86 ? based PC, 4 GB RAM, 3192 Mhz
processor)
> It took quite a long time to do the join:
>
> user system elapsed
> 8166.07 2.98 8194.43
>
> Sys.info()
> sysname release
> "Windows"
"XP"
> version nodename
> "build 2600, Service Pack 2"
> machine
> "x86"
> I am running R 2.7.1 patched.
> I wonder if any of you can suggest or help (or have time) in optimizing
this code to make it run faster. My programming skills are not high enough to do
it.
>
> Thanks,
>
> Monica
>
> #### code follows:
> #### x a data frame with over 30000 points with coord in UTM, xeast, xnorth
> #### y a data frame with over 4000 points with UTM coord (yeast, ynorth)
and
> ##### classification
> ### calculating Euclidian distance
>
> dist <- function(xeast, xnorth, yeast, ynorth) {
> ((xeast-yeast)^2 + (xnorth-ynorth)^2)^0.5
> }
>
> ### doing the merge by location with minimum distance
>
> dist.merge <- function(x, y, xeast, xnorth, yeast, ynorth){
> tmp <- t(apply(x[,c(xeast, xnorth)], 1, function(x, y){
> dists <- apply(y, 1, function(x, y) dist(x[2],
> x[1], y[2], y[1]), x)
> cbind(1:nrow(y), dists)[dists == min(dists),,drop=F][1,]
> }
> , y[,c(yeast, ynorth)]))
> tmp <- cbind(x, min.dist=tmp[,2], y[tmp[,1],-match(c(yeast,
> ynorth), names(y))])
> row.names(tmp) <- NULL
> tmp
> }
>
> #### code end
>
> _________________________________________________________________
>
> Live.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
http://www.stats.ox.ac.uk/~davison