Several of the methods I use for analyzing large data sets, such as
WinGamma: determining the level of noise in data
Relief-F: estimating the influence of variables
depend on finding the k nearest neighbors of a point in a data frame or
matrix efficiently. (For large data sets it is not feasible to compute
the 'dist' matrix anyway.)
Seeing the proposed solution to "[R] distance between two matrices"
last month for finding _one_ nearest neighbor I came up with a solution
'nearest(A, n, k)' as appended.
Still, this is very clumsy and slow if you have to find the 3 nearest
neighbors for 1000 points in a data frame with 100000 entries at least
-- about 2 secs per data point on my computer or half an hour for an
application from real life.
Are there better/faster ways to perform this task using R functions?
Even better, is there a free implementation of kd-trees that I could
utilize (the one I found did not conform to the ANSI C standard)?
Someome pointed to 'spdep' of the R-Sig-Geo project, but
'knearneigh'
only accepts 2D data points.
Hans Werner Borchers
ABB Corporate Research, Germany
________________________________________________________________________
require (class)
nearest <- function (X, n, k=3)
## Find k nearest neighbors of X[n, ] in the data frame
## or matrix X, utilizing function knn1 k-times.
{
N <- nrow(X)
# inds contains the indices of nearest neighbors
inds <- c(n); i <- 0
while (i < k) {
# use knn1 for one index...
j <- as.integer(knn1(X [-inds, ], X[n, ], 1:(N-length(inds))))
# ...and change to true index of neighbor
inds <- c(inds, setdiff(1:N, inds)[j])
i <- i+1
}
# return nearest neighbor indices (without n, of course)
return(inds[-1])
}
Why not modify the C code of knn? (Not knn1) On Mon, 2 Feb 2004, Hans W Borchers wrote:> Several of the methods I use for analyzing large data sets, such as > > WinGamma: determining the level of noise in data > Relief-F: estimating the influence of variables > > depend on finding the k nearest neighbors of a point in a data frame or > matrix efficiently. (For large data sets it is not feasible to compute > the 'dist' matrix anyway.) > > Seeing the proposed solution to "[R] distance between two matrices" > last month for finding _one_ nearest neighbor I came up with a solution > 'nearest(A, n, k)' as appended. > > Still, this is very clumsy and slow if you have to find the 3 nearest > neighbors for 1000 points in a data frame with 100000 entries at least > -- about 2 secs per data point on my computer or half an hour for an > application from real life. > > Are there better/faster ways to perform this task using R functions? > Even better, is there a free implementation of kd-trees that I could > utilize (the one I found did not conform to the ANSI C standard)? > > Someome pointed to 'spdep' of the R-Sig-Geo project, but 'knearneigh' > only accepts 2D data points. > > Hans Werner Borchers > ABB Corporate Research, Germany > ________________________________________________________________________ > > require (class) > nearest <- function (X, n, k=3) > ## Find k nearest neighbors of X[n, ] in the data frame > ## or matrix X, utilizing function knn1 k-times. > { > N <- nrow(X) > # inds contains the indices of nearest neighbors > inds <- c(n); i <- 0 > while (i < k) { > # use knn1 for one index... > j <- as.integer(knn1(X [-inds, ], X[n, ], 1:(N-length(inds)))) > # ...and change to true index of neighbor > inds <- c(inds, setdiff(1:N, inds)[j]) > i <- i+1 > } > # return nearest neighbor indices (without n, of course) > return(inds[-1]) > } > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
> /Why not modify the C code of knn? (Not knn1)/Because I thought it would not be a good idea to apply 'knn' a 1000 times or more. Instead I was looking for a procedure that returns a structure making it easy to find nearest neighbors for any point one wants. This way I used 'dist' for smaller data sets. The next step could be to extent nearest neighbors to data frames with factors generalizing, for example, the 'daisy' function. But you might be right that looking at the 'knn' implementation will be a faster road for the moment. Hans Werner Borchers ABB Corporate Research, Germany