I'm having a problem where I have to apply a function to a subset of a
variable, where the subset is defined by the n nearest neighbours of a
second variable.
Here's an example applied to the 'iris' dataset:
$ head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
For each row, I look at the value of Sepal.Length. I then figure out the
n rows where the value of Sepal.Length is closest to that in the
original row, and apply a function on the values of Sepal.Width to these
rows (typically returning a scalar).
For example, setting n = 5 and calculcating the mean on a slightly
modified dataset, based on the first row (Sepal.Length ~= 5.1):
$ set.seed(1)
$ iris[,1:4]=iris[,1:4]+runif(150)/100
$ x=iris$Sepal.Length[1]
$ (pos=which(order(abs(iris$Sepal.Length-x)) %in% 2:6))
[1] 18 26 40 42 52
$ mean(iris$Sepal.Width[pos])
[1] 3.086595
Now, I could easily use a 'for' loop or 'sapply' to do this for
all
rows, but I would think there is a better (and perhaps even faster?)
way. Anyone know of a specific function in a package for this sort of
thing?
Also note that this way of doing it won't necessarily work on the
unmodified dataset, where a number of rows have the same values for
'Sepal.Length', and the original row won't necessarily have
'order'
value equal to 1. (Exactly how to break ties when there are more than n
number of observations with the same distance to the original row isn't
very important, though. For example, using the ones with lowest row
numbers would be an OK solution, or n random ones, would both OK.)
--
Karl Ove Hufthammer