Dear help list,
I think I found a bug a the R Random Forest. Hopefully, you are able to
reproduce it.
I use R version 2.7.2 and RF version 4.5-27.
This is a minimal code to describe the problem:
library(randomForest)
tries <- 20
dimension <- 20
n <- 200
outlyingness <- rep(NaN,tries)
for (o_number in 1:tries){
features <- matrix(rnorm(n*dimension,0,1),n,dimension)
#Generate features, n uncorrelated normally distributed points
outlier.rf <- randomForest(features, ntree=100, proximity=TRUE)
#Compute Random Forest including the proximity matrix
outlyingness_all <- apply(outlier.rf$proximity,2,mean) #Compute
the mean proximity for each of the n points
better <- sum(outlyingness_all[1]<outlyingness_all) #Compute
the rank of a certain point according to the outlyingness
outlyingness[o_number] <- 1+better
}
outlyingness
Point number 1 plays a special role in this code fragment.
A typical value for "outlyingness" is
200 200 200 200 196 200 200 200 200 200 200 200 200 200 200 200 199 200
200 200
whereas one obtains what one would expect for any other point. So, if
better <- sum(outlyingness_all[1]<outlyingness_all)
is for example replaced by
better <- sum(outlyingness_all[17]<outlyingness_all)
one gets
194 7 184 76 25 40 175 174 137 75 49 146 175 150 148 118 100 88
121 14
Is this a bug or am I confused?
Can anybody help me? Does anybody know the problem?
Best regards
Jens Roeder
[[alternative HTML version deleted]]