thr3ads.net - R help - [R] Remove redundant observations for cross-validation [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Eleni Rapsomaniki

2007-Aug-10 12:03 UTC

[R] Remove redundant observations for cross-validation

Hi,

This is a general statistics question that I believe occurs often so may have
some R functions/packages dedicated to it.
Suppose you want to check the accuracy of a classifier using a large training
data-set where each row represents an observation. Is there a simple approach
for removing redundant rows (rows with very similar values for all columns)
from the training data so as to obtain a realistic classification performance
upon x-validation? The only one I can think of is clustering the data into an
arbitary number of clusters and selecting one observation from each cluster.

e.g
library(cluster)
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
           cbind(rnorm(10,5,2.5), rnorm(15,5,2.5)),
           cbind(rnorm(10,15,0.5), rnorm(15,15,0.5)),
           cbind(rnorm(5,5,0.1), rnorm(5,5,0.1)))
          
pamx <- pam(x, 15)

y=array(NA, dim=c(15,ncol(x)))
for(i in 1:15){
        y[i,]=x[sample(which(pamx$clustering==i), 1),]
}

This seems a bit subjective though... Any better ideas?

Eleni Rapsomaniki

R help - Aug 2007 - Remove redundant observations for cross-validation

[R] Remove redundant observations for cross-validation