Eleni Rapsomaniki
2007-Aug-10 12:03 UTC
[R] Remove redundant observations for cross-validation
Hi, This is a general statistics question that I believe occurs often so may have some R functions/packages dedicated to it. Suppose you want to check the accuracy of a classifier using a large training data-set where each row represents an observation. Is there a simple approach for removing redundant rows (rows with very similar values for all columns) from the training data so as to obtain a realistic classification performance upon x-validation? The only one I can think of is clustering the data into an arbitary number of clusters and selecting one observation from each cluster. e.g library(cluster) x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)), cbind(rnorm(10,5,2.5), rnorm(15,5,2.5)), cbind(rnorm(10,15,0.5), rnorm(15,15,0.5)), cbind(rnorm(5,5,0.1), rnorm(5,5,0.1))) pamx <- pam(x, 15) y=array(NA, dim=c(15,ncol(x))) for(i in 1:15){ y[i,]=x[sample(which(pamx$clustering==i), 1),] } This seems a bit subjective though... Any better ideas? Eleni Rapsomaniki