clayton.springer@pharma.novartis.com
2004-May-07 15:26 UTC
[R] randomForests and Y-scrambling on a small synthetic dataset
Dear r-help, The following dataset (generated with perl) has 10 observations of 100 dependant variables (integers drawn uniformly from [1:9]) which is split evenly between two classes.. First I show some work, and then ask two questions at the end.> data <- read.table ("rf_input.dat") > library (randomForest)# if we do randomForest one time it looks like this:> rf <- randomForest (factor(V101) ~. ,data=data) > rf$confusion1 2 class.error 1 5 5 0.5 2 4 6 0.4 # now we do it 100 times>tnum <- numeric() for (i in 1:100) { MT <- data$V101 MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)]) number <- as.integer (summary ( predict(MT.rf) == MT)[3] ) tnum <- c(tnum,number) }> > > > + + + + + + + ># and this distribution of results (about 13 correct out of 20)> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 9 11 12 12 13 13 13 14 14 15 17 # now lets permute (re-randomize?) the classes and repeat 1000 times:> library (gregmisc)tnum <- numeric() for (i in 1:1000) { MT <- permute (data$V101) MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)]) number <- as.integer (summary ( predict(MT.rf) == MT)[3] ) tnum <- c(tnum,number) } # I get these results: the average is about 8 correct (out of 20) with 13 correct being at about # the 95% confidence level> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 4 5 6 7 8 8 9 10 12 18> quantile (tnum,probs = seq (0.9,1,0.01),na.rm = T)90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100% 12 12 12 12 12 13 13 14 14 15 18 -------- My two questions: Question 1: Naively I might have expected to get 10/20 for the Y-scrambled examples, but instead I got 8/20. Why is that? (Persumably has something to do with the randomForest only training on 2/3 of the examples.) Question 2: With my Y scrambling exercise I seem to have demonstrated that the original dataset was not random. But yet it is random by construction. Is this just a fluke, or is something wrong with my protocol? thanks in advance, Clayton