Hi, I am using randomForest package to do some prediction job on GWAS data. I firstly split the data into training and testing set (70% vs 30%), then using training set to grow the trees (ntree=100000). It looks that the OOB error in training set is good (<10%). However, it is not very good for the test set with a AUC only about 50%. Although some people said no cross-validation was necessary for RF, I still felt unsafe and thought a testing set is important. I felt really frustrated with the results. Anyone has some suggestions? Thanks. PS: example code I used RF<-randomForest(PHENOTYPE~.,data=Train,importance=T,ntree=20000,do.trace=5000) rownames(Test)<-Test$IID Pred<-predict(RF,Test,type="prob") -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3314777.html Sent from the R help mailing list archive at Nabble.com.
> I am using randomForest package to do some prediction job on GWAS data. I > firstly split the data into training and testing set (70% vs 30%), then > using training set to grow the trees (ntree=100000). It looks that the OOB > error in training set is good (<10%). However, it is not very good for the > test set with a AUC only about 50%.Did you do any feature selection in the training set? If so, you also need to include that step in the cross-validation to get realistic performance estimates (see Ambroise and McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences (2002) vol. 99 (10) pp. 6562-6566). In the caret package, train() can be used to get cross-validation estimates for RF and the sbf() function (for selection by filter) can be used to include simple univariate filters in the CV procedure.> Although some people said no cross-validation was necessary for RF, I still > felt unsafe and thought a testing set is important. I felt really frustrated > with the results.CV is needed when you want an assessment of performance on a test set. In this sense, RF is like any other method. -- Max
Thanks, Max. Yes, I?did some feature selections in the training set. Basically, I selected the top 1000 SNPs based on OOB error and grow the forest using training set, then using the test set to validate the forest grown. But if I do the same thing in test set, the top SNPs would be different than those in training set. That may be difficult to interpret. -- View this message in context: http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3320094.html Sent from the R help mailing list archive at Nabble.com.