thr3ads.net - R help - [R] Random Forest & Cross Validation [Feb 2011]

If this information is useful, please help other people find it:
Share via:

ronzhao

2011-Feb-20 00:02 UTC

[R] Random Forest & Cross Validation

Hi,
I am using randomForest package to do some prediction job on GWAS data. I
firstly split the data into training and testing set (70% vs 30%), then
using training set to grow the trees (ntree=100000). It looks that the OOB
error in training set is good (<10%). However, it is not very good for the
test set with a AUC only about 50%. 
Although some people said no cross-validation was necessary for RF, I still
felt unsafe and thought a testing set is important. I felt really frustrated
with the results.


Anyone has some suggestions?

Thanks.

PS: example code I used

RF<-randomForest(PHENOTYPE~.,data=Train,importance=T,ntree=20000,do.trace=5000)
rownames(Test)<-Test$IID

Pred<-predict(RF,Test,type="prob")


-- 
View this message in context:
http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3314777.html
Sent from the R help mailing list archive at Nabble.com.

Max Kuhn

2011-Feb-20 19:48 UTC

head link

[R] Random Forest & Cross Validation

> I am using randomForest package to do some prediction job on GWAS data. I
> firstly split the data into training and testing set (70% vs 30%), then
> using training set to grow the trees (ntree=100000). It looks that the OOB
> error in training set is good (<10%). However, it is not very good for
the
> test set with a AUC only about 50%.
Did you do any feature selection in the training set? If so, you also
need to include that step in the cross-validation to get realistic
performance estimates (see Ambroise and McLachlan. Selection bias in
gene extraction on the basis of microarray gene-expression data.
Proceedings of the National Academy of Sciences (2002) vol. 99 (10)
pp. 6562-6566).

In the caret package, train() can be used to get cross-validation
estimates for RF and the sbf() function (for selection by filter) can
be used to include simple univariate filters in the CV procedure.
> Although some people said no cross-validation was necessary for RF, I still
> felt unsafe and thought a testing set is important. I felt really
frustrated
> with the results.
CV is needed when you want an assessment of performance on a test set.
In this sense, RF is like any other method.

-- 

Max

ronzhao

2011-Feb-22 21:39 UTC

head link

[R] Random Forest & Cross Validation

Thanks, Max.

Yes, I?did some feature selections in the training set. Basically, I
selected the top 1000 SNPs based on OOB error and grow the forest using
training set, then using the test set to validate the forest grown.

But if I do the same thing in test set, the top SNPs would be different than
those in training set. That may be difficult to interpret.




-- 
View this message in context:
http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-tp3314777p3320094.html
Sent from the R help mailing list archive at Nabble.com.

R help - Feb 2011 - Random Forest & Cross Validation

[R] Random Forest & Cross Validation

[R] Random Forest & Cross Validation

[R] Random Forest & Cross Validation