thr3ads.net - R help - [R] simplifying randomForest(s) [Sep 2003]

If this information is useful, please help other people find it:
Share via:

Ramon Diaz-Uriarte

2003-Sep-16 09:44 UTC

[R] simplifying randomForest(s)

Dear All,

I have been using the randomForest package for a couple of difficult 
prediction problems (which also share p >> n). The performance is good,
but
since all the variables in the data set are used, interpretation of what is 
going on is not easy, even after looking at variable importance as produced 
by the randomForest run.

I have tried a simple "variable selection" scheme, and it does seem to
perform
well (as judged by leave-one-out) but I am not sure if it makes any sense.  
The idea is, in a kind of backwards elimination,  to eliminate one by one the 
variables with smallest importance (or all the ones with negative importance 
in one go) until the out-of-bag estimate of classification error becames 
larger than that of the previous model (or of the initial model). So nothing 
really new. But I haven't been able to find any comments in the literature 
about "simplification" of random forests. 

Any suggestions/comments?

Best,

Ram?n

-- 
Ram?n D?az-Uriarte
Bioinformatics Unit
Centro Nacional de Investigaciones Oncol?gicas (CNIO)
(Spanish National Cancer Center)
Melchor Fern?ndez Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://bioinfo.cnio.es/~rdiaz

Liaw, Andy

2003-Sep-16 12:14 UTC

head link

[R] simplifying randomForest(s)

Ramon,
> From: Ramon Diaz-Uriarte [mailto:rdiaz at cnio.es] 
> 
> Dear All,
> 
> I have been using the randomForest package for a couple of difficult 
> prediction problems (which also share p >> n). The 
> performance is good, but 
> since all the variables in the data set are used, 
> interpretation of what is 
> going on is not easy, even after looking at variable 
> importance as produced 
> by the randomForest run.
> 
> I have tried a simple "variable selection" scheme, and it 
> does seem to perform 
> well (as judged by leave-one-out) but I am not sure if it 
> makes any sense.  
> The idea is, in a kind of backwards elimination,  to 
> eliminate one by one the 
> variables with smallest importance (or all the ones with 
> negative importance 
> in one go) until the out-of-bag estimate of classification 
> error becames 
> larger than that of the previous model (or of the initial 
> model). So nothing 
> really new. But I haven't been able to find any comments in 
> the literature 
> about "simplification" of random forests. 
This is quite a hazardous game.  We've been burned by this ourselves. 
I'll
send you a paper we submitted on variable selection for random forest
off-line.  (Those who are interested, let me know.)

The basic problem is that when you select important variables by RF and then
re-run RF with those variables, the OOB error rate become biased downward.
As you iterate more times, the "overfitting" becomes more and more
severe
(in the sense that, the OOB error rate will keep decreasing while error rate
on an independent test set will be flat or increases).  I was na?ve enough
to ask Breiman about this, and his reply was something like "any competent
statistician would know that you need something like cross-validation to do
that"...

In the upcoming version 5 of Breiman's Fortran code, he offers an option to
run RF twice, first time with all variables, and the second with the k
(selected by user) most important variables from the 1st run.  The OOB error
rate from the 2nd run is no longer unbiased, but the bias is probably not
too severe with only one iteration.

Best,
Andy
 > Any suggestions/comments?
> 
> Best,
> 
> Ram?n
> 
> -- 
> Ram?n D?az-Uriarte
> Bioinformatics Unit
> Centro Nacional de Investigaciones Oncol?gicas (CNIO)
> (Spanish National Cancer Center)
> Melchor Fern?ndez Almagro, 3
> 28029 Madrid (Spain)
> Fax: +-34-91-224-6972
> Phone: +-34-91-224-6900
> http://bioinfo.cnio.es/~rdiaz

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Reasonably Related Threads

Search for more reasonably related threads

R help - Sep 2003 - simplifying randomForest(s)

[R] simplifying randomForest(s)

[R] simplifying randomForest(s)

Reasonably Related Threads