thr3ads.net - R help - [R] Variable Importance

If this information is useful, please help other people find it:
Share via:

Mathe, Ewy (NIH/NCI) [F]

2007-Aug-24 19:13 UTC

[R] Variable Importance - Random Forest

Hello,

 

I am trying to explore the use of random forests for classification and
am certain about the interpretation of the importance measurements.

 

When having the option "importance = T" in the randomForest call, the
resulting 'importance' element matrix has four columns with the
following headings:

0 - mean raw importance score of variable x for class 0 (where
importance is the difference between the permutated data error and the
original test set error)

1 - mean raw importance score of variable x for class 1

MeanDecreaseAccuracy : average lowering of the margin across all cases
(where margin is the proportion of votes for the true class - the
maximum proportion of votes for the other classes)

MeanDecreaseGini : summation of the gini decreases over all trees in the
forest

 

Are these definitions correct?  Why is the raw importance score
calculated for each class?  Could one just average the raw importance
scores for class 0 and 1 to get a composite importance score?

 

Now, when having the option "importance = F" in the randomForest call,
the 'importance' element is now a vector.  What values are those?

 

Thank you in advance for any input you may have.

 

Best,

Ewy

 

 

 

 

Ewy Mathe, Ph. D.

Laboratory of Human Carcinogenesis

National Cancer Institute, NIH

37 Convent Drive

Building 37, Room 3068

Bethesda, MD  20892-4255

Tel: 301-496-5835

Fax: 301-496-0497

 


	[[alternative HTML version deleted]]

Henric Nilsson (Public)

2007-Aug-25 22:32 UTC

head link

[R] Variable Importance - Random Forest

Den 2007-08-24 21:13, Mathe, Ewy (NIH/NCI) [F] skrev:> Hello,
> 
>  
> 
> I am trying to explore the use of random forests for classification and
> am certain about the interpretation of the importance measurements.
In case you haven't already done so, you probably want to read

@ARTICLE{Strobl+Boulesteix+Zeileis+Hothorn:2007,
   author = {Carolin Strobl and Anne-Laure Boulesteix and Achim Zeileis 
   and Torsten Hothorn},
   title = {Bias in Random Forest Variable Importance Measures: 
Illustrations,
   		   Sources and a Solution},
   journal = {{BMC} Bioinformatics},
   year = {2007},
   volume = {8},
   number = {25},
   url = {http://www.biomedcentral.com/1471-2105/8/25/}
}


HTH,
Henric


> 
>  
> 
> When having the option "importance = T" in the randomForest call,
the
> resulting 'importance' element matrix has four columns with the
> following headings:
> 
> 0 - mean raw importance score of variable x for class 0 (where
> importance is the difference between the permutated data error and the
> original test set error)
> 
> 1 - mean raw importance score of variable x for class 1
> 
> MeanDecreaseAccuracy : average lowering of the margin across all cases
> (where margin is the proportion of votes for the true class - the
> maximum proportion of votes for the other classes)
> 
> MeanDecreaseGini : summation of the gini decreases over all trees in the
> forest
> 
>  
> 
> Are these definitions correct?  Why is the raw importance score
> calculated for each class?  Could one just average the raw importance
> scores for class 0 and 1 to get a composite importance score?
> 
>  
> 
> Now, when having the option "importance = F" in the randomForest
call,
> the 'importance' element is now a vector.  What values are those?
> 
>  
> 
> Thank you in advance for any input you may have.
> 
>  
> 
> Best,
> 
> Ewy
> 
>  
> 
>  
> 
>  
> 
>  
> 
> Ewy Mathe, Ph. D.
> 
> Laboratory of Human Carcinogenesis
> 
> National Cancer Institute, NIH
> 
> 37 Convent Drive
> 
> Building 37, Room 3068
> 
> Bethesda, MD  20892-4255
> 
> Tel: 301-496-5835
> 
> Fax: 301-496-0497
> 
>  
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Liaw, Andy

2007-Sep-07 01:51 UTC

head link

[R] Variable Importance - Random Forest

I'm slowly clearing my back-log of r-help messages...

Please see reply inline below.

Andy
> From: Mathe, Ewy (NIH/NCI) [F]
> Hello,
> 
>  
> 
> I am trying to explore the use of random forests for 
> classification and
> am certain about the interpretation of the importance measurements.
> 
>  
> 
> When having the option "importance = T" in the randomForest call,
the
> resulting 'importance' element matrix has four columns with the
> following headings:
> 
> 0 - mean raw importance score of variable x for class 0 (where
> importance is the difference between the permutated data error and the
> original test set error)
> 
> 1 - mean raw importance score of variable x for class 1
> 
> MeanDecreaseAccuracy : average lowering of the margin across all cases
> (where margin is the proportion of votes for the true class - the
> maximum proportion of votes for the other classes)
> 
> MeanDecreaseGini : summation of the gini decreases over all 
> trees in the
> forest
> 
>  
> 
> Are these definitions correct?  Why is the raw importance score
> calculated for each class?  Could one just average the raw importance
> scores for class 0 and 1 to get a composite importance score?
The "permutation-based" importance measures are based on OOB data. 
For
each tree in the forest, the difference in error rates on the OOB data
with and without permuting the variable of interest is computed.  Call
this d[i] for the i-th tree.  The overall importance measure is
mean(d[i]) / se(d[i]), where se(d[i]) is sd(d[i])/sqrt(ntree) (the
"standard error").  The numbers in the "0" and "1"
columns are the
analogs computed separately for the "0" class and "1" class
separately.
These are useful, e.g., when "balanced sampling" is used.
  
  > 
> Now, when having the option "importance = F" in the randomForest
call,
> the 'importance' element is now a vector.  What values are those?
That's the MeanDecreaseGini, because they come at nearly zero additional
computation, so we might as well keep them.
 >  
> 
> Thank you in advance for any input you may have.
> 
>  
> 
> Best,
> 
> Ewy
> 
> 
> Ewy Mathe, Ph. D.
> 
> Laboratory of Human Carcinogenesis
> 
> National Cancer Institute, NIH
> 
> 37 Convent Drive
> 
> Building 37, Room 3068
> 
> Bethesda, MD  20892-4255
> 
> Tel: 301-496-5835
> 
> Fax: 301-496-0497
> 
>  
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Possibly Parallel Threads

Search for more reasonably related threads

R help - Aug 2007 - Variable Importance - Random Forest

[R] Variable Importance - Random Forest

[R] Variable Importance - Random Forest

[R] Variable Importance - Random Forest

Possibly Parallel Threads