thr3ads.net - R help - [R] Random Forests Variable Importance Question [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Paul Fisch

2009-Apr-13 09:04 UTC

[R] Random Forests Variable Importance Question

I am trying to use the random forests package for classification in R.

The Variable Importance Measures listed are:

-mean raw importance score of variable x for class 0

-mean raw importance score of variable x for class 1

-MeanDecreaseAccuracy

-MeanDecreaseGini

Now I know what these "mean" as in I know their definitions. What I
want to know is how to use them.

What I am trying to figure out is what these values mean in only the
context of how accurate they are, what is a good value, what is a bad
value, what are the maximums and minimums, etc.

If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does
that mean it is important or unimportant? Also any information on the
raw scores would be really helpful too. I want to know everything
there is to know about these numbers that is relevant to the
application of them.

I don't really want a technical explanation that uses words like
'error', 'summation', or 'permutated', but rather a
simpler
explanation that didn't involve any discussion of how random forests
works(I have read all about that and didn't find it very helpful.)

Like if I wanted someone to explain to me how to use a radio, I
wouldn't expect the explanation to involve how a radio converts radio
waves into sound.

If anyone can help me out at all it would be really great.? I have
read many many lectures on random forests and other data mining
lectures but I have never found simple answers about how to read the
variable importance measures.

Thanks,
Paul Fisch

Liaw, Andy

2009-Apr-13 13:09 UTC

head link

[R] Random Forests Variable Importance Question

I'll take a shot.

Let me try to explain the 3rd measure first.  A RF model tries to predict an
outcome variable (the classes) from a group of potential predictor variables
(the "x").  If a predictor variable is "important" in making
the prediction accurate, then by messing with it (e.g., giving it random values)
should have a larger impact on how well the prediction can be made, compared to
a variable that contributes little.  The variable importance measure tries to
capture this.  (If you throw a wrench into the trunk of a car, it probably
doesn't affect how the car drives.  However, if you throw the wrench into
the engine compartment, that _may_ be a different story.)

I don't know about others, but I only look at the relative importance of the
variables, rather than trying to interpret the numbers (raw or scaled).  Any
number below 0 should be treated as the same as 0 (if I recall, Breiman &
Cutler's code truncate the values at 0).  Any variable with importance value
smaller than the absolute value of the minimum is probably not worth much
looking.

The first two measures (you must be predicting an outcome variable with two
classes) are the analogous measures that address each of the two classes
specifically, rather than over all of the data.

Andy


From: Paul Fisch> 
> I am trying to use the random forests package for classification in R.
> 
> The Variable Importance Measures listed are:
> 
> -mean raw importance score of variable x for class 0
> 
> -mean raw importance score of variable x for class 1
> 
> -MeanDecreaseAccuracy
> 
> -MeanDecreaseGini
> 
> Now I know what these "mean" as in I know their definitions. What
I
> want to know is how to use them.
> 
> What I am trying to figure out is what these values mean in only the
> context of how accurate they are, what is a good value, what is a bad
> value, what are the maximums and minimums, etc.
> 
> If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does
> that mean it is important or unimportant? Also any information on the
> raw scores would be really helpful too. I want to know everything
> there is to know about these numbers that is relevant to the
> application of them.
> 
> I don't really want a technical explanation that uses words like
> 'error', 'summation', or 'permutated', but rather a
simpler
> explanation that didn't involve any discussion of how random forests
> works(I have read all about that and didn't find it very helpful.)
> 
> Like if I wanted someone to explain to me how to use a radio, I
> wouldn't expect the explanation to involve how a radio converts radio
> waves into sound.
> 
> If anyone can help me out at all it would be really great.? I have
> read many many lectures on random forests and other data mining
> lectures but I have never found simple answers about how to read the
> variable importance measures.
> 
> Thanks,
> Paul Fisch
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> Notice:  This e-mail message, together with any attachme...{{dropped:12}}

Dimitri Liakhovitski

2009-Apr-14 20:47 UTC

head link

[R] Random Forests Variable Importance Question

Paul,

To build on what Andy said:
The measures of importance RF provides are just alternative ways of
getting at the same thing: Variable Importance.
For example, MeanDecreaseAccuracy is one of those alternatives. As
Andy said, it does not make sense to look at the absolute importance
value. In a hypothetical case where all importance values seem "high"
but are equal - that means that all variables have the same
importance. In another case, where all importance values seem "low"
but are equal - that means exactly the same thing, that all variables
have the same importance. The point is: the absolute value of
importance is not very helpful. One needs to build relative importance
values.
I learned to use it like this (similar to what Andy said):

1. Take RF output for each variable (MeanDecreaseAccuracy - for
example, if the RF object is called "rftest" then I take the vector
as.data.frame(rftest$importance)[1]
2. I divide each variable's (raw) importance by its respective SD
(as.data.frame(rftest$importanceSD)[1])
3. The resulting values that are less than zero are made equal to
zero, as Andy mentioned.
4. I take each value, multiply it by 100 and divide it by the sum of
all the values from step 3.

This way I get relative importance of each predictor and all
importances sum up to 100.

-- 
Dimitri Liakhovitski
MarketTools, Inc.
Dimitri.Liakhovitski at markettools.com

Date: Mon, 13 Apr 2009 09:09:35 -0400
From: "Liaw, Andy" <andy_liaw at merck.com>
Subject: Re: [R] Random Forests Variable Importance Question
To: "Paul Fisch" <fischp at gmail.com>, <r-help at
r-project.org>
Message-ID:
       <39B6DDB9048D0F4DAD42CB26AAFF0AFA071BA2B8 at usctmx1106.merck.com>
Content-Type: text/plain; charset="iso-8859-1"

I'll take a shot.

Let me try to explain the 3rd measure first.  A RF model tries to
predict an outcome variable (the classes) from a group of potential
predictor variables (the "x").  If a predictor variable is
"important"
in making the prediction accurate, then by messing with it (e.g.,
giving it random values) should have a larger impact on how well the
prediction can be made, compared to a variable that contributes
little.  The variable importance measure tries to capture this.  (If
you throw a wrench into the trunk of a car, it probably doesn't affect
how the car drives.  However, if you throw the wrench into the engine
compartment, that _may_ be a different story.)

I don't know about others, but I only look at the relative importance
of the variables, rather than trying to interpret the numbers (raw or
scaled).  Any number below 0 should be treated as the same as 0 (if I
recall, Breiman & Cutler's code truncate the values at 0).  Any
variable with importance value smaller than the absolute value of the
minimum is probably not worth much looking.

The first two measures (you must be predicting an outcome variable
with two classes) are the analogous measures that address each of the
two classes specifically, rather than over all of the data.

Andy


From: Paul Fisch>
> I am trying to use the random forests package for classification in R.
>
> The Variable Importance Measures listed are:
>
> -mean raw importance score of variable x for class 0
>
> -mean raw importance score of variable x for class 1
>
> -MeanDecreaseAccuracy
>
> -MeanDecreaseGini
>
> Now I know what these "mean" as in I know their definitions. What
I
> want to know is how to use them.
>
> What I am trying to figure out is what these values mean in only the
> context of how accurate they are, what is a good value, what is a bad
> value, what are the maximums and minimums, etc.
>
> If a variable has a high MeanDecreaseAccuracy or MeanDecreaseGini does
> that mean it is important or unimportant? Also any information on the
> raw scores would be really helpful too. I want to know everything
> there is to know about these numbers that is relevant to the
> application of them.
>
> I don't really want a technical explanation that uses words like
> 'error', 'summation', or 'permutated', but rather a
simpler
> explanation that didn't involve any discussion of how random forests
> works(I have read all about that and didn't find it very helpful.)
>
> Like if I wanted someone to explain to me how to use a radio, I
> wouldn't expect the explanation to involve how a radio converts radio
> waves into sound.
>
> If anyone can help me out at all it would be really great.? I have
> read many many lectures on random forests and other data mining
> lectures but I have never found simple answers about how to read the
> variable importance measures.
>
> Thanks,
> Paul Fisch

Apparently Analagous Threads

Search for more reasonably related threads

R help - Apr 2009 - Random Forests Variable Importance Question

[R] Random Forests Variable Importance Question

[R] Random Forests Variable Importance Question

[R] Random Forests Variable Importance Question

Apparently Analagous Threads