Terry Therneau
2011-Jan-24 14:53 UTC
[R] How to measure/rank ?variable importance when using rpart?
--- included message ----
Thus, my question is: *What common measures exists for ranking/measuring
variable importance of participating variables in a CART model? And how
can
this be computed using R (for example, when using the rpart package)*
---end ----
Consider the following printout from rpart
summary(rpart(time ~ age + ph.ecog + pat.karno, data=lung))
Node number 1: 228 observations, complexity param=0.03665178
mean=305.2325, MSE=44176.93
left son=2 (81 obs) right son=3 (147 obs)
Primary splits:
pat.karno < 75 to the left, improve=0.03661157, (3 missing)
ph.ecog < 1.5 to the right, improve=0.03620793, (1 missing)
age < 75.5 to the right, improve=0.01606491, (0 missing)
Surrogate splits:
ph.ecog < 1.5 to the right, agree=0.787, adj=0.392, (3 split)
age < 72.5 to the right, agree=0.680, adj=0.089, (0 split)
In Breiman, Friedman, Olshen, & Stone, the canonical CART book, the
pat.karno variable would get .0366 "points" for this split,
ph.ecog would get .0366 * .392 points
age would get .0366 * .089 points
The reason for adding in surrogates is to account for redundant
variables. Suppose for instance that x1=height but so is x10, just
measured on a different day. They won't be exactly the same, so one
will get picked over the other at any given split; but at the end they
should get the same importance score.
This calculation is added up over all the splits to get a variable
importance. So -- all the necessary ingredients are present. Someone
just needs to write the importance function :-)
Terry T.
Tal Galili
2011-Jan-24 15:14 UTC
[R] How to measure/rank ?variable importance when using rpart?
Hi Terry,
I've actually already written such a function (based on an old similar
question I once asked on this list), which I attached bellow to this e-mail.
But I have a few problems with my function:
1) I wasn't sure how to include the surrogate variable importance level
into the function (how to access them from the rpart object, how many of
them to ask for from the original call to rpart, since it's default is 5 -
is that enough? and how should they be presented in the final count-down?
should all of these numbers be mixed together??).
2) I'm not sure which split type (error function) makes this a valid method
of measuring of variable importance. For example, should we always use
information gain for this function (e.g: rpart(..., parms = list(split
"information")) )
Or will this also work with the gini index?
Here is the function I've written so far:
info.gain.rpart <- function(fit1, to_plot = T, ylab = "sum of all the
improvement (in fit$split[, 'improve'])",
main = "Information per variable" ,..., sort = T, col)
{
info_gain <- tapply(fit1$splits[, "improve"],
rownames(fit1$splits), sum)
# let's order info_gain according to the original order of the letters in
the data.frame
# needed function:
order.x.by.y <- function(x,y) order(match(x, y)) # this function gets x/y
and returns the order of x so it will be like y
x_names <- names(attr(fit1, "xlevels")) # the original names of the
elements
info_gain_order <- order.x.by.y(names(info_gain),x_names) # the needed new
order.
info_gain <- info_gain[info_gain_order]
length_info_gain <- length(info_gain)
# info.gain <- info.gain[c(8,1:7)]
if(missing(col)) col <- rep("grey", length_info_gain)
if(length(col) < length_info_gain) col <- rep(col, length_info_gain)
if(sort) {
ss <- order(info_gain,decreasing = T)
info_gain <- info_gain[ss]
col <- col[ss] # this way we can notice which belongs to which stem...
}
if(to_plot) barplot(info_gain, ylab = ylab, main = main,col =col,...)
return(info_gain)
}
Thanks,
Tal
----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
On Mon, Jan 24, 2011 at 4:53 PM, Terry Therneau <therneau@mayo.edu> wrote:
> --- included message ----
> Thus, my question is: *What common measures exists for ranking/measuring
> variable importance of participating variables in a CART model? And how
> can
> this be computed using R (for example, when using the rpart package)*
>
> ---end ----
>
> Consider the following printout from rpart
> summary(rpart(time ~ age + ph.ecog + pat.karno, data=lung))
>
> Node number 1: 228 observations, complexity param=0.03665178
> mean=305.2325, MSE=44176.93
> left son=2 (81 obs) right son=3 (147 obs)
> Primary splits:
> pat.karno < 75 to the left, improve=0.03661157, (3 missing)
> ph.ecog < 1.5 to the right, improve=0.03620793, (1 missing)
> age < 75.5 to the right, improve=0.01606491, (0 missing)
> Surrogate splits:
> ph.ecog < 1.5 to the right, agree=0.787, adj=0.392, (3 split)
> age < 72.5 to the right, agree=0.680, adj=0.089, (0 split)
>
> In Breiman, Friedman, Olshen, & Stone, the canonical CART book, the
> pat.karno variable would get .0366 "points" for this split,
> ph.ecog would get .0366 * .392 points
> age would get .0366 * .089 points
>
> The reason for adding in surrogates is to account for redundant
> variables. Suppose for instance that x1=height but so is x10, just
> measured on a different day. They won't be exactly the same, so one
> will get picked over the other at any given split; but at the end they
> should get the same importance score.
>
> This calculation is added up over all the splits to get a variable
> importance. So -- all the necessary ingredients are present. Someone
> just needs to write the importance function :-)
>
> Terry T.
>
>
[[alternative HTML version deleted]]