Terry Therneau
2011-Jan-24 14:53 UTC
[R] How to measure/rank ?variable importance when using rpart?
--- included message ---- Thus, my question is: *What common measures exists for ranking/measuring variable importance of participating variables in a CART model? And how can this be computed using R (for example, when using the rpart package)* ---end ---- Consider the following printout from rpart summary(rpart(time ~ age + ph.ecog + pat.karno, data=lung)) Node number 1: 228 observations, complexity param=0.03665178 mean=305.2325, MSE=44176.93 left son=2 (81 obs) right son=3 (147 obs) Primary splits: pat.karno < 75 to the left, improve=0.03661157, (3 missing) ph.ecog < 1.5 to the right, improve=0.03620793, (1 missing) age < 75.5 to the right, improve=0.01606491, (0 missing) Surrogate splits: ph.ecog < 1.5 to the right, agree=0.787, adj=0.392, (3 split) age < 72.5 to the right, agree=0.680, adj=0.089, (0 split) In Breiman, Friedman, Olshen, & Stone, the canonical CART book, the pat.karno variable would get .0366 "points" for this split, ph.ecog would get .0366 * .392 points age would get .0366 * .089 points The reason for adding in surrogates is to account for redundant variables. Suppose for instance that x1=height but so is x10, just measured on a different day. They won't be exactly the same, so one will get picked over the other at any given split; but at the end they should get the same importance score. This calculation is added up over all the splits to get a variable importance. So -- all the necessary ingredients are present. Someone just needs to write the importance function :-) Terry T.
Tal Galili
2011-Jan-24 15:14 UTC
[R] How to measure/rank ?variable importance when using rpart?
Hi Terry, I've actually already written such a function (based on an old similar question I once asked on this list), which I attached bellow to this e-mail. But I have a few problems with my function: 1) I wasn't sure how to include the surrogate variable importance level into the function (how to access them from the rpart object, how many of them to ask for from the original call to rpart, since it's default is 5 - is that enough? and how should they be presented in the final count-down? should all of these numbers be mixed together??). 2) I'm not sure which split type (error function) makes this a valid method of measuring of variable importance. For example, should we always use information gain for this function (e.g: rpart(..., parms = list(split "information")) ) Or will this also work with the gini index? Here is the function I've written so far: info.gain.rpart <- function(fit1, to_plot = T, ylab = "sum of all the improvement (in fit$split[, 'improve'])", main = "Information per variable" ,..., sort = T, col) { info_gain <- tapply(fit1$splits[, "improve"], rownames(fit1$splits), sum) # let's order info_gain according to the original order of the letters in the data.frame # needed function: order.x.by.y <- function(x,y) order(match(x, y)) # this function gets x/y and returns the order of x so it will be like y x_names <- names(attr(fit1, "xlevels")) # the original names of the elements info_gain_order <- order.x.by.y(names(info_gain),x_names) # the needed new order. info_gain <- info_gain[info_gain_order] length_info_gain <- length(info_gain) # info.gain <- info.gain[c(8,1:7)] if(missing(col)) col <- rep("grey", length_info_gain) if(length(col) < length_info_gain) col <- rep(col, length_info_gain) if(sort) { ss <- order(info_gain,decreasing = T) info_gain <- info_gain[ss] col <- col[ss] # this way we can notice which belongs to which stem... } if(to_plot) barplot(info_gain, ylab = ylab, main = main,col =col,...) return(info_gain) } Thanks, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Mon, Jan 24, 2011 at 4:53 PM, Terry Therneau <therneau@mayo.edu> wrote:> --- included message ---- > Thus, my question is: *What common measures exists for ranking/measuring > variable importance of participating variables in a CART model? And how > can > this be computed using R (for example, when using the rpart package)* > > ---end ---- > > Consider the following printout from rpart > summary(rpart(time ~ age + ph.ecog + pat.karno, data=lung)) > > Node number 1: 228 observations, complexity param=0.03665178 > mean=305.2325, MSE=44176.93 > left son=2 (81 obs) right son=3 (147 obs) > Primary splits: > pat.karno < 75 to the left, improve=0.03661157, (3 missing) > ph.ecog < 1.5 to the right, improve=0.03620793, (1 missing) > age < 75.5 to the right, improve=0.01606491, (0 missing) > Surrogate splits: > ph.ecog < 1.5 to the right, agree=0.787, adj=0.392, (3 split) > age < 72.5 to the right, agree=0.680, adj=0.089, (0 split) > > In Breiman, Friedman, Olshen, & Stone, the canonical CART book, the > pat.karno variable would get .0366 "points" for this split, > ph.ecog would get .0366 * .392 points > age would get .0366 * .089 points > > The reason for adding in surrogates is to account for redundant > variables. Suppose for instance that x1=height but so is x10, just > measured on a different day. They won't be exactly the same, so one > will get picked over the other at any given split; but at the end they > should get the same importance score. > > This calculation is added up over all the splits to get a variable > importance. So -- all the necessary ingredients are present. Someone > just needs to write the importance function :-) > > Terry T. > >[[alternative HTML version deleted]]