Jason Jones Medical Informatics
2008-Sep-25 22:56 UTC
[R] varimp in party (or randomForest)
Hi,
There is an excellent article at http://www.biomedcentral.com/1471-2105/9/307 by
Stroble, et al. describing variable importance in random forests. Does anyone
have any suggestions (besides imputation or removal of cases) for how to deal
with data that *have* missing data for predictor variables?
Below is an excerpt of some code referenced in the article. I have commented
out one line and added one additional line. The code runs beautifully if only
complete cases are included and (though it builds the tree) breaks at the
variable importance step missing data are presented.
# From http://www.biomedcentral.com/content/supplementary/1471-2105-8-25-S1.R
require("party")
arabidopsis_url <-
"http://www.biomedcentral.com/content/supplementary/1471-2105-5-132-S1.txt"
arabidopsis <- read.table(arabidopsis_url, header = TRUE,
sep = " ", na.string = "X")
#arabidopsis <- subset(arabidopsis, complete.cases(arabidopsis))
arabidopsis <- subset(arabidopsis, is.na(arabidopsis$edit)==FALSE)
arabidopsis <- arabidopsis[, !(names(arabidopsis) %in% c("X0",
"loc"))]
my_cforest_control <- cforest_control(teststat = "quad",
testtype = "Univ", mincriterion = 0, ntree = 50, mtry = 3,
replace = TRUE)
my_cforest <- cforest(edit ~ ., data = arabidopsis,
controls = my_cforest_control)
varimp_cforest <- varimp(my_cforest)
By the way, the same issue arises for the randomForest package.
Does anyone have any suggestions? I'm more interested in the variable
importance than the tree per se.
Thanks,
Jason
Jason Jones, PhD
Medical Informatics
j.jones at imail.org
801.707.6898
