Jason Jones Medical Informatics
2008-Sep-25 22:56 UTC
[R] varimp in party (or randomForest)
Hi, There is an excellent article at http://www.biomedcentral.com/1471-2105/9/307 by Stroble, et al. describing variable importance in random forests. Does anyone have any suggestions (besides imputation or removal of cases) for how to deal with data that *have* missing data for predictor variables? Below is an excerpt of some code referenced in the article. I have commented out one line and added one additional line. The code runs beautifully if only complete cases are included and (though it builds the tree) breaks at the variable importance step missing data are presented. # From http://www.biomedcentral.com/content/supplementary/1471-2105-8-25-S1.R require("party") arabidopsis_url <- "http://www.biomedcentral.com/content/supplementary/1471-2105-5-132-S1.txt" arabidopsis <- read.table(arabidopsis_url, header = TRUE, sep = " ", na.string = "X") #arabidopsis <- subset(arabidopsis, complete.cases(arabidopsis)) arabidopsis <- subset(arabidopsis, is.na(arabidopsis$edit)==FALSE) arabidopsis <- arabidopsis[, !(names(arabidopsis) %in% c("X0", "loc"))] my_cforest_control <- cforest_control(teststat = "quad", testtype = "Univ", mincriterion = 0, ntree = 50, mtry = 3, replace = TRUE) my_cforest <- cforest(edit ~ ., data = arabidopsis, controls = my_cforest_control) varimp_cforest <- varimp(my_cforest) By the way, the same issue arises for the randomForest package. Does anyone have any suggestions? I'm more interested in the variable importance than the tree per se. Thanks, Jason Jason Jones, PhD Medical Informatics j.jones at imail.org 801.707.6898