Richard L. Valliant
2016-Jan-27 23:38 UTC
[R] randomForest and factor predictors--unexpected results
I'm been experimenting with the randomForest R package (v. 4.6-12) and getting an unexpected difference between rpart and randomForest results that may have something to do with using x's that are factors. The same model (see code below) is used to predict a 2-value variable called "resp" that is treated as a factor. Four x's are used that are factors. The rpart predicted probabilities average to the same as mean(resp) when used on the full dataset. This seems OK. The randomForest predicted probabilities average is quite a bit different from mean(resp). This seems unexpected since random forests amount to repeatedly doing variations of what rpart does. Has anyone seen anything like this or see what I am doing wrong? (I did the same comparison using the kyphosis dataset in rpart with all continuous predictors and found consistent average predicted probabilities between rpart and randomForest.) Here's the code ... require(PracTools) # R package with dataset used require(rpart) require(randomForest) data(nhis) # dataset in PracTools table(nhis$resp)/nrow(nhis) # 0 1 #0.3098952 0.6901048 t1 <- rpart(resp ~ age + as.factor(hisp) + as.factor(race) + as.factor(parents_r) + as.factor(educ_r), method = "class", control = rpart.control(minbucket = 50, cp=0), data = nhis) rpart.prob <- predict(object = t1, newdata = nhis, type = "prob") apply(rpart.prob,2,mean) # 0 1 #0.3098952 0.6901048 mean of rpart predictions same as mean(resp) rf.nhis <- randomForest(as.factor(resp) ~ age + as.factor(hisp) + as.factor(race) + as.factor(parents_r) + as.factor(educ_r), importance = TRUE, na.action = na.omit, mtry=5, ntree = 1000, classwt = c(0.31, 0.69), # cycled through mtry =1,...,5; the lower mtry is, the worse are the predicted probs data = nhis) rfnhis.prob <- predict(object = rf.nhis, newdata = nhis, type = "prob") apply(rfnhis.prob,2,mean) # 0 1 #0.2485541 0.7514459 not too close to mean(resp) R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 randomForest_4.6-12 Thanks for any help, Richard Valliant Universities of Maryland and Michigan