Richard L. Valliant
2016-Jan-27 23:38 UTC
[R] randomForest and factor predictors--unexpected results
I'm been experimenting with the randomForest R package (v. 4.6-12) and
getting an unexpected difference between rpart and randomForest results that may
have something to do with using x's that are factors.
The same model (see code below) is used to predict a 2-value variable called
"resp" that is treated as a factor. Four x's are used that are
factors.
The rpart predicted probabilities average to the same as mean(resp) when used on
the full dataset. This seems OK.
The randomForest predicted probabilities average is quite a bit different from
mean(resp). This seems unexpected since random forests amount to repeatedly
doing variations of what rpart does.
Has anyone seen anything like this or see what I am doing wrong?
(I did the same comparison using the kyphosis dataset in rpart with all
continuous predictors and found consistent average predicted probabilities
between rpart and randomForest.)
Here's the code ...
require(PracTools) # R package with dataset used
require(rpart)
require(randomForest)
data(nhis) # dataset in PracTools
table(nhis$resp)/nrow(nhis)
# 0 1
#0.3098952 0.6901048
t1 <- rpart(resp ~ age + as.factor(hisp) + as.factor(race) +
as.factor(parents_r) + as.factor(educ_r),
method = "class",
control = rpart.control(minbucket = 50, cp=0),
data = nhis)
rpart.prob <- predict(object = t1, newdata = nhis, type = "prob")
apply(rpart.prob,2,mean)
# 0 1
#0.3098952 0.6901048 mean of rpart predictions same as mean(resp)
rf.nhis <- randomForest(as.factor(resp) ~ age + as.factor(hisp) +
as.factor(race)
+ as.factor(parents_r) + as.factor(educ_r),
importance = TRUE, na.action = na.omit, mtry=5,
ntree = 1000, classwt = c(0.31, 0.69),
# cycled through mtry =1,...,5; the lower mtry is, the
worse are the predicted probs
data = nhis)
rfnhis.prob <- predict(object = rf.nhis, newdata = nhis, type =
"prob")
apply(rfnhis.prob,2,mean)
# 0 1
#0.2485541 0.7514459 not too close to mean(resp)
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
randomForest_4.6-12
Thanks for any help,
Richard Valliant
Universities of Maryland and Michigan