Dimitri Liakhovitski
2014-Jan-27 23:47 UTC
[R] Predictor Importance in Random Forests and bootstrap
Hello!
Below, I:
1. Create a data set with a bunch of factors. All of them are predictors
and 'y' is the dependent variable.
2. I run a classification Random Forests run with predictor importance. I
look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini
3. I run 2 boostrap runs for 2 Random Forests measures of importance
mentioned above.
Question: Could anyone please explain why I am getting such a huge positive
bias across the board (for all predictors) for MeanDecreaseAccuracy?
Thanks a lot!
Dimitri
#----------------------------------------------------------------
# Creating a a data set:
#-------------------------------------------------------------
N<-1000
myset1<-c(1,2,3,4,5)
probs1a<-c(.05,.10,.15,.40,.30)
probs1b<-c(.05,.15,.10,.30,.40)
probs1c<-c(.05,.05,.10,.15,.65)
myset2<-c(1,2,3,4,5,6,7)
probs2a<-c(.02,.03,.10,.15,.20,.30,.20)
probs2b<-c(.02,.03,.10,.15,.20,.20,.30)
probs2c<-c(.02,.03,.10,.10,.10,.25,.40)
myset.y<-c(1,2)
probs.y<-c(.65,.30)
set.seed(1)
y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y))
set.seed(2)
a<-as.factor(sample(myset1, N, replace = TRUE,probs1a))
set.seed(3)
b<-as.factor(sample(myset1, N, replace = TRUE,probs1b))
set.seed(4)
c<-as.factor(sample(myset1, N, replace = TRUE,probs1c))
set.seed(5)
d<-as.factor(sample(myset2, N, replace = TRUE,probs2a))
set.seed(6)
e<-as.factor(sample(myset2, N, replace = TRUE,probs2b))
set.seed(7)
f<-as.factor(sample(myset2, N, replace = TRUE,probs2c))
mydata<-data.frame(a,b,c,d,e,f,y)
#-------------------------------------------------------------
# Single Random Forests run with predictor importance.
#-------------------------------------------------------------
library(randomForest)
set.seed(123)
rf1<-randomForest(y~.,data=mydata,importance=T)
importance(rf1)[,c(3:4)]
#-------------------------------------------------------------
# Bootstrapping run
#-------------------------------------------------------------
library(boot)
### Defining two functions to be used for bootstrapping:
# myrf3 returns MeanDecreaseAccuracy:
myrf3<-function(usedata,idx){
set.seed(123)
out<-randomForest(y~.,data=usedata[idx,],importance=T)
return(importance(out)[,3])
}
# myrf4 returns MeanDecreaseGini:
myrf4<-function(usedata,idx){
set.seed(123)
out<-randomForest(y~.,data=usedata[idx,],importance=T)
return(importance(out)[,4])
}
### 2 bootstrap runs:
rfboot3<-boot(mydata,myrf3,R=10)
rfboot4<-boot(mydata,myrf4,R=10)
### Results
rfboot3 # for MeanDecreaseAccuracy
colMeans(rfboot3$t)-importance(rf1)[,3]
rfboot4 # for MeanDecreaseGini
colMeans(rfboot4$t)-importance(rf1)[,4] # for MeanDecreaseGini
--
Dimitri Liakhovitski
[[alternative HTML version deleted]]
Bert Gunter
2014-Jan-28 00:09 UTC
[R] Predictor Importance in Random Forests and bootstrap
I **think** this kind of methodological issue might be better at SO (stats.stackexchange.com). It's not really about R programming, which is the main focus of this list. And yes, I know they do intersect. Nevertheless... Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." H. Gilbert Welch On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> wrote:> Hello! > Below, I: > 1. Create a data set with a bunch of factors. All of them are predictors > and 'y' is the dependent variable. > 2. I run a classification Random Forests run with predictor importance. I > look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini > 3. I run 2 boostrap runs for 2 Random Forests measures of importance > mentioned above. > > Question: Could anyone please explain why I am getting such a huge positive > bias across the board (for all predictors) for MeanDecreaseAccuracy? > > Thanks a lot! > Dimitri > > > #---------------------------------------------------------------- > # Creating a a data set: > #------------------------------------------------------------- > > N<-1000 > myset1<-c(1,2,3,4,5) > probs1a<-c(.05,.10,.15,.40,.30) > probs1b<-c(.05,.15,.10,.30,.40) > probs1c<-c(.05,.05,.10,.15,.65) > myset2<-c(1,2,3,4,5,6,7) > probs2a<-c(.02,.03,.10,.15,.20,.30,.20) > probs2b<-c(.02,.03,.10,.15,.20,.20,.30) > probs2c<-c(.02,.03,.10,.10,.10,.25,.40) > myset.y<-c(1,2) > probs.y<-c(.65,.30) > > set.seed(1) > y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y)) > set.seed(2) > a<-as.factor(sample(myset1, N, replace = TRUE,probs1a)) > set.seed(3) > b<-as.factor(sample(myset1, N, replace = TRUE,probs1b)) > set.seed(4) > c<-as.factor(sample(myset1, N, replace = TRUE,probs1c)) > set.seed(5) > d<-as.factor(sample(myset2, N, replace = TRUE,probs2a)) > set.seed(6) > e<-as.factor(sample(myset2, N, replace = TRUE,probs2b)) > set.seed(7) > f<-as.factor(sample(myset2, N, replace = TRUE,probs2c)) > > mydata<-data.frame(a,b,c,d,e,f,y) > > > #------------------------------------------------------------- > # Single Random Forests run with predictor importance. > #------------------------------------------------------------- > > library(randomForest) > set.seed(123) > rf1<-randomForest(y~.,data=mydata,importance=T) > importance(rf1)[,c(3:4)] > > #------------------------------------------------------------- > # Bootstrapping run > #------------------------------------------------------------- > > library(boot) > > ### Defining two functions to be used for bootstrapping: > > # myrf3 returns MeanDecreaseAccuracy: > myrf3<-function(usedata,idx){ > set.seed(123) > out<-randomForest(y~.,data=usedata[idx,],importance=T) > return(importance(out)[,3]) > } > > # myrf4 returns MeanDecreaseGini: > myrf4<-function(usedata,idx){ > set.seed(123) > out<-randomForest(y~.,data=usedata[idx,],importance=T) > return(importance(out)[,4]) > } > > ### 2 bootstrap runs: > rfboot3<-boot(mydata,myrf3,R=10) > rfboot4<-boot(mydata,myrf4,R=10) > > ### Results > rfboot3 # for MeanDecreaseAccuracy > colMeans(rfboot3$t)-importance(rf1)[,3] > > rfboot4 # for MeanDecreaseGini > colMeans(rfboot4$t)-importance(rf1)[,4] # for MeanDecreaseGini > > -- > Dimitri Liakhovitski > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.