Dimitri Liakhovitski
2014-Jan-27 23:47 UTC
[R] Predictor Importance in Random Forests and bootstrap
Hello! Below, I: 1. Create a data set with a bunch of factors. All of them are predictors and 'y' is the dependent variable. 2. I run a classification Random Forests run with predictor importance. I look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini 3. I run 2 boostrap runs for 2 Random Forests measures of importance mentioned above. Question: Could anyone please explain why I am getting such a huge positive bias across the board (for all predictors) for MeanDecreaseAccuracy? Thanks a lot! Dimitri #---------------------------------------------------------------- # Creating a a data set: #------------------------------------------------------------- N<-1000 myset1<-c(1,2,3,4,5) probs1a<-c(.05,.10,.15,.40,.30) probs1b<-c(.05,.15,.10,.30,.40) probs1c<-c(.05,.05,.10,.15,.65) myset2<-c(1,2,3,4,5,6,7) probs2a<-c(.02,.03,.10,.15,.20,.30,.20) probs2b<-c(.02,.03,.10,.15,.20,.20,.30) probs2c<-c(.02,.03,.10,.10,.10,.25,.40) myset.y<-c(1,2) probs.y<-c(.65,.30) set.seed(1) y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y)) set.seed(2) a<-as.factor(sample(myset1, N, replace = TRUE,probs1a)) set.seed(3) b<-as.factor(sample(myset1, N, replace = TRUE,probs1b)) set.seed(4) c<-as.factor(sample(myset1, N, replace = TRUE,probs1c)) set.seed(5) d<-as.factor(sample(myset2, N, replace = TRUE,probs2a)) set.seed(6) e<-as.factor(sample(myset2, N, replace = TRUE,probs2b)) set.seed(7) f<-as.factor(sample(myset2, N, replace = TRUE,probs2c)) mydata<-data.frame(a,b,c,d,e,f,y) #------------------------------------------------------------- # Single Random Forests run with predictor importance. #------------------------------------------------------------- library(randomForest) set.seed(123) rf1<-randomForest(y~.,data=mydata,importance=T) importance(rf1)[,c(3:4)] #------------------------------------------------------------- # Bootstrapping run #------------------------------------------------------------- library(boot) ### Defining two functions to be used for bootstrapping: # myrf3 returns MeanDecreaseAccuracy: myrf3<-function(usedata,idx){ set.seed(123) out<-randomForest(y~.,data=usedata[idx,],importance=T) return(importance(out)[,3]) } # myrf4 returns MeanDecreaseGini: myrf4<-function(usedata,idx){ set.seed(123) out<-randomForest(y~.,data=usedata[idx,],importance=T) return(importance(out)[,4]) } ### 2 bootstrap runs: rfboot3<-boot(mydata,myrf3,R=10) rfboot4<-boot(mydata,myrf4,R=10) ### Results rfboot3 # for MeanDecreaseAccuracy colMeans(rfboot3$t)-importance(rf1)[,3] rfboot4 # for MeanDecreaseGini colMeans(rfboot4$t)-importance(rf1)[,4] # for MeanDecreaseGini -- Dimitri Liakhovitski [[alternative HTML version deleted]]
Bert Gunter
2014-Jan-28 00:09 UTC
[R] Predictor Importance in Random Forests and bootstrap
I **think** this kind of methodological issue might be better at SO (stats.stackexchange.com). It's not really about R programming, which is the main focus of this list. And yes, I know they do intersect. Nevertheless... Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." H. Gilbert Welch On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> wrote:> Hello! > Below, I: > 1. Create a data set with a bunch of factors. All of them are predictors > and 'y' is the dependent variable. > 2. I run a classification Random Forests run with predictor importance. I > look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini > 3. I run 2 boostrap runs for 2 Random Forests measures of importance > mentioned above. > > Question: Could anyone please explain why I am getting such a huge positive > bias across the board (for all predictors) for MeanDecreaseAccuracy? > > Thanks a lot! > Dimitri > > > #---------------------------------------------------------------- > # Creating a a data set: > #------------------------------------------------------------- > > N<-1000 > myset1<-c(1,2,3,4,5) > probs1a<-c(.05,.10,.15,.40,.30) > probs1b<-c(.05,.15,.10,.30,.40) > probs1c<-c(.05,.05,.10,.15,.65) > myset2<-c(1,2,3,4,5,6,7) > probs2a<-c(.02,.03,.10,.15,.20,.30,.20) > probs2b<-c(.02,.03,.10,.15,.20,.20,.30) > probs2c<-c(.02,.03,.10,.10,.10,.25,.40) > myset.y<-c(1,2) > probs.y<-c(.65,.30) > > set.seed(1) > y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y)) > set.seed(2) > a<-as.factor(sample(myset1, N, replace = TRUE,probs1a)) > set.seed(3) > b<-as.factor(sample(myset1, N, replace = TRUE,probs1b)) > set.seed(4) > c<-as.factor(sample(myset1, N, replace = TRUE,probs1c)) > set.seed(5) > d<-as.factor(sample(myset2, N, replace = TRUE,probs2a)) > set.seed(6) > e<-as.factor(sample(myset2, N, replace = TRUE,probs2b)) > set.seed(7) > f<-as.factor(sample(myset2, N, replace = TRUE,probs2c)) > > mydata<-data.frame(a,b,c,d,e,f,y) > > > #------------------------------------------------------------- > # Single Random Forests run with predictor importance. > #------------------------------------------------------------- > > library(randomForest) > set.seed(123) > rf1<-randomForest(y~.,data=mydata,importance=T) > importance(rf1)[,c(3:4)] > > #------------------------------------------------------------- > # Bootstrapping run > #------------------------------------------------------------- > > library(boot) > > ### Defining two functions to be used for bootstrapping: > > # myrf3 returns MeanDecreaseAccuracy: > myrf3<-function(usedata,idx){ > set.seed(123) > out<-randomForest(y~.,data=usedata[idx,],importance=T) > return(importance(out)[,3]) > } > > # myrf4 returns MeanDecreaseGini: > myrf4<-function(usedata,idx){ > set.seed(123) > out<-randomForest(y~.,data=usedata[idx,],importance=T) > return(importance(out)[,4]) > } > > ### 2 bootstrap runs: > rfboot3<-boot(mydata,myrf3,R=10) > rfboot4<-boot(mydata,myrf4,R=10) > > ### Results > rfboot3 # for MeanDecreaseAccuracy > colMeans(rfboot3$t)-importance(rf1)[,3] > > rfboot4 # for MeanDecreaseGini > colMeans(rfboot4$t)-importance(rf1)[,4] # for MeanDecreaseGini > > -- > Dimitri Liakhovitski > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.