Sam Albers
2012-Mar-03 02:19 UTC
[R] Strategies to deal with unbalanced classification data in randomForest
Hello all, I have become somewhat confused with options available for dealing with a highly unbalanced data set (10000 in one class, 50 in the other). As a summary I am unsure: a) if I am perform the two class weighting methods properly, b) if the data are too unbalanced and that this type of analysis is appropriate and c) if there is any interaction between the weighting for class imbalances and number of trees in a forest. An example will illustrate this best. Say I have a data set like the following: df <- rbind( data.frame(var1=runif(10000, 10, 50), var2=runif(10000, -3, 3), var3=runif(10000, 0.1, 0.25), cls=factor("CLASS-1") ), data.frame(var1=runif(50, 10, 50), var2=runif(50, 2, 7), var3=runif(50, 0.2, 0.35), cls=factor("CLASS-2") ) ) ## Where the response vector is highly imbalanced like so: summary(df$cls) library(randomForest) set.seed(17) ## Now the obviously an extreme case but I am wondering what the options are to deal with something like this. ## The problem with this situation manifests itself when I try to train a random forest ## without accounting for this imbalance df.rf<-randomForest(cls~var1+var2+var3, data=df,importance=TRUE) ## Now one option is to down sample the majority variable. However, I can seem to find exactly ## how to do this. Does this seem correct? df.rf.downsamp <-randomForest(cls~var1+var2+var3, data=df,sampsize=c(50,50), importance=TRUE) ## 50 being the number of observations in the minority variable ## The other option which there seems to be some confusion over is establish some class weights ## to balance the error rate. This approach I've mostly drawn from here: ## http://stat-www.berkeley.edu/users/breiman/RandomForests/cc_home.htm#balance ## This might not be appropriate, however, as of September it looks like Breiman method wasn't used in R df.rf.weights<-randomForest(cls~var1+var2+var3, data=df,classwt=c(1, 600), importance=TRUE) ## Nevertheless, what I am concerned about is the effect of an unbalanced data set has on my randomForest model ## For example: par(mfrow=c(1,3)) plot(df.rf) plot(df.rf.downsamp) plot(df.rf.weights) presents three very different scenarios and I having trouble resolving the issues I mentioned above. I am extremely grateful for all the work that has been done on randomForests in R up to this point. I was hoping that someone, with more experience, might be able to advise what the best strategy is to deal with this problem. Which of these approaches are best and am I using them right? Thanks so much in advance for any help. Sam> sessionInfo()R version 2.14.2 (2012-02-29) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 [4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] ggplot2_0.8.9 plyr_1.7.1 tools_2.14.2