Dear all, I am trying to train a randomForest using all my control data (12,000 cases, ~ 20 explanatory variables, 2 classes). Because of memory constraints, I have split my data into 7 subsets and trained a randomForest for each, hoping that using combine() afterwards would solve the memory issue. Unfortunately, combine() still runs out of memory. Is there anything else I can do? (I am not using the formula version) Many Thanks Eleni Rapsomaniki
From: Eleni Rapsomaniki> > I'm using R (windows) version 2.1.1, randomForest version 4.15.^^^^^^^^^^^^^^^^^^^^^^^^^ Never seen such a version...> I call randomForest like this: > > my.rf=randomForest(x=train.df[,-response_index], > y=train.df[,response_index], > xtest=test.df[,-response_index], > ytest=test.df[,response_index], > importance=TRUE,proximity=FALSE, keep.forest=TRUE) > > (where train.df and test.df are my train and test > data.frames and response_index is the column number > specifiying the class) > > I then save each tree to a file so I can combine them all > afterwards. There are no memory issues when > keep.forest=FALSE. But I think that's the bit I need for > future predictions (right?).Yes, but what is your question? (Do you mean each *forest*, instead of each *tree*?)> I did check previous messages on memory issues, and thought > that combining the trees afterwards would solve the problem. > Since my cross-validation subsets give me a fairly stable > error-rate, I suppose I could just use a randomForest trained > on just a subset of my data. But would I not be "wasting" > data this way?Perhaps, but see Jerry Friedman's ISLE, where he argued that RF with very small trees grown on small random samples can give even better results some of the times.> A bit off the subject, but should the order at which at rows > (ie. sets of explanatory variables) are passed to the > randomForest function affect the result? I have noticed that > if I pick a random unordered sample from my control data for > training the error rate is much lower than if I a take an > ordered sample. This remains true for all my cross-validation > results.I'm not sure I understand. In randomForest() (as in other functions) variables are in columns, rather than rows, so are you talking about variables (columns) in different order or data (rows) in different order? Andy> I'm sorry for my many questions. > Many Thanks > Eleni Rapsomaniki > > > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > >
From: Eleni Rapsomaniki> > Hi Andy, > > > > I'm using R (windows) version 2.1.1, randomForest version 4.15. > > ^^^^^^^^^^^^^^^^^^^^^^^^^ > > Never seen such a version... > Ooops! I meant 4.5-15 > > > > I then save each tree to a file so I can combine them all > > > afterwards. There are no memory issues when > > > keep.forest=FALSE. But I think that's the bit I need for > > > future predictions (right?). > > > > Yes, but what is your question? (Do you mean each *forest*, > > instead of each *tree*?) > I mean the component of the object that is created from > randomForest that has > the name "forest" (and takes up all the memory!).Yes, the forest can take up quite a bit of space. You might consider setting nodesize larger and see if that gives you sufficient space saving w/o compromising prediction performance.> > > A bit off the subject, but should the order at which at rows > > > (ie. sets of explanatory variables) are passed to the > > > randomForest function affect the result? I have noticed that > > > if I pick a random unordered sample from my control data for > > > training the error rate is much lower than if I a take an > > > ordered sample. This remains true for all my cross-validation > > > results. > > > > I'm not sure I understand. In randomForest() (as in other > > functions) variables are in columns, rather than rows, so > > are you talking about variables (columns) in different order > > or data (rows) in different order? > > Yes, sorry I confused you. I mean the order at which data > (rows) is passed, not > columns.Then I'm not sure what you mean by difference in performance, even in cross-validation. Perhaps you can show some example? Each tree in the forest is grown on a random sample from the data, so the order of the row can not matter.> Finally, I see from > http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#inter > > that there is a component in Breiman's implementation of > randomForest that > computes interactions between parameters. Has this been > implemented in R yet?No. Prof. Breiman told me that is very experimental, and he wouldn't mind if that doesn't make it into the R package. Since I have other priorities for the package, that naturally went to the backburner. Cheers, Andy> Many thanks for your time and help. > Eleni Rapsomaniki > > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > >
Hello again, The reason why I thought the order at which rows are passed to randomForest affect the error rate is because I get different results for different ways of splitting my positive/negative data. First get the data (attached with this email) pos.df=read.table("C:/Program Files/R/rw2011/pos.df", header=T) neg.df=read.table("C:/Program Files/R/rw2011/neg.df", header=T) library(randomForest) #The first 2 columns are explanatory variables (which incidentally are not discriminative at all if one looks at their distributions), the 3rd is the class (pos or neg) train2test.ratio=8/10 min_len=min(nrow(pos.df), nrow(neg.df)) class_index=which(names(pos.df)=="class") #is the same for neg.df train_size=as.integer(min_len*train2test.ratio) ############ Way 1 train.indicesP=sample(seq(1:nrow(pos.df)), size=train_size, replace=FALSE) train.indicesN=sample(seq(1:nrow(neg.df)), size=train_size, replace=FALSE) trainP=pos.df[train.indicesP,] trainN=neg.df[train.indicesN,] testP=pos.df[-train.indicesP,] testN=neg.df[-train.indicesN,] mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-class_index], y=rbind(trainP, trainN)[,class_index], xtest=rbind(testP, testN)[,-class_index], ytest=rbind(testP, testN)[,class_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion ############## Way 2 ind <- sample(2, min(nrow(pos.df), nrow(neg.df)), replace = TRUE, prob=c(train2test.ratio, (1-train2test.ratio))) trainP=pos.df[ind==1,] trainN=neg.df[ind==1,] testP=pos.df[ind==2,] testN=neg.df[ind==2,] mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion ########### Way 3 subset_start=1 subset_end=subset_start+train_size train_index=seq(subset_start:subset_end) trainP=pos.df[train_index,] trainN=neg.df[train_index,] testP=pos.df[-train_index,] testN=neg.df[-train_index,] mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion ########### end The first 2 methods give me an abnormally low error rate (compared to what I get using the same data on a naiveBayes method) while the last one seems more realistic, but the difference in error rates is very significant. I need to use the last method to cross-validate subsets of my data sequentially(the first two methods use random rows throughout the length of the data), unless there is a better way to do it (?). Something must be very different between the first 2 methods and the last, but which is the correct one? I would greatly appreciate any suggestions on this! Many Thanks Eleni Rapsomaniki