thr3ads.net - R help - [R] memory problems when combining randomForests [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Eleni Rapsomaniki

2006-Jul-26 16:57 UTC

[R] memory problems when combining randomForests

Dear all,

I am trying to train a randomForest using all my control data (12,000 cases, ~
20 explanatory variables, 2 classes). Because of memory constraints, I have
split my data into 7 subsets and trained a randomForest for each, hoping that
using combine() afterwards would solve the memory issue. Unfortunately,
combine() still runs out of memory. Is there anything else I can do? (I am not
using the formula version)

Many Thanks
Eleni Rapsomaniki

Liaw, Andy

2006-Jul-27 18:59 UTC

head link

[R] memory problems when combining randomForests

From: Eleni Rapsomaniki> 
> I'm using R (windows) version 2.1.1, randomForest version 4.15.                                        ^^^^^^^^^^^^^^^^^^^^^^^^^

Never seen such a version...
> I call randomForest like this:
> 
> my.rf=randomForest(x=train.df[,-response_index], 
> y=train.df[,response_index],  
> xtest=test.df[,-response_index], 
> ytest=test.df[,response_index],  
> importance=TRUE,proximity=FALSE, keep.forest=TRUE)
> 
>  (where train.df and test.df are my train and test 
> data.frames and  response_index is the column number 
> specifiying the class)
> 
> I then save each tree to a file so I can combine them all 
> afterwards. There are no memory issues when 
> keep.forest=FALSE. But I think that's the bit I need for 
> future predictions (right?). 
Yes, but what is your question?  (Do you mean each *forest*,
instead of each *tree*?)
 > I did check previous messages on memory issues, and thought 
> that combining the trees afterwards would solve the problem. 
> Since my cross-validation subsets give me a fairly stable 
> error-rate, I suppose I could just use a randomForest trained 
> on just a subset of my data. But would I not be "wasting" 
> data this way?
Perhaps, but see Jerry Friedman's ISLE, where he argued
that RF with very small trees grown on small random samples
can give even better results some of the times.
 > A bit off the subject, but should the order at which at rows 
> (ie. sets of explanatory variables) are passed to the 
> randomForest function affect the result? I have noticed that 
> if I pick a random unordered sample from my control data for 
> training the error rate is much lower than if I a take an 
> ordered sample. This remains true for all my cross-validation 
> results. 
I'm not sure I understand.  In randomForest() (as in other
functions) variables are in columns, rather than rows, so
are you talking about variables (columns) in different order 
or data (rows) in different order?

Andy
 > I'm sorry for my many questions.
> Many Thanks
> Eleni Rapsomaniki
> 
> 
> 
> 
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
> 
>

Liaw, Andy

2006-Jul-28 13:28 UTC

head link

[R] memory problems when combining randomForests

From: Eleni Rapsomaniki> 
> Hi Andy, 
> 
> > > I'm using R (windows) version 2.1.1, randomForest version
4.15.
> >                                        ^^^^^^^^^^^^^^^^^^^^^^^^^ 
> > Never seen such a version...
> Ooops! I meant 4.5-15
>  
> > > I then save each tree to a file so I can combine them all 
> > > afterwards. There are no memory issues when 
> > > keep.forest=FALSE. But I think that's the bit I need for 
> > > future predictions (right?). 
> > 
> > Yes, but what is your question?  (Do you mean each *forest*,
> > instead of each *tree*?)
> I mean the component of the object that is created from 
> randomForest that has
> the name "forest" (and takes up all the memory!). 
Yes, the forest can take up quite a bit of space.  You might 
consider setting nodesize larger and see if that gives you 
sufficient space saving w/o compromising prediction performance.
 > > > A bit off the subject, but should the order at which at rows 
> > > (ie. sets of explanatory variables) are passed to the 
> > > randomForest function affect the result? I have noticed that 
> > > if I pick a random unordered sample from my control data for 
> > > training the error rate is much lower than if I a take an 
> > > ordered sample. This remains true for all my cross-validation 
> > > results. 
> > 
> > I'm not sure I understand.  In randomForest() (as in other
> > functions) variables are in columns, rather than rows, so
> > are you talking about variables (columns) in different order 
> > or data (rows) in different order?
> 
> Yes, sorry I confused you. I mean the order at which data 
> (rows) is passed, not
> columns.
Then I'm not sure what you mean by difference in performance, even
in cross-validation.  Perhaps you can show some example?  Each 
tree in the forest is grown on a random sample from the data, so
the order of the row can not matter.

> Finally, I see from
> http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#inter
> 
> that there is a component in Breiman's implementation of 
> randomForest that
> computes interactions between parameters. Has this been 
> implemented in R yet?
No.  Prof. Breiman told me that is very experimental, and he
wouldn't mind if that doesn't make it into the R package.  
Since I have other priorities for the package, that naturally
went to the backburner.

Cheers,
Andy

 > Many thanks for your time and help.
> Eleni Rapsomaniki
> 
> 
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
> 
>

Eleni Rapsomaniki

2006-Jul-29 13:14 UTC

head link

[R] memory problems when combining randomForests

Hello again,

The reason why I thought the order at which rows are passed to randomForest
affect the error rate is because I get different results for different ways of
splitting my positive/negative data. 

First get the data (attached with this email)
pos.df=read.table("C:/Program Files/R/rw2011/pos.df", header=T)
neg.df=read.table("C:/Program Files/R/rw2011/neg.df", header=T)
library(randomForest)
#The first 2 columns are explanatory variables (which incidentally are not
discriminative at all if one looks at their distributions), the 3rd is the
class (pos or neg) 

train2test.ratio=8/10
min_len=min(nrow(pos.df), nrow(neg.df))
class_index=which(names(pos.df)=="class") #is the same for neg.df
train_size=as.integer(min_len*train2test.ratio)

############   Way 1
train.indicesP=sample(seq(1:nrow(pos.df)), size=train_size, replace=FALSE)
train.indicesN=sample(seq(1:nrow(neg.df)), size=train_size, replace=FALSE)

trainP=pos.df[train.indicesP,]
trainN=neg.df[train.indicesN,]
testP=pos.df[-train.indicesP,]
testN=neg.df[-train.indicesN,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-class_index],
y=rbind(trainP, trainN)[,class_index], xtest=rbind(testP,
testN)[,-class_index], ytest=rbind(testP, testN)[,class_index],
importance=TRUE,proximity=FALSE, keep.forest=FALSE)
mydata.rf$test$confusion

##############   Way 2
ind <- sample(2, min(nrow(pos.df), nrow(neg.df)), replace = TRUE,
prob=c(train2test.ratio, (1-train2test.ratio)))
trainP=pos.df[ind==1,]
trainN=neg.df[ind==1,]
testP=pos.df[ind==2,]
testN=neg.df[ind==2,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index],
y=rbind(trainP,
trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP,
testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE)
mydata.rf$test$confusion

########### Way 3
subset_start=1
subset_end=subset_start+train_size
train_index=seq(subset_start:subset_end)
trainP=pos.df[train_index,]
trainN=neg.df[train_index,]
testP=pos.df[-train_index,]
testN=neg.df[-train_index,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index],
y=rbind(trainP,
trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP,
testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE)
mydata.rf$test$confusion

########### end

The first 2 methods give me an abnormally low error rate (compared to what I
get using the same data on a naiveBayes method) while the last one seems more
realistic, but the difference in error rates is very significant. I need to use
the last method to cross-validate subsets of my data sequentially(the first two
methods use random rows throughout the length of the data), unless there is a
better way to do it (?). Something must be very different between the first 2
methods and the last, but which is the correct one?

I would greatly appreciate any suggestions on this!

Many Thanks
Eleni Rapsomaniki

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Jul 2006 - memory problems when combining randomForests

[R] memory problems when combining randomForests

[R] memory problems when combining randomForests

[R] memory problems when combining randomForests

[R] memory problems when combining randomForests

Apparently Analagous Threads