thr3ads.net - R help - [R] Can't seem to finish a randomForest.... Just goes and goe s! [Apr 2004]

If this information is useful, please help other people find it:
Share via:

Liaw, Andy

2004-Apr-05 00:07 UTC

[R] Can't seem to finish a randomForest.... Just goes and goe s!

When you have fairly large data, _do not use the formula interface_, as a
couple of copies of the data would be made.  Try simply:

Myforest.rf <- randomForest(Mydata[, -46], Mydata[,46], 
                            ntrees=100, mtry=7)

[Note that you don't need to set proximity (not proximities) or importance
to FALSE, as that's the default already.]

You might also want to use do.trace=1 to see if trees are actually being
grown (assuming there's no output buffering as in Rgui on Windows, otherwise
you'll probably want to turn that off).

I had run randomForest on data set much larger than that, without problem,
so I don't imagine your data would be `difficult'.  (I have not used the
Mac, though.)

Andy
> From: David L. Van Brunt, Ph.D.
> 
> Playing with randomForest, samples run fine. But on real data, no go.
> 
> Here's the setup: OS X, same behavior whether I'm using 
> R-Aqua 1.8.1 or the
> Fink compile-of-my-own with X-11, R version 1.8.1.
> 
> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M 
> physical RAM.
> 
> I have not altered the Startup options of R.
> 
> Data set is read in from a text file with "read.table", and 
> has 46 variables
> and 1,855 cases. Trying the following:
> 
> The DV is categorical, 0 or 1. Most of the IV's are either 
> continuous, or
> correctly read in as factors. The largest factor has 30 
> levels.... Only the
> DV seems to need identifying as a factor to force class trees over
> regresssion:
> 
> >Mydata$V46<-as.factor(Mydata$V46)
> >Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7
,proximities=FALSE> , importance=FALSE)
> 
> 5 hours later, R.bin was still taking up 75% of my processor. 
>  When I've
> tried this with larger data, I get errors referring to the 
> buffer (sorry,
> not in front of me right now).
> 
> Any ideas on this? The data don't seem horrifically large. 
> Seems like there
> are a few options for setting memory size, but I'm  not sure 
> which of them
> to try tweaking, or if that's even the issue.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

David L. Van Brunt, Ph.D.

2004-Apr-05 00:14 UTC

head link

[R] Can't seem to finish a randomForest.... Just goes and goe s!

Thanks for the pointer!! Can't believe you got back to me so quickly on a
Sunday evening. I'll give that a shot and let you know how it goes.

On 4/4/04 19:07, "Liaw, Andy" <andy_liaw at merck.com> wrote:
> When you have fairly large data, _do not use the formula interface_, as a
> couple of copies of the data would be made.  Try simply:
> 
> Myforest.rf <- randomForest(Mydata[, -46], Mydata[,46],
>                           ntrees=100, mtry=7)
> 
> [Note that you don't need to set proximity (not proximities) or
importance
> to FALSE, as that's the default already.]
> 
> You might also want to use do.trace=1 to see if trees are actually being
> grown (assuming there's no output buffering as in Rgui on Windows,
otherwise
> you'll probably want to turn that off).
> 
> I had run randomForest on data set much larger than that, without problem,
> so I don't imagine your data would be `difficult'.  (I have not
used the
> Mac, though.)
> 
> Andy
> 
>> From: David L. Van Brunt, Ph.D.
>> 
>> Playing with randomForest, samples run fine. But on real data, no go.
>> 
>> Here's the setup: OS X, same behavior whether I'm using
>> R-Aqua 1.8.1 or the
>> Fink compile-of-my-own with X-11, R version 1.8.1.
>> 
>> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M
>> physical RAM.
>> 
>> I have not altered the Startup options of R.
>> 
>> Data set is read in from a text file with "read.table", and
>> has 46 variables
>> and 1,855 cases. Trying the following:
>> 
>> The DV is categorical, 0 or 1. Most of the IV's are either
>> continuous, or
>> correctly read in as factors. The largest factor has 30
>> levels.... Only the
>> DV seems to need identifying as a factor to force class trees over
>> regresssion:
>> 
>>> Mydata$V46<-as.factor(Mydata$V46)
>>> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7
> ,proximities=FALSE
>> , importance=FALSE)
>> 
>> 5 hours later, R.bin was still taking up 75% of my processor.
>>  When I've
>> tried this with larger data, I get errors referring to the
>> buffer (sorry,
>> not in front of me right now).
>> 
>> Any ideas on this? The data don't seem horrifically large.
>> Seems like there
>> are a few options for setting memory size, but I'm  not sure
>> which of them
>> to try tweaking, or if that's even the issue.
>> 
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide!
>> http://www.R-project.org/posting-guide.html
>> 
>> 
> 
> 
>
------------------------------------------------------------------------------
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New
> Jersey, USA 08889), and/or its affiliates (which may be known outside the
> United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan,
as
> Banyu) that may be confidential, proprietary copyrighted and/or legally
> privileged. It is intended solely for the use of the individual or entity
> named on this message.  If you are not the intended recipient, and have
> received this message in error, please notify us immediately by reply
e-mail
> and then delete it from your system.
>
------------------------------------------------------------------------------
-- 
David L. Van Brunt, Ph.D.
Outlier Consulting & Development
mailto: <ocd at well-wired.com>

Liaw, Andy

2004-Apr-06 00:35 UTC

head link

[R] Can't seem to finish a randomForest.... Just goes and goe s!

No, that's not the problem.  You still need to take Bill's and
Torsten's
advise.  If the categorical variables (class labels included) are read into
R as factors, then the conversion to integers is automagic:  Factors in R
_are_ integers 1 through the number of levels, with the levels attribute.
randomForest() just takes advantage of that fact.

Andy
> From: David L. Van Brunt, Ph.D.
> 
> D'OH!
> 
> I clearly just needed to Re-RTFM!!!  I had a column still 
> coded as TEXT
> (yup, "Monday", etc), and the randomForest manual by Breiman 
> says they need
> to be numerically coded. Easy recode. I'll try running it 
> RIGHT this time,
> and let you all know how this goes.  Grumble mumble mumble....
> 
> On 4/5/04 1:40, "Bill.Venables at csiro.au" 
> <Bill.Venables at csiro.au> wrote:
> 
> > Alternatively, if you can arrive at a sensible ordering of 
> the levels
> > you can declare them ordered factors and make the 
> computation feasible
> > once again.
> > 
> > Bill Venables.
> > 
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Torsten Hothorn
> > Sent: Monday, 5 April 2004 4:27 PM
> > To: David L. Van Brunt, Ph.D.
> > Cc: R-Help
> > Subject: Re: [R] Can't seem to finish a randomForest.... 
> Just goes and
> > goes!
> > 
> > 
> > On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote:
> > 
> >> Playing with randomForest, samples run fine. But on real 
> data, no go.
> >> 
> >> Here's the setup: OS X, same behavior whether I'm using 
> R-Aqua 1.8.1
> >> or the Fink compile-of-my-own with X-11, R version 1.8.1.
> >> 
> >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with
512M physical
> >> RAM.
> >> 
> >> I have not altered the Startup options of R.
> >> 
> >> Data set is read in from a text file with "read.table",
and has 46
> >> variables and 1,855 cases. Trying the following:
> >> 
> >> The DV is categorical, 0 or 1. Most of the IV's are either 
> continuous,
> > 
> >> or correctly read in as factors. The largest factor has 30 
> levels....
> >> Only the
> >                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > This means: there are 2^(30-1) = 536.870.912 possible splits to be
> > evaluated everytime this variable is picked up (minus 
> something due to
> > empty levels). At least the last time I looked at the code, 
> randomForest
> > used an exhaustive search over all possible splits. Try reducing the
> > number of levels to something reasonable (or for a first 
> shot: remove
> > this variable from the learning sample).
> > 
> > Best,
> > 
> > Torsten
> > 
> > 
> >> DV seems to need identifying as a factor to force class trees over
> >> regresssion:
> >> 
> >>> Mydata$V46<-as.factor(Mydata$V46)
> >>> 
> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi
> >>> ties=FALSE
> >> , importance=FALSE)
> >> 
> >> 5 hours later, R.bin was still taking up 75% of my processor. 
When
> >> I've tried this with larger data, I get errors referring 
> to the buffer
> > 
> >> (sorry, not in front of me right now).
> >> 
> >> Any ideas on this? The data don't seem horrifically large. 
> Seems like
> >> there are a few options for setting memory size, but I'm  not
sure
> >> which of them to try tweaking, or if that's even the issue.
> >> 
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide!
> >> http://www.R-project.org/posting-guide.html
> >> 
> >> 
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> 
> -- 
> David L. Van Brunt, Ph.D.
> Outlier Consulting & Development
> mailto: <ocd at well-wired.com>
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Liaw, Andy

2004-Apr-06 02:15 UTC

head link

[R] Can't seem to finish a randomForest.... Just goes and goe s!

If that variable is a subject ID, and the data are repeated observations on
the subjects, then you might be treading on thin ice here.  A while back
someone at NCI got a data set with two reps per subject, and he was able to
modify the code so that the bootstrap is done on the subject basis, rather
than observations.  It's a bit of work trying to get a proximity matrix to
make sense, though.

I really have no idea how to take care of repeated measures type data (i.e.,
accounting for intra-subject correlations) in a classification problem.  I
suppose one can formulate it as a GLMM.  I guess it really depends on what
you are looking for; i.e., what's the goal?  I assume you want to predict
something, but is that over all subjects, or subject-specific?  I better
stop here, as this is out of my league...

Andy
> From: David L. Van Brunt, Ph.D. [mailto:dvanbrunt at well-wired.com] 
> 
> Removing that first 39 level variable, the trees ran just 
> fine. I had also
> taken the shorter categoricals (day of week, for example) and 
> read them in
> as numerics.
> 
> Still working on it. Need that 30 level puppy in there somehow, but it
> really is not anything like a rank... It is a nominal variable.
> 
> With numeric values, only assigning the outcome (last column) 
> to be a factor
> using "as.factor()" it runs fine, and fast.
> 
> I may be misusing this analysis. That first column is indeed 
> nominal, and I
> was including it because the data within that name are 
> repeated observations
> of that subject. But I suppose there's no guarantee that that 
> information
> would be selected, so what does that do to the forest?  Sigh. 
> I'm not much
> of a lumberjack. Logistic regression is more my style, but 
> this is pretty
> interesting stuff.
> 
> If interested, here's a link to the data;
> http://www.well-wired.com/reflibrary/uploads/1081216314.txt
> 
>  
> 
> On 4/5/04 1:40, "Bill.Venables at csiro.au" 
> <Bill.Venables at csiro.au> wrote:
> 
> > Alternatively, if you can arrive at a sensible ordering of 
> the levels
> > you can declare them ordered factors and make the 
> computation feasible
> > once again.
> > 
> > Bill Venables.
> > 
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Torsten Hothorn
> > Sent: Monday, 5 April 2004 4:27 PM
> > To: David L. Van Brunt, Ph.D.
> > Cc: R-Help
> > Subject: Re: [R] Can't seem to finish a randomForest.... 
> Just goes and
> > goes!
> > 
> > 
> > On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote:
> > 
> >> Playing with randomForest, samples run fine. But on real 
> data, no go.
> >> 
> >> Here's the setup: OS X, same behavior whether I'm using 
> R-Aqua 1.8.1
> >> or the Fink compile-of-my-own with X-11, R version 1.8.1.
> >> 
> >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with
512M physical
> >> RAM.
> >> 
> >> I have not altered the Startup options of R.
> >> 
> >> Data set is read in from a text file with "read.table",
and has 46
> >> variables and 1,855 cases. Trying the following:
> >> 
> >> The DV is categorical, 0 or 1. Most of the IV's are either 
> continuous,
> > 
> >> or correctly read in as factors. The largest factor has 30 
> levels....
> >> Only the
> >                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > This means: there are 2^(30-1) = 536.870.912 possible splits to be
> > evaluated everytime this variable is picked up (minus 
> something due to
> > empty levels). At least the last time I looked at the code, 
> randomForest
> > used an exhaustive search over all possible splits. Try reducing the
> > number of levels to something reasonable (or for a first 
> shot: remove
> > this variable from the learning sample).
> > 
> > Best,
> > 
> > Torsten
> > 
> > 
> >> DV seems to need identifying as a factor to force class trees over
> >> regresssion:
> >> 
> >>> Mydata$V46<-as.factor(Mydata$V46)
> >>> 
> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi
> >>> ties=FALSE
> >> , importance=FALSE)
> >> 
> >> 5 hours later, R.bin was still taking up 75% of my processor. 
When
> >> I've tried this with larger data, I get errors referring 
> to the buffer
> > 
> >> (sorry, not in front of me right now).
> >> 
> >> Any ideas on this? The data don't seem horrifically large. 
> Seems like
> >> there are a few options for setting memory size, but I'm  not
sure
> >> which of them to try tweaking, or if that's even the issue.
> >> 
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide!
> >> http://www.R-project.org/posting-guide.html
> >> 
> >> 
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> 
> -- 
> David L. Van Brunt, Ph.D.
> Outlier Consulting & Development
> mailto: <ocd at well-wired.com>
> 
> 
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Apr 2004 - Can't seem to finish a randomForest.... Just goes and goe s!

[R] Can't seem to finish a randomForest.... Just goes and goe s!

[R] Can't seem to finish a randomForest.... Just goes and goe s!

[R] Can't seem to finish a randomForest.... Just goes and goe s!

[R] Can't seem to finish a randomForest.... Just goes and goe s!

Possibly Parallel Threads