Liaw, Andy
2004-Apr-05 00:07 UTC
[R] Can't seem to finish a randomForest.... Just goes and goe s!
When you have fairly large data, _do not use the formula interface_, as a couple of copies of the data would be made. Try simply: Myforest.rf <- randomForest(Mydata[, -46], Mydata[,46], ntrees=100, mtry=7) [Note that you don't need to set proximity (not proximities) or importance to FALSE, as that's the default already.] You might also want to use do.trace=1 to see if trees are actually being grown (assuming there's no output buffering as in Rgui on Windows, otherwise you'll probably want to turn that off). I had run randomForest on data set much larger than that, without problem, so I don't imagine your data would be `difficult'. (I have not used the Mac, though.) Andy> From: David L. Van Brunt, Ph.D. > > Playing with randomForest, samples run fine. But on real data, no go. > > Here's the setup: OS X, same behavior whether I'm using > R-Aqua 1.8.1 or the > Fink compile-of-my-own with X-11, R version 1.8.1. > > This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M > physical RAM. > > I have not altered the Startup options of R. > > Data set is read in from a text file with "read.table", and > has 46 variables > and 1,855 cases. Trying the following: > > The DV is categorical, 0 or 1. Most of the IV's are either > continuous, or > correctly read in as factors. The largest factor has 30 > levels.... Only the > DV seems to need identifying as a factor to force class trees over > regresssion: > > >Mydata$V46<-as.factor(Mydata$V46) > >Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximities=FALSE> , importance=FALSE) > > 5 hours later, R.bin was still taking up 75% of my processor. > When I've > tried this with larger data, I get errors referring to the > buffer (sorry, > not in front of me right now). > > Any ideas on this? The data don't seem horrifically large. > Seems like there > are a few options for setting memory size, but I'm not sure > which of them > to try tweaking, or if that's even the issue. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
David L. Van Brunt, Ph.D.
2004-Apr-05 00:14 UTC
[R] Can't seem to finish a randomForest.... Just goes and goe s!
Thanks for the pointer!! Can't believe you got back to me so quickly on a Sunday evening. I'll give that a shot and let you know how it goes. On 4/4/04 19:07, "Liaw, Andy" <andy_liaw at merck.com> wrote:> When you have fairly large data, _do not use the formula interface_, as a > couple of copies of the data would be made. Try simply: > > Myforest.rf <- randomForest(Mydata[, -46], Mydata[,46], > ntrees=100, mtry=7) > > [Note that you don't need to set proximity (not proximities) or importance > to FALSE, as that's the default already.] > > You might also want to use do.trace=1 to see if trees are actually being > grown (assuming there's no output buffering as in Rgui on Windows, otherwise > you'll probably want to turn that off). > > I had run randomForest on data set much larger than that, without problem, > so I don't imagine your data would be `difficult'. (I have not used the > Mac, though.) > > Andy > >> From: David L. Van Brunt, Ph.D. >> >> Playing with randomForest, samples run fine. But on real data, no go. >> >> Here's the setup: OS X, same behavior whether I'm using >> R-Aqua 1.8.1 or the >> Fink compile-of-my-own with X-11, R version 1.8.1. >> >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M >> physical RAM. >> >> I have not altered the Startup options of R. >> >> Data set is read in from a text file with "read.table", and >> has 46 variables >> and 1,855 cases. Trying the following: >> >> The DV is categorical, 0 or 1. Most of the IV's are either >> continuous, or >> correctly read in as factors. The largest factor has 30 >> levels.... Only the >> DV seems to need identifying as a factor to force class trees over >> regresssion: >> >>> Mydata$V46<-as.factor(Mydata$V46) >>> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7 > ,proximities=FALSE >> , importance=FALSE) >> >> 5 hours later, R.bin was still taking up 75% of my processor. >> When I've >> tried this with larger data, I get errors referring to the >> buffer (sorry, >> not in front of me right now). >> >> Any ideas on this? The data don't seem horrifically large. >> Seems like there >> are a few options for setting memory size, but I'm not sure >> which of them >> to try tweaking, or if that's even the issue. >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide! >> http://www.R-project.org/posting-guide.html >> >> > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New > Jersey, USA 08889), and/or its affiliates (which may be known outside the > United States as Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as > Banyu) that may be confidential, proprietary copyrighted and/or legally > privileged. It is intended solely for the use of the individual or entity > named on this message. If you are not the intended recipient, and have > received this message in error, please notify us immediately by reply e-mail > and then delete it from your system. > -------------------------------------------------------------------------------- David L. Van Brunt, Ph.D. Outlier Consulting & Development mailto: <ocd at well-wired.com>
Liaw, Andy
2004-Apr-06 00:35 UTC
[R] Can't seem to finish a randomForest.... Just goes and goe s!
No, that's not the problem. You still need to take Bill's and Torsten's advise. If the categorical variables (class labels included) are read into R as factors, then the conversion to integers is automagic: Factors in R _are_ integers 1 through the number of levels, with the levels attribute. randomForest() just takes advantage of that fact. Andy> From: David L. Van Brunt, Ph.D. > > D'OH! > > I clearly just needed to Re-RTFM!!! I had a column still > coded as TEXT > (yup, "Monday", etc), and the randomForest manual by Breiman > says they need > to be numerically coded. Easy recode. I'll try running it > RIGHT this time, > and let you all know how this goes. Grumble mumble mumble.... > > On 4/5/04 1:40, "Bill.Venables at csiro.au" > <Bill.Venables at csiro.au> wrote: > > > Alternatively, if you can arrive at a sensible ordering of > the levels > > you can declare them ordered factors and make the > computation feasible > > once again. > > > > Bill Venables. > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > Torsten Hothorn > > Sent: Monday, 5 April 2004 4:27 PM > > To: David L. Van Brunt, Ph.D. > > Cc: R-Help > > Subject: Re: [R] Can't seem to finish a randomForest.... > Just goes and > > goes! > > > > > > On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote: > > > >> Playing with randomForest, samples run fine. But on real > data, no go. > >> > >> Here's the setup: OS X, same behavior whether I'm using > R-Aqua 1.8.1 > >> or the Fink compile-of-my-own with X-11, R version 1.8.1. > >> > >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical > >> RAM. > >> > >> I have not altered the Startup options of R. > >> > >> Data set is read in from a text file with "read.table", and has 46 > >> variables and 1,855 cases. Trying the following: > >> > >> The DV is categorical, 0 or 1. Most of the IV's are either > continuous, > > > >> or correctly read in as factors. The largest factor has 30 > levels.... > >> Only the > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > This means: there are 2^(30-1) = 536.870.912 possible splits to be > > evaluated everytime this variable is picked up (minus > something due to > > empty levels). At least the last time I looked at the code, > randomForest > > used an exhaustive search over all possible splits. Try reducing the > > number of levels to something reasonable (or for a first > shot: remove > > this variable from the learning sample). > > > > Best, > > > > Torsten > > > > > >> DV seems to need identifying as a factor to force class trees over > >> regresssion: > >> > >>> Mydata$V46<-as.factor(Mydata$V46) > >>> > Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi > >>> ties=FALSE > >> , importance=FALSE) > >> > >> 5 hours later, R.bin was still taking up 75% of my processor. When > >> I've tried this with larger data, I get errors referring > to the buffer > > > >> (sorry, not in front of me right now). > >> > >> Any ideas on this? The data don't seem horrifically large. > Seems like > >> there are a few options for setting memory size, but I'm not sure > >> which of them to try tweaking, or if that's even the issue. > >> > >> ______________________________________________ > >> R-help at stat.math.ethz.ch mailing list > >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide! > >> http://www.R-project.org/posting-guide.html > >> > >> > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > -- > David L. Van Brunt, Ph.D. > Outlier Consulting & Development > mailto: <ocd at well-wired.com> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Liaw, Andy
2004-Apr-06 02:15 UTC
[R] Can't seem to finish a randomForest.... Just goes and goe s!
If that variable is a subject ID, and the data are repeated observations on the subjects, then you might be treading on thin ice here. A while back someone at NCI got a data set with two reps per subject, and he was able to modify the code so that the bootstrap is done on the subject basis, rather than observations. It's a bit of work trying to get a proximity matrix to make sense, though. I really have no idea how to take care of repeated measures type data (i.e., accounting for intra-subject correlations) in a classification problem. I suppose one can formulate it as a GLMM. I guess it really depends on what you are looking for; i.e., what's the goal? I assume you want to predict something, but is that over all subjects, or subject-specific? I better stop here, as this is out of my league... Andy> From: David L. Van Brunt, Ph.D. [mailto:dvanbrunt at well-wired.com] > > Removing that first 39 level variable, the trees ran just > fine. I had also > taken the shorter categoricals (day of week, for example) and > read them in > as numerics. > > Still working on it. Need that 30 level puppy in there somehow, but it > really is not anything like a rank... It is a nominal variable. > > With numeric values, only assigning the outcome (last column) > to be a factor > using "as.factor()" it runs fine, and fast. > > I may be misusing this analysis. That first column is indeed > nominal, and I > was including it because the data within that name are > repeated observations > of that subject. But I suppose there's no guarantee that that > information > would be selected, so what does that do to the forest? Sigh. > I'm not much > of a lumberjack. Logistic regression is more my style, but > this is pretty > interesting stuff. > > If interested, here's a link to the data; > http://www.well-wired.com/reflibrary/uploads/1081216314.txt > > > > On 4/5/04 1:40, "Bill.Venables at csiro.au" > <Bill.Venables at csiro.au> wrote: > > > Alternatively, if you can arrive at a sensible ordering of > the levels > > you can declare them ordered factors and make the > computation feasible > > once again. > > > > Bill Venables. > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > Torsten Hothorn > > Sent: Monday, 5 April 2004 4:27 PM > > To: David L. Van Brunt, Ph.D. > > Cc: R-Help > > Subject: Re: [R] Can't seem to finish a randomForest.... > Just goes and > > goes! > > > > > > On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote: > > > >> Playing with randomForest, samples run fine. But on real > data, no go. > >> > >> Here's the setup: OS X, same behavior whether I'm using > R-Aqua 1.8.1 > >> or the Fink compile-of-my-own with X-11, R version 1.8.1. > >> > >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical > >> RAM. > >> > >> I have not altered the Startup options of R. > >> > >> Data set is read in from a text file with "read.table", and has 46 > >> variables and 1,855 cases. Trying the following: > >> > >> The DV is categorical, 0 or 1. Most of the IV's are either > continuous, > > > >> or correctly read in as factors. The largest factor has 30 > levels.... > >> Only the > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > This means: there are 2^(30-1) = 536.870.912 possible splits to be > > evaluated everytime this variable is picked up (minus > something due to > > empty levels). At least the last time I looked at the code, > randomForest > > used an exhaustive search over all possible splits. Try reducing the > > number of levels to something reasonable (or for a first > shot: remove > > this variable from the learning sample). > > > > Best, > > > > Torsten > > > > > >> DV seems to need identifying as a factor to force class trees over > >> regresssion: > >> > >>> Mydata$V46<-as.factor(Mydata$V46) > >>> > Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi > >>> ties=FALSE > >> , importance=FALSE) > >> > >> 5 hours later, R.bin was still taking up 75% of my processor. When > >> I've tried this with larger data, I get errors referring > to the buffer > > > >> (sorry, not in front of me right now). > >> > >> Any ideas on this? The data don't seem horrifically large. > Seems like > >> there are a few options for setting memory size, but I'm not sure > >> which of them to try tweaking, or if that's even the issue. > >> > >> ______________________________________________ > >> R-help at stat.math.ethz.ch mailing list > >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide! > >> http://www.R-project.org/posting-guide.html > >> > >> > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > -- > David L. Van Brunt, Ph.D. > Outlier Consulting & Development > mailto: <ocd at well-wired.com> > > > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
Possibly Parallel Threads
- Can't seem to finish a randomForest.... Just goes and goes!
- Re: R-help Digest, Vol 14, Issue 3
- Party package: varimp(..., conditional=TRUE) error: term 1 would require 9e+12 columns
- random forest -optimising mtry
- Need Help! Poor performance about randomForest for large data