Bill.Venables@csiro.au
2004-Apr-05 06:40 UTC
[R] Can't seem to finish a randomForest.... Just goes and goes!
Alternatively, if you can arrive at a sensible ordering of the levels you can declare them ordered factors and make the computation feasible once again. Bill Venables. -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Torsten Hothorn Sent: Monday, 5 April 2004 4:27 PM To: David L. Van Brunt, Ph.D. Cc: R-Help Subject: Re: [R] Can't seem to finish a randomForest.... Just goes and goes! On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote:> Playing with randomForest, samples run fine. But on real data, no go. > > Here's the setup: OS X, same behavior whether I'm using R-Aqua 1.8.1 > or the Fink compile-of-my-own with X-11, R version 1.8.1. > > This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical > RAM. > > I have not altered the Startup options of R. > > Data set is read in from a text file with "read.table", and has 46 > variables and 1,855 cases. Trying the following: > > The DV is categorical, 0 or 1. Most of the IV's are either continuous,> or correctly read in as factors. The largest factor has 30 levels.... > Only the^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This means: there are 2^(30-1) = 536.870.912 possible splits to be evaluated everytime this variable is picked up (minus something due to empty levels). At least the last time I looked at the code, randomForest used an exhaustive search over all possible splits. Try reducing the number of levels to something reasonable (or for a first shot: remove this variable from the learning sample). Best, Torsten> DV seems to need identifying as a factor to force class trees over > regresssion: > > >Mydata$V46<-as.factor(Mydata$V46) > >Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi > >ties=FALSE > , importance=FALSE) > > 5 hours later, R.bin was still taking up 75% of my processor. When > I've tried this with larger data, I get errors referring to the buffer> (sorry, not in front of me right now). > > Any ideas on this? The data don't seem horrifically large. Seems like > there are a few options for setting memory size, but I'm not sure > which of them to try tweaking, or if that's even the issue. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
David L. Van Brunt, Ph.D.
2004-Apr-06 00:16 UTC
[R] Can't seem to finish a randomForest.... Just goes and goes!
D'OH! I clearly just needed to Re-RTFM!!! I had a column still coded as TEXT (yup, "Monday", etc), and the randomForest manual by Breiman says they need to be numerically coded. Easy recode. I'll try running it RIGHT this time, and let you all know how this goes. Grumble mumble mumble.... On 4/5/04 1:40, "Bill.Venables at csiro.au" <Bill.Venables at csiro.au> wrote:> Alternatively, if you can arrive at a sensible ordering of the levels > you can declare them ordered factors and make the computation feasible > once again. > > Bill Venables. > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Torsten Hothorn > Sent: Monday, 5 April 2004 4:27 PM > To: David L. Van Brunt, Ph.D. > Cc: R-Help > Subject: Re: [R] Can't seem to finish a randomForest.... Just goes and > goes! > > > On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote: > >> Playing with randomForest, samples run fine. But on real data, no go. >> >> Here's the setup: OS X, same behavior whether I'm using R-Aqua 1.8.1 >> or the Fink compile-of-my-own with X-11, R version 1.8.1. >> >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical >> RAM. >> >> I have not altered the Startup options of R. >> >> Data set is read in from a text file with "read.table", and has 46 >> variables and 1,855 cases. Trying the following: >> >> The DV is categorical, 0 or 1. Most of the IV's are either continuous, > >> or correctly read in as factors. The largest factor has 30 levels.... >> Only the > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > This means: there are 2^(30-1) = 536.870.912 possible splits to be > evaluated everytime this variable is picked up (minus something due to > empty levels). At least the last time I looked at the code, randomForest > used an exhaustive search over all possible splits. Try reducing the > number of levels to something reasonable (or for a first shot: remove > this variable from the learning sample). > > Best, > > Torsten > > >> DV seems to need identifying as a factor to force class trees over >> regresssion: >> >>> Mydata$V46<-as.factor(Mydata$V46) >>> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi >>> ties=FALSE >> , importance=FALSE) >> >> 5 hours later, R.bin was still taking up 75% of my processor. When >> I've tried this with larger data, I get errors referring to the buffer > >> (sorry, not in front of me right now). >> >> Any ideas on this? The data don't seem horrifically large. Seems like >> there are a few options for setting memory size, but I'm not sure >> which of them to try tweaking, or if that's even the issue. >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide! >> http://www.R-project.org/posting-guide.html >> >> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html-- David L. Van Brunt, Ph.D. Outlier Consulting & Development mailto: <ocd at well-wired.com>
David L. Van Brunt, Ph.D.
2004-Apr-06 01:58 UTC
[R] Can't seem to finish a randomForest.... Just goes and goes!
Removing that first 39 level variable, the trees ran just fine. I had also taken the shorter categoricals (day of week, for example) and read them in as numerics. Still working on it. Need that 30 level puppy in there somehow, but it really is not anything like a rank... It is a nominal variable. With numeric values, only assigning the outcome (last column) to be a factor using "as.factor()" it runs fine, and fast. I may be misusing this analysis. That first column is indeed nominal, and I was including it because the data within that name are repeated observations of that subject. But I suppose there's no guarantee that that information would be selected, so what does that do to the forest? Sigh. I'm not much of a lumberjack. Logistic regression is more my style, but this is pretty interesting stuff. If interested, here's a link to the data; http://www.well-wired.com/reflibrary/uploads/1081216314.txt On 4/5/04 1:40, "Bill.Venables at csiro.au" <Bill.Venables at csiro.au> wrote:> Alternatively, if you can arrive at a sensible ordering of the levels > you can declare them ordered factors and make the computation feasible > once again. > > Bill Venables. > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Torsten Hothorn > Sent: Monday, 5 April 2004 4:27 PM > To: David L. Van Brunt, Ph.D. > Cc: R-Help > Subject: Re: [R] Can't seem to finish a randomForest.... Just goes and > goes! > > > On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote: > >> Playing with randomForest, samples run fine. But on real data, no go. >> >> Here's the setup: OS X, same behavior whether I'm using R-Aqua 1.8.1 >> or the Fink compile-of-my-own with X-11, R version 1.8.1. >> >> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical >> RAM. >> >> I have not altered the Startup options of R. >> >> Data set is read in from a text file with "read.table", and has 46 >> variables and 1,855 cases. Trying the following: >> >> The DV is categorical, 0 or 1. Most of the IV's are either continuous, > >> or correctly read in as factors. The largest factor has 30 levels.... >> Only the > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > This means: there are 2^(30-1) = 536.870.912 possible splits to be > evaluated everytime this variable is picked up (minus something due to > empty levels). At least the last time I looked at the code, randomForest > used an exhaustive search over all possible splits. Try reducing the > number of levels to something reasonable (or for a first shot: remove > this variable from the learning sample). > > Best, > > Torsten > > >> DV seems to need identifying as a factor to force class trees over >> regresssion: >> >>> Mydata$V46<-as.factor(Mydata$V46) >>> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi >>> ties=FALSE >> , importance=FALSE) >> >> 5 hours later, R.bin was still taking up 75% of my processor. When >> I've tried this with larger data, I get errors referring to the buffer > >> (sorry, not in front of me right now). >> >> Any ideas on this? The data don't seem horrifically large. Seems like >> there are a few options for setting memory size, but I'm not sure >> which of them to try tweaking, or if that's even the issue. >> >> ______________________________________________ >> R-help at stat.math.ethz.ch mailing list >> https://www.stat.math.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide! >> http://www.R-project.org/posting-guide.html >> >> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html-- David L. Van Brunt, Ph.D. Outlier Consulting & Development mailto: <ocd at well-wired.com>
Reasonably Related Threads
- Can't seem to finish a randomForest.... Just goes and goe s!
- Re: R-help Digest, Vol 14, Issue 3
- Repost: Examples of "classwt", "strata", and "sampsize" i n randomForest?
- Extracting the MSE and % Variance from RandomForest
- randomForest gives different results for formula call v. x, y methods. Why?