It looks like you are building a regression model. With such a large number of
rows, you should try to limit the size of the trees by setting nodesize to
something larger than the default (5). The issue, I suspect, is the fact that
the size of the largest possible tree has about 2*nodesize nodes, and each node
takes a row in a matrix to store. Multiply that by the number of trees you are
trying to build, and you see how the memory can be gobbled up quickly. Boosted
trees don't usually run into this problem because one usually boost very
small trees (usually no more than 10 terminal nodes per tree).
Best,
Andy
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of John Foreman
> Sent: Wednesday, September 07, 2011 2:46 PM
> To: r-help at r-project.org
> Subject: [R] randomForest memory footprint
>
> Hello, I am attempting to train a random forest model using the
> randomForest package on 500,000 rows and 8 columns (7 predictors, 1
> response). The data set is the first block of data from the UCI
> Machine Learning Repo dataset "Record Linkage Comparison
Patterns"
> with the slight modification that I dropped two columns with lots of
> NA's and I used knn imputation to fill in other gaps.
>
> When I load in my dataset, R uses no more than 100 megs of RAM. I'm
> running a 64-bit R with ~4 gigs of RAM available. When I execute the
> randomForest() function, however I get memory complaints. Example:
>
> > summary(mydata1.clean[,3:10])
> cmp_fname_c1 cmp_lname_c1 cmp_sex cmp_bd
> cmp_bm cmp_by cmp_plz is_match
> Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
> Min. :0.0000 Min. :0.0000 Min. :0.00000 FALSE:572820
> 1st Qu.:0.2857 1st Qu.:0.1000 1st Qu.:1.0000 1st Qu.:0.0000
> 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 TRUE : 2093
> Median :1.0000 Median :0.1818 Median :1.0000 Median :0.0000
> Median :0.0000 Median :0.0000 Median :0.00000
> Mean :0.7127 Mean :0.3156 Mean :0.9551 Mean :0.2247
> Mean :0.4886 Mean :0.2226 Mean :0.00549
> 3rd Qu.:1.0000 3rd Qu.:0.4286 3rd Qu.:1.0000 3rd Qu.:0.0000
> 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
> Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
> Max. :1.0000 Max. :1.0000 Max. :1.00000
> > mydata1.rf.model2 <- randomForest(x =
> mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
> Error: cannot allocate vector of size 877.2 Mb
> In addition: Warning messages:
> 1: In dim(data) <- dim :
> Reached total allocation of 3992Mb: see help(memory.size)
> 2: In dim(data) <- dim :
> Reached total allocation of 3992Mb: see help(memory.size)
> 3: In dim(data) <- dim :
> Reached total allocation of 3992Mb: see help(memory.size)
> 4: In dim(data) <- dim :
> Reached total allocation of 3992Mb: see help(memory.size)
>
> Other techniques such as boosted trees handle the data size just fine.
> Are there any parameters I can adjust such that I can use a value of
> 100 or more for ntree?
>
> Thanks,
> John
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Notice: This e-mail message, together with any attachme...{{dropped:11}}