[Please CC me in any replies as I am not currently subscribed to the list. Thanks!] Dear all, I did a bit of searching on the question of large datasets but did not come to a definite conclusion. What I am trying to do is the following: I want to read in a dataset with approx. 100 000 rows and approx 150 columns. The file size is ~ 33MB, which one would deem not too big a file for R. To speed up the reading in of the file I do not use read.table but a loop that does reading with scan() into a buffer and some preprocessing and then adds the data into a dataframe. When I then want to run randomForest() R complains that I cannot allocate a vector of size 313.0 MB. I am aware that randomForest needs all data in memory, but 1) why should that suddenly be 10 times the size of the data (I acknowedge the need for some internal data of R, but 10 times seems a bit too much) and 2) there is still physical memory free on the machine (in total 4GB available, even though R is limited to 2GB if I correctly remember the help pages - still 2GB should be enough!) - it doesn't seem to work either with changed settings done via mem.limits(), or run-time arguments --min-vsize --max-vsize - what do these have to be set to to work in my case?? > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5) Error: cannot allocate vector of size 313.0 Mb > object.size(df)/1024/1024 [1] 129.5390 Any help would be greatly appreciated, Florian -- Florian Nigsch <fn211@cam.ac.uk> Unilever Centre for Molecular Sciences Informatics Department of Chemistry University of Cambridge http://www-mitchell.ch.cam.ac.uk/ Telephone: +44 (0)1223 763 073 [[alternative HTML version deleted]]
Florian, The first thing that you should change is how you call randomForest. Instead of specifying the model via a formula, use the randomForest(x, y) interface. When a formula is used, there is a terms object created so that a model matrix can be created for these and future observations. That terms object can get big (I think it would be a matrix of size 151 x 150) and is diagonal. That might not solve it, but it should help. Max -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Florian Nigsch Sent: Thursday, July 26, 2007 2:07 PM To: r-help at stat.math.ethz.ch Subject: [R] Large dataset + randomForest [Please CC me in any replies as I am not currently subscribed to the list. Thanks!] Dear all, I did a bit of searching on the question of large datasets but did not come to a definite conclusion. What I am trying to do is the following: I want to read in a dataset with approx. 100 000 rows and approx 150 columns. The file size is ~ 33MB, which one would deem not too big a file for R. To speed up the reading in of the file I do not use read.table but a loop that does reading with scan() into a buffer and some preprocessing and then adds the data into a dataframe. When I then want to run randomForest() R complains that I cannot allocate a vector of size 313.0 MB. I am aware that randomForest needs all data in memory, but 1) why should that suddenly be 10 times the size of the data (I acknowedge the need for some internal data of R, but 10 times seems a bit too much) and 2) there is still physical memory free on the machine (in total 4GB available, even though R is limited to 2GB if I correctly remember the help pages - still 2GB should be enough!) - it doesn't seem to work either with changed settings done via mem.limits(), or run-time arguments --min-vsize --max-vsize - what do these have to be set to to work in my case?? > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5) Error: cannot allocate vector of size 313.0 Mb > object.size(df)/1024/1024 [1] 129.5390 Any help would be greatly appreciated, Florian -- Florian Nigsch <fn211 at cam.ac.uk> Unilever Centre for Molecular Sciences Informatics Department of Chemistry University of Cambridge http://www-mitchell.ch.cam.ac.uk/ Telephone: +44 (0)1223 763 073 [[alternative HTML version deleted]] ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ---------------------------------------------------------------------- LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}}
I compiled the newest R version on a Redhat Linux (uname -a = Linux .cam.ac.uk 2.4.21-50.ELsmp #1 SMP Tue May 8 17:18:29 EDT 2007 i686 i686 i386 GNU/Linux) with 4GB of physical memory. The step when the whole script crashed is within the randomForest() routine, I do know that because I want to time it thus I have it inside a system.time() call. This function exits with the error I posted earlier: > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5) Error: cannot allocate vector of size 313.0 Mb When calling gc() directly before I call randomForest() and after I get this: > gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 255416 6.9 899071 24.1 16800.0 818163 21.9 Vcells 17874469 136.4 90854072 693.2 4000.1 269266598 2054.4 > rf <- randomForest(V1 ~ ., data=df, subset=trainindices, do.trace=5) Error: cannot allocate vector of size 626.1 Mb > gc() used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) Ncells 255441 6.9 899071 24.1 16800.0 818163 21.9 Vcells 17874541 136.4 112037674 854.8 4000.1 269266598 2054.4 > So the only real difference is in the "gc trigger" and the "(Mb)" column next to it. By the way, I am not running it in GUI mode On 27 Jul 2007, at 13:17, jim holtman wrote:> At the max, you had 2GB of memory being used. What operating system > are you running on and how much physical memory do you have on your > system? For windows, there are parameters on the command line to > start RGUI that let you define how much memory can be used. I am not > sure of Linus/UNIX. So you are probably hitting the 2GB max and then > you don't have any more physical memory available. If the computation > is a long script, you might put some 'gc()' statements in the code to > see what section is using the most memory. > > Your problem might have to be broken into parts to run. > > On 7/27/07, Florian Nigsch <fn211 at cam.ac.uk> wrote: >> Hi Jim, >> >> Here is the output of gc() of the same session of R (that I have >> still running...) >> >>> gc() >> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb) >> Ncells 255416 6.9 899071 24.1 16800.0 818163 21.9 >> Vcells 17874469 136.4 113567591 866.5 4000.1 269266598 2054.4 >> >> By increasing the limit of vcells and ncells to 1GB (if that is >> possible?!), would that perhaps solve my problem? >> >> Cheers, >> >> Florian