I have a program which reads in a very large data set, performs some analyses, and then repeats this process with another data set. As soon as the first set of analyses are complete, I remove the very large object and clean up to try and make memory available in order to run the second set of analyses. The process looks something like this: 1) read in data set 1 and perform analyses rm(list=ls()) gc() 2) read in data set 2 and perform analyses rm(list=ls()) gc() ... But, it appears that I am not making the memory that was consumed in step 1 available back to the OS as R complains that it cannot allocate a vector of size X as the process tries to repeat in step 2. So, I close and reopen R and then drop in the code to run the second analysis. When this is done, I close and reopen R and run the third analysis. This is terribly inefficient. Instead I would rather just source in the R code and let the analyses run over night. Is there a way that I can use gc() or some other function more efficiently rather than having to close and reopen R at each iteration? I'm using Windows XP and r 2.6.1 Harold [[alternative HTML version deleted]]
On 4 February 2008 at 20:45, Doran, Harold wrote: | I have a program which reads in a very large data set, performs some analyses, and then repeats this process with another data set. As soon as the first set of analyses are complete, I remove the very large object and clean up to try and make memory available in order to run the second set of analyses. The process looks something like this: | | 1) read in data set 1 and perform analyses | rm(list=ls()) | gc() | 2) read in data set 2 and perform analyses | rm(list=ls()) | gc() | ... | | But, it appears that I am not making the memory that was consumed in step 1 available back to the OS as R complains that it cannot allocate a vector of size X as the process tries to repeat in step 2. | | So, I close and reopen R and then drop in the code to run the second analysis. When this is done, I close and reopen R and run the third analysis. | | This is terribly inefficient. Instead I would rather just source in the R code and let the analyses run over night. | | Is there a way that I can use gc() or some other function more efficiently rather than having to close and reopen R at each iteration? I haven't found one. Every (trading) I process batches of data with R, and the only reliable way I have found is to use fresh R sessions. Otherwise, the fragmented memory will eventually result in the all-too-familiar 'cannot allocate X mb' for rather small values of X relative to my total ram. C'est la vie. As gc() seems to help somewhat yet not 'sufficiently', fresh starts are an alternative help, And Rscript starts faster than the main R. Now, I happen to be partial to littler [1] which starts even faster, so I use that ( on Linux and am not sure if it can be built on Windows as we embed R directly and hence start faster than Rscript). But either one should help you with some batch files -- given you a way to run overnight. And once you start batching things, it is only a small step to regain efficiency by parallel execution using something like MPI or NWS Hth, Dirk [1] littler is the predecessor to Rscript by Jeff and myself. See either http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/LittleR or http://dirk.eddelbuettel.com/code/littler.html for more on littler and feel free to email us. -- Three out of two people have difficulties with fractions.
1) See ?"Memory-limits": it is almost certainly memory fragmentation. You don't need to give the memory back to the OS (and few OSes actually do so). 2) I've never seen this running a 64-bit version of R. 3) You can easily write a script to do this. Indeed, you could write an R script to run multiple R scripts in separate processes in turn (via system("Rscript fileN.R") ). For example. Uwe Ligges uses R to script building and testing of packages on Windows. On Mon, 4 Feb 2008, Doran, Harold wrote:> I have a program which reads in a very large data set, performs some > analyses, and then repeats this process with another data set. As soon > as the first set of analyses are complete, I remove the very large > object and clean up to try and make memory available in order to run the > second set of analyses. The process looks something like this: > > 1) read in data set 1 and perform analyses > rm(list=ls()) > gc() > 2) read in data set 2 and perform analyses > rm(list=ls()) > gc() > ... > > But, it appears that I am not making the memory that was consumed in > step 1 available back to the OS as R complains that it cannot allocate a > vector of size X as the process tries to repeat in step 2. > > So, I close and reopen R and then drop in the code to run the second > analysis. When this is done, I close and reopen R and run the third > analysis. > > This is terribly inefficient. Instead I would rather just source in the > R code and let the analyses run over night. > > Is there a way that I can use gc() or some other function more > efficiently rather than having to close and reopen R at each iteration? > > I'm using Windows XP and r 2.6.1 > > Harold > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595