I have a program which reads in a very large data set, performs some analyses, and then repeats this process with another data set. As soon as the first set of analyses are complete, I remove the very large object and clean up to try and make memory available in order to run the second set of analyses. The process looks something like this: 1) read in data set 1 and perform analyses rm(list=ls()) gc() 2) read in data set 2 and perform analyses rm(list=ls()) gc() ... But, it appears that I am not making the memory that was consumed in step 1 available back to the OS as R complains that it cannot allocate a vector of size X as the process tries to repeat in step 2. So, I close and reopen R and then drop in the code to run the second analysis. When this is done, I close and reopen R and run the third analysis. This is terribly inefficient. Instead I would rather just source in the R code and let the analyses run over night. Is there a way that I can use gc() or some other function more efficiently rather than having to close and reopen R at each iteration? I'm using Windows XP and r 2.6.1 Harold [[alternative HTML version deleted]]
On 4 February 2008 at 20:45, Doran, Harold wrote:
| I have a program which reads in a very large data set, performs some analyses,
and then repeats this process with another data set. As soon as the first set of
analyses are complete, I remove the very large object and clean up to try and
make memory available in order to run the second set of analyses. The process
looks something like this:
|
| 1) read in data set 1 and perform analyses
| rm(list=ls())
| gc()
| 2) read in data set 2 and perform analyses
| rm(list=ls())
| gc()
| ...
|
| But, it appears that I am not making the memory that was consumed in step 1
available back to the OS as R complains that it cannot allocate a vector of size
X as the process tries to repeat in step 2.
|
| So, I close and reopen R and then drop in the code to run the second analysis.
When this is done, I close and reopen R and run the third analysis.
|
| This is terribly inefficient. Instead I would rather just source in the R code
and let the analyses run over night.
|
| Is there a way that I can use gc() or some other function more efficiently
rather than having to close and reopen R at each iteration?
I haven't found one.
Every (trading) I process batches of data with R, and the only reliable way I
have found is to use fresh R sessions. Otherwise, the fragmented memory will
eventually result in the all-too-familiar 'cannot allocate X mb' for
rather
small values of X relative to my total ram. C'est la vie.
As gc() seems to help somewhat yet not 'sufficiently', fresh starts are
an
alternative help, And Rscript starts faster than the main R. Now, I happen to
be partial to littler [1] which starts even faster, so I use that ( on Linux
and am not sure if it can be built on Windows as we embed R directly and
hence start faster than Rscript). But either one should help you with some
batch files -- given you a way to run overnight. And once you start batching
things, it is only a small step to regain efficiency by parallel execution
using something like MPI or NWS
Hth, Dirk
[1] littler is the predecessor to Rscript by Jeff and myself. See either
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/LittleR
or
http://dirk.eddelbuettel.com/code/littler.html
for more on littler and feel free to email us.
--
Three out of two people have difficulties with fractions.
1) See ?"Memory-limits": it is almost certainly memory fragmentation.
You don't need to give the memory back to the OS (and few OSes actually do
so).
2) I've never seen this running a 64-bit version of R.
3) You can easily write a script to do this. Indeed, you could write an R
script to run multiple R scripts in separate processes in turn (via
system("Rscript fileN.R") ). For example. Uwe Ligges uses R to
script
building and testing of packages on Windows.
On Mon, 4 Feb 2008, Doran, Harold wrote:
> I have a program which reads in a very large data set, performs some
> analyses, and then repeats this process with another data set. As soon
> as the first set of analyses are complete, I remove the very large
> object and clean up to try and make memory available in order to run the
> second set of analyses. The process looks something like this:
>
> 1) read in data set 1 and perform analyses
> rm(list=ls())
> gc()
> 2) read in data set 2 and perform analyses
> rm(list=ls())
> gc()
> ...
>
> But, it appears that I am not making the memory that was consumed in
> step 1 available back to the OS as R complains that it cannot allocate a
> vector of size X as the process tries to repeat in step 2.
>
> So, I close and reopen R and then drop in the code to run the second
> analysis. When this is done, I close and reopen R and run the third
> analysis.
>
> This is terribly inefficient. Instead I would rather just source in the
> R code and let the analyses run over night.
>
> Is there a way that I can use gc() or some other function more
> efficiently rather than having to close and reopen R at each iteration?
>
> I'm using Windows XP and r 2.6.1
>
> Harold
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595