thr3ads.net - R help - [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Michael Cassin

2007-Aug-09 17:15 UTC

[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Hi,

I've been having similar experiences and haven't been able to
substantially improve the efficiency using the guidance in the I/O
Manual.

Could anyone advise on how to improve the following scan()?  It is not
based on my real file, please assume that I do need to read in
characters, and can't do any pre-processing of the file, etc.

## Create Sample File
write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",row.names=FALSE)
q()

**New Session**
#R
system("ls -l big.csv")
system("free -m")
big1<-matrix(scan("big.csv",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE)
system("free -m")

The file is approximately 9MB, but approximately 50-60MB is used to
read it in.

object.size(big1) is 56MB, or 56 bytes per string, which seems excessive.

Regards, Mike

Configuration info:> sessionInfo()R version 2.5.1 (2007-06-27)
x86_64-redhat-linux-gnu
locale:
C
attached base packages:
[1] "stats"     "graphics"  "grDevices"
"utils"     "datasets"  "methods"
[7] "base"

# uname -a
Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD
2007 x86_64 x86_64 x86_64 GNU/Linux



====== Quoted Text ===From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
 Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST)




 The R Data Import/Export Manual points out several ways in which you
can use read.csv more efficiently.

 On Tue, 26 Jun 2007, ivo welch wrote:

 > dear R experts:
 >> I am of course no R experts, but use it regularly.  I thought I would
> share some experimentation  with memory use.  I run a linux machine
> with about 4GB of memory, and R 2.5.0.
>
> upon startup, gc() reports
>
>         used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 268755 14.4     407500 21.8   350000 18.7
> Vcells 139137  1.1     786432  6.0   444750  3.4
>
> This is my baseline.  linux 'top' reports 48MB as baseline.  This
> includes some of my own routines that are always loaded.  Good..
>
>
> Next, I created a s.csv file with 22 variables and 500,000
> observations, taking up an uncompressed disk space of 115MB.  The
> resulting object.size() after a read.csv() is 84,002,712 bytes (80MB).
>
>> s= read.csv("s.csv");
>> object.size(s);
>
> [1] 84002712
>
>
> here is where things get more interesting.  after the read.csv() is
> finished, gc() reports
>
>           used (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells   270505 14.5    8349948 446.0 11268682 601.9
> Vcells 10639515 81.2   34345544 262.1 42834692 326.9
>
> I was a big surprised by this---R had 928MB intermittent memory in
> use.  More interestingly, this is also similar to what linux 'top'
> reports as memory use of the R process (919MB, probably 1024 vs. 1000
> B/MB), even after the read.csv() is finished and gc() has been run.
> Nothing seems to have been released back to the OS.
>
> Now,
>
>> rm(s)
>> gc()
>         used (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells 270541 14.5    6679958 356.8 11268755 601.9
> Vcells 139481  1.1   27476536 209.7 42807620 326.6
>
> linux 'top' now reports 650MB of memory use (though R itself uses
only
> 15.6Mb).  My guess is that It leaves the trigger memory of 567MB plus
> the base 48MB.
>
>
> There are two interesting observations for me here:  first, to read a
> .csv file, I need to have at least 10-15 times as much memory as the
> file that I want to read---a lot more than the factor of 3-4 that I
> had expected.  The moral is that IF R can read a .csv file, one need
> not worry too much about running into memory constraints lateron.  {R
> Developers---reducing read.csv's memory requirement a little would be
> nice.  of course, you have more than enough on your plate, already.}
>
> Second, memory is not returned fully to the OS.  This is not
> necessarily a bad thing, but good to know.
>
> Hope this helps...
>
> Sincerely,
>
> /iaw
>
> ______________________________________________
> R-help_at_stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> --
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Gabor Grothendieck

2007-Aug-09 17:33 UTC

head link

[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

If we add quote = FALSE to the write.csv statement its twice as fast
reading it in.

On 8/9/07, Michael Cassin <michael at cassin.name>
wrote:> Hi,
>
> I've been having similar experiences and haven't been able to
> substantially improve the efficiency using the guidance in the I/O
> Manual.
>
> Could anyone advise on how to improve the following scan()?  It is not
> based on my real file, please assume that I do need to read in
> characters, and can't do any pre-processing of the file, etc.
>
> ## Create Sample File
>
write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",row.names=FALSE)
> q()
>
> **New Session**
> #R
> system("ls -l big.csv")
> system("free -m")
>
big1<-matrix(scan("big.csv",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE)
> system("free -m")
>
> The file is approximately 9MB, but approximately 50-60MB is used to
> read it in.
>
> object.size(big1) is 56MB, or 56 bytes per string, which seems excessive.
>
> Regards, Mike
>
> Configuration info:
> > sessionInfo()
> R version 2.5.1 (2007-06-27)
> x86_64-redhat-linux-gnu
> locale:
> C
> attached base packages:
> [1] "stats"     "graphics"  "grDevices"
"utils"     "datasets"  "methods"
> [7] "base"
>
> # uname -a
> Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD
> 2007 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> ====== Quoted Text ===> From: Prof Brian Ripley
<ripley_at_stats.ox.ac.uk>
>  Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST)
>
>
>
>
>  The R Data Import/Export Manual points out several ways in which you
> can use read.csv more efficiently.
>
>  On Tue, 26 Jun 2007, ivo welch wrote:
>
>  > dear R experts:
>  >
> > I am of course no R experts, but use it regularly.  I thought I would
> > share some experimentation  with memory use.  I run a linux machine
> > with about 4GB of memory, and R 2.5.0.
> >
> > upon startup, gc() reports
> >
> >         used (Mb) gc trigger (Mb) max used (Mb)
> > Ncells 268755 14.4     407500 21.8   350000 18.7
> > Vcells 139137  1.1     786432  6.0   444750  3.4
> >
> > This is my baseline.  linux 'top' reports 48MB as baseline. 
This
> > includes some of my own routines that are always loaded.  Good..
> >
> >
> > Next, I created a s.csv file with 22 variables and 500,000
> > observations, taking up an uncompressed disk space of 115MB.  The
> > resulting object.size() after a read.csv() is 84,002,712 bytes (80MB).
> >
> >> s= read.csv("s.csv");
> >> object.size(s);
> >
> > [1] 84002712
> >
> >
> > here is where things get more interesting.  after the read.csv() is
> > finished, gc() reports
> >
> >           used (Mb) gc trigger  (Mb) max used  (Mb)
> > Ncells   270505 14.5    8349948 446.0 11268682 601.9
> > Vcells 10639515 81.2   34345544 262.1 42834692 326.9
> >
> > I was a big surprised by this---R had 928MB intermittent memory in
> > use.  More interestingly, this is also similar to what linux
'top'
> > reports as memory use of the R process (919MB, probably 1024 vs. 1000
> > B/MB), even after the read.csv() is finished and gc() has been run.
> > Nothing seems to have been released back to the OS.
> >
> > Now,
> >
> >> rm(s)
> >> gc()
> >         used (Mb) gc trigger  (Mb) max used  (Mb)
> > Ncells 270541 14.5    6679958 356.8 11268755 601.9
> > Vcells 139481  1.1   27476536 209.7 42807620 326.6
> >
> > linux 'top' now reports 650MB of memory use (though R itself
uses only
> > 15.6Mb).  My guess is that It leaves the trigger memory of 567MB plus
> > the base 48MB.
> >
> >
> > There are two interesting observations for me here:  first, to read a
> > .csv file, I need to have at least 10-15 times as much memory as the
> > file that I want to read---a lot more than the factor of 3-4 that I
> > had expected.  The moral is that IF R can read a .csv file, one need
> > not worry too much about running into memory constraints lateron.  {R
> > Developers---reducing read.csv's memory requirement a little would
be
> > nice.  of course, you have more than enough on your plate, already.}
> >
> > Second, memory is not returned fully to the OS.  This is not
> > necessarily a bad thing, but good to know.
> >
> > Hope this helps...
> >
> > Sincerely,
> >
> > /iaw
> >
> > ______________________________________________
> > R-help_at_stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>  --
> Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Reasonably Related Threads

Search for more apparently analagous threads

R help - Aug 2007 - Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory

Reasonably Related Threads