Michael Cassin
2007-Aug-09 17:15 UTC
[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",row.names=FALSE) q() **New Session** #R system("ls -l big.csv") system("free -m") big1<-matrix(scan("big.csv",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system("free -m") The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info:> sessionInfo()R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" [7] "base" # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux ====== Quoted Text ===From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: > dear R experts: >> I am of course no R experts, but use it regularly. I thought I would > share some experimentation with memory use. I run a linux machine > with about 4GB of memory, and R 2.5.0. > > upon startup, gc() reports > > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 268755 14.4 407500 21.8 350000 18.7 > Vcells 139137 1.1 786432 6.0 444750 3.4 > > This is my baseline. linux 'top' reports 48MB as baseline. This > includes some of my own routines that are always loaded. Good.. > > > Next, I created a s.csv file with 22 variables and 500,000 > observations, taking up an uncompressed disk space of 115MB. The > resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). > >> s= read.csv("s.csv"); >> object.size(s); > > [1] 84002712 > > > here is where things get more interesting. after the read.csv() is > finished, gc() reports > > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 270505 14.5 8349948 446.0 11268682 601.9 > Vcells 10639515 81.2 34345544 262.1 42834692 326.9 > > I was a big surprised by this---R had 928MB intermittent memory in > use. More interestingly, this is also similar to what linux 'top' > reports as memory use of the R process (919MB, probably 1024 vs. 1000 > B/MB), even after the read.csv() is finished and gc() has been run. > Nothing seems to have been released back to the OS. > > Now, > >> rm(s) >> gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 270541 14.5 6679958 356.8 11268755 601.9 > Vcells 139481 1.1 27476536 209.7 42807620 326.6 > > linux 'top' now reports 650MB of memory use (though R itself uses only > 15.6Mb). My guess is that It leaves the trigger memory of 567MB plus > the base 48MB. > > > There are two interesting observations for me here: first, to read a > .csv file, I need to have at least 10-15 times as much memory as the > file that I want to read---a lot more than the factor of 3-4 that I > had expected. The moral is that IF R can read a .csv file, one need > not worry too much about running into memory constraints lateron. {R > Developers---reducing read.csv's memory requirement a little would be > nice. of course, you have more than enough on your plate, already.} > > Second, memory is not returned fully to the OS. This is not > necessarily a bad thing, but good to know. > > Hope this helps... > > Sincerely, > > /iaw > > ______________________________________________ > R-help_at_stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Brian D. Ripley, ripley_at_stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Gabor Grothendieck
2007-Aug-09 17:33 UTC
[R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin <michael at cassin.name> wrote:> Hi, > > I've been having similar experiences and haven't been able to > substantially improve the efficiency using the guidance in the I/O > Manual. > > Could anyone advise on how to improve the following scan()? It is not > based on my real file, please assume that I do need to read in > characters, and can't do any pre-processing of the file, etc. > > ## Create Sample File > write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),"big.csv",row.names=FALSE) > q() > > **New Session** > #R > system("ls -l big.csv") > system("free -m") > big1<-matrix(scan("big.csv",sep=",",what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) > system("free -m") > > The file is approximately 9MB, but approximately 50-60MB is used to > read it in. > > object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. > > Regards, Mike > > Configuration info: > > sessionInfo() > R version 2.5.1 (2007-06-27) > x86_64-redhat-linux-gnu > locale: > C > attached base packages: > [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" > [7] "base" > > # uname -a > Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD > 2007 x86_64 x86_64 x86_64 GNU/Linux > > > > ====== Quoted Text ===> From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> > Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) > > > > > The R Data Import/Export Manual points out several ways in which you > can use read.csv more efficiently. > > On Tue, 26 Jun 2007, ivo welch wrote: > > > dear R experts: > > > > I am of course no R experts, but use it regularly. I thought I would > > share some experimentation with memory use. I run a linux machine > > with about 4GB of memory, and R 2.5.0. > > > > upon startup, gc() reports > > > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 268755 14.4 407500 21.8 350000 18.7 > > Vcells 139137 1.1 786432 6.0 444750 3.4 > > > > This is my baseline. linux 'top' reports 48MB as baseline. This > > includes some of my own routines that are always loaded. Good.. > > > > > > Next, I created a s.csv file with 22 variables and 500,000 > > observations, taking up an uncompressed disk space of 115MB. The > > resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). > > > >> s= read.csv("s.csv"); > >> object.size(s); > > > > [1] 84002712 > > > > > > here is where things get more interesting. after the read.csv() is > > finished, gc() reports > > > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 270505 14.5 8349948 446.0 11268682 601.9 > > Vcells 10639515 81.2 34345544 262.1 42834692 326.9 > > > > I was a big surprised by this---R had 928MB intermittent memory in > > use. More interestingly, this is also similar to what linux 'top' > > reports as memory use of the R process (919MB, probably 1024 vs. 1000 > > B/MB), even after the read.csv() is finished and gc() has been run. > > Nothing seems to have been released back to the OS. > > > > Now, > > > >> rm(s) > >> gc() > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 270541 14.5 6679958 356.8 11268755 601.9 > > Vcells 139481 1.1 27476536 209.7 42807620 326.6 > > > > linux 'top' now reports 650MB of memory use (though R itself uses only > > 15.6Mb). My guess is that It leaves the trigger memory of 567MB plus > > the base 48MB. > > > > > > There are two interesting observations for me here: first, to read a > > .csv file, I need to have at least 10-15 times as much memory as the > > file that I want to read---a lot more than the factor of 3-4 that I > > had expected. The moral is that IF R can read a .csv file, one need > > not worry too much about running into memory constraints lateron. {R > > Developers---reducing read.csv's memory requirement a little would be > > nice. of course, you have more than enough on your plate, already.} > > > > Second, memory is not returned fully to the OS. This is not > > necessarily a bad thing, but good to know. > > > > Hope this helps... > > > > Sincerely, > > > > /iaw > > > > ______________________________________________ > > R-help_at_stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > -- > Brian D. Ripley, ripley_at_stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >