Hello everyone, I would like to get some advices on using R with some really large datasets. I'm using RH9 Linux R 1.8.1 for a research with a lot of numerical data. The datasets total to around 200Mb (shown by memory.size). During my data manipulation, the system memory usage grew to 1.5Gb, and this caused a lot of swapping activities on my 1Gb PC. This is just a small-scale experiment, the full-scale one will be using data 30 times as large (on a 4Gb machine). I can see that I'll need to deal with memory usage problem very soon. I notice that R keeps all datasets in memory at all times. I wonder whether there is any way to instruct R to push some of the less-frequently-used data tables out of main memory, so as to free up memory for those that are actively in used. It'll be even better if R can keep only part of a table in memory only when that part is needed. Using save & load could help, but I just wonder whether R is intelligent enough to do this by itself, so I don't need to keep track of memory usage at all times. Another thought is to use a 64-bit machine (AMD64). I find there is a pre-compiled R for Fedora Linux on AMD64. Anyone knows whether this version of R runs as 64-bit? If so, then will R be able to go beyond the 32-bit 4Gb memory limit? Also, from the manual, I find that the RPgSQL package (for PostgreSQL database) supports a feature "proxy data frame". Does anyone have experience with this? Can "proxy data frame" handle memory efficiently for very large datasets? Say, if I have a 6Gb database table defined as a proxy data frame, will R & RPgSQL be able to handle it with just 4Gb of memory? Any comments will be useful. Many thanks. Sunny Ho (Hong Kong University of Science & Technology)
As far as I know, R does compile on AMD Opterons and runs as a 64-bit application. So it can store objects larger than 4GB. However, I don't think R gets tested very often on 64-bit machines with such large objects so there may be yet undiscovered bugs. -roger Sunny Ho wrote:> Hello everyone, > > I would like to get some advices on using R with some really large datasets. > > I'm using RH9 Linux R 1.8.1 for a research with a lot of numerical data. The datasets total to around 200Mb (shown by memory.size). During my data manipulation, the system memory usage grew to 1.5Gb, and this caused a lot of swapping activities on my 1Gb PC. This is just a small-scale experiment, the full-scale one will be using data 30 times as large (on a 4Gb machine). I can see that I'll need to deal with memory usage problem very soon. > > I notice that R keeps all datasets in memory at all times. I wonder whether there is any way to instruct R to push some of the less-frequently-used data tables out of main memory, so as to free up memory for those that are actively in used. It'll be even better if R can keep only part of a table in memory only when that part is needed. Using save & load could help, but I just wonder whether R is intelligent enough to do this by itself, so I don't need to keep track of memory usage at all times. > > Another thought is to use a 64-bit machine (AMD64). I find there is a pre-compiled R for Fedora Linux on AMD64. Anyone knows whether this version of R runs as 64-bit? If so, then will R be able to go beyond the 32-bit 4Gb memory limit? > > Also, from the manual, I find that the RPgSQL package (for PostgreSQL database) supports a feature "proxy data frame". Does anyone have experience with this? Can "proxy data frame" handle memory efficiently for very large datasets? Say, if I have a 6Gb database table defined as a proxy data frame, will R & RPgSQL be able to handle it with just 4Gb of memory? > > Any comments will be useful. Many thanks. > > Sunny Ho > (Hong Kong University of Science & Technology) > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
I was under the impression that R has been run on 64-bit Solaris (and other 64-bit Unices) for quite a while (as 64-bit app). We've been running 64-bit R on amd64 for a few months (and had quite a few oppertunities to get the R processes using over 8GB of RAM). Not much problem as far as I can see... Best, Andy> From: Roger D. Peng > > As far as I know, R does compile on AMD Opterons and runs as a > 64-bit application. So it can store objects larger than 4GB. > However, I don't think R gets tested very often on 64-bit > machines with such large objects so there may be yet undiscovered > bugs. > > -roger > > Sunny Ho wrote: > > > Hello everyone, > > > > I would like to get some advices on using R with some > really large datasets. > > > > I'm using RH9 Linux R 1.8.1 for a research with a lot of > numerical data. The datasets total to around 200Mb (shown by > memory.size). During my data manipulation, the system memory > usage grew to 1.5Gb, and this caused a lot of swapping > activities on my 1Gb PC. This is just a small-scale > experiment, the full-scale one will be using data 30 times as > large (on a 4Gb machine). I can see that I'll need to deal > with memory usage problem very soon. > > > > I notice that R keeps all datasets in memory at all times. > I wonder whether there is any way to instruct R to push some > of the less-frequently-used data tables out of main memory, > so as to free up memory for those that are actively in used. > It'll be even better if R can keep only part of a table in > memory only when that part is needed. Using save & load could > help, but I just wonder whether R is intelligent enough to do > this by itself, so I don't need to keep track of memory usage > at all times. > > > > Another thought is to use a 64-bit machine (AMD64). I find > there is a pre-compiled R for Fedora Linux on AMD64. Anyone > knows whether this version of R runs as 64-bit? If so, then > will R be able to go beyond the 32-bit 4Gb memory limit? > > > > Also, from the manual, I find that the RPgSQL package (for > PostgreSQL database) supports a feature "proxy data frame". > Does anyone have experience with this? Can "proxy data frame" > handle memory efficiently for very large datasets? Say, if I > have a 6Gb database table defined as a proxy data frame, will > R & RPgSQL be able to handle it with just 4Gb of memory? > > > > Any comments will be useful. Many thanks. > > > > Sunny Ho > > (Hong Kong University of Science & Technology) > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
On a dual Opteron 244 with 16GB ram, and [andy at leo:cb1]% free total used free shared buffers cached Mem: 16278648 14552676 1725972 0 229420 3691824 -/+ buffers/cache: 10631432 5647216 Swap: 2096472 13428 2083044 ... using freshly compiled R-1.9.0:> system.time(x <- numeric(1e9))[1] 3.60 8.09 15.11 0.00 0.00> object.size(x)/1024^3[1] 7.45058 Andy> From: Peter Dalgaard > > "Roger D. Peng" <rpeng at jhsph.edu> writes: > > > I've been running R on 64-bit SuSE Linux on Opterons for a > few months > > now and it certainly runs fine in what I would call standard > > situations. In particular there seems to be no problem with > > workspaces > 4GB. But I seldom handle single objects (like > matrices, > > vectors) that are > 4GB. The only exception is lists, but I think > > those are okay since they are composed of various sub-objects (like > > Peter mentioned). > > I just tried, and x <- numeric(1e9) (~8GB) doesn't appear to be a > problem, except that it takes "forever" since the machine in question > has only 1GB of memory, and numeric() zero fills the allocated > block... > > -- > O__ ---- Peter Dalgaard Blegdamsvej 3 > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > (*) \(*) -- University of Copenhagen Denmark Ph: > (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > (+45) 35327907 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Thank you guys for sharing your experiences in 64-bit R. Those are very helpful in my planning work. I wonder anyone has experience in using database interface with R. Are there any "preferred" choice or "hidden catch"? In our setup, we may be using MS SQL Server or Oracle to keep the data. I know that RODBC can talk to them directly. Could there be any 32-bit/64-bit compatibility issues, say, when a 32-bit ORACLE is talking to a 64-bit R ? Performance wise, when used with R, how does MySQL or PostgreSQL compare to MS SQL Server or Oracle? Any comments will be helpful. Thanks Sunny Ho (Hong Kong University of Science & Technology)> Hello everyone, > > I would like to get some advices on using R with some really large datasets. > > I'm using RH9 Linux R 1.8.1 for a research with a lot of numerical data. The datasets total to around 200Mb (shown by memory.size). During my data manipulation, the system memory usage grew to 1.5Gb, and this caused a lot of swapping activities on my 1Gb PC. This is just a small-scale experiment, the full-scale one will be using data 30 times as large (on a 4Gb machine). I can see that I'll need to deal with memory usage problem very soon. > > I notice that R keeps all datasets in memory at all times. I wonder whether there is any way to instruct R to push some of the less-frequently-used data tables out of main memory, so as to free up memory for those that are actively in used. It'll be even better if R can keep only part of a table in memory only when that part is needed. Using save & load could help, but I just wonder whether R is intelligent enough to do this by itself, so I don't need to keep track of memory usage at all times. > > Another thought is to use a 64-bit machine (AMD64). I find there is a pre-compiled R for Fedora Linux on AMD64. Anyone knows whether this version of R runs as 64-bit? If so, then will R be able to go beyond the 32-bit 4Gb memory limit? > > Also, from the manual, I find that the RPgSQL package (for PostgreSQL database) supports a feature "proxy data frame". Does anyone have experience with this? Can "proxy data frame" handle memory efficiently for very large datasets? Say, if I have a 6Gb database table defined as a proxy data frame, will R & RPgSQL be able to handle it with just 4Gb of memory? > > Any comments will be useful. Many thanks. > > Sunny Ho > (Hong Kong University of Science & Technology) > >