babyfoxlove1 at sina.com
2010-Jul-23 16:10 UTC
[R] How to deal with more than 6GB dataset using R?
Hi there, Sorry to bother those who are not interested in this problem. I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. Thanks. --Gin [[alternative HTML version deleted]]
On 23/07/2010 12:10 PM, babyfoxlove1 at sina.com wrote:> Hi there, > > Sorry to bother those who are not interested in this problem. > > I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. > Thanks. >You probably won't get much faster than read.table with all of the colClasses specified. It will be a lot slower if you leave that at the default NA setting, because then R needs to figure out the types by reading them as character and examining all the values. If the file is very consistently structured (e.g. the same number of characters in every value in every row) you might be able to write a C function to read it faster, but I'd guess the time spent writing that would be a lot more than the time saved. Duncan Murdoch
Allan Engelhardt
2010-Jul-23 16:39 UTC
[R] How to deal with more than 6GB dataset using R?
read.table is not very inefficient IF you specify the colClasses= parameter. scan (with the what= parameter) is probably a little more efficient. In either case, save the data using save() once you have it in the right structure and it will be much more efficient to read it next time. (In fact I often exit R at this stage and re-start it with the .RData file before I start the analysis to clear out the memory.) I did a lot of testing on the types of (large) data structures I normally work with and found that options("save.defaults" = list(compress="bzip2", compression_level=6, ascii=FALSE)) gave me the best trade-off between size and speed. Your mileage will undoubtedly vary, but if you do this a lot it may be worth getting hard data for your setup. Hope this helps a little. Allan On 23/07/10 17:10, babyfoxlove1 at sina.com wrote:> Hi there, > > Sorry to bother those who are not interested in this problem. > > I'm dealing with a large data set, more than 6 GB file, and doing regression test with those data. I was wondering are there any efficient ways to read those data? Instead of just using read.table()? BTW, I'm using a 64bit version desktop and a 64bit version R, and the memory for the desktop is enough for me to use. > Thanks. > > > --Gin > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
You may want to look at the biglm package as another way to regression models on very large data sets. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of babyfoxlove1 at sina.com > Sent: Friday, July 23, 2010 10:10 AM > To: r-help at r-project.org > Subject: [R] How to deal with more than 6GB dataset using R? > > Hi there, > > Sorry to bother those who are not interested in this problem. > > I'm dealing with a large data set, more than 6 GB file, and doing > regression test with those data. I was wondering are there any > efficient ways to read those data? Instead of just using read.table()? > BTW, I'm using a 64bit version desktop and a 64bit version R, and the > memory for the desktop is enough for me to use. > Thanks. > > > --Gin > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Jens Oehlschlägel
2010-Jul-28 08:51 UTC
[R] How to deal with more than 6GB dataset using R?
Matthew, You might want to look at function read.table.ffdf in the ff package, which can read large csv files in chunks and store the result in a binary format on disk that can be quickly accessed from R. ff allows you to access complete columns (returned as a vector or array) or subsets of the data identified by row-positions (and column selection, returned as a data.frame). As Jim pointed out: all depends on what you are going with the data. If you want to access subsets not by row-position but rather by search conditions, you are better-off with an indexed database. Please let me know if you write a fast read.fwf.ffdf - we would be happy to include it into the ff package. Jens
I tried several ways: 1. I used the scan() function, it can read the 6GB file into the memory without difficulty, just took some time. But just read into the memory was definitely not enough, when I did the next step, which was to plot() and then tried to build the nonlinear regression model, it was stucked at the plot() part, since it has already reached the memory limit, even though I have 64-bit version system and huge memory size. 2. I tried the bigmemory() package. It can read the dataset into the memory as well, but since it stores the data into a matrix format, and the normal functions such as nls(), plot()... cannot work on matrices--that is the problem. What should I do then? Or do I need to change to SAS? I believe there are a lot of people who are dealing with large datasets, what did you do in this situation? Thanks. 2010/7/24 <babyfoxlove1@sina.com>> > -------------- Original Message -------------- > > You may want to look at the biglm package as another way to regression > models on very large data sets. > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.snow@imail.org > 801.408.8111 > > > > -----Original Message----- > > From: r-help-bounces@r-project.org [mailto:r-help-bounces@r- > > project.org] On Behalf Of babyfoxlove1@sina.com > > Sent: Friday, July 23, 2010 10:10 AM > > To: r-help@r-project.org > > Subject: [R] How to deal with more than 6GB dataset using R? > > > > Hi there, > > > > Sorry to bother those who are not interested in this problem. > > > > I'm dealing with a large data set, more than 6 GB file, and doing > > regression test with those data. I was wondering are there any > > efficient ways to read those data? Instead of just using read.table()? > > BTW, I'm using a 64bit version desktop and a 64bit version R, and the > > memory for the desktop is enough for me to use. > > Thanks. > > > > > > --Gin > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > and provide commented, minimal, self-contained, reproducible code. >-- Best, Jing Li [[alternative HTML version deleted]]