thr3ads.net - R help - [R] How to deal with more than 6GB dataset using R? [Jul 2010]

If this information is useful, please help other people find it:
Share via:

babyfoxlove1 at sina.com

2010-Jul-23 16:10 UTC

[R] How to deal with more than 6GB dataset using R?

&nbsp;Hi there,

Sorry to bother those who are not interested in this problem.

I'm dealing with a large data set, more than 6 GB file, and doing regression
test with those data. I was wondering are there any efficient ways to read those
data? Instead of just using read.table()? BTW, I'm using a 64bit version
desktop and a 64bit version R, and the memory for the desktop is enough for me
to use.
Thanks.


--Gin

	[[alternative HTML version deleted]]

Duncan Murdoch

2010-Jul-23 16:36 UTC

head link

[R] How to deal with more than 6GB dataset using R?

On 23/07/2010 12:10 PM, babyfoxlove1 at sina.com wrote:> &nbsp;Hi there,
>
> Sorry to bother those who are not interested in this problem.
>
> I'm dealing with a large data set, more than 6 GB file, and doing
regression test with those data. I was wondering are there any efficient ways to
read those data? Instead of just using read.table()? BTW, I'm using a 64bit
version desktop and a 64bit version R, and the memory for the desktop is enough
for me to use.
> Thanks.
>   
You probably won't get much faster than read.table with all of the 
colClasses specified.  It will be a lot slower if you leave that at the 
default NA setting, because then R needs to figure out the types by 
reading them as character and examining all the values.  If the file is 
very consistently structured (e.g. the same number of characters in 
every value in every row) you might be able to write a C function to 
read it faster, but I'd guess the time spent writing that would be a lot 
more than the time saved.

Duncan Murdoch

Allan Engelhardt

2010-Jul-23 16:39 UTC

head link

[R] How to deal with more than 6GB dataset using R?

read.table is not very inefficient IF you specify the colClasses= 
parameter.  scan (with the what= parameter) is probably a little more 
efficient.  In either case, save the data using save() once you have it 
in the right structure and it will be much more efficient to read it 
next time.  (In fact I often exit R at this stage and re-start it with 
the .RData file before I start the analysis to clear out the memory.)

I did a lot of testing on the types of (large) data structures I 
normally work with and found that options("save.defaults" = 
list(compress="bzip2", compression_level=6, ascii=FALSE)) gave me the 
best trade-off between size and speed.  Your mileage will undoubtedly 
vary, but if you do this a lot it may be worth getting hard data for 
your setup.

Hope this helps a little.

Allan

On 23/07/10 17:10, babyfoxlove1 at sina.com wrote:> &nbsp;Hi there,
>
> Sorry to bother those who are not interested in this problem.
>
> I'm dealing with a large data set, more than 6 GB file, and doing
regression test with those data. I was wondering are there any efficient ways to
read those data? Instead of just using read.table()? BTW, I'm using a 64bit
version desktop and a 64bit version R, and the memory for the desktop is enough
for me to use.
> Thanks.
>
>
> --Gin
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Greg Snow

2010-Jul-24 19:55 UTC

head link

[R] How to deal with more than 6GB dataset using R?

You may want to look at the biglm package as another way to regression models on
very large data sets.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of babyfoxlove1 at sina.com
> Sent: Friday, July 23, 2010 10:10 AM
> To: r-help at r-project.org
> Subject: [R] How to deal with more than 6GB dataset using R?
> 
> &nbsp;Hi there,
> 
> Sorry to bother those who are not interested in this problem.
> 
> I'm dealing with a large data set, more than 6 GB file, and doing
> regression test with those data. I was wondering are there any
> efficient ways to read those data? Instead of just using read.table()?
> BTW, I'm using a 64bit version desktop and a 64bit version R, and the
> memory for the desktop is enough for me to use.
> Thanks.
> 
> 
> --Gin
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jens Oehlschlägel

2010-Jul-28 08:51 UTC

head link

[R] How to deal with more than 6GB dataset using R?

Matthew,

You might want to look at function read.table.ffdf in the ff package, which can
read large csv files in chunks and store the result  in a binary format on disk
that can be quickly accessed from R. ff allows you to access complete columns
(returned as a vector or array) or subsets of the data identified by
row-positions (and column selection, returned as a data.frame). As Jim pointed
out: all depends on what you are going with the data. If you want to access
subsets not by row-position but rather by search conditions, you are better-off
with an indexed database.

Please let me know if you write a fast read.fwf.ffdf - we would be happy to
include it into the ff package.


Jens

Jing Li

2010-Aug-01 22:05 UTC

head link

[R] How to deal with more than 6GB dataset using R?

I tried several ways:

1.  I used the scan() function, it can read the 6GB file into the memory
without difficulty, just took some time. But just read into the memory was
definitely not enough, when I did the next step, which was to plot() and
then tried to build the nonlinear regression model, it was stucked at the
plot() part, since it has already reached the memory limit, even though I
have 64-bit version system and huge memory size.

2. I tried the bigmemory() package. It can read the dataset into the memory
as well, but since it stores the data into a matrix format, and the normal
functions such as nls(), plot()... cannot work on matrices--that is the
problem. What should I do then?

Or do I need to change to SAS? I believe there are a lot of people who are
dealing with large datasets, what did you do in this situation?

Thanks.




2010/7/24 <babyfoxlove1@sina.com>
>
> -------------- Original Message --------------
>
> You may want to look at the biglm package as another way to regression
> models on very large data sets.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow@imail.org
> 801.408.8111
>
>
> > -----Original Message-----
> > From: r-help-bounces@r-project.org [mailto:r-help-bounces@r-
> > project.org] On Behalf Of babyfoxlove1@sina.com
> > Sent: Friday, July 23, 2010 10:10 AM
> > To: r-help@r-project.org
> > Subject: [R] How to deal with more than 6GB dataset using R?
> >
> >  Hi there,
> >
> > Sorry to bother those who are not interested in this problem.
> >
> > I'm dealing with a large data set, more than 6 GB file, and doing
> > regression test with those data. I was wondering are there any
> > efficient ways to read those data? Instead of just using read.table()?
> > BTW, I'm using a 64bit version desktop and a 64bit version R, and
the
> > memory for the desktop is enough for me to use.
> > Thanks.
> >
> >
> > --Gin
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>


-- 
Best,
Jing Li

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jul 2010 - How to deal with more than 6GB dataset using R?

[R] How to deal with more than 6GB dataset using R?

[R] How to deal with more than 6GB dataset using R?

[R] How to deal with more than 6GB dataset using R?

[R] How to deal with more than 6GB dataset using R?

[R] How to deal with more than 6GB dataset using R?

[R] How to deal with more than 6GB dataset using R?

Apparently Analagous Threads