thr3ads.net - R help - [R] How to read HUGE data sets? [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Jorge Iván Vélez

2008-Feb-28 05:03 UTC

[R] How to read HUGE data sets?

Dear R-list,

Does somebody know how can I read a HUGE data set using R? It is a hapmap
data set (txt format) which is around 4GB. After read it, I need to delete
some specific rows and columns. I'm running R 2.6.2 patched over XP SP2
using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be
appreciated.

Thanks in advance,

Jorge

	[[alternative HTML version deleted]]

Mark W Kimpel

2008-Feb-28 05:08 UTC

head link

[R] How to read HUGE data sets?

Depending on how many rows you will delete, and if you know in advance 
which ones they are, one approach is to use the "skip" argument of 
read.table. If you only need a fraction of the total number of rows this 
will save a lot of RAM.

Mark

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 204-4202 Home (no voice mail please)

mwkimpel<at>gmail<dot>com

******************************************************************


Jorge Iv?n V?lez wrote:> Dear R-list,
> 
> Does somebody know how can I read a HUGE data set using R? It is a hapmap
> data set (txt format) which is around 4GB. After read it, I need to delete
> some specific rows and columns. I'm running R 2.6.2 patched over XP SP2
> using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be
> appreciated.
> 
> Thanks in advance,
> 
> Jorge
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Roy Mendelssohn

2008-Feb-28 05:13 UTC

head link

[R] How to read HUGE data sets?

I may be mistaken, but I believe R does all it work in memory.  If  
that is so, you would really only have 2 options:

1.  Get a lot of memory

2.  Figure out a way to do the desired operation on parts of the data  
at a time.

-Roy M.


On Feb 27, 2008, at 9:03 PM, Jorge Iv?n V?lez wrote:
> Dear R-list,
>
> Does somebody know how can I read a HUGE data set using R? It is a  
> hapmap
> data set (txt format) which is around 4GB. After read it, I need to  
> delete
> some specific rows and columns. I'm running R 2.6.2 patched over XP  
> SP2
> using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion  
> would be
> appreciated.
>
> Thanks in advance,
>
> Jorge
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
**********************
"The contents of this message do not reflect any position of the U.S.  
Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division	
Southwest Fisheries Science Center
1352 Lighthouse Avenue
Pacific Grove, CA 93950-2097

e-mail: Roy.Mendelssohn at noaa.gov (Note new e-mail address)
voice: (831)-648-9029
fax: (831)-648-8440
www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."

Patrick Connolly

2008-Feb-28 06:22 UTC

head link

[R] How to read HUGE data sets?

On Wed, 27-Feb-2008 at 09:13PM -0800, Roy Mendelssohn wrote:

|> I may be mistaken, but I believe R does all it work in memory.  If  
|> that is so, you would really only have 2 options:
|> 
|> 1.  Get a lot of memory

But with a 32bit operating system, 4G is all the memory that can be
addressed (including the operating system).  So your chances of
getting all the data into R seem very slim.

|> 
|> 2.  Figure out a way to do the desired operation on parts of the data  
|> at a time.

That might involve using a database which you can query from R, or you
might be able to use a Perl script to select what you require.  I have
heard of people using Perl with Windows.


Someone once asked me to plot some SAS output which was several
hundred Mb.  In that case, a simple Perl script cut it down to 3 Mb.
You might be lucky too.


Good luck.

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}          		 Great minds discuss ideas    
 _( Y )_  	  	        Middle minds discuss events 
(:_~*~_:) 	       		 Small minds discuss people  
 (_)-(_)  	                           ..... Anon
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

Emmanuel Charpentier

2008-Feb-28 14:47 UTC

head link

[R] How to read HUGE data sets?

Jorge Iv?n V?lez a ?crit :> Dear R-list,
> 
> Does somebody know how can I read a HUGE data set using R? It is a hapmap
> data set (txt format) which is around 4GB. After read it, I need to delete
> some specific rows and columns. I'm running R 2.6.2 patched over XP SP2
> using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be
> appreciated.
Hmmm... Unless you're running a 64-bits version of XP, you might be SOL
(nonwhistanding the astounding feats of the R Core Team, which managed
to be able to use about 3,5 GB of memory under 32-bits Windows) : your
*raw* data will eat more than the available memory. You might be lucky
if some of them can be abstracted (e. g. long character chains that can
be reduced to vectors), or get unlucky (large R storage overhead of
nonreducible data).

You might consider changing machines : get a 64-bit machine with gobs of
memory and cross your fingers. Note that, since R pointers are 64-bits
wide instead of 32-bits, data storage needs will inflate...

Depending of the real meaning of your data and the processing they need,
you might also consider storing your raw data in a SQL DBMS, reduce them
in SQL and read in R only the relevant part(s). There also  are some
contributed packages that might help in special situations : biglm, birch.

HTH,

					Emmanuel Charpentier

jebyrnes

2008-Feb-28 21:26 UTC

head link

[R] How to read HUGE data sets?

Sounds like you want to use the filehash package which was written for just
such problems

http://yusung.blogspot.com/2007/09/dealing-with-large-data-set-in-r.html
http://cran.r-project.org/web/packages/filehash/index.html


or maybe the ff package
http://cran.r-project.org/web/packages/ff/index.html

-- 
View this message in context:
http://www.nabble.com/How-to-read-HUGE-data-sets--tp15729830p15746400.html
Sent from the R help mailing list archive at Nabble.com.

Roland Rau

2008-Feb-28 21:47 UTC

head link

[R] How to read HUGE data sets?

Hi,

Jorge Iv?n V?lez wrote:> Dear R-list,
> 
> Does somebody know how can I read a HUGE data set using R? It is a hapmap
> data set (txt format) which is around 4GB. After read it, I need to delete
> some specific rows and columns. I'm running R 2.6.2 patched over XP SP2
in such a case, I would recommend not to use R in the beginning. Try to 
use awk[1] to cut out the correct rows and columns. If the resulting 
data are still very large, I would suggest to read it into a Database 
System. My experience is limited in that respect: I only used SQLite. 
But in conjunction with the RSQLite package, I was managed all my "big 
data problems".

Check http://www.ibm.com/developerworks/library/l-awk1.html to get you 
smoothly started with awk.

I hope this helps,
Roland

[1] I think the gawk implementation offers most options (e.g. for 
timing) but I recently used mawk on Windows XP and it was way faster (or 
was it nawk?). If you don't have experience in some language such as 
perl, I'd say it is much easier to learn awk than perl.

Gabor Grothendieck

2008-Feb-28 22:16 UTC

head link

[R] How to read HUGE data sets?

read.table's colClasses= argument can take a "NULL" for those
columns
that you want
ignored.  Also see the skip= argument.  ?read.table .

The sqldf package can read a subset of rows and columns (actually any
sql operation)
from a file larger than R can otherwise handle.  It will automatically
set up a temporary
SQLite database for you, load the file into the database without going
through R and
extract just the data you want into R and then automatically delete
the database.  All this
can be done in 2 lines of code.  See example 6 on the home page:
http://sqldf.googlecode.com

On Thu, Feb 28, 2008 at 12:03 AM, Jorge Iv?n V?lez
<jorgeivanvelez at gmail.com> wrote:> Dear R-list,
>
> Does somebody know how can I read a HUGE data set using R? It is a hapmap
> data set (txt format) which is around 4GB. After read it, I need to delete
> some specific rows and columns. I'm running R 2.6.2 patched over XP SP2
> using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be
> appreciated.
>
> Thanks in advance,
>
> Jorge
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Liviu Andronic

2008-Feb-29 13:27 UTC

head link

[R] How to read HUGE data sets?

On 2/28/08, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:>  The sqldf package can read a subset of rows and columns (actually any
>  sql operation)
>  from a file larger than R can otherwise handle.  It will automatically
>  set up a temporary
>  SQLite database for you, load the file into the database without going
>  through R and
>  extract just the data you want into R and then automatically delete
>  the database.  All this
>  can be done in 2 lines of code.
Is it realistic to use this approach for datasets as big as 30-40 GB?

Liviu

Apparently Analagous Threads

Search for more reasonably related threads

R help - Feb 2008 - How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

[R] How to read HUGE data sets?

Apparently Analagous Threads