Dear R-list, Does somebody know how can I read a HUGE data set using R? It is a hapmap data set (txt format) which is around 4GB. After read it, I need to delete some specific rows and columns. I'm running R 2.6.2 patched over XP SP2 using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be appreciated. Thanks in advance, Jorge [[alternative HTML version deleted]]
Depending on how many rows you will delete, and if you know in advance which ones they are, one approach is to use the "skip" argument of read.table. If you only need a fraction of the total number of rows this will save a lot of RAM. Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 204-4202 Home (no voice mail please) mwkimpel<at>gmail<dot>com ****************************************************************** Jorge Iv?n V?lez wrote:> Dear R-list, > > Does somebody know how can I read a HUGE data set using R? It is a hapmap > data set (txt format) which is around 4GB. After read it, I need to delete > some specific rows and columns. I'm running R 2.6.2 patched over XP SP2 > using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be > appreciated. > > Thanks in advance, > > Jorge > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
I may be mistaken, but I believe R does all it work in memory. If that is so, you would really only have 2 options: 1. Get a lot of memory 2. Figure out a way to do the desired operation on parts of the data at a time. -Roy M. On Feb 27, 2008, at 9:03 PM, Jorge Iv?n V?lez wrote:> Dear R-list, > > Does somebody know how can I read a HUGE data set using R? It is a > hapmap > data set (txt format) which is around 4GB. After read it, I need to > delete > some specific rows and columns. I'm running R 2.6.2 patched over XP > SP2 > using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion > would be > appreciated. > > Thanks in advance, > > Jorge > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.********************** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ********************** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center 1352 Lighthouse Avenue Pacific Grove, CA 93950-2097 e-mail: Roy.Mendelssohn at noaa.gov (Note new e-mail address) voice: (831)-648-9029 fax: (831)-648-8440 www: http://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill."
On Wed, 27-Feb-2008 at 09:13PM -0800, Roy Mendelssohn wrote: |> I may be mistaken, but I believe R does all it work in memory. If |> that is so, you would really only have 2 options: |> |> 1. Get a lot of memory But with a 32bit operating system, 4G is all the memory that can be addressed (including the operating system). So your chances of getting all the data into R seem very slim. |> |> 2. Figure out a way to do the desired operation on parts of the data |> at a time. That might involve using a database which you can query from R, or you might be able to use a Perl script to select what you require. I have heard of people using Perl with Windows. Someone once asked me to plot some SAS output which was several hundred Mb. In that case, a simple Perl script cut it down to 3 Mb. You might be lucky too. Good luck. -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___ Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Middle minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) ..... Anon ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
Jorge Iv?n V?lez a ?crit :> Dear R-list, > > Does somebody know how can I read a HUGE data set using R? It is a hapmap > data set (txt format) which is around 4GB. After read it, I need to delete > some specific rows and columns. I'm running R 2.6.2 patched over XP SP2 > using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be > appreciated.Hmmm... Unless you're running a 64-bits version of XP, you might be SOL (nonwhistanding the astounding feats of the R Core Team, which managed to be able to use about 3,5 GB of memory under 32-bits Windows) : your *raw* data will eat more than the available memory. You might be lucky if some of them can be abstracted (e. g. long character chains that can be reduced to vectors), or get unlucky (large R storage overhead of nonreducible data). You might consider changing machines : get a 64-bit machine with gobs of memory and cross your fingers. Note that, since R pointers are 64-bits wide instead of 32-bits, data storage needs will inflate... Depending of the real meaning of your data and the processing they need, you might also consider storing your raw data in a SQL DBMS, reduce them in SQL and read in R only the relevant part(s). There also are some contributed packages that might help in special situations : biglm, birch. HTH, Emmanuel Charpentier
Sounds like you want to use the filehash package which was written for just such problems http://yusung.blogspot.com/2007/09/dealing-with-large-data-set-in-r.html http://cran.r-project.org/web/packages/filehash/index.html or maybe the ff package http://cran.r-project.org/web/packages/ff/index.html -- View this message in context: http://www.nabble.com/How-to-read-HUGE-data-sets--tp15729830p15746400.html Sent from the R help mailing list archive at Nabble.com.
Hi, Jorge Iv?n V?lez wrote:> Dear R-list, > > Does somebody know how can I read a HUGE data set using R? It is a hapmap > data set (txt format) which is around 4GB. After read it, I need to delete > some specific rows and columns. I'm running R 2.6.2 patched over XP SP2in such a case, I would recommend not to use R in the beginning. Try to use awk[1] to cut out the correct rows and columns. If the resulting data are still very large, I would suggest to read it into a Database System. My experience is limited in that respect: I only used SQLite. But in conjunction with the RSQLite package, I was managed all my "big data problems". Check http://www.ibm.com/developerworks/library/l-awk1.html to get you smoothly started with awk. I hope this helps, Roland [1] I think the gawk implementation offers most options (e.g. for timing) but I recently used mawk on Windows XP and it was way faster (or was it nawk?). If you don't have experience in some language such as perl, I'd say it is much easier to learn awk than perl.
read.table's colClasses= argument can take a "NULL" for those columns that you want ignored. Also see the skip= argument. ?read.table . The sqldf package can read a subset of rows and columns (actually any sql operation) from a file larger than R can otherwise handle. It will automatically set up a temporary SQLite database for you, load the file into the database without going through R and extract just the data you want into R and then automatically delete the database. All this can be done in 2 lines of code. See example 6 on the home page: http://sqldf.googlecode.com On Thu, Feb 28, 2008 at 12:03 AM, Jorge Iv?n V?lez <jorgeivanvelez at gmail.com> wrote:> Dear R-list, > > Does somebody know how can I read a HUGE data set using R? It is a hapmap > data set (txt format) which is around 4GB. After read it, I need to delete > some specific rows and columns. I'm running R 2.6.2 patched over XP SP2 > using a 2.4 GHz Core 2-Duo processor and 4GB RAM. Any suggestion would be > appreciated. > > Thanks in advance, > > Jorge > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On 2/28/08, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> The sqldf package can read a subset of rows and columns (actually any > sql operation) > from a file larger than R can otherwise handle. It will automatically > set up a temporary > SQLite database for you, load the file into the database without going > through R and > extract just the data you want into R and then automatically delete > the database. All this > can be done in 2 lines of code.Is it realistic to use this approach for datasets as big as 30-40 GB? Liviu