Dupuis, Robert
2015-Jul-14 22:21 UTC
[R] Parsing large amounts of csv data with limited RAM
I'm relatively new to using R, and I am trying to find a decent solution for my current dilemma. Right now, I am currently trying to parse second data from a 7 months of CSV data. This is over 10GB of data, and I've run into some "memory issues" loading them all into a single dataset to be plotted. If possible, I'd really like to keep both the one second resolution, and all 100 or so columns intact to make things easier on myself. The problem I have is that the machine that is running this script only has 8GB of RAM. I've had issues parsing files with lapply, and some sort of csv reader. So far I've tried read.csv, readr.read_table, and data.table.fread with only fread having any sort of memory management (fread seems to crash on me however). The basic approach I am using is as follows: # Get the data files = list.files(pattern="*.csv") set <- lapply(files, function(x) fread(x, header = T, sep = ',')) #replace fread with something that can parse csv data # Handle the data (Do my plotting down here) ... These processes work with smaller data sets, but I would like to in a worse case scenario be able to parse through 1 year data which would be around 20GB. Thank you for your time, Robert Dupuis
Jeff Newmiller
2015-Jul-15 02:27 UTC
[R] Parsing large amounts of csv data with limited RAM
You seem to want your cake and eat it too. Not unexpected, but you may have your work cut out to learn about the price of having it all. Plotting: pretty silly to stick with gigabytes of data in your plots. Some kind of aggregation seems required here, with the raw data being a stepping stone to that goal. Loading: if you don't have RAM, buy more or use one of the disk-based solutions. There are proprietary solutions for a fee, and there are packages like ff. When I have dealt with large data sets I have used sqldf or RODBC (which I think works best for read-only access), so I cannot advise you on ff. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On July 14, 2015 3:21:42 PM PDT, "Dupuis, Robert" <dupuis at beaconpower.com> wrote:>I'm relatively new to using R, and I am trying to find a decent >solution for my current dilemma. > >Right now, I am currently trying to parse second data from a 7 months >of CSV data. This is over 10GB of data, and I've run into some "memory >issues" loading them all into a single dataset to be plotted. If >possible, I'd really like to keep both the one second resolution, and >all 100 or so columns intact to make things easier on myself. > >The problem I have is that the machine that is running this script only >has 8GB of RAM. I've had issues parsing files with lapply, and some >sort of csv reader. So far I've tried read.csv, readr.read_table, and >data.table.fread with only fread having any sort of memory management >(fread seems to crash on me however). The basic approach I am using is >as follows: > ># Get the data >files = list.files(pattern="*.csv") >set <- lapply(files, function(x) fread(x, header = T, sep = ',')) >#replace fread with something that can parse csv data > ># Handle the data (Do my plotting down here) >... > >These processes work with smaller data sets, but I would like to in a >worse case scenario be able to parse through 1 year data which would be >around 20GB. > >Thank you for your time, >Robert Dupuis > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
take a look at the sqldf package because it has the ability to load a csv file to a database from which you can then process the data in pieces Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Tue, Jul 14, 2015 at 10:27 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> You seem to want your cake and eat it too. Not unexpected, but you may > have your work cut out to learn about the price of having it all. > > Plotting: pretty silly to stick with gigabytes of data in your plots. Some > kind of aggregation seems required here, with the raw data being a stepping > stone to that goal. > > Loading: if you don't have RAM, buy more or use one of the disk-based > solutions. There are proprietary solutions for a fee, and there are > packages like ff. When I have dealt with large data sets I have used sqldf > or RODBC (which I think works best for read-only access), so I cannot > advise you on ff. > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- > Sent from my phone. Please excuse my brevity. > > On July 14, 2015 3:21:42 PM PDT, "Dupuis, Robert" <dupuis at beaconpower.com> > wrote: > >I'm relatively new to using R, and I am trying to find a decent > >solution for my current dilemma. > > > >Right now, I am currently trying to parse second data from a 7 months > >of CSV data. This is over 10GB of data, and I've run into some "memory > >issues" loading them all into a single dataset to be plotted. If > >possible, I'd really like to keep both the one second resolution, and > >all 100 or so columns intact to make things easier on myself. > > > >The problem I have is that the machine that is running this script only > >has 8GB of RAM. I've had issues parsing files with lapply, and some > >sort of csv reader. So far I've tried read.csv, readr.read_table, and > >data.table.fread with only fread having any sort of memory management > >(fread seems to crash on me however). The basic approach I am using is > >as follows: > > > ># Get the data > >files = list.files(pattern="*.csv") > >set <- lapply(files, function(x) fread(x, header = T, sep = ',')) > >#replace fread with something that can parse csv data > > > ># Handle the data (Do my plotting down here) > >... > > > >These processes work with smaller data sets, but I would like to in a > >worse case scenario be able to parse through 1 year data which would be > >around 20GB. > > > >Thank you for your time, > >Robert Dupuis > > > >______________________________________________ > >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >https://stat.ethz.ch/mailman/listinfo/r-help > >PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]