thr3ads.net - R help - [R] read.csv and write.csv filtering for very big data ? [Jun 2013]

If this information is useful, please help other people find it:
Share via:

ivo welch

2013-Jun-03 22:59 UTC

[R] read.csv and write.csv filtering for very big data ?

dear R wizards---

I presume this is a common problem, so I thought I would ask whether
this solution already exists and if not, suggest it.  say, a user has
a data set of x GB, where x is very big---say, greater than RAM.
fortunately, data often come sequentially in groups, and there is a
need to process contiguous subsets of them and write the results to a
new file.  read.csv and write.csv only work on FULL data sets.
read.csv has the ability to skip n lines and read only m lines, but
this can cross the subsets.  the useful solution here would be a
"filter" function that understands about chunks:

   filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...

a chunk would not exactly be a factor, because normal R factors can be
non-sequential in the data frame.  the filter.csv makes it very simple
to work on large data sets...almost SAS simple:

   filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv",
"date",
function(d) colMeans(d))
or
   filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c >
results.csv.bz2"), "date", function(d) d[ unique(d$date), ] )  ##
filter out obserations that have the same date again later

or some reasonable variant of this.

now that I can have many small chunks, it would be nice if this were
threadsafe, so

   mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...

with 'library(parallel)' could feed multiple cores the FUNprocess, and
make sure that the processes don't step on one another.  (why did R
not use a dot after "mc" for parallel lapply?)  presumably, to keep it
simple, mcfilter.csv would keep a counter of read chunks and block
write chinks until the next sequential chunk in order arrives.

just a suggestion...

/iaw

----
Ivo Welch (ivo.welch at gmail.com)

Greg Snow

2013-Jun-04 21:56 UTC

head link

[R] read.csv and write.csv filtering for very big data ?

Some possibilities using existing tools.

If you create a file connection and open it before reading from it (or
writing to it), then functions like read.table and read.csv ( and
write.table for a writable connection) will read from the connection, but
not close and reset it.  This means that you could open 2 files, one for
reading and one for writing, then read in a chunk, process it, write it
out, then read in the next chunk, etc.

Another option would be to read the data into an ff object (ff package) or
into a database (SQLite for one) which could have the data accessed in
chunks, possibly even in parallel.


On Mon, Jun 3, 2013 at 4:59 PM, ivo welch
<ivo.welch@anderson.ucla.edu>wrote:
> dear R wizards---
>
> I presume this is a common problem, so I thought I would ask whether
> this solution already exists and if not, suggest it.  say, a user has
> a data set of x GB, where x is very big---say, greater than RAM.
> fortunately, data often come sequentially in groups, and there is a
> need to process contiguous subsets of them and write the results to a
> new file.  read.csv and write.csv only work on FULL data sets.
> read.csv has the ability to skip n lines and read only m lines, but
> this can cross the subsets.  the useful solution here would be a
> "filter" function that understands about chunks:
>
>    filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>
> a chunk would not exactly be a factor, because normal R factors can be
> non-sequential in the data frame.  the filter.csv makes it very simple
> to work on large data sets...almost SAS simple:
>
>    filter.csv( pipe('bzcat infile.csv.bz2'),
"results.csv", "date",
> function(d) colMeans(d))
> or
>    filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c
>
> results.csv.bz2"), "date", function(d) d[ unique(d$date), ]
)  ##
> filter out obserations that have the same date again later
>
> or some reasonable variant of this.
>
> now that I can have many small chunks, it would be nice if this were
> threadsafe, so
>
>    mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>
> with 'library(parallel)' could feed multiple cores the FUNprocess,
and
> make sure that the processes don't step on one another.  (why did R
> not use a dot after "mc" for parallel lapply?)  presumably, to
keep it
> simple, mcfilter.csv would keep a counter of read chunks and block
> write chinks until the next sequential chunk in order arrives.
>
> just a suggestion...
>
> /iaw
>
> ----
> Ivo Welch (ivo.welch@gmail.com)
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Gregory (Greg) L. Snow Ph.D.
538280@gmail.com

	[[alternative HTML version deleted]]

R help - Jun 2013 - read.csv and write.csv filtering for very big data ?

[R] read.csv and write.csv filtering for very big data ?

[R] read.csv and write.csv filtering for very big data ?