thr3ads.net - R help - [R] merging and working with big data sets [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Jay Emerson

2010-Oct-12 11:52 UTC

[R] merging and working with big data sets

I can't speak for ff and filehash, but bigmemory's data structure
doesn't allow "clever" merges (for actually good reasons). 
However,
it is still probably less painful (and faster) than other options,
though we don't implement it: we leave it to the user because details
may vary depending on the example and the code is trivial.

- Allocate an empty new filebacked big.matrix of the proper size.
- Fill it in chunks (typically a column at a time if you can afford
the RAM overhead, or a portion of a column at a time).   Column
operations are more efficient than row operations (again, because of
the internals of the data structure).
- Because you'll be using filebackings, RAM limitations won't matter
other than the overhead of copying each chunk.

I should note: if you used separated=TRUE, each column would have a
separate binary file, and a "smart" cbind() would be possible simply
by manipulating the descriptor file.  Again, not something we advise
or formally provide, but it wouldn't be hard.

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Oct 2010 - merging and working with big data sets

[R] merging and working with big data sets

Possibly Parallel Threads

Wisdom of the Ancients