This is where the sort/merge application on the mainframe has excelled
for the last 40 years. If you can not send it to a mainframe, you can
look at the SyncSort package that runs on UNIX machines.
On Mon, Jul 30, 2012 at 12:25 PM, Matthew Keller <mckellercran at
gmail.com> wrote:> Hello all,
>
> I have some genetic datasets (gzipped) that contain 6 columns and
> upwards of 10s of billions of rows. The largest dataset is about 16 GB
> on file, gzipped (!). I need to sort them according to columns 1, 2,
> and 3. The setkey() function in the data.table package does this
> quickly, but of course we're limited by R not being able to index
> vectors with > 2^31 elements, and bringing in only the parts of the
> dataset we need is not applicable here.
>
> I'm asking for practical advice from people who've done this or who
> have ideas. We'd like to be able to sort the biggest datasets in hours
> rather than days (or weeks!). We cannot have any process take over 50
> GB RAM max (we'd prefer smaller so we can parallelize). .
>
> Relational databases seem too slow, but maybe I am wrong. A quick look
> at the bigmemory package doesn't turn up an ability to sort like this,
> but again, maybe I'm wrong. My computer programmer writes in C++, so
> if you have ideas in C++, that works too.
>
> Any help would be much appreciated... Thanks!
>
> Matt
>
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.