thr3ads.net - R help - [R] how to sort huge (> 2^31 row) dataframes quickly [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Matthew Keller

2012-Jul-30 16:25 UTC

[R] how to sort huge (> 2^31 row) dataframes quickly

Hello all,

I have some genetic datasets (gzipped) that contain 6 columns and
upwards of 10s of billions of rows. The largest dataset is about 16 GB
on file, gzipped (!). I need to sort them according to columns 1, 2,
and 3. The setkey() function in the data.table package does this
quickly, but of course we're limited by R not being able to index
vectors with > 2^31 elements, and bringing in only the parts of the
dataset we need is not applicable here.

I'm asking for practical advice from people who've done this or who
have ideas. We'd like to be able to sort the biggest datasets in hours
rather than days (or weeks!). We cannot have any process take over 50
GB RAM max (we'd prefer smaller so we can parallelize). .

Relational databases seem too slow, but maybe I am wrong. A quick look
at the bigmemory package doesn't turn up an ability to sort like this,
but again, maybe I'm wrong. My computer programmer writes in C++, so
if you have ideas in C++, that works too.

Any help would be much appreciated... Thanks!

Matt


-- 
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com

jim holtman

2012-Jul-30 16:33 UTC

head link

[R] how to sort huge (> 2^31 row) dataframes quickly

This is where the sort/merge application on the mainframe has excelled
for the last 40 years.  If you can not send it to a mainframe, you can
look at the SyncSort package that runs on UNIX machines.

On Mon, Jul 30, 2012 at 12:25 PM, Matthew Keller <mckellercran at
gmail.com> wrote:> Hello all,
>
> I have some genetic datasets (gzipped) that contain 6 columns and
> upwards of 10s of billions of rows. The largest dataset is about 16 GB
> on file, gzipped (!). I need to sort them according to columns 1, 2,
> and 3. The setkey() function in the data.table package does this
> quickly, but of course we're limited by R not being able to index
> vectors with > 2^31 elements, and bringing in only the parts of the
> dataset we need is not applicable here.
>
> I'm asking for practical advice from people who've done this or who
> have ideas. We'd like to be able to sort the biggest datasets in hours
> rather than days (or weeks!). We cannot have any process take over 50
> GB RAM max (we'd prefer smaller so we can parallelize). .
>
> Relational databases seem too slow, but maybe I am wrong. A quick look
> at the bigmemory package doesn't turn up an ability to sort like this,
> but again, maybe I'm wrong. My computer programmer writes in C++, so
> if you have ideas in C++, that works too.
>
> Any help would be much appreciated... Thanks!
>
> Matt
>
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

Reasonably Related Threads

Search for more maybe matching threads

R help - Jul 2012 - how to sort huge (> 2^31 row) dataframes quickly

[R] how to sort huge (> 2^31 row) dataframes quickly

[R] how to sort huge (> 2^31 row) dataframes quickly

Reasonably Related Threads