thr3ads.net - R help - [R] merging pre-sorted data frames [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Mike Miller

2015-Jan-14 00:55 UTC

[R] merging pre-sorted data frames

I have many pairs of data frames each with about 15 million records each 
and about 10 million records in common.  They are sorted by two of their 
fields and will be merged by those same fields.

The fact that the data are sorted could be used to greatly speed up a 
merge, but I have the impression that merge() cannot "know" in advance
that the fields are already sorted.

I'm sure that I can use merge(), but I suspect that it is doing a lot of 
unnecessary work and that it will take much more time than the job really 
should require.  Is that correct?  Can anything be done about it?

The inspiration for my question comes partly from the way GNU comm works.

If you have any ideas about this, I'd love to hear them.

Thanks in advance.

Mike

-- 
Michael B. Miller, Ph.D.
University of Minnesota
http://scholar.google.com/citations?user=EV_phq4AAAAJ

Jeff Newmiller

2015-Jan-14 01:07 UTC

head link

[R] merging pre-sorted data frames

On Tue, 13 Jan 2015, Mike Miller wrote:
> I have many pairs of data frames each with about 15 million records each
and
> about 10 million records in common.  They are sorted by two of their fields
> and will be merged by those same fields.
>
> The fact that the data are sorted could be used to greatly speed up a
merge,
> but I have the impression that merge() cannot "know" in advance
that the
> fields are already sorted.
There are different versions of "merge". This sounds like a job for
the
data.table package, which has its own way of doing merges that is likely 
to be useful here. However, be warned that data.table takes some getting 
used to, and if it can't figure out from your use of it how to use the 
fast techniques then it will often fall back on the slower data.frame 
approaches. [1] covers the single-column case... but multiple columns is 
quite doable.

You might also find sqldf helpful if you are more comfortable with SQL 
than data.table's way of doing things.

[1] http://stackoverflow.com/questions/17331684/fast-exists-in-data-table
> I'm sure that I can use merge(), but I suspect that it is doing a lot
of
> unnecessary work and that it will take much more time than the job really 
> should require.  Is that correct?  Can anything be done about it?
>
> The inspiration for my question comes partly from the way GNU comm works.
Not familiar with that.
> If you have any ideas about this, I'd love to hear them.
>
> Thanks in advance.
>
> Mike
>
> -- 
> Michael B. Miller, Ph.D.
> University of Minnesota
> http://scholar.google.com/citations?user=EV_phq4AAAAJ
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

Mike Miller

2015-Jan-15 03:17 UTC

head link

[R] merging pre-sorted data frames

Thanks, Jeff.  You really know the packages.  I search and I guess I 
didn't use the right terms.  That package seems to do exactly what I 
wanted.

Mike


On Tue, 13 Jan 2015, Jeff Newmiller wrote:
> On Tue, 13 Jan 2015, Mike Miller wrote:
>
>> I have many pairs of data frames each with about 15 million records
each
>> and about 10 million records in common.  They are sorted by two of
their
>> fields and will be merged by those same fields.
>> 
>> The fact that the data are sorted could be used to greatly speed up a 
>> merge, but I have the impression that merge() cannot "know"
in advance that
>> the fields are already sorted.
>
> There are different versions of "merge". This sounds like a job
for the
> data.table package, which has its own way of doing merges that is likely to
> be useful here. However, be warned that data.table takes some getting used 
> to, and if it can't figure out from your use of it how to use the fast 
> techniques then it will often fall back on the slower data.frame
approaches.
> [1] covers the single-column case... but multiple columns is quite doable.
>
> You might also find sqldf helpful if you are more comfortable with SQL than
> data.table's way of doing things.
>
> [1] http://stackoverflow.com/questions/17331684/fast-exists-in-data-table
>
>> I'm sure that I can use merge(), but I suspect that it is doing a
lot of
>> unnecessary work and that it will take much more time than the job
really
>> should require.  Is that correct?  Can anything be done about it?
>> 
>> The inspiration for my question comes partly from the way GNU comm
works.
>
> Not familiar with that.
>
>> If you have any ideas about this, I'd love to hear them.
>> 
>> Thanks in advance.
>> 
>> Mike
>> 
>> -- 
>> Michael B. Miller, Ph.D.
>> University of Minnesota
>> http://scholar.google.com/citations?user=EV_phq4AAAAJ
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                      Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
>

R help - Jan 2015 - merging pre-sorted data frames

[R] merging pre-sorted data frames

[R] merging pre-sorted data frames

[R] merging pre-sorted data frames