thr3ads.net - R help - [R] Alternatives to merge for large data sets? [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Adam D. I. Kramer

2006-Sep-07 06:12 UTC

[R] Alternatives to merge for large data sets?

Hello,

I am trying to merge two very large data sets, via

pubbounds.prof <-
merge(x=pubbounds,y=prof,by.x="user",by.y="userid",all=TRUE,sort=FALSE)

which gives me an error of

Error: cannot allocate vector of size 2962 Kb

I am reasonably sure that this is correct syntax.

The trouble is that pubbounds and prof are large; they are data frames which
take up 70M and 11M respectively when saved as .Rdata files.

I understand from various archive searches that "merge can't handle
that,"
because merge takes n^2 memory, which I do not have.

My question is whether there is an alternative to merge which would carry
out the process in a slower, iterative manner...or if I should just bite the
bullet, write.table, and use a perl script to do the job.

Thankful as always,
Adam D. I. Kramer

Prof Brian Ripley

2006-Sep-07 08:57 UTC

head link

[R] Alternatives to merge for large data sets?

Which version of R?

Please try 2.4.0 alpha, as it has a different and more efficient 
algorithm for the case of 1-1 matches.

On Wed, 6 Sep 2006, Adam D. I. Kramer wrote:
> Hello,
> 
> I am trying to merge two very large data sets, via
> 
> pubbounds.prof <-
>
merge(x=pubbounds,y=prof,by.x="user",by.y="userid",all=TRUE,sort=FALSE)
> 
> which gives me an error of
> 
> Error: cannot allocate vector of size 2962 Kb
> 
> I am reasonably sure that this is correct syntax.
> 
> The trouble is that pubbounds and prof are large; they are data frames
which
> take up 70M and 11M respectively when saved as .Rdata files.
> 
> I understand from various archive searches that "merge can't
handle that,"
> because merge takes n^2 memory, which I do not have.
Not really true (it has been changed since those days).  Of course, if you 
have multiple matches it must do so.
> My question is whether there is an alternative to merge which would carry
> out the process in a slower, iterative manner...or if I should just bite
the
> bullet, write.table, and use a perl script to do the job.
> 
> Thankful as always,
> Adam D. I. Kramer
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

bogdan romocea

2006-Sep-07 19:05 UTC

head link

[R] Alternatives to merge for large data sets?

One obvious alternative is an SQL join, which you could do directly in
a DBMS, or from R via RMySQL / RSQLite /... Keep in mind that creating
indexes on user/userid before the join may save a lot of time.

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Adam
> D. I. Kramer
> Sent: Thursday, September 07, 2006 2:46 PM
> To: Prof Brian Ripley
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Alternatives to merge for large data sets?
>
>
> On Thu, 7 Sep 2006, Prof Brian Ripley wrote:
>
> > Which version of R?
>
> Previously, 2.3.1.
>
> > Please try 2.4.0 alpha, as it has a different and more efficient
> > algorithm for the case of 1-1 matches.
>
> I downloaded and installed R-latest, but got the same error message:
>
> Error: cannot allocate vector of size 7301 Kb
>
> ...though at least the too-big size was larger this time.
>
> My data set is not exactly 1-1; every item in "prof" may have
> one or more
> matches in "pubbounds," though every item in
"pubbounds"
> corrosponds only to
> one "prof."
>
> --Adam
>
> >
> > On Wed, 6 Sep 2006, Adam D. I. Kramer wrote:
> >
> >> Hello,
> >>
> >> I am trying to merge two very large data sets, via
> >>
> >> pubbounds.prof <-
> >>
>
merge(x=pubbounds,y=prof,by.x="user",by.y="userid",all=TRUE,so
> rt=FALSE)
> >>
> >> which gives me an error of
> >>
> >> Error: cannot allocate vector of size 2962 Kb
> >>
> >> I am reasonably sure that this is correct syntax.
> >>
> >> The trouble is that pubbounds and prof are large; they are
> data frames which
> >> take up 70M and 11M respectively when saved as .Rdata files.
> >>
> >> I understand from various archive searches that "merge
> can't handle that,"
> >> because merge takes n^2 memory, which I do not have.
> >
> > Not really true (it has been changed since those days).  Of
> course, if you
> > have multiple matches it must do so.
> >
> >> My question is whether there is an alternative to merge
> which would carry
> >> out the process in a slower, iterative manner...or if I
> should just bite the
> >> bullet, write.table, and use a perl script to do the job.
> >>
> >> Thankful as always,
> >> Adam D. I. Kramer
> >
> > --
> > Brian D. Ripley,                  ripley at stats.ox.ac.uk
> > Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford,             Tel:  +44 1865 272861 (self)
> > 1 South Parks Road,                     +44 1865 272866 (PA)
> > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Maybe Matching Threads

Search for more maybe matching threads

R help - Sep 2006 - Alternatives to merge for large data sets?

[R] Alternatives to merge for large data sets?

[R] Alternatives to merge for large data sets?

[R] Alternatives to merge for large data sets?

Maybe Matching Threads