-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of dms
Sent: Wednesday, March 02, 2011 3:16 PM
To: r-help at r-project.org
Subject: [R] merge( , by='row.names') slowness
I noticed that joining two data.frames in R using the "merge"
function that using by='row.names' slows things down substantially
when compared to just joining on a common index column.
Using a dataframe size of ~10,000 rows: it's as slow as 10 minutes in
the by='row.names' case versus merely 1 second using an index column.
Beyond the 10^6 range, it's unusably slow.
n <- 5
a <- data.frame(id=as.character(1:10^n), x=rnorm(10^n)); rownames(a)
<- a$id
b <- data.frame(id=as.character(1:10^n + 10^(n-1)), y=rnorm(10^n));
rownames(b) <- b$id
date()
fast <- merge(a, b, all=T)
date()
slow <- merge(a, b, all=T, by='row.names')
date()
Has anybody else noticed this?
_________________________________________________
HI DMS,
Well, first off, they don't give the same answer... in fact, not even the
same dimension.
Even so, from looking at merge.data.frame, it's not immediately obvious what
would make a difference of this magnitude.
The answer might be buried in the internal merge.
Here for n=3:> system.time(print(dim(merge(a,b,all=T))))
[1] 1100 3
user system elapsed
0.01 0.00 0.01> system.time(print(dim(merge(a,b,all=T,by=1))))
[1] 1100 3
user system elapsed
0.01 0.00 0.02> system.time(print(dim(merge(a,b,all=T,by=0))))
[1] 1100 5
user system elapsed
3.26 0.00 3.17> system.time(print(dim(merge(a,b,all=T,by="row.names"))))
[1] 1100 5
user system elapsed
3.17 0.00 3.17>
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
message may contain confidential information. If you are not the designated
recipient, please notify the sender immediately, and delete the original and any
copies. Any use of the message by you is prohibited.