thr3ads.net - R devel - [Rd] merge performace degradation in 2.9.1 [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Adrian Dragulescu

2009-Jul-09 17:05 UTC

[Rd] merge performace degradation in 2.9.1

I have noticed a significant performance degradation using merge in 2.9.1 
relative to 2.8.1.  Here is what I observed:

   N <- 100000
   X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N))
   X$mon <- as.character(X$mon)
   Y <- data.frame(mon=month.abb, letter=letters[1:12])
   Y$mon <- as.character(Y$mon)

   Z <- cbind(Y, group=1:12)

   system.time(Out <- merge(X, Y, by="mon", all=TRUE))
   # R 2.8.1 is 17% faster than R 2.9.1 for N=100000

   system.time(Out <- merge(X, Z, by=c("mon", "group"),
all=TRUE))
   # R 2.8.1 is 16% faster than R 2.9.1 for N=100000

Here is the head of summaryRprof() for 2.8.1
$by.self
                    self.time self.pct total.time total.pct
sort.list               4.60     56.5       4.60      56.5
make.unique             1.68     20.6       2.18      26.8
as.character            0.50      6.1       0.50       6.1
duplicated.default      0.50      6.1       0.50       6.1
merge.data.frame        0.20      2.5       8.02      98.5
[.data.frame            0.16      2.0       7.10      87.2

and for 2.9.1
$by.self
                    self.time self.pct total.time total.pct
sort.list               4.66     39.2       4.66      39.2
nchar                   3.28     27.6       3.28      27.6
make.unique             1.42     12.0       1.92      16.2
as.character            0.50      4.2       0.50       4.2
data.frame              0.46      3.9       4.12      34.7
[.data.frame            0.44      3.7       7.28      61.3

As you notice the 2.9.1 has an nchar entry that is quite time consuming.

Is there a way to avoid the degradation in performance in 2.9.1?

Thank you,
Adrian

As an aside, I got interested in testing merge in 2.9.1 by reading the 
r-devel message from 30-May-2009 "Degraded performance with rank()" by
Tim
Bergsma, as he mentions doing merges, but only today decided to test.

Matthew Dowle

2009-Jul-14 01:59 UTC

head link

[Rd] merge performace degradation in 2.9.1

> Is there a way to avoid the degradation in performance in 2.9.1?If the example is to demonstrate a difference between R versions that you 
really need to get to the bottom of then read no further.  However, if the 
example is actually what you want to do then you can speed it up by using a 
data.table as follows to reduce the 26 secs to 1 sec.

Time on my PC at home (quite old now!) :> system.time(Out <- merge(X, Y, by="mon", all=TRUE))   user  system elapsed
  25.63    0.58   26.98

Using a data.table instead :
X <- data.table(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N), 
key="mon")
Y <- data.table(mon=month.abb, letter=letters[1:12], key="mon")
tables()
     NAME      NROW COLS       KEY
[1,] X    1,200,000 group,mon  mon
[2,] Y           12 mon,letter mon> system.time(X$letter <- Y[X,letter])   # Y[X] is the syntax for merge of
> two data.tables   user  system elapsed
   0.98    0.11    1.10> identical(Out$letter, X$letter)
[1] TRUE> identical(Out$mon, X$mon)
[1] TRUE> identical(Out$group, X$group)[1] TRUE

To do the multi-column equi-join of X and Z, set a key of 2 columns. 
'nomatch' is the equivalent of 'all' and can be set to 0 (inner
join) or NA
(outer join).


"Adrian Dragulescu" <adrian_d at eskimo.com> wrote in message 
news:Pine.LNX.4.64.0907090953580.1125 at
shell.eskimo.com...>
> I have noticed a significant performance degradation using merge in 2.9.1 
> relative to 2.8.1.  Here is what I observed:
>
>   N <- 100000
>   X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), 
> each=N))
>   X$mon <- as.character(X$mon)
>   Y <- data.frame(mon=month.abb, letter=letters[1:12])
>   Y$mon <- as.character(Y$mon)
>
>   Z <- cbind(Y, group=1:12)
>
>   system.time(Out <- merge(X, Y, by="mon", all=TRUE))
>   # R 2.8.1 is 17% faster than R 2.9.1 for N=100000
>
>   system.time(Out <- merge(X, Z, by=c("mon",
"group"), all=TRUE))
>   # R 2.8.1 is 16% faster than R 2.9.1 for N=100000
>
> Here is the head of summaryRprof() for 2.8.1
> $by.self
>                    self.time self.pct total.time total.pct
> sort.list               4.60     56.5       4.60      56.5
> make.unique             1.68     20.6       2.18      26.8
> as.character            0.50      6.1       0.50       6.1
> duplicated.default      0.50      6.1       0.50       6.1
> merge.data.frame        0.20      2.5       8.02      98.5
> [.data.frame            0.16      2.0       7.10      87.2
>
> and for 2.9.1
> $by.self
>                    self.time self.pct total.time total.pct
> sort.list               4.66     39.2       4.66      39.2
> nchar                   3.28     27.6       3.28      27.6
> make.unique             1.42     12.0       1.92      16.2
> as.character            0.50      4.2       0.50       4.2
> data.frame              0.46      3.9       4.12      34.7
> [.data.frame            0.44      3.7       7.28      61.3
>
> As you notice the 2.9.1 has an nchar entry that is quite time consuming.
>
> Is there a way to avoid the degradation in performance in 2.9.1?
>
> Thank you,
> Adrian
>
> As an aside, I got interested in testing merge in 2.9.1 by reading the 
> r-devel message from 30-May-2009 "Degraded performance with
rank()" by Tim
> Bergsma, as he mentions doing merges, but only today decided to test.
>

Maybe Matching Threads

Search for more maybe matching threads

R devel - Jul 2009 - merge performace degradation in 2.9.1

[Rd] merge performace degradation in 2.9.1

[Rd] merge performace degradation in 2.9.1

Maybe Matching Threads