thr3ads.net - R help - [R] hwo to speed up "aggregate" [Jan 2011]

If this information is useful, please help other people find it:
Share via:

analyst41 at hotmail.com

2011-Jan-26 10:39 UTC

[R] hwo to speed up "aggregate"

I have
> df   quantity branch client       date  name
1        10      1      1 2010-01-01   one
2        20      2      1 2010-01-01   one
3        30      3      2 2010-01-01   two
4        15      4      1 2010-01-01   one
5        10      5      2 2010-01-01   two
6        20      6      3 2010-01-01 three
7      1000      1      1 2011-01-01   one
8      2000      2      1 2011-01-01   one
9      3000      3      2 2011-01-01   two
10     1500      4      1 2011-01-01   one
11     1000      5      2 2011-01-01   two
12     2000      6      3 2011-01-01 three

I want to aggregate away the branch. I followed a suggestion by Gabor
(thanks) and did
>
aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date),sum)  client       date quantity
1      1 2010-01-01       45
2      2 2010-01-01       40
3      3 2010-01-01       20
4      1 2011-01-01     4500
5      2 2011-01-01     4000
6      3 2011-01-01     2000

I want df$name also in the output and did what looked obvious:
>
aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date,name=df$name),sum)  client       date  name quantity
1      1 2010-01-01   one       45
2      1 2011-01-01   one     4500
3      3 2010-01-01 three       20
4      3 2011-01-01 three     2000
5      2 2010-01-01   two       40
6      2 2011-01-01   two     4000

It seems to work, but slows down tremendously for a dataframe with
around a 1000 rows.

Could anyone explain what is going on and suggest a way out?

Thanks.

Dennis Murphy

2011-Jan-26 12:14 UTC

head link

[R] hwo to speed up "aggregate"

If you have a large data frame, one option is package data.table. Try the
following:

library(data.table)
dt <- data.table(df)
dt[, list(qsum = sum(quantity)), by ='client, date, name']
     client       date  name qsum
[1,]      1 2010-01-01   one   45
[2,]      1 2011-01-01   one 4500
[3,]      2 2010-01-01   two   40
[4,]      2 2011-01-01   two 4000
[5,]      3 2010-01-01 three   20
[6,]      3 2011-01-01 three 2000

BTW, the leading comma after the opening bracket is not a typo :)

For R versions >= 2.11.x, aggregate() has a formula interface that saves a
fair bit of typing:

aggregate(quantity ~ client + date + name, data = df, FUN = sum)
  client       date  name quantity
1      1 2010-01-01   one       45
2      1 2011-01-01   one     4500
3      3 2010-01-01 three       20
4      3 2011-01-01 three     2000
5      2 2010-01-01   two       40
6      2 2011-01-01   two     4000

A third option is package plyr and function ddply():

library(plyr)
ddply(df, .(client, date, name), summarise, qsum = sum(quantity))
# same output as data.table

Hopefully one or more of these will improve your processing time.

Dennis

On Wed, Jan 26, 2011 at 2:39 AM, analyst41@hotmail.com <
analyst41@hotmail.com> wrote:
> I have
>
> > df
>   quantity branch client       date  name
> 1        10      1      1 2010-01-01   one
> 2        20      2      1 2010-01-01   one
> 3        30      3      2 2010-01-01   two
> 4        15      4      1 2010-01-01   one
> 5        10      5      2 2010-01-01   two
> 6        20      6      3 2010-01-01 three
> 7      1000      1      1 2011-01-01   one
> 8      2000      2      1 2011-01-01   one
> 9      3000      3      2 2011-01-01   two
> 10     1500      4      1 2011-01-01   one
> 11     1000      5      2 2011-01-01   two
> 12     2000      6      3 2011-01-01 three
>
> I want to aggregate away the branch. I followed a suggestion by Gabor
> (thanks) and did
>
> >
>
aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date),sum)
>  client       date quantity
> 1      1 2010-01-01       45
> 2      2 2010-01-01       40
> 3      3 2010-01-01       20
> 4      1 2011-01-01     4500
> 5      2 2011-01-01     4000
> 6      3 2011-01-01     2000
>
> I want df$name also in the output and did what looked obvious:
>
> >
>
aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date,name=df$name),sum)
>  client       date  name quantity
> 1      1 2010-01-01   one       45
> 2      1 2011-01-01   one     4500
> 3      3 2010-01-01 three       20
> 4      3 2011-01-01 three     2000
> 5      2 2010-01-01   two       40
> 6      2 2011-01-01   two     4000
>
> It seems to work, but slows down tremendously for a dataframe with
> around a 1000 rows.
>
> Could anyone explain what is going on and suggest a way out?
>
> Thanks.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Henrique Dallazuanna

2011-Jan-26 12:36 UTC

head link

[R] hwo to speed up "aggregate"

Try this:

 unique(transform(df, quantity = ave(quantity, client, date, name, FUN sum),
branch = NULL))

On Wed, Jan 26, 2011 at 8:39 AM, analyst41@hotmail.com <
analyst41@hotmail.com> wrote:
> I have
>
> > df
>   quantity branch client       date  name
> 1        10      1      1 2010-01-01   one
> 2        20      2      1 2010-01-01   one
> 3        30      3      2 2010-01-01   two
> 4        15      4      1 2010-01-01   one
> 5        10      5      2 2010-01-01   two
> 6        20      6      3 2010-01-01 three
> 7      1000      1      1 2011-01-01   one
> 8      2000      2      1 2011-01-01   one
> 9      3000      3      2 2011-01-01   two
> 10     1500      4      1 2011-01-01   one
> 11     1000      5      2 2011-01-01   two
> 12     2000      6      3 2011-01-01 three
>
> I want to aggregate away the branch. I followed a suggestion by Gabor
> (thanks) and did
>
> >
>
aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date),sum)
>  client       date quantity
> 1      1 2010-01-01       45
> 2      2 2010-01-01       40
> 3      3 2010-01-01       20
> 4      1 2011-01-01     4500
> 5      2 2011-01-01     4000
> 6      3 2011-01-01     2000
>
> I want df$name also in the output and did what looked obvious:
>
> >
>
aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date,name=df$name),sum)
>  client       date  name quantity
> 1      1 2010-01-01   one       45
> 2      1 2011-01-01   one     4500
> 3      3 2010-01-01 three       20
> 4      3 2011-01-01 three     2000
> 5      2 2010-01-01   two       40
> 6      2 2011-01-01   two     4000
>
> It seems to work, but slows down tremendously for a dataframe with
> around a 1000 rows.
>
> Could anyone explain what is going on and suggest a way out?
>
> Thanks.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more maybe matching threads

R help - Jan 2011 - hwo to speed up "aggregate"

[R] hwo to speed up "aggregate"

[R] hwo to speed up "aggregate"

[R] hwo to speed up "aggregate"

Possibly Parallel Threads