thr3ads.net - R help - [R] aggregate / collapse big data frame efficiently [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Martin Batholdy

2012-Dec-25 16:34 UTC

[R] aggregate / collapse big data frame efficiently

Hi,


I need to aggregate rows of a data.frame by computing the mean for rows with the
same factor-level on one factor-variable;

here is the sample code:


x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))

aggregate(x, list(x[,1]), mean)


Now my problem is, that the actual data-set is much bigger (120 rows and
approximately 100.000 columns) ? and it takes very very long (actually at some
point I just stopped it).

Is there anything that can be done to make the aggregate routine more efficient?
Or is there a different approach that would work faster?


Thanks for any suggestions!

Jeff Newmiller

2012-Dec-25 17:00 UTC

head link

[R] aggregate / collapse big data frame efficiently

You might consider using the sqldf package.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Martin Batholdy <batholdy at googlemail.com> wrote:
>Hi,
>
>
>I need to aggregate rows of a data.frame by computing the mean for rows
>with the same factor-level on one factor-variable;
>
>here is the sample code:
>
>
>x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
>aggregate(x, list(x[,1]), mean)
>
>
>Now my problem is, that the actual data-set is much bigger (120 rows
>and approximately 100.000 columns) ? and it takes very very long
>(actually at some point I just stopped it).
>
>Is there anything that can be done to make the aggregate routine more
>efficient?
>Or is there a different approach that would work faster?
>
>
>Thanks for any suggestions!
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

arun

2012-Dec-25 17:08 UTC

head link

[R] aggregate / collapse big data frame efficiently

Hi,
You could use library(data.table) 
x <- data.frame(A=rep(letters,2), B=rnorm(52), C=rnorm(52), D=rnorm(52))
res<- with(x,aggregate(cbind(B,C,D),by=list(A),mean))
colnames(res)[1]<-"A"

?x1<-data.table(x)
res2<- x1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A]
?identical(res,data.frame(res2))
#[1] TRUE

Just for comparison:
set.seed(25)
xnew<-data.frame(A=rep(letters,1500),B=rnorm(39000),C=rnorm(39000),D=rnorm(39000))
system.time(resnew<-with(xnew,aggregate(cbind(B,C,D),by=list(A),mean)))
?#user? system elapsed 
?# 0.152?? 0.000?? 0.152 

xnew1<-data.table(xnew)
system.time(resnew1<- xnew1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A])
# user? system elapsed 
?# 0.004?? 0.000?? 0.005 



A.K.




----- Original Message -----
From: Martin Batholdy <batholdy at googlemail.com>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Tuesday, December 25, 2012 11:34 AM
Subject: [R] aggregate / collapse big data frame efficiently

Hi,


I need to aggregate rows of a data.frame by computing the mean for rows with the
same factor-level on one factor-variable;

here is the sample code:


x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))

aggregate(x, list(x[,1]), mean)


Now my problem is, that the actual data-set is much bigger (120 rows and
approximately 100.000 columns) ? and it takes very very long (actually at some
point I just stopped it).

Is there anything that can be done to make the aggregate routine more efficient?
Or is there a different approach that would work faster?


Thanks for any suggestions!

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Patrick Burns

2012-Dec-25 18:12 UTC

head link

[R] aggregate / collapse big data frame efficiently

I'd suggest the 'data.table' package.  That is
one of the prime uses it was created for.

Pat

On 25/12/2012 16:34, Martin Batholdy wrote:> Hi,
>
>
> I need to aggregate rows of a data.frame by computing the mean for rows
with the same factor-level on one factor-variable;
>
> here is the sample code:
>
>
> x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
> aggregate(x, list(x[,1]), mean)
>
>
> Now my problem is, that the actual data-set is much bigger (120 rows and
approximately 100.000 columns) ? and it takes very very long (actually at some
point I just stopped it).
>
> Is there anything that can be done to make the aggregate routine more
efficient?
> Or is there a different approach that would work faster?
>
>
> Thanks for any suggestions!
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Patrick Burns
pburns at pburns.seanet.com
twitter: @portfolioprobe
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of 'Some hints for the R beginner'
and 'The R Inferno')

jim holtman

2012-Dec-25 18:20 UTC

head link

[R] aggregate / collapse big data frame efficiently

According to the way that you have used 'aggregate', you are taking
the column means.  Couple of suggestions for faster processing:


1. use matrices instead of data.frames ( i converted your example just
before using it)
2, use the 'colMeans'

I created a 120 x 100000 matrix with 10 levels and its does the
computation in less than 2 seconds:

>  n <- 100000
> nLevels <- 10
> nRows <- 120
> Cols <- list(rep(list(sample(nRows)), n))
> df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols)
> colnames(df)[-1] <- paste0('col', 1:n)
>
> # convert to matrix for faster processing
> df.m <- as.matrix(df[, -1])  # remove levels column
> str(df.m) int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:100000] "col1" "col2" "col3"
"col4" ...> system.time({+ # split the indices of rows for each level
+ x <- split(seq(nrow(df)), df$levels)
+ result <- sapply(x, function(a) colMeans(df.m[a, ]))
+ })
   user  system elapsed
   1.33    0.00    1.35> str(result) num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:100000] "col1" "col2" "col3"
"col4" ...
  ..$ : chr [1:10] "1" "2" "3" "4"
...>

On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy at googlemail.com> wrote:> Hi,
>
>
> I need to aggregate rows of a data.frame by computing the mean for rows
with the same factor-level on one factor-variable;
>
> here is the sample code:
>
>
> x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
> aggregate(x, list(x[,1]), mean)
>
>
> Now my problem is, that the actual data-set is much bigger (120 rows and
approximately 100.000 columns) ? and it takes very very long (actually at some
point I just stopped it).
>
> Is there anything that can be done to make the aggregate routine more
efficient?
> Or is there a different approach that would work faster?
>
>
> Thanks for any suggestions!
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

Wensui Liu

2012-Dec-26 03:32 UTC

head link

[R] aggregate / collapse big data frame efficiently

aggregate() is not efficient. try by().


On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy
<batholdy@googlemail.com>wrote:
> Hi,
>
>
> I need to aggregate rows of a data.frame by computing the mean for rows
> with the same factor-level on one factor-variable;
>
> here is the sample code:
>
>
> x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52))
>
> aggregate(x, list(x[,1]), mean)
>
>
> Now my problem is, that the actual data-set is much bigger (120 rows and
> approximately 100.000 columns) – and it takes very very long (actually at
> some point I just stopped it).
>
> Is there anything that can be done to make the aggregate routine more
> efficient?
> Or is there a different approach that would work faster?
>
>
> Thanks for any suggestions!
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
=============================WenSui Liu
Credit Risk Manager, 53 Bancorp
wensui.liu@53.com
513-295-4370
=============================
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more apparently analagous threads

R help - Dec 2012 - aggregate / collapse big data frame efficiently

[R] aggregate / collapse big data frame efficiently

[R] aggregate / collapse big data frame efficiently

[R] aggregate / collapse big data frame efficiently

[R] aggregate / collapse big data frame efficiently

[R] aggregate / collapse big data frame efficiently

[R] aggregate / collapse big data frame efficiently

Maybe Matching Threads