Hi, I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable; here is the sample code: x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) aggregate(x, list(x[,1]), mean) Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) ? and it takes very very long (actually at some point I just stopped it). Is there anything that can be done to make the aggregate routine more efficient? Or is there a different approach that would work faster? Thanks for any suggestions!
You might consider using the sqldf package. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. Martin Batholdy <batholdy at googlemail.com> wrote:>Hi, > > >I need to aggregate rows of a data.frame by computing the mean for rows >with the same factor-level on one factor-variable; > >here is the sample code: > > >x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) > >aggregate(x, list(x[,1]), mean) > > >Now my problem is, that the actual data-set is much bigger (120 rows >and approximately 100.000 columns) ? and it takes very very long >(actually at some point I just stopped it). > >Is there anything that can be done to make the aggregate routine more >efficient? >Or is there a different approach that would work faster? > > >Thanks for any suggestions! > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Hi, You could use library(data.table) x <- data.frame(A=rep(letters,2), B=rnorm(52), C=rnorm(52), D=rnorm(52)) res<- with(x,aggregate(cbind(B,C,D),by=list(A),mean)) colnames(res)[1]<-"A" ?x1<-data.table(x) res2<- x1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A] ?identical(res,data.frame(res2)) #[1] TRUE Just for comparison: set.seed(25) xnew<-data.frame(A=rep(letters,1500),B=rnorm(39000),C=rnorm(39000),D=rnorm(39000)) system.time(resnew<-with(xnew,aggregate(cbind(B,C,D),by=list(A),mean))) ?#user? system elapsed ?# 0.152?? 0.000?? 0.152 xnew1<-data.table(xnew) system.time(resnew1<- xnew1[,list(B=mean(B),C=mean(C),D=mean(D)),by=A]) # user? system elapsed ?# 0.004?? 0.000?? 0.005 A.K. ----- Original Message ----- From: Martin Batholdy <batholdy at googlemail.com> To: "r-help at r-project.org" <r-help at r-project.org> Cc: Sent: Tuesday, December 25, 2012 11:34 AM Subject: [R] aggregate / collapse big data frame efficiently Hi, I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable; here is the sample code: x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) aggregate(x, list(x[,1]), mean) Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) ? and it takes very very long (actually at some point I just stopped it). Is there anything that can be done to make the aggregate routine more efficient? Or is there a different approach that would work faster? Thanks for any suggestions! ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I'd suggest the 'data.table' package. That is one of the prime uses it was created for. Pat On 25/12/2012 16:34, Martin Batholdy wrote:> Hi, > > > I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable; > > here is the sample code: > > > x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) > > aggregate(x, list(x[,1]), mean) > > > Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) ? and it takes very very long (actually at some point I just stopped it). > > Is there anything that can be done to make the aggregate routine more efficient? > Or is there a different approach that would work faster? > > > Thanks for any suggestions! > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Patrick Burns pburns at pburns.seanet.com twitter: @portfolioprobe http://www.portfolioprobe.com/blog http://www.burns-stat.com (home of 'Some hints for the R beginner' and 'The R Inferno')
According to the way that you have used 'aggregate', you are taking the column means. Couple of suggestions for faster processing: 1. use matrices instead of data.frames ( i converted your example just before using it) 2, use the 'colMeans' I created a 120 x 100000 matrix with 10 levels and its does the computation in less than 2 seconds:> n <- 100000 > nLevels <- 10 > nRows <- 120 > Cols <- list(rep(list(sample(nRows)), n)) > df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols) > colnames(df)[-1] <- paste0('col', 1:n) > > # convert to matrix for faster processing > df.m <- as.matrix(df[, -1]) # remove levels column > str(df.m)int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ...> system.time({+ # split the indices of rows for each level + x <- split(seq(nrow(df)), df$levels) + result <- sapply(x, function(a) colMeans(df.m[a, ])) + }) user system elapsed 1.33 0.00 1.35> str(result)num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ... ..$ : chr [1:10] "1" "2" "3" "4" ...>On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy <batholdy at googlemail.com> wrote:> Hi, > > > I need to aggregate rows of a data.frame by computing the mean for rows with the same factor-level on one factor-variable; > > here is the sample code: > > > x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) > > aggregate(x, list(x[,1]), mean) > > > Now my problem is, that the actual data-set is much bigger (120 rows and approximately 100.000 columns) ? and it takes very very long (actually at some point I just stopped it). > > Is there anything that can be done to make the aggregate routine more efficient? > Or is there a different approach that would work faster? > > > Thanks for any suggestions! > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
aggregate() is not efficient. try by(). On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy <batholdy@googlemail.com>wrote:> Hi, > > > I need to aggregate rows of a data.frame by computing the mean for rows > with the same factor-level on one factor-variable; > > here is the sample code: > > > x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) > > aggregate(x, list(x[,1]), mean) > > > Now my problem is, that the actual data-set is much bigger (120 rows and > approximately 100.000 columns) – and it takes very very long (actually at > some point I just stopped it). > > Is there anything that can be done to make the aggregate routine more > efficient? > Or is there a different approach that would work faster? > > > Thanks for any suggestions! > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- =============================WenSui Liu Credit Risk Manager, 53 Bancorp wensui.liu@53.com 513-295-4370 ============================= [[alternative HTML version deleted]]