Dear R experts, I would really appreciate if you had an idea on how to use more efficiently the aggregate method: More specifically, I would like to calculate the mean of certain values on a data frame,? grouped by various attributes, and then create a new column in the data frame that will have the corresponding mean for every row. I attach part of my code: matchMean <- function(ind,dataTable,aggrTable) { index <- which((aggrTable[,1]==dataTable[["Attr1"]][ind]) & (aggrTable[,2]==dataTable[["Attr2"]][ind])) as.numeric(aggrTable[index,3]) } avgDur <- aggregate(ap.dat[["Dur"]], by = list(ap.dat[["Attr1"]], ap.dat[["Attr2"]]), FUN="mean") meanDur <- sapply((1:length(ap.dat[,1])), FUN=matchMean, ap.dat, avgDur) ap.dat <- cbind (ap.dat, meanDur) As I deal with very large dataset, it takes long time to run my matching function, so if you had an idea on how to automate more this matching process I would be really grateful. Thank you very much in advance! Kind regards, Stella -- Stella Pachidi Master in Business Informatics student Utrecht University
It's easiest for us to help if you give us a reproducible example. We don't have your datasets (ap.dat), so we can't run your code below. It's easy to create sample data with the random number generators in R, or use ?dput to give us a sample of your actual data.frame. I would guess your problem is solved by ?ave though. Stella Pachidi wrote:> Dear R experts, > > I would really appreciate if you had an idea on how to use more > efficiently the aggregate method: > > More specifically, I would like to calculate the mean of certain > values on a data frame, grouped by various attributes, and then > create a new column in the data frame that will have the corresponding > mean for every row. I attach part of my code: > > matchMean <- function(ind,dataTable,aggrTable) > { > index <- which((aggrTable[,1]==dataTable[["Attr1"]][ind]) & > (aggrTable[,2]==dataTable[["Attr2"]][ind])) > as.numeric(aggrTable[index,3]) > } > > avgDur <- aggregate(ap.dat[["Dur"]], by = list(ap.dat[["Attr1"]], > ap.dat[["Attr2"]]), FUN="mean") > meanDur <- sapply((1:length(ap.dat[,1])), FUN=matchMean, ap.dat, avgDur) > ap.dat <- cbind (ap.dat, meanDur) > > As I deal with very large dataset, it takes long time to run my > matching function, so if you had an idea on how to automate more this > matching process I would be really grateful. > > Thank you very much in advance! > > Kind regards, > Stella > > > > -- > Stella Pachidi > Master in Business Informatics student > Utrecht University > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Take a look at ?split (and unsplit) eg: Dur <- rnorm(100) Attr1=rep(c("A","B"),each=50) Attr2=rep(c("A","B"),times=50) ap.dat <-data.frame(Attr1,Attr2,Dur) split.fact <- paste(ap.dat$Attr1,ap.dat$Attr2) ap.list <-split(ap.dat,split.fact) ap.mean <-lapply(ap.list,function(x){ x$meanDur=rep(mean(x$Dur),dim(x)[1]) return(x) }) ap.dat.fast <- unsplit(ap.mean,split.fact) system.time on 1000 replicates gives :> system.time(replicate(1000,{+ split.fact <- paste(ap.dat$Attr1,ap.dat$Attr2) + ap.list <-split(ap.dat,split.fact) + ap.mean <-lapply(ap.list,functi .... [TRUNCATED] user system elapsed 4.88 0.00 4.88> source(.trPaths[5], echo=TRUE, max.deparse.length=150)> system.time(replicate(1000,{+ avgDur <- aggregate(ap.dat[["Dur"]], by = list(ap.dat[["Attr1"]], + ap.dat[["Attr2"]]), FUN="mean") + meanDur <- sapp .... [TRUNCATED] user system elapsed 58.00 0.11 58.13>It should be a tenfold faster. Cheers Joris On Tue, Jun 1, 2010 at 4:48 PM, Stella Pachidi <stella.pachidi@gmail.com>wrote:> Dear R experts, > > I would really appreciate if you had an idea on how to use more > efficiently the aggregate method: > > More specifically, I would like to calculate the mean of certain > values on a data frame, grouped by various attributes, and then > create a new column in the data frame that will have the corresponding > mean for every row. I attach part of my code: > > matchMean <- function(ind,dataTable,aggrTable) > { > index <- which((aggrTable[,1]==dataTable[["Attr1"]][ind]) & > (aggrTable[,2]==dataTable[["Attr2"]][ind])) > as.numeric(aggrTable[index,3]) > } > > avgDur <- aggregate(ap.dat[["Dur"]], by = list(ap.dat[["Attr1"]], > ap.dat[["Attr2"]]), FUN="mean") > meanDur <- sapply((1:length(ap.dat[,1])), FUN=matchMean, ap.dat, avgDur) > ap.dat <- cbind (ap.dat, meanDur) > > As I deal with very large dataset, it takes long time to run my > matching function, so if you had an idea on how to automate more this > matching process I would be really grateful. > > Thank you very much in advance! > > Kind regards, > Stella > > > > -- > Stella Pachidi > Master in Business Informatics student > Utrecht University > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 Joris.Meys@Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]