Titus von der Malsburg
2009-Dec-08 12:50 UTC
[R] conditionally merging adjacent rows in a data frame
Hi, I have a data frame and want to merge adjacent rows if some condition is met. There's an obvious solution using a loop but it is prohibitively slow because my data frame is large. Is there an efficient canonical solution for that?> head(d)rt dur tid mood roi x 55 5523 200 4 subj 9 5 56 5523 52 4 subj 7 31 57 5523 209 4 subj 4 9 58 5523 188 4 subj 4 7 70 4016 264 5 indic 9 51 71 4016 195 5 indic 4 14 The desired result would have consecutive rows with the same roi value merged. dur values should be added and x values averaged, other values don't differ in these rows and should stay the same.> head(result)rt dur tid mood roi x 55 5523 200 4 subj 9 5 56 5523 52 4 subj 7 31 57 5523 397 4 subj 4 8 70 4016 264 5 indic 9 51 71 4016 195 5 indic 4 14 There's also a solution using reshape. It uses an index for blocks d$index <- cumsum(c(TRUE,diff(d$roi)!=0)) melts and then casts for every column using an appropriate fun.aggregate. However, this is a bit cumbersome and also I'm not sure how to make sure that I get the original order of rows. Thanks for any suggestion. Titus
Gray Calhoun
2009-Dec-08 14:57 UTC
[R] Fwd: conditionally merging adjacent rows in a data frame
I think I forgot to send the original to the mailing list, so I'm forwarding it (see below). Sorry about that (and sorry if I did remember and this is a duplicate). After a few more minutes of thought, I realized that you should probably make sure that rt, tid, and mood are also the same in consecutive rows when constructing the 'consecutiveROI' vector (just as an additional error check). There may be a built-in function that could replace first three line lines, as well. Best, Gray ---------- Forwarded message ---------- From: Gray Calhoun <gray.calhoun at gmail.com> Date: Tue, Dec 8, 2009 at 6:42 AM Subject: Re: [R] conditionally merging adjacent rows in a data frame To: Titus von der Malsburg <malsburg at gmail.com> Hi Titus, ?This solution isn't great and will probably need some work on your part. ?The basic idea is to create a new index that is shared by consecutive rows with the same value of roi, then just aggregate by the new index> consecutiveROI <- d$roi[-1] == d$roi[1:(length(d$roi)-1)] > newindex <- 1:dim(d)[1] > newindex[c(consecutiveROI, FALSE)] <- newindex[c(FALSE, consecutiveROI)]> aggregate(d$x, list(newindex = newindex), mean)And the same for dur. ?You can get the unique rows of d with> d[a$newindex,]There may be bugs, but I think this general approach will work well. --Gray On Tue, Dec 8, 2009 at 6:50 AM, Titus von der Malsburg <malsburg at gmail.com> wrote:> Hi, I have a data frame and want to merge adjacent rows if some condition is > met. ?There's an obvious solution using a loop but it is prohibitively slow > because my data frame is large. ?Is there an efficient canonical solution for > that? > >> head(d) > ? ? rt dur tid ?mood roi ?x > 55 5523 200 ? 4 ?subj ? 9 ?5 > 56 5523 ?52 ? 4 ?subj ? 7 31 > 57 5523 209 ? 4 ?subj ? 4 ?9 > 58 5523 188 ? 4 ?subj ? 4 ?7 > 70 4016 264 ? 5 indic ? 9 51 > 71 4016 195 ? 5 indic ? 4 14 > > The desired result would have consecutive rows with the same roi value merged. > dur values should be added and x values averaged, other values don't differ in > these rows and should stay the same. > >> head(result) > ? ? rt dur tid ?mood roi ?x > 55 5523 200 ? 4 ?subj ? 9 ?5 > 56 5523 ?52 ? 4 ?subj ? 7 31 > 57 5523 397 ? 4 ?subj ? 4 ?8 > 70 4016 264 ? 5 indic ? 9 51 > 71 4016 195 ? 5 indic ? 4 14 > > There's also a solution using reshape. ?It uses an index for blocks > > ?d$index <- cumsum(c(TRUE,diff(d$roi)!=0)) > > melts and then casts for every column using an appropriate fun.aggregate. > However, this is a bit cumbersome and also I'm not sure how to make sure that > I get the original order of rows. > > Thanks for any suggestion. > > ?Titus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Gray Calhoun Assistant Professor of Economics Iowa State University -- Gray Calhoun Assistant Professor of Economics Iowa State University
How about creating an index using multiple columns. a <- with(d, aggregate(dur, list(rt=rt,tid=tid,mood=mood,roi=roi), sum)) b <- with(d, aggregate(x, list(rt=rt,tid=tid,mood=mood,roi=roi), mean)) c <- merge(a, b, by=c("rt","tid","mood", "roi")) I suppose one could save some time by not running aggregate twice on the same dataset, but I am not sure how. Nikhil On 8 Dec 2009, at 7:50AM, Titus von der Malsburg wrote:> Hi, I have a data frame and want to merge adjacent rows if some > condition is > met. There's an obvious solution using a loop but it is > prohibitively slow > because my data frame is large. Is there an efficient canonical > solution for > that? > >> head(d) > rt dur tid mood roi x > 55 5523 200 4 subj 9 5 > 56 5523 52 4 subj 7 31 > 57 5523 209 4 subj 4 9 > 58 5523 188 4 subj 4 7 > 70 4016 264 5 indic 9 51 > 71 4016 195 5 indic 4 14 > > The desired result would have consecutive rows with the same roi > value merged. > dur values should be added and x values averaged, other values don't > differ in > these rows and should stay the same. > >> head(result) > rt dur tid mood roi x > 55 5523 200 4 subj 9 5 > 56 5523 52 4 subj 7 31 > 57 5523 397 4 subj 4 8 > 70 4016 264 5 indic 9 51 > 71 4016 195 5 indic 4 14 > > There's also a solution using reshape. It uses an index for blocks > > d$index <- cumsum(c(TRUE,diff(d$roi)!=0)) > > melts and then casts for every column using an appropriate > fun.aggregate. > However, this is a bit cumbersome and also I'm not sure how to make > sure that > I get the original order of rows. > > Thanks for any suggestion. > > Titus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Titus von der Malsburg
2009-Dec-08 22:13 UTC
[R] conditionally merging adjacent rows in a data frame
On Tue, Dec 8, 2009 at 5:19 PM, Nikhil Kaza <nikhil.list at gmail.com> wrote:> I suppose that is true, but the example data seem to suggest that it is > sorted by rt.I was not very clear on that. Sorry.> d$count <- 1 > ?a <- with(d, aggregate(subset(d, select=c("dur", "x", "count"), > list(rt=rt,tid=tid,mood=mood,roi=roi), sum)) > a$x <- a$x/a$countThis is neat!> But it would still be nice to get a generic way that uses different > functions on different columns much like excel's pivot table.I guess the most straight-forward thing would be to extend aggregate to also accept instead of a FUN a list of FUNs where the first is applied to value of the first column of x (the data frame), the second to the second column, and so on. Titus
Gabor Grothendieck
2009-Dec-08 23:11 UTC
[R] conditionally merging adjacent rows in a data frame
Here are a couple of solutions. The first uses by and the second sqldf:> Lines <- " rt dur tid mood roi x+ 55 5523 200 4 subj 9 5 + 56 5523 52 4 subj 7 31 + 57 5523 209 4 subj 4 9 + 58 5523 188 4 subj 4 7 + 70 4016 264 5 indic 9 51 + 71 4016 195 5 indic 4 14"> d <- read.table(textConnection(Lines), header = TRUE) > > > # solution 1 - see ?by and ?transform > > idx <- cumsum( c(TRUE,diff(d$roi)!=0) ) > do.call(rbind, by(d, idx, function(x)+ transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ])) rt dur tid mood roi x 1 5523 200 4 subj 9 5 2 5523 52 4 subj 7 31 3 5523 397 4 subj 4 8 4 4016 264 5 indic 9 51 5 4016 195 5 indic 4 14> > # solution 2 - see http://sqldf.googlecode.com > > dd <- data.frame(d, idx) # idx computed above > library(sqldf) > sqldf("select rt, sum(dur) dur, tid, mood, roi, avg(x) x from dd group by idx")rt dur tid mood roi x 1 5523 200 4 subj 9 5 2 5523 52 4 subj 7 31 3 5523 397 4 subj 4 8 4 4016 264 5 indic 9 51 5 4016 195 5 indic 4 14 On Tue, Dec 8, 2009 at 7:50 AM, Titus von der Malsburg <malsburg at gmail.com> wrote:> Hi, I have a data frame and want to merge adjacent rows if some condition is > met. ?There's an obvious solution using a loop but it is prohibitively slow > because my data frame is large. ?Is there an efficient canonical solution for > that? > >> head(d) > ? ? rt dur tid ?mood roi ?x > 55 5523 200 ? 4 ?subj ? 9 ?5 > 56 5523 ?52 ? 4 ?subj ? 7 31 > 57 5523 209 ? 4 ?subj ? 4 ?9 > 58 5523 188 ? 4 ?subj ? 4 ?7 > 70 4016 264 ? 5 indic ? 9 51 > 71 4016 195 ? 5 indic ? 4 14 > > The desired result would have consecutive rows with the same roi value merged. > dur values should be added and x values averaged, other values don't differ in > these rows and should stay the same. > >> head(result) > ? ? rt dur tid ?mood roi ?x > 55 5523 200 ? 4 ?subj ? 9 ?5 > 56 5523 ?52 ? 4 ?subj ? 7 31 > 57 5523 397 ? 4 ?subj ? 4 ?8 > 70 4016 264 ? 5 indic ? 9 51 > 71 4016 195 ? 5 indic ? 4 14 > > There's also a solution using reshape. ?It uses an index for blocks > > ?d$index <- cumsum(c(TRUE,diff(d$roi)!=0)) > > melts and then casts for every column using an appropriate fun.aggregate. > However, this is a bit cumbersome and also I'm not sure how to make sure that > I get the original order of rows. > > Thanks for any suggestion. > > ?Titus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Titus von der Malsburg
2009-Dec-09 12:59 UTC
[R] conditionally merging adjacent rows in a data frame
On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Here are a couple of solutions. The first uses by and the second sqldf:Brilliant! Now I have a whole collection of solutions. I did a simple performance comparison with a data frame that has 7929 lines. The results were as following (loading appropriate packages is not included in the measurements): times <- c(0.248, 0.551, 41.080, 0.16, 0.190) names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply") barplot(times, log="y", ylab="log(s)") So sqldf clearly wins followed by tapply and aggregate. summaryBy is slower than necessary because it computes for x and dur both, mean /and/ sum. by+transform presumably suffers from the contruction of many intermediate data frames. Are there any canonical places where R-recipes are collected? If yes I would write-up a summary. These were the competitors: # Gary's and Nikhil's aggregate solution: aggregate.fixations1 <- function(d) { idx <- c(TRUE,diff(d$roi)!=0) d2 <- d[idx,] idx <- cumsum(idx) d2$dur <- aggregate(d$dur, list(idx), sum)[2] d2$x <- aggregate(d$x, list(idx), mean)[2] d2 } # Marek's symmaryBy: library(doBy) aggregate.fixations2 <- function(d) { idx <- c(TRUE,diff(d$roi)!=0) d2 <- d[idx,] d$idx <- cumsum(idx) d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum, mean))[c("dur.sum", "x.mean")] d2 } # Gabor's by+transform solution: aggregate.fixations3 <- function(d) { idx <- cumsum(c(TRUE,diff(d$roi)!=0)) d2 <- do.call(rbind, by(d, idx, function(x) transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ])) d2 } # Gabor's sqldf solution: library(sqldf) aggregate.fixations4 <- function(d) { idx <- c(TRUE,diff(d$roi)!=0) d2 <- d[idx,] d$idx <- cumsum(idx) d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx") d2 } # Titus' solution using plain old tapply: aggregate.fixations5 <- function(d) { idx <- c(TRUE,diff(d$roi)!=0) d2 <- d[idx,] idx <- cumsum(idx) d2$dur <- tapply(d$dur, idx, sum) d2$x <- tapply(d$x, idx, mean) d2 }
Gabor Grothendieck
2009-Dec-09 13:12 UTC
[R] conditionally merging adjacent rows in a data frame
On Wed, Dec 9, 2009 at 7:59 AM, Titus von der Malsburg <malsburg at gmail.com> wrote:> On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck > <ggrothendieck at gmail.com> wrote: >> Here are a couple of solutions. ?The first uses by and the second sqldf: > > Brilliant! ?Now I have a whole collection of solutions. ?I did a simple > performance comparison with a data frame that has 7929 lines. > > The results were as following (loading appropriate packages is not included in > the measurements): > > ?times <- c(0.248, 0.551, 41.080, 0.16, 0.190) > ?names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply") > ?barplot(times, log="y", ylab="log(s)") > > So sqldf clearly wins followed by tapply and aggregate. ?summaryBy is slower > than necessary because it computes for x and dur both, mean /and/ sum. > by+transform presumably suffers from the contruction of many intermediate data > frames. > > Are there any canonical places where R-recipes are collected? ?If yes I would > write-up a summary.If you google for R wiki its the first hit.