thr3ads.net - R help - [R] conditionally merging adjacent rows in a data frame [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Titus von der Malsburg

2009-Dec-08 12:50 UTC

[R] conditionally merging adjacent rows in a data frame

Hi, I have a data frame and want to merge adjacent rows if some condition is
met.  There's an obvious solution using a loop but it is prohibitively slow
because my data frame is large.  Is there an efficient canonical solution for
that?
> head(d)     rt dur tid  mood roi  x
55 5523 200   4  subj   9  5
56 5523  52   4  subj   7 31
57 5523 209   4  subj   4  9
58 5523 188   4  subj   4  7
70 4016 264   5 indic   9 51
71 4016 195   5 indic   4 14

The desired result would have consecutive rows with the same roi value merged.
dur values should be added and x values averaged, other values don't differ
in
these rows and should stay the same.
> head(result)     rt dur tid  mood roi  x
55 5523 200   4  subj   9  5
56 5523  52   4  subj   7 31
57 5523 397   4  subj   4  8
70 4016 264   5 indic   9 51
71 4016 195   5 indic   4 14

There's also a solution using reshape.  It uses an index for blocks

  d$index <- cumsum(c(TRUE,diff(d$roi)!=0))

melts and then casts for every column using an appropriate fun.aggregate.
However, this is a bit cumbersome and also I'm not sure how to make sure
that
I get the original order of rows.

Thanks for any suggestion.

  Titus

Gray Calhoun

2009-Dec-08 14:57 UTC

head link

[R] Fwd: conditionally merging adjacent rows in a data frame

I think I forgot to send the original to the mailing list, so I'm
forwarding it (see below).  Sorry about that (and sorry if I did
remember and this is a duplicate).  After a few more minutes of
thought, I realized that you should probably make sure that rt, tid,
and mood are also the same in consecutive rows when constructing the
'consecutiveROI' vector (just as an additional error check).

There may be a built-in function that could replace first three line
lines, as well.

Best,
Gray

---------- Forwarded message ----------
From: Gray Calhoun <gray.calhoun at gmail.com>
Date: Tue, Dec 8, 2009 at 6:42 AM
Subject: Re: [R] conditionally merging adjacent rows in a data frame
To: Titus von der Malsburg <malsburg at gmail.com>


Hi Titus,
?This solution isn't great and will probably need some work on your
part. ?The basic idea is to create a new index that is shared by
consecutive rows with the same value of roi, then just aggregate by
the new index
> consecutiveROI <- d$roi[-1] == d$roi[1:(length(d$roi)-1)]
> newindex <- 1:dim(d)[1]
> newindex[c(consecutiveROI, FALSE)] <- newindex[c(FALSE, consecutiveROI)]
> aggregate(d$x, list(newindex = newindex), mean)
And the same for dur. ?You can get the unique rows of d with
> d[a$newindex,]
There may be bugs, but I think this general approach will work well.

--Gray

On Tue, Dec 8, 2009 at 6:50 AM, Titus von der Malsburg
<malsburg at gmail.com> wrote:> Hi, I have a data frame and want to merge adjacent rows if some condition
is
> met. ?There's an obvious solution using a loop but it is prohibitively
slow
> because my data frame is large. ?Is there an efficient canonical solution
for
> that?
>
>> head(d)
> ? ? rt dur tid ?mood roi ?x
> 55 5523 200 ? 4 ?subj ? 9 ?5
> 56 5523 ?52 ? 4 ?subj ? 7 31
> 57 5523 209 ? 4 ?subj ? 4 ?9
> 58 5523 188 ? 4 ?subj ? 4 ?7
> 70 4016 264 ? 5 indic ? 9 51
> 71 4016 195 ? 5 indic ? 4 14
>
> The desired result would have consecutive rows with the same roi value
merged.
> dur values should be added and x values averaged, other values don't
differ in
> these rows and should stay the same.
>
>> head(result)
> ? ? rt dur tid ?mood roi ?x
> 55 5523 200 ? 4 ?subj ? 9 ?5
> 56 5523 ?52 ? 4 ?subj ? 7 31
> 57 5523 397 ? 4 ?subj ? 4 ?8
> 70 4016 264 ? 5 indic ? 9 51
> 71 4016 195 ? 5 indic ? 4 14
>
> There's also a solution using reshape. ?It uses an index for blocks
>
> ?d$index <- cumsum(c(TRUE,diff(d$roi)!=0))
>
> melts and then casts for every column using an appropriate fun.aggregate.
> However, this is a bit cumbersome and also I'm not sure how to make
sure that
> I get the original order of rows.
>
> Thanks for any suggestion.
>
> ?Titus
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


--
Gray Calhoun

Assistant Professor of Economics
Iowa State University



-- 
Gray Calhoun

Assistant Professor of Economics
Iowa State University

Nikhil Kaza

2009-Dec-08 15:29 UTC

head link

[R] conditionally merging adjacent rows in a data frame

How about creating an index using multiple columns.

  a <- with(d, aggregate(dur, list(rt=rt,tid=tid,mood=mood,roi=roi),  
sum))
  b <- with(d, aggregate(x, list(rt=rt,tid=tid,mood=mood,roi=roi),  
mean))
c <- merge(a, b, by=c("rt","tid","mood",
"roi"))

I suppose one could save some time by not running aggregate twice on  
the same dataset, but I am not sure how.

Nikhil

On 8 Dec 2009, at 7:50AM, Titus von der Malsburg wrote:
> Hi, I have a data frame and want to merge adjacent rows if some  
> condition is
> met.  There's an obvious solution using a loop but it is  
> prohibitively slow
> because my data frame is large.  Is there an efficient canonical  
> solution for
> that?
>
>> head(d)
>     rt dur tid  mood roi  x
> 55 5523 200   4  subj   9  5
> 56 5523  52   4  subj   7 31
> 57 5523 209   4  subj   4  9
> 58 5523 188   4  subj   4  7
> 70 4016 264   5 indic   9 51
> 71 4016 195   5 indic   4 14
>
> The desired result would have consecutive rows with the same roi  
> value merged.
> dur values should be added and x values averaged, other values don't  
> differ in
> these rows and should stay the same.
>
>> head(result)
>     rt dur tid  mood roi  x
> 55 5523 200   4  subj   9  5
> 56 5523  52   4  subj   7 31
> 57 5523 397   4  subj   4  8
> 70 4016 264   5 indic   9 51
> 71 4016 195   5 indic   4 14
>
> There's also a solution using reshape.  It uses an index for blocks
>
>  d$index <- cumsum(c(TRUE,diff(d$roi)!=0))
>
> melts and then casts for every column using an appropriate  
> fun.aggregate.
> However, this is a bit cumbersome and also I'm not sure how to make  
> sure that
> I get the original order of rows.
>
> Thanks for any suggestion.
>
>  Titus
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Titus von der Malsburg

2009-Dec-08 22:13 UTC

head link

[R] conditionally merging adjacent rows in a data frame

On Tue, Dec 8, 2009 at 5:19 PM, Nikhil Kaza <nikhil.list at gmail.com>
wrote:> I suppose that is true, but the example data seem to suggest that it is
> sorted by rt.
I was not very clear on that.  Sorry.
> d$count <- 1
> ?a <- with(d, aggregate(subset(d, select=c("dur",
"x", "count"),
> list(rt=rt,tid=tid,mood=mood,roi=roi), sum))
> a$x <- a$x/a$count
This is neat!
> But it would still be nice to get a generic way that uses different
> functions on different columns much like excel's pivot table.
I guess the most straight-forward thing would be to extend aggregate
to also accept instead of a FUN a list of FUNs where the first is
applied to value of the first column of x (the data frame), the second
to the second column, and so on.

  Titus

Gabor Grothendieck

2009-Dec-08 23:11 UTC

head link

[R] conditionally merging adjacent rows in a data frame

Here are a couple of solutions.  The first uses by and the second sqldf:
> Lines <- " rt dur tid  mood roi  x+ 55 5523 200   4  subj   9  5
+ 56 5523  52   4  subj   7 31
+ 57 5523 209   4  subj   4  9
+ 58 5523 188   4  subj   4  7
+ 70 4016 264   5 indic   9 51
+ 71 4016 195   5 indic   4 14"> d <- read.table(textConnection(Lines), header = TRUE)
>
>
> # solution 1 - see ?by and ?transform
>
> idx <- cumsum( c(TRUE,diff(d$roi)!=0) )
> do.call(rbind, by(d, idx, function(x)+ transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))
    rt dur tid  mood roi  x
1 5523 200   4  subj   9  5
2 5523  52   4  subj   7 31
3 5523 397   4  subj   4  8
4 4016 264   5 indic   9 51
5 4016 195   5 indic   4 14>
> # solution 2 - see http://sqldf.googlecode.com
>
> dd <- data.frame(d, idx) # idx computed above
> library(sqldf)
> sqldf("select rt, sum(dur) dur, tid, mood, roi, avg(x) x from dd group
by idx")    rt dur tid  mood roi  x
1 5523 200   4  subj   9  5
2 5523  52   4  subj   7 31
3 5523 397   4  subj   4  8
4 4016 264   5 indic   9 51
5 4016 195   5 indic   4 14


On Tue, Dec 8, 2009 at 7:50 AM, Titus von der Malsburg
<malsburg at gmail.com> wrote:> Hi, I have a data frame and want to merge adjacent rows if some condition
is
> met. ?There's an obvious solution using a loop but it is prohibitively
slow
> because my data frame is large. ?Is there an efficient canonical solution
for
> that?
>
>> head(d)
> ? ? rt dur tid ?mood roi ?x
> 55 5523 200 ? 4 ?subj ? 9 ?5
> 56 5523 ?52 ? 4 ?subj ? 7 31
> 57 5523 209 ? 4 ?subj ? 4 ?9
> 58 5523 188 ? 4 ?subj ? 4 ?7
> 70 4016 264 ? 5 indic ? 9 51
> 71 4016 195 ? 5 indic ? 4 14
>
> The desired result would have consecutive rows with the same roi value
merged.
> dur values should be added and x values averaged, other values don't
differ in
> these rows and should stay the same.
>
>> head(result)
> ? ? rt dur tid ?mood roi ?x
> 55 5523 200 ? 4 ?subj ? 9 ?5
> 56 5523 ?52 ? 4 ?subj ? 7 31
> 57 5523 397 ? 4 ?subj ? 4 ?8
> 70 4016 264 ? 5 indic ? 9 51
> 71 4016 195 ? 5 indic ? 4 14
>
> There's also a solution using reshape. ?It uses an index for blocks
>
> ?d$index <- cumsum(c(TRUE,diff(d$roi)!=0))
>
> melts and then casts for every column using an appropriate fun.aggregate.
> However, this is a bit cumbersome and also I'm not sure how to make
sure that
> I get the original order of rows.
>
> Thanks for any suggestion.
>
> ?Titus
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Titus von der Malsburg

2009-Dec-09 12:59 UTC

head link

[R] conditionally merging adjacent rows in a data frame

On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:> Here are a couple of solutions.  The first uses by and the second sqldf:
Brilliant!  Now I have a whole collection of solutions.  I did a simple
performance comparison with a data frame that has 7929 lines.

The results were as following (loading appropriate packages is not included in
the measurements):

 times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
 names(times) <-
c("aggregate","summaryBy","by+transform","sqldf","tapply")
 barplot(times, log="y", ylab="log(s)")

So sqldf clearly wins followed by tapply and aggregate.  summaryBy is slower
than necessary because it computes for x and dur both, mean /and/ sum.
by+transform presumably suffers from the contruction of many intermediate data
frames.

Are there any canonical places where R-recipes are collected?  If yes I would
write-up a summary.

These were the competitors:

 # Gary's and Nikhil's aggregate solution:

 aggregate.fixations1 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   idx  <- cumsum(idx)
   d2$dur <- aggregate(d$dur, list(idx), sum)[2]
   d2$x   <- aggregate(d$x, list(idx), mean)[2]

   d2
 }

 # Marek's symmaryBy:

 library(doBy)

 aggregate.fixations2 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   d$idx  <- cumsum(idx)
   d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum,
mean))[c("dur.sum", "x.mean")]
   d2
 }

 # Gabor's by+transform solution:

 aggregate.fixations3 <- function(d) {

   idx  <- cumsum(c(TRUE,diff(d$roi)!=0))

   d2 <- do.call(rbind, by(d, idx, function(x)
                 transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ]))

   d2
 }

 # Gabor's sqldf solution:

 library(sqldf)

 aggregate.fixations4 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   d$idx  <- cumsum(idx)
   d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx")

   d2
 }

 # Titus' solution using plain old tapply:

 aggregate.fixations5 <- function(d) {

   idx  <- c(TRUE,diff(d$roi)!=0)
   d2     <- d[idx,]

   idx  <- cumsum(idx)
   d2$dur <- tapply(d$dur, idx, sum)
   d2$x <- tapply(d$x, idx, mean)

   d2
 }

Gabor Grothendieck

2009-Dec-09 13:12 UTC

head link

[R] conditionally merging adjacent rows in a data frame

On Wed, Dec 9, 2009 at 7:59 AM, Titus von der Malsburg
<malsburg at gmail.com> wrote:> On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> Here are a couple of solutions. ?The first uses by and the second
sqldf:
>
> Brilliant! ?Now I have a whole collection of solutions. ?I did a simple
> performance comparison with a data frame that has 7929 lines.
>
> The results were as following (loading appropriate packages is not included
in
> the measurements):
>
> ?times <- c(0.248, 0.551, 41.080, 0.16, 0.190)
> ?names(times) <-
c("aggregate","summaryBy","by+transform","sqldf","tapply")
> ?barplot(times, log="y", ylab="log(s)")
>
> So sqldf clearly wins followed by tapply and aggregate. ?summaryBy is
slower
> than necessary because it computes for x and dur both, mean /and/ sum.
> by+transform presumably suffers from the contruction of many intermediate
data
> frames.
>
> Are there any canonical places where R-recipes are collected? ?If yes I
would
> write-up a summary.
If you google for
   R wiki
its the first hit.

Maybe Matching Threads

Search for more apparently analagous threads

R help - Dec 2009 - conditionally merging adjacent rows in a data frame

[R] conditionally merging adjacent rows in a data frame

[R] Fwd: conditionally merging adjacent rows in a data frame

[R] conditionally merging adjacent rows in a data frame

[R] conditionally merging adjacent rows in a data frame

[R] conditionally merging adjacent rows in a data frame

[R] conditionally merging adjacent rows in a data frame

[R] conditionally merging adjacent rows in a data frame

Maybe Matching Threads