George Nachman
2006-Dec-12 23:34 UTC
[R] How to sum one column in a data frame keyed on other columns
I have a data frame that looks like this: url time somethingirrelevant visits www.foo.com 1:00 xxx 100 www.foo.com 1:00 yyy 50 www.foo.com 2:00 xyz 25 www.bar.com 1:00 xxx 200 www.bar.com 1:00 zzz 200 www.foo.com 2:00 xxx 500 I'd like to write some code that takes this as input and outputs something like this: url time total_vists www.foo.com 1:00 150 www.foo.com 2:00 525 www.bar.com 1:00 400 In other words, I need to calculate the sum of visits for each unique tuple of (url,time). I can do it with this code, but it's very slow, and doesn't seem like the right approach: keys = list() getkey = function(m,cols,index) { paste(m[index,cols],collapse=",") } for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] = 0 } for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] keys[[getkey(data,1:2,i)]] + data[i,4] } I'm sure there's a more functional-programming approach to this problem! Any ideas?
Simon Blomberg
2006-Dec-12 23:56 UTC
[R] How to sum one column in a data frame keyed on other columns
You could look at the reshape package, using sum as the aggregate function. HTH, Simon. George Nachman wrote:> I have a data frame that looks like this: > > url time somethingirrelevant visits > www.foo.com 1:00 xxx 100 > www.foo.com 1:00 yyy 50 > www.foo.com 2:00 xyz 25 > www.bar.com 1:00 xxx 200 > www.bar.com 1:00 zzz 200 > www.foo.com 2:00 xxx 500 > > I'd like to write some code that takes this as input and outputs > something like this: > > url time total_vists > www.foo.com 1:00 150 > www.foo.com 2:00 525 > www.bar.com 1:00 400 > > In other words, I need to calculate the sum of visits for each unique > tuple of (url,time). > > I can do it with this code, but it's very slow, and doesn't seem like > the right approach: > > keys = list() > getkey = function(m,cols,index) { paste(m[index,cols],collapse=",") } > for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] = 0 } > for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] > keys[[getkey(data,1:2,i)]] + data[i,4] } > > I'm sure there's a more functional-programming approach to this > problem! Any ideas? > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Simon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat. Centre for Resource and Environmental Studies The Australian National University Canberra ACT 0200 Australia T: +61 2 6125 7800 email: Simon.Blomberg_at_anu.edu.au F: +61 2 6125 0757 CRICOS Provider # 00120C The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. - John Tukey.
Marc Schwartz
2006-Dec-13 00:05 UTC
[R] How to sum one column in a data frame keyed on other columns
On Tue, 2006-12-12 at 15:34 -0800, George Nachman wrote:> I have a data frame that looks like this: > > url time somethingirrelevant visits > www.foo.com 1:00 xxx 100 > www.foo.com 1:00 yyy 50 > www.foo.com 2:00 xyz 25 > www.bar.com 1:00 xxx 200 > www.bar.com 1:00 zzz 200 > www.foo.com 2:00 xxx 500 > > I'd like to write some code that takes this as input and outputs > something like this: > > url time total_vists > www.foo.com 1:00 150 > www.foo.com 2:00 525 > www.bar.com 1:00 400 > > In other words, I need to calculate the sum of visits for each unique > tuple of (url,time). > > I can do it with this code, but it's very slow, and doesn't seem like > the right approach: > > keys = list() > getkey = function(m,cols,index) { paste(m[index,cols],collapse=",") } > for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] = 0 } > for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] > keys[[getkey(data,1:2,i)]] + data[i,4] } > > I'm sure there's a more functional-programming approach to this > problem! Any ideas?See ?aggregate If your dataframe is called 'DF':> aggregate(DF$visits, list(DF$url, DF$time), sum)Group.1 Group.2 x 1 www.bar.com 1:00 400 2 www.foo.com 1:00 150 3 www.foo.com 2:00 525 HTH, Marc Schwartz
jim holtman
2006-Dec-13 00:07 UTC
[R] How to sum one column in a data frame keyed on other columns
> xurl time somethingirrelevant visits 1 www.foo.com 1:00 xxx 100 2 www.foo.com 1:00 yyy 50 3 www.foo.com 2:00 xyz 25 4 www.bar.com 1:00 xxx 200 5 www.bar.com 1:00 zzz 200 6 www.foo.com 2:00 xxx 500> ?aggregate > aggregate(x$visits, list(x$url, x$time), sum)Group.1 Group.2 x 1 www.bar.com 1:00 400 2 www.foo.com 1:00 150 3 www.foo.com 2:00 525>On 12/12/06, George Nachman <gnachman@llamas.org> wrote:> > I have a data frame that looks like this: > > url time somethingirrelevant visits > www.foo.com 1:00 xxx 100 > www.foo.com 1:00 yyy 50 > www.foo.com 2:00 xyz 25 > www.bar.com 1:00 xxx 200 > www.bar.com 1:00 zzz 200 > www.foo.com 2:00 xxx 500 > > I'd like to write some code that takes this as input and outputs > something like this: > > url time total_vists > www.foo.com 1:00 150 > www.foo.com 2:00 525 > www.bar.com 1:00 400 > > In other words, I need to calculate the sum of visits for each unique > tuple of (url,time). > > I can do it with this code, but it's very slow, and doesn't seem like > the right approach: > > keys = list() > getkey = function(m,cols,index) { paste(m[index,cols],collapse=",") } > for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] = 0 } > for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] > keys[[getkey(data,1:2,i)]] + data[i,4] } > > I'm sure there's a more functional-programming approach to this > problem! Any ideas? > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]]
Bill.Venables at csiro.au
2006-Dec-13 00:32 UTC
[R] How to sum one column in a data frame keyed on other columns
Here is an elementary way of doing it:> daturl time somethingirrelevant visits 1 www.foo.com 1:00 xxx 100 2 www.foo.com 1:00 yyy 50 3 www.foo.com 2:00 xyz 25 4 www.bar.com 1:00 xxx 200 5 www.bar.com 1:00 zzz 200 6 www.foo.com 2:00 xxx 500> dat <- transform(dat, key = paste(url, time)) > total_visits <- with(dat, tapply(visits, key, sum)) > m <- match(names(total_visits), dat$key) > tdat <- cbind(dat[m, c("url", "time")], total_visits) > tdaturl time total_visits 4 www.bar.com 1:00 400 1 www.foo.com 1:00 150 3 www.foo.com 2:00 525>This should not be too difficult to morph into a fairly general function. Here's what I might do [warning: somewhat obscure code follows] sumUp <- function(dat, key_list, sum_list) { key <- with(dat, do.call("paste", dat[, key_list, drop = FALSE])) totals <- as.matrix(sapply(dat[, sum_list, drop = FALSE], tapply, key, sum)) dimnames(totals)[[2]] <- paste("total", sum_list, sep = "_") m <- match(dimnames(totals)[[1]], key) cbind(dat[m, key_list, drop = FALSE], totals) } check:> sumUp(dat, c("url", "time"), "visits")url time total_visits 4 www.bar.com 1:00 400 1 www.foo.com 1:00 150 3 www.foo.com 2:00 525> sumUp(dat, "url", "visits")url total_visits 4 www.bar.com 400 1 www.foo.com 675 Question for the reader: why to you need 'drop = FALSE' (in three places)? Bill Venables. -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of George Nachman Sent: Wednesday, 13 December 2006 9:35 AM To: r-help at stat.math.ethz.ch Subject: [R] How to sum one column in a data frame keyed on other columns I have a data frame that looks like this: url time somethingirrelevant visits www.foo.com 1:00 xxx 100 www.foo.com 1:00 yyy 50 www.foo.com 2:00 xyz 25 www.bar.com 1:00 xxx 200 www.bar.com 1:00 zzz 200 www.foo.com 2:00 xxx 500 I'd like to write some code that takes this as input and outputs something like this: url time total_vists www.foo.com 1:00 150 www.foo.com 2:00 525 www.bar.com 1:00 400 In other words, I need to calculate the sum of visits for each unique tuple of (url,time). I can do it with this code, but it's very slow, and doesn't seem like the right approach: keys = list() getkey = function(m,cols,index) { paste(m[index,cols],collapse=",") } for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] = 0 } for (i in 1:nrow(data)) { keys[[getkey(data,1:2,i)]] keys[[getkey(data,1:2,i)]] + data[i,4] } I'm sure there's a more functional-programming approach to this problem! Any ideas? ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.