Vincent Vinh-Hung
2009-Jul-04  10:21 UTC
[R] A question built on How to sum one column in a data frame keyed on other columns
Dear List: I have a question related to a previous discussion How to sum one column in a data frame keyed on other columns https://stat.ethz.ch/pipermail/r-help/2006-December/122141.html (George Nachman, Bill Venables) The original query was to calculate the sum of visits for each unique tuple of (url, time) from the data frame dat:> daturl time somethingirrelevant visits 1 www.foo.com 1:00 xxx 100 2 www.foo.com 1:00 yyy 50 3 www.foo.com 2:00 xyz 25 4 www.bar.com 1:00 xxx 200 5 www.bar.com 1:00 zzz 200 6 www.foo.com 2:00 xxx 500 The response gave:> tdaturl time total_visits 4 www.bar.com 1:00 400 1 www.foo.com 1:00 150 3 www.foo.com 2:00 525 My question is how can I build a similar data frame but having also rows for combinations of (url, time) that were not in dat? In this example, the (url, time) without record of visit would be (www.bar.com, 2:00):> ndaturl time total_visits www.bar.com 1:00 400 www.bar.com 2:00 0 www.foo.com 1:00 150 www.foo.com 2:00 525 I have tried to build a data frame with> adat <- data.frame (url = rep(unique(dat$url), each=length(unique(dat$time))),time=unique(dat$time), alt_visits=0)> ndat <- merge (adat, tdat, by=c("url", "time"), all = TRUE)then replace the NA and remove the alt_visits column. But this appears clumsy and quite slow with 10000 rows and 4 columns, I would be most grateful for other suggestions, Vincent
David Winsemius
2009-Jul-04  13:40 UTC
[R] A question built on How to sum one column in a data frame keyed on other columns
On Jul 4, 2009, at 6:21 AM, Vincent Vinh-Hung wrote:> Dear List: > > I have a question related to a previous discussion > How to sum one column in a data frame keyed on other columns > https://stat.ethz.ch/pipermail/r-help/2006-December/122141.html > (George Nachman, Bill Venables) > > The original query was to calculate the sum of visits for each unique > tuple of (url, time) from the data frame dat: > >> dat > url time somethingirrelevant visits > 1 www.foo.com 1:00 xxx 100 > 2 www.foo.com 1:00 yyy 50 > 3 www.foo.com 2:00 xyz 25 > 4 www.bar.com 1:00 xxx 200 > 5 www.bar.com 1:00 zzz 200 > 6 www.foo.com 2:00 xxx 500 > > The response gave: > >> tdat > url time total_visits > 4 www.bar.com 1:00 400 > 1 www.foo.com 1:00 150 > 3 www.foo.com 2:00 525 > > My question is how can I build a similar data frame but having also > rows for > combinations of (url, time) that were not in dat? > In this example, the (url, time) without record of visit would be > (www.bar.com, 2:00): > >> ndat > url time total_visits > www.bar.com 1:00 400 > www.bar.com 2:00 0 > www.foo.com 1:00 150 > www.foo.com 2:00 525 > > I have tried to build a data frame with >> adat <- data.frame (url = rep(unique(dat$url), >> each=length(unique(dat$time))), > time=unique(dat$time), alt_visits=0) >> ndat <- merge (adat, tdat, by=c("url", "time"), all = TRUE) > then replace the NA and remove the alt_visits column. > But this appears clumsy and quite slow with 10000 rows and 4 columns, > I would be most grateful for other suggestions,?expand.grid Perhaps: > fulldat <- expand.grid( unique(dat$url), unique(dat$time) ) > names(fulldat) <- c("url", "time") > merge(fulldat, tdat) url time total_visits 1 www.bar.com 1:00 400 2 www.foo.com 1:00 150 3 www.foo.com 2:00 525 > merge(fulldat, tdat, all=TRUE) # notice that default is to drop missing levels url time total_visits 1 www.bar.com 1:00 400 2 www.bar.com 2:00 NA 3 www.foo.com 1:00 150 4 www.foo.com 2:00 525 If they are not ordered, then you might want to sort the elements before they get sent to expand.grid for purposes of readability. David Winsemius, MD Heritage Laboratories West Hartford, CT