thr3ads.net - R help - [R] A question built on How to sum one column in a data frame keyed on other columns [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Vincent Vinh-Hung

2009-Jul-04 10:21 UTC

[R] A question built on How to sum one column in a data frame keyed on other columns

Dear List:

I have a question related to a previous discussion
How to sum one column in a data frame keyed on other columns
https://stat.ethz.ch/pipermail/r-help/2006-December/122141.html
(George Nachman, Bill Venables)

The original query was to calculate the sum of visits for each unique
tuple of (url, time) from the data frame dat:
> dat          url time somethingirrelevant visits
1 www.foo.com 1:00                 xxx    100
2 www.foo.com 1:00                 yyy     50
3 www.foo.com 2:00                 xyz     25
4 www.bar.com 1:00                 xxx    200
5 www.bar.com 1:00                 zzz    200
6 www.foo.com 2:00                 xxx    500

The response gave:
> tdat          url time total_visits
4 www.bar.com 1:00          400
1 www.foo.com 1:00          150
3 www.foo.com 2:00          525

My question is how can I build a similar data frame but having also rows for
combinations of (url, time) that were not in dat?
In this example, the (url, time) without record of visit would be
(www.bar.com, 2:00):
> ndat          url time total_visits
www.bar.com 1:00          400
www.bar.com 2:00            0
www.foo.com 1:00          150
www.foo.com 2:00          525

I have tried to build a data frame with> adat <- data.frame (url = rep(unique(dat$url),
each=length(unique(dat$time))),
	time=unique(dat$time), alt_visits=0)> ndat <- merge (adat, tdat, by=c("url", "time"), all
= TRUE)then replace the NA and remove the alt_visits column.
But this appears clumsy and quite slow with 10000 rows and 4 columns,
I would be most grateful for other suggestions,

Vincent

David Winsemius

2009-Jul-04 13:40 UTC

head link

[R] A question built on How to sum one column in a data frame keyed on other columns

On Jul 4, 2009, at 6:21 AM, Vincent Vinh-Hung wrote:
> Dear List:
>
> I have a question related to a previous discussion
> How to sum one column in a data frame keyed on other columns
> https://stat.ethz.ch/pipermail/r-help/2006-December/122141.html
> (George Nachman, Bill Venables)
>
> The original query was to calculate the sum of visits for each unique
> tuple of (url, time) from the data frame dat:
>
>> dat
>          url time somethingirrelevant visits
> 1 www.foo.com 1:00                 xxx    100
> 2 www.foo.com 1:00                 yyy     50
> 3 www.foo.com 2:00                 xyz     25
> 4 www.bar.com 1:00                 xxx    200
> 5 www.bar.com 1:00                 zzz    200
> 6 www.foo.com 2:00                 xxx    500
>
> The response gave:
>
>> tdat
>          url time total_visits
> 4 www.bar.com 1:00          400
> 1 www.foo.com 1:00          150
> 3 www.foo.com 2:00          525
>
> My question is how can I build a similar data frame but having also  
> rows for
> combinations of (url, time) that were not in dat?
> In this example, the (url, time) without record of visit would be
> (www.bar.com, 2:00):
>
>> ndat
>          url time total_visits
> www.bar.com 1:00          400
> www.bar.com 2:00            0
> www.foo.com 1:00          150
> www.foo.com 2:00          525
>
> I have tried to build a data frame with
>> adat <- data.frame (url = rep(unique(dat$url),  
>> each=length(unique(dat$time))),
> 	time=unique(dat$time), alt_visits=0)
>> ndat <- merge (adat, tdat, by=c("url", "time"),
all = TRUE)
> then replace the NA and remove the alt_visits column.
> But this appears clumsy and quite slow with 10000 rows and 4 columns,
> I would be most grateful for other suggestions,
?expand.grid

Perhaps:

 > fulldat <- expand.grid( unique(dat$url), unique(dat$time) )
 > names(fulldat) <- c("url", "time")
 > merge(fulldat, tdat)
           url time total_visits
1 www.bar.com 1:00          400
2 www.foo.com 1:00          150
3 www.foo.com 2:00          525
 > merge(fulldat, tdat, all=TRUE)  # notice that default is to drop  
missing levels
           url time total_visits
1 www.bar.com 1:00          400
2 www.bar.com 2:00           NA
3 www.foo.com 1:00          150
4 www.foo.com 2:00          525

If they are not ordered, then you might want to sort the elements  
before they get sent to expand.grid for purposes of readability.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

R help - Jul 2009 - A question built on How to sum one column in a data frame keyed on other columns

[R] A question built on How to sum one column in a data frame keyed on other columns

[R] A question built on How to sum one column in a data frame keyed on other columns