thr3ads.net - R devel - [Rd] by() processing on a dataframe [Sep 2005]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2005-Sep-30 17:22 UTC

[Rd] by() processing on a dataframe

I want to calculate a statistic on a number of subgroups of a dataframe, 
then put the results into a dataframe.  (What SAS PROC MEANS does, I 
think, though it's been years since I used it.)

This is possible using by(), but it seems cumbersome and fragile.  Is 
there a more straightforward way than this?

Here's a simple example showing my current strategy:

 > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
c(2,2,2,2)), value = rnorm(8))
 > dataset
   gp1 gp2      value
1   1   1  0.9493232
2   1   1 -0.0474712
3   1   2 -0.6808021
4   1   2  1.9894999
5   2   3  2.0154786
6   2   3  0.4333056
7   2   4 -0.4746228
8   2   4  0.6017522
 >
 > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
+ gp2 = subset$gp2[1], statistic = mean(subset$value))
 >
 > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup)
 >
 > result <- do.call('rbind', bylist)
 > result
    gp1 gp2  statistic
1    1   1 0.45092598
11   1   2 0.65434890
12   2   3 1.22439210
13   2   4 0.06356469

tapply() is inappropriate because I don't have all possible combinations 
of gp1 and gp2 values, only some of them:

 > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
          1         2        3          4
1 0.450926 0.6543489       NA         NA
2       NA        NA 1.224392 0.06356469



In the real case, I only have a very sparse subset of all the 
combinations, and tapply() and by() both die for lack of memory.

Any suggestions on how to do what I want, without using SAS?

Duncan Murdoch

hadley wickham

2005-Sep-30 17:41 UTC

head link

[Rd] by() processing on a dataframe

I'm not entirely sure what you want, but maybe this does the trick?

data.frame.by <- function(data, variables, fun, ...) {
	if (length(variables) == 0 ) {
		df <- data.frame(results = 0)
		df$results <- list(fun(data$value, ...))
		return(df)
	}

	sorted <- sort.df(data, variables)[,c(variables), drop=FALSE]
	duplicates <- duplicated(sorted[,variables, drop=FALSE])
	index <- cumsum(!duplicates)

	results <- by(data, index, fun, ...)

	cols <- sorted[!duplicates,variables, drop=FALSE]
	cols$results <- array(results)
	cols
}


sort.df <- function(data, vars) {
	data[do.call("order", data[,vars, drop=FALSE]), ,drop=FALSE]
}


dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
c(2,2,2,2)), value = rnorm(8))

data.frame.by(dataset, c("gp1", "gp2"), function(data)
mean(data$value))
data.frame.by(dataset, "gp1", function(data) tapply(data$value,
data$gp2, mean))
data.frame.by(dataset, "gp1", function(data) lm(gp2 ~ value, data)) #
doesn't print, but everything is there ok

(note that the results column will be a list if necessary - this may
be a serious abuse of data frames, but I'm not sure and no one replied
when I queried the list)

Hadley

Peter Dalgaard

2005-Sep-30 17:41 UTC

head link

[Rd] by() processing on a dataframe

Duncan Murdoch <murdoch at stats.uwo.ca> writes:
> I want to calculate a statistic on a number of subgroups of a dataframe, 
> then put the results into a dataframe.  (What SAS PROC MEANS does, I 
> think, though it's been years since I used it.)
> 
> This is possible using by(), but it seems cumbersome and fragile.  Is 
> there a more straightforward way than this?
> 
> Here's a simple example showing my current strategy:
> 
>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
> c(2,2,2,2)), value = rnorm(8))
>  > dataset
>    gp1 gp2      value
> 1   1   1  0.9493232
> 2   1   1 -0.0474712
> 3   1   2 -0.6808021
> 4   1   2  1.9894999
> 5   2   3  2.0154786
> 6   2   3  0.4333056
> 7   2   4 -0.4746228
> 8   2   4  0.6017522
>  >
>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>  >
>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2),
handleonegroup)
>  >
>  > result <- do.call('rbind', bylist)
>  > result
>     gp1 gp2  statistic
> 1    1   1 0.45092598
> 11   1   2 0.65434890
> 12   2   3 1.22439210
> 13   2   4 0.06356469
> 
> tapply() is inappropriate because I don't have all possible
combinations
> of gp1 and gp2 values, only some of them:
> 
>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>           1         2        3          4
> 1 0.450926 0.6543489       NA         NA
> 2       NA        NA 1.224392 0.06356469
> 
> 
> 
> In the real case, I only have a very sparse subset of all the 
> combinations, and tapply() and by() both die for lack of memory.
> 
> Any suggestions on how to do what I want, without using SAS?
Have you tried aggregate()?

Alternatively, you migth split on interaction(...., drop=TRUE)

-- 
   O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907

Gabor Grothendieck

2005-Sep-30 17:45 UTC

head link

[Rd] by() processing on a dataframe

Check out summaryBy in the doBy package at:

   http://genetics.agrsci.dk/~sorenh/misc

e.g.

   summaryBy(value ~ gp1 + gp2, data = dataset)



On 9/30/05, Duncan Murdoch <murdoch at stats.uwo.ca>
wrote:> I want to calculate a statistic on a number of subgroups of a dataframe,
> then put the results into a dataframe.  (What SAS PROC MEANS does, I
> think, though it's been years since I used it.)
>
> This is possible using by(), but it seems cumbersome and fragile.  Is
> there a more straightforward way than this?
>
> Here's a simple example showing my current strategy:
>
>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4,
> c(2,2,2,2)), value = rnorm(8))
>  > dataset
>   gp1 gp2      value
> 1   1   1  0.9493232
> 2   1   1 -0.0474712
> 3   1   2 -0.6808021
> 4   1   2  1.9894999
> 5   2   3  2.0154786
> 6   2   3  0.4333056
> 7   2   4 -0.4746228
> 8   2   4  0.6017522
>  >
>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>  >
>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2),
handleonegroup)
>  >
>  > result <- do.call('rbind', bylist)
>  > result
>    gp1 gp2  statistic
> 1    1   1 0.45092598
> 11   1   2 0.65434890
> 12   2   3 1.22439210
> 13   2   4 0.06356469
>
> tapply() is inappropriate because I don't have all possible
combinations
> of gp1 and gp2 values, only some of them:
>
>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>          1         2        3          4
> 1 0.450926 0.6543489       NA         NA
> 2       NA        NA 1.224392 0.06356469
>
>
>
> In the real case, I only have a very sparse subset of all the
> combinations, and tapply() and by() both die for lack of memory.
>
> Any suggestions on how to do what I want, without using SAS?
>
> Duncan Murdoch
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Marc Schwartz (via MN)

2005-Sep-30 17:48 UTC

head link

[Rd] by() processing on a dataframe

On Fri, 2005-09-30 at 13:22 -0400, Duncan Murdoch wrote:> I want to calculate a statistic on a number of subgroups of a dataframe, 
> then put the results into a dataframe.  (What SAS PROC MEANS does, I 
> think, though it's been years since I used it.)
> 
> This is possible using by(), but it seems cumbersome and fragile.  Is 
> there a more straightforward way than this?
> 
> Here's a simple example showing my current strategy:
> 
>  > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, 
> c(2,2,2,2)), value = rnorm(8))
>  > dataset
>    gp1 gp2      value
> 1   1   1  0.9493232
> 2   1   1 -0.0474712
> 3   1   2 -0.6808021
> 4   1   2  1.9894999
> 5   2   3  2.0154786
> 6   2   3  0.4333056
> 7   2   4 -0.4746228
> 8   2   4  0.6017522
>  >
>  > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1],
> + gp2 = subset$gp2[1], statistic = mean(subset$value))
>  >
>  > bylist <- by(dataset, list(dataset$gp1, dataset$gp2),
handleonegroup)
>  >
>  > result <- do.call('rbind', bylist)
>  > result
>     gp1 gp2  statistic
> 1    1   1 0.45092598
> 11   1   2 0.65434890
> 12   2   3 1.22439210
> 13   2   4 0.06356469
> 
> tapply() is inappropriate because I don't have all possible
combinations
> of gp1 and gp2 values, only some of them:
> 
>  > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean)
>           1         2        3          4
> 1 0.450926 0.6543489       NA         NA
> 2       NA        NA 1.224392 0.06356469
> 
> 
> 
> In the real case, I only have a very sparse subset of all the 
> combinations, and tapply() and by() both die for lack of memory.
> 
> Any suggestions on how to do what I want, without using SAS?
> 
> Duncan Murdoch
Duncan,

Does this do what you want?
> set.seed(1)
 > df <- data.frame(gp1 = rep(1:2, c(4,4)),                    gp2 = rep(1:4, c(2,2,2,2)), 
                   value = rnorm(8))
 > df  gp1 gp2      value
1   1   1 -0.6264538
2   1   1  0.1836433
3   1   2 -0.8356286
4   1   2  1.5952808
5   2   3  0.3295078
6   2   3 -0.8204684
7   2   4  0.4874291
8   2   4  0.7383247
> means <- aggregate(df$value, list(gp1 = df$gp1, gp2 = df$gp2), mean)
 > means  gp1 gp2          x
1   1   1 -0.2214052
2   1   2  0.3798261
3   2   3 -0.2454803
4   2   4  0.6128769

> merge(df, means, by = c("gp1", "gp2"))  gp1 gp2      value          x
1   1   1 -0.6264538 -0.2214052
2   1   1  0.1836433 -0.2214052
3   1   2 -0.8356286  0.3798261
4   1   2  1.5952808  0.3798261
5   2   3  0.3295078 -0.2454803
6   2   3 -0.8204684 -0.2454803
7   2   4  0.4874291  0.6128769
8   2   4  0.7383247  0.6128769


HTH,

Marc Schwartz

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Sep 2005 - by() processing on a dataframe

[Rd] by() processing on a dataframe

[Rd] by() processing on a dataframe

[Rd] by() processing on a dataframe

[Rd] by() processing on a dataframe

[Rd] by() processing on a dataframe

Seemingly Similar Threads