Hi:
For this particular task, the aggregate() function and the doBy package
provide nicely formatted output, but you may have to do some renaming.
Let's try a more expansive toy example with which one can do a bit more.
df <- data.frame(grp1 = rep(c('x', 'y'), each = 40),
grp2 = rep(rep(c('x', 'y'), each = 20), 2),
grp3 = rep(rep(c('x', 'y'), each = 10), 4),
a = 1:80, b = 81:160, d = 161:240)
library(doBy)
summaryBy(a + b + d ~ grp1 + grp2 + grp3, data = df, FUN = range)
grp1 grp2 grp3 a.FUN1 a.FUN2 b.FUN1 b.FUN2 d.FUN1 d.FUN2
1 x x x 1 10 81 90 161 170
2 x x y 11 20 91 100 171 180
3 x y x 21 30 101 110 181 190
4 x y y 31 40 111 120 191 200
5 y x x 41 50 121 130 201 210
6 y x y 51 60 131 140 211 220
7 y y x 61 70 141 150 221 230
8 y y y 71 80 151 160 231 240
aggregate(cbind(a, b, d) ~ grp1 + grp2 + grp3, data = df, FUN = range)
grp1 grp2 grp3 a.1 a.2 b.1 b.2 d.1 d.2
1 x x x 1 10 81 90 161 170
2 y x x 41 50 121 130 201 210
3 x y x 21 30 101 110 181 190
4 y y x 61 70 141 150 221 230
5 x x y 11 20 91 100 171 180
6 y x y 51 60 131 140 211 220
7 x y y 31 40 111 120 191 200
8 y y y 71 80 151 160 231 240
Each pair of columns associated with variables a, b and d correspond to the
min and max of the values of these variables in each group. (range(x)
returns a vector composed of the min and max of x, respectively.)
Let's broaden our goals a bit. Suppose we want multiple summaries for
multiple variables by multiple groups. An example function might be to
return the min, max, mean, standard deviation and CV for each group. Here is
a simple function that takes a numeric vector x as input and returns a
(named) vector with the above summaries.
f <- function(x) c(min = min(x), max = max(x), mean = mean(x), sd = sd(x),
cv = mean(x)/sd(x))
# This function can be used directly in summaryBy() or aggregate():
summaryBy(a + b + d ~ grp1 + grp2 + grp3, data = df, FUN = f)
aggregate(cbind(a, b, d) ~ grp1 + grp2 + grp3, data = df, FUN = f)
# [Output omitted for the sake of brevity. Difference is in the variable
names.]
To get this to work in plyr or data.table, which are two packages that have
numerous facilities for manipulating and summarizing data, we have to modify
the function so that it outputs in a single row what the output of the two
functions above generates. Because ddply() operates on data frames and data
tables are slightly different data animals, the functions have to be
rewritten slightly for each case. If you are new to packages, these have to
be installed first before you can load and use them:
# Uncomment the next line if you need to install these packages
# install.packages(c('data.table', 'plyr'))
# ddply() in package plyr
library(plyr)
g <- function(df) {
c(a.min = min(df$a), a.max = max(df$a), a.mean = mean(df$a), a.sd sd(df$a),
a.cv = sd(df$a)/mean(df$a),
b.min = min(df$b), b.max = max(df$b), b.mean = mean(df$b), b.sd sd(df$b),
b.cv = sd(df$b)/mean(df$b),
d.min = min(df$d), d.max = max(df$d), d.mean = mean(df$d), d.sd sd(df$d),
d.cv = sd(df$d)/mean(df$d))
}
ddply(df, .(grp1, grp2, grp3), g)
# package data.table
library(data.table)
dt <- data.table(df, key = 'grp1, grp2, grp3')
h <- function(df) {
list(a.min = min(df$a), a.max = max(df$a), a.mean = mean(df$a),
a.sd = sd(df$a), a.cv = sd(df$a)/mean(df$a),
b.min = min(df$b), b.max = max(df$b), b.mean = mean(df$b),
b.sd = sd(df$b), b.cv = sd(df$b)/mean(df$b),
d.min = min(df$d), d.max = max(df$d), d.mean = mean(df$d),
d.sd = sd(df$d), d.cv = sd(df$d)/mean(df$d))
}
dt[, h(.SD), by = list(grp1, grp2, grp3)]
# Note: the .SD as the argument of h() in the data table is a special
'sub-data' construct;
# see the package's vignette and FAQ for further details
R has a rich array of functions and a number of packages to summarize data.
(I might also mention that sqldf is another package using SQL syntax on R
data frames that would have worked well here, too.) Hopefully this gives you
some idea of what can be done.
All of these functions return data frames by default.
I might also suggest, given the nature of your questions, that you take the
time to read the Introduction to R manual, which explains many of the basic
concepts and features in R.
HTH,
Dennis
On Mon, Feb 7, 2011 at 8:29 PM, Al Roark <hrbuilder@hotmail.com> wrote:
>
> I'd like to summarize several variables in a data frame, for multiple
> groups, and store the results in a data.frame. To do so, I'm using
by(). For
> example:
>
>
>
df<-data.frame(a=1:10,b=11:20,c=21:30,grp1=c("x","y"),grp2=c("x","y"),grp3=c("x","y"))
> dfsum<-by(df[c("a","b","c")],
df[c("grp1","grp2","grp3")], range)
>
> The result has a class of "by" and a mode of "list".
I'm new to R and can't
> find any documentation on this class, and don't see methods for it
> associated with the as.data.frame. How should I go about coercing this to
a
> data frame? Is there a comprehensive source that I'm might be missing,
> which can tell me such things?
>
> Cheers
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]