Trying to find a quick/slick/easily interpretable way to
collapse a data set.
Suppose I have a data set that looks like this:
h <- structure(list(INDEX = structure(1:6, .Label = c("1",
"2", "3",
"4", "5", "6"), class = "factor"), TICKS
= c(0, 0, 0, 0, 0, 3
), BROOD = structure(c(1L, 1L, 2L, 3L, 3L, 3L), .Label = c("501",
"502", "503"), class = "factor"), HEIGHT = c(465,
465, 472, 475,
475, 475), YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c("95",
"96", "97"), class = "factor"), LOCATION =
structure(c(1L, 1L,
2L, 3L, 3L, 3L), .Label = c("32", "36", "37"),
class = "factor")), .Names c("INDEX",
"TICKS", "BROOD", "HEIGHT", "YEAR",
"LOCATION"), row.names = c(NA,
6L), class = "data.frame")
i.e., > h
INDEX TICKS BROOD HEIGHT YEAR LOCATION
1 1 0 501 465 95 32
2 2 0 501 465 95 32
3 3 0 502 472 95 36
4 4 0 503 475 95 37
5 5 0 503 475 95 37
6 6 3 503 475 95 37
I want a data set that looks like this:
BROOD TICKS.mean HEIGHT YEAR LOCATION
501 0 465 95 32
502 0 472 95 36
503 1 475 95 37
(for example). I.e., I want to collapse it to a dataset by brood,
taking the mean of TICKS and reducing each of
the other variables (would be nice to allow multiple summary
statistics, e.g. TICKS.mean and TICKS.sd ...)
In some ways, this is the opposite of a database join/merge
operation -- I want to collapse the data frame back down.
If I had the "unmerged" (i.e., the brood table) handy I could
use it.
I know I can construct this table a bit at a time,
using tapply() or by() or aggregate() to get the means.
Here's a solution that takes the first element of each factor
and the mean of each numeric variable. I can imagine there
are more general/flexible solutions. (One might want to
specify more than one summary function, or specify that
factors that vary within group should be dropped.)
vtype = sapply(h,class) ## variable types [numeric or factor]
vtypes = unique(vtype) ## possible types
v2 = lapply(vtypes,function(z) which(vtype==z)) ## which are which?
cfuns = list(factor=function(z)z[1],numeric=mean)## functions to apply
m = mapply(function(w,f) { aggregate(h[w],list(h$BROOD),f) },
v2,cfuns,SIMPLIFY=FALSE)
data.frame(m[[1]],m[[2]][-1])
My question is whether this is re-inventing the wheel. Is there
some function or package that performs this task?
cheers
Ben Bolker
--
View this message in context:
http://www.nabble.com/collapsing-a-data-frame-tf4614195.html#a13177053
Sent from the R help mailing list archive at Nabble.com.
Here is one way. Not sure what you wanted done with some of the other variables, so I just chose the first one; you could do max/min:> z <- by(h, h$BROOD, function(x){+ # take first value of elements you don't want to change + data.frame(BROOD=x$BROOD[1], TICKS.mean=mean(x$TICKS), TICKS.sd=sd(x$TICKS), + HEIGHT=x$HEIGHT[1], YEAR=x$YEAR[1], LOCATION=x$LOCATION[1]) + })> do.call('rbind', z)BROOD TICKS.mean TICKS.sd HEIGHT YEAR LOCATION 501 501 0 0.000000 465 95 32 502 502 0 NA 472 95 36 503 503 1 1.732051 475 95 37 On 10/12/07, Ben Bolker <bolker at ufl.edu> wrote:> > Trying to find a quick/slick/easily interpretable way to > collapse a data set. > > Suppose I have a data set that looks like this: > > h <- structure(list(INDEX = structure(1:6, .Label = c("1", "2", "3", > "4", "5", "6"), class = "factor"), TICKS = c(0, 0, 0, 0, 0, 3 > ), BROOD = structure(c(1L, 1L, 2L, 3L, 3L, 3L), .Label = c("501", > "502", "503"), class = "factor"), HEIGHT = c(465, 465, 472, 475, > 475, 475), YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("95", > "96", "97"), class = "factor"), LOCATION = structure(c(1L, 1L, > 2L, 3L, 3L, 3L), .Label = c("32", "36", "37"), class = "factor")), .Names > c("INDEX", > "TICKS", "BROOD", "HEIGHT", "YEAR", "LOCATION"), row.names = c(NA, > 6L), class = "data.frame") > > i.e., > > h > INDEX TICKS BROOD HEIGHT YEAR LOCATION > 1 1 0 501 465 95 32 > 2 2 0 501 465 95 32 > 3 3 0 502 472 95 36 > 4 4 0 503 475 95 37 > 5 5 0 503 475 95 37 > 6 6 3 503 475 95 37 > > I want a data set that looks like this: > BROOD TICKS.mean HEIGHT YEAR LOCATION > 501 0 465 95 32 > 502 0 472 95 36 > 503 1 475 95 37 > > (for example). I.e., I want to collapse it to a dataset by brood, > taking the mean of TICKS and reducing each of > the other variables (would be nice to allow multiple summary > statistics, e.g. TICKS.mean and TICKS.sd ...) > In some ways, this is the opposite of a database join/merge > operation -- I want to collapse the data frame back down. > If I had the "unmerged" (i.e., the brood table) handy I could > use it. > > I know I can construct this table a bit at a time, > using tapply() or by() or aggregate() to get the means. > > Here's a solution that takes the first element of each factor > and the mean of each numeric variable. I can imagine there > are more general/flexible solutions. (One might want to > specify more than one summary function, or specify that > factors that vary within group should be dropped.) > > vtype = sapply(h,class) ## variable types [numeric or factor] > vtypes = unique(vtype) ## possible types > v2 = lapply(vtypes,function(z) which(vtype==z)) ## which are which? > cfuns = list(factor=function(z)z[1],numeric=mean)## functions to apply > m = mapply(function(w,f) { aggregate(h[w],list(h$BROOD),f) }, > v2,cfuns,SIMPLIFY=FALSE) > data.frame(m[[1]],m[[2]][-1]) > > My question is whether this is re-inventing the wheel. Is there > some function or package that performs this task? > > cheers > Ben Bolker > > -- > View this message in context: http://www.nabble.com/collapsing-a-data-frame-tf4614195.html#a13177053 > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
On 10/12/07, Ben Bolker <bolker at ufl.edu> wrote:> > Trying to find a quick/slick/easily interpretable way to > collapse a data set. > > Suppose I have a data set that looks like this: > > h <- structure(list(INDEX = structure(1:6, .Label = c("1", "2", "3", > "4", "5", "6"), class = "factor"), TICKS = c(0, 0, 0, 0, 0, 3 > ), BROOD = structure(c(1L, 1L, 2L, 3L, 3L, 3L), .Label = c("501", > "502", "503"), class = "factor"), HEIGHT = c(465, 465, 472, 475, > 475, 475), YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("95", > "96", "97"), class = "factor"), LOCATION = structure(c(1L, 1L, > 2L, 3L, 3L, 3L), .Label = c("32", "36", "37"), class = "factor")), .Names > c("INDEX", > "TICKS", "BROOD", "HEIGHT", "YEAR", "LOCATION"), row.names = c(NA, > 6L), class = "data.frame") > > i.e., > > h > INDEX TICKS BROOD HEIGHT YEAR LOCATION > 1 1 0 501 465 95 32 > 2 2 0 501 465 95 32 > 3 3 0 502 472 95 36 > 4 4 0 503 475 95 37 > 5 5 0 503 475 95 37 > 6 6 3 503 475 95 37 > > I want a data set that looks like this: > BROOD TICKS.mean HEIGHT YEAR LOCATION > 501 0 465 95 32 > 502 0 472 95 36 > 503 1 475 95 37 > > (for example). I.e., I want to collapse it to a dataset by brood, > taking the mean of TICKS and reducing each of > the other variables (would be nice to allow multiple summary > statistics, e.g. TICKS.mean and TICKS.sd ...) > In some ways, this is the opposite of a database join/merge > operation -- I want to collapse the data frame back down. > If I had the "unmerged" (i.e., the brood table) handy I could > use it. > > I know I can construct this table a bit at a time, > using tapply() or by() or aggregate() to get the means. > > Here's a solution that takes the first element of each factor > and the mean of each numeric variable. I can imagine there > are more general/flexible solutions. (One might want to > specify more than one summary function, or specify that > factors that vary within group should be dropped.) > > vtype = sapply(h,class) ## variable types [numeric or factor] > vtypes = unique(vtype) ## possible types > v2 = lapply(vtypes,function(z) which(vtype==z)) ## which are which? > cfuns = list(factor=function(z)z[1],numeric=mean)## functions to apply > m = mapply(function(w,f) { aggregate(h[w],list(h$BROOD),f) }, > v2,cfuns,SIMPLIFY=FALSE) > data.frame(m[[1]],m[[2]][-1]) > > My question is whether this is re-inventing the wheel. Is there > some function or package that performs this task?Maybe the reshape package? http://had.co.nz/reshape hm <- melt(h, m = "TICKS") cast(hm, BROOD + HEIGHT + YEAR + LOCATION ~ ., mean) cast(hm, BROOD + HEIGHT + LOCATION ~ YEAR, mean) cast(hm, BROOD ~ HEIGHT ~ YEAR, mean) You should be able to create just about any data structure you need, and if you can't let me know. Hadley -- http://had.co.nz/
As far as I know you will have to do some work yourself but
if we leverage some of it off the summaryBy function in doBy
its basically just a lapply over the columns.
The first arg to collapse is a formula whose rhs contains the
by variables and whose lhs is ignored. data is our data frame,
fac.fun is a function to be applied to each factor column and should
return a scalar whlie cont.fun is a function to be applied to each
continuous column and should return a named vector.
The first two lines of collapse get the by.names and remaining names
and then we lapply over the remaining names invoking the appropriate
function making use of summaryBy to reduce our effort on the
continuous ones.
Finally we cbind the result together and remove duplicate columns
that the procedure produced.
library(summaryBy)
collapse <- function(fo, data,
fac.fun = function(x) if (length(unique(x)) == 1) x[1] else NA,
cont.fun = function(x) c(mean = mean(x), sd = sd(x))) {
by.names <- tail(all.vars(fo), 1)
other.names <- setdiff(names(data), by.names)
f <- function(nm) {
if (is.factor(data[[nm]])) aggregate(data[nm], h[by.names], fac.fun)
else summaryBy(. ~ ., data = data[c(nm, by.names)], FUN = cont.fun)
}
y <- do.call(cbind, lapply(other.names, f))
y[!duplicated(names(y))]
}
collapse(~ BROOD, h)
Running it using your h gives:
> collapse(~ BROOD, h)
BROOD INDEX TICKS.mean TICKS.sd HEIGHT.mean HEIGHT.sd YEAR LOCATION
1 501 1 0 0.000000 465 0 95 32
2 502 3 0 NA 472 NA 95 36
3 503 4 1 1.732051 475 0 95 37
On 10/12/07, Ben Bolker <bolker at ufl.edu> wrote:>
> Trying to find a quick/slick/easily interpretable way to
> collapse a data set.
>
> Suppose I have a data set that looks like this:
>
> h <- structure(list(INDEX = structure(1:6, .Label = c("1",
"2", "3",
> "4", "5", "6"), class = "factor"),
TICKS = c(0, 0, 0, 0, 0, 3
> ), BROOD = structure(c(1L, 1L, 2L, 3L, 3L, 3L), .Label = c("501",
> "502", "503"), class = "factor"), HEIGHT =
c(465, 465, 472, 475,
> 475, 475), YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c("95",
> "96", "97"), class = "factor"), LOCATION =
structure(c(1L, 1L,
> 2L, 3L, 3L, 3L), .Label = c("32", "36",
"37"), class = "factor")), .Names > c("INDEX",
> "TICKS", "BROOD", "HEIGHT", "YEAR",
"LOCATION"), row.names = c(NA,
> 6L), class = "data.frame")
>
> i.e.,
> > h
> INDEX TICKS BROOD HEIGHT YEAR LOCATION
> 1 1 0 501 465 95 32
> 2 2 0 501 465 95 32
> 3 3 0 502 472 95 36
> 4 4 0 503 475 95 37
> 5 5 0 503 475 95 37
> 6 6 3 503 475 95 37
>
> I want a data set that looks like this:
> BROOD TICKS.mean HEIGHT YEAR LOCATION
> 501 0 465 95 32
> 502 0 472 95 36
> 503 1 475 95 37
>
> (for example). I.e., I want to collapse it to a dataset by brood,
> taking the mean of TICKS and reducing each of
> the other variables (would be nice to allow multiple summary
> statistics, e.g. TICKS.mean and TICKS.sd ...)
> In some ways, this is the opposite of a database join/merge
> operation -- I want to collapse the data frame back down.
> If I had the "unmerged" (i.e., the brood table) handy I could
> use it.
>
> I know I can construct this table a bit at a time,
> using tapply() or by() or aggregate() to get the means.
>
> Here's a solution that takes the first element of each factor
> and the mean of each numeric variable. I can imagine there
> are more general/flexible solutions. (One might want to
> specify more than one summary function, or specify that
> factors that vary within group should be dropped.)
>
> vtype = sapply(h,class) ## variable types [numeric or factor]
> vtypes = unique(vtype) ## possible types
> v2 = lapply(vtypes,function(z) which(vtype==z)) ## which are which?
> cfuns = list(factor=function(z)z[1],numeric=mean)## functions to apply
> m = mapply(function(w,f) { aggregate(h[w],list(h$BROOD),f) },
> v2,cfuns,SIMPLIFY=FALSE)
> data.frame(m[[1]],m[[2]][-1])
>
> My question is whether this is re-inventing the wheel. Is there
> some function or package that performs this task?
>
> cheers
> Ben Bolker
>
> --
> View this message in context:
http://www.nabble.com/collapsing-a-data-frame-tf4614195.html#a13177053
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>