Consider the following:
> set.seed(42)
> ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5)
> x <- runif(42)
> tapply(x,ff,sum)
1 2 3 4 5
3.675436 NA 7.519675 NA 9.094210
I got bitten by those NAs in the result of tapply(). Effectively
one is summing over the empty set, and consequently (according to what
I learned as a child) I thought that the result would be 0.
And that's what one gets if one does the sum ``by hand'':
> sum(x[ff==1])
[1] 3.675436
> sum(x[ff==2])
[1] 0
> sum(x[ff==4])
[1] 0
On reflection I realized that since tapply() needs to work with
arbitrary
functions, and since there is no way to determine what an arbitrary
function
will do to the empty set, this is the Way It's Got to Be.
But it's a trap for young players, and so I thought I'd post my
experience
as a warning to others to be careful about this.
To work around the problem one ***could*** do something like
> result[is.na(result)] <- 0
but that's very infra dig in my book. I figured out something I like
much better:
sapply(tapply(x,ff,I,simplify=FALSE),sum)
That simplify=FALSE is needed just in case there is at most one entry of
x for each level of ff, in which case tapply will return an array with
NAs in it, rather than a list with NULL entries corresponding to
empty cells,
unless simplify=FALSE is specified.
cheers,
Rolf Turner
######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
On Tue, 1 Dec 2009 14:10:17 +1300 Rolf Turner <r.turner at auckland.ac.nz> wrote:> Consider the following: > > > set.seed(42) > > ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5) > > x <- runif(42) > > tapply(x,ff,sum) > 1 2 3 4 5 > 3.675436 NA 7.519675 NA 9.094210 > > I got bitten by those NAs in the result of tapply(). Effectively > one is summing over the empty set, and consequently (according to what > I learned as a child) I thought that the result would be 0.Note that this *is* documented on the help page for 'tapply', actually, in its description: Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors. Basically (ignoring some details) 'tapply' does: sapply(split(x, ff), sum) Which actually *does* give you 0 for level 2 and 4. The reason is (again ignoring some details) 'tapply' does: sapply(split(x, as.numeric(ff)), sum) which only looks at the actual values of 'ff', not its levels. Note that value 'zero' is not a special case. For instance, sapply(split(x, ff), prod) gives the 'empty product', i.e., 1. Exercise to the reader: Note that sapply(split(x, ff, drop=TRUE), sum) gives you the values of (just) the non-empty levels. Now, why does sapply(split(x, ff), sum, drop=TRUE) give the wrong value (1) for these levels, while sapply(split(x, ff), sum, drop=FALSE) gives the the correct value? (The answer should be fairly obvious, but it's an easy mistake to make.) -- Karl Ove Hufthammer