thr3ads.net - R help - [R] Remark on tapply(). [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Rolf Turner

2009-Dec-01 01:10 UTC

[R] Remark on tapply().

Consider the following:

 > set.seed(42)
 > ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5)
 > x <- runif(42)
 > tapply(x,ff,sum)
        1        2        3        4        5
3.675436       NA 7.519675       NA 9.094210

I got bitten by those NAs in the result of tapply().  Effectively
one is summing over the empty set, and consequently (according to what
I learned as a child) I thought that the result would be 0.

And that's what one gets if one does the sum ``by hand'':

 > sum(x[ff==1])
[1] 3.675436
 > sum(x[ff==2])
[1] 0
  > sum(x[ff==4])
[1] 0

On reflection I realized that since tapply() needs to work with  
arbitrary
functions, and since there is no way to determine what an arbitrary  
function
will do to the empty set, this is the Way It's Got to Be.

But it's a trap for young players, and so I thought I'd post my  
experience
as a warning to others to be careful about this.

To work around the problem one ***could*** do something like

 > result[is.na(result)] <- 0

but that's very infra dig in my book.  I figured out something I like
much better:

	sapply(tapply(x,ff,I,simplify=FALSE),sum)

That simplify=FALSE is needed just in case there is at most one entry of
x for each level of ff, in which case tapply will return an array with
NAs in it, rather than a list with NULL entries corresponding to  
empty cells,
unless simplify=FALSE is specified.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

Karl Ove Hufthammer

2009-Dec-01 07:32 UTC

head link

[R] Remark on tapply().

On Tue, 1 Dec 2009 14:10:17 +1300 Rolf Turner <r.turner at auckland.ac.nz>
wrote:> Consider the following:
> 
>  > set.seed(42)
>  > ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5)
>  > x <- runif(42)
>  > tapply(x,ff,sum)
>         1        2        3        4        5
> 3.675436       NA 7.519675       NA 9.094210
> 
> I got bitten by those NAs in the result of tapply().  Effectively
> one is summing over the empty set, and consequently (according to what
> I learned as a child) I thought that the result would be 0.
Note that this *is* documented on the help page for 'tapply', actually, 
in its description:

  Apply a function to each cell of a ragged array, that is to each
  (non-empty) group of values given by a unique combination of the 
  levels of certain factors. 

Basically (ignoring some details) 'tapply' does:

  sapply(split(x, ff), sum)

Which actually *does* give you 0 for level 2 and 4. The reason is (again 
ignoring some details) 'tapply' does:

  sapply(split(x, as.numeric(ff)), sum)

which only looks at the actual values of 'ff', not its levels.

Note that value 'zero' is not a special case. For instance,

  sapply(split(x, ff), prod)

gives the 'empty product', i.e., 1.

Exercise to the reader:

Note that
sapply(split(x, ff, drop=TRUE), sum)
gives you the values of (just) the non-empty levels.

Now, why does
  sapply(split(x, ff), sum, drop=TRUE)
give the wrong value (1) for these levels, while
  sapply(split(x, ff), sum, drop=FALSE)
gives the the correct value?

(The answer should be fairly obvious, but it's an easy mistake to make.)

-- 
Karl Ove Hufthammer

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Dec 2009 - Remark on tapply().

[R] Remark on tapply().

[R] Remark on tapply().

Seemingly Similar Threads