Dear R-developers, when tapply() is invoked on factors that have empty levels, it returns NA. This behaviour is in accord with the tapply documentation, and is reasonable in many cases. However, when FUN is sum, it would also seem reasonable to return 0 instead of NA, because "the sum of an empty set is zero, by definition." I'd like to raise a discussion of the possibility of an amendment to tapply. The attached patch changes the function so that it checks if there are any empty levels, and if there are, replaces the corresponding NA values with the result of applying FUN to the empty set. Eg in the case of sum, it replaces the NA with 0, whereas with mean, it replaces the NA with NA, and issues a warning. This change has the following advantage: tapply and sum work better together. Arguably, tapply and any other function that has a non-NA response to the empty set will also work better together. Furthermore, tapply shows a warning if FUN would normally show a warning upon being evaluated on an empty set. That deviates from current behaviour, which might be bad, but also provides information that might be useful to the user, so that would be good. The attached script provides the new function in full, and demonstrates its application in some simple test cases. Best wishes, Andrew -- Andrew Robinson Department of Mathematics and Statistics Tel: +61-3-8344-9763 University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599 http://www.ms.unimelb.edu.au/~andrewpr http://blogs.mbs.edu/fishing-in-the-bay/ -------------- next part -------------- ## The new function my.tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE) { FUN <- if (!is.null(FUN)) match.fun(FUN) if (!is.list(INDEX)) INDEX <- list(INDEX) nI <- length(INDEX) namelist <- vector("list", nI) names(namelist) <- names(INDEX) extent <- integer(nI) nx <- length(X) one <- as.integer(1) group <- rep.int(one, nx)#- to contain the splitting vector ngroup <- one for (i in seq.int(INDEX)) { index <- as.factor(INDEX[[i]]) if (length(index) != nx) stop("arguments must have same length") namelist[[i]] <- levels(index)#- all of them, yes ! extent[i] <- nlevels(index) group <- group + ngroup * (as.integer(index) - one) ngroup <- ngroup * nlevels(index) } if (is.null(FUN)) return(group) ans <- lapply(split(X, group), FUN, ...) index <- as.numeric(names(ans)) if (simplify && all(unlist(lapply(ans, length)) == 1)) { ansmat <- array(dim=extent, dimnames=namelist) ans <- unlist(ans, recursive = FALSE) } else { ansmat <- array(vector("list", prod(extent)), dim=extent, dimnames=namelist) } ## old : ansmat[as.numeric(names(ans))] <- ans names(ans) <- NULL ansmat[index] <- ans if (sum(table(INDEX) < 1) > 0) ansmat[table(INDEX) < 1] <- do.call(FUN, list(c(NULL), ...)) ansmat } ## Check its utility group <- factor(c(1,1,3,3), levels=c("1","2","3")) x <- c(1,2,3,4) ## Ok with mean? tapply(x, group, mean) my.tapply(x, group, mean) ## Ok with sum? tapply(x, group, sum) my.tapply(x, group, sum) ## Check that other arguments are carried through x <- c(NA,2,3,10) tapply(x, group, sum, na.rm=TRUE) tapply(x, group, mean, na.rm=TRUE) my.tapply(x, group, sum, na.rm=TRUE) my.tapply(x, group, mean, na.rm=TRUE) ## Check that listed groups work ok also group.2 <- factor(c(1,2,3,3), levels=c("1","2","3")) tapply(x, list(group, group.2), sum, na.rm=TRUE) tapply(x, list(group, group.2), mean, na.rm=TRUE) my.tapply(x, list(group, group.2), sum, na.rm=TRUE) my.tapply(x, list(group, group.2), mean, na.rm=TRUE)
Bill.Venables at csiro.au
2007-Nov-06 06:53 UTC
[Rd] A suggestion for an amendment to tapply
Unfortunately I think it would break too much existing code. tapply() is an old function and many people have gotten used to the way it works now. This is not to suggest there could not be another argument added at the end to indicate that you want the new behaviour, though. e.g. tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE, handle.empty.levels = FALSE) but this raises the question of what sort of time penalty the modification might entail. Probably not much for most situations, I suppose. (I know this argument name looks long, but you do need a fairly specific argument name, or it will start to impinge on the ... argument.) Just some thoughts. Bill Venables. Bill Venables CSIRO Laboratories PO Box 120, Cleveland, 4163 AUSTRALIA Office Phone (email preferred): +61 7 3826 7251 Fax (if absolutely necessary): +61 7 3826 7304 Mobile: +61 4 8819 4402 Home Phone: +61 7 3286 7700 mailto:Bill.Venables at csiro.au http://www.cmis.csiro.au/bill.venables/ -----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Andrew Robinson Sent: Tuesday, 6 November 2007 3:10 PM To: R-Devel Subject: [Rd] A suggestion for an amendment to tapply Dear R-developers, when tapply() is invoked on factors that have empty levels, it returns NA. This behaviour is in accord with the tapply documentation, and is reasonable in many cases. However, when FUN is sum, it would also seem reasonable to return 0 instead of NA, because "the sum of an empty set is zero, by definition." I'd like to raise a discussion of the possibility of an amendment to tapply. The attached patch changes the function so that it checks if there are any empty levels, and if there are, replaces the corresponding NA values with the result of applying FUN to the empty set. Eg in the case of sum, it replaces the NA with 0, whereas with mean, it replaces the NA with NA, and issues a warning. This change has the following advantage: tapply and sum work better together. Arguably, tapply and any other function that has a non-NA response to the empty set will also work better together. Furthermore, tapply shows a warning if FUN would normally show a warning upon being evaluated on an empty set. That deviates from current behaviour, which might be bad, but also provides information that might be useful to the user, so that would be good. The attached script provides the new function in full, and demonstrates its application in some simple test cases. Best wishes, Andrew -- Andrew Robinson Department of Mathematics and Statistics Tel: +61-3-8344-9763 University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599 http://www.ms.unimelb.edu.au/~andrewpr http://blogs.mbs.edu/fishing-in-the-bay/