george.leigh@dpi.qld.gov.au
2004-Mar-15 09:29 UTC
[Rd] Bug in tapply with factors containing NAs (PR#6672)
Full_Name: George Leigh Version: 1.8.1 OS: Windows 2000 Submission from: (NULL) (203.25.1.208) The following example gives the correct answer when the first argument of tapply is a numeric vector, but an incorrect answer when it is a factor. If the function used by tapply is "length", the type and contents of the first argument should make no difference, provided it has the same length as the second argument.> x = c(NA, 1) > y = factor(x) > tapply(x, y, length)1 1> tapply(y, y, length)1 2>
Prof Brian D Ripley
2004-Mar-15 12:18 UTC
[Rd] Bug in tapply with factors containing NAs (PR#6672)
On Mon, 15 Mar 2004 george.leigh@dpi.qld.gov.au wrote:> Full_Name: George Leigh > Version: 1.8.1 > OS: Windows 2000 > Submission from: (NULL) (203.25.1.208) > > > The following example gives the correct answer when the first argument of tapply > is a numeric vector, but an incorrect answer when it is a factor. If the > function used by tapply is "length", the type and contents of the first argument > should make no difference, provided it has the same length as the second > argument.Not so:> split(x, y)$"1" [1] 1> split(y, y)$"1" [1] <NA> 1 Levels: 1 Note that as there is only one level, NA must be 1 in y, whereas it does not have to be in x. So the answer for a factor in your problem is definitely correct, if fortuitous. R does the same as S in this example. If there were more than one level in y, the issue is less clearcut. Probably y[[k]] <- x[f == k] in split.default should be x[f %in% k] Note too z <- x; class(x) <- "foo"> split(z, y)$"1" [1] NA 1> x = c(NA, 1) > > y = factor(x) > > tapply(x, y, length) > 1 > 1 > > tapply(y, y, length) > 1 > 2-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595
Peter Dalgaard
2004-Mar-15 12:21 UTC
[Rd] Bug in tapply with factors containing NAs (PR#6672)
george.leigh@dpi.qld.gov.au writes:> Full_Name: George Leigh > Version: 1.8.1 > OS: Windows 2000 > Submission from: (NULL) (203.25.1.208) > > > The following example gives the correct answer when the first argument of tapply > is a numeric vector, but an incorrect answer when it is a factor. If the > function used by tapply is "length", the type and contents of the first argument > should make no difference, provided it has the same length as the second > argument. > > > x = c(NA, 1) > > y = factor(x) > > tapply(x, y, length) > 1 > 1 > > tapply(y, y, length) > 1 > 2 > >The core of this is that> split(y,y)$"1" [1] <NA> 1 Levels: 1> split(x,y)$"1" [1] 1 which in turn comes from the innards of split.default: ... if (is.null(attr(x, "class")) && is.null(names(x))) return(.Internal(split(x, f))) lf <- levels(f) y <- vector("list", length(lf)) names(y) <- lf for (k in lf) y[[k]] <- x[f == k] y Factors have a class attribute, so you don't use the internal code in that case and> y[y=="1"][1] <NA> 1 Levels: 1 I think the line in split.default needs to read for (k in lf) y[[k]] <- x[!is.na(f) & f == k] -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907
Reasonably Related Threads
- tapply huge speed difference if X has names
- problem assigning an array to a variable in a data frame
- Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply
- Specification of factors in tapply
- tapply() and barplot() help files for 1.8.1