Suharto Anggono Suharto Anggono
2017-Jan-27 16:36 UTC
[Rd] RFC: tapply(*, ..., init.value = NA)
The "no factor combination" case is distinguishable by 'tapply' with simplify=FALSE.> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3) > D2 <- D2[-c(1,5), ] > DN <- D2; DN[1,"N"] <- NA > with(DN, tapply(N, list(n,L), FUN=sum, simplify=FALSE))A B C D E F 1 NA 6 NULL NULL NULL NULL 2 NULL NULL 3 6 NULL NULL 3 NULL NULL NULL NULL 6 6 There is an old related discussion starting on https://stat.ethz.ch/pipermail/r-devel/2007-November/047338.html . ---------------------------------- Last week, we've talked here about "xtabs(), factors and NAs", -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html In the mean time, I've spent several hours on the issue and also committed changes to R-devel "in two iterations". In the case there is a *Left* hand side part to xtabs() formula, see the help page example using 'esoph', it uses tapply(..., FUN = sum) and I now think there is a missing feature in tapply() there, which I am proposing to change. Look at a small example:> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2), , N = 3 L n A B C D E F 1 1 2 0 0 0 0 2 0 0 1 2 0 0 3 0 0 0 0 2 2> DN <- D2; DN[1,"N"] <- NA; DNn L N 2 1 A NA 3 1 B 3 4 1 B 3 6 2 C 3 7 2 D 3 8 2 D 3 9 3 E 3 10 3 E 3 11 3 F 3 12 3 F 3> with(DN, tapply(N, list(n,L), FUN=sum))A B C D E F 1 NA 6 NA NA NA NA 2 NA NA 3 6 NA NA 3 NA NA NA NA 6 6>and as you can see, the resulting matrix has NAs, all the same NA_real_, but semantically of two different kinds: 1) at ["1", "A"], the NA comes from the NA in 'N' 2) all other NAs come from the fact that there is no such factor combination *and* from the fact that tapply() uses array(dim = .., dimnames = ...) i.e., initializes the array with NAs (see definition of 'array'). My proposition is the following patch to tapply(), adding a new option 'init.value': ----------------------------------------------------------------------------- -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) { FUN <- if (!is.null(FUN)) match.fun(FUN) if (!is.list(INDEX)) INDEX <- list(INDEX) @@ -44,7 +44,7 @@ index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L ans <- lapply(X = ans[index], FUN = FUN, ...) if (simplify && all(lengths(ans) == 1L)) { - ansmat <- array(dim = extent, dimnames = namelist) + ansmat <- array(init.value, dim = extent, dimnames = namelist) ans <- unlist(ans, recursive = FALSE) } else { ansmat <- array(vector("list", prod(extent)), ----------------------------------------------------------------------------- With that, I can set the initial value to '0' instead of array's default of NA :> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))A B C D E F 1 NA 6 0 0 0 0 2 0 0 3 6 0 0 3 0 0 0 0 6 6>which now has 0 counts and NA as is desirable to be used inside xtabs(). All fine... and would not be worth a posting to R-devel, except for this: The change will not be 100% back compatible -- by necessity: any new argument for tapply() will make that argument name not available to be specified (via '...') for 'FUN'. The new function would be> str(tapply)function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) where the '...' are passed FUN(), and with the new signature, 'init.value' then won't be passed to FUN "anymore" (compared to R <= 3.3.x). For that reason, we could use 'INIT.VALUE' instead (possibly decreasing the probability the arg name is used in other functions). Opinions? Thank you in advance, Martin
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Fri, 27 Jan 2017 16:36:59 +0000 writes:> The "no factor combination" case is distinguishable by 'tapply' with simplify=FALSE. >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3) >> D2 <- D2[-c(1,5), ] >> DN <- D2; DN[1,"N"] <- NA >> with(DN, tapply(N, list(n,L), FUN=sum, simplify=FALSE)) > A B C D E F > 1 NA 6 NULL NULL NULL NULL > 2 NULL NULL 3 6 NULL NULL > 3 NULL NULL NULL NULL 6 6 Yes, I know that simplify=FALSE behaves differently, it returns a list with dim & dimnames, sometimes also called a "list - matrix" ... and it *can* be used instead, but to be useful needs to be post processed and that overall is somewhat inefficient and ugly. > There is an old related discussion starting on https://stat.ethz.ch/pipermail/r-devel/2007-November/047338.html . Thank you, indeed, for finding that. There Andrew Robinson did raise the same issue, but his proposed solution was not much back compatible and I think was primarily dismissed because of that. Martin > ---------------------------------- > Last week, we've talked here about "xtabs(), factors and NAs", -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html > In the mean time, I've spent several hours on the issue > and also committed changes to R-devel "in two iterations". > In the case there is a *Left* hand side part to xtabs() formula, > see the help page example using 'esoph', > it uses tapply(..., FUN = sum) and > I now think there is a missing feature in tapply() there, which > I am proposing to change. > Look at a small example: >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2) > , , N = 3 > L > n A B C D E F > 1 1 2 0 0 0 0 > 2 0 0 1 2 0 0 > 3 0 0 0 0 2 2 >> DN <- D2; DN[1,"N"] <- NA; DN > n L N > 2 1 A NA > 3 1 B 3 > 4 1 B 3 > 6 2 C 3 > 7 2 D 3 > 8 2 D 3 > 9 3 E 3 > 10 3 E 3 > 11 3 F 3 > 12 3 F 3 >> with(DN, tapply(N, list(n,L), FUN=sum)) > A B C D E F > 1 NA 6 NA NA NA NA > 2 NA NA 3 6 NA NA > 3 NA NA NA NA 6 6 >> > and as you can see, the resulting matrix has NAs, all the same > NA_real_, but semantically of two different kinds: > 1) at ["1", "A"], the NA comes from the NA in 'N' > 2) all other NAs come from the fact that there is no such factor combination > *and* from the fact that tapply() uses > array(dim = .., dimnames = ...) > i.e., initializes the array with NAs (see definition of 'array'). > My proposition is the following patch to tapply(), adding a new > option 'init.value': > ----------------------------------------------------------------------------- > -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) > +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > { > FUN <- if (!is.null(FUN)) match.fun(FUN) > if (!is.list(INDEX)) INDEX <- list(INDEX) > @@ -44,7 +44,7 @@ > index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L > ans <- lapply(X = ans[index], FUN = FUN, ...) > if (simplify && all(lengths(ans) == 1L)) { > - ansmat <- array(dim = extent, dimnames = namelist) > + ansmat <- array(init.value, dim = extent, dimnames = namelist) > ans <- unlist(ans, recursive = FALSE) > } else { > ansmat <- array(vector("list", prod(extent)), > ----------------------------------------------------------------------------- > With that, I can set the initial value to '0' instead of array's > default of NA : >> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0)) > A B C D E F > 1 NA 6 0 0 0 0 > 2 0 0 3 6 0 0 > 3 0 0 0 0 6 6 >> > which now has 0 counts and NA as is desirable to be used inside > xtabs(). > All fine... and would not be worth a posting to R-devel, > except for this: > The change will not be 100% back compatible -- by necessity: any new argument for > tapply() will make that argument name not available to be > specified (via '...') for 'FUN'. The new function would be >> str(tapply) > function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > where the '...' are passed FUN(), and with the new signature, > 'init.value' then won't be passed to FUN "anymore" (compared to > R <= 3.3.x). > For that reason, we could use 'INIT.VALUE' instead (possibly decreasing > the probability the arg name is used in other functions). > Opinions? > Thank you in advance, > Martin > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel