Last week, we've talked here about "xtabs(), factors and NAs", -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html In the mean time, I've spent several hours on the issue and also committed changes to R-devel "in two iterations". In the case there is a *Left* hand side part to xtabs() formula, see the help page example using 'esoph', it uses tapply(..., FUN = sum) and I now think there is a missing feature in tapply() there, which I am proposing to change. Look at a small example:> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2), , N = 3 L n A B C D E F 1 1 2 0 0 0 0 2 0 0 1 2 0 0 3 0 0 0 0 2 2> DN <- D2; DN[1,"N"] <- NA; DNn L N 2 1 A NA 3 1 B 3 4 1 B 3 6 2 C 3 7 2 D 3 8 2 D 3 9 3 E 3 10 3 E 3 11 3 F 3 12 3 F 3> with(DN, tapply(N, list(n,L), FUN=sum))A B C D E F 1 NA 6 NA NA NA NA 2 NA NA 3 6 NA NA 3 NA NA NA NA 6 6>and as you can see, the resulting matrix has NAs, all the same NA_real_, but semantically of two different kinds: 1) at ["1", "A"], the NA comes from the NA in 'N' 2) all other NAs come from the fact that there is no such factor combination *and* from the fact that tapply() uses array(dim = .., dimnames = ...) i.e., initializes the array with NAs (see definition of 'array'). My proposition is the following patch to tapply(), adding a new option 'init.value': ----------------------------------------------------------------------------- -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) { FUN <- if (!is.null(FUN)) match.fun(FUN) if (!is.list(INDEX)) INDEX <- list(INDEX) @@ -44,7 +44,7 @@ index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L ans <- lapply(X = ans[index], FUN = FUN, ...) if (simplify && all(lengths(ans) == 1L)) { - ansmat <- array(dim = extent, dimnames = namelist) + ansmat <- array(init.value, dim = extent, dimnames = namelist) ans <- unlist(ans, recursive = FALSE) } else { ansmat <- array(vector("list", prod(extent)), ----------------------------------------------------------------------------- With that, I can set the initial value to '0' instead of array's default of NA :> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))A B C D E F 1 NA 6 0 0 0 0 2 0 0 3 6 0 0 3 0 0 0 0 6 6>which now has 0 counts and NA as is desirable to be used inside xtabs(). All fine... and would not be worth a posting to R-devel, except for this: The change will not be 100% back compatible -- by necessity: any new argument for tapply() will make that argument name not available to be specified (via '...') for 'FUN'. The new function would be> str(tapply)function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) where the '...' are passed FUN(), and with the new signature, 'init.value' then won't be passed to FUN "anymore" (compared to R <= 3.3.x). For that reason, we could use 'INIT.VALUE' instead (possibly decreasing the probability the arg name is used in other functions). Opinions? Thank you in advance, Martin
It would be cool if the default for tapply's init.value could be FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for FUN=all, -Inf for FUN=max, etc. But that would take time and would break code for which FUN did not work on length-0 objects. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:> Last week, we've talked here about "xtabs(), factors and NAs", > -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html > > In the mean time, I've spent several hours on the issue > and also committed changes to R-devel "in two iterations". > > In the case there is a *Left* hand side part to xtabs() formula, > see the help page example using 'esoph', > it uses tapply(..., FUN = sum) and > I now think there is a missing feature in tapply() there, which > I am proposing to change. > > Look at a small example: > >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2) > , , N = 3 > > L > n A B C D E F > 1 1 2 0 0 0 0 > 2 0 0 1 2 0 0 > 3 0 0 0 0 2 2 > >> DN <- D2; DN[1,"N"] <- NA; DN > n L N > 2 1 A NA > 3 1 B 3 > 4 1 B 3 > 6 2 C 3 > 7 2 D 3 > 8 2 D 3 > 9 3 E 3 > 10 3 E 3 > 11 3 F 3 > 12 3 F 3 >> with(DN, tapply(N, list(n,L), FUN=sum)) > A B C D E F > 1 NA 6 NA NA NA NA > 2 NA NA 3 6 NA NA > 3 NA NA NA NA 6 6 >> > > and as you can see, the resulting matrix has NAs, all the same > NA_real_, but semantically of two different kinds: > > 1) at ["1", "A"], the NA comes from the NA in 'N' > 2) all other NAs come from the fact that there is no such factor combination > *and* from the fact that tapply() uses > > array(dim = .., dimnames = ...) > > i.e., initializes the array with NAs (see definition of 'array'). > > My proposition is the following patch to tapply(), adding a new > option 'init.value': > > ----------------------------------------------------------------------------- > > -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) > +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > { > FUN <- if (!is.null(FUN)) match.fun(FUN) > if (!is.list(INDEX)) INDEX <- list(INDEX) > @@ -44,7 +44,7 @@ > index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L > ans <- lapply(X = ans[index], FUN = FUN, ...) > if (simplify && all(lengths(ans) == 1L)) { > - ansmat <- array(dim = extent, dimnames = namelist) > + ansmat <- array(init.value, dim = extent, dimnames = namelist) > ans <- unlist(ans, recursive = FALSE) > } else { > ansmat <- array(vector("list", prod(extent)), > > ----------------------------------------------------------------------------- > > With that, I can set the initial value to '0' instead of array's > default of NA : > >> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0)) > A B C D E F > 1 NA 6 0 0 0 0 > 2 0 0 3 6 0 0 > 3 0 0 0 0 6 6 >> > > which now has 0 counts and NA as is desirable to be used inside > xtabs(). > > All fine... and would not be worth a posting to R-devel, > except for this: > > The change will not be 100% back compatible -- by necessity: any new argument for > tapply() will make that argument name not available to be > specified (via '...') for 'FUN'. The new function would be > >> str(tapply) > function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > > where the '...' are passed FUN(), and with the new signature, > 'init.value' then won't be passed to FUN "anymore" (compared to > R <= 3.3.x). > > For that reason, we could use 'INIT.VALUE' instead (possibly decreasing > the probability the arg name is used in other functions). > > > Opinions? > > Thank you in advance, > Martin > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
On a related note, the storage mode should try to match ans[[1]] (or unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full copy. Henrik On Jan 26, 2017 07:50, "William Dunlap via R-devel" <r-devel at r-project.org> wrote: It would be cool if the default for tapply's init.value could be FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for FUN=all, -Inf for FUN=max, etc. But that would take time and would break code for which FUN did not work on length-0 objects. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:> Last week, we've talked here about "xtabs(), factors and NAs", > -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html > > In the mean time, I've spent several hours on the issue > and also committed changes to R-devel "in two iterations". > > In the case there is a *Left* hand side part to xtabs() formula, > see the help page example using 'esoph', > it uses tapply(..., FUN = sum) and > I now think there is a missing feature in tapply() there, which > I am proposing to change. > > Look at a small example: > >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),N=3)[-c(1,5), ]; xtabs(~., D2)> , , N = 3 > > L > n A B C D E F > 1 1 2 0 0 0 0 > 2 0 0 1 2 0 0 > 3 0 0 0 0 2 2 > >> DN <- D2; DN[1,"N"] <- NA; DN > n L N > 2 1 A NA > 3 1 B 3 > 4 1 B 3 > 6 2 C 3 > 7 2 D 3 > 8 2 D 3 > 9 3 E 3 > 10 3 E 3 > 11 3 F 3 > 12 3 F 3 >> with(DN, tapply(N, list(n,L), FUN=sum)) > A B C D E F > 1 NA 6 NA NA NA NA > 2 NA NA 3 6 NA NA > 3 NA NA NA NA 6 6 >> > > and as you can see, the resulting matrix has NAs, all the same > NA_real_, but semantically of two different kinds: > > 1) at ["1", "A"], the NA comes from the NA in 'N' > 2) all other NAs come from the fact that there is no such factorcombination> *and* from the fact that tapply() uses > > array(dim = .., dimnames = ...) > > i.e., initializes the array with NAs (see definition of 'array'). > > My proposition is the following patch to tapply(), adding a new > option 'init.value': > > -----------------------------------------------------------------------------> > -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) > +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify= TRUE)> { > FUN <- if (!is.null(FUN)) match.fun(FUN) > if (!is.list(INDEX)) INDEX <- list(INDEX) > @@ -44,7 +44,7 @@ > index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L > ans <- lapply(X = ans[index], FUN = FUN, ...) > if (simplify && all(lengths(ans) == 1L)) { > - ansmat <- array(dim = extent, dimnames = namelist) > + ansmat <- array(init.value, dim = extent, dimnames = namelist) > ans <- unlist(ans, recursive = FALSE) > } else { > ansmat <- array(vector("list", prod(extent)), > > -----------------------------------------------------------------------------> > With that, I can set the initial value to '0' instead of array's > default of NA : > >> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0)) > A B C D E F > 1 NA 6 0 0 0 0 > 2 0 0 3 6 0 0 > 3 0 0 0 0 6 6 >> > > which now has 0 counts and NA as is desirable to be used inside > xtabs(). > > All fine... and would not be worth a posting to R-devel, > except for this: > > The change will not be 100% back compatible -- by necessity: any newargument for> tapply() will make that argument name not available to be > specified (via '...') for 'FUN'. The new function would be > >> str(tapply) > function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > > where the '...' are passed FUN(), and with the new signature, > 'init.value' then won't be passed to FUN "anymore" (compared to > R <= 3.3.x). > > For that reason, we could use 'INIT.VALUE' instead (possibly decreasing > the probability the arg name is used in other functions). > > > Opinions? > > Thank you in advance, > Martin > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]]
If xtabs is enhanced then as.data.frame.table may also need to be modified so that it continues to be usable as an inverse, at least to the degree feasible. On Thu, Jan 26, 2017 at 5:42 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:> Last week, we've talked here about "xtabs(), factors and NAs", > -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html > > In the mean time, I've spent several hours on the issue > and also committed changes to R-devel "in two iterations". > > In the case there is a *Left* hand side part to xtabs() formula, > see the help page example using 'esoph', > it uses tapply(..., FUN = sum) and > I now think there is a missing feature in tapply() there, which > I am proposing to change. > > Look at a small example: > >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2) > , , N = 3 > > L > n A B C D E F > 1 1 2 0 0 0 0 > 2 0 0 1 2 0 0 > 3 0 0 0 0 2 2 > >> DN <- D2; DN[1,"N"] <- NA; DN > n L N > 2 1 A NA > 3 1 B 3 > 4 1 B 3 > 6 2 C 3 > 7 2 D 3 > 8 2 D 3 > 9 3 E 3 > 10 3 E 3 > 11 3 F 3 > 12 3 F 3 >> with(DN, tapply(N, list(n,L), FUN=sum)) > A B C D E F > 1 NA 6 NA NA NA NA > 2 NA NA 3 6 NA NA > 3 NA NA NA NA 6 6 >> > > and as you can see, the resulting matrix has NAs, all the same > NA_real_, but semantically of two different kinds: > > 1) at ["1", "A"], the NA comes from the NA in 'N' > 2) all other NAs come from the fact that there is no such factor combination > *and* from the fact that tapply() uses > > array(dim = .., dimnames = ...) > > i.e., initializes the array with NAs (see definition of 'array'). > > My proposition is the following patch to tapply(), adding a new > option 'init.value': > > ----------------------------------------------------------------------------- > > -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) > +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > { > FUN <- if (!is.null(FUN)) match.fun(FUN) > if (!is.list(INDEX)) INDEX <- list(INDEX) > @@ -44,7 +44,7 @@ > index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L > ans <- lapply(X = ans[index], FUN = FUN, ...) > if (simplify && all(lengths(ans) == 1L)) { > - ansmat <- array(dim = extent, dimnames = namelist) > + ansmat <- array(init.value, dim = extent, dimnames = namelist) > ans <- unlist(ans, recursive = FALSE) > } else { > ansmat <- array(vector("list", prod(extent)), > > ----------------------------------------------------------------------------- > > With that, I can set the initial value to '0' instead of array's > default of NA : > >> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0)) > A B C D E F > 1 NA 6 0 0 0 0 > 2 0 0 3 6 0 0 > 3 0 0 0 0 6 6 >> > > which now has 0 counts and NA as is desirable to be used inside > xtabs(). > > All fine... and would not be worth a posting to R-devel, > except for this: > > The change will not be 100% back compatible -- by necessity: any new argument for > tapply() will make that argument name not available to be > specified (via '...') for 'FUN'. The new function would be > >> str(tapply) > function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > > where the '...' are passed FUN(), and with the new signature, > 'init.value' then won't be passed to FUN "anymore" (compared to > R <= 3.3.x). > > For that reason, we could use 'INIT.VALUE' instead (possibly decreasing > the probability the arg name is used in other functions). > > > Opinions? > > Thank you in advance, > Martin > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com