On a related note, the storage mode should try to match ans[[1]] (or unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full copy. Henrik On Jan 26, 2017 07:50, "William Dunlap via R-devel" <r-devel at r-project.org> wrote: It would be cool if the default for tapply's init.value could be FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for FUN=all, -Inf for FUN=max, etc. But that would take time and would break code for which FUN did not work on length-0 objects. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:> Last week, we've talked here about "xtabs(), factors and NAs", > -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html > > In the mean time, I've spent several hours on the issue > and also committed changes to R-devel "in two iterations". > > In the case there is a *Left* hand side part to xtabs() formula, > see the help page example using 'esoph', > it uses tapply(..., FUN = sum) and > I now think there is a missing feature in tapply() there, which > I am proposing to change. > > Look at a small example: > >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),N=3)[-c(1,5), ]; xtabs(~., D2)> , , N = 3 > > L > n A B C D E F > 1 1 2 0 0 0 0 > 2 0 0 1 2 0 0 > 3 0 0 0 0 2 2 > >> DN <- D2; DN[1,"N"] <- NA; DN > n L N > 2 1 A NA > 3 1 B 3 > 4 1 B 3 > 6 2 C 3 > 7 2 D 3 > 8 2 D 3 > 9 3 E 3 > 10 3 E 3 > 11 3 F 3 > 12 3 F 3 >> with(DN, tapply(N, list(n,L), FUN=sum)) > A B C D E F > 1 NA 6 NA NA NA NA > 2 NA NA 3 6 NA NA > 3 NA NA NA NA 6 6 >> > > and as you can see, the resulting matrix has NAs, all the same > NA_real_, but semantically of two different kinds: > > 1) at ["1", "A"], the NA comes from the NA in 'N' > 2) all other NAs come from the fact that there is no such factorcombination> *and* from the fact that tapply() uses > > array(dim = .., dimnames = ...) > > i.e., initializes the array with NAs (see definition of 'array'). > > My proposition is the following patch to tapply(), adding a new > option 'init.value': > > -----------------------------------------------------------------------------> > -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) > +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify= TRUE)> { > FUN <- if (!is.null(FUN)) match.fun(FUN) > if (!is.list(INDEX)) INDEX <- list(INDEX) > @@ -44,7 +44,7 @@ > index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L > ans <- lapply(X = ans[index], FUN = FUN, ...) > if (simplify && all(lengths(ans) == 1L)) { > - ansmat <- array(dim = extent, dimnames = namelist) > + ansmat <- array(init.value, dim = extent, dimnames = namelist) > ans <- unlist(ans, recursive = FALSE) > } else { > ansmat <- array(vector("list", prod(extent)), > > -----------------------------------------------------------------------------> > With that, I can set the initial value to '0' instead of array's > default of NA : > >> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0)) > A B C D E F > 1 NA 6 0 0 0 0 > 2 0 0 3 6 0 0 > 3 0 0 0 0 6 6 >> > > which now has 0 counts and NA as is desirable to be used inside > xtabs(). > > All fine... and would not be worth a posting to R-devel, > except for this: > > The change will not be 100% back compatible -- by necessity: any newargument for> tapply() will make that argument name not available to be > specified (via '...') for 'FUN'. The new function would be > >> str(tapply) > function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > > where the '...' are passed FUN(), and with the new signature, > 'init.value' then won't be passed to FUN "anymore" (compared to > R <= 3.3.x). > > For that reason, we could use 'INIT.VALUE' instead (possibly decreasing > the probability the arg name is used in other functions). > > > Opinions? > > Thank you in advance, > Martin > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]]
> On Jan 26, 2017 07:50, "William Dunlap via R-devel" <r-devel at r-project.org>> wrote: > It would be cool if the default for tapply's init.value could be > FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for > FUN=all, -Inf for FUN=max, etc. But that would take time and would > break code for which FUN did not work on length-0 objects. > Bill Dunlap > TIBCO Software > wdunlap tibco.com I had the same idea (after my first post), so I agree that would be nice. One could argue it would take time only if the user is too lazy to specify the value, and we could use tryCatch(FUN(X[0]), error = NA) to safeguard against those functions that fail for 0 length arg. But I think the main reason for _not_ setting such a default is back-compatibility. In my proposal, the new argument would not be any change by default and so all current uses of tapply() would remain unchanged.>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com> >>>>> on Thu, 26 Jan 2017 07:57:08 -0800 writes:> On a related note, the storage mode should try to match ans[[1]] (or > unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full > copy. Yes, related indeed; and would fall "in line" with Bill's idea. OTOH, it could be implemented independently, by something like if(missing(init.value)) init.value <- if(length(ans)) as.vector(NA, mode=storage.mode(ans[[1]])) else NA ............. A colleague proposed to use the shorter argument name 'default' instead of 'init.value' which indeed maybe more natural and still not too often used as "non-first" argument in FUN(.). Thank you for the constructive feedback! Martin > On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler > <maechler at stat.math.ethz.ch> wrote: >> Last week, we've talked here about "xtabs(), factors and NAs", -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html >> >> In the mean time, I've spent several hours on the issue >> and also committed changes to R-devel "in two iterations". >> >> In the case there is a *Left* hand side part to xtabs() formula, >> see the help page example using 'esoph', >> it uses tapply(..., FUN = sum) and >> I now think there is a missing feature in tapply() there, which >> I am proposing to change. >> >> Look at a small example: >> >>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), > N=3)[-c(1,5), ]; xtabs(~., D2) >> , , N = 3 >> >> L >> n A B C D E F >> 1 1 2 0 0 0 0 >> 2 0 0 1 2 0 0 >> 3 0 0 0 0 2 2 >> >>> DN <- D2; DN[1,"N"] <- NA; DN >> n L N >> 2 1 A NA >> 3 1 B 3 >> 4 1 B 3 >> 6 2 C 3 >> 7 2 D 3 >> 8 2 D 3 >> 9 3 E 3 >> 10 3 E 3 >> 11 3 F 3 >> 12 3 F 3 >>> with(DN, tapply(N, list(n,L), FUN=sum)) >> A B C D E F >> 1 NA 6 NA NA NA NA >> 2 NA NA 3 6 NA NA >> 3 NA NA NA NA 6 6 >>> >> >> and as you can see, the resulting matrix has NAs, all the same >> NA_real_, but semantically of two different kinds: >> >> 1) at ["1", "A"], the NA comes from the NA in 'N' >> 2) all other NAs come from the fact that there is no such factor > combination >> *and* from the fact that tapply() uses >> >> array(dim = .., dimnames = ...) >> >> i.e., initializes the array with NAs (see definition of 'array'). >> >> My proposition is the following patch to tapply(), adding a new >> option 'init.value': >> >> ------------------------------------------------------------ > ----------------- >> >> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) >> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify > = TRUE) >> { >> FUN <- if (!is.null(FUN)) match.fun(FUN) >> if (!is.list(INDEX)) INDEX <- list(INDEX) >> @@ -44,7 +44,7 @@ >> index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L >> ans <- lapply(X = ans[index], FUN = FUN, ...) >> if (simplify && all(lengths(ans) == 1L)) { >> - ansmat <- array(dim = extent, dimnames = namelist) >> + ansmat <- array(init.value, dim = extent, dimnames = namelist) >> ans <- unlist(ans, recursive = FALSE) >> } else { >> ansmat <- array(vector("list", prod(extent)), >> >> ------------------------------------------------------------ > ----------------- >> >> With that, I can set the initial value to '0' instead of array's >> default of NA : >> >>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0)) >> A B C D E F >> 1 NA 6 0 0 0 0 >> 2 0 0 3 6 0 0 >> 3 0 0 0 0 6 6 >>> >> >> which now has 0 counts and NA as is desirable to be used inside >> xtabs(). >> >> All fine... and would not be worth a posting to R-devel, >> except for this: >> >> The change will not be 100% back compatible -- by necessity: any new > argument for >> tapply() will make that argument name not available to be >> specified (via '...') for 'FUN'. The new function would be >> >>> str(tapply) >> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) >> >> where the '...' are passed FUN(), and with the new signature, >> 'init.value' then won't be passed to FUN "anymore" (compared to >> R <= 3.3.x). >> >> For that reason, we could use 'INIT.VALUE' instead (possibly decreasing >> the probability the arg name is used in other functions). >> >> >> Opinions? >> >> Thank you in advance, >> Martin >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]]
On Fri, Jan 27, 2017 at 12:34 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:> > > On Jan 26, 2017 07:50, "William Dunlap via R-devel" <r-devel at r-project.org> > > wrote: > > > It would be cool if the default for tapply's init.value could be > > FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for > > FUN=all, -Inf for FUN=max, etc. But that would take time and would > > break code for which FUN did not work on length-0 objects. > > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > I had the same idea (after my first post), so I agree that would > be nice. One could argue it would take time only if the user is too lazy > to specify the value, and we could use > tryCatch(FUN(X[0]), error = NA) > to safeguard against those functions that fail for 0 length arg. > > But I think the main reason for _not_ setting such a default is > back-compatibility. In my proposal, the new argument would not > be any change by default and so all current uses of tapply() > would remain unchanged. > >>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com> >>>>>> on Thu, 26 Jan 2017 07:57:08 -0800 writes: > > > On a related note, the storage mode should try to match ans[[1]] (or > > unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a full > > copy. > > Yes, related indeed; and would fall "in line" with Bill's idea. > OTOH, it could be implemented independently, > by something like > > if(missing(init.value)) > init.value <- > if(length(ans)) as.vector(NA, mode=storage.mode(ans[[1]])) > else NAI would probably do something like: ans <- unlist(ans, recursive = FALSE, use.names = FALSE) if (length(ans)) storage.mode(init.value) <- storage.mode(ans[[1]]) ansmat <- array(init.value, dim = extent, dimnames = namelist) instead. That completely avoids having to use missing() and the value of 'init.value' will be coerced later if not done upfront. use.names = FALSE speeds up unlist(). /Henrik> > ............. > > A colleague proposed to use the shorter argument name 'default' > instead of 'init.value' which indeed maybe more natural and > still not too often used as "non-first" argument in FUN(.). > > Thank you for the constructive feedback! > Martin > > > On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler > > <maechler at stat.math.ethz.ch> wrote: > >> Last week, we've talked here about "xtabs(), factors and NAs", > -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html > >> > >> In the mean time, I've spent several hours on the issue > >> and also committed changes to R-devel "in two iterations". > >> > >> In the case there is a *Left* hand side part to xtabs() formula, > >> see the help page example using 'esoph', > >> it uses tapply(..., FUN = sum) and > >> I now think there is a missing feature in tapply() there, which > >> I am proposing to change. > >> > >> Look at a small example: > >> > >>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), > > N=3)[-c(1,5), ]; xtabs(~., D2) > >> , , N = 3 > >> > >> L > >> n A B C D E F > >> 1 1 2 0 0 0 0 > >> 2 0 0 1 2 0 0 > >> 3 0 0 0 0 2 2 > >> > >>> DN <- D2; DN[1,"N"] <- NA; DN > >> n L N > >> 2 1 A NA > >> 3 1 B 3 > >> 4 1 B 3 > >> 6 2 C 3 > >> 7 2 D 3 > >> 8 2 D 3 > >> 9 3 E 3 > >> 10 3 E 3 > >> 11 3 F 3 > >> 12 3 F 3 > >>> with(DN, tapply(N, list(n,L), FUN=sum)) > >> A B C D E F > >> 1 NA 6 NA NA NA NA > >> 2 NA NA 3 6 NA NA > >> 3 NA NA NA NA 6 6 > >>> > >> > >> and as you can see, the resulting matrix has NAs, all the same > >> NA_real_, but semantically of two different kinds: > >> > >> 1) at ["1", "A"], the NA comes from the NA in 'N' > >> 2) all other NAs come from the fact that there is no such factor > > combination > >> *and* from the fact that tapply() uses > >> > >> array(dim = .., dimnames = ...) > >> > >> i.e., initializes the array with NAs (see definition of 'array'). > >> > >> My proposition is the following patch to tapply(), adding a new > >> option 'init.value': > >> > >> ------------------------------------------------------------ > > ----------------- > >> > >> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) > >> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify > > = TRUE) > >> { > >> FUN <- if (!is.null(FUN)) match.fun(FUN) > >> if (!is.list(INDEX)) INDEX <- list(INDEX) > >> @@ -44,7 +44,7 @@ > >> index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L > >> ans <- lapply(X = ans[index], FUN = FUN, ...) > >> if (simplify && all(lengths(ans) == 1L)) { > >> - ansmat <- array(dim = extent, dimnames = namelist) > >> + ansmat <- array(init.value, dim = extent, dimnames = namelist) > >> ans <- unlist(ans, recursive = FALSE) > >> } else { > >> ansmat <- array(vector("list", prod(extent)), > >> > >> ------------------------------------------------------------ > > ----------------- > >> > >> With that, I can set the initial value to '0' instead of array's > >> default of NA : > >> > >>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0)) > >> A B C D E F > >> 1 NA 6 0 0 0 0 > >> 2 0 0 3 6 0 0 > >> 3 0 0 0 0 6 6 > >>> > >> > >> which now has 0 counts and NA as is desirable to be used inside > >> xtabs(). > >> > >> All fine... and would not be worth a posting to R-devel, > >> except for this: > >> > >> The change will not be 100% back compatible -- by necessity: any new > > argument for > >> tapply() will make that argument name not available to be > >> specified (via '...') for 'FUN'. The new function would be > >> > >>> str(tapply) > >> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) > >> > >> where the '...' are passed FUN(), and with the new signature, > >> 'init.value' then won't be passed to FUN "anymore" (compared to > >> R <= 3.3.x). > >> > >> For that reason, we could use 'INIT.VALUE' instead (possibly decreasing > >> the probability the arg name is used in other functions). > >> > >> > >> Opinions? > >> > >> Thank you in advance, > >> Martin > >> > >> ______________________________________________ > >> R-devel at r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > [[alternative HTML version deleted]] >