thr3ads.net - R devel - [Rd] RFC: tapply(*, ..., init.value = NA) [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Suharto Anggono Suharto Anggono

2017-Jan-27 16:36 UTC

[Rd] RFC: tapply(*, ..., init.value = NA)

The "no factor combination" case is distinguishable by
'tapply' with simplify=FALSE.
> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)
> D2 <- D2[-c(1,5), ]
> DN <- D2; DN[1,"N"] <- NA
> with(DN, tapply(N, list(n,L), FUN=sum, simplify=FALSE))  A    B    C    D    E    F
1 NA   6    NULL NULL NULL NULL
2 NULL NULL 3    6    NULL NULL
3 NULL NULL NULL NULL 6    6


There is an old related discussion starting on
https://stat.ethz.ch/pipermail/r-devel/2007-November/047338.html .

----------------------------------
Last week, we've talked here about "xtabs(), factors and NAs",
 ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html

In the mean time, I've spent several hours on the issue
and also committed changes to R-devel "in two iterations".

In the case there is a *Left* hand side part to xtabs() formula,
see the help page example using 'esoph',
it uses  tapply(...,  FUN = sum)   and
I now think there is a missing feature in tapply() there, which
I am proposing to change. 

Look at a small example:
> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2), , N = 3

   L
n   A B C D E F
  1 1 2 0 0 0 0
  2 0 0 1 2 0 0
  3 0 0 0 0 2 2
> DN <- D2; DN[1,"N"] <- NA; DN   n L  N
2  1 A NA
3  1 B  3
4  1 B  3
6  2 C  3
7  2 D  3
8  2 D  3
9  3 E  3
10 3 E  3
11 3 F  3
12 3 F  3> with(DN, tapply(N, list(n,L), FUN=sum))   A  B  C  D  E  F
1 NA  6 NA NA NA NA
2 NA NA  3  6 NA NA
3 NA NA NA NA  6  6>  
and as you can see, the resulting matrix has NAs, all the same
NA_real_, but semantically of two different kinds:

1) at ["1", "A"], the  NA  comes from the NA in 'N'
2) all other NAs come from the fact that there is no such factor combination
   *and* from the fact that tapply() uses

   array(dim = .., dimnames = ...)

i.e., initializes the array with NAs  (see definition of 'array').

My proposition is the following patch to  tapply(), adding a new
option 'init.value':

-----------------------------------------------------------------------------
 
-tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
+tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify =
TRUE)
 {
     FUN <- if (!is.null(FUN)) match.fun(FUN)
     if (!is.list(INDEX)) INDEX <- list(INDEX)
@@ -44,7 +44,7 @@
     index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
     ans <- lapply(X = ans[index], FUN = FUN, ...)
     if (simplify && all(lengths(ans) == 1L)) {
-	ansmat <- array(dim = extent, dimnames = namelist)
+	ansmat <- array(init.value, dim = extent, dimnames = namelist)
 	ans <- unlist(ans, recursive = FALSE)
     } else {
 	ansmat <- array(vector("list", prod(extent)),

-----------------------------------------------------------------------------

With that, I can set the initial value to '0' instead of array's
default of NA :
> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))   A B C D E F
1 NA 6 0 0 0 0
2  0 0 3 6 0 0
3  0 0 0 0 6 6> 
which now has 0 counts and NA  as is desirable to be used inside
xtabs().

All fine... and would not be worth a posting to R-devel,
except for this:

The change will not be 100% back compatible -- by necessity: any new argument
for
tapply() will make that argument name not available to be
specified (via '...') for 'FUN'.  The new function would be
> str(tapply)function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)  

where the '...' are passed FUN(),  and with the new signature,
'init.value' then won't be passed to FUN  "anymore"
(compared to
R <= 3.3.x).

For that reason, we could use   'INIT.VALUE' instead (possibly
decreasing
the probability the arg name is used in other functions).


Opinions?

Thank you in advance,
Martin

Martin Maechler

2017-Jan-27 17:01 UTC

head link

[Rd] RFC: tapply(*, ..., init.value = NA)

>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at
r-project.org>
>>>>>     on Fri, 27 Jan 2017 16:36:59 +0000 writes:
    > The "no factor combination" case is distinguishable by
'tapply' with simplify=FALSE.
    >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)
    >> D2 <- D2[-c(1,5), ]
    >> DN <- D2; DN[1,"N"] <- NA
    >> with(DN, tapply(N, list(n,L), FUN=sum, simplify=FALSE))
    > A    B    C    D    E    F
    > 1 NA   6    NULL NULL NULL NULL
    > 2 NULL NULL 3    6    NULL NULL
    > 3 NULL NULL NULL NULL 6    6

Yes, I know that simplify=FALSE  behaves differently, it returns
a list with dim & dimnames, sometimes also called a "list -
matrix"
... and it *can* be used instead, but to be useful needs to be
post processed and that overall is somewhat inefficient and ugly.


    > There is an old related discussion starting on
https://stat.ethz.ch/pipermail/r-devel/2007-November/047338.html .

Thank you, indeed, for finding that. There Andrew Robinson did
raise the same issue, but his proposed solution was not much
back compatible and I think was primarily dismissed because of that.

Martin

    > ----------------------------------
    > Last week, we've talked here about "xtabs(), factors and
NAs",
    -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html

    > In the mean time, I've spent several hours on the issue
    > and also committed changes to R-devel "in two iterations".

    > In the case there is a *Left* hand side part to xtabs() formula,
    > see the help page example using 'esoph',
    > it uses  tapply(...,  FUN = sum)   and
    > I now think there is a missing feature in tapply() there, which
    > I am proposing to change. 

    > Look at a small example:

    >> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2)
    > , , N = 3

    > L
    > n   A B C D E F
    > 1 1 2 0 0 0 0
    > 2 0 0 1 2 0 0
    > 3 0 0 0 0 2 2

    >> DN <- D2; DN[1,"N"] <- NA; DN
    > n L  N
    > 2  1 A NA
    > 3  1 B  3
    > 4  1 B  3
    > 6  2 C  3
    > 7  2 D  3
    > 8  2 D  3
    > 9  3 E  3
    > 10 3 E  3
    > 11 3 F  3
    > 12 3 F  3
    >> with(DN, tapply(N, list(n,L), FUN=sum))
    > A  B  C  D  E  F
    > 1 NA  6 NA NA NA NA
    > 2 NA NA  3  6 NA NA
    > 3 NA NA NA NA  6  6
    >> 

    > and as you can see, the resulting matrix has NAs, all the same
    > NA_real_, but semantically of two different kinds:

    > 1) at ["1", "A"], the  NA  comes from the NA in
'N'
    > 2) all other NAs come from the fact that there is no such factor
combination
    > *and* from the fact that tapply() uses

    > array(dim = .., dimnames = ...)

    > i.e., initializes the array with NAs  (see definition of
'array').

    > My proposition is the following patch to  tapply(), adding a new
    > option 'init.value':

    >
-----------------------------------------------------------------------------
 
    > -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
    > +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA,
simplify = TRUE)
    > {
    > FUN <- if (!is.null(FUN)) match.fun(FUN)
    > if (!is.list(INDEX)) INDEX <- list(INDEX)
    > @@ -44,7 +44,7 @@
    > index <- as.logical(lengths(ans))  # equivalently, lengths(ans) >
0L
    > ans <- lapply(X = ans[index], FUN = FUN, ...)
    > if (simplify && all(lengths(ans) == 1L)) {
    > -	ansmat <- array(dim = extent, dimnames = namelist)
    > +	ansmat <- array(init.value, dim = extent, dimnames = namelist)
    > ans <- unlist(ans, recursive = FALSE)
    > } else {
    > ansmat <- array(vector("list", prod(extent)),

    >
-----------------------------------------------------------------------------

    > With that, I can set the initial value to '0' instead of
array's
    > default of NA :

    >> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
    > A B C D E F
    > 1 NA 6 0 0 0 0
    > 2  0 0 3 6 0 0
    > 3  0 0 0 0 6 6
    >> 

    > which now has 0 counts and NA  as is desirable to be used inside
    > xtabs().

    > All fine... and would not be worth a posting to R-devel,
    > except for this:

    > The change will not be 100% back compatible -- by necessity: any new
argument for
    > tapply() will make that argument name not available to be
    > specified (via '...') for 'FUN'.  The new function
would be

    >> str(tapply)
    > function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)

    > where the '...' are passed FUN(),  and with the new signature,
    > 'init.value' then won't be passed to FUN 
"anymore" (compared to
    > R <= 3.3.x).

    > For that reason, we could use   'INIT.VALUE' instead (possibly
decreasing
    > the probability the arg name is used in other functions).


    > Opinions?

    > Thank you in advance,
    > Martin

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Reasonably Related Threads

Search for more maybe matching threads

R devel - Jan 2017 - RFC: tapply(*, ..., init.value = NA)

[Rd] RFC: tapply(*, ..., init.value = NA)

[Rd] RFC: tapply(*, ..., init.value = NA)

Reasonably Related Threads