thr3ads.net - R devel - [Rd] RFC: tapply(*, ..., init.value = NA) [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2017-Jan-26 10:42 UTC

[Rd] RFC: tapply(*, ..., init.value = NA)

Last week, we've talked here about "xtabs(), factors and NAs",
 ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html

In the mean time, I've spent several hours on the issue
and also committed changes to R-devel "in two iterations".

In the case there is a *Left* hand side part to xtabs() formula,
see the help page example using 'esoph',
it uses  tapply(...,  FUN = sum)   and
I now think there is a missing feature in tapply() there, which
I am proposing to change. 

Look at a small example:
> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2), , N = 3

   L
n   A B C D E F
  1 1 2 0 0 0 0
  2 0 0 1 2 0 0
  3 0 0 0 0 2 2
> DN <- D2; DN[1,"N"] <- NA; DN   n L  N
2  1 A NA
3  1 B  3
4  1 B  3
6  2 C  3
7  2 D  3
8  2 D  3
9  3 E  3
10 3 E  3
11 3 F  3
12 3 F  3> with(DN, tapply(N, list(n,L), FUN=sum))   A  B  C  D  E  F
1 NA  6 NA NA NA NA
2 NA NA  3  6 NA NA
3 NA NA NA NA  6  6>  
and as you can see, the resulting matrix has NAs, all the same
NA_real_, but semantically of two different kinds:

1) at ["1", "A"], the  NA  comes from the NA in 'N'
2) all other NAs come from the fact that there is no such factor combination
   *and* from the fact that tapply() uses

   array(dim = .., dimnames = ...)

i.e., initializes the array with NAs  (see definition of 'array').

My proposition is the following patch to  tapply(), adding a new
option 'init.value':

-----------------------------------------------------------------------------
 
-tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
+tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify =
TRUE)
 {
     FUN <- if (!is.null(FUN)) match.fun(FUN)
     if (!is.list(INDEX)) INDEX <- list(INDEX)
@@ -44,7 +44,7 @@
     index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
     ans <- lapply(X = ans[index], FUN = FUN, ...)
     if (simplify && all(lengths(ans) == 1L)) {
-	ansmat <- array(dim = extent, dimnames = namelist)
+	ansmat <- array(init.value, dim = extent, dimnames = namelist)
 	ans <- unlist(ans, recursive = FALSE)
     } else {
 	ansmat <- array(vector("list", prod(extent)),

-----------------------------------------------------------------------------

With that, I can set the initial value to '0' instead of array's
default of NA :
> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))   A B C D E F
1 NA 6 0 0 0 0
2  0 0 3 6 0 0
3  0 0 0 0 6 6> 
which now has 0 counts and NA  as is desirable to be used inside
xtabs().

All fine... and would not be worth a posting to R-devel,
except for this:

The change will not be 100% back compatible -- by necessity: any new argument
for
tapply() will make that argument name not available to be
specified (via '...') for 'FUN'.  The new function would be
> str(tapply)function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)  

where the '...' are passed FUN(),  and with the new signature,
'init.value' then won't be passed to FUN  "anymore"
(compared to
R <= 3.3.x).

For that reason, we could use   'INIT.VALUE' instead (possibly
decreasing
the probability the arg name is used in other functions).


Opinions?

Thank you in advance,
Martin

William Dunlap

2017-Jan-26 15:47 UTC

head link

[Rd] RFC: tapply(*, ..., init.value = NA)

It would be cool if the default for tapply's init.value could be
FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
FUN=all, -Inf for FUN=max, etc.  But that would take time and would
break code for which FUN did not work on length-0 objects.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:> Last week, we've talked here about "xtabs(), factors and
NAs",
>  ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
>
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses  tapply(...,  FUN = sum)   and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
>
> Look at a small example:
>
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2)
> , , N = 3
>
>    L
> n   A B C D E F
>   1 1 2 0 0 0 0
>   2 0 0 1 2 0 0
>   3 0 0 0 0 2 2
>
>> DN <- D2; DN[1,"N"] <- NA; DN
>    n L  N
> 2  1 A NA
> 3  1 B  3
> 4  1 B  3
> 6  2 C  3
> 7  2 D  3
> 8  2 D  3
> 9  3 E  3
> 10 3 E  3
> 11 3 F  3
> 12 3 F  3
>> with(DN, tapply(N, list(n,L), FUN=sum))
>    A  B  C  D  E  F
> 1 NA  6 NA NA NA NA
> 2 NA NA  3  6 NA NA
> 3 NA NA NA NA  6  6
>>
>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
>
> 1) at ["1", "A"], the  NA  comes from the NA in
'N'
> 2) all other NAs come from the fact that there is no such factor
combination
>    *and* from the fact that tapply() uses
>
>    array(dim = .., dimnames = ...)
>
> i.e., initializes the array with NAs  (see definition of 'array').
>
> My proposition is the following patch to  tapply(), adding a new
> option 'init.value':
>
>
-----------------------------------------------------------------------------
>
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA,
simplify = TRUE)
>  {
>      FUN <- if (!is.null(FUN)) match.fun(FUN)
>      if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
>      index <- as.logical(lengths(ans))  # equivalently, lengths(ans)
> 0L
>      ans <- lapply(X = ans[index], FUN = FUN, ...)
>      if (simplify && all(lengths(ans) == 1L)) {
> -       ansmat <- array(dim = extent, dimnames = namelist)
> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>         ans <- unlist(ans, recursive = FALSE)
>      } else {
>         ansmat <- array(vector("list", prod(extent)),
>
>
-----------------------------------------------------------------------------
>
> With that, I can set the initial value to '0' instead of
array's
> default of NA :
>
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>    A B C D E F
> 1 NA 6 0 0 0 0
> 2  0 0 3 6 0 0
> 3  0 0 0 0 6 6
>>
>
> which now has 0 counts and NA  as is desirable to be used inside
> xtabs().
>
> All fine... and would not be worth a posting to R-devel,
> except for this:
>
> The change will not be 100% back compatible -- by necessity: any new
argument for
> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'.  The new function would be
>
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>
> where the '...' are passed FUN(),  and with the new signature,
> 'init.value' then won't be passed to FUN  "anymore"
(compared to
> R <= 3.3.x).
>
> For that reason, we could use   'INIT.VALUE' instead (possibly
decreasing
> the probability the arg name is used in other functions).
>
>
> Opinions?
>
> Thank you in advance,
> Martin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Henrik Bengtsson

2017-Jan-26 15:57 UTC

head link

[Rd] RFC: tapply(*, ..., init.value = NA)

On a related note, the storage mode should try to match ans[[1]] (or
unlist:ed and) when allocating 'ansmat' to avoid coercion and hence a
full
copy.

Henrik


On Jan 26, 2017 07:50, "William Dunlap via R-devel" <r-devel at
r-project.org>
wrote:

It would be cool if the default for tapply's init.value could be
FUN(X[0]), so it would be 0 for FUN=sum or FUN=length, TRUE for
FUN=all, -Inf for FUN=max, etc.  But that would take time and would
break code for which FUN did not work on length-0 objects.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Jan 26, 2017 at 2:42 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:> Last week, we've talked here about "xtabs(), factors and
NAs",
>  ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
>
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses  tapply(...,  FUN = sum)   and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
>
> Look at a small example:
>
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2)> , , N = 3
>
>    L
> n   A B C D E F
>   1 1 2 0 0 0 0
>   2 0 0 1 2 0 0
>   3 0 0 0 0 2 2
>
>> DN <- D2; DN[1,"N"] <- NA; DN
>    n L  N
> 2  1 A NA
> 3  1 B  3
> 4  1 B  3
> 6  2 C  3
> 7  2 D  3
> 8  2 D  3
> 9  3 E  3
> 10 3 E  3
> 11 3 F  3
> 12 3 F  3
>> with(DN, tapply(N, list(n,L), FUN=sum))
>    A  B  C  D  E  F
> 1 NA  6 NA NA NA NA
> 2 NA NA  3  6 NA NA
> 3 NA NA NA NA  6  6
>>
>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
>
> 1) at ["1", "A"], the  NA  comes from the NA in
'N'
> 2) all other NAs come from the fact that there is no such factor
combination>    *and* from the fact that tapply() uses
>
>    array(dim = .., dimnames = ...)
>
> i.e., initializes the array with NAs  (see definition of 'array').
>
> My proposition is the following patch to  tapply(), adding a new
> option 'init.value':
>
> ------------------------------------------------------------
----------------->
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA,
simplify
= TRUE)>  {
>      FUN <- if (!is.null(FUN)) match.fun(FUN)
>      if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
>      index <- as.logical(lengths(ans))  # equivalently, lengths(ans)
> 0L
>      ans <- lapply(X = ans[index], FUN = FUN, ...)
>      if (simplify && all(lengths(ans) == 1L)) {
> -       ansmat <- array(dim = extent, dimnames = namelist)
> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>         ans <- unlist(ans, recursive = FALSE)
>      } else {
>         ansmat <- array(vector("list", prod(extent)),
>
> ------------------------------------------------------------
----------------->
> With that, I can set the initial value to '0' instead of
array's
> default of NA :
>
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>    A B C D E F
> 1 NA 6 0 0 0 0
> 2  0 0 3 6 0 0
> 3  0 0 0 0 6 6
>>
>
> which now has 0 counts and NA  as is desirable to be used inside
> xtabs().
>
> All fine... and would not be worth a posting to R-devel,
> except for this:
>
> The change will not be 100% back compatible -- by necessity: any new
argument for> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'.  The new function would be
>
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>
> where the '...' are passed FUN(),  and with the new signature,
> 'init.value' then won't be passed to FUN  "anymore"
(compared to
> R <= 3.3.x).
>
> For that reason, we could use   'INIT.VALUE' instead (possibly
decreasing
> the probability the arg name is used in other functions).
>
>
> Opinions?
>
> Thank you in advance,
> Martin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]

Gabor Grothendieck

2017-Jan-27 15:20 UTC

head link

[Rd] RFC: tapply(*, ..., init.value = NA)

If xtabs is enhanced then as.data.frame.table may also need to be
modified so that it continues to be usable as an inverse, at least to
the degree feasible.


On Thu, Jan 26, 2017 at 5:42 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:> Last week, we've talked here about "xtabs(), factors and
NAs",
>  ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html
>
> In the mean time, I've spent several hours on the issue
> and also committed changes to R-devel "in two iterations".
>
> In the case there is a *Left* hand side part to xtabs() formula,
> see the help page example using 'esoph',
> it uses  tapply(...,  FUN = sum)   and
> I now think there is a missing feature in tapply() there, which
> I am proposing to change.
>
> Look at a small example:
>
>> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2)
> , , N = 3
>
>    L
> n   A B C D E F
>   1 1 2 0 0 0 0
>   2 0 0 1 2 0 0
>   3 0 0 0 0 2 2
>
>> DN <- D2; DN[1,"N"] <- NA; DN
>    n L  N
> 2  1 A NA
> 3  1 B  3
> 4  1 B  3
> 6  2 C  3
> 7  2 D  3
> 8  2 D  3
> 9  3 E  3
> 10 3 E  3
> 11 3 F  3
> 12 3 F  3
>> with(DN, tapply(N, list(n,L), FUN=sum))
>    A  B  C  D  E  F
> 1 NA  6 NA NA NA NA
> 2 NA NA  3  6 NA NA
> 3 NA NA NA NA  6  6
>>
>
> and as you can see, the resulting matrix has NAs, all the same
> NA_real_, but semantically of two different kinds:
>
> 1) at ["1", "A"], the  NA  comes from the NA in
'N'
> 2) all other NAs come from the fact that there is no such factor
combination
>    *and* from the fact that tapply() uses
>
>    array(dim = .., dimnames = ...)
>
> i.e., initializes the array with NAs  (see definition of 'array').
>
> My proposition is the following patch to  tapply(), adding a new
> option 'init.value':
>
>
-----------------------------------------------------------------------------
>
> -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA,
simplify = TRUE)
>  {
>      FUN <- if (!is.null(FUN)) match.fun(FUN)
>      if (!is.list(INDEX)) INDEX <- list(INDEX)
> @@ -44,7 +44,7 @@
>      index <- as.logical(lengths(ans))  # equivalently, lengths(ans)
> 0L
>      ans <- lapply(X = ans[index], FUN = FUN, ...)
>      if (simplify && all(lengths(ans) == 1L)) {
> -       ansmat <- array(dim = extent, dimnames = namelist)
> +       ansmat <- array(init.value, dim = extent, dimnames = namelist)
>         ans <- unlist(ans, recursive = FALSE)
>      } else {
>         ansmat <- array(vector("list", prod(extent)),
>
>
-----------------------------------------------------------------------------
>
> With that, I can set the initial value to '0' instead of
array's
> default of NA :
>
>> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))
>    A B C D E F
> 1 NA 6 0 0 0 0
> 2  0 0 3 6 0 0
> 3  0 0 0 0 6 6
>>
>
> which now has 0 counts and NA  as is desirable to be used inside
> xtabs().
>
> All fine... and would not be worth a posting to R-devel,
> except for this:
>
> The change will not be 100% back compatible -- by necessity: any new
argument for
> tapply() will make that argument name not available to be
> specified (via '...') for 'FUN'.  The new function would be
>
>> str(tapply)
> function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)
>
> where the '...' are passed FUN(),  and with the new signature,
> 'init.value' then won't be passed to FUN  "anymore"
(compared to
> R <= 3.3.x).
>
> For that reason, we could use   'INIT.VALUE' instead (possibly
decreasing
> the probability the arg name is used in other functions).
>
>
> Opinions?
>
> Thank you in advance,
> Martin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Apparently Analagous Threads

Search for more possibly parallel threads

R devel - Jan 2017 - RFC: tapply(*, ..., init.value = NA)

[Rd] RFC: tapply(*, ..., init.value = NA)

[Rd] RFC: tapply(*, ..., init.value = NA)

[Rd] RFC: tapply(*, ..., init.value = NA)

[Rd] RFC: tapply(*, ..., init.value = NA)

Apparently Analagous Threads