thr3ads.net - R devel - [Rd] Undefined behavior of head() and tail() with n = 0 [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Florent Angly

2017-Jan-25 15:31 UTC

[Rd] Undefined behavior of head() and tail() with n = 0

Hi all,

The documentation for head() and tail() describes the behavior of
these generic functions when n is strictly positive (n > 0) and
strictly negative (n < 0). How these functions work when given a zero
value is not defined.

Both GNU command-line utilities head and tail behave differently with +0 and -0:
http://man7.org/linux/man-pages/man1/head.1.html
http://man7.org/linux/man-pages/man1/tail.1.html

Since R supports signed zeros (1/+0 != 1/-0) and the R head() and
tail() functions are modeled after their GNU counterparts, I would
expect the R functions to distinguish between +0 and -0
> tail(1:5, n=0)
integer(0)> tail(1:5, n=1)
[1] 5> tail(1:5, n=2)[1] 4 5
> tail(1:5, n=-2)
[1] 3 4 5> tail(1:5, n=-1)
[1] 2 3 4 5> tail(1:5, n=-0)integer(0)  # expected 1:5
> head(1:5, n=0)
integer(0)> head(1:5, n=1)
[1] 1> head(1:5, n=2)[1] 1 2
> head(1:5, n=-2)
[1] 1 2 3> head(1:5, n=-1)
[1] 1 2 3 4> head(1:5, n=-0)integer(0)  # expected 1:5

For both head() and tail(), I expected 1:5 as output but got
integer(0). I obtained similar results using a data.frame and a
function as x argument.

An easy fix would be to explicitly state in the documentation what n 0 does, and
that there is no practical difference between -0 and +0.
However, in my eyes, the better approach would be implement support
for -0 and document it. What do you think?

Best,

Florent


PS/ My sessionInfo() gives:
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Switzerland.1252
LC_CTYPE=German_Switzerland.1252
LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
 LC_TIME=German_Switzerland.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Martin Maechler

2017-Jan-26 09:53 UTC

head link

[Rd] Undefined behavior of head() and tail() with n = 0

>>>>> Florent Angly <florent.angly at gmail.com>
>>>>>     on Wed, 25 Jan 2017 16:31:45 +0100 writes:
    > Hi all,
    > The documentation for head() and tail() describes the behavior of
    > these generic functions when n is strictly positive (n > 0) and
    > strictly negative (n < 0). How these functions work when given a
zero
    > value is not defined.

    > Both GNU command-line utilities head and tail behave differently with
+0 and -0:
    > http://man7.org/linux/man-pages/man1/head.1.html
    > http://man7.org/linux/man-pages/man1/tail.1.html

    > Since R supports signed zeros (1/+0 != 1/-0) 

whoa, whoa, .. slow down --  The above is misleading!

Rather read in  ?Arithmetic (*the* reference to consult for such issues),
where the 2nd part of the following section

 || Implementation limits:
 || 
 ||      [..............]
 || 
 ||      Another potential issue is signed zeroes: on IEC 60659 platforms
 ||      there are two zeroes with internal representations differing by
 ||      sign.  Where possible R treats them as the same, but for example
 ||      direct output from C code often does not do so and may output
 ||      ?-0.0? (and on Windows whether it does so or not depends on the
 ||      version of Windows).  One place in R where the difference might be
 ||      seen is in division by zero: ?1/x? is ?Inf? or ?-Inf? depending on
 ||      the sign of zero ?x?.  Another place is ?identical(0, -0, num.eq  ||   
FALSE)?.

says the *contrary* ( __Where possible R treats them as the same__ ):
We do _not_ want to distinguish -0 and +0,
but there are cases where it is inavoidable

And there are good reasons (mathematics !!) for this.

I'm pretty sure that it would be quite a mistake to start
differentiating it here...  but of course we can continue
discussing here if you like.

Martin Maechler
ETH Zurich and R Core


    > and the R head() and tail() functions are modeled after
    > their GNU counterparts, I would expect the R functions to
    > distinguish between +0 and -0

    >> tail(1:5, n=0)
    > integer(0)
    >> tail(1:5, n=1)
    > [1] 5
    >> tail(1:5, n=2)
    > [1] 4 5

    >> tail(1:5, n=-2)
    > [1] 3 4 5
    >> tail(1:5, n=-1)
    > [1] 2 3 4 5
    >> tail(1:5, n=-0)
    > integer(0)  # expected 1:5

    >> head(1:5, n=0)
    > integer(0)
    >> head(1:5, n=1)
    > [1] 1
    >> head(1:5, n=2)
    > [1] 1 2

    >> head(1:5, n=-2)
    > [1] 1 2 3
    >> head(1:5, n=-1)
    > [1] 1 2 3 4
    >> head(1:5, n=-0)
    > integer(0)  # expected 1:5

    > For both head() and tail(), I expected 1:5 as output but got
    > integer(0). I obtained similar results using a data.frame and a
    > function as x argument.

    > An easy fix would be to explicitly state in the documentation what n   
> 0 does, and that there is no practical difference between -0 and +0.
    > However, in my eyes, the better approach would be implement support
    > for -0 and document it. What do you think?

    > Best,

    > Florent


    > PS/ My sessionInfo() gives:
    > R version 3.3.2 (2016-10-31)
    > Platform: x86_64-w64-mingw32/x64 (64-bit)
    > Running under: Windows 7 x64 (build 7601) Service Pack 1

    > locale:
    > [1] LC_COLLATE=German_Switzerland.1252
    > LC_CTYPE=German_Switzerland.1252
    > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
    > LC_TIME=German_Switzerland.1252

    > attached base packages:
    > [1] stats     graphics  grDevices utils     datasets  methods   base

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler

2017-Jan-26 10:42 UTC

head link

[Rd] RFC: tapply(*, ..., init.value = NA)

Last week, we've talked here about "xtabs(), factors and NAs",
 ->  https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html

In the mean time, I've spent several hours on the issue
and also committed changes to R-devel "in two iterations".

In the case there is a *Left* hand side part to xtabs() formula,
see the help page example using 'esoph',
it uses  tapply(...,  FUN = sum)   and
I now think there is a missing feature in tapply() there, which
I am proposing to change. 

Look at a small example:
> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]),
N=3)[-c(1,5), ]; xtabs(~., D2), , N = 3

   L
n   A B C D E F
  1 1 2 0 0 0 0
  2 0 0 1 2 0 0
  3 0 0 0 0 2 2
> DN <- D2; DN[1,"N"] <- NA; DN   n L  N
2  1 A NA
3  1 B  3
4  1 B  3
6  2 C  3
7  2 D  3
8  2 D  3
9  3 E  3
10 3 E  3
11 3 F  3
12 3 F  3> with(DN, tapply(N, list(n,L), FUN=sum))   A  B  C  D  E  F
1 NA  6 NA NA NA NA
2 NA NA  3  6 NA NA
3 NA NA NA NA  6  6>  
and as you can see, the resulting matrix has NAs, all the same
NA_real_, but semantically of two different kinds:

1) at ["1", "A"], the  NA  comes from the NA in 'N'
2) all other NAs come from the fact that there is no such factor combination
   *and* from the fact that tapply() uses

   array(dim = .., dimnames = ...)

i.e., initializes the array with NAs  (see definition of 'array').

My proposition is the following patch to  tapply(), adding a new
option 'init.value':

-----------------------------------------------------------------------------
 
-tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
+tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify =
TRUE)
 {
     FUN <- if (!is.null(FUN)) match.fun(FUN)
     if (!is.list(INDEX)) INDEX <- list(INDEX)
@@ -44,7 +44,7 @@
     index <- as.logical(lengths(ans))  # equivalently, lengths(ans) > 0L
     ans <- lapply(X = ans[index], FUN = FUN, ...)
     if (simplify && all(lengths(ans) == 1L)) {
-	ansmat <- array(dim = extent, dimnames = namelist)
+	ansmat <- array(init.value, dim = extent, dimnames = namelist)
 	ans <- unlist(ans, recursive = FALSE)
     } else {
 	ansmat <- array(vector("list", prod(extent)),

-----------------------------------------------------------------------------

With that, I can set the initial value to '0' instead of array's
default of NA :
> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))   A B C D E F
1 NA 6 0 0 0 0
2  0 0 3 6 0 0
3  0 0 0 0 6 6> 
which now has 0 counts and NA  as is desirable to be used inside
xtabs().

All fine... and would not be worth a posting to R-devel,
except for this:

The change will not be 100% back compatible -- by necessity: any new argument
for
tapply() will make that argument name not available to be
specified (via '...') for 'FUN'.  The new function would be
> str(tapply)function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE)  

where the '...' are passed FUN(),  and with the new signature,
'init.value' then won't be passed to FUN  "anymore"
(compared to
R <= 3.3.x).

For that reason, we could use   'INIT.VALUE' instead (possibly
decreasing
the probability the arg name is used in other functions).


Opinions?

Thank you in advance,
Martin

William Dunlap

2017-Jan-26 15:51 UTC

head link

[Rd] Undefined behavior of head() and tail() with n = 0

In addition, signed zeroes only exist for floating point numbers - the
bit patterns for as.integer(0) and as.integer(-0) are identical.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, Jan 26, 2017 at 1:53 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:>>>>>> Florent Angly <florent.angly at gmail.com>
>>>>>>     on Wed, 25 Jan 2017 16:31:45 +0100 writes:
>
>     > Hi all,
>     > The documentation for head() and tail() describes the behavior of
>     > these generic functions when n is strictly positive (n > 0) and
>     > strictly negative (n < 0). How these functions work when given
a zero
>     > value is not defined.
>
>     > Both GNU command-line utilities head and tail behave differently
with +0 and -0:
>     > http://man7.org/linux/man-pages/man1/head.1.html
>     > http://man7.org/linux/man-pages/man1/tail.1.html
>
>     > Since R supports signed zeros (1/+0 != 1/-0)
>
> whoa, whoa, .. slow down --  The above is misleading!
>
> Rather read in  ?Arithmetic (*the* reference to consult for such issues),
> where the 2nd part of the following section
>
>  || Implementation limits:
>  ||
>  ||      [..............]
>  ||
>  ||      Another potential issue is signed zeroes: on IEC 60659 platforms
>  ||      there are two zeroes with internal representations differing by
>  ||      sign.  Where possible R treats them as the same, but for example
>  ||      direct output from C code often does not do so and may output
>  ||      ?-0.0? (and on Windows whether it does so or not depends on the
>  ||      version of Windows).  One place in R where the difference might be
>  ||      seen is in division by zero: ?1/x? is ?Inf? or ?-Inf? depending on
>  ||      the sign of zero ?x?.  Another place is ?identical(0, -0, num.eq
>  ||      FALSE)?.
>
> says the *contrary* ( __Where possible R treats them as the same__ ):
> We do _not_ want to distinguish -0 and +0,
> but there are cases where it is inavoidable
>
> And there are good reasons (mathematics !!) for this.
>
> I'm pretty sure that it would be quite a mistake to start
> differentiating it here...  but of course we can continue
> discussing here if you like.
>
> Martin Maechler
> ETH Zurich and R Core
>
>
>     > and the R head() and tail() functions are modeled after
>     > their GNU counterparts, I would expect the R functions to
>     > distinguish between +0 and -0
>
>     >> tail(1:5, n=0)
>     > integer(0)
>     >> tail(1:5, n=1)
>     > [1] 5
>     >> tail(1:5, n=2)
>     > [1] 4 5
>
>     >> tail(1:5, n=-2)
>     > [1] 3 4 5
>     >> tail(1:5, n=-1)
>     > [1] 2 3 4 5
>     >> tail(1:5, n=-0)
>     > integer(0)  # expected 1:5
>
>     >> head(1:5, n=0)
>     > integer(0)
>     >> head(1:5, n=1)
>     > [1] 1
>     >> head(1:5, n=2)
>     > [1] 1 2
>
>     >> head(1:5, n=-2)
>     > [1] 1 2 3
>     >> head(1:5, n=-1)
>     > [1] 1 2 3 4
>     >> head(1:5, n=-0)
>     > integer(0)  # expected 1:5
>
>     > For both head() and tail(), I expected 1:5 as output but got
>     > integer(0). I obtained similar results using a data.frame and a
>     > function as x argument.
>
>     > An easy fix would be to explicitly state in the documentation what
n >     > 0 does, and that there is no practical difference between -0 and
+0.
>     > However, in my eyes, the better approach would be implement
support
>     > for -0 and document it. What do you think?
>
>     > Best,
>
>     > Florent
>
>
>     > PS/ My sessionInfo() gives:
>     > R version 3.3.2 (2016-10-31)
>     > Platform: x86_64-w64-mingw32/x64 (64-bit)
>     > Running under: Windows 7 x64 (build 7601) Service Pack 1
>
>     > locale:
>     > [1] LC_COLLATE=German_Switzerland.1252
>     > LC_CTYPE=German_Switzerland.1252
>     > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
>     > LC_TIME=German_Switzerland.1252
>
>     > attached base packages:
>     > [1] stats     graphics  grDevices utils     datasets  methods  
base
>
>     > ______________________________________________
>     > R-devel at r-project.org mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Jan 2017 - Undefined behavior of head() and tail() with n = 0

[Rd] Undefined behavior of head() and tail() with n = 0

[Rd] Undefined behavior of head() and tail() with n = 0

[Rd] RFC: tapply(*, ..., init.value = NA)

[Rd] Undefined behavior of head() and tail() with n = 0

Seemingly Similar Threads