Florent Angly
2017-Jan-25 15:31 UTC
[Rd] Undefined behavior of head() and tail() with n = 0
Hi all, The documentation for head() and tail() describes the behavior of these generic functions when n is strictly positive (n > 0) and strictly negative (n < 0). How these functions work when given a zero value is not defined. Both GNU command-line utilities head and tail behave differently with +0 and -0: http://man7.org/linux/man-pages/man1/head.1.html http://man7.org/linux/man-pages/man1/tail.1.html Since R supports signed zeros (1/+0 != 1/-0) and the R head() and tail() functions are modeled after their GNU counterparts, I would expect the R functions to distinguish between +0 and -0> tail(1:5, n=0)integer(0)> tail(1:5, n=1)[1] 5> tail(1:5, n=2)[1] 4 5> tail(1:5, n=-2)[1] 3 4 5> tail(1:5, n=-1)[1] 2 3 4 5> tail(1:5, n=-0)integer(0) # expected 1:5> head(1:5, n=0)integer(0)> head(1:5, n=1)[1] 1> head(1:5, n=2)[1] 1 2> head(1:5, n=-2)[1] 1 2 3> head(1:5, n=-1)[1] 1 2 3 4> head(1:5, n=-0)integer(0) # expected 1:5 For both head() and tail(), I expected 1:5 as output but got integer(0). I obtained similar results using a data.frame and a function as x argument. An easy fix would be to explicitly state in the documentation what n 0 does, and that there is no practical difference between -0 and +0. However, in my eyes, the better approach would be implement support for -0 and document it. What do you think? Best, Florent PS/ My sessionInfo() gives: R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C LC_TIME=German_Switzerland.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base
Martin Maechler
2017-Jan-26 09:53 UTC
[Rd] Undefined behavior of head() and tail() with n = 0
>>>>> Florent Angly <florent.angly at gmail.com> >>>>> on Wed, 25 Jan 2017 16:31:45 +0100 writes:> Hi all, > The documentation for head() and tail() describes the behavior of > these generic functions when n is strictly positive (n > 0) and > strictly negative (n < 0). How these functions work when given a zero > value is not defined. > Both GNU command-line utilities head and tail behave differently with +0 and -0: > http://man7.org/linux/man-pages/man1/head.1.html > http://man7.org/linux/man-pages/man1/tail.1.html > Since R supports signed zeros (1/+0 != 1/-0) whoa, whoa, .. slow down -- The above is misleading! Rather read in ?Arithmetic (*the* reference to consult for such issues), where the 2nd part of the following section || Implementation limits: || || [..............] || || Another potential issue is signed zeroes: on IEC 60659 platforms || there are two zeroes with internal representations differing by || sign. Where possible R treats them as the same, but for example || direct output from C code often does not do so and may output || ?-0.0? (and on Windows whether it does so or not depends on the || version of Windows). One place in R where the difference might be || seen is in division by zero: ?1/x? is ?Inf? or ?-Inf? depending on || the sign of zero ?x?. Another place is ?identical(0, -0, num.eq || FALSE)?. says the *contrary* ( __Where possible R treats them as the same__ ): We do _not_ want to distinguish -0 and +0, but there are cases where it is inavoidable And there are good reasons (mathematics !!) for this. I'm pretty sure that it would be quite a mistake to start differentiating it here... but of course we can continue discussing here if you like. Martin Maechler ETH Zurich and R Core > and the R head() and tail() functions are modeled after > their GNU counterparts, I would expect the R functions to > distinguish between +0 and -0 >> tail(1:5, n=0) > integer(0) >> tail(1:5, n=1) > [1] 5 >> tail(1:5, n=2) > [1] 4 5 >> tail(1:5, n=-2) > [1] 3 4 5 >> tail(1:5, n=-1) > [1] 2 3 4 5 >> tail(1:5, n=-0) > integer(0) # expected 1:5 >> head(1:5, n=0) > integer(0) >> head(1:5, n=1) > [1] 1 >> head(1:5, n=2) > [1] 1 2 >> head(1:5, n=-2) > [1] 1 2 3 >> head(1:5, n=-1) > [1] 1 2 3 4 >> head(1:5, n=-0) > integer(0) # expected 1:5 > For both head() and tail(), I expected 1:5 as output but got > integer(0). I obtained similar results using a data.frame and a > function as x argument. > An easy fix would be to explicitly state in the documentation what n > 0 does, and that there is no practical difference between -0 and +0. > However, in my eyes, the better approach would be implement support > for -0 and document it. What do you think? > Best, > Florent > PS/ My sessionInfo() gives: > R version 3.3.2 (2016-10-31) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows 7 x64 (build 7601) Service Pack 1 > locale: > [1] LC_COLLATE=German_Switzerland.1252 > LC_CTYPE=German_Switzerland.1252 > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C > LC_TIME=German_Switzerland.1252 > attached base packages: > [1] stats graphics grDevices utils datasets methods base > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Last week, we've talked here about "xtabs(), factors and NAs", -> https://stat.ethz.ch/pipermail/r-devel/2017-January/073621.html In the mean time, I've spent several hours on the issue and also committed changes to R-devel "in two iterations". In the case there is a *Left* hand side part to xtabs() formula, see the help page example using 'esoph', it uses tapply(..., FUN = sum) and I now think there is a missing feature in tapply() there, which I am proposing to change. Look at a small example:> D2 <- data.frame(n = gl(3,4), L = gl(6,2, labels=LETTERS[1:6]), N=3)[-c(1,5), ]; xtabs(~., D2), , N = 3 L n A B C D E F 1 1 2 0 0 0 0 2 0 0 1 2 0 0 3 0 0 0 0 2 2> DN <- D2; DN[1,"N"] <- NA; DNn L N 2 1 A NA 3 1 B 3 4 1 B 3 6 2 C 3 7 2 D 3 8 2 D 3 9 3 E 3 10 3 E 3 11 3 F 3 12 3 F 3> with(DN, tapply(N, list(n,L), FUN=sum))A B C D E F 1 NA 6 NA NA NA NA 2 NA NA 3 6 NA NA 3 NA NA NA NA 6 6>and as you can see, the resulting matrix has NAs, all the same NA_real_, but semantically of two different kinds: 1) at ["1", "A"], the NA comes from the NA in 'N' 2) all other NAs come from the fact that there is no such factor combination *and* from the fact that tapply() uses array(dim = .., dimnames = ...) i.e., initializes the array with NAs (see definition of 'array'). My proposition is the following patch to tapply(), adding a new option 'init.value': ----------------------------------------------------------------------------- -tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE) +tapply <- function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) { FUN <- if (!is.null(FUN)) match.fun(FUN) if (!is.list(INDEX)) INDEX <- list(INDEX) @@ -44,7 +44,7 @@ index <- as.logical(lengths(ans)) # equivalently, lengths(ans) > 0L ans <- lapply(X = ans[index], FUN = FUN, ...) if (simplify && all(lengths(ans) == 1L)) { - ansmat <- array(dim = extent, dimnames = namelist) + ansmat <- array(init.value, dim = extent, dimnames = namelist) ans <- unlist(ans, recursive = FALSE) } else { ansmat <- array(vector("list", prod(extent)), ----------------------------------------------------------------------------- With that, I can set the initial value to '0' instead of array's default of NA :> with(DN, tapply(N, list(n,L), FUN=sum, init.value=0))A B C D E F 1 NA 6 0 0 0 0 2 0 0 3 6 0 0 3 0 0 0 0 6 6>which now has 0 counts and NA as is desirable to be used inside xtabs(). All fine... and would not be worth a posting to R-devel, except for this: The change will not be 100% back compatible -- by necessity: any new argument for tapply() will make that argument name not available to be specified (via '...') for 'FUN'. The new function would be> str(tapply)function (X, INDEX, FUN = NULL, ..., init.value = NA, simplify = TRUE) where the '...' are passed FUN(), and with the new signature, 'init.value' then won't be passed to FUN "anymore" (compared to R <= 3.3.x). For that reason, we could use 'INIT.VALUE' instead (possibly decreasing the probability the arg name is used in other functions). Opinions? Thank you in advance, Martin
William Dunlap
2017-Jan-26 15:51 UTC
[Rd] Undefined behavior of head() and tail() with n = 0
In addition, signed zeroes only exist for floating point numbers - the bit patterns for as.integer(0) and as.integer(-0) are identical. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Jan 26, 2017 at 1:53 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:>>>>>> Florent Angly <florent.angly at gmail.com> >>>>>> on Wed, 25 Jan 2017 16:31:45 +0100 writes: > > > Hi all, > > The documentation for head() and tail() describes the behavior of > > these generic functions when n is strictly positive (n > 0) and > > strictly negative (n < 0). How these functions work when given a zero > > value is not defined. > > > Both GNU command-line utilities head and tail behave differently with +0 and -0: > > http://man7.org/linux/man-pages/man1/head.1.html > > http://man7.org/linux/man-pages/man1/tail.1.html > > > Since R supports signed zeros (1/+0 != 1/-0) > > whoa, whoa, .. slow down -- The above is misleading! > > Rather read in ?Arithmetic (*the* reference to consult for such issues), > where the 2nd part of the following section > > || Implementation limits: > || > || [..............] > || > || Another potential issue is signed zeroes: on IEC 60659 platforms > || there are two zeroes with internal representations differing by > || sign. Where possible R treats them as the same, but for example > || direct output from C code often does not do so and may output > || ?-0.0? (and on Windows whether it does so or not depends on the > || version of Windows). One place in R where the difference might be > || seen is in division by zero: ?1/x? is ?Inf? or ?-Inf? depending on > || the sign of zero ?x?. Another place is ?identical(0, -0, num.eq > || FALSE)?. > > says the *contrary* ( __Where possible R treats them as the same__ ): > We do _not_ want to distinguish -0 and +0, > but there are cases where it is inavoidable > > And there are good reasons (mathematics !!) for this. > > I'm pretty sure that it would be quite a mistake to start > differentiating it here... but of course we can continue > discussing here if you like. > > Martin Maechler > ETH Zurich and R Core > > > > and the R head() and tail() functions are modeled after > > their GNU counterparts, I would expect the R functions to > > distinguish between +0 and -0 > > >> tail(1:5, n=0) > > integer(0) > >> tail(1:5, n=1) > > [1] 5 > >> tail(1:5, n=2) > > [1] 4 5 > > >> tail(1:5, n=-2) > > [1] 3 4 5 > >> tail(1:5, n=-1) > > [1] 2 3 4 5 > >> tail(1:5, n=-0) > > integer(0) # expected 1:5 > > >> head(1:5, n=0) > > integer(0) > >> head(1:5, n=1) > > [1] 1 > >> head(1:5, n=2) > > [1] 1 2 > > >> head(1:5, n=-2) > > [1] 1 2 3 > >> head(1:5, n=-1) > > [1] 1 2 3 4 > >> head(1:5, n=-0) > > integer(0) # expected 1:5 > > > For both head() and tail(), I expected 1:5 as output but got > > integer(0). I obtained similar results using a data.frame and a > > function as x argument. > > > An easy fix would be to explicitly state in the documentation what n > > 0 does, and that there is no practical difference between -0 and +0. > > However, in my eyes, the better approach would be implement support > > for -0 and document it. What do you think? > > > Best, > > > Florent > > > > PS/ My sessionInfo() gives: > > R version 3.3.2 (2016-10-31) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > Running under: Windows 7 x64 (build 7601) Service Pack 1 > > > locale: > > [1] LC_COLLATE=German_Switzerland.1252 > > LC_CTYPE=German_Switzerland.1252 > > LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C > > LC_TIME=German_Switzerland.1252 > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel