Martin Maechler
2016-May-10 14:08 UTC
[Rd] complex NA's match(), etc: not back-compatible change proposal
This is an RFC / announcement related to the 2nd part of PR#16885 https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885 about complex NA's. The (somewhat rare) incompatibility in R's 3.3.0 match() behavior for the case of complex numbers with NA & NaN's {which has been fixed for R 3.3.0 patched in the mean time} triggered some more comprehensive "research". I found that we have had a long-standing inconsistency at least between the documented and the real behavior. I am claiming that the documented behavior is desirable and hence R's current "real" behavior is bugous, and I am proposing to change it, in R-devel (to be 3.4.0) for now. In help(match) we have been saying | Exactly what matches what is to some extent a matter of definition. | For all types, \code{NA} matches \code{NA} and no other value. | For real and complex values, \code{NaN} values are regarded | as matching any other \code{NaN} value, but not matching \code{NA}. for at least 10 years. But we don't do that at all in the complex case (and AFAIK never got a bug report about it). Also, e.g., print(.) or format(.) do simply use "NA" for all the different complex NA-containing numbers, where OTOH, non-NA NaN's { <=> !is.nan(z) & is.na(z) } in format() or print() do show the NaN in real and/or imaginary parts; for an example, look at the "format" column of the matrix below, after 'print(cbind' ... The current match()---and duplicated(), unique() which are based on the same C code---*do* distinguish almost all complex NA / NaN's which is NOT according to documentation. I have found that this is just because of of our hashing function for the complex case, chash() in R/src/main/unique.c, is bogous in the sense that it is not compatible with the above documentation and also not with the cequal() function (in the same file uniqu.c) for checking equality of complex numbers. As I have found,, a *simplified* version of the chash() function to make it compatible with cequal() does solve all the problems I've indicated, and the current plan is to commit that change --- after some discussion time, here on R-devel --- to the code base. My change passes 'make check-all' fine, but I'm 100% sure that there will be effects in package-space. ... one reason for this posting. As mentioned above, note that the chash() function has been in use for all three functions match() duplicated() unique() and the change will affect all three --- but just for the case of complex vectors with NA or NaN's. To show more, a small R session -- using my version of R-devel == the proposition: The R script ('complex-NA-short.R') for (a bit more than) the session is attached {{you can attach text/plain easily}}:> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) > ## --- = NA_real_ but that does not exist e.g., in R 2.3.1 > ## similarly, '1L', '2L', .. do not exist e.g., in R 2.3.1 > (z <- z[is.na(z)])[1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA [9] 0+NaNi 1+NaNi NA NaN+NaNi> outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ?+ r <- matrix( , length(x), length(y)) + for(i in seq(along=x)) + for(j in seq(along=y)) + r[i,j] <- identical(z[i], z[j], ...) + r + }> ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ: > ## a version that works in older versions of R, where identical() had fewer arguments! > outerID.picky <- function(x,y) {+ nF <- length(formals(identical)) - 2 + do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF)))) + }> oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is a wild guess > symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R][1,] | . . . . . . . . . . . [2,] . | . . . . . . . . . . [3,] . . | . . . . . . . . . [4,] . . . | . . . . . . . . [5,] . . . . | . . . . . . . [6,] . . . . . | . . . . . . [7,] . . . . . . | . . . . . [8,] . . . . . . . | . . . . [9,] . . . . . . . . | . . . [10,] . . . . . . . . . | . . [11,] . . . . . . . . . . | . [12,] . . . . . . . . . . . |> try(# for older R versions+ stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1)) + )> (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in print()/format() _FIXME_[1] 1 2 1 2 1 1 1 1 2 2 1 2> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern : > print(cbind(format = format(z), t(zRI), mz), quote=FALSE)format Re Im mz [1,] NA <NA> 0 1 [2,] NaN+ 0i NaN 0 2 [3,] NA <NA> 1 1 [4,] NaN+ 1i NaN 1 2 [5,] NA 0 <NA> 1 [6,] NA 1 <NA> 1 [7,] NA <NA> <NA> 1 [8,] NA NaN <NA> 1 [9,] 0+NaNi 0 NaN 2 [10,] 1+NaNi 1 NaN 2 [11,] NA <NA> NaN 1 [12,] NaN+NaNi NaN NaN 2>------------------------------- Note that 'mz <- match(z, z)' and hence the last column of the matrix above are very different in current R, distinguishing most kinds of NA / NaN against the documentation (and the real/numeric case). Martin Maechler R Core Team -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: complex-NA-short.R URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20160510/8867c00e/attachment.pl>
Martin Maechler
2016-May-11 08:00 UTC
[Rd] complex NA's match(), etc: not back-compatible change proposal
>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Tue, 10 May 2016 16:08:39 +0200 writes:> This is an RFC / announcement related to the 2nd part of PR#16885 > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885 > about complex NA's. > The (somewhat rare) incompatibility in R's 3.3.0 match() behavior for the > case of complex numbers with NA & NaN's {which has been fixed for R 3.3.0 > patched in the mean time} triggered some more comprehensive "research". > I found that we have had a long-standing inconsistency at least between the > documented and the real behavior. I am claiming that the documented > behavior is desirable and hence R's current "real" behavior is bugous, and > I am proposing to change it, in R-devel (to be 3.4.0) for now. After the "roaring unanimous" assent (one private msg encouraging me to go forward, no dissenting voice, hence an "odds ratio" of +Inf in favor ;-) I have now committed my proposal to R-devel (svn rev. 70597) and some of us will be seeing the effect in package space within a day or so, in the CRAN checks against R-devel (not for bioconductor AFAIK; their checks using R-devel only when it less than ca 6 months from release). It's still worthwhile to discuss the issue, if you come late to it, notably as ---paraphrasing Dirk on the R-package-devel list--- the release of 3.4.0 is almost a year away, and so now is the best time to tinker with the API, in other words, consider breaking rarely used legacy APIs.. Martin > In help(match) we have been saying > | Exactly what matches what is to some extent a matter of definition. > | For all types, \code{NA} matches \code{NA} and no other value. > | For real and complex values, \code{NaN} values are regarded > | as matching any other \code{NaN} value, but not matching \code{NA}. > for at least 10 years. But we don't do that at all in the > complex case (and AFAIK never got a bug report about it). > Also, e.g., print(.) or format(.) do simply use "NA" for all > the different complex NA-containing numbers, where OTOH, > non-NA NaN's { <=> !is.nan(z) & is.na(z) } > in format() or print() do show the NaN in real and/or imaginary > parts; for an example, look at the "format" column of the matrix > below, after 'print(cbind' ... > The current match()---and duplicated(), unique() which are based on the same > C code---*do* distinguish almost all complex NA / NaN's which is > NOT according to documentation. I have found that this is just because of > of our hashing function for the complex case, chash() in R/src/main/unique.c, > is bogous in the sense that it is not compatible with the above documentation > and also not with the cequal() function (in the same file uniqu.c) for checking > equality of complex numbers. > As I have found,, a *simplified* version of the chash() function > to make it compatible with cequal() does solve all the problems I've > indicated, and the current plan is to commit that change --- after some > discussion time, here on R-devel --- to the code base. > My change passes 'make check-all' fine, but I'm 100% sure that there will > be effects in package-space. ... one reason for this posting. > As mentioned above, note that the chash() function has been in > use for all three functions > match() > duplicated() > unique() > and the change will affect all three --- but just for the case of complex > vectors with NA or NaN's. > To show more, a small R session -- using my version of R-devel > == the proposition: > The R script ('complex-NA-short.R') for (a bit more than) the > session is attached {{you can attach text/plain easily}}: >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) >> ## --- = NA_real_ but that does not exist e.g., in R 2.3.1 >> ## similarly, '1L', '2L', .. do not exist e.g., in R 2.3.1 >> (z <- z[is.na(z)]) > [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA > [9] 0+NaNi 1+NaNi NA NaN+NaNi >> outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ? > + r <- matrix( , length(x), length(y)) > + for(i in seq(along=x)) > + for(j in seq(along=y)) > + r[i,j] <- identical(z[i], z[j], ...) > + r > + } >> ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ: >> ## a version that works in older versions of R, where identical() had fewer arguments! >> outerID.picky <- function(x,y) { > + nF <- length(formals(identical)) - 2 > + do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF)))) > + } >> oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is a wild guess >> symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R] > [1,] | . . . . . . . . . . . > [2,] . | . . . . . . . . . . > [3,] . . | . . . . . . . . . > [4,] . . . | . . . . . . . . > [5,] . . . . | . . . . . . . > [6,] . . . . . | . . . . . . > [7,] . . . . . . | . . . . . > [8,] . . . . . . . | . . . . > [9,] . . . . . . . . | . . . > [10,] . . . . . . . . . | . . > [11,] . . . . . . . . . . | . > [12,] . . . . . . . . . . . | >> try(# for older R versions > + stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1)) > + ) >> (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in print()/format() _FIXME_ > [1] 1 2 1 2 1 1 1 1 2 2 1 2 >> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern : >> print(cbind(format = format(z), t(zRI), mz), quote=FALSE) > format Re Im mz > [1,] NA <NA> 0 1 > [2,] NaN+ 0i NaN 0 2 > [3,] NA <NA> 1 1 > [4,] NaN+ 1i NaN 1 2 > [5,] NA 0 <NA> 1 > [6,] NA 1 <NA> 1 > [7,] NA <NA> <NA> 1 > [8,] NA NaN <NA> 1 > [9,] 0+NaNi 0 NaN 2 > [10,] 1+NaNi 1 NaN 2 > [11,] NA <NA> NaN 1 > [12,] NaN+NaNi NaN NaN 2 >> > ------------------------------- > Note that 'mz <- match(z, z)' and hence the last column of the matrix above > are very different in current R, > distinguishing most kinds of NA / NaN against the documentation (and the > real/numeric case). > Martin Maechler > R Core Team > ### Basically a shortened version of the PR#16885 -- complex part b) > ### of R/tests/reg-tests-1c.R > ## b) complex 'x' with different kinds of NaN > x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) > ## --- = NA_real_ but that does not exist e.g., in R 2.3.1 > ## similarly, '1L', '2L', .. do not exist e.g., in R 2.3.1 > (z <- z[is.na(z)]) > outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ? > r <- matrix( , length(x), length(y)) > for(i in seq(along=x)) > for(j in seq(along=y)) > r[i,j] <- identical(z[i], z[j], ...) > r > } > ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ: > ## a version that works in older versions of R, where identical() had fewer arguments! > outerID.picky <- function(x,y) { > nF <- length(formals(identical)) - 2 > do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF)))) > } > oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is a wild guess > symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R] > try(# for older R versions > stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1)) > ) > (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in print()/format() _FIXME_ > zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern : > print(cbind(format = format(z), t(zRI), mz), quote=FALSE) > ## compute match(z[i], z) , for i = 1,2,..,12 : > (m1z <- sapply(z, match, table = z)) > ## 1 2 1 2 2 2 1 2 2 2 1 2 # R 1.2.3 (2001-04-26) > ## 1 2 3 4 1 3 7 8 2 4 8 7 # R 1.4.1 (2002-01-30) > ## 1 2 3 4 1 3 7 8 2 4 8 12 # R 1.5.1 (2002-06-17) > ## 1 2 3 4 1 3 7 8 2 4 8 12 # R 1.8.1 (2003-11-21) > ## 1 2 3 4 1 3 7 8 2 4 8 12 # R 2.0.1 (2004-11-15) > ## 1 2 3 4 1 3 7 4 2 4 4 12 # R 2.1.1 (2005-06-20) > ## 1 2 3 4 1 3 7 4 2 4 4 12 # R 2.3.1 (2006-06-01) > ## 1 2 3 4 1 3 7 8 2 4 8 12 # R 2.5.1 (2007-06-27) > ## 1 2 3 4 1 3 7 4 2 4 4 12 # R 2.10.1 (2009-12-14) > ## 1 2 3 4 1 3 7 4 2 4 4 12 # R 3.1.1 (2014-07-10) > ## 1 2 3 4 1 3 7 4 2 4 4 12 # R 3.2.5 -- and 3.3.0 patched > ## 1 2 1 2 1 1 1 1 2 2 1 2 # <<-- Martin's R-devel and proposed future R > if(!exists("anyNA", mode="function")) anyNA <- function(x) any(is.na(x)) > stopifnot(apply(zRI, 2, anyNA)) # *all* are NA *or* NaN (or both) > is.NA <- function(.) is.na(.) & !is.nan(.) > (iNaN <- apply(zRI, 2, function(.) any(is.nan(.)))) > (iNA <- apply(zRI, 2, function(.) any(is.NA (.)))) # has non-NaN NA's > ## In Martin's version of R-devel : > stopifnot(identical(m1z == 1, iNA), > identical(m1z == 2, !iNA)) > ## m1z uses match(x, *) with length(x) == 1 and failed in R 3.3.0 > stopifnot(identical(m1z, mz)) > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel