Suharto Anggono Suharto Anggono
2016-May-28 09:34 UTC
[Rd] complex NA's match(), etc: not back-compatible change proposal
On 'factor', I meant the case where 'levels' is not specified, where 'unique' is called.> factor(c(complex(real=NaN), complex(imaginary=NaN)))[1] NaN+0i <NA> Levels: NaN+0i Look at <NA> in the result above. Yes, it happens in earlier versions of R, too. On matching both NA and NaN, another consequence is that length(unique(.)) may depend on order. Example using R devel r70604:> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) > (z <- z[is.na(z)])[1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA [9] 0+NaNi 1+NaNi NA NaN+NaNi> length(print(unique(z)))[1] NA NaN+0i [1] 2> length(print(unique(c(z[8], z[-8]))))[1] NA [1] 1 -------------------------------------------- On Mon, 23/5/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal Cc: R-devel at r-project.org Date: Monday, 23 May, 2016, 11:06 PM >>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>>? ???on Fri, 13 May 2016 16:33:05 +0000 writes: ? ? > That, for example, complex(real=NaN) and complex(imaginary=NaN) are regarded as equal makes it possible that ? ? >? length(unique(as.character(x))) > length(unique(x)) ? ? > (current code of function 'factor' doesn't expect it). Thank you, that is an interesting remark - but is already true, in [[elided Yahoo spam]] .. and of course this is because we do *print*???0+NaNi? etc, i.e., we differentiate the? non-NA-but-NaN complex values in formatting / printing but not in match(), unique() ... and indeed, with the? 'z'? example below, ? fz <- factor(z,z) gives a warnings about duplicated levels and gives such warnings also in current (and previous) versions of R, at least for the slightly larger z? I've used in the tests/reg-tests-1c.R example. For the moment I can live with that warning, as I don't think factor()s are constructed from complex numbers "often"... and the performance of factor() in the more regular cases is important. > Yes, an argument for the behavior is that NA and NaN are of one kind. > On my system, using 32-bit R for Windows from binary from CRAN, the result of sapply(z, match, table = z) (not in current R-devel) may be different from below: ? ? > 1 2 3 4 1 3 7 8 2 4 8 12? # R 2.10.1, different from below ? ? > 1 2 3 4 1 3 7 8 2 4 8 12? # R 3.2.5, different from below interesting, thank you... and another reason why the change (currently only in R-devel) may have been a good one: More uniformity. ? ? > I noticed that, by function 'cequal' in unique.c, a complex number that has both NA and NaN matches NA and also matches NaN. ? ? >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) ? ? >> (z <- z[is.na(z)]) ? ? > [1]? ? ???NA NaN+? 0i? ? ???NA NaN+? 1i? ? ???NA? ? ???NA? ? ???NA? ? ???NA ? ? > [9]???0+NaNi???1+NaNi? ? ???NA NaN+NaNi ? ? >> sapply(z, match, table z[8]) ? ? > [1] 1 1 1 1 1 1 1 1 1 1 1 1 ? ? >> match(z, z[8]) ? ? > [1] 1 1 1 1 1 1 1 1 1 1 1 1 Yes, I see the same. But is n't it what we expect: All of our z[] entries has at least one NA or a NaN in its real or imaginary, and since z[8] has both, it does match with all z[]'s either because of the NA or because of the NaN in common. Hence, currently, I don't think this needs to be changed... but if there are other reasons / arguments ... Thank you again, Martin Maechler ? ? >> sessionInfo() ? ? > R Under development (unstable) (2016-05-12 r70604) ? ? > Platform: i386-w64-mingw32/i386 (32-bit) ? ? > Running under: Windows XP (build 2600) Service Pack 2 ? ? > locale: ? ? > [1] LC_COLLATE=English_United States.1252 ? ? > [2] LC_CTYPE=English_United States.1252 ? ? > [3] LC_MONETARY=English_United States.1252 ? ? > [4] LC_NUMERIC=C ? ? > [5] LC_TIME=English_United States.1252 ? ? > attached base packages: ? ? > [1] stats? ???graphics? grDevices utils? ???datasets? methods???base ? ? > ----------------- >>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>>? ???on Tue, 10 May 2016 16:08:39 +0200 writes: ? ? >> This is an RFC / announcement related to the 2nd part of PR#16885 ? ? >> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885 ? ? >> about? complex NA's. ? ? >> The (somewhat rare) incompatibility in R's 3.3.0 match() behavior for the ? ? >> case of complex numbers with NA & NaN's {which has been fixed for R 3.3.0 ? ? >> patched in the mean time} triggered some more comprehensive "research". ? ? >> I found that we have had a long-standing inconsistency at least between the ? ? >> documented and the real behavior.? I am claiming that the documented ? ? >> behavior is desirable and hence R's current "real" behavior is bugous, and ? ? >> I am proposing to change it, in R-devel (to be 3.4.0) for now. ? ? > After the? "roaring unanimous" assent? (one private msg ? ? > encouraging me to go forward, no dissenting voice, hence an ? ? > "odds ratio" of? +Inf? in favor ;-) ? ? > I have now committed my proposal to R-devel (svn rev. 70597) and ? ? > some of us will be seeing the effect in package space within a ? ? > day or so, in the CRAN checks against R-devel (not for ? ? > bioconductor AFAIK; their checks using R-devel only when it less ? ? > than ca 6 months from release). ? ? > It's still worthwhile to discuss the issue, if you come late ? ? > to it, notably as ---paraphrasing Dirk on the R-package-devel list--- ? ? > the release of 3.4.0 is almost a year away, and so now is the ? ? > best time to tinker with the API, in other words, consider breaking ? ? > rarely used legacy APIs.. ? ? > Martin ? ? >> In help(match) we have been saying ? ? >> |? Exactly what matches what is to some extent a matter of definition. ? ? >> |? For all types, \code{NA} matches \code{NA} and no other value. ? ? >> |? For real and complex values, \code{NaN} values are regarded ? ? >> |? as matching any other \code{NaN} value, but not matching \code{NA}. ? ? >> for at least 10 years.? But we don't do that at all in the ? ? >> complex case (and AFAIK never got a bug report about it). ? ? >> Also, e.g., print(.) or format(.) do simply use? "NA" for all ? ? >> the different complex NA-containing numbers, where OTOH, ? ? >> non-NA NaN's { <=>? !is.nan(z) & is.na(z) } ? ? >> in format() or print() do show the NaN in real and/or imaginary ? ? >> parts; for an example, look at the "format" column of the matrix ? ? >> below, after 'print(cbind' ... ? ? >> The current match()---and duplicated(), unique() which are based on the same ? ? >> C code---*do* distinguish almost all complex NA / NaN's which is ? ? >> NOT according to documentation. I have found that this is just because of ? ? >> of our hashing function for the complex case, chash() in R/src/main/unique.c, ? ? >> is bogous in the sense that it is not compatible with the above documentation ? ? >> and also not with the cequal() function (in the same file uniqu.c) for checking ? ? >> equality of complex numbers. ? ? >> As I have found,, a *simplified* version of the chash() function ? ? >> to make it compatible with cequal() does solve all the problems I've ? ? >> indicated,? and the current plan is to commit that change --- after some ? ? >> discussion time, here on R-devel ---? to the code base. ? ? >> My change passes? 'make check-all' fine, but I'm 100% sure that there will ? ? >> be effects in package-space. ... one reason for this posting. ? ? >> As mentioned above, note that the chash() function has been in ? ? >> use for all three functions ? ? >> match() ? ? >> duplicated() ? ? >> unique() ? ? >> and the change will affect all three --- but just for the case of complex ? ? >> vectors with NA or NaN's. ? ? >> To show more, a small R session -- using my version of R-devel ? ? >> == the proposition: ? ? >> The R script ('complex-NA-short.R') for (a bit more than) the ? ? >> session is attached {{you can attach? text/plain easily}}: ? ? >>> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) ? ? >>> ##? ? ? ? ???--- = NA_real_? but that does not exist e.g., in R 2.3.1 ? ? >>> ##? ? ? ? ? ? ? ? ???similarly,? '1L', '2L', .. do not exist e.g., in R 2.3.1 ? ? >>> (z <- z[is.na(z)]) ? ? >> [1]? ? ???NA NaN+? 0i? ? ???NA NaN+? 1i? ? ???NA? ? ???NA? ? ???NA? ? ???NA ? ? >> [9]???0+NaNi???1+NaNi? ? ???NA NaN+NaNi ? ? >>> outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ? ? ? >> +? ???r <- matrix( , length(x), length(y)) ? ? >> +? ???for(i in seq(along=x)) ? ? >> +? ? ? ???for(j in seq(along=y)) ? ? >> +? ? ? ? ? ???r[i,j] <- identical(z[i], z[j], ...) ? ? >> +? ???r ? ? >> + } ? ? >>> ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ: ? ? >>> ## a version that works in older versions of R, where identical() had fewer arguments! ? ? >>> outerID.picky <- function(x,y) { ? ? >> +? ???nF <- length(formals(identical)) - 2 ? ? >> +? ???do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF)))) ? ? >> + } ? ? >>> oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is? a wild guess ? ? >>> symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R] ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ? ? >> [1,] | . . . . . . . . . . . ? ? >> [2,] . | . . . . . . . . . . ? ? >> [3,] . . | . . . . . . . . . ? ? >> [4,] . . . | . . . . . . . . ? ? >> [5,] . . . . | . . . . . . . ? ? >> [6,] . . . . . | . . . . . . ? ? >> [7,] . . . . . . | . . . . . ? ? >> [8,] . . . . . . . | . . . . ? ? >> [9,] . . . . . . . . | . . . ? ? >> [10,] . . . . . . . . . | . . ? ? >> [11,] . . . . . . . . . . | . ? ? >> [12,] . . . . . . . . . . . | ? ? >>> try(# for older R versions ? ? >> + stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1)) ? ? >> + ) ? ? >>> (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in print()/format() _FIXME_ ? ? >> [1] 1 2 1 2 1 1 1 1 2 2 1 2 ? ? >>> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern : ? ? >>> print(cbind(format = format(z), t(zRI), mz), quote=FALSE) ? ? >> format???Re???Im???mz ? ? >> [1,]? ? ???NA <NA> 0? ? 1 ? ? >> [2,] NaN+? 0i NaN? 0? ? 2 ? ? >> [3,]? ? ???NA <NA> 1? ? 1 ? ? >> [4,] NaN+? 1i NaN? 1? ? 2 ? ? >> [5,]? ? ???NA 0? ? <NA> 1 ? ? >> [6,]? ? ???NA 1? ? <NA> 1 ? ? >> [7,]? ? ???NA <NA> <NA> 1 ? ? >> [8,]? ? ???NA NaN? <NA> 1 ? ? >> [9,]???0+NaNi 0? ? NaN? 2 ? ? >> [10,]???1+NaNi 1? ? NaN? 2 ? ? >> [11,]? ? ???NA <NA> NaN? 1 ? ? >> [12,] NaN+NaNi NaN? NaN? 2 ? ? >>> ? ? >> ------------------------------- ? ? >> Note that 'mz <- match(z, z)' and hence the last column of the matrix above ? ? >> are very different in current R, ? ? >> distinguishing most kinds of NA / NaN? against the documentation (and the ? ? >> real/numeric case). ? ? >> Martin Maechler ? ? >> R Core Team ? ? >> ### Basically a shortened version of? the PR#16885 -- complex part b) ? ? >> ### of? R/tests/reg-tests-1c.R ? ? >> ## b) complex 'x' with different kinds of NaN ? ? >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) ? ? >> ##? ? ? ? ???--- = NA_real_? but that does not exist e.g., in R 2.3.1 ? ? >> ##? ? ? ? ? ? ? ? ???similarly,? '1L', '2L', .. do not exist e.g., in R 2.3.1 ? ? >> (z <- z[is.na(z)]) ? ? >> outerID <- function(x,y, ...) { ## ugly; can we get outer() to work ? ? ? >> r <- matrix( , length(x), length(y)) ? ? >> for(i in seq(along=x)) ? ? >> for(j in seq(along=y)) ? ? >> r[i,j] <- identical(z[i], z[j], ...) ? ? >> r ? ? >> } ? ? >> ## Very strictly - in the sense of identical() -- these 12 complex numbers all differ: ? ? >> ## a version that works in older versions of R, [[elided Yahoo spam]] ? ? >> outerID.picky <- function(x,y) { ? ? >> nF <- length(formals(identical)) - 2 ? ? >> do.call("outerID", c(list(x, y), as.list(rep(FALSE, nF)))) ? ? >> } ? ? >> oldR <- !exists("getRversion") || getRversion() < "3.0.0" ## << FIXME: 3.0.0 is? a wild guess ? ? >> symnum(id.z <- outerID.picky(z,z)) ## == Diagonal matrix [newer versions of R] ? ? >> try(# for older R versions ? ? >> stopifnot(identical(id.z, outerID(z,z)), oldR || identical(id.z, diag(12) == 1)) ? ? >> ) ? ? >> (mz <- match(z, z)) # currently different {NA,NaN} patterns differ - not in print()/format() _FIXME_ ? ? >> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern : ? ? >> print(cbind(format = format(z), t(zRI), mz), quote=FALSE) ? ? >> ## compute? match(z[i], z) , for? i = 1,2,..,12? : ? ? >> (m1z <- sapply(z, match, table = z)) ? ? >> ## 1 2 1 2 2 2 1 2 2 2 1 2???# R 1.2.3? (2001-04-26) ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 7???# R 1.4.1? (2002-01-30) ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? # R 1.5.1? (2002-06-17) ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? # R 1.8.1? (2003-11-21) ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? # R 2.0.1? (2004-11-15) ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # R 2.1.1? (2005-06-20) ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # R 2.3.1? (2006-06-01) ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? # R 2.5.1? (2007-06-27) ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # R 2.10.1 (2009-12-14) ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # R 3.1.1? (2014-07-10) ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # R 3.2.5 -- and 3.3.0 patched ? ? >> ## 1 2 1 2 1 1 1 1 2 2 1 2???# <<-- Martin's R-devel and proposed future R ? ? >> if(!exists("anyNA", mode="function")) anyNA <- function(x) any(is.na(x)) ? ? >> stopifnot(apply(zRI, 2, anyNA)) # *all* are? NA *or* NaN (or both) ? ? >> is.NA <- function(.) is.na(.) & !is.nan(.) ? ? >> (iNaN <- apply(zRI, 2, function(.) any(is.nan(.)))) ? ? >> (iNA <-? apply(zRI, 2, function(.) any(is.NA (.)))) # has non-NaN NA's ? ? >> ## In Martin's version of R-devel : ? ? >> stopifnot(identical(m1z == 1, iNA), ? ? >> identical(m1z == 2, !iNA)) ? ? >> ## m1z uses match(x, *) with length(x) == 1 and failed in R 3.3.0 ? ? >> stopifnot(identical(m1z, mz)) ? ? >> ______________________________________________ ? ? >> R-devel at r-project.org mailing list ? ? >> https://stat.ethz.ch/mailman/listinfo/r-devel ? ? > ______________________________________________ ? ? > R-devel at r-project.org mailing list ? ? > https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler
2016-May-30 10:48 UTC
[Rd] complex NA's match(), etc: not back-compatible change proposal
>>>>> Suharto Anggono >>>>> on Sat, 28 May 2016 09:34:08 +0000 writes:> On 'factor', I meant the case where 'levels' is not > specified, where 'unique' is called. I see, thank you. >> factor(c(complex(real=NaN), complex(imaginary=NaN))) > [1] NaN+0i <NA> > Levels: NaN+0i > Look at <NA> in the result above. Yes, it happens in > earlier versions of R, too. Yes; let's call this "problem 1" > On matching both NA and NaN, another consequence is that > length(unique(.)) may depend on order. > Example using R devel r70604: >> x0 <- c(0,1, NA, NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) >> (z <- z[is.na(z)]) > [1] NA NaN+ 0i NA NaN+ 1i NA NA NA NA > [9] 0+NaNi 1+NaNi NA NaN+NaNi >> length(print(unique(z))) > [1] NA NaN+0i > [1] 2 >> length(print(unique(c(z[8], z[-8])))) > [1] NA > [1] 1 > -------------------------------------------- Thank you, Suharto. I agree these are even more convincing reasons to consider changing. Let's call this ("matching both NA and NaN") "problem 2". I think we agree that the R-devel -- comparted to previous versions -- *is* consistent in its (C level) functions cequal() and chash() and also is consistent with the documentation of match()/unique()/duplicated(). Hence I think a change would have to affect all of the above, including a change of documentation. Also, resolution of "problem 1" and "problem 2" are related, but --I think-- almost separate. For the following, let's use a vector notation for complex numbers, say (a, b) :== complex(real = a, imaginary = b) With R (showing relevant examples): ##------------------------------------------------------------------------------ options(width = max(85, getOption("width"))) # so 'z' prints in one line p.z <- function(z) print(noquote(paste0("(",Re(z),",",Im(z),")"))) z <- c(1,NA,NaN); z <- outer(z,z, complex, length.out=1); (z <- z[is.na(z)]) ## NA NaN+ 1i NA NA NA 1+NaNi NA NaN+NaNi p.z(z) ## (NA,1) (NaN,1) (1,NA) (NA,NA) (NaN,NA) (1,NaN) (NA,NaN) (NaN,NaN) length(p.z(unique(z[ 1:8 ]))) ## [1] (NA,1) (NaN,1) ## [1] 2 length(p.z(unique(z[ c(8,1:7) ]))) ## [1] (NaN,NaN) (NA,1) ## [1] 2 length(p.z(unique(z[ c(7:8,1:6) ]))) ## [1] (NA,NaN) ## [1] 1 ##------------------------------------------------------------------------------ Problem 1: To me, at the moment, it would seem most "natural" to consider a change where the match()/unique()/duplicated() behavior matched the behavior of print()/format()/as.character() for such complex vectors. I think this would automatically solve the issue that sometimes length(unique(as.character(x))) > length(unique(x)) The are principally two solutions to this: A: change match()/unique()/duplicated() B: change print()/format()/as.character() For A -- which seems "less disruptive" and more desirable to me -- we would have to change cequal() {and chash()!} and say that complex numbers with NA|NaN "match" if they have any NA, but otherwise, both the regular (r,i) and the NaN must be at the exact same places (and *different* NaNs should match, of course). Problem 2: unique(z[i]) depends on the permutation 'i' What should a change be here ... notably after the "proposed" (rather only "considered") change '1 A' above ? Can "the" new behavior easily be described in words (if '1 A' above is already assumed)? At the moment, I would not tackle Problem 2. It would become less problematic once Problem 1 is solved according to '1 A', because it least length(unique(.)) would not change: It would contain *one* z[] with an NA, and all the other z[]s. Opinions ? Thank you in advance for chiming in.. Martin Maechler, ETH Zurich > On Mon, 23/5/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: > Subject: Re: [Rd] complex NA's match(), etc: not back-compatible change proposal > Cc: R-devel at r-project.org > Date: Monday, 23 May, 2016, 11:06 PM >>>>>> > Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>>> ? ???on Fri, 13 > May 2016 16:33:05 +0000 writes: > ? ? > That, for example, complex(real=NaN) > and complex(imaginary=NaN) are regarded as equal makes it > possible that > ? ? >? > length(unique(as.character(x))) > length(unique(x)) > ? ? > (current code of > function 'factor' doesn't expect it). > Thank you, that is an > interesting remark - but is already true, > in > [[elided Yahoo spam]] > .. > and of course this is because we do > *print*???0+NaNi? etc, > i.e., we > differentiate the? non-NA-but-NaN complex values in > formatting / printing but not in match(), > unique() ... > and indeed, > with the? 'z'? example below, > ? > fz <- factor(z,z) > gives a warnings about > duplicated levels and gives such warnings > also in current (and previous) versions of R, > at least for the slightly > larger z? > I've used in the tests/reg-tests-1c.R example. > For the moment I can live with > that warning, as I don't think > factor()s > are constructed from complex numbers "often"... > and the performance of factor() in the more > regular cases is important. >> Yes, an argument for the behavior is that > NA and NaN are of one kind. >> On my > system, using 32-bit R for Windows from binary from CRAN, > the result of sapply(z, match, table = z) (not in current > R-devel) may be different from below: > ? ? >> 1 2 3 4 1 3 7 8 2 4 8 12? # R 2.10.1, different from > below > ? ? > 1 2 3 4 1 3 7 8 2 4 8 12? > # R 3.2.5, different from below > interesting, thank you... and another reason > why the change > (currently only in R-devel) > may have been a good one: More uniformity. > ? ? > I noticed that, by > function 'cequal' in unique.c, a complex number that > has both NA and NaN matches NA and also matches NaN. > ? ? >> x0 <- c(0,1, > NA, NaN); z <- outer(x0,x0, complex, length.out=1); > rm(x0) > ? ? >> (z <- > z[is.na(z)]) > ? ? > [1]? ? > ???NA NaN+? 0i? ? ???NA NaN+? 1i? > ? ???NA? ? ???NA? ? > ???NA? ? ???NA > ? ? >> [9]???0+NaNi???1+NaNi? ? > ???NA NaN+NaNi > ? ? >> sapply(z, match, table > z[8]) > ? ? > [1] 1 1 1 1 1 1 1 1 1 1 1 > 1 > ? ? >> match(z, z[8]) > ? ? > [1] 1 1 1 1 1 1 1 1 1 1 1 1 > Yes, I see the same. But is > n't it what we expect: > All of our z[] entries has at least one NA or a > NaN in its real > or imaginary, and since z[8] > has both, it does match with all > z[]'s > either because of the NA or because of the NaN in common. > Hence, currently, I don't > think this needs to be changed... > but if > there are other reasons / arguments ... > Thank you again, > Martin > Maechler > ? ? >> sessionInfo() > ? > ? > R Under development (unstable) (2016-05-12 > r70604) > ? ? > Platform: > i386-w64-mingw32/i386 (32-bit) > ? ? > > Running under: Windows XP (build 2600) Service Pack 2 > ? ? > locale: > ? ? > [1] LC_COLLATE=English_United > States.1252 > ? ? > [2] > LC_CTYPE=English_United States.1252 > ? ? >> [3] LC_MONETARY=English_United States.1252 > ? ? > [4] LC_NUMERIC=C > ? > ? > [5] LC_TIME=English_United States.1252 > ? ? > attached base > packages: > ? ? > [1] stats? > ???graphics? grDevices utils? > ???datasets? methods???base > ? ? > > ----------------- >>>>>> > Martin Maechler <maechler at stat.math.ethz.ch> >>>>>> ? ???on Tue, 10 > May 2016 16:08:39 +0200 writes: > ? ? >> This is an RFC / announcement > related to the 2nd part of PR#16885 > ? ? >>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16885 > ? ? >> about? complex NA's. > ? ? >> The (somewhat > rare) incompatibility in R's 3.3.0 match() behavior for > the > ? ? >> case of complex numbers > with NA & NaN's {which has been fixed for R 3.3.0 > ? ? >> patched in the mean time} > triggered some more comprehensive "research". > ? ? >> I found that we > have had a long-standing inconsistency at least between > the > ? ? >> documented and the real > behavior.? I am claiming that the documented > ? ? >> behavior is desirable and hence > R's current "real" behavior is bugous, and > ? ? >> I am proposing to change it, in > R-devel (to be 3.4.0) for now. > ? ? > After the? "roaring > unanimous" assent? (one private msg > ? > ? > encouraging me to go forward, no dissenting voice, > hence an > ? ? > "odds ratio" > of? +Inf? in favor ;-) > ? > ? > I have now committed my proposal to R-devel (svn > rev. 70597) and > ? ? > some of us will > be seeing the effect in package space within a > ? ? > day or so, in the CRAN checks > against R-devel (not for > ? ? > > bioconductor AFAIK; their checks using R-devel only when it > less > ? ? > than ca 6 months from > release). > ? ? > > It's still worthwhile to discuss the issue, if you come > late > ? ? > to it, notably as > ---paraphrasing Dirk on the R-package-devel list--- > ? ? > the release of 3.4.0 is almost a > year away, and so now is the > ? ? > best > time to tinker with the API, in other words, consider > breaking > ? ? > rarely used legacy > APIs.. > ? ? > Martin > ? ? >>> In help(match) we have been saying > ? ? >> |? Exactly > what matches what is to some extent a matter of > definition. > ? ? >> |? For all > types, \code{NA} matches \code{NA} and no other value. > ? ? >> |? For real and complex values, > \code{NaN} values are regarded > ? ? >>> |? as matching any other \code{NaN} value, but not > matching \code{NA}. > ? ? >>> for at least 10 years.? But we don't do that > at all in the > ? ? >> complex case > (and AFAIK never got a bug report about it). > ? ? >> Also, e.g., > print(.) or format(.) do simply use? "NA" for > all > ? ? >> the different complex > NA-containing numbers, where OTOH, > ? ? >>> non-NA NaN's { <=>? !is.nan(z) & > is.na(z) } > ? ? >> in format() or > print() do show the NaN in real and/or imaginary > ? ? >> parts; for an example, look at > the "format" column of the matrix > ? ? >> below, after > 'print(cbind' ... > ? ? >> The current match()---and > duplicated(), unique() which are based on the same > ? ? >> C code---*do* distinguish almost > all complex NA / NaN's which is > ? ? >>> NOT according to documentation. I have found that > this is just because of > ? ? >> of > our hashing function for the complex case, chash() in > R/src/main/unique.c, > ? ? >> is > bogous in the sense that it is not compatible with the above > documentation > ? ? >> and also not > with the cequal() function (in the same file uniqu.c) for > checking > ? ? >> equality of complex > numbers. > ? ? >> As > I have found,, a *simplified* version of the chash() > function > ? ? >> to make it > compatible with cequal() does solve all the problems > I've > ? ? >> indicated,? and the > current plan is to commit that change --- after some > ? ? >> discussion time, here on R-devel > ---? to the code base. > ? > ? >> My change passes? 'make check-all' > fine, but I'm 100% sure that there will > ? ? >> be effects in package-space. ... > one reason for this posting. > ? ? >> As mentioned above, note that > the chash() function has been in > ? ? >>> use for all three functions > ? ? >>> match() > ? ? >> > duplicated() > ? ? >> unique() > ? ? >> and the change will affect all > three --- but just for the case of complex > ? ? >> vectors with NA or NaN's. > ? ? >> To show more, a > small R session -- using my version of R-devel > ? ? >> == the proposition: > ? ? >> The R script > ('complex-NA-short.R') for (a bit more than) the > ? ? >> session is attached {{you can > attach? text/plain easily}}: > ? ? >>> x0 <- c(0,1, NA, NaN); z > <- outer(x0,x0, complex, length.out=1); rm(x0) > ? ? >>> ##? ? ? ? > ???--- = NA_real_? but that does not exist e.g., > in R 2.3.1 > ? ? >>> ##? ? ? ? > ? ? ? ? ???similarly,? '1L', > '2L', .. do not exist e.g., in R 2.3.1 > ? ? >>> (z <- z[is.na(z)]) > ? ? >> [1]? ? ???NA NaN+? > 0i? ? ???NA NaN+? 1i? ? ???NA? > ? ???NA? ? ???NA? ? > ???NA > ? ? >> > [9]???0+NaNi???1+NaNi? ? > ???NA NaN+NaNi > ? ? >>> > outerID <- function(x,y, ...) { ## ugly; can we get > outer() to work ? > ? ? >> +? > ???r <- matrix( , length(x), length(y)) > ? ? >> +? ???for(i in > seq(along=x)) > ? ? >> +? ? ? > ???for(j in seq(along=y)) > ? ? >>> +? ? ? ? ? ???r[i,j] <- > identical(z[i], z[j], ...) > ? ? >> > +? ???r > ? ? >> + } > ? ? >>> ## Very strictly - in the > sense of identical() -- these 12 complex numbers all > differ: > ? ? >>> ## a version that > works in older versions of R, where identical() had fewer > arguments! > ? ? >>> outerID.picky > <- function(x,y) { > ? ? >> +? > ???nF <- length(formals(identical)) - 2 > ? ? >> +? > ???do.call("outerID", c(list(x, y), > as.list(rep(FALSE, nF)))) > ? ? >> + > } > ? ? >>> oldR <- > !exists("getRversion") || getRversion() < > "3.0.0" ## << FIXME: 3.0.0 is? a wild > guess > ? ? >>> symnum(id.z <- > outerID.picky(z,z)) ## == Diagonal matrix [newer versions of > R] > ? ? ? ? ? ? ? ? ? ? ? ? ? > ??? > ? ? >> [1,] | . . . . > . . . . . . . > ? ? >> [2,] . | . . . > . . . . . . . > ? ? >> [3,] . . | . . > . . . . . . . > ? ? >> [4,] . . . | . > . . . . . . . > ? ? >> [5,] . . . . | > . . . . . . . > ? ? >> [6,] . . . . . > | . . . . . . > ? ? >> [7,] . . . . . > . | . . . . . > ? ? >> [8,] . . . . . > . . | . . . . > ? ? >> [9,] . . . . . > . . . | . . . > ? ? >> [10,] . . . . . > . . . . | . . > ? ? >> [11,] . . . . . > . . . . . | . > ? ? >> [12,] . . . . . > . . . . . . | > ? ? >>> try(# for > older R versions > ? ? >> + > stopifnot(identical(id.z, outerID(z,z)), oldR || > identical(id.z, diag(12) == 1)) > ? ? >>> + ) > ? ? >>> (mz <- > match(z, z)) # currently different {NA,NaN} patterns differ > - not in print()/format() _FIXME_ > ? ? >>> [1] 1 2 1 2 1 1 1 1 2 2 1 2 > ? ? >>>> zRI <- rbind(Re=Re(z), Im=Im(z)) # and see > the pattern : > ? ? >>> > print(cbind(format = format(z), t(zRI), mz), quote=FALSE) > ? ? >> > format???Re???Im???mz > ? ? >> [1,]? ? ???NA > <NA> 0? ? 1 > ? ? >> [2,] > NaN+? 0i NaN? 0? ? 2 > ? ? >> > [3,]? ? ???NA <NA> 1? ? 1 > ? ? >> [4,] NaN+? 1i NaN? 1? ? 2 > ? ? >> [5,]? ? ???NA > 0? ? <NA> 1 > ? ? >> [6,]? > ? ???NA 1? ? <NA> 1 > ? > ? >> [7,]? ? ???NA <NA> <NA> > 1 > ? ? >> [8,]? ? ???NA > NaN? <NA> 1 > ? ? >> > [9,]???0+NaNi 0? ? NaN? 2 > ? > ? >> [10,]???1+NaNi 1? ? NaN? 2 > ? ? >> [11,]? ? ???NA > <NA> NaN? 1 > ? ? >> [12,] > NaN+NaNi NaN? NaN? 2 > ? ? >>> > ? ? >> > ------------------------------- > ? ? >>> Note that 'mz <- match(z, z)' and hence > the last column of the matrix above > ? ? >>> are very different in current R, > ? ? >> distinguishing most kinds of NA > / NaN? against the documentation (and the > ? ? >> real/numeric case). > ? ? >> Martin > Maechler > ? ? >> R Core Team > ? ? >>> ### Basically a shortened version of? the PR#16885 > -- complex part b) > ? ? >> ### of? > R/tests/reg-tests-1c.R > ? > ? >> ## b) complex 'x' with different kinds > of NaN > ? ? >> x0 <- c(0,1, NA, > NaN); z <- outer(x0,x0, complex, length.out=1); rm(x0) > ? ? >> ##? ? ? ? ???--- > = NA_real_? but that does not exist e.g., in R 2.3.1 > ? ? >> ##? ? ? ? ? ? ? ? > ???similarly,? '1L', '2L', .. do > not exist e.g., in R 2.3.1 > ? ? >> (z > <- z[is.na(z)]) > ? ? >> outerID > <- function(x,y, ...) { ## ugly; can we get outer() to > work ? > ? ? >> r <- matrix( , > length(x), length(y)) > ? ? >> for(i > in seq(along=x)) > ? ? >> for(j in > seq(along=y)) > ? ? >> r[i,j] <- > identical(z[i], z[j], ...) > ? ? >> > r > ? ? >> } > ? ? >>> ## Very strictly - in the sense of identical() -- > these 12 complex numbers all differ: > ? ? >>> ## a version that works in older versions of R, > [[elided Yahoo spam]] > ? ? >>> outerID.picky <- function(x,y) { > ? ? >> nF <- > length(formals(identical)) - 2 > ? ? >>> do.call("outerID", c(list(x, y), > as.list(rep(FALSE, nF)))) > ? ? >> > } > ? ? >> oldR <- > !exists("getRversion") || getRversion() < > "3.0.0" ## << FIXME: 3.0.0 is? a wild > guess > ? ? >> symnum(id.z <- > outerID.picky(z,z)) ## == Diagonal matrix [newer versions of > R] > ? ? >> try(# for older R > versions > ? ? >> > stopifnot(identical(id.z, outerID(z,z)), oldR || > identical(id.z, diag(12) == 1)) > ? ? >>> ) > ? ? >> (mz <- match(z, > z)) # currently different {NA,NaN} patterns differ - not in > print()/format() _FIXME_ > ? ? >> zRI > <- rbind(Re=Re(z), Im=Im(z)) # and see the pattern : > ? ? >> print(cbind(format = format(z), > t(zRI), mz), quote=FALSE) > ? ? >> ## compute? match(z[i], z) , > for? i = 1,2,..,12? : > ? ? >> (m1z > <- sapply(z, match, table = z)) > ? ? >>> ## 1 2 1 2 2 2 1 2 2 2 1 2???# R 1.2.3? > (2001-04-26) > ? ? >> ## 1 2 3 4 1 3 7 > 8 2 4 8 7???# R 1.4.1? (2002-01-30) > ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? # > R 1.5.1? (2002-06-17) > ? ? >> ## 1 2 > 3 4 1 3 7 8 2 4 8 12? # R 1.8.1? (2003-11-21) > ? ? >> ## 1 2 3 4 1 3 7 8 2 4 8 12? # > R 2.0.1? (2004-11-15) > ? ? >> ## 1 2 > 3 4 1 3 7 4 2 4 4 12? # R 2.1.1? (2005-06-20) > ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # > R 2.3.1? (2006-06-01) > ? ? >> ## 1 2 > 3 4 1 3 7 8 2 4 8 12? # R 2.5.1? (2007-06-27) > ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # > R 2.10.1 (2009-12-14) > ? ? >> ## 1 2 > 3 4 1 3 7 4 2 4 4 12? # R 3.1.1? (2014-07-10) > ? ? >> ## 1 2 3 4 1 3 7 4 2 4 4 12? # > R 3.2.5 -- and 3.3.0 patched > ? ? >> > ## 1 2 1 2 1 1 1 1 2 2 1 2???# <<-- > Martin's R-devel and proposed future R > ? ? >> > if(!exists("anyNA", mode="function")) > anyNA <- function(x) any(is.na(x)) > ? ? >>> stopifnot(apply(zRI, 2, anyNA)) # *all* are? NA > *or* NaN (or both) > ? ? >> is.NA > <- function(.) is.na(.) & !is.nan(.) > ? ? >> (iNaN <- apply(zRI, 2, > function(.) any(is.nan(.)))) > ? ? >> > (iNA <-? apply(zRI, 2, function(.) any(is.NA (.)))) # > has non-NaN NA's > ? ? >> ## In > Martin's version of R-devel : > ? ? >>> stopifnot(identical(m1z == 1, iNA), > ? ? >> identical(m1z == 2, !iNA)) > ? ? >> ## m1z uses match(x, *) with > length(x) == 1 and failed in R 3.3.0 > ? ? >>> stopifnot(identical(m1z, mz)) > ? ? >>> ______________________________________________ > ? ? >> R-devel at r-project.org mailing > list > ? ? >> https://stat.ethz.ch/mailman/listinfo/r-devel > ? ? > > ______________________________________________ > ? ? > R-devel at r-project.org > mailing list > ? ? > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel