Ben Bolker
2023-Jun-03 17:06 UTC
[Rd] infelicity in `na.print = ""` for numeric columns of data frames/formatting numeric values
format(c(1:2, NA)) gives the last value as "NA" rather than preserving it as NA, even if na.encode = FALSE (which does the 'expected' thing for character vectors, but not numeric vectors). This was already brought up in 2008 in https://bugs.r-project.org/show_bug.cgi?id=12318 where Gregor Gorjanc pointed out the issue. Documentation was added and the bug closed as invalid. GG ended with: > IMHO it would be better that na.encode argument would also have an effect for numeric like vectors. Nearly any function in R returns NA values and I expected the same for format, at least when na.encode=FALSE. I agree! I encountered this in the context of printing a data frame with na.print = "", which works as expected when printing the individual columns but not when printing the whole data frame (because print.data.frame calls format.data.frame, which calls format.default ...). Example below. It's also different from what you would get if you converted to character before formatting and printing: print(format(as.character(c(1:2, NA)), na.encode=FALSE), na.print ="") Everything about this is documented (if you look carefully enough), but IMO it violates the principle of least surprise https://en.wikipedia.org/wiki/Principle_of_least_astonishment , so I would call it at least an 'infelicity' (sensu Bill Venables) Is there any chance that this design decision could be revisited? cheers Ben Bolker --- Consider dd <- data.frame(f = factor(1:2), c = as.character(1:2), n = as.numeric(1:2), i = 1:2) dd[3,] <- rep(NA, 4) print(dd, na.print = "") print(dd, na.print = "") f c n i 1 1 1 1 1 2 2 2 2 2 3 NA NA This is in fact as documented (see below), but seems suboptimal given that printing the columns separately with na.print = "" would successfully print the NA entries as blank even in the numeric columns: invisible(lapply(dd, print, na.print = "")) [1] 1 2 Levels: 1 2 [1] "1" "2" [1] 1 2 [1] 1 2 * ?print.data.frame documents that it calls format() for each column before printing * the code of print.data.frame() shows that it calls format.data.frame() with na.encode = FALSE * ?format.data.frame specifically notes that na.encode "only applies to elements of character vectors, not to numerical, complex nor logical ?NA?s, which are always encoded as ?"NA"?. So the NA values in the numeric columns become "NA" rather than remaining as NA values, and are thus printed rather than being affected by the na.print argument.
Martin Maechler
2023-Jun-05 13:27 UTC
[Rd] infelicity in `na.print = ""` for numeric columns of data frames/formatting numeric values
>>>>> Ben Bolker >>>>> on Sat, 3 Jun 2023 13:06:41 -0400 writes:> format(c(1:2, NA)) gives the last value as "NA" rather than > preserving it as NA, even if na.encode = FALSE (which does the > 'expected' thing for character vectors, but not numeric vectors). > This was already brought up in 2008 in > https://bugs.r-project.org/show_bug.cgi?id=12318 where Gregor Gorjanc > pointed out the issue. Documentation was added and the bug closed as > invalid. GG ended with: >> IMHO it would be better that na.encode argument would also have an > effect for numeric like vectors. Nearly any function in R returns NA > values and I expected the same for format, at least when na.encode=FALSE. > I agree! I do too, at least "in principle", keeping in mind that backward compatibility is also an important principle ... Not sure if the 'na.encode' argument should matter or possibly a new optional argument, but "in principle" I think that format(c(1:2, NA, 4)) should preserve is.na(.) even by default. > I encountered this in the context of printing a data frame with > na.print = "", which works as expected when printing the individual > columns but not when printing the whole data frame (because > print.data.frame calls format.data.frame, which calls format.default > ...). Example below. > It's also different from what you would get if you converted to > character before formatting and printing: > print(format(as.character(c(1:2, NA)), na.encode=FALSE), na.print ="") > Everything about this is documented (if you look carefully enough), > but IMO it violates the principle of least surprise > https://en.wikipedia.org/wiki/Principle_of_least_astonishment , so I > would call it at least an 'infelicity' (sensu Bill Venables) > Is there any chance that this design decision could be revisited? We'd have to hear other opinions / gut feelings. Also, someone (not me) would ideally volunteer to run 'R CMD check <pkg>' for a few 1000 (not necessarily all) CRAN & BioC packages with an accordingly patched version of R-devel (I might volunteer to create such a branch, e.g., a bit before the R Sprint 2023 end of August). > cheers > Ben Bolker > --- The following issue you are raising may really be a *different* one, as it involves format() and print() methods for "data.frame", i.e., format.data.frame() vs print.data.frame() which is quite a bit related, of course, to how 'numeric' columns are formatted -- as you note yourself below; I vaguely recall that the data.frame method could be an even "harder problem" .. but I don't remember the details. It may also be that there are no changes necessary to the *.data.frame() methods, and only the documentation (you mention) should be updated ... Martin > Consider > dd <- data.frame(f = factor(1:2), c = as.character(1:2), n = > as.numeric(1:2), i = 1:2) > dd[3,] <- rep(NA, 4) > print(dd, na.print = "") > print(dd, na.print = "") > f c n i > 1 1 1 1 1 > 2 2 2 2 2 > 3 NA NA > This is in fact as documented (see below), but seems suboptimal given > that printing the columns separately with na.print = "" would > successfully print the NA entries as blank even in the numeric columns: > invisible(lapply(dd, print, na.print = "")) > [1] 1 2 > Levels: 1 2 > [1] "1" "2" > [1] 1 2 > [1] 1 2 > * ?print.data.frame documents that it calls format() for each column > before printing > * the code of print.data.frame() shows that it calls format.data.frame() > with na.encode = FALSE > * ?format.data.frame specifically notes that na.encode "only applies to > elements of character vectors, not to numerical, complex nor logical > ?NA?s, which are always encoded as ?"NA"?. > So the NA values in the numeric columns become "NA" rather than > remaining as NA values, and are thus printed rather than being affected > by the na.print argument. > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel