Suharto Anggono Suharto Anggono
2016-Aug-07 15:32 UTC
[Rd] table(exclude = NULL) always includes NA
This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html . With R 2.7.2:> a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > table(a, b, exclude = NULL)b a 1 2 1 1 1 2 2 0 3 1 0 <NA> 1 0 With R 3.3.1:> a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > table(a, b, exclude = NULL)b a 1 2 <NA> 1 1 1 0 2 2 0 0 3 1 0 0 <NA> 1 0 0> table(a, b, useNA = "ifany")b a 1 2 1 1 1 2 2 0 3 1 0 <NA> 1 0> table(a, b, exclude = NULL, useNA = "ifany")b a 1 2 <NA> 1 1 1 0 2 2 0 0 3 1 0 0 <NA> 1 0 0 For the example, in R 3.3.1, the result of 'table' with exclude = NULL includes NA even if NA is not present. It is different from R 2.7.2, that comes from factor(exclude = NULL), that includes NA only if NA is present.>From R 3.3.1 help on 'table', in "Details" section:'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts. This is overridden by specifying 'exclude = NULL'. Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always". For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified. The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used. With R 2.7.2:> log <- c(NA, logical(4), NA, !logical(2), NA) > summary(log)Mode FALSE TRUE NA's logical 4 2 3> summary(log[!is.na(log)])Mode FALSE TRUE logical 4 2> summary(TRUE)Mode TRUE logical 1 With R 3.3.1:> log <- c(NA, logical(4), NA, !logical(2), NA) > summary(log)Mode FALSE TRUE NA's logical 4 2 3> summary(log[!is.na(log)])Mode FALSE TRUE NA's logical 4 2 0> summary(TRUE)Mode TRUE NA's logical 1 0 In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector. On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't contain FALSE. I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA.
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Sun, 7 Aug 2016 15:32:19 +0000 writes:> This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html .> With R 2.7.2:> > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > table(a, b, exclude = NULL) > b > a 1 2 > 1 1 1 > 2 2 0 > 3 1 0 > <NA> 1 0> With R 3.3.1:> > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > table(a, b, exclude = NULL) > b > a 1 2 <NA> > 1 1 1 0 > 2 2 0 0 > 3 1 0 0 > <NA> 1 0 0 > > table(a, b, useNA = "ifany") > b > a 1 2 > 1 1 1 > 2 2 0 > 3 1 0 > <NA> 1 0 > > table(a, b, exclude = NULL, useNA = "ifany") > b > a 1 2 <NA> > 1 1 1 0 > 2 2 0 0 > 3 1 0 0 > <NA> 1 0 0> For the example, in R 3.3.1, the result of 'table' with > exclude = NULL includes NA even if NA is not present. It is > different from R 2.7.2, that comes from factor(exclude = NULL), > that includes NA only if NA is present.I agree that this (R 3.3.1 behavior) seems undesirable and looks wrong, and the old (<= 2.2.7) behavior for table(a,b, exclude=NULL) seems desirable to me.> >From R 3.3.1 help on 'table', in "Details" section: > 'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts. This is overridden by specifying 'exclude = NULL'.> Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always".Yes, it should be documented what happens for this case, (but read on ...)> For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified.Yes. What should we do? I currently think that we'd want to change the line useNA <- if (!missing(exclude) && is.null(exclude)) "always" to useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always" which would not even contradict documentation, as indeed you mentioned above, the exact action here had not been documented. The change above at least does not break any of the standard R tests ('make check-all', i.e., including the recommended packages), which for me confirms that it may be "what is best"... ---- Thank you for mentioning the important consequence for summary(<logical>). They can helping insight what a "probably best" behavior should be for these cases of table(). Martin Maechler, ETH Zurich> The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used.> With R 2.7.2:> > log <- c(NA, logical(4), NA, !logical(2), NA) > > summary(log) > Mode FALSE TRUE NA's > logical 4 2 3 > > summary(log[!is.na(log)]) > Mode FALSE TRUE > logical 4 2 > > summary(TRUE) > Mode TRUE > logical 1> With R 3.3.1:> > log <- c(NA, logical(4), NA, !logical(2), NA) > > summary(log) > Mode FALSE TRUE NA's > logical 4 2 3 > > summary(log[!is.na(log)]) > Mode FALSE TRUE NA's > logical 4 2 0 > > summary(TRUE) > Mode TRUE NA's > logical 1 0> In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector. > On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't contain FALSE.> I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA.I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior for table() {and hence summary(<logical>)}.
>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Tue, 9 Aug 2016 15:35:41 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Sun, 7 Aug 2016 15:32:19 +0000 writes:> > This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html . > > > With R 2.7.2: > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > > table(a, b, exclude = NULL) > > b > > a 1 2 > > 1 1 1 > > 2 2 0 > > 3 1 0 > > <NA> 1 0 > > > With R 3.3.1: > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > > table(a, b, exclude = NULL) > > b > > a 1 2 <NA> > > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 > > > table(a, b, useNA = "ifany") > > b > > a 1 2 > > 1 1 1 > > 2 2 0 > > 3 1 0 > > <NA> 1 0 > > > table(a, b, exclude = NULL, useNA = "ifany") > > b > > a 1 2 <NA> > > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 > > > For the example, in R 3.3.1, the result of 'table' with > > exclude = NULL includes NA even if NA is not present. It is > > different from R 2.7.2, that comes from factor(exclude = NULL), > > that includes NA only if NA is present. > > I agree that this (R 3.3.1 behavior) seems undesirable and looks > wrong, and the old (<= 2.2.7) behavior for table(a,b, > exclude=NULL) seems desirable to me. > > > > >From R 3.3.1 help on 'table', in "Details" section: > > 'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts. This is overridden by specifying 'exclude = NULL'. > > > Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always". > > Yes, it should be documented what happens for this case, > (but read on ...)and it is *not* true that the documentation does not say, since 2013, it has contained exclude: levels to remove for all factors in ?...?. If set to ?NULL?, it implies ?useNA = "always"?. See ?Details? for its interpretation for non-factor arguments.> > For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified. > > Yes. What should we do? > I currently think that we'd want to change the line > > useNA <- if (!missing(exclude) && is.null(exclude)) "always" > > to > > useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always" > > > which would not even contradict documentation, as indeed you > mentioned above, the exact action here had not been documented.The last part ("which ..") above is wrong, as mentioned earlier. The above change entails behaviour which looks better to me; however, the change *is* "against the current documentation". and after experimentation (a "complete factorial design" of argument settings), I'm not entirely happy with the result.... and one reason is that 'exclude = NULL' and (e.g.) 'exclude = c()' are (still) handled differently: From a usual interpreation, both should mean "do not exclude any factor entries (and levels) from tabulation" but one of the two changes the default of 'useNA' and the other does not. If we want a change anyway (and have to update the doc), it could be "more logical" to replace the line above by useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always" notably, replacing 'useNA' *only* if it has not been specified, which seems much closer to "typically expected" behavior..> The change above at least does not break any of the standard R > tests ('make check-all', i.e., including the recommended > packages), which for me confirms that it may be "what is > best"... > > ---- > > Thank you for mentioning the important consequence for summary(<logical>). > They can helping insight what a "probably best" behavior should > be for these cases of table(). > > Martin Maechler, > ETH Zurich > > > The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used. > > > With R 2.7.2: > > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > > > summary(log[!is.na(log)]) > > Mode FALSE TRUE > > logical 4 2 > > > summary(TRUE) > > Mode TRUE > > logical 1 > > > With R 3.3.1: > > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > > > summary(log[!is.na(log)]) > > Mode FALSE TRUE NA's > > logical 4 2 0 > > > summary(TRUE) > > Mode TRUE NA's > > logical 1 0 > > > In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector. > > On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't contain FALSE. > > > I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA. > > I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior > for table() {and hence summary(<logical>)}.