Suharto Anggono Suharto Anggono
2016-Aug-11 16:19 UTC
[Rd] table(exclude = NULL) always includes NA
I stand corrected. The part "If set to 'NULL', it implies 'useNA="always"'." is even in the documentation in R 2.8.0. It was my fault not to check carefully. I wonder, why "always" was chosen for 'useNA' for exclude = NULL. Why exclude = NULL is so special? What about another 'exclude' of length zero, like character(0) (not c(), because c() is NULL)? I thought that, too. But then, I have no opinion about making it general. It fits my expectation to override 'useNA' only if the user doesn't explicitly specify 'useNA'. Thank you for looking into this. My points: Could R 2.7.2 behavior of table(<non-factor>, exclude = NULL) be brought back? But R 3.3.1 behavior is in R since version 2.8.0, rather long. If not, I suggest changing summary(<logical>). -------------------------------------------- On Thu, 11/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: Subject: Re: [Rd] table(exclude = NULL) always includes NA @r-project.org Cc: "Martin Maechler" <maechler at stat.math.ethz.ch> Date: Thursday, 11 August, 2016, 12:39 AM>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Tue, 9 Aug 2016 15:35:41 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Sun, 7 Aug 2016 15:32:19 +0000 writes:> > This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html . > > > With R 2.7.2: > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > > table(a, b, exclude = NULL) > > b > > a 1 2 > > 1 1 1 > > 2 2 0 > > 3 1 0 > > <NA> 1 0 > > > With R 3.3.1: > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > > table(a, b, exclude = NULL) > > b > > a 1 2 <NA> > > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 > > > table(a, b, useNA = "ifany") > > b > > a 1 2 > > 1 1 1 > > 2 2 0 > > 3 1 0 > > <NA> 1 0 > > > table(a, b, exclude = NULL, useNA = "ifany") > > b > > a 1 2 <NA> > > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 > > > For the example, in R 3.3.1, the result of 'table' with > > exclude = NULL includes NA even if NA is not present. It is > > different from R 2.7.2, that comes from factor(exclude = NULL), > > that includes NA only if NA is present. > > I agree that this (R 3.3.1 behavior) seems undesirable and looks > wrong, and the old (<= 2.2.7) behavior for table(a,b, > exclude=NULL) seems desirable to me. > > > > >From R 3.3.1 help on 'table', in "Details" section: > > 'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts. This is overridden by specifying 'exclude = NULL'. > > > Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always". > > Yes, it should be documented what happens for this case, > (but read on ...)and it is *not* true that the documentation does not say, since 2013, it has contained exclude: levels to remove for all factors in ?...?. If set to ?NULL?, it implies ?useNA = "always"?. See ?Details? for its interpretation for non-factor arguments.> > For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified. > > Yes. What should we do? > I currently think that we'd want to change the line > > useNA <- if (!missing(exclude) && is.null(exclude)) "always" > > to > > useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always" > > > which would not even contradict documentation, as indeed you > mentioned above, the exact action here had not been documented.The last part ("which ..") above is wrong, as mentioned earlier. The above change entails behaviour which looks better to me; however, the change *is* "against the current documentation". and after experimentation (a "complete factorial design" of argument settings), I'm not entirely happy with the result.... and one reason is that 'exclude = NULL' and (e.g.) 'exclude = c()' are (still) handled differently: From a usual interpreation, both should mean "do not exclude any factor entries (and levels) from tabulation" but one of the two changes the default of 'useNA' and the other does not. If we want a change anyway (and have to update the doc), it could be "more logical" to replace the line above by useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always" notably, replacing 'useNA' *only* if it has not been specified, which seems much closer to "typically expected" behavior..> The change above at least does not break any of the standard R > tests ('make check-all', i.e., including the recommended > packages), which for me confirms that it may be "what is > best"... > > ---- > > Thank you for mentioning the important consequence for summary(<logical>). > They can helping insight what a "probably best" behavior should > be for these cases of table(). > > Martin Maechler, > ETH Zurich > > > The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used. > > > With R 2.7.2: > > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > > > summary(log[!is.na(log)]) > > Mode FALSE TRUE > > logical 4 2 > > > summary(TRUE) > > Mode TRUE > > logical 1 > > > With R 3.3.1: > > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > > > summary(log[!is.na(log)]) > > Mode FALSE TRUE NA's > > logical 4 2 0 > > > summary(TRUE) > > Mode TRUE NA's > > logical 1 0 > > > In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector. > > On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't contain FALSE. > > > I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA. > > I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior > for table() {and hence summary(<logical>)}.
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Thu, 11 Aug 2016 16:19:49 +0000 writes:> I stand corrected. The part "If set to 'NULL', it implies > 'useNA="always"'." is even in the documentation in R > 2.8.0. It was my fault not to check carefully. I wonder, > why "always" was chosen for 'useNA' for exclude = NULL. me too. "ifany" would seem more logical, and I am considering changing to that as a 2nd step (if the 1st step, below) shows to be feasible. > Why exclude = NULL is so special? What about another > 'exclude' of length zero, like character(0) (not c(), > because c() is NULL)? I thought that, too. But then, I > have no opinion about making it general. As mentioned, I entirely agree with that {and you are right about c() !!}. > It fits my expectation to override 'useNA' only if the > user doesn't explicitly specify 'useNA'. > Thank you for looking into this. you are welcome. As first step, I plan to commit the change to (*) useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always" as proposed yesterday, and I'll eventually see / be notified about the effect in CRAN space. -- (*) slightly more efficiently, I'll be using match() directly instead of %in% > My points: > Could R 2.7.2 behavior of table(<non-factor>, exclude = NULL) be brought back? But R 3.3.1 behavior is in R since version 2.8.0, rather long. you are right... but then, the places / cases where the behavior would change back should be quite rare. > If not, I suggest changing summary(<logical>). > -------------------------------------------- Thank you for your feedback, Suharto! Martin > On Thu, 11/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: > > Subject: Re: [Rd] table(exclude = NULL) always includes NA > > @r-project.org > Cc: "Martin Maechler" <maechler at stat.math.ethz.ch> > Date: Thursday, 11 August, 2016, 12:39 AM > > >>>>> Martin Maechler <maechler at stat.math.ethz.ch> > >>>>> on Tue, 9 Aug 2016 15:35:41 +0200 writes: > > >>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> > >>>>> on Sun, 7 Aug 2016 15:32:19 +0000 writes: > > > > This is an example from https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html . > > > > > With R 2.7.2: > > > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > > > table(a, b, exclude = NULL) > > > b > > > a 1 2 > > > 1 1 1 > > > 2 2 0 > > > 3 1 0 > > > <NA> 1 0 > > > > > With R 3.3.1: > > > > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) > > > > table(a, b, exclude = NULL) > > > b > > > a 1 2 <NA> > > > 1 1 1 0 > > > 2 2 0 0 > > > 3 1 0 0 > > > <NA> 1 0 0 > > > > table(a, b, useNA = "ifany") > > > b > > > a 1 2 > > > 1 1 1 > > > 2 2 0 > > > 3 1 0 > > > <NA> 1 0 > > > > table(a, b, exclude = NULL, useNA = "ifany") > > > b > > > a 1 2 <NA> > > > 1 1 1 0 > > > 2 2 0 0 > > > 3 1 0 0 > > > <NA> 1 0 0 > > > > > For the example, in R 3.3.1, the result of 'table' with > > > exclude = NULL includes NA even if NA is not present. It is > > > different from R 2.7.2, that comes from factor(exclude = NULL), > > > that includes NA only if NA is present. > > > > I agree that this (R 3.3.1 behavior) seems undesirable and looks > > wrong, and the old (<= 2.2.7) behavior for table(a,b, > > exclude=NULL) seems desirable to me. > > > > > > > >From R 3.3.1 help on 'table', in "Details" section: > > > 'useNA' controls if the table includes counts of 'NA' values: the allowed values correspond to never, only if the count is positive and even for zero counts. This is overridden by specifying 'exclude = NULL'. > > > > > Specifying 'exclude = NULL' overrides 'useNA' to what value? The documentation doesn't say. Looking at the code of function 'table', the value is "always". > > > > Yes, it should be documented what happens for this case, > > (but read on ...) > > and it is *not* true that the documentation does not say, since > 2013, it has contained > > exclude: levels to remove for all factors in ?...?. If set to ?NULL?, > it implies ?useNA = "always"?. See ?Details? for its > interpretation for non-factor arguments. > > > > > For the example, in R 3.3.1, the result like in R 2.7.2 can be obtained with useNA = "ifany" and 'exclude' unspecified. > > > > Yes. What should we do? > > I currently think that we'd want to change the line > > > > useNA <- if (!missing(exclude) && is.null(exclude)) "always" > > > > to > > > > useNA <- if (!missing(exclude) && is.null(exclude)) "ifany" # was "always" > > > > > > which would not even contradict documentation, as indeed you > > mentioned above, the exact action here had not been documented. > > The last part ("which ..") above is wrong, as mentioned earlier. > > The above change entails behaviour which looks better to me; > however, the change *is* "against the current documentation". > and after experimentation (a "complete factorial design" of > argument settings), I'm not entirely happy with the result.... and one reason > is that 'exclude = NULL' and (e.g.) 'exclude = c()' > are (still) handled differently: From a usual interpreation, > both should mean > "do not exclude any factor entries (and levels) from tabulation" > but one of the two changes the default of 'useNA' and the other > does not. If we want a change anyway (and have to update the doc), > it could be "more logical" to replace the line above by > > useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "always" > > notably, replacing 'useNA' *only* if it has not been specified, > which seems much closer to "typically expected" behavior.. > > > > > > The change above at least does not break any of the standard R > > tests ('make check-all', i.e., including the recommended > > packages), which for me confirms that it may be "what is > > best"... > > > > ---- > > > > Thank you for mentioning the important consequence for summary(<logical>). > > They can helping insight what a "probably best" behavior should > > be for these cases of table(). > > > > Martin Maechler, > > ETH Zurich > > > > > The result of 'summary' of a logical vector is affected. As mentioned in http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels , in the code of function 'summary.default', for logical, table(object, exclude = NULL) is used. > > > > > With R 2.7.2: > > > > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > > summary(log) > > > Mode FALSE TRUE NA's > > > logical 4 2 3 > > > > summary(log[!is.na(log)]) > > > Mode FALSE TRUE > > > logical 4 2 > > > > summary(TRUE) > > > Mode TRUE > > > logical 1 > > > > > With R 3.3.1: > > > > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > > summary(log) > > > Mode FALSE TRUE NA's > > > logical 4 2 3 > > > > summary(log[!is.na(log)]) > > > Mode FALSE TRUE NA's > > > logical 4 2 0 > > > > summary(TRUE) > > > Mode TRUE NA's > > > logical 1 0 > > > > > In R 3.3.1, "NA's' is always in the result of 'summary' of a logical vector. It is unlike 'summary' of a numeric vector. > > > On the other hand, in R 3.3.1, FALSE is not in the result of 'summary' of a logical vector that doesn't contain FALSE. > > > > > I prefer the result of 'summary' of a logical vector like in R 2.7.2, or, alternatively, the result that always includes all possible values: FALSE, TRUE, NA. > > > > I tend to agree, and strongly prefer the 'R(<=2.7.2)'-behavior > > for table() {and hence summary(<logical>)}. >
>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Fri, 12 Aug 2016 10:12:01 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Thu, 11 Aug 2016 16:19:49 +0000 writes:>> I stand corrected. The part "If set to 'NULL', it implies >> 'useNA="always"'." is even in the documentation in R >> 2.8.0. It was my fault not to check carefully. I wonder, >> why "always" was chosen for 'useNA' for exclude = NULL. > me too. "ifany" would seem more logical, and I am > considering changing to that as a 2nd step (if the 1st > step, below) shows to be feasible. >> Why exclude = NULL is so special? What about another >> 'exclude' of length zero, like character(0) (not c(), >> because c() is NULL)? I thought that, too. But then, I >> have no opinion about making it general. > As mentioned, I entirely agree with that {and you are > right about c() !!}. >> It fits my expectation to override 'useNA' only if the >> user doesn't explicitly specify 'useNA'. >> Thank you for looking into this. > you are welcome. As first step, I plan to commit the > change to (*) > useNA <- if (missing(useNA) && !missing(exclude) && !(NA > %in% exclude)) "always" > as proposed yesterday, and I'll eventually see / be > notified about the effect in CRAN space. and as I'm finding now, 20 minutes too late, doing step 1 without doing step 2 is not feasible. It gives many 0 counts for <NA> e.g. for exclude = "foo". > -- > (*) slightly more efficiently, I'll be using match() > directly instead of %in% >> My points: Could R 2.7.2 behavior of table(<non-factor>, >> exclude = NULL) be brought back? But R 3.3.1 behavior is >> in R since version 2.8.0, rather long. > you are right... but then, the places / cases where the > behavior would change back should be quite rare. >> If not, I suggest changing summary(<logical>). >> -------------------------------------------- > Thank you for your feedback, Suharto! Martin >> On Thu, 11/8/16, Martin Maechler >> <maechler at stat.math.ethz.ch> wrote: >> >> Subject: Re: [Rd] table(exclude = NULL) always includes >> NA >> >> @r-project.org Cc: "Martin Maechler" >> <maechler at stat.math.ethz.ch> Date: Thursday, 11 August, >> 2016, 12:39 AM >> >> >>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> >> on Tue, 9 Aug 2016 15:35:41 +0200 writes: >> >> >>>>> Suharto Anggono Suharto Anggono via R-devel >> <r-devel at r-project.org> >>>>> on Sun, 7 Aug 2016 15:32:19 >> +0000 writes: >> >> > > This is an example from >> https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html >> . >> > >> > > With R 2.7.2: >> > >> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) >> > > > table(a, b, exclude = NULL) > > b > > a 1 2 > > 1 1 >> 1 > > 2 2 0 > > 3 1 0 > > <NA> 1 0 >> > >> > > With R 3.3.1: >> > >> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) >> > > > table(a, b, exclude = NULL) > > b > > a 1 2 <NA> > >> > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 > > > >> table(a, b, useNA = "ifany") > > b > > a 1 2 > > 1 1 1 > >> > 2 2 0 > > 3 1 0 > > <NA> 1 0 > > > table(a, b, exclude >> = NULL, useNA = "ifany") > > b > > a 1 2 <NA> > > 1 1 1 0 >> > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 >> > >> > > For the example, in R 3.3.1, the result of 'table' >> with > > exclude = NULL includes NA even if NA is not >> present. It is > > different from R 2.7.2, that comes >> from factor(exclude = NULL), > > that includes NA only if >> NA is present. >> > >> > I agree that this (R 3.3.1 behavior) seems undesirable >> and looks > wrong, and the old (<= 2.2.7) behavior for >> table(a,b, > exclude=NULL) seems desirable to me. >> > >> > >> > > >From R 3.3.1 help on 'table', in "Details" section: >> > > 'useNA' controls if the table includes counts of 'NA' >> values: the allowed values correspond to never, only if >> the count is positive and even for zero counts. This is >> overridden by specifying 'exclude = NULL'. >> > >> > > Specifying 'exclude = NULL' overrides 'useNA' to what >> value? The documentation doesn't say. Looking at the code >> of function 'table', the value is "always". >> > >> > Yes, it should be documented what happens for this >> case, > (but read on ...) >> >> and it is *not* true that the documentation does not say, >> since 2013, it has contained >> >> exclude: levels to remove for all factors in ?...?. If >> set to ?NULL?, it implies ?useNA = "always"?. See >> ?Details? for its interpretation for non-factor >> arguments. >> >> >> > > For the example, in R 3.3.1, the result like in R >> 2.7.2 can be obtained with useNA = "ifany" and 'exclude' >> unspecified. >> > >> > Yes. What should we do? > I currently think that we'd >> want to change the line >> > >> > useNA <- if (!missing(exclude) && is.null(exclude)) >> "always" >> > >> > to >> > >> > useNA <- if (!missing(exclude) && is.null(exclude)) >> "ifany" # was "always" >> > >> > >> > which would not even contradict documentation, as >> indeed you > mentioned above, the exact action here had >> not been documented. >> >> The last part ("which ..") above is wrong, as mentioned >> earlier. >> >> The above change entails behaviour which looks better to >> me; however, the change *is* "against the current >> documentation". and after experimentation (a "complete >> factorial design" of argument settings), I'm not entirely >> happy with the result.... and one reason is that 'exclude >> = NULL' and (e.g.) 'exclude = c()' are (still) handled >> differently: From a usual interpreation, both should mean >> "do not exclude any factor entries (and levels) from >> tabulation" but one of the two changes the default of >> 'useNA' and the other does not. If we want a change >> anyway (and have to update the doc), it could be "more >> logical" to replace the line above by >> >> useNA <- if (missing(useNA) && !missing(exclude) && !(NA >> %in% exclude)) "always" >> >> notably, replacing 'useNA' *only* if it has not been >> specified, which seems much closer to "typically >> expected" behavior.. >> >> >> >> >> > The change above at least does not break any of the >> standard R > tests ('make check-all', i.e., including the >> recommended > packages), which for me confirms that it >> may be "what is > best"... >> > >> > ---- >> > >> > Thank you for mentioning the important consequence for >> summary(<logical>). > They can helping insight what a >> "probably best" behavior should > be for these cases of >> table(). >> > >> > Martin Maechler, > ETH Zurich >> > >> > > The result of 'summary' of a logical vector is >> affected. As mentioned in >> http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels >> , in the code of function 'summary.default', for logical, >> table(object, exclude = NULL) is used. >> > >> > > With R 2.7.2: >> > >> > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > >> summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > >> > > summary(log[!is.na(log)]) > > Mode FALSE TRUE > > >> logical 4 2 > > > summary(TRUE) > > Mode TRUE > > logical >> 1 >> > >> > > With R 3.3.1: >> > >> > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > >> summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > >> > > summary(log[!is.na(log)]) > > Mode FALSE TRUE NA's > >> > logical 4 2 0 > > > summary(TRUE) > > Mode TRUE NA's > >> > logical 1 0 >> > >> > > In R 3.3.1, "NA's' is always in the result of >> 'summary' of a logical vector. It is unlike 'summary' of >> a numeric vector. > > On the other hand, in R 3.3.1, >> FALSE is not in the result of 'summary' of a logical >> vector that doesn't contain FALSE. >> > >> > > I prefer the result of 'summary' of a logical vector >> like in R 2.7.2, or, alternatively, the result that >> always includes all possible values: FALSE, TRUE, NA. >> > >> > I tend to agree, and strongly prefer the >> 'R(<=2.7.2)'-behavior > for table() {and hence >> summary(<logical>)}. >> > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel