Suharto Anggono Suharto Anggono
2016-Aug-14 03:42 UTC
[Rd] table(exclude = NULL) always includes NA
useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "ifany" An example where it change 'table' result for non-factor input, from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html : x <- c(1,2,3,3,NA) table(as.integer(x), exclude=NaN) I bring the example up, in case that the change in result is not intended. -------------------------------------------- On Sat, 13/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: Subject: Re: [Rd] table(exclude = NULL) always includes NA To: "Martin Maechler" <maechler at stat.math.ethz.ch> @r-project.org Date: Saturday, 13 August, 2016, 4:29 AM>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Fri, 12 Aug 2016 10:12:01 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Thu, 11 Aug 2016 16:19:49 +0000 writes:>> I stand corrected. The part "If set to 'NULL', it implies >> 'useNA="always"'." is even in the documentation in R >> 2.8.0. It was my fault not to check carefully. I wonder, >> why "always" was chosen for 'useNA' for exclude = NULL. > me too. "ifany" would seem more logical, and I am > considering changing to that as a 2nd step (if the 1st > step, below) shows to be feasible. >> Why exclude = NULL is so special? What about another >> 'exclude' of length zero, like character(0) (not c(), >> because c() is NULL)? I thought that, too. But then, I >> have no opinion about making it general. > As mentioned, I entirely agree with that {and you are > right about c() !!}. >> It fits my expectation to override 'useNA' only if the >> user doesn't explicitly specify 'useNA'. >> Thank you for looking into this. > you are welcome. As first step, I plan to commit the > change to (*) > useNA <- if (missing(useNA) && !missing(exclude) && !(NA > %in% exclude)) "always" > as proposed yesterday, and I'll eventually see / be > notified about the effect in CRAN space. and as I'm finding now, 20 minutes too late, doing step 1 without doing step 2 is not feasible. It gives many 0 counts for <NA> e.g. for exclude = "foo". > -- > (*) slightly more efficiently, I'll be using match() > directly instead of %in% >> My points: Could R 2.7.2 behavior of table(<non-factor>, >> exclude = NULL) be brought back? But R 3.3.1 behavior is >> in R since version 2.8.0, rather long. > you are right... but then, the places / cases where the > behavior would change back should be quite rare. >> If not, I suggest changing summary(<logical>). >> -------------------------------------------- > Thank you for your feedback, Suharto! Martin >> On Thu, 11/8/16, Martin Maechler >> <maechler at stat.math.ethz.ch> wrote: >> >> Subject: Re: [Rd] table(exclude = NULL) always includes >> NA >> >> @r-project.org Cc: "Martin Maechler" >> <maechler at stat.math.ethz.ch> Date: Thursday, 11 August, >> 2016, 12:39 AM >> >> >>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> >> on Tue, 9 Aug 2016 15:35:41 +0200 writes: >> >> >>>>> Suharto Anggono Suharto Anggono via R-devel >> <r-devel at r-project.org> >>>>> on Sun, 7 Aug 2016 15:32:19 >> +0000 writes: >> >> > > This is an example from >> https://stat.ethz.ch/pipermail/r-help/2007-May/132573.html >> . >> > >> > > With R 2.7.2: >> > >> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) >> > > > table(a, b, exclude = NULL) > > b > > a 1 2 > > 1 1 >> 1 > > 2 2 0 > > 3 1 0 > > <NA> 1 0 >> > >> > > With R 3.3.1: >> > >> > > > a <- c(1, 1, 2, 2, NA, 3); b <- c(2, 1, 1, 1, 1, 1) >> > > > table(a, b, exclude = NULL) > > b > > a 1 2 <NA> > >> > 1 1 1 0 > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 > > > >> table(a, b, useNA = "ifany") > > b > > a 1 2 > > 1 1 1 > >> > 2 2 0 > > 3 1 0 > > <NA> 1 0 > > > table(a, b, exclude >> = NULL, useNA = "ifany") > > b > > a 1 2 <NA> > > 1 1 1 0 >> > > 2 2 0 0 > > 3 1 0 0 > > <NA> 1 0 0 >> > >> > > For the example, in R 3.3.1, the result of 'table' >> with > > exclude = NULL includes NA even if NA is not >> present. It is > > different from R 2.7.2, that comes >> from factor(exclude = NULL), > > that includes NA only if >> NA is present. >> > >> > I agree that this (R 3.3.1 behavior) seems undesirable >> and looks > wrong, and the old (<= 2.2.7) behavior for >> table(a,b, > exclude=NULL) seems desirable to me. >> > >> > >> > > >From R 3.3.1 help on 'table', in "Details" section: >> > > 'useNA' controls if the table includes counts of 'NA' >> values: the allowed values correspond to never, only if >> the count is positive and even for zero counts. This is >> overridden by specifying 'exclude = NULL'. >> > >> > > Specifying 'exclude = NULL' overrides 'useNA' to what >> value? The documentation doesn't say. Looking at the code >> of function 'table', the value is "always". >> > >> > Yes, it should be documented what happens for this >> case, > (but read on ...) >> >> and it is *not* true that the documentation does not say, >> since 2013, it has contained >> >> exclude: levels to remove for all factors in ?...?. If >> set to ?NULL?, it implies ?useNA = "always"?. See >> ?Details? for its interpretation for non-factor >> arguments. >> >> >> > > For the example, in R 3.3.1, the result like in R >> 2.7.2 can be obtained with useNA = "ifany" and 'exclude' >> unspecified. >> > >> > Yes. What should we do? > I currently think that we'd >> want to change the line >> > >> > useNA <- if (!missing(exclude) && is.null(exclude)) >> "always" >> > >> > to >> > >> > useNA <- if (!missing(exclude) && is.null(exclude)) >> "ifany" # was "always" >> > >> > >> > which would not even contradict documentation, as >> indeed you > mentioned above, the exact action here had >> not been documented. >> >> The last part ("which ..") above is wrong, as mentioned >> earlier. >> >> The above change entails behaviour which looks better to >> me; however, the change *is* "against the current >> documentation". and after experimentation (a "complete >> factorial design" of argument settings), I'm not entirely >> happy with the result.... and one reason is that 'exclude >> = NULL' and (e.g.) 'exclude = c()' are (still) handled >> differently: From a usual interpreation, both should mean >> "do not exclude any factor entries (and levels) from >> tabulation" but one of the two changes the default of >> 'useNA' and the other does not. If we want a change >> anyway (and have to update the doc), it could be "more >> logical" to replace the line above by >> >> useNA <- if (missing(useNA) && !missing(exclude) && !(NA >> %in% exclude)) "always" >> >> notably, replacing 'useNA' *only* if it has not been >> specified, which seems much closer to "typically >> expected" behavior.. >> >> >> >> >> > The change above at least does not break any of the >> standard R > tests ('make check-all', i.e., including the >> recommended > packages), which for me confirms that it >> may be "what is > best"... >> > >> > ---- >> > >> > Thank you for mentioning the important consequence for >> summary(<logical>). > They can helping insight what a >> "probably best" behavior should > be for these cases of >> table(). >> > >> > Martin Maechler, > ETH Zurich >> > >> > > The result of 'summary' of a logical vector is >> affected. As mentioned in >> http://stackoverflow.com/questions/26775501/r-dropping-nas-in-logical-column-levels >> , in the code of function 'summary.default', for logical, >> table(object, exclude = NULL) is used. >> > >> > > With R 2.7.2: >> > >> > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > >> summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > >> > > summary(log[!is.na(log)]) > > Mode FALSE TRUE > > >> logical 4 2 > > > summary(TRUE) > > Mode TRUE > > logical >> 1 >> > >> > > With R 3.3.1: >> > >> > > > log <- c(NA, logical(4), NA, !logical(2), NA) > > > >> summary(log) > > Mode FALSE TRUE NA's > > logical 4 2 3 > >> > > summary(log[!is.na(log)]) > > Mode FALSE TRUE NA's > >> > logical 4 2 0 > > > summary(TRUE) > > Mode TRUE NA's > >> > logical 1 0 >> > >> > > In R 3.3.1, "NA's' is always in the result of >> 'summary' of a logical vector. It is unlike 'summary' of >> a numeric vector. > > On the other hand, in R 3.3.1, >> FALSE is not in the result of 'summary' of a logical >> vector that doesn't contain FALSE. >> > >> > > I prefer the result of 'summary' of a logical vector >> like in R 2.7.2, or, alternatively, the result that >> always includes all possible values: FALSE, TRUE, NA. >> > >> > I tend to agree, and strongly prefer the >> 'R(<=2.7.2)'-behavior > for table() {and hence >> summary(<logical>)}. >> > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> Suharto Anggono Suharto Anggono <suharto_anggono at yahoo.com> >>>>> on Sun, 14 Aug 2016 03:42:08 +0000 writes:> useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "ifany" > An example where it change 'table' result for non-factor input, from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html : > x <- c(1,2,3,3,NA) > table(as.integer(x), exclude=NaN) > I bring the example up, in case that the change in result is not intended. Thanks a lot, Suharto. To me, the example is convincing that the change (I commited Friday), svn rev 71087 & 71088, are a clear improvement: (As you surely know, but not all the other readers:) Before the change, the above example gave *different* results for 'x' and 'as.integer(x)', the integer case *not* counting the NAs, whereas with the change in effect, they are the same:> x <- as.integer(dx <- c(1,2,3,3,NA)) > table(x, exclude=NaN); table(dx, exclude=NaN)x 1 2 3 <NA> 1 1 2 1 dx 1 2 3 <NA> 1 1 2 1>-- But the change has affected 6-8 (of the 8000+) CRAN packages which I am investigating now and probably will be in contact with the package maintainers after that. Martin
>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Mon, 15 Aug 2016 11:07:43 +0200 writes:>>>>> Suharto Anggono Suharto Anggono <suharto_anggono at yahoo.com> >>>>> on Sun, 14 Aug 2016 03:42:08 +0000 writes:>> useNA <- if (missing(useNA) && !missing(exclude) && !(NA %in% exclude)) "ifany" >> An example where it change 'table' result for non-factor input, from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html : >> x <- c(1,2,3,3,NA) >> table(as.integer(x), exclude=NaN) >> I bring the example up, in case that the change in result is not intended. > Thanks a lot, Suharto. > To me, the example is convincing that the change (I commited > Friday), svn rev 71087 & 71088, are a clear improvement: > (As you surely know, but not all the other readers:) > Before the change, the above example gave *different* results > for 'x' and 'as.integer(x)', the integer case *not* counting the NAs, > whereas with the change in effect, they are the same: >> x <- as.integer(dx <- c(1,2,3,3,NA)) >> table(x, exclude=NaN); table(dx, exclude=NaN) > x > 1 2 3 <NA> > 1 1 2 1 > dx > 1 2 3 <NA> > 1 1 2 1 >> > -- > But the change has affected 6-8 (of the 8000+) CRAN packages > which I am investigating now and probably will be in contact with the > package maintainers after that. There has been another bug in table(), since the time 'useNA' was introduced, which gives (in released R, R-patched, or R-devel): > table(1:3, exclude = 1, useNA = "ifany") 2 3 <NA> 1 1 1 > and that bug now (in R-devel, after my changes) triggers in cases it did not previously, notably in table(1:3, exclude = 1) which now does set 'useNA = "ifany"' and so gives the same silly result as the one above. The reason for this bug is that addNA(..) is called (in all R versions mentioned) in this case, but it should not. I'm currently testing yet another amendment.. Martin