Suharto Anggono Suharto Anggono
2016-Sep-02 16:10 UTC
[Rd] Coercion of 'exclude' in function 'factor' (was 'droplevels' inappropriate change)
I am basically fine with the change. How about using just the following? if(!is.character(exclude)) exclude <- as.vector(exclude, typeof(x)) # may result in NA x <- as.character(x) It looks simpler and is, more or less, equivalent. In factor.Rd, in description of argument 'exclude', "(when \code{x} is a \code{factor} already)" can be removed. A larger change that, I think, is reasonable is entirely removing the code exclude <- as.vector(exclude, typeof(x)) # may result in NA The explicit coercion of 'exclude' is not necessary. Function 'factor' works without it. The coercion of 'exclude' may leads to a surprise because it "may result in NA". Example from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html : factor(as.integer(c(1,2,3,3,NA)), exclude=NaN) excludes NA. As a bonus, without the coercion of 'exclude', 'exclude' can be a factor if 'x' is a factor. This part of an example in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html works. cc <- c("x","y","NA") ff <- factor(cc) factor(ff,exclude=ff[3]) However, the coercion of 'exclude' has been in function 'factor' in R "forever". -------------------------------------------- On Wed, 31/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: Subject: Re: [Rd] 'droplevels' inappropriate change Cc: "Martin Maechler" <maechler at stat.math.ethz.ch> Date: Wednesday, 31 August, 2016, 2:51 PM>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Sat, 27 Aug 2016 18:55:37 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Sat, 27 Aug 2016 03:17:32 +0000 writes:>> In R devel r71157, 'droplevels' documentation, in "Arguments" section, says this about argument 'exclude'. >> passed to factor(); factor levels which should be excluded from the result even if present. Note that this was implicitly NA in R <= 3.3.1 which did drop NA levels even when present in x, contrary to the documentation. The current default is compatible with x[ , drop=FALSE]. >> The part >> x[ , drop=FALSE] >> should be >> x[ , drop=TRUE] [[elided Yahoo spam]] > a "typo" by me. .. fixed now. >> Saying that 'exclude' is factor levels is not quite true for NA element. NA may be not an original level, but NA in 'exclude' affects the result. >> For a factor 'x', factor(x, exclude = exclude) doesn't really work for excluding in general. See, for example, https://stat.ethz.ch/pipermail/r-help/2005-September/079336.html . >> factor(factor(c("a","b","c")), exclude="c") >> However, this excludes "2": >> factor(factor(2:3), exclude=2) >> Rather unexpectedly, this excludes NA: >> factor(factor(c("a",NA), exclude=NULL), exclude="c") >> For a factor 'x', factor(x, exclude = exclude) can only exclude integer-like or NA levels. An explanation is in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html . > Well, Peter Dalgaard (in that R-devel e-mail, a bit more than 5 > years ago) is confirming the problem there, and suggesting (as > you, right?) that actually `factor()` is not behaving > correctly here. > And your persistence is finally getting close to convince me > that it is not just droplevels(), but factor() itself which > needs care here. > Interestingly, the following patch *does* pass 'make check-all' > (after small change in tests/reg-tests-1b.R which is ok), > and leads to behavior which is much closer to the documentation, > notably for your two examples above would give what one would > expect. > (( If the R-Hub would support experiments with branches of R-devel > from R-core members, I could just create such a branch and R Hub > would run 'R CMD check <pkg>' for thousands of CRAN packages > and provide a web page with the *differences* in the package > check results ... so we could see ... )) > I do agree that we should strongly consider such a change. as nobody has commented, I've been liberal and have taken these no comments as consent. Hence I have committed ------------------------------------------------------------------------ r71178 | maechler | 2016-08-31 09:45:40 +0200 (Wed, 31 Aug 2016) | 1 line Changed paths: M /trunk/doc/NEWS.Rd M /trunk/src/library/base/R/factor.R M /trunk/src/library/base/man/factor.Rd M /trunk/tests/reg-tests-1b.R M /trunk/tests/reg-tests-1c.R factor(x, exclude) more "rational" when x or exclude are character ------------------------------------------------------------------------ which apart from documentation, examples, and regression tests is just the patch below. Martin Maechler ETH Zurich > --- factor.R (revision 71157) > +++ factor.R (working copy) > @@ -28,8 +28,12 @@ > levels <- unique(y[ind]) > } > force(ordered) # check if original x is an ordered factor > - exclude <- as.vector(exclude, typeof(x)) # may result in NA > - x <- as.character(x) > + if(!is.character(x)) { > + if(!is.character(exclude)) > + exclude <- as.vector(exclude, typeof(x)) # may result in NA > + x <- as.character(x) > + } else > + exclude <- as.vector(exclude, typeof(x)) # may result in NA > ## levels could be a long vectors, but match will not handle that. > levels <- levels[is.na(match(levels, exclude))] > f <- match(x, levels) Delete Reply Reply All Forward Apply
Martin Maechler
2016-Sep-13 16:33 UTC
[Rd] Coercion of 'exclude' in function 'factor' (was 'droplevels' inappropriate change)
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Fri, 2 Sep 2016 16:10:00 +0000 writes:> I am basically fine with the change. > How about using just the following? > if(!is.character(exclude)) > exclude <- as.vector(exclude, typeof(x)) # may result in NA > x <- as.character(x) > It looks simpler and is, more or less, equivalent. yes, but the current code should be slightly faster.. > In factor.Rd, in description of argument 'exclude', "(when \code{x} is a \code{factor} already)" can be removed. > A larger change that, I think, is reasonable is entirely removing the code > exclude <- as.vector(exclude, typeof(x)) # may result in NA > The explicit coercion of 'exclude' is not necessary. > Function 'factor' works without it. > The coercion of 'exclude' may lead to a surprise because it "may result in NA". > Example from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html : > factor(as.integer(c(1,2,3,3,NA)), exclude=NaN) > excludes NA. > As a bonus, without the coercion of 'exclude', 'exclude' can be a factor if 'x' is a factor. This part of an example in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html works. > cc <- c("x","y","NA") > ff <- factor(cc) > factor(ff,exclude=ff[3]) Yes, two good reasons for a change. factor() would finally behave according to the documentation which has been mentioning that 'exclude' could be a factor: ((Until my R-devel changes of a few weeks ago, i.e. at least in all recent released versions of R)), the help page for factor has said || If 'exclude' is used it should either be a factor with the same || level set as 'x' or a set of codes for the levels to be excluded. > However, the coercion of 'exclude' has been in function 'factor' in R "forever". Indeed: On March 6, 1998, svn rev. 845, when the R source files got a '.R' appended, and quite a long time before R 1.0.0, the factor function was as short as (but using an .Internal(.) !) function (x, levels = sort(unique(x), na.last = TRUE), labels, exclude = NA, ordered = FALSE) { if (length(x) == 0) return(character(0)) exclude <- as.vector(exclude, typeof(x)) levels <- levels[is.na(match(levels, exclude))] x <- .Internal(factor(match(x, levels), length(levels), ordered)) if (missing(labels)) levels(x) <- levels else levels(x) <- labels x } and already contained that line. Nevertheless, simplifying factor() by removing that line (or those 2 lines, now!) seems to only have advantages.... I'm not yet committing to anything, but currently would strongly consider it .. though *after* the '<length-1-array> OP <non-array>' issue has settled a bit. Martin > -------------------------------------------- > On Wed, 31/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: > Subject: Re: [Rd] 'droplevels' inappropriate change > Cc: "Martin Maechler" <maechler at stat.math.ethz.ch> > Date: Wednesday, 31 August, 2016, 2:51 PM>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Sat, 27 Aug 2016 18:55:37 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Sat, 27 Aug 2016 03:17:32 +0000 writes:>>> In R devel r71157, 'droplevels' documentation, in "Arguments" section, says this about argument 'exclude'. >>> passed to factor(); factor levels which should be excluded from the result even if present. Note that this was implicitly NA in R <= 3.3.1 which did drop NA levels even when present in x, contrary to the documentation. The current default is compatible with x[ , drop=FALSE]. >>> The part >>> x[ , drop=FALSE] >>> should be >>> x[ , drop=TRUE] > [[elided Yahoo spam]] >> a "typo" by me. .. fixed now. >>> Saying that 'exclude' is factor levels is not quite true for NA element. NA may be not an original level, but NA in 'exclude' affects the result. >>> For a factor 'x', factor(x, exclude = exclude) doesn't really work for excluding in general. See, for example, https://stat.ethz.ch/pipermail/r-help/2005-September/079336.html . >>> factor(factor(c("a","b","c")), exclude="c") >>> However, this excludes "2": >>> factor(factor(2:3), exclude=2) >>> Rather unexpectedly, this excludes NA: >>> factor(factor(c("a",NA), exclude=NULL), exclude="c") >>> For a factor 'x', factor(x, exclude = exclude) can only exclude integer-like or NA levels. An explanation is in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html . >> Well, Peter Dalgaard (in that R-devel e-mail, a bit more than 5 >> years ago) is confirming the problem there, and suggesting (as >> you, right?) that actually `factor()` is not behaving >> correctly here. >> And your persistence is finally getting close to convince me >> that it is not just droplevels(), but factor() itself which >> needs care here. >> Interestingly, the following patch *does* pass 'make check-all' >> (after small change in tests/reg-tests-1b.R which is ok), >> and leads to behavior which is much closer to the documentation, >> notably for your two examples above would give what one would >> expect. >> (( If the R-Hub would support experiments with branches of R-devel >> from R-core members, I could just create such a branch and R Hub >> would run 'R CMD check <pkg>' for thousands of CRAN packages >> and provide a web page with the *differences* in the package >> check results ... so we could see ... )) >> I do agree that we should strongly consider such a change. > as nobody has commented, I've been liberal and have taken these > no comments as consent. > Hence I have committed > ------------------------------------------------------------------------ > r71178 | maechler | 2016-08-31 09:45:40 +0200 (Wed, 31 Aug 2016) | 1 line > Changed paths: > M /trunk/doc/NEWS.Rd > M /trunk/src/library/base/R/factor.R > M /trunk/src/library/base/man/factor.Rd > M /trunk/tests/reg-tests-1b.R > M /trunk/tests/reg-tests-1c.R > factor(x, exclude) more "rational" when x or exclude are character > ------------------------------------------------------------------------ > which apart from documentation, examples, and regression tests > is just the patch below. > Martin Maechler > ETH Zurich >> --- factor.R (revision 71157) >> +++ factor.R (working copy) >> @@ -28,8 +28,12 @@ >> levels <- unique(y[ind]) >> } >> force(ordered) # check if original x is an ordered factor >> - exclude <- as.vector(exclude, typeof(x)) # may result in NA >> - x <- as.character(x) >> + if(!is.character(x)) { >> + if(!is.character(exclude)) >> + exclude <- as.vector(exclude, typeof(x)) # may result in NA >> + x <- as.character(x) >> + } else >> + exclude <- as.vector(exclude, typeof(x)) # may result in NA >> ## levels could be a long vectors, but match will not handle that. >> levels <- levels[is.na(match(levels, exclude))] >> f <- match(x, levels) > Delete Reply Reply All Forward Apply > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler
2016-Sep-30 12:51 UTC
[Rd] Coercion of 'exclude' in function 'factor' (was 'droplevels' inappropriate change)
>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Tue, 13 Sep 2016 18:33:35 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Fri, 2 Sep 2016 16:10:00 +0000 writes:>> I am basically fine with the change. >> How about using just the following? >> if(!is.character(exclude)) >> exclude <- as.vector(exclude, typeof(x)) # may result in NA >> x <- as.character(x) >> It looks simpler and is, more or less, equivalent. > yes, but the current code should be slightly faster.. >> In factor.Rd, in description of argument 'exclude', "(when \code{x} is a \code{factor} already)" can be removed. >> A larger change that, I think, is reasonable is entirely removing the code >> exclude <- as.vector(exclude, typeof(x)) # may result in NA >> The explicit coercion of 'exclude' is not necessary. >> Function 'factor' works without it. >> The coercion of 'exclude' may lead to a surprise because it "may result in NA". >> Example from https://stat.ethz.ch/pipermail/r-help/2005-April/069053.html : >> factor(as.integer(c(1,2,3,3,NA)), exclude=NaN) >> excludes NA. >> As a bonus, without the coercion of 'exclude', 'exclude' can be a factor if 'x' is a factor. This part of an example in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html works. >> cc <- c("x","y","NA") >> ff <- factor(cc) >> factor(ff,exclude=ff[3]) > Yes, two good reasons for a change. factor() would finally > behave according to the documentation which has been mentioning > that 'exclude' could be a factor: ((Until my R-devel changes of a > few weeks ago, i.e. at least in all recent released versions of R)), > the help page for factor has said > || If 'exclude' is used it should either be a factor with the same > || level set as 'x' or a set of codes for the levels to be excluded. >> However, the coercion of 'exclude' has been in function 'factor' in R "forever". > Indeed: On March 6, 1998, svn rev. 845, when the R source files got a > '.R' appended, and quite a long time before R 1.0.0, > the factor function was as short as (but using an .Internal(.) !) > function (x, levels = sort(unique(x), na.last = TRUE), labels, exclude = NA, > ordered = FALSE) > { > if (length(x) == 0) > return(character(0)) > exclude <- as.vector(exclude, typeof(x)) > levels <- levels[is.na(match(levels, exclude))] > x <- .Internal(factor(match(x, levels), length(levels), > ordered)) > if (missing(labels)) > levels(x) <- levels > else levels(x) <- labels > x > } > and already contained that line. > Nevertheless, simplifying factor() by removing that line (or those > 2 lines, now!) seems to only have advantages.... > I'm not yet committing to anything, but currently would strongly > consider it .. though *after* the > '<length-1-array> OP <non-array>' > issue has settled a bit. (Which it has; the decision has been to keep it.) I have now committed Suharto's proposal above, to entirely drop the exclude <- as.vector(exclude, typeof(x)) parts in the factor() function... which has the two advantages mentioned above and simplifies the code (and documentation). ------------------------------------------------------------------------ r71424 | maechler | 2016-09-30 14:38:43 +0200 (Fri, 30 Sep 2016) | 1 line Changed paths: M /trunk/doc/NEWS.Rd M /trunk/src/library/base/R/factor.R M /trunk/src/library/base/man/factor.Rd M /trunk/tests/reg-tests-1c.R simplify factor(), allowing 'exclude= <factor>' as documented in R <= 3.3.x ------------------------------------------------------------------------ I do expect some "reaction" in CRAN/Bioconductor package space, so the final word has not been spoken on this, but the new code is more aestethic to me. Thank you for the suggestion, Martin >> -------------------------------------------- >> On Wed, 31/8/16, Martin Maechler <maechler at stat.math.ethz.ch> wrote: >> Subject: Re: [Rd] 'droplevels' inappropriate change >> Cc: "Martin Maechler" <maechler at stat.math.ethz.ch> >> Date: Wednesday, 31 August, 2016, 2:51 PM>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Sat, 27 Aug 2016 18:55:37 +0200 writes:>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at r-project.org> >>>>> on Sat, 27 Aug 2016 03:17:32 +0000 writes:>>>> In R devel r71157, 'droplevels' documentation, in "Arguments" section, says this about argument 'exclude'. >>>> passed to factor(); factor levels which should be excluded from the result even if present. Note that this was implicitly NA in R <= 3.3.1 which did drop NA levels even when present in x, contrary to the documentation. The current default is compatible with x[ , drop=FALSE]. >>>> The part >>>> x[ , drop=FALSE] >>>> should be >>>> x[ , drop=TRUE] >> [[elided Yahoo spam]] >>> a "typo" by me. .. fixed now. >>>> Saying that 'exclude' is factor levels is not quite true for NA element. NA may be not an original level, but NA in 'exclude' affects the result. >>>> For a factor 'x', factor(x, exclude = exclude) doesn't really work for excluding in general. See, for example, https://stat.ethz.ch/pipermail/r-help/2005-September/079336.html . >>>> factor(factor(c("a","b","c")), exclude="c") >>>> However, this excludes "2": >>>> factor(factor(2:3), exclude=2) >>>> Rather unexpectedly, this excludes NA: >>>> factor(factor(c("a",NA), exclude=NULL), exclude="c") >>>> For a factor 'x', factor(x, exclude = exclude) can only exclude integer-like or NA levels. An explanation is in https://stat.ethz.ch/pipermail/r-help/2011-April/276274.html . >>> Well, Peter Dalgaard (in that R-devel e-mail, a bit more than 5 >>> years ago) is confirming the problem there, and suggesting (as >>> you, right?) that actually `factor()` is not behaving >>> correctly here. >>> And your persistence is finally getting close to convince me >>> that it is not just droplevels(), but factor() itself which >>> needs care here. >>> Interestingly, the following patch *does* pass 'make check-all' >>> (after small change in tests/reg-tests-1b.R which is ok), >>> and leads to behavior which is much closer to the documentation, >>> notably for your two examples above would give what one would >>> expect. >>> (( If the R-Hub would support experiments with branches of R-devel >>> from R-core members, I could just create such a branch and R Hub >>> would run 'R CMD check <pkg>' for thousands of CRAN packages >>> and provide a web page with the *differences* in the package >>> check results ... so we could see ... )) >>> I do agree that we should strongly consider such a change. >> as nobody has commented, I've been liberal and have taken these >> no comments as consent. >> Hence I have committed >> ------------------------------------------------------------------------ >> r71178 | maechler | 2016-08-31 09:45:40 +0200 (Wed, 31 Aug 2016) | 1 line >> Changed paths: >> M /trunk/doc/NEWS.Rd >> M /trunk/src/library/base/R/factor.R >> M /trunk/src/library/base/man/factor.Rd >> M /trunk/tests/reg-tests-1b.R >> M /trunk/tests/reg-tests-1c.R >> factor(x, exclude) more "rational" when x or exclude are character >> ------------------------------------------------------------------------ >> which apart from documentation, examples, and regression tests >> is just the patch below. >> Martin Maechler >> ETH Zurich >>> --- factor.R (revision 71157) >>> +++ factor.R (working copy) >>> @@ -28,8 +28,12 @@ >>> levels <- unique(y[ind]) >>> } >>> force(ordered) # check if original x is an ordered factor >>> - exclude <- as.vector(exclude, typeof(x)) # may result in NA >>> - x <- as.character(x) >>> + if(!is.character(x)) { >>> + if(!is.character(exclude)) >>> + exclude <- as.vector(exclude, typeof(x)) # may result in NA >>> + x <- as.character(x) >>> + } else >>> + exclude <- as.vector(exclude, typeof(x)) # may result in NA >>> ## levels could be a long vectors, but match will not handle that. >>> levels <- levels[is.na(match(levels, exclude))] >>> f <- match(x, levels) >> Delete Reply Reply All Forward Apply >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel