>>>>> Paul Johnson <pauljohn32 at gmail.com> >>>>> on Fri, 16 Jun 2017 11:02:34 -0500 writes:> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <jorismeys at gmail.com> wrote: >> To extwnd on Martin 's explanation : >> >> In factor(), levels are the unique input values and labels the unique output >> values. So the function levels() actually displays the labels. >> > Dear Joris > I think we agree. Currently, factor insists both levels and labels be unique. > I wish that it would not accept nonunique labels. I also understand it > is impractical to change this now in base R. > I don't think I succeeded in explaining why this would be nicer. > Here's another example. Fairly often, we see input data like > x <- c("Male", "Man", "male", "Man", "Female") > The first four represent the same value. I'd like to go in one step > to a new factor variable with enumerated types "Male" and "Female". > This fails > xf <- factor(x, levels = c("Male", "Man", "male", "Female"), > labels = c("Male", "Male", "Male", "Female")) > Instead, we need 2 steps. > xf <- factor(x, levels = c("Male", "Man", "male", "Female")) > levels(xf) <- c("Male", "Male", "Male", "Female") > I think it is quirky that `levels<-.factor` allows the duplicated > labels, whereas factor does not. > I wrote a function rockchalk::combineLevels to simplify combining > levels, but most of the students here like plyr::mapvalues to do it. > The use of levels() can be tricky because one must enumerate all > values, not just the ones being changed. > But I do understand Martin's point. Its been this way 25 years, it > won't change. :). Well.. the above is a bit out of context. Your first example really did not make a point to me (and Joris) and I showed that you could use even two different simple factor() calls to produce what you wanted yc <- factor(c("1",NA,NA,"4","4","4")) yn <- factor(c( 1, NA,NA, 4, 4, 4)) Your new example is indeed much more convincing ! (Note though that the two steps that are needed can be written more shortly The "been this way 25 years" is one a reason to be very cautious(*) with changes, but not a reason for no changes! (*) Indeed as some of you have noted we really should not "break behavior". This means to me we cannot accept a change there which gives an error or a different result in cases the old behavior gave a valid factor. I'm looking at a possible change currently [not promising that a change will happen ...] Martin
>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Thu, 22 Jun 2017 11:43:59 +0200 writes:>>>>> Paul Johnson <pauljohn32 at gmail.com> >>>>> on Fri, 16 Jun 2017 11:02:34 -0500 writes:>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <jorismeys at gmail.com> wrote: >>> To extwnd on Martin 's explanation : >>> >>> In factor(), levels are the unique input values and labels the unique output >>> values. So the function levels() actually displays the labels. >>> >> Dear Joris >> I think we agree. Currently, factor insists both levels and labels be unique. >> I wish that it would not accept nonunique labels. I also understand it >> is impractical to change this now in base R. >> I don't think I succeeded in explaining why this would be nicer. >> Here's another example. Fairly often, we see input data like >> x <- c("Male", "Man", "male", "Man", "Female") >> The first four represent the same value. I'd like to go in one step >> to a new factor variable with enumerated types "Male" and "Female". >> This fails >> xf <- factor(x, levels = c("Male", "Man", "male", "Female"), >> labels = c("Male", "Male", "Male", "Female")) >> Instead, we need 2 steps. >> xf <- factor(x, levels = c("Male", "Man", "male", "Female")) >> levels(xf) <- c("Male", "Male", "Male", "Female") >> I think it is quirky that `levels<-.factor` allows the duplicated >> labels, whereas factor does not. >> I wrote a function rockchalk::combineLevels to simplify combining >> levels, but most of the students here like plyr::mapvalues to do it. >> The use of levels() can be tricky because one must enumerate all >> values, not just the ones being changed. >> But I do understand Martin's point. Its been this way 25 years, it >> won't change. :). > Well.. the above is a bit out of context. > Your first example really did not make a point to me (and Joris) > and I showed that you could use even two different simple factor() calls to > produce what you wanted > yc <- factor(c("1",NA,NA,"4","4","4")) > yn <- factor(c( 1, NA,NA, 4, 4, 4)) > Your new example is indeed much more convincing ! > (Note though that the two steps that are needed can be written > more shortly > The "been this way 25 years" is one a reason to be very > cautious(*) with changes, but not a reason for no changes! > (*) Indeed as some of you have noted we really should not "break behavior". > This means to me we cannot accept a change there which gives > an error or a different result in cases the old behavior gave a valid factor. > I'm looking at a possible change currently > [not promising that a change will happen ...] In the end, I've liked the change (after 2-3 iterations), and now been brave to commit to R-devel (svn 72845). With the change, I had to disable one of our own regression checks (tests/reg-tests-1b.R, line 726): The following is now (in R-devel -> R 3.5.0) valid: > factor(1:2, labels = c("A","A")) [1] A A Levels: A > I wonder how many CRAN package checks will "break" from this (my guess is in the order of a dozen), but I hope that these breakages will be benign, e.g., similar to the above case where before an error was expected via tools :: assertError(.) Martin
Hmm, the danger in this is that duplicated factor levels _used_ to be allowed (i.e. multiple codes with the same level). Disallowing it is what broke read.spss() on some files, because SPSS's concept of value labels is not 1-to-1 with factors. Reallowing it with different semantics could be premature. I mean, if we hadn't had the "forbidden" step, read.spss() could have changed behaviour unnoticed. So what if there is code relying on duplicate factor levels, which hasn't been run for some time? -pd> On 23 Jun 2017, at 10:42 , Martin Maechler <maechler at stat.math.ethz.ch> wrote: > >>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>>> on Thu, 22 Jun 2017 11:43:59 +0200 writes: > >>>>>> Paul Johnson <pauljohn32 at gmail.com> >>>>>> on Fri, 16 Jun 2017 11:02:34 -0500 writes: > >>> On Fri, Jun 16, 2017 at 2:35 AM, Joris Meys <jorismeys at gmail.com> wrote: >>>> To extwnd on Martin 's explanation : >>>> >>>> In factor(), levels are the unique input values and labels the unique output >>>> values. So the function levels() actually displays the labels. >>>> > >>> Dear Joris > >>> I think we agree. Currently, factor insists both levels and labels be unique. > >>> I wish that it would not accept nonunique labels. I also understand it >>> is impractical to change this now in base R. > >>> I don't think I succeeded in explaining why this would be nicer. >>> Here's another example. Fairly often, we see input data like > >>> x <- c("Male", "Man", "male", "Man", "Female") > >>> The first four represent the same value. I'd like to go in one step >>> to a new factor variable with enumerated types "Male" and "Female". >>> This fails > >>> xf <- factor(x, levels = c("Male", "Man", "male", "Female"), >>> labels = c("Male", "Male", "Male", "Female")) > >>> Instead, we need 2 steps. > >>> xf <- factor(x, levels = c("Male", "Man", "male", "Female")) >>> levels(xf) <- c("Male", "Male", "Male", "Female") > >>> I think it is quirky that `levels<-.factor` allows the duplicated >>> labels, whereas factor does not. > >>> I wrote a function rockchalk::combineLevels to simplify combining >>> levels, but most of the students here like plyr::mapvalues to do it. >>> The use of levels() can be tricky because one must enumerate all >>> values, not just the ones being changed. > >>> But I do understand Martin's point. Its been this way 25 years, it >>> won't change. :). > >> Well.. the above is a bit out of context. > >> Your first example really did not make a point to me (and Joris) >> and I showed that you could use even two different simple factor() calls to >> produce what you wanted >> yc <- factor(c("1",NA,NA,"4","4","4")) >> yn <- factor(c( 1, NA,NA, 4, 4, 4)) > >> Your new example is indeed much more convincing ! > >> (Note though that the two steps that are needed can be written >> more shortly > >> The "been this way 25 years" is one a reason to be very >> cautious(*) with changes, but not a reason for no changes! > >> (*) Indeed as some of you have noted we really should not "break behavior". >> This means to me we cannot accept a change there which gives >> an error or a different result in cases the old behavior gave a valid factor. > >> I'm looking at a possible change currently >> [not promising that a change will happen ...] > > In the end, I've liked the change (after 2-3 iterations), and > now been brave to commit to R-devel (svn 72845). > > With the change, I had to disable one of our own regression > checks (tests/reg-tests-1b.R, line 726): > > The following is now (in R-devel -> R 3.5.0) valid: > >> factor(1:2, labels = c("A","A")) > [1] A A > Levels: A >> > > I wonder how many CRAN package checks will "break" from > this (my guess is in the order of a dozen), but I hope > that these breakages will be benign, e.g., similar to the above > case where before an error was expected via tools :: assertError(.) > > Martin > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com