thr3ads.net - R devel - [Rd] duplicated factor labels. [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Paul Johnson

2017-Jun-15 00:00 UTC

[Rd] duplicated factor labels.

Dear R devel

I've been wondering about this for a while. I am sorry to ask for your
time, but can one of you help me understand this?

This concerns duplicated labels, not levels, in the factor function.

I think it is hard to understand that factor() fails, but levels()
after does not
>  x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)Error in `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels,  :
  factor level [3] is duplicated> y <- factor(x, levels = xlevels)
> levels(y) <- xlabels
> y[1] 1    <NA> <NA> 4    4    4
Levels: 1 4

If the latter use of levels() causes a good, expected result, couldn't
factor(..., labels = xlabels) be made to the same thing?

That's the gist of it. To signal to you that I've been trying to
figure this out on my own, here is a revision I've tested in R's
factor function which "seems" to fix the matter. (Of course, probably
causes lots of other problems I don't understand, that's why I'm
writing to  you now.)

In the factor function, the class of f is assigned *after* levels(f) is called

    levels(f) <- ## nl == nL or 1
    if (nl == nL) as.character(labels)
    else paste0(labels, seq_along(levels))
    class(f) <- c(if(ordered) "ordered", "factor")

At that point, f is an integer, and levels(f) is a primitive
> `levels<-`function (x, value)  .Primitive("levels<-")

That's what generates the error.  I don't understand well what
.Primitive means here. I need to walk past that detail.

Suppose I revise the factor function to put the class(f) line before
the level(). Then `levels<-.factor` is called and all seems well.

factor <- function (x = character(), levels, labels = levels, exclude = NA,
    ordered = is.ordered(x), nmax = NA)
{
    if (is.null(x))
        x <- character()
    nx <- names(x)
    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    }
    force(ordered)
    if (!is.character(x))
        x <- as.character(x)
    levels <- levels[is.na(match(levels, exclude))]
    f <- match(x, levels)
    if (!is.null(nx))
        names(f) <- nx
    nl <- length(labels)
    nL <- length(levels)
    if (!any(nl == c(1L, nL)))
        stop(gettextf("invalid 'labels'; length %d should be 1 or
%d",
            nl, nL), domain = NA)
    ## class() moved up 3 rows
    class(f) <- c(if (ordered) "ordered", "factor")
    levels(f) <- if (nl == nL)
                  as.character(labels)
         else paste0(labels, seq_along(levels))
    f
}
> assignInNamespace("factor", factor, "base")
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
> y[1] 1    <NA> <NA> 4    4    4
Levels: 1 4> attributes(y)$class
[1] "factor"

$levels
[1] "1" "4"

That's a "good" answer for me.

But I broke your function. I eliminated the check for duplicated levels.
> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
> y[1] 1    4    <NA> <NA> <NA> <NA>
Levels: 1 4

Rather than have factor return the "duplicated levels" error when
there are duplicated values in labels, I wonder why it is not better
to have a check for duplicated levels directly. For example, insert a
new else in this stanza

    if (missing(levels)) {
        y <- unique(x, nmax = nmax)
        ind <- sort.list(y)
        levels <- unique(as.character(y)[ind])
    } ##next is new part
        else {
        levels <- unique(levels)
    }

That will cause an error when there are duplicated levels because
there are more labels than levels:
> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) :
  invalid 'labels'; length 6 should be 1 or 2

So, in conclusion, if levels() can work after creating a factor, I
wish equivalent labels argument would be accepted. What is your
opinion?

pj
-- 
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.

Martin Maechler

2017-Jun-15 15:15 UTC

head link

[Rd] duplicated factor labels.

>>>>> Paul Johnson <pauljohn32 at gmail.com>
>>>>>     on Wed, 14 Jun 2017 19:00:11 -0500 writes:
    > Dear R devel
    > I've been wondering about this for a while. I am sorry to ask for
your
    > time, but can one of you help me understand this?

    > This concerns duplicated labels, not levels, in the factor function.

    > I think it is hard to understand that factor() fails, but levels()
    > after does not

    >> x <- 1:6
    >> xlevels <- 1:6
    >> xlabels <- c(1, NA, NA, 4, 4, 4)
    >> y <- factor(x, levels = xlevels, labels = xlabels)
    > Error in `levels<-`(`*tmp*`, value = if (nl == nL)
    > as.character(labels) else paste0(labels,  :
    > factor level [3] is duplicated
    >> y <- factor(x, levels = xlevels)
    >> levels(y) <- xlabels
    >> y
    > [1] 1    <NA> <NA> 4    4    4
    > Levels: 1 4

    > If the latter use of levels() causes a good, expected result,
couldn't
    > factor(..., labels = xlabels) be made to the same thing?

I may misunderstand, but I think you are confusing 'labels' and
'levels'
here, (and you are not alone in this!) mostly because  R's
factor() function treats them as arguments in a way that can be
confusing.. (but I don't think we'd want to change that; it's
been documented and in use for  > 25 year (in S, S+, R).

Note that after the above,
> dput(y)structure(c(1L, NA, NA, 2L, 2L, 2L), .Label = c("1", "4"),
class = "factor")

and that of course _is_ a valid factor .. which you can easily
get directly via e.g.
> identical(y, factor(c(1,NA,NA,4,4,4)))[1] TRUE

or also  via
> identical(y,
factor(c("1",NA,NA,"4","4","4")))[1] TRUE

I really don't see a need for a change of factor().
It should remain as simple as possible (but not simpler :-).

Martin

Joris Meys

2017-Jun-16 07:35 UTC

head link

[Rd] duplicated factor labels.

To extwnd on Martin 's explanation :

In factor(), levels are the unique input values and labels the unique
output values. So the function levels() actually displays the labels.

Cheers
Joris


On 15 Jun 2017 17:15, "Martin Maechler" <maechler at
stat.math.ethz.ch> wrote:
>>>>> Paul Johnson <pauljohn32 at gmail.com>
>>>>>     on Wed, 14 Jun 2017 19:00:11 -0500 writes:
    > Dear R devel
    > I've been wondering about this for a while. I am sorry to ask for
your
    > time, but can one of you help me understand this?

    > This concerns duplicated labels, not levels, in the factor function.

    > I think it is hard to understand that factor() fails, but levels()
    > after does not

    >> x <- 1:6
    >> xlevels <- 1:6
    >> xlabels <- c(1, NA, NA, 4, 4, 4)
    >> y <- factor(x, levels = xlevels, labels = xlabels)
    > Error in `levels<-`(`*tmp*`, value = if (nl == nL)
    > as.character(labels) else paste0(labels,  :
    > factor level [3] is duplicated
    >> y <- factor(x, levels = xlevels)
    >> levels(y) <- xlabels
    >> y
    > [1] 1    <NA> <NA> 4    4    4
    > Levels: 1 4

    > If the latter use of levels() causes a good, expected result,
couldn't
    > factor(..., labels = xlabels) be made to the same thing?

I may misunderstand, but I think you are confusing 'labels' and
'levels'
here, (and you are not alone in this!) mostly because  R's
factor() function treats them as arguments in a way that can be
confusing.. (but I don't think we'd want to change that; it's
been documented and in use for  > 25 year (in S, S+, R).

Note that after the above,
> dput(y)structure(c(1L, NA, NA, 2L, 2L, 2L), .Label = c("1", "4"),
class = "factor")

and that of course _is_ a valid factor .. which you can easily
get directly via e.g.
> identical(y, factor(c(1,NA,NA,4,4,4)))[1] TRUE

or also  via
> identical(y,
factor(c("1",NA,NA,"4","4","4")))[1] TRUE

I really don't see a need for a change of factor().
It should remain as simple as possible (but not simpler :-).

Martin

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more seemingly similar threads

R devel - Jun 2017 - duplicated factor labels.

[Rd] duplicated factor labels.

[Rd] duplicated factor labels.

[Rd] duplicated factor labels.

Seemingly Similar Threads