Dear R devel
I've been wondering about this for a while. I am sorry to ask for your
time, but can one of you help me understand this?
This concerns duplicated labels, not levels, in the factor function.
I think it is hard to understand that factor() fails, but levels()
after does not
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
Error in `levels<-`(`*tmp*`, value = if (nl == nL)
as.character(labels) else paste0(labels, :
factor level [3] is duplicated> y <- factor(x, levels = xlevels)
> levels(y) <- xlabels
> y
[1] 1 <NA> <NA> 4 4 4
Levels: 1 4
If the latter use of levels() causes a good, expected result, couldn't
factor(..., labels = xlabels) be made to the same thing?
That's the gist of it. To signal to you that I've been trying to
figure this out on my own, here is a revision I've tested in R's
factor function which "seems" to fix the matter. (Of course, probably
causes lots of other problems I don't understand, that's why I'm
writing to you now.)
In the factor function, the class of f is assigned *after* levels(f) is called
levels(f) <- ## nl == nL or 1
if (nl == nL) as.character(labels)
else paste0(labels, seq_along(levels))
class(f) <- c(if(ordered) "ordered", "factor")
At that point, f is an integer, and levels(f) is a primitive
> `levels<-`
function (x, value) .Primitive("levels<-")
That's what generates the error. I don't understand well what
.Primitive means here. I need to walk past that detail.
Suppose I revise the factor function to put the class(f) line before
the level(). Then `levels<-.factor` is called and all seems well.
factor <- function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA)
if (is.null(x))
x <- character()
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
if (!is.character(x))
x <- as.character(x)
levels <- levels[, exclude))]
f <- match(x, levels)
if (!is.null(nx))
names(f) <- nx
nl <- length(labels)
nL <- length(levels)
if (!any(nl == c(1L, nL)))
stop(gettextf("invalid 'labels'; length %d should be 1 or
nl, nL), domain = NA)
## class() moved up 3 rows
class(f) <- c(if (ordered) "ordered", "factor")
levels(f) <- if (nl == nL)
else paste0(labels, seq_along(levels))
> assignInNamespace("factor", factor, "base")
> x <- 1:6
> xlevels <- 1:6
> xlabels <- c(1, NA, NA, 4, 4, 4)
> y <- factor(x, levels = xlevels, labels = xlabels)
> y
[1] 1 <NA> <NA> 4 4 4
Levels: 1 4> attributes(y)
[1] "factor"
[1] "1" "4"
That's a "good" answer for me.
But I broke your function. I eliminated the check for duplicated levels.
> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
> y
[1] 1 4 <NA> <NA> <NA> <NA>
Levels: 1 4
Rather than have factor return the "duplicated levels" error when
there are duplicated values in labels, I wonder why it is not better
to have a check for duplicated levels directly. For example, insert a
new else in this stanza
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
levels <- unique(as.character(y)[ind])
} ##next is new part
else {
levels <- unique(levels)
That will cause an error when there are duplicated levels because
there are more labels than levels:
> y <- factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels)
Error in factor(x, levels = c(1, 1, 1, 2, 2, 2), labels = xlabels) :
invalid 'labels'; length 6 should be 1 or 2
So, in conclusion, if levels() can work after creating a factor, I
wish equivalent labels argument would be accepted. What is your
Paul E. Johnson
Director, Center for Research Methods and Data Analysis
To write to me directly, please address me at pauljohn at