Hi, Given factors x and y, c(x,y) does not seem to return a useful result :> x[1] a b c d e Levels: a b c d e> y[1] d e f g h Levels: d e f g h> c(x,y)[1] 1 2 3 4 5 1 2 3 4 5>Is there a case for a new method c.factor as follows? Does something similar exist already? Is there a better way to write the function?> c.factor = function(x,y){ newlevels = union(levels(x),levels(y)) m = match(levels(y), newlevels) ans = c(unclass(x),m[unclass(y)]) levels(ans) = newlevels class(ans) = "factor" ans }> c(x,y)[1] a b c d e d e f g h Levels: a b c d e f g h> as.integer(c(x,y))[1] 1 2 3 4 5 4 5 6 7 8>Regards, Matthew> version_ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 4.0 year 2006 month 10 day 03 svn rev 39566 language R version.string R version 2.4.0 (2006-10-03)
On Tue, 2006-11-14 at 16:36 +0000, Matthew Dowle wrote:> Hi, > > Given factors x and y, c(x,y) does not seem to return a useful result : > > x > [1] a b c d e > Levels: a b c d e > > y > [1] d e f g h > Levels: d e f g h > > c(x,y) > [1] 1 2 3 4 5 1 2 3 4 5 > > > > Is there a case for a new method c.factor as follows? Does something > similar exist already? Is there a better way to write the function? > > > c.factor = function(x,y) > { > newlevels = union(levels(x),levels(y)) > m = match(levels(y), newlevels) > ans = c(unclass(x),m[unclass(y)]) > levels(ans) = newlevels > class(ans) = "factor" > ans > } > > c(x,y) > [1] a b c d e d e f g h > Levels: a b c d e f g h > > as.integer(c(x,y)) > [1] 1 2 3 4 5 4 5 6 7 8 > > > > Regards, > MatthewI'll defer to others as to whether or not there is a basis for c.factor, however: c.factor <- function(...) { args <- list(...) # this could be optional if (!all(sapply(args, is.factor))) stop("All arguments must be factors") factor(unlist(lapply(args, function(x) as.character(x)))) } x <- factor(letters[1:5]) y <- factor(letters[4:8]) z <- factor(letters[9:14])> x[1] a b c d e Levels: a b c d e> y[1] d e f g h Levels: d e f g h> z[1] i j k l m n Levels: i j k l m n> c(x, y)[1] a b c d e d e f g h Levels: a b c d e f g h> c(x, y, z)[1] a b c d e d e f g h i j k l m n Levels: a b c d e f g h i j k l m n> c(x, 1:5)Error in c.factor(x, 1:5) : All arguments must be factors HTH, Marc Schwartz
Prof Ripley,> Well, R has managed without a factor method for c() for most of itsdecade> of existence (not that it originally had factors as we know them).R has managed without other things too for most of its decade. For example, row names in data frames have very recently been made efficient. That is an example how R was managing for a decade but an improvement has still been made. As we become aware of what we believe is missing in R, I believe the correct approach, the approach you advocate, is to contribute back to the list. This is what I did. I also contributed a potential solution in the form of working source code. I stand by my statement that the current result of c(x,y) when x and y are factors is not useful. It is a specific statement about a specific operation, not any general criticism of R. I agree with you that factors are best viewed as an enumeration type, but I would argue further that c() of 2 enumerated types should return an enumerated type, retaining the powerful feature of enumerated types in R. However, currently R ignores the fact that x and y are enumerated. It silently ignores the levels information, and returns an integer vector whose integers are, well, not useful. Or, if you prefer, not as useful as the proposal I posted. I have a solution which works for me, and I have contributed it. One other person has shown some interest, and taken it further to work with multiple arguments which looks like a nice improvement. The only thing I would comment, if c.factor does go further, is to please avoid the use of as.character in the implementation. One key advantage of the factor type is precisely that it is enumerated, and therefore is efficient for categorical data sets. Intermediate coercion to character is inefficient in this case, which is why I avoided it in the solution I posted. Regards, Matthew> -----Original Message----- > From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] > Sent: 14 November 2006 18:23 > To: Marc Schwartz > Cc: Matthew Dowle; r-devel at r-project.org > Subject: Re: [Rd] c.factor > > > Well, R has managed without a factor method for c() for most > of its decade > of existence (not that it originally had factors as we know them). > > I would argue that factors are best viewed as an enumeration > type, and > anything which silently changes their level set is a bad > idea. I can see > a case for a c() method for factors that combines factors > with the same > level sets, but I can also see this is best done by users who > know the > level sets are same (c.factor would have to expend a > considerable effort > to check). > > You also need to consider the dispatch rules. c.factor will > be called > whenever the first argument is a factor, whatever the others > are. S4 (I > think, definitely S4-based versions of S-PLUS) has an > alternative concat() > that works differently (recursively) and seems a more natural model. > > > On Tue, 14 Nov 2006, Marc Schwartz wrote: > > > On Tue, 2006-11-14 at 11:51 -0600, Marc Schwartz wrote: > >> On Tue, 2006-11-14 at 16:36 +0000, Matthew Dowle wrote: > >>> Hi, > >>> > >>> Given factors x and y, c(x,y) does not seem to return a useful > >>> result : > >>>> x > >>> [1] a b c d e > >>> Levels: a b c d e > >>>> y > >>> [1] d e f g h > >>> Levels: d e f g h > >>>> c(x,y) > >>> [1] 1 2 3 4 5 1 2 3 4 5 > >>>> > >>> > >>> Is there a case for a new method c.factor as follows? Does > >>> something similar exist already? Is there a better way > to write the > >>> function? > >>> > >>>> c.factor = function(x,y) > >>> { > >>> newlevels = union(levels(x),levels(y)) > >>> m = match(levels(y), newlevels) > >>> ans = c(unclass(x),m[unclass(y)]) > >>> levels(ans) = newlevels > >>> class(ans) = "factor" > >>> ans > >>> } > >>>> c(x,y) > >>> [1] a b c d e d e f g h > >>> Levels: a b c d e f g h > >>>> as.integer(c(x,y)) > >>> [1] 1 2 3 4 5 4 5 6 7 8 > >>>> > >>> > >>> Regards, > >>> Matthew > >> > >> I'll defer to others as to whether or not there is a basis for > >> c.factor, > >> however: > >> > >> c.factor <- function(...) > >> { > >> args <- list(...) > >> > >> # this could be optional > >> if (!all(sapply(args, is.factor))) > >> stop("All arguments must be factors") > >> > >> factor(unlist(lapply(args, function(x) as.character(x)))) } > > > > > > That last line can even be cleaned up, as I was doing something else > > initially: > > > > c.factor <- function(...) > > { > > args <- list(...) > > > > if (!all(sapply(args, is.factor))) > > stop("All arguments must be factors") > > > > factor(unlist(lapply(args, as.character))) > > } > > > > > > Marc > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > >
> Just for clarification, my interest was only to provide an > alternative that provided for a more generic approach, at > least in a narrow application, not that I was advocating it's > need.Understood, apologies for falsely implying your advocation.> I would agree with Prof. Ripley's comments, that one needs to > be quite careful in manipulating factors in this fashion, not > the least of which would be ordered factors, which have not > been considered here and could/would introduce their own > idiosyncrasies in this context.I agree with that of course.> If something like this gets implemented (ie. concat()) one > would need to fully understand the behavioral subtleties and > the nature of the results, lest naive users find themselves > in the potential presence of cascading errors.Agree with that too.
I just noticed that a new feature in R 2.4 is that unlist of a list of factors already does the operation that I proposed :> x = factor(letters[1:5]) > y = factor(letters[4:8]) > unlist(list(x,y))[1] a b c d e d e f g h Levels: a b c d e f g h>Therefore, does it not make sense that c(x,y) should return the same as unlist(list(x,y)) ? Also, the specific "if" for factors inside the definition of unlist, not surprisingly, uses a very similar method to those previously posted. However, it first coerces the factors with as.character, before matching to the new level set. This is inefficient. Here is the c.factor method again that I proposed, which avoids the as.character and is therefore more efficient. Leaving aside the discussion about c.factor, or concat, or whatever, could 'unlist' be changed to use this method instead ? After all one of the key advantages of factors is to save main memory, anything which coerces back to character is going to defeat the benefit.> c.factor = function(...) {args <- list(...) if (!all(sapply(args, is.factor))) stop("all arguments must be factor") newlevels = unique(unlist(lapply(args,levels))) ans = unlist(lapply(args, function(x) { m = match(levels(x), newlevels) m[as.integer(x)] })) levels(ans) = newlevels class(ans) = "factor" ans }> identical(c(x,y), unlist(list(x,y)))[1] TRUE> version_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 4.0 year 2006 month 10 day 03 svn rev 39566 language R version.string R version 2.4.0 (2006-10-03)>"Brian Ripley" <ripley at stats.ox.ac.uk> wrote in message news:Pine.LNX.4.64.0611150926070.19618 at auk.stats...> On Tue, 14 Nov 2006, Bill Dunlap wrote: > >> On Tue, 14 Nov 2006, Prof Brian Ripley wrote: >> >>> Well, R has managed without a factor method for c() for most of its >>> decade >>> of existence (not that it originally had factors as we know them). >>> >>> I would argue that factors are best viewed as an enumeration type,and>>> anything which silently changes their level set is a bad idea. I can>>> see >>> a case for a c() method for factors that combines factors with thesame>>> level sets, but I can also see this is best done by users who knowthe>>> level sets are same (c.factor would have to expend a considerableeffort>>> to check). >>> >>> You also need to consider the dispatch rules. c.factor will becalled>>> whenever the first argument is a factor, whatever the others are. S4(I>>> think, definitely S4-based versions of S-PLUS) has an alternative >>> concat() >>> that works differently (recursively) and seems a more natural model. >> >> In addition, c() has always had a double meaning of >> (a) turning an object into a simple "vector" (an object >> without "attributes"), as in >> > c(factor(c("Cat","Dog","Cat"))) >> [1] 1 2 1 >> > c(data.frame(x=1:2,y=c("Dog","Cat"))) >> $x >> [1] 1 2 >> >> $y >> [1] Dog Cat >> Levels: Cat Dog > > To my surprise that was not documented at all on the R help page, andI've> clarified it. (BTW, at least in R it does not remove names, just all > other attributes.) > >> (b) concatenating several such vectors into one. >> >> The proposed c.factor does only (b). > > (Strictly not, as a factor is not a vector.) > > But the help page explicitly only describes the default method, andsome> of the other methods do preserve some attributes, AFAIR. > >> Should we just >> throw c() into the ash heap and use as.vector() or >> concat() instead? >> >> The whole concept of concatenating objects of disparate >> types is suspect. > > I think working on a concat() for R would be helpful. I vaguelyrecalled> something like it in the Green Book, but the index does not help (butthen> it is not very complete). > > Brian