thr3ads.net - R devel - [Rd] c.factor [Nov 2006]

If this information is useful, please help other people find it:
Share via:

Matthew Dowle

2006-Nov-14 16:36 UTC

[Rd] c.factor

Hi,

Given factors x and y,  c(x,y) does not seem to return a useful result
:> x[1] a b c d e
Levels: a b c d e> y[1] d e f g h
Levels: d e f g h> c(x,y)
 [1] 1 2 3 4 5 1 2 3 4 5> 
Is there a case for a new method c.factor as follows?  Does something
similar exist already?  Is there a better way to write the function?
> c.factor = function(x,y){
    newlevels = union(levels(x),levels(y))
    m = match(levels(y), newlevels)
    ans = c(unclass(x),m[unclass(y)])
    levels(ans) = newlevels
    class(ans) = "factor"
    ans
}> c(x,y) [1] a b c d e d e f g h
Levels: a b c d e f g h> as.integer(c(x,y))
 [1] 1 2 3 4 5 4 5 6 7 8>
Regards,
Matthew

> version               _                           
platform       x86_64-unknown-linux-gnu    
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          2                           
minor          4.0                         
year           2006                        
month          10                          
day            03                          
svn rev        39566                       
language       R                           
version.string R version 2.4.0 (2006-10-03)

Marc Schwartz

2006-Nov-14 17:51 UTC

head link

[Rd] c.factor

On Tue, 2006-11-14 at 16:36 +0000, Matthew Dowle wrote:> Hi,
> 
> Given factors x and y,  c(x,y) does not seem to return a useful result :
> > x
> [1] a b c d e
> Levels: a b c d e
> > y
> [1] d e f g h
> Levels: d e f g h
> > c(x,y)
>  [1] 1 2 3 4 5 1 2 3 4 5
> > 
> 
> Is there a case for a new method c.factor as follows?  Does something
> similar exist already?  Is there a better way to write the function?
> 
> > c.factor = function(x,y)
> {
>     newlevels = union(levels(x),levels(y))
>     m = match(levels(y), newlevels)
>     ans = c(unclass(x),m[unclass(y)])
>     levels(ans) = newlevels
>     class(ans) = "factor"
>     ans
> }
> > c(x,y)
>  [1] a b c d e d e f g h
> Levels: a b c d e f g h
> > as.integer(c(x,y))
>  [1] 1 2 3 4 5 4 5 6 7 8
> >
> 
> Regards,
> Matthew
I'll defer to others as to whether or not there is a basis for c.factor,
however:

c.factor <- function(...)
{
  args <- list(...)

  # this could be optional
  if (!all(sapply(args, is.factor)))
   stop("All arguments must be factors")

  factor(unlist(lapply(args, function(x) as.character(x))))
}


x <- factor(letters[1:5])
y <- factor(letters[4:8])
z <- factor(letters[9:14])
> x[1] a b c d e
Levels: a b c d e
> y[1] d e f g h
Levels: d e f g h
> z[1] i j k l m n
Levels: i j k l m n

> c(x, y) [1] a b c d e d e f g h
Levels: a b c d e f g h

> c(x, y, z) [1] a b c d e d e f g h i j k l m n
Levels: a b c d e f g h i j k l m n

> c(x, 1:5)Error in c.factor(x, 1:5) : All arguments must be factors


HTH,

Marc Schwartz

Matthew Dowle

2006-Nov-15 12:51 UTC

head link

[Rd] c.factor

Prof Ripley,
> Well, R has managed without a factor method for c() for most of its
decade > of existence (not that it originally had factors as we know them).
R has managed without other things too for most of its decade. For
example, row names in data frames have very recently been made
efficient. That is an example how R was managing for a decade but an
improvement has still been made. As we become aware of what we believe
is missing in R, I believe the correct approach, the approach you
advocate, is to contribute back to the list. This is what I did. I also
contributed a potential solution in the form of working source code. I
stand by my statement that the current result of c(x,y) when x and y are
factors is not useful. It is a specific statement about a specific
operation, not any general criticism of R. I agree with you that factors
are best viewed as an enumeration type, but I would argue further that
c() of 2 enumerated types should return an enumerated type, retaining
the powerful feature of enumerated types in R. However, currently R
ignores the fact that x and y are enumerated. It silently ignores the
levels information, and returns an integer vector whose integers are,
well, not useful. Or, if you prefer, not as useful as the proposal I
posted.

I have a solution which works for me, and I have contributed it. One
other person has shown some interest, and taken it further to work with
multiple arguments which looks like a nice improvement.

The only thing I would comment, if c.factor does go further, is to
please avoid the use of as.character in the implementation. One key
advantage of the factor type is precisely that it is enumerated, and
therefore is efficient for categorical data sets. Intermediate coercion
to character is inefficient in this case, which is why I avoided it in
the solution I posted.

Regards,
Matthew

> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] 
> Sent: 14 November 2006 18:23
> To: Marc Schwartz
> Cc: Matthew Dowle; r-devel at r-project.org
> Subject: Re: [Rd] c.factor
> 
> 
> Well, R has managed without a factor method for c() for most 
> of its decade 
> of existence (not that it originally had factors as we know them).
> 
> I would argue that factors are best viewed as an enumeration 
> type, and 
> anything which silently changes their level set is a bad 
> idea.  I can see 
> a case for a c() method for factors that combines factors 
> with the same 
> level sets, but I can also see this is best done by users who 
> know the 
> level sets are same (c.factor would have to expend a 
> considerable effort 
> to check).
> 
> You also need to consider the dispatch rules.  c.factor will 
> be called 
> whenever the first argument is a factor, whatever the others 
> are. S4 (I 
> think, definitely S4-based versions of S-PLUS) has an 
> alternative concat() 
> that works differently (recursively) and seems a more natural model.
> 
> 
> On Tue, 14 Nov 2006, Marc Schwartz wrote:
> 
> > On Tue, 2006-11-14 at 11:51 -0600, Marc Schwartz wrote:
> >> On Tue, 2006-11-14 at 16:36 +0000, Matthew Dowle wrote:
> >>> Hi,
> >>>
> >>> Given factors x and y,  c(x,y) does not seem to return a
useful
> >>> result :
> >>>> x
> >>> [1] a b c d e
> >>> Levels: a b c d e
> >>>> y
> >>> [1] d e f g h
> >>> Levels: d e f g h
> >>>> c(x,y)
> >>>  [1] 1 2 3 4 5 1 2 3 4 5
> >>>>
> >>>
> >>> Is there a case for a new method c.factor as follows?  Does 
> >>> something similar exist already?  Is there a better way 
> to write the 
> >>> function?
> >>>
> >>>> c.factor = function(x,y)
> >>> {
> >>>     newlevels = union(levels(x),levels(y))
> >>>     m = match(levels(y), newlevels)
> >>>     ans = c(unclass(x),m[unclass(y)])
> >>>     levels(ans) = newlevels
> >>>     class(ans) = "factor"
> >>>     ans
> >>> }
> >>>> c(x,y)
> >>>  [1] a b c d e d e f g h
> >>> Levels: a b c d e f g h
> >>>> as.integer(c(x,y))
> >>>  [1] 1 2 3 4 5 4 5 6 7 8
> >>>>
> >>>
> >>> Regards,
> >>> Matthew
> >>
> >> I'll defer to others as to whether or not there is a basis for
> >> c.factor,
> >> however:
> >>
> >> c.factor <- function(...)
> >> {
> >>   args <- list(...)
> >>
> >>   # this could be optional
> >>   if (!all(sapply(args, is.factor)))
> >>    stop("All arguments must be factors")
> >>
> >>   factor(unlist(lapply(args, function(x) as.character(x)))) }
> >
> >
> > That last line can even be cleaned up, as I was doing something else
> > initially:
> >
> > c.factor <- function(...)
> > {
> >  args <- list(...)
> >
> >  if (!all(sapply(args, is.factor)))
> >   stop("All arguments must be factors")
> >
> >  factor(unlist(lapply(args, as.character)))
> > }
> >
> >
> > Marc
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
>

Matthew Dowle

2006-Nov-15 15:30 UTC

head link

[Rd] c.factor

> Just for clarification, my interest was only to provide an 
> alternative that provided for a more generic approach, at 
> least in a narrow application, not that I was advocating it's 
> need.
Understood, apologies for falsely implying your advocation.
> I would agree with Prof. Ripley's comments, that one needs to 
> be quite careful in manipulating factors in this fashion, not 
> the least of which would be ordered factors, which have not 
> been considered here and could/would introduce their own 
> idiosyncrasies in this context.
I agree with that of course.
> If something like this gets implemented (ie. concat()) one 
> would need to fully understand the behavioral subtleties and 
> the nature of the results, lest naive users find themselves 
> in the potential presence of cascading errors.
Agree with that too.

Matthew Dowle

2006-Nov-22 16:29 UTC

head link

[Rd] c.factor

I just noticed that a new feature in R 2.4 is that unlist of a list of
factors 
already does the operation that I proposed :
> x = factor(letters[1:5])
> y = factor(letters[4:8])
> unlist(list(x,y))[1] a b c d e d e f g h
Levels: a b c d e f g h>
Therefore, does it not make sense that c(x,y) should return the same as 
unlist(list(x,y)) ?

Also, the specific "if" for factors inside the definition of unlist,
not

surprisingly, uses a very similar method to those previously posted. 
However, it first coerces the factors with as.character, before matching
to 
the new level set. This is inefficient. Here is the c.factor method
again 
that I proposed, which avoids the as.character and is therefore more 
efficient. Leaving aside the discussion about c.factor, or concat, or 
whatever, could 'unlist' be changed to use this method instead ? After 
all one of the key advantages of factors is to save main memory,
anything 
which coerces back to character is going to defeat the benefit.
> c.factor = function(...) {args <- list(...)
if (!all(sapply(args, is.factor))) stop("all arguments must be
factor")
newlevels = unique(unlist(lapply(args,levels)))
ans = unlist(lapply(args, function(x) {
m = match(levels(x), newlevels)
m[as.integer(x)]
}))
levels(ans) = newlevels
class(ans) = "factor"
ans
}> identical(c(x,y), unlist(list(x,y)))
[1] TRUE> version_
platform i386-pc-mingw32
arch i386
os mingw32
system i386, mingw32
status
major 2
minor 4.0
year 2006
month 10
day 03
svn rev 39566
language R
version.string R version 2.4.0 (2006-10-03)>

"Brian Ripley" <ripley at stats.ox.ac.uk> wrote in message 
news:Pine.LNX.4.64.0611150926070.19618 at auk.stats...> On Tue, 14 Nov 2006, Bill Dunlap wrote:
>
>> On Tue, 14 Nov 2006, Prof Brian Ripley wrote:
>>
>>> Well, R has managed without a factor method for c() for most of its
>>> decade
>>> of existence (not that it originally had factors as we know them).
>>>
>>> I would argue that factors are best viewed as an enumeration type,
and>>> anything which silently changes their level set is a bad idea. I
can
>>> see
>>> a case for a c() method for factors that combines factors with the
same>>> level sets, but I can also see this is best done by users who know
the>>> level sets are same (c.factor would have to expend a considerable
effort>>> to check).
>>>
>>> You also need to consider the dispatch rules. c.factor will be
called>>> whenever the first argument is a factor, whatever the others are.
S4
(I>>> think, definitely S4-based versions of S-PLUS) has an alternative 
>>> concat()
>>> that works differently (recursively) and seems a more natural
model.
>>
>> In addition, c() has always had a double meaning of
>> (a) turning an object into a simple "vector" (an object
>> without "attributes"), as in
>> > c(factor(c("Cat","Dog","Cat")))
>> [1] 1 2 1
>> > c(data.frame(x=1:2,y=c("Dog","Cat")))
>> $x
>> [1] 1 2
>>
>> $y
>> [1] Dog Cat
>> Levels: Cat Dog
>
> To my surprise that was not documented at all on the R help page, and
I've> clarified it. (BTW, at least in R it does not remove names, just all
> other attributes.)
>
>> (b) concatenating several such vectors into one.
>>
>> The proposed c.factor does only (b).
>
> (Strictly not, as a factor is not a vector.)
>
> But the help page explicitly only describes the default method, and
some> of the other methods do preserve some attributes, AFAIR.
>
>> Should we just
>> throw c() into the ash heap and use as.vector() or
>> concat() instead?
>>
>> The whole concept of concatenating objects of disparate
>> types is suspect.
>
> I think working on a concat() for R would be helpful. I vaguely
recalled> something like it in the Green Book, but the index does not help (but
then> it is not very complete).
>
> Brian

Apparently Analagous Threads

Search for more possibly parallel threads

R devel - Nov 2006 - c.factor

[Rd] c.factor

[Rd] c.factor

[Rd] c.factor

[Rd] c.factor

[Rd] c.factor

Apparently Analagous Threads