thr3ads.net - R help - [R] Opinion: Why I find factors convenient to use [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2012-Aug-17 17:32 UTC

[R] Opinion: Why I find factors convenient to use

Folks:

Over the years, many people -- including some who I would consider
real expeRts -- have criticized factors and advocated the use
(sometimes exclusively) of character vectors instead. I would just
like to point out that, for me, factors provide one feature that I
find to be very convenient: ordering of levels. **

As an example, suppose one has a character vector of labels "small,"
medium", and "large". Then most R functions (e.g. tapply()) will
display results involving this vector in alphabetical order, which I
think most would view as undesirable. By converting to a factor with
levels in the logical order, displays will automatically be "logical."
For example:
> x <-
sample(c("small","medium","large"),12,rep=TRUE)
> table(x)x
 large medium  small
     2      3      7> y <-
factor(x,lev=c("small","medium","large"))
##ordered() also would do, but is not necessary for this
> table(y)y
 small medium  large
     7      3      2

Naturally, this is just my opinion, and I understand why lots of smart
people find factors irritating (at least!). So contrary opinions
cheerily welcomed. But perhaps these comments might be helpful to
those who have been "bitten" by factors or just wonder what all the
fuss is about.

** Another advantage is reduced storage space, I believe. Please
correct if wrong.

Cheers,
Bert

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

PIKAL Petr

2012-Aug-17 17:53 UTC

head link

[R] Opinion: Why I find factors convenient to use

I second to Bert's opinion, factors can be confusing, but they have quite
nice features which can not be easily mimicked by plain character vectors. I
find extremelly usefull possibility of manipulating its levels.
> fac<-factor(sample(letters[1:5], 20, replace=TRUE))
> fac [1] e e d d e e c e a e a e b b d e c c d b
Levels: a b c d e> levels(fac)[2:4]<- "new.level"
> fac [1] e         e         new.level new.level e         e         new.level
 [8] e         a         e         a         e         new.level new.level
[15] new.level e         new.level new.level new.level new.level
Levels: a new.level e>
Regards
Petr


________________________________________
Odes?late: r-help-bounces at r-project.org [r-help-bounces at r-project.org] za
u?ivatele Bert Gunter [gunter.berton at gene.com]
Odesl?no: 17. srpna 2012 19:32
To: r-help at r-project.org
P?edm?t: [R] Opinion: Why I find factors convenient to use

Folks:

Over the years, many people -- including some who I would consider
real expeRts -- have criticized factors and advocated the use
(sometimes exclusively) of character vectors instead. I would just
like to point out that, for me, factors provide one feature that I
find to be very convenient: ordering of levels. **

As an example, suppose one has a character vector of labels "small,"
medium", and "large". Then most R functions (e.g. tapply()) will
display results involving this vector in alphabetical order, which I
think most would view as undesirable. By converting to a factor with
levels in the logical order, displays will automatically be "logical."
For example:
> x <-
sample(c("small","medium","large"),12,rep=TRUE)
> table(x)x
 large medium  small
     2      3      7> y <-
factor(x,lev=c("small","medium","large"))
##ordered() also would do, but is not necessary for this
> table(y)y
 small medium  large
     7      3      2

Naturally, this is just my opinion, and I understand why lots of smart
people find factors irritating (at least!). So contrary opinions
cheerily welcomed. But perhaps these comments might be helpful to
those who have been "bitten" by factors or just wonder what all the
fuss is about.

** Another advantage is reduced storage space, I believe. Please
correct if wrong.

Cheers,
Bert

--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jeff Newmiller

2012-Aug-17 17:58 UTC

head link

[R] Opinion: Why I find factors convenient to use

I don't know if my recent post on this prompted your post, but I don't
see much to argue with in your discussion. I find factors to be useful for
managing display and some kinds of analysis.

However, I find them mostly a handicap when importing, merging, and handling
data QC. Therefore I delay conversion until late in the game... but usually I do
eventually convert in most cases.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.



Bert Gunter <gunter.berton at gene.com> wrote:
>Folks:
>
>Over the years, many people -- including some who I would consider
>real expeRts -- have criticized factors and advocated the use
>(sometimes exclusively) of character vectors instead. I would just
>like to point out that, for me, factors provide one feature that I
>find to be very convenient: ordering of levels. **
>
>As an example, suppose one has a character vector of labels
"small,"
>medium", and "large". Then most R functions (e.g. tapply())
will
>display results involving this vector in alphabetical order, which I
>think most would view as undesirable. By converting to a factor with
>levels in the logical order, displays will automatically be
"logical."
>For example:
>
>> x <-
sample(c("small","medium","large"),12,rep=TRUE)
>> table(x)
>x
> large medium  small
>     2      3      7
>> y <-
factor(x,lev=c("small","medium","large"))
##ordered() also would
>do, but is not necessary for this
>> table(y)
>y
> small medium  large
>     7      3      2
>
>Naturally, this is just my opinion, and I understand why lots of smart
>people find factors irritating (at least!). So contrary opinions
>cheerily welcomed. But perhaps these comments might be helpful to
>those who have been "bitten" by factors or just wonder what all
the
>fuss is about.
>
>** Another advantage is reduced storage space, I believe. Please
>correct if wrong.
>
>Cheers,
>Bert
>
>-- 
>
>Bert Gunter
>Genentech Nonclinical Biostatistics
>
>Internal Contact Info:
>Phone: 467-7374
>Website:
>http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2012-Aug-17 22:38 UTC

head link

[R] Opinion: Why I find factors convenient to use

Hello,

Em 17-08-2012 20:27, Bert Gunter escreveu:> ... so it may be just the way object.size() counts in the two cases, right?
Or maybe the way character vectors and factors are coded.
(64 bit Windows 7 or ubuntu 12.04) 80k for the character vector seems to 
be 8 * 1e4 for pointers plus room for the strings themselves, and 40k 
for the factor seems more like 32 bit ints * 1e4 in consecutive memory 
locations. I confess to being too lazy to go check the sources, but if 
this is the case then it's an other point to factors, they are indeed 
more efficient memory-wise.
And 64 bit OSs are to become more and more used, processors aren't 
becoming worse.

There is also the statistical side of it. Factors are the natural way of 
coding nominal or categorical variables. The small/medium/large example 
is a good one. Or seasons, we like to see Fall or Autumn after Spring 
and Summer, not before. (btw, does anyone know why M/F?) And this has 
nothing to do with the usefullness of charaters, I like persons' names 
to be names, alphabetic.

I've also made a simple check, apparently, character vectors are kept as 
a vector of pointers and a vector of unique strings. If we change one of 
the strings, even for something smaller, occupying less bytes, 
object.size will report an increase in size. Try x[1] <- "a" and
see the
new size of x. It's bigger and the number of pointers to strings is the 
same.

For 32 and 64 bit Windows 7 and for 64 bit ubuntu 12.04, R was:
 > R.version
[...]
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows

Rui Barradas>
> -- Bert
>
> On Fri, Aug 17, 2012 at 11:42 AM, Peter Langfelder
> <peter.langfelder at gmail.com> wrote:
>> On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <ruipbarradas at
sapo.pt> wrote:
>>> Hello,
>>>
>>> No, factors may use less memory. System dependent?
>> I think it's a 32-bit vs. 64-bit distinction - I get Rui's
results on
>> 64-bit Windows and Linux installation, but Bert's result on a
32-bit
>> Linux machine.
>>
>> Peter
>>
>>>> x
<-sample(c("small","medium","large"),1e4,rep=TRUE)
>>>> y <- factor(x)
>>>> object.size(x)
>>> 80184 bytes
>>>> object.size(y)
>>> 40576 bytes
>
>

Jim Lemon

2012-Aug-18 08:27 UTC

head link

[R] Opinion: Why I find factors convenient to use

On 08/18/2012 03:32 AM, Bert Gunter wrote:> Folks:
> ...
> So contrary opinions
> cheerily welcomed. But perhaps these comments might be helpful to
> those who have been "bitten" by factors or just wonder what all
the
> fuss is about.
>I tend to use stringsAsFactors=FALSE quite a bit, as I am often 
manipulating character strings, and that

Error in strsplit(bugga, "") : non-character argument

is so annoying. Almost as annoying as printing out a list of selected 
cases with some of the fields turning up as integers rather than the 
strings I expected. That said, I often convert the results to factors so 
that some other function will work properly. So I must express my 
gratitude for motivating me to add

options(stringsAsFactors=FALSE)

to that wonderful .First function that makes my life a little happier 
every day.

Jim

S Ellison

2012-Aug-20 11:30 UTC

head link

[R] Opinion: Why I find factors convenient to use

> -----Original Message-----
> Over the years, many people -- including some who I would 
> consider real expeRts -- have criticized factors and 
> advocated the use (sometimes exclusively) of character 
> vectors instead. 
Exclusive use of character vectors is not going to do the job.

The concept of a factor is fundamental to a lot of statistics; a programming
environment that does not implement factors and their associated special
behaviour is probably not a statistical programming language.

Special behaviours I have in mind include:
- Level order can be arbitrarily specified for display purposes
- A control level can be intentionally chosen for contrasts
- the option of "ordered" factors (for example, for polr and the like)

So I think the language does and will require a 'factor' type in one
form or another.

 _When_ you decide to convert a character input to a factor is, of course, up to
the user,and for cleanup it's very often better to stick with character
early and convert to factor a bit later. But personally, I think that there is
sufficient control over the coding of data to allow user discretion. and on the
whole, it seems to me that character input gets used as factor data so much of
the time when it is used at all that the default stringsAsFactors=TRUE setting
seems the more sensible default.

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Aug 2012 - Opinion: Why I find factors convenient to use

[R] Opinion: Why I find factors convenient to use

[R] Opinion: Why I find factors convenient to use

[R] Opinion: Why I find factors convenient to use

[R] Opinion: Why I find factors convenient to use

[R] Opinion: Why I find factors convenient to use

[R] Opinion: Why I find factors convenient to use

Apparently Analagous Threads