Folks: Over the years, many people -- including some who I would consider real expeRts -- have criticized factors and advocated the use (sometimes exclusively) of character vectors instead. I would just like to point out that, for me, factors provide one feature that I find to be very convenient: ordering of levels. ** As an example, suppose one has a character vector of labels "small," medium", and "large". Then most R functions (e.g. tapply()) will display results involving this vector in alphabetical order, which I think most would view as undesirable. By converting to a factor with levels in the logical order, displays will automatically be "logical." For example:> x <- sample(c("small","medium","large"),12,rep=TRUE) > table(x)x large medium small 2 3 7> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would do, but is not necessary for this > table(y)y small medium large 7 3 2 Naturally, this is just my opinion, and I understand why lots of smart people find factors irritating (at least!). So contrary opinions cheerily welcomed. But perhaps these comments might be helpful to those who have been "bitten" by factors or just wonder what all the fuss is about. ** Another advantage is reduced storage space, I believe. Please correct if wrong. Cheers, Bert -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
I second to Bert's opinion, factors can be confusing, but they have quite nice features which can not be easily mimicked by plain character vectors. I find extremelly usefull possibility of manipulating its levels.> fac<-factor(sample(letters[1:5], 20, replace=TRUE)) > fac[1] e e d d e e c e a e a e b b d e c c d b Levels: a b c d e> levels(fac)[2:4]<- "new.level" > fac[1] e e new.level new.level e e new.level [8] e a e a e new.level new.level [15] new.level e new.level new.level new.level new.level Levels: a new.level e>Regards Petr ________________________________________ Odes?late: r-help-bounces at r-project.org [r-help-bounces at r-project.org] za u?ivatele Bert Gunter [gunter.berton at gene.com] Odesl?no: 17. srpna 2012 19:32 To: r-help at r-project.org P?edm?t: [R] Opinion: Why I find factors convenient to use Folks: Over the years, many people -- including some who I would consider real expeRts -- have criticized factors and advocated the use (sometimes exclusively) of character vectors instead. I would just like to point out that, for me, factors provide one feature that I find to be very convenient: ordering of levels. ** As an example, suppose one has a character vector of labels "small," medium", and "large". Then most R functions (e.g. tapply()) will display results involving this vector in alphabetical order, which I think most would view as undesirable. By converting to a factor with levels in the logical order, displays will automatically be "logical." For example:> x <- sample(c("small","medium","large"),12,rep=TRUE) > table(x)x large medium small 2 3 7> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would do, but is not necessary for this > table(y)y small medium large 7 3 2 Naturally, this is just my opinion, and I understand why lots of smart people find factors irritating (at least!). So contrary opinions cheerily welcomed. But perhaps these comments might be helpful to those who have been "bitten" by factors or just wonder what all the fuss is about. ** Another advantage is reduced storage space, I believe. Please correct if wrong. Cheers, Bert -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I don't know if my recent post on this prompted your post, but I don't see much to argue with in your discussion. I find factors to be useful for managing display and some kinds of analysis. However, I find them mostly a handicap when importing, merging, and handling data QC. Therefore I delay conversion until late in the game... but usually I do eventually convert in most cases. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. Bert Gunter <gunter.berton at gene.com> wrote:>Folks: > >Over the years, many people -- including some who I would consider >real expeRts -- have criticized factors and advocated the use >(sometimes exclusively) of character vectors instead. I would just >like to point out that, for me, factors provide one feature that I >find to be very convenient: ordering of levels. ** > >As an example, suppose one has a character vector of labels "small," >medium", and "large". Then most R functions (e.g. tapply()) will >display results involving this vector in alphabetical order, which I >think most would view as undesirable. By converting to a factor with >levels in the logical order, displays will automatically be "logical." >For example: > >> x <- sample(c("small","medium","large"),12,rep=TRUE) >> table(x) >x > large medium small > 2 3 7 >> y <- factor(x,lev=c("small","medium","large")) ##ordered() also would >do, but is not necessary for this >> table(y) >y > small medium large > 7 3 2 > >Naturally, this is just my opinion, and I understand why lots of smart >people find factors irritating (at least!). So contrary opinions >cheerily welcomed. But perhaps these comments might be helpful to >those who have been "bitten" by factors or just wonder what all the >fuss is about. > >** Another advantage is reduced storage space, I believe. Please >correct if wrong. > >Cheers, >Bert > >-- > >Bert Gunter >Genentech Nonclinical Biostatistics > >Internal Contact Info: >Phone: 467-7374 >Website: >http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Hello, Em 17-08-2012 20:27, Bert Gunter escreveu:> ... so it may be just the way object.size() counts in the two cases, right?Or maybe the way character vectors and factors are coded. (64 bit Windows 7 or ubuntu 12.04) 80k for the character vector seems to be 8 * 1e4 for pointers plus room for the strings themselves, and 40k for the factor seems more like 32 bit ints * 1e4 in consecutive memory locations. I confess to being too lazy to go check the sources, but if this is the case then it's an other point to factors, they are indeed more efficient memory-wise. And 64 bit OSs are to become more and more used, processors aren't becoming worse. There is also the statistical side of it. Factors are the natural way of coding nominal or categorical variables. The small/medium/large example is a good one. Or seasons, we like to see Fall or Autumn after Spring and Summer, not before. (btw, does anyone know why M/F?) And this has nothing to do with the usefullness of charaters, I like persons' names to be names, alphabetic. I've also made a simple check, apparently, character vectors are kept as a vector of pointers and a vector of unique strings. If we change one of the strings, even for something smaller, occupying less bytes, object.size will report an increase in size. Try x[1] <- "a" and see the new size of x. It's bigger and the number of pointers to strings is the same. For 32 and 64 bit Windows 7 and for 64 bit ubuntu 12.04, R was: > R.version [...] version.string R version 2.15.1 (2012-06-22) nickname Roasted Marshmallows Rui Barradas> > -- Bert > > On Fri, Aug 17, 2012 at 11:42 AM, Peter Langfelder > <peter.langfelder at gmail.com> wrote: >> On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote: >>> Hello, >>> >>> No, factors may use less memory. System dependent? >> I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on >> 64-bit Windows and Linux installation, but Bert's result on a 32-bit >> Linux machine. >> >> Peter >> >>>> x <-sample(c("small","medium","large"),1e4,rep=TRUE) >>>> y <- factor(x) >>>> object.size(x) >>> 80184 bytes >>>> object.size(y) >>> 40576 bytes > >
On 08/18/2012 03:32 AM, Bert Gunter wrote:> Folks: > ... > So contrary opinions > cheerily welcomed. But perhaps these comments might be helpful to > those who have been "bitten" by factors or just wonder what all the > fuss is about. >I tend to use stringsAsFactors=FALSE quite a bit, as I am often manipulating character strings, and that Error in strsplit(bugga, "") : non-character argument is so annoying. Almost as annoying as printing out a list of selected cases with some of the fields turning up as integers rather than the strings I expected. That said, I often convert the results to factors so that some other function will work properly. So I must express my gratitude for motivating me to add options(stringsAsFactors=FALSE) to that wonderful .First function that makes my life a little happier every day. Jim
> -----Original Message----- > Over the years, many people -- including some who I would > consider real expeRts -- have criticized factors and > advocated the use (sometimes exclusively) of character > vectors instead.Exclusive use of character vectors is not going to do the job. The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language. Special behaviours I have in mind include: - Level order can be arbitrarily specified for display purposes - A control level can be intentionally chosen for contrasts - the option of "ordered" factors (for example, for polr and the like) So I think the language does and will require a 'factor' type in one form or another. _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default. S Ellison ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
Reasonably Related Threads
- Convenience function to get unevaluated ... function arguments
- conditional statement to replace values in dataframe with NA
- Factor to numeric conversion - as.numeric(levels(f))[f] - Language definition seems to say to not use this.
- Data frame vs matrix quirk: Hinky error message?
- Please delete my e-mail judit.barroso@montana.edu