Tal Galili
2010-Dec-11 22:48 UTC
[R] Why do we have to turn factors into characters for various functions?
Hello dear R-help mailing list, My question is *not* about how factors are implemented in R (which is, if I understand correctly, that factors keeps numbers and assign levels to them). My question *is* about why so many functions that work on factors don't treat them as characters by default? Here are two simple examples: Example one turning the characters inside a factor into numeric: x <- factor(4:6) as.numeric(x) # output: 1 2 3 as.numeric(as.character(x)) # output: 4 5 6 # isn't this what we wanted? Example two, using strsplit on a factor: x <- factor(paste(letters[4:6], 4:6, sep="A")) strsplit(x, "A") # will result in an error: # Error in strsplit(x, "A") : non-character argument strsplit(as.character(x), "A") # will work and split So what is the reason this is the case? Is it that implementing a switch of factors to characters as the default in some of the basic function will cause old code to break? Is it a better design in some other way? I am curious to know the reason for this. Thank you for your reading, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- [[alternative HTML version deleted]]
Joshua Wiley
2010-Dec-12 00:13 UTC
[R] Why do we have to turn factors into characters for various functions?
Hi Tal, I always think of factors as a way of imposing (however arbitrarily) order on some variable. To that extent, the key aspect is first, second, third, etc., represented numerically in factors as 1, 2, 3, etc. . The labels are for convenience and interpretation. Consider: x <- factor(c(5, 4, 6)) y <- factor(c(6, 5, 7)) as.numeric(x) as.numeric(y) Is there numeric or character value of 5 more important? Or is its relative position? If you have character data that you might want to split and manipulate, store it as a string variable (you can set an option so stringsAsFactors = FALSE by default in read.table()). If your factor labels are numeric, that suggests it might have been better stored as numeric in the first place. Generally, when I find myself converting factors to numeric or character class data, it means I've been using factor() to recode data (which is not its intended purpose). My 2 cents. Cheers, Josh On Sat, Dec 11, 2010 at 2:48 PM, Tal Galili <tal.galili at gmail.com> wrote:> Hello dear R-help mailing list, > > My question is *not* about how factors are implemented in R (which is, if I > understand correctly, that factors keeps numbers and assign levels to them). > My question *is* about why so many functions that work on factors don't > treat them as characters by default? > > Here are two simple examples: > Example one turning the characters inside a factor into numeric: > > x <- factor(4:6) > as.numeric(x) # output: 1 2 3 > as.numeric(as.character(x)) # output: 4 5 6 ?# isn't this what we wanted? > > > Example two, using strsplit on a factor: > > x <- factor(paste(letters[4:6], 4:6, sep="A")) > strsplit(x, "A") # will result in an error: ?# Error in strsplit(x, "A") : > non-character argument > strsplit(as.character(x), "A") # will work and split > > > So what is the reason this is the case? > Is it that implementing a switch of factors to characters as the default in > some of the basic function will cause old code to break? > Is it a better design in some other way? > > I am curious to know the reason for this. > > Thank you for your reading, > Tal > > ----------------Contact > Details:------------------------------------------------------- > Contact me: Tal.Galili at gmail.com | ?972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | > www.r-statistics.com (English) > ---------------------------------------------------------------------------------------------- > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/
Erik Iverson
2010-Dec-12 18:16 UTC
[R] Why do we have to turn factors into characters for various functions?
On 12/11/2010 04:48 PM, Tal Galili wrote:> Hello dear R-help mailing list, > > My question is *not* about how factors are implemented in R (which is, if I > understand correctly, that factors keeps numbers and assign levels to them). > My question *is* about why so many functions that work on factors don't > treat them as characters by default? > > Here are two simple examples: > Example one turning the characters inside a factor into numeric: > > x<- factor(4:6) > as.numeric(x) # output: 1 2 3 > as.numeric(as.character(x)) # output: 4 5 6 # isn't this what we wanted?But your example of 'x' is a very special case. Most factors will not have numeric levels as you have constructed. Most levels will be categorical such as Sex, Race, Country of Origin, Treatment, etc. These are stored as numeric codes (R's enumerated type class), and most modeling functions treat variables of class factor differently. So, as.numeric(x) will just return the numeric codes regardless of the levels of the factor, which is fine. It seems you may be silently suggesting that *if* the levels of the factor are themselves able to be coerced to numeric, then as.numeric(x) should return that instead of the underlying numeric codes. Of course, having functions do different things depending on the particular input is dangerous, thus we have the behavior as it is currently implemented.
Petr Savicky
2010-Dec-12 19:12 UTC
[R] Why do we have to turn factors into characters for various functions?
On Sun, Dec 12, 2010 at 12:48:30AM +0200, Tal Galili wrote:> Hello dear R-help mailing list, > > My question is *not* about how factors are implemented in R (which is, if I > understand correctly, that factors keeps numbers and assign levels to them). > My question *is* about why so many functions that work on factors don't > treat them as characters by default?Personally, i try to use factors only when there is a specific reason for this and character type otherwise. Factors are natural in the data used for construction of a classification model or for categorical attributes, also for preparing input to table() function and related things.> Here are two simple examples: > Example one turning the characters inside a factor into numeric: > > x <- factor(4:6) > as.numeric(x) # output: 1 2 3 > as.numeric(as.character(x)) # output: 4 5 6 # isn't this what we wanted?If you are concerned with computing time, then applying as.numeric() only to the levels is probably better x <- factor(rep(4:6, times=1000000)) cpu1 <- system.time( out1 <- as.numeric(as.character(x)) ) cpu2 <- system.time( out2 <- as.numeric(levels(x))[as.integer(x)] ) rbind(cpu1, cpu2) user.self sys.self elapsed user.child sys.child cpu1 0.570 0.031 0.601 0 0 cpu2 0.042 0.027 0.070 0 0> Is it that implementing a switch of factors to characters as the default in > some of the basic function will cause old code to break?I think that this is an important part of the reason. Petr Savicky.
Heinz Tuechler
2010-Dec-12 20:00 UTC
[R] Why do we have to turn factors into characters for various functions?
At 12.12.2010 00:48 +0200, Tal Galili wrote:>Hello dear R-help mailing list, > >My question is *not* about how factors are implemented in R (which is, if I >understand correctly, that factors keeps numbers and assign levels to them). >My question *is* about why so many functions that work on factors don't >treat them as characters by default? > >Here are two simple examples: >Example one turning the characters inside a factor into numeric: > >x <- factor(4:6) >as.numeric(x) # output: 1 2 3 >as.numeric(as.character(x)) # output: 4 5 6 # isn't this what we wanted? > > >Example two, using strsplit on a factor: > >x <- factor(paste(letters[4:6], 4:6, sep="A")) >strsplit(x, "A") # will result in an error: # Error in strsplit(x, "A") : >non-character argument >strsplit(as.character(x), "A") # will work and split > > >So what is the reason this is the case? >Is it that implementing a switch of factors to characters as the default in >some of the basic function will cause old code to break? >Is it a better design in some other way? > >I am curious to know the reason for this.In my view the answer can be found implicitly in the language definition. "Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers. Rather unfortunately users often make use of the implementation in order to make some calculations easier." It is the "unfortunate" use of factors that seems generally accepted, even if the language definition continues: "This, however, is an implementation issue and is not guaranteed to hold in all implementations of R." Personally, like some others, I avoid factors, except in cases, where they represent a statistical concept. Certainly I would agree with you that, if only reading the "R Language Definition" and not the documentation of the function factor, one would rather expect functions like as.numeric or strsplit to operate on the levels of a factor and not on the underlying, implementation specific, integer array. Heinz>Thank you for your reading, >Tal > >----------------Contact >Details:------------------------------------------------------- >Contact me: Tal.Galili at gmail.com | 972-52-7275845 >Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | >www.r-statistics.com (English) >---------------------------------------------------------------------------------------------- > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Petr PIKAL
2010-Dec-15 10:45 UTC
[R] Why do we have to turn factors into characters for various functions?
Hi Heinz OK, Point taken. I must say I do not do concatenation of factors very often so this feature does not bothers me much. Best regards Petr Heinz Tuechler <tuechler at gmx.at> napsal dne 13.12.2010 13:52:17:> Hello Petr, > > don't want to convince you. If you like the following: > > x <- factor(1:4, labels=c("one", "two", "three", "four")) > > y <- factor(3:5, labels=c("three", "four", "five")) > > data.frame(character=c(as.character(x), as.character(y)), numeric=c(x,y))> > character numeric > 1 one 1 > 2 two 2 > 3 three 3 > 4 four 4 > 5 three 1 > 6 four 2 > 7 five 3 > > For me the behaviour of character vectors is easier to follow and > less errror prone. > > cx <- c("one", "two", "three", "four") > > cy <- c("three", "four", "five") > > c(cx, cy) > > [1] "one" "two" "three" "four" "three" "four" "five" > > > >Anyway it is maybe more about personal habits than about bad factor > >"features" > > I agree with you regarding personal habits. It's not the features of > factors. For me it's the rather inconsistent use in functions like > c() or print(). > If you print a factor, you see it's levels, but if you combine it > using c(), you combine the famouse implementation specific underlying > integer vector. > > best regards, > > Heinz > > At 13.12.2010 08:50 +0100, Petr PIKAL wrote: > >Hi > > > >r-help-bounces at r-project.org napsal dne 12.12.2010 21:00:37: > > > > > At 12.12.2010 00:48 +0200, Tal Galili wrote: > > > >Hello dear R-help mailing list, > > > > > > > >My question is *not* about how factors are implemented in R (whichis,> >if I > > > >understand correctly, that factors keeps numbers and assign levelsto> >them). > > > >My question *is* about why so many functions that work on factorsdon't> > > >treat them as characters by default? > > > > > > > >Here are two simple examples: > > > >Example one turning the characters inside a factor into numeric: > > > > > > > >x <- factor(4:6) > > > >as.numeric(x) # output: 1 2 3 > > > >as.numeric(as.character(x)) # output: 4 5 6 # isn't this what we > >wanted? > > > > > > > > > > > >Example two, using strsplit on a factor: > > > > > > > >x <- factor(paste(letters[4:6], 4:6, sep="A")) > > > >strsplit(x, "A") # will result in an error: # Error in strsplit(x, > >"A") : > > > >non-character argument > > > >strsplit(as.character(x), "A") # will work and split > > > > > > > > > > > >So what is the reason this is the case? > > > >Is it that implementing a switch of factors to characters as the > >default in > > > >some of the basic function will cause old code to break? > > > >Is it a better design in some other way? > > > > > > > >I am curious to know the reason for this. > > > > > > In my view the answer can be found implicitly in the language > >definition. > > > > > > "Factors are currently implemented using an integer array to specify > > > the actual levels and a second array of names that are mapped to the > > > integers. Rather unfortunately users often make use of the > > > implementation in order to make some calculations easier." > > > > > > It is the "unfortunate" use of factors that seems generallyaccepted,> > > even if the language definition continues: > > > > > > "This, however, is an implementation issue and is not guaranteed to > > > hold in all implementations of R." > > > > > > Personally, like some others, I avoid factors, except in cases,where> > > they represent a statistical concept. > > > >On contrary I find factors quite useful. Consider possibility to change > >its levels > > > > > set.seed(111) > > > x <- factor(sample(1:4, 20, replace=T), labels=c("one", "two","three",> >"four")) > > > x > > [1] three three two three two two one three two one three > >three > >[13] one one one two one four two three > >Levels: one two three four > > > levels(x)[3:4] <- "more" > > > x > > [1] more more two more two two one more two one more more oneone> >one > >[16] two one more two more > >Levels: one two more > > > >I believe that if x is character, it can be also done but factor wayseems> >to me more convenient. I also use point distinction in plots by > >pch=as.numeric(some.factor) quite often. > > > >Anyway it is maybe more about personal habits than about bad factor > >"features" > > > >Regards > >Petr > > > > > > > > Certainly I would agree with you that, if only reading the "R > > > Language Definition" and not the documentation of the function > > > factor, one would rather expect functions like as.numeric orstrsplit> > > to operate on the levels of a factor and not on the underlying, > > > implementation specific, integer array. > > > > > > Heinz > > > > > > > > > > > > >Thank you for your reading, > > > >Tal > > > > > > > >----------------Contact > > > >Details:------------------------------------------------------- > > > >Contact me: Tal.Galili at gmail.com | 972-52-7275845 > > > >Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il(Hebrew)> >| > > > >www.r-statistics.com (English) > > > > > >------------------------------------------------------------------- > > --------------------------- > > > > > > > > [[alternative HTML version deleted]] > > > > > > > >______________________________________________ > > > >R-help at r-project.org mailing list > > > >https://stat.ethz.ch/mailman/listinfo/r-help > > > >PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > > > >and provide commented, minimal, self-contained, reproducible code. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > >
Possibly Parallel Threads
- Is it possible to "right align" text in R graphics?
- Unexplained behavior of level names when using ordered factors in lm?
- How to read.table with “Hebrew” column names (in R)?
- How to turn a LaTeX Sweave file (Rnw) into .HTML/.odf/.docx? (under windows)
- Importing tRNA data into R ?