Terry Therneau
2007-Apr-23 12:59 UTC
[R] stringsAsFactor global option (was "character coerced to a factor")
--- Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Just one caveat. I personally would try to avoid > using > global options since it can cause conflicts when > two different programs assume two different settings > of the same global option and need to interact.I see this argument often, and don't buy it. In any case, for this particular option, the Mayo biostatistics group (~120 users) has had stringsAsFactors=F as a global default for 15+ years now with no ill effects. It is much less confusing for both new and old users. Johh Kane asked "Any idea what the rationale was for setting the option to TRUE?" When factors were first introduced, there was no option to turn them off. Reading between the lines of the white book (Statistical Models in S) that introduced them, this is my guess: they made perfect sense for the particular data sets that were being analysed by the authors at the time. Many of the defaults in the survival package, which I wrote, have exactly the same rationale --- so let us not be too harsh on an author for not forseeing all the future consequences of a default! A place where factors really are a pain is when the patient id is a character string. When, for instance, you subset the data to do an analysis of only the females, having the data set `remember' all of the male id's (the original levels) is non-productive in dozens of ways. For other variables factors work well and have some nice properties. In general, I've found in my work (medical research) that factors are beneficial for about 1/5 of the character variables, a PITA for 1/4, and a wash for the rest; so prefer to do any transformations myself. For the historically curious: In Splus, one originally fixed this with an override of the function as.data.frame.character <- as.data.frame.vector before they added the global option. In R, unfortunately, this override didn't work due to namespaces, and we had to wait for the option to be added. (Another dammed-if-you-do dammed-if-you-don't issue. Normally you don't want users to be able to override a base function, because 9 times out of 10 they did it by accident and dont' want it either. But when a user really does want to do so ...) Terry Therneau
hadley wickham
2007-Apr-23 13:30 UTC
[R] stringsAsFactor global option (was "character coerced to a factor")
> A place where factors really are a pain is when the patient id is a character > string. When, for instance, you subset the data to do an analysis of only > the females, having the data set `remember' all of the male id's (the original > levels) is non-productive in dozens of ways. For other variables factors > work well and have some nice properties. In general, I've found in my work > (medical research) that factors are beneficial for about 1/5 of the character > variables, a PITA for 1/4, and a wash for the rest; so prefer to do any > transformations myself.It seems to me that the most importance difference between factors and character vectors is that factors also store the range of the variable. You could imagine doing something similar for continuous variables. This would have the interesting property that plots of subsets would have the same range as plots of the original data. I'd imagine, just as with factors, this would be useful and frustrating in equal parts. In terms of which should be the default, I can imagine two arguments: * keep to the original format of the data as closely as possible: character vectors should be the default * maintain as much information about the original data as possible: factors should be the default. Hadley
Prof Brian Ripley
2007-Apr-23 13:48 UTC
[R] stringsAsFactor global option (was "character coerced to a factor")
On Mon, 23 Apr 2007, Terry Therneau wrote:> --- Gabor Grothendieck <ggrothendieck at gmail.com> > wrote: > >> Just one caveat. I personally would try to avoid >> using >> global options since it can cause conflicts when >> two different programs assume two different settings >> of the same global option and need to interact. > > I see this argument often, and don't buy it. In any case, for this > particular option, the Mayo biostatistics group (~120 users) has had > stringsAsFactors=F as a global default for 15+ years now with no ill effects. > It is much less confusing for both new and old users. > > Johh Kane asked "Any idea what the rationale was for setting the > option to TRUE?" When factors were first introduced, there was no option > to turn them off. Reading between the lines of the white book (Statistical > Models in S) that introduced them, this is my guess: they made perfect sense for > the particular data sets that were being analysed by the authors at the time. > Many of the defaults in the survival package, which I wrote, have exactly the > same rationale --- so let us not be too harsh on an author for not forseeing > all the future consequences of a default! > > A place where factors really are a pain is when the patient id is a character > string. When, for instance, you subset the data to do an analysis of only > the females, having the data set `remember' all of the male id's (the original > levels) is non-productive in dozens of ways. For other variables factors > work well and have some nice properties. In general, I've found in my work > (medical research) that factors are beneficial for about 1/5 of the character > variables, a PITA for 1/4, and a wash for the rest; so prefer to do any > transformations myself. > > For the historically curious: > In Splus, one originally fixed this with an override of the function > as.data.frame.character <- as.data.frame.vector > before they added the global option. In R, unfortunately, this override > didn't work due to namespaces, and we had to wait for the option to be > added. (Another dammed-if-you-do dammed-if-you-don't issue. Normally you > don't want users to be able to override a base function, because 9 times out > of 10 they did it by accident and dont' want it either. But when a user really > does want to do so ...)That is what 'assignInNamespace' is for (and it came in with namespaces). -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Possibly Parallel Threads
- Character coerced to factor and I cannot get it back
- Fwd: RE: Character coerced to factor and I cannot get it back
- Fwd: Re: Character coerced to factor and I cannot get it back
- Help with quota setting...
- Converting character to numeric: Error: (list) object cannot be coerced to type 'double'