Terry Therneau
2007-Feb-12 14:36 UTC
[R] Problem with factor state when subset()ing a data frame
The solution to most "factors" questions on the R mailing list is to set the global option stringsAsFactors to F. Make it your part of your default R startup. Even better, do what we have done at Mayo for the last 10+ years and make it the default for your whole unit. (150+ users, 20+ years of S experience). We were one of the groups that whined to Insightful until they added this feature, which unfortunately did not become a part of R until fairly recently. For some character variables the factor logic makes sense, for other it does not. If you set the option above, then you can use an explicit mydata$variable <- factor(mydata$variable) for the variables that should be factors. In my experience, with a wide variety of data analysis, that is about 1/10 of my character variables. Others may disagree about the fraction, but one of the really bad aspects of the default design is that it forces 100% conversion of characters to another class, which is certainly not best state. (Street address, for instance, never makes sense as a factor). When factor are the right thing, they are very useful. I would agree with Peter Dalgaard's assessment of past discussion about automatically dropping unused levels: there is no approach that always works best, and the current default has been extensively talked over and appears to be the best current default. They most certainly should not disappear from the language, or have major changes without a lot of discussion. Terry Therneau Biostatistics, Mayo Clinic