thr3ads.net - R devel - [Rd] stringsAsFactors [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Terry Therneau

2013-Feb-11 13:50 UTC

[Rd] stringsAsFactors

I think your idea to remove the warnings is excellent, and a good compromise. 
Characters
already work fine in modeling functions except for the silly warning.

It is interesting how often the defaults for a program reflect the data sets in
use at the
time the defaults were chosen.  There are some such in my own survival package
whose
proper value is no longer as "obvious" as it was when I chose them. 
Factors are very
handy for variables which have only a few levels and will be used in modeling. 
Every
character variable of every dataset in "Statistical Models in S",
which introduced
factors, is of this type so auto-transformation made a lot of sense.  The
"solder" data
set there is one for which Helmert contrasts are proper so guess what the
default contrast
option was?  (I think there are only a few data sets in the world for which
Helmert makes
sense, however, and R eventually changed the default.)

For character variables that should not be factors such as a street adress 
stringsAsFactors can be a real PITA, and I expect that people's preference
for the option
depends almost entirely on how often these arise in their own work.  As long as
there is
an option that can be overridden I'm okay.  Yes, I'd prefer FALSE as the
default, partly
because the current value is a tripwire in the hallway that eventually catches
every new user.

Terry Therneau

On 02/11/2013 05:00 AM, r-devel-request at r-project.org
wrote:> Both of these were discussed by R Core.  I think it's unlikely the
> default for stringsAsFactors will be changed (some R Core members like
> the current behaviour), but it's fairly likely the show.signif.stars
> default will change.  (That's if someone gets around to it:  I
> personally don't care about that one.  P-values are commonly used
> statistics, and the stars are just a simple graphical display of them.
> I find some p-values to be useful, and the display to be harmless.)
>
> I think it's really unlikely the more extreme changes (i.e. dropping
> show.signif.stars completely, or dropping p-values) will happen.
>
> Regarding stringsAsFactors:  I'm not going to defend keeping it as is,
> I'll let the people who like it defend it.  What I will likely do is
> make a few changes so that character vectors are automatically changed
> to factors in modelling functions, so that operating with
> stringsAsFactors=FALSE doesn't trigger silly warnings.

William Dunlap

2013-Feb-11 17:13 UTC

head link

[Rd] stringsAsFactors

Note that changing this does not just mean getting rid of "silly
warnings".
Currently, predict.lm() can give wrong answers when stringsAsFactors is FALSE.

  > d <- data.frame(x=1:10,
f=rep(c("A","B","C"), c(4,3,3)), y=c(1:4, 15:17,
28.1,28.8,30.1))
  > fit_ab <- lm(y ~ x + f, data = d, subset = f!="B")
  Warning message:
  In model.matrix.default(mt, mf, contrasts) :
    variable 'f' converted to a factor
  > predict(fit_ab, newdata=d)
   1  2  3  4  5  6  7  8  9 10
   1  2  3  4 25 26 27  8  9 10
  Warning messages:
  1: In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
    variable 'f' converted to a factor
  2: In predict.lm(fit_ab, newdata = d) :
    prediction from a rank-deficient fit may be misleading

fit_ab is not rank-deficient and the predict should report
   1 2 3 4 NA NA NA 28 29 30 

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at
r-project.org] On Behalf
> Of Terry Therneau
> Sent: Monday, February 11, 2013 5:50 AM
> To: r-devel at r-project.org; Duncan Murdoch
> Subject: Re: [Rd] stringsAsFactors
> 
> I think your idea to remove the warnings is excellent, and a good
compromise.
> Characters
> already work fine in modeling functions except for the silly warning.
> 
> It is interesting how often the defaults for a program reflect the data
sets in use at the
> time the defaults were chosen.  There are some such in my own survival
package whose
> proper value is no longer as "obvious" as it was when I chose
them.  Factors are very
> handy for variables which have only a few levels and will be used in
modeling.  Every
> character variable of every dataset in "Statistical Models in S",
which introduced
> factors, is of this type so auto-transformation made a lot of sense.  The
"solder" data
> set there is one for which Helmert contrasts are proper so guess what the
default
> contrast
> option was?  (I think there are only a few data sets in the world for which
Helmert makes
> sense, however, and R eventually changed the default.)
> 
> For character variables that should not be factors such as a street adress
> stringsAsFactors can be a real PITA, and I expect that people's
preference for the option
> depends almost entirely on how often these arise in their own work.  As
long as there is
> an option that can be overridden I'm okay.  Yes, I'd prefer FALSE
as the default, partly
> because the current value is a tripwire in the hallway that eventually
catches every new
> user.
> 
> Terry Therneau
> 
> On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
> > Both of these were discussed by R Core.  I think it's unlikely the
> > default for stringsAsFactors will be changed (some R Core members like
> > the current behaviour), but it's fairly likely the
show.signif.stars
> > default will change.  (That's if someone gets around to it:  I
> > personally don't care about that one.  P-values are commonly used
> > statistics, and the stars are just a simple graphical display of them.
> > I find some p-values to be useful, and the display to be harmless.)
> >
> > I think it's really unlikely the more extreme changes (i.e.
dropping
> > show.signif.stars completely, or dropping p-values) will happen.
> >
> > Regarding stringsAsFactors:  I'm not going to defend keeping it as
is,
> > I'll let the people who like it defend it.  What I will likely do
is
> > make a few changes so that character vectors are automatically changed
> > to factors in modelling functions, so that operating with
> > stringsAsFactors=FALSE doesn't trigger silly warnings.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Brian Diggs

2013-Feb-11 20:15 UTC

head link

[Rd] stringsAsFactors

On 2/11/2013 5:50 AM, Terry Therneau wrote:> I think your idea to remove the warnings is excellent, and a good
> compromise.  Characters already work fine in modeling functions except
> for the silly warning.
>
> It is interesting how often the defaults for a program reflect the data
> sets in use at the time the defaults were chosen.  There are some such
> in my own survival package whose proper value is no longer as
"obvious"
> as it was when I chose them.  Factors are very handy for variables which
> have only a few levels and will be used in modeling.  Every character
> variable of every dataset in "Statistical Models in S", which
introduced
> factors, is of this type so auto-transformation made a lot of sense.
> The "solder" data set there is one for which Helmert contrasts
are
> proper so guess what the default contrast option was?  (I think there
> are only a few data sets in the world for which Helmert makes sense,
> however, and R eventually changed the default.)
>
> For character variables that should not be factors such as a street
> adress stringsAsFactors can be a real PITA, and I expect that people's
> preference for the option depends almost entirely on how often these
> arise in their own work.  As long as there is an option that can be
> overridden I'm okay.  Yes, I'd prefer FALSE as the default, partly
> because the current value is a tripwire in the hallway that eventually
> catches every new user.
I also agree that stringsAsFactors should not be TRUE, at least by 
default. I do not change the default in my .Rprofile because I have seen 
examples where people have gotten tripped up having changed this and 
forgotten about it or when sharing code and getting different, 
unexpected, results. However, my code is littered with this additional 
argument so that I get, to me, the more sensible behavior.

My preference follows from my conceptualization of what a factor is. To 
me, a factor is the representation of a data type which has a fixed, 
finite, set of values which it can take which are known a priori. In 
terms of sample and population, a variable could only be a factor if all 
possible values that the variable could take in the population are known 
(not just those in the given sample). Automatic conversion of strings to 
factors assumes that the values that are present constitute the complete 
and exclusive set of values which that variable could ever have, an 
assumption which is often not correct in my experience. Examples such as 
names, street addresses, or unique alphanumeric identifiers all fit this 
criteria.

In contrast, a character variable is just vector of arbitrary length 
character strings; it makes no further assumptions.

A secondary reason why I don't like a default conversion of strings to 
factors is, on importing data, I often have to do some data cleaning 
(unifying case, noting specific missing value encoding, collapsing 
redundant entries) before I have a clean set of possible values to 
convert to a factor. Once I have converted those variables which should 
be factors to factors and left those that are just character strings as 
character strings, I don't want later functions changing those choices 
on me.

I realize that, historically, a factor was also a more efficient storage 
mechanism for strings (store each unique string only once and then 
record an index to that string), but with the global string table, my 
understanding is that that is no longer the case.

Finally, stringsAsFactors being TRUE by default effectively says that 
the is no place for character vectors; all character vectors should be 
converted to factors as soon as possible. Take this to the (absurd) 
extreme, why even have a character vector type, then? The (appropriate) 
existence of both a factor type and character vector type is a further 
argument that the latter should no be converted to the former automatically.
> Terry Therneau
>
> On 02/11/2013 05:00 AM, r-devel-request at r-project.org wrote:
>> Both of these were discussed by R Core.  I think it's unlikely the
>> default for stringsAsFactors will be changed (some R Core members like
>> the current behaviour), but it's fairly likely the
show.signif.stars
>> default will change.  (That's if someone gets around to it:  I
>> personally don't care about that one.  P-values are commonly used
>> statistics, and the stars are just a simple graphical display of them.
>> I find some p-values to be useful, and the display to be harmless.)
>>
>> I think it's really unlikely the more extreme changes (i.e.
dropping
>> show.signif.stars completely, or dropping p-values) will happen.
>>
>> Regarding stringsAsFactors:  I'm not going to defend keeping it as
is,
>> I'll let the people who like it defend it.  What I will likely do
is
>> make a few changes so that character vectors are automatically changed
>> to factors in modelling functions, so that operating with
>> stringsAsFactors=FALSE doesn't trigger silly warnings.
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

Seemingly Similar Threads

Search for more apparently analagous threads

R devel - Feb 2013 - stringsAsFactors

[Rd] stringsAsFactors

[Rd] stringsAsFactors

[Rd] stringsAsFactors

Seemingly Similar Threads