thr3ads.net - R devel - [Rd] irrelevant warning message [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Terry Therneau

2009-Jan-12 23:12 UTC

[Rd] irrelevant warning message

Context: 
  R version 2.7.1 (2008-06-23)
  I don't know when this was upgraded in the department, I just ran into the
aberrent behavior today.
  
Problem:
  
  Our group BY CHOICE does not change character variables into factors by 
default.  I can get into a long arguement as to why later, and will give one 
example of why below.
  
  The default behavior of S, Splus and R has been to create dummy variables for 
factor, character, and logical variables.  This is good.  
  
  Why has R suddenly gotton a compulsion to put out a warning message for any 
model where we do this?  I contend that it is
  - confusing
  - unnecessary
  - and wrong.
  
  It is certainly confusing, as it implies a behavior change when there has been
no change.  
  The fact that the "factor" command was used behind the scenes is
irrelevant to
anyone - who cares that HOW the rules are implemented.  Is there going to soon 
be a message of WARNING: logical variable turned into numeric?  It is the 
sensible next step.
  Wrong because the data element in question is not converted - not at the user 
level at least.  It would be much more proper to say "was treated as a
factor by
model.matrix"; but that is a semantic issue.
  
  
  In any case- to the real question.
  
  1.What is the easiest way to eliminate this?  I would prefer not to have to 
change the source code and recompile the local versions.  Because of namespaces 
it is not as easy as adding a repaired copy to our local library and loading it 
first.  I remember some discussion about forcing a change into another name 
space but I've lost the link to it. 
  I'll do this if we must, but it is such a hassle to keep updating the
change
with new releases of the package.  It might still be less than dealing with the 
training/answer questions burden for our group, which is quite large.
  
  2. Is there any hope of undoing this?   Its only real impact is to annoy, and 
I've always disliked systems that preach at me.  (Detested is more like it,
I
still remember how hard it was to delete certain files in Digital's TOPS os,
which was sure you ``didn't actually want to do that''.)  Allowing
some global
"stop preaching" option would be a fix, or in the same vein to have it
look at
the existing stringsAsFactors option for a "this person knows so don't
bother
him hint".
    If this addition followed on some discussion, please point me to it.

  The reaction of other experienced users in our group has been the same, when I
pointed this out.  So I am not alone in the "why?"
  
  	Terry Therneau
  	

PS. Here are two interrelated reasons we don't autoconvert:

  1. Subject id.  Factors give no advantage for a unique id, and some clear 
problems.  In particular when one creates as subset - everyone over 60 say - 
there is no good reason to remember all the ids you didn't select.
  2. Subject id.  I work on a lot of studies of fractures and fracture risk.  A 
time-trend model might be
  	gam(fracture ~ subject + x1 + x2 + ..., subset=(sex='F'))
  
  Fracture risk for males and females is so different that separate models are 
the sensible thing.  If subject is a factor before the call, then my model has a
zillion unneeded levels.  There are other ways out of this issue, but avoiding 
factors is the easiest.

hadley wickham

2009-Jan-13 00:21 UTC

head link

[Rd] irrelevant warning message

> PS. Here are two interrelated reasons we don't autoconvert:
>
>  1. Subject id.  Factors give no advantage for a unique id, and some clear
> problems.  In particular when one creates as subset - everyone over 60 say
-
> there is no good reason to remember all the ids you didn't select.
>  2. Subject id.  I work on a lot of studies of fractures and fracture risk.
A
> time-trend model might be
>        gam(fracture ~ subject + x1 + x2 + ..., subset=(sex='F'))
>
>  Fracture risk for males and females is so different that separate models
are
> the sensible thing.  If subject is a factor before the call, then my model
has a
> zillion unneeded levels.  There are other ways out of this issue, but
avoiding
> factors is the easiest.
3.  Factors take up more memory than character vectors.

(This is tongue-in-cheek, but in recent versions of R, factor
variables take up (very very slightly) more memory than character
variables. It's a common myth that the opposite is true)

I think R's handling of character vectors has progressed to the point
where they should be the norm, not the exception.  Maybe others will
have different views.

Hadley

-- 
http://had.co.nz/

Duncan Murdoch

2009-Jan-13 00:33 UTC

head link

[Rd] irrelevant warning message

On 12/01/2009 6:12 PM, Terry Therneau wrote:> Context: 
>   R version 2.7.1 (2008-06-23)
>   I don't know when this was upgraded in the department, I just ran
into the
> aberrent behavior today.
>   
> Problem:
>   
>   Our group BY CHOICE does not change character variables into factors by 
> default.  I can get into a long arguement as to why later, and will give
one
> example of why below.
>   
>   The default behavior of S, Splus and R has been to create dummy variables
for
> factor, character, and logical variables.  This is good.  
>   
>   Why has R suddenly gotton a compulsion to put out a warning message for
any
> model where we do this?  I contend that it is
I think you need to be more specific.  I just tried

x <- sample(letters[1:4], 100, rep=T)
y <- rnorm(x)
lm(y ~ x)

and got a warning in all R versions I tried back to 2.4.1.  In 2.3.1 
this was an error.

So I suspect the change you saw was to some other modelling function 
besides lm(), and I would guess that it came from making it consistent 
with lm().

But it would help if you told use which function, and which version 
you're comparing 2.7.1 with.

Now, it probably does make sense to suppress that warning.  I guess it 
was probably introduced because we used to give an error, and someone 
was being conservative and didn't think error-ful behaviour should go to 
accepted behaviour in one step.  But maybe it's time for the second step.

Duncan Murdoch
>   - confusing
>   - unnecessary
>   - and wrong.
>   
>   It is certainly confusing, as it implies a behavior change when there has
been
> no change.  
>   The fact that the "factor" command was used behind the scenes
is irrelevant to
> anyone - who cares that HOW the rules are implemented.  Is there going to
soon
> be a message of WARNING: logical variable turned into numeric?  It is the 
> sensible next step.
>   Wrong because the data element in question is not converted - not at the
user
> level at least.  It would be much more proper to say "was treated as a
factor by
> model.matrix"; but that is a semantic issue.
>   
>   
>   In any case- to the real question.
>   
>   1.What is the easiest way to eliminate this?  I would prefer not to have
to
> change the source code and recompile the local versions.  Because of
namespaces
> it is not as easy as adding a repaired copy to our local library and
loading it
> first.  I remember some discussion about forcing a change into another name
> space but I've lost the link to it. 
>   I'll do this if we must, but it is such a hassle to keep updating the
change
> with new releases of the package.  It might still be less than dealing with
the
> training/answer questions burden for our group, which is quite large.
>   
>   2. Is there any hope of undoing this?   Its only real impact is to annoy,
and
> I've always disliked systems that preach at me.  (Detested is more like
it, I
> still remember how hard it was to delete certain files in Digital's
TOPS os,
> which was sure you ``didn't actually want to do that''.) 
Allowing some global
> "stop preaching" option would be a fix, or in the same vein to
have it look at
> the existing stringsAsFactors option for a "this person knows so
don't bother
> him hint".
>     If this addition followed on some discussion, please point me to it.
> 
>   The reaction of other experienced users in our group has been the same,
when I
> pointed this out.  So I am not alone in the "why?"
>   
>   	Terry Therneau
>   	
> 
> PS. Here are two interrelated reasons we don't autoconvert:
> 
>   1. Subject id.  Factors give no advantage for a unique id, and some clear
> problems.  In particular when one creates as subset - everyone over 60 say
-
> there is no good reason to remember all the ids you didn't select.
>   2. Subject id.  I work on a lot of studies of fractures and fracture
risk.  A
> time-trend model might be
>   	gam(fracture ~ subject + x1 + x2 + ..., subset=(sex='F'))
>   
>   Fracture risk for males and females is so different that separate models
are
> the sensible thing.  If subject is a factor before the call, then my model
has a
> zillion unneeded levels.  There are other ways out of this issue, but
avoiding
> factors is the easiest.
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Terry Therneau

2009-Jan-13 13:28 UTC

head link

[Rd] irrelevant warning message

Thanks for the replies:

Duncan: > and got a warning in all R versions I tried back to 2.4.1.  In 2.3.1 
> this was an error.
  It seems I have egg on my face wrt this point.  A more true synopsis of what I
saw should have been that 1. I've never noticed this in R before and 2.
Until
recently I did all my modeling in Splus or Bell S, and character vectors always 
worked there.  (My survival routines were always more up to date in Splus 
because that's what I use for the source code.  But conversion from a local
cvs
archive to Rforge is nearly done -- just a survexp.us ratetable issue remains --
so R will become my most current version in another day or two.)  Possibly I 
don't have any character variables as covariates in the survival test suite.
  
 
Hadley:> I think R's handling of character vectors has progressed to the point
> where they should be the norm, not the exception.  Maybe others will
> have different views.
  Factors are very useful when there is a small discrete number of levels, and I
use them moderately often.  For that case, most of the default behavior of 
factors makes perfect sense, e.g., retention of levels.  I'm very sure that 
adding stringsAsFactors to the system options was a good thing, not as sure that
defaulting it to FALSE is the best thing for all users.  
   In my world most of the data comes from formal processes: clinical trials, 
data bases, large studies that use dedicated keyed entry, etc.  The most common 
character variables are things like id, name, and address for which the factor 
paradym doesn't work, and most of the variables I get that are actually 
'factors' come to me as small integers; I turn them into factors using
both the
levels and labels arguments.  Thus autoconversion is just a PITA. But my world 
is not everyone's. 
   My main complaint with factors has always been the assumption that everything
should be turned into one.  I fought that battle with Splus.  Defaults behavior 
is often a reflection of the data sets being analysed at the time the code was 
written, and factors reflect the data sets in Chambers & Hastie book.  But
then,
my survival code has some defaults with exactly the same origin...

Reasonably Related Threads

Search for more seemingly similar threads

R devel - Jan 2009 - irrelevant warning message

[Rd] irrelevant warning message

[Rd] irrelevant warning message

[Rd] irrelevant warning message

[Rd] irrelevant warning message

Reasonably Related Threads