thr3ads.net - R devel - [Rd] difference of m1 <- lm(f, data) and update(m1, formula=f) [Aug 2021]

If this information is useful, please help other people find it:
Share via:

Tim Taylor

2021-Aug-11 08:45 UTC

[R] Formula compared to call within model call

Manipulating formulas within different models I notice the following:

m1 <- lm(formula = hp ~ cyl, data = mtcars)
m2 <- update(m1, formula. = hp ~ cyl)
all.equal(m1, m2)
#> [1] TRUE
identical(m1, m2)
#> [1] FALSE
waldo::compare(m1, m2)
#> `old$call[[2]]` is a call
#> `new$call[[2]]` is an S3 object of class <formula>, a call

I'm aware formulas are a form of call but what I'm unsure of is whether
there is meaningful difference between the two versions of the models? Any
clarification, even just on the relation between formulas and calls would be
useful.

Martin Maechler

2021-Aug-11 09:51 UTC

head link

[R] Formula compared to call within model call

>>>>> Tim Taylor 
>>>>>     on Wed, 11 Aug 2021 08:45:48 +0000 writes:
    > Manipulating formulas within different models I notice the following:

    > m1 <- lm(formula = hp ~ cyl, data = mtcars)
    > m2 <- update(m1, formula. = hp ~ cyl)
    > all.equal(m1, m2)
    > #> [1] TRUE
    > identical(m1, m2)
    > #> [1] FALSE
    > waldo::compare(m1, m2)
    > #> `old$call[[2]]` is a call
    > #> `new$call[[2]]` is an S3 object of class <formula>, a call

    > I'm aware formulas are a form of call but what I'm unsure
    > of is whether there is meaningful difference between the
    > two versions of the models? 

A good question.
In principle, the promise of an update()  method should be to
produce the *same* result as calling the original model-creation
(or more generally object-creation) function call.

So, already with identical(), you've shown that this is not
quite the case for simple lm(),
and indeed that is a bit undesirable.

To answer your question re "meaningful" difference,
given what I say above is:
No, there shouldn't be any relevant difference, and if there is,
that may considered a bug in the respective update() method,
here update.lm.

More about this in the following  R code snippet :

## MM: indeed,
identical(m1$call, m2$call) #> [1] FALSE
noCall <- function(x) x[setdiff(names(x), "call")]
identical(noCall(m1), noCall(m2))# TRUE!
## look closer:
c1 <- m1$call
c2 <- m2$call
str(as.list(c1))
## List of 3
##  $        : symbol lm
##  $ formula: language hp ~ cyl
##  $ data   : symbol mtcars

str(as.list(c2))
## List of 3
##  $        : symbol lm
##  $ formula:Class 'formula'  language hp ~ cyl
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
##  $ data   : symbol mtcars

identical(c1[-2], c2[-2]) # TRUE ==> so, indeed the difference is *only* in
the formula ( = [2]) component
f1 <- c1$formula
f2 <- c2$formula
all.equal(f1,f2) # TRUE
identical(f1,f2) # FALSE
## Note that this is typically *not* visible if the user uses the accessor
functions:
identical(formula(m1), formula(m2)) # TRUE !
## and indeed, the formula() method for 'lm'  does set the environment:
stats:::formula.lm

--
Martin Maechler
ETH Zurich   and  R Core

Martin Maechler

2021-Aug-11 13:15 UTC

head link

[Rd] difference of m1 <- lm(f, data) and update(m1, formula=f)

I'm diverting this from R-help to R-devel,

because I'm asking / musing if and if where we should / could
change R here (see below).
>>>>> Martin Maechler on 11 Aug 2021 11:51:25 +0200
>>>>> Tim Taylor .. on 08:45:48 +0000 writes:
    >> Manipulating formulas within different models I notice the
following:

    >> m1 <- lm(formula = hp ~ cyl, data = mtcars)
    >> m2 <- update(m1, formula. = hp ~ cyl)
    >> all.equal(m1, m2)
    >> #> [1] TRUE
    >> identical(m1, m2)
    >> #> [1] FALSE
    >> waldo::compare(m1, m2)
    >> #> `old$call[[2]]` is a call
    >> #> `new$call[[2]]` is an S3 object of class <formula>, a
call

    >> I'm aware formulas are a form of call but what I'm unsure
    >> of is whether there is meaningful difference between the
    >> two versions of the models? 

    > A good question.
    > In principle, the promise of an update()  method should be to
    > produce the *same* result as calling the original model-creation
    > (or more generally object-creation) function call.

    > So, already with identical(), you've shown that this is not
    > quite the case for simple lm(),
    > and indeed that is a bit undesirable.

    > To answer your question re "meaningful" difference,
    > given what I say above is:
    > No, there shouldn't be any relevant difference, and if there is,
    > that may considered a bug in the respective update() method,
    > here update.lm.

    > More about this in the following  R code snippet :

Again, a repr.ex.:

---0<-------0<-------0<-------0<-------0<-------0<-------0<----

m1 <- lm(formula = hp ~ cyl, data = mtcars)
m2  <- update(m1, formula. = hp ~ cyl)
m2a <- update(m1)
identical(m1, m2a)#>  TRUE !
## ==> calling update() & explicitly specifying the formula is "the
problem"

identical(m1$call, m2$call) #> [1] FALSE
noCall <- function(x) x[setdiff(names(x), "call")]
identical(noCall(m1), noCall(m2))# TRUE!
## look closer:
c1 <- m1$call
c2 <- m2$call
str(as.list(c1))
## List of 3
##  $        : symbol lm
##  $ formula: language hp ~ cyl
##  $ data   : symbol mtcars

str(as.list(c2))
## List of 3
##  $        : symbol lm
##  $ formula:Class 'formula'  language hp ~ cyl
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
##  $ data   : symbol mtcars

identical(c1[-2], c2[-2]) # TRUE ==> so, indeed the difference is *only* in
the formula ( = [2]) component
f1 <- c1$formula
f2 <- c2$formula
all.equal(f1,f2) # TRUE
identical(f1,f2) # FALSE

## Note that this is typically *not* visible if the user uses
## the accessor functions they should :
identical(formula(m1), formula(m2)) # TRUE !
## and indeed, the formula() method for 'lm'  does set the environment:
stats:::formula.lm

---0<-------0<-------0<-------0<-------0<-------0<-------0<----

We know that it has been important in  R  the formulas have an
environment and that's been the only R-core recommended way to
do non-standard evaluation (!! .. but let's skip that for now !!).

OTOH we have also kept the convention that a formula without
environment implicitly means its environment
is .GlobalEnv aka globalenv().

Currently, I think formula() methods then *should* always return
a formula *with* an environment .. even though that's not
claimed in the reference, i.e., ?formula.

Also, the print() method for formulas by default does *not* show the
environment if it is .GlobalEnv, as you can see on that help
already in the "Usage" section:

     ## S3 method for class 'formula'
     print(x, showEnv = !identical(e, .GlobalEnv), ...)
     
Now, I've looked at the update() here, which is update.default()
and the source code of that currently is

update.formula <- function (old, new, ...)
{
    tmp <- .Call(C_updateform, as.formula(old), as.formula(new))
    ## FIXME?: terms.formula() with "large" unneeded attributes:
    formula(terms.formula(tmp, simplify = TRUE))
}

where the important part is the "FIXME" comment (seen in the R
sources, but no longer in the R function after installation).

My current "idea" is to formalize what we see working here:
namely allow  update.formula() to *not* set the environment of
its result *if* that environment would be .GlobalEnv ..

--> I'm starting to test my proposal
but would still be *very* glad for comments, also contradicting
ones!

Martin

R devel - Aug 2021 - difference of m1 <- lm(f, data) and update(m1, formula=f)

[R] Formula compared to call within model call

[R] Formula compared to call within model call

[Rd] difference of m1 <- lm(f, data) and update(m1, formula=f)