Manipulating formulas within different models I notice the following: m1 <- lm(formula = hp ~ cyl, data = mtcars) m2 <- update(m1, formula. = hp ~ cyl) all.equal(m1, m2) #> [1] TRUE identical(m1, m2) #> [1] FALSE waldo::compare(m1, m2) #> `old$call[[2]]` is a call #> `new$call[[2]]` is an S3 object of class <formula>, a call I'm aware formulas are a form of call but what I'm unsure of is whether there is meaningful difference between the two versions of the models? Any clarification, even just on the relation between formulas and calls would be useful.
>>>>> Tim Taylor >>>>> on Wed, 11 Aug 2021 08:45:48 +0000 writes:> Manipulating formulas within different models I notice the following: > m1 <- lm(formula = hp ~ cyl, data = mtcars) > m2 <- update(m1, formula. = hp ~ cyl) > all.equal(m1, m2) > #> [1] TRUE > identical(m1, m2) > #> [1] FALSE > waldo::compare(m1, m2) > #> `old$call[[2]]` is a call > #> `new$call[[2]]` is an S3 object of class <formula>, a call > I'm aware formulas are a form of call but what I'm unsure > of is whether there is meaningful difference between the > two versions of the models? A good question. In principle, the promise of an update() method should be to produce the *same* result as calling the original model-creation (or more generally object-creation) function call. So, already with identical(), you've shown that this is not quite the case for simple lm(), and indeed that is a bit undesirable. To answer your question re "meaningful" difference, given what I say above is: No, there shouldn't be any relevant difference, and if there is, that may considered a bug in the respective update() method, here update.lm. More about this in the following R code snippet : ## MM: indeed, identical(m1$call, m2$call) #> [1] FALSE noCall <- function(x) x[setdiff(names(x), "call")] identical(noCall(m1), noCall(m2))# TRUE! ## look closer: c1 <- m1$call c2 <- m2$call str(as.list(c1)) ## List of 3 ## $ : symbol lm ## $ formula: language hp ~ cyl ## $ data : symbol mtcars str(as.list(c2)) ## List of 3 ## $ : symbol lm ## $ formula:Class 'formula' language hp ~ cyl ## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## $ data : symbol mtcars identical(c1[-2], c2[-2]) # TRUE ==> so, indeed the difference is *only* in the formula ( = [2]) component f1 <- c1$formula f2 <- c2$formula all.equal(f1,f2) # TRUE identical(f1,f2) # FALSE ## Note that this is typically *not* visible if the user uses the accessor functions: identical(formula(m1), formula(m2)) # TRUE ! ## and indeed, the formula() method for 'lm' does set the environment: stats:::formula.lm -- Martin Maechler ETH Zurich and R Core
Martin Maechler
2021-Aug-11 13:15 UTC
[Rd] difference of m1 <- lm(f, data) and update(m1, formula=f)
I'm diverting this from R-help to R-devel, because I'm asking / musing if and if where we should / could change R here (see below).>>>>> Martin Maechler on 11 Aug 2021 11:51:25 +0200>>>>> Tim Taylor .. on 08:45:48 +0000 writes:>> Manipulating formulas within different models I notice the following: >> m1 <- lm(formula = hp ~ cyl, data = mtcars) >> m2 <- update(m1, formula. = hp ~ cyl) >> all.equal(m1, m2) >> #> [1] TRUE >> identical(m1, m2) >> #> [1] FALSE >> waldo::compare(m1, m2) >> #> `old$call[[2]]` is a call >> #> `new$call[[2]]` is an S3 object of class <formula>, a call >> I'm aware formulas are a form of call but what I'm unsure >> of is whether there is meaningful difference between the >> two versions of the models? > A good question. > In principle, the promise of an update() method should be to > produce the *same* result as calling the original model-creation > (or more generally object-creation) function call. > So, already with identical(), you've shown that this is not > quite the case for simple lm(), > and indeed that is a bit undesirable. > To answer your question re "meaningful" difference, > given what I say above is: > No, there shouldn't be any relevant difference, and if there is, > that may considered a bug in the respective update() method, > here update.lm. > More about this in the following R code snippet : Again, a repr.ex.: ---0<-------0<-------0<-------0<-------0<-------0<-------0<---- m1 <- lm(formula = hp ~ cyl, data = mtcars) m2 <- update(m1, formula. = hp ~ cyl) m2a <- update(m1) identical(m1, m2a)#> TRUE ! ## ==> calling update() & explicitly specifying the formula is "the problem" identical(m1$call, m2$call) #> [1] FALSE noCall <- function(x) x[setdiff(names(x), "call")] identical(noCall(m1), noCall(m2))# TRUE! ## look closer: c1 <- m1$call c2 <- m2$call str(as.list(c1)) ## List of 3 ## $ : symbol lm ## $ formula: language hp ~ cyl ## $ data : symbol mtcars str(as.list(c2)) ## List of 3 ## $ : symbol lm ## $ formula:Class 'formula' language hp ~ cyl ## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> ## $ data : symbol mtcars identical(c1[-2], c2[-2]) # TRUE ==> so, indeed the difference is *only* in the formula ( = [2]) component f1 <- c1$formula f2 <- c2$formula all.equal(f1,f2) # TRUE identical(f1,f2) # FALSE ## Note that this is typically *not* visible if the user uses ## the accessor functions they should : identical(formula(m1), formula(m2)) # TRUE ! ## and indeed, the formula() method for 'lm' does set the environment: stats:::formula.lm ---0<-------0<-------0<-------0<-------0<-------0<-------0<---- We know that it has been important in R the formulas have an environment and that's been the only R-core recommended way to do non-standard evaluation (!! .. but let's skip that for now !!). OTOH we have also kept the convention that a formula without environment implicitly means its environment is .GlobalEnv aka globalenv(). Currently, I think formula() methods then *should* always return a formula *with* an environment .. even though that's not claimed in the reference, i.e., ?formula. Also, the print() method for formulas by default does *not* show the environment if it is .GlobalEnv, as you can see on that help already in the "Usage" section: ## S3 method for class 'formula' print(x, showEnv = !identical(e, .GlobalEnv), ...) Now, I've looked at the update() here, which is update.default() and the source code of that currently is update.formula <- function (old, new, ...) { tmp <- .Call(C_updateform, as.formula(old), as.formula(new)) ## FIXME?: terms.formula() with "large" unneeded attributes: formula(terms.formula(tmp, simplify = TRUE)) } where the important part is the "FIXME" comment (seen in the R sources, but no longer in the R function after installation). My current "idea" is to formalize what we see working here: namely allow update.formula() to *not* set the environment of its result *if* that environment would be .GlobalEnv .. --> I'm starting to test my proposal but would still be *very* glad for comments, also contradicting ones! Martin