Doesn't this preclude "y ~ ." style notations?> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote: > > This is fairly clearly documented in ?lm: > > "All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula." > > There are lots of possible places to look for weights, but this seems to me like a pretty sensible search order. In most cases the environment of the formula will have a parent environment chain that eventually leads to the global environment, so (with no conflicts) your strategy of defining w there will sometimes work, but looks pretty unreliable. > > When you say you want to work around this search order, I think the obvious way is to add your w vector to your d dataframe. That way it is guaranteed to be found even if there's a conflicting variable in the formula environment, or the global environment. > > Duncan Murdoch > > On 09/08/2020 2:13 p.m., John Mount wrote: >> I know this programmers can reason this out from R's late parameter evaluation rules PLUS the explicit match.call()/eval() lm() does to work with the passed in formula and data frame. But, from a statistical user point of view this seems to be counter-productive. At best it works as if the user is passing in the name of the weights variable instead of values (I know this is the obvious consequence of NSE). >> lm() takes instance weights from the formula environment. Usually that environment is the interactive environment or a close child of the interactive environment and we are lucky enough to have no intervening name collisions so we don't have a problem. However it makes programming over formulas for lm() a bit tricky. Here is an example of the issue. >> Is there any recommended discussion on this and how to work around it? In my own work I explicitly set the formula environment and put the weights in that environment. >> d <- data.frame(x = 1:3, y = c(3, 3, 4)) >> w <- c(1, 5, 1) >> # works >> lm(y ~ x, data = d, weights = w) >> # fails, as weights are taken from formul environment >> fn <- function() { # deliberately set up formula with bad value in environment >> w <- c(-1, -1, -1, -1) # bad weights >> f <- as.formula(y ~ x) # captures bad weights with as.formula(env = parent.frame()) default >> return(f) >> } >> lm(fn(), data = d, weights = w) >> # Error in model.frame.default(formula = fn(), data = d, weights = w, drop.unused.levels = TRUE) : >> # variable lengths differ (found for '(weights)') >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >
On 09/08/2020 3:01 p.m., John Mount wrote:> Doesn't this preclude "y ~ ." style notations?Yes, but you can use "y ~ . - w". Duncan Murdoch> >> On Aug 9, 2020, at 11:56 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote: >> >> This is fairly clearly documented in ?lm: >> >> "All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula." >> >> There are lots of possible places to look for weights, but this seems to me like a pretty sensible search order. In most cases the environment of the formula will have a parent environment chain that eventually leads to the global environment, so (with no conflicts) your strategy of defining w there will sometimes work, but looks pretty unreliable. >> >> When you say you want to work around this search order, I think the obvious way is to add your w vector to your d dataframe. That way it is guaranteed to be found even if there's a conflicting variable in the formula environment, or the global environment. >> >> Duncan Murdoch >> >> On 09/08/2020 2:13 p.m., John Mount wrote: >>> I know this programmers can reason this out from R's late parameter evaluation rules PLUS the explicit match.call()/eval() lm() does to work with the passed in formula and data frame. But, from a statistical user point of view this seems to be counter-productive. At best it works as if the user is passing in the name of the weights variable instead of values (I know this is the obvious consequence of NSE). >>> lm() takes instance weights from the formula environment. Usually that environment is the interactive environment or a close child of the interactive environment and we are lucky enough to have no intervening name collisions so we don't have a problem. However it makes programming over formulas for lm() a bit tricky. Here is an example of the issue. >>> Is there any recommended discussion on this and how to work around it? In my own work I explicitly set the formula environment and put the weights in that environment. >>> d <- data.frame(x = 1:3, y = c(3, 3, 4)) >>> w <- c(1, 5, 1) >>> # works >>> lm(y ~ x, data = d, weights = w) >>> # fails, as weights are taken from formul environment >>> fn <- function() { # deliberately set up formula with bad value in environment >>> w <- c(-1, -1, -1, -1) # bad weights >>> f <- as.formula(y ~ x) # captures bad weights with as.formula(env = parent.frame()) default >>> return(f) >>> } >>> lm(fn(), data = d, weights = w) >>> # Error in model.frame.default(formula = fn(), data = d, weights = w, drop.unused.levels = TRUE) : >>> # variable lengths differ (found for '(weights)') >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >
On 09/08/2020 3:07 p.m., Duncan Murdoch wrote:> On 09/08/2020 3:01 p.m., John Mount wrote: >> Doesn't this preclude "y ~ ." style notations? > > Yes, but you can use "y ~ . - w".And as was pointed out to me offline, often one doesn't have a simple vector w giving the weights, instead one computes the weights from the predictors. So if weights = f(pred), the original "y ~ ." would be fine. Duncan Murdoch