Thomas J. Leeper
2017-Mar-30 14:01 UTC
[Rd] get_all_vars() does not handle rhs matrices in formulae
Hello again, It appears that get_all_vars() incorrectly handles model formulae that use a right-hand side (rhs) matrix. For example, consider these two substantively identical models: # model using named variables mpg <- mtcars$mpg wt <- mtcars$wt hp <- mtcars$hp m1 <- lm(mpg ~ wt + hp) # model using matrix y <- mtcars$mpg x <- cbind(mtcars$wt, mtcars$hp) m2 <- lm(y ~ x) For the first, get_all_vars() returns the correct data frame: str(get_all_vars(m1, .GlobalEnv)) ## 'data.frame': 32 obs. of 3 variables: ## $ mpg: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... which could, for example, be passed on to predict() just like the output from model.frame(): str(predict(m1, model.frame(m1))) ## Named num [1:32] 23.6 22.6 25.3 21.3 18.3 ... ## - attr(*, "names")= chr [1:32] "1" "2" "3" "4" ... str(predict(m1, get_all_vars(m1))) ## Named num [1:32] 23.6 22.6 25.3 21.3 18.3 ... ## - attr(*, "names")= chr [1:32] "1" "2" "3" "4" ... For the model specified with a rhs matrix, however, get_all_vars() returns a three-column data frame with the second matrix column added as an unnamed third column: str(get_all_vars(m2, .GlobalEnv)) ## 'data.frame': 32 obs. of 3 variables: ## $ y : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ x : num 2.62 2.88 2.32 3.21 3.44 ... ## $ NA: num 110 110 93 110 175 105 245 62 95 123 ... This means attempts to use this data structure in predict() fail: str(predict(m2, get_all_vars(m2))) ## Error: variable 'x' was fitted with type "nmatrix.2" but type "numeric" was supplied The correct structure needs to resemble following in order for that to succeed: newdat <- data.frame(y = y) newdat$x <- x str(newdat) ## 'data.frame': 32 obs. of 2 variables: ## $ y: num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ x: num [1:32, 1:2] 2.62 2.88 2.32 3.21 3.44 ... str(predict(m2, newdat)) ## Named num [1:32] 23.6 22.6 25.3 21.3 18.3 ... ## - attr(*, "names")= chr [1:32] "1" "2" "3" "4" ... The correct structure is basically what is returned by model.frame() in cases involving a rhs matrix: all.equal(newdat, model.frame(m2), check.attributes = FALSE) ## [1] TRUE The issue seems to be in one of the very last lines of get_all_vars(): x <- setNames(as.data.frame(c(variables, extras), optional = TRUE), c(varnames, extranames)) This both coerces `variables` to the wrong structure (making a three-column data frame instead of a two-column data frame) and therefore misnames the resulting columns. I unfortunately don't know the most sensible/general way to solve this, otherwise I would submit a patch. Anyone know how to fix this last line? Best, -Thomas Thomas J. Leeper http://www.thomasleeper.com