Paolo Agnolucci
2010-May-05 10:16 UTC
[R] Predict when regressors are passed through a data matrix
Hi everyone, this should be pretty basic but I need asking for help as I got stuck. I am running simple linear regression models on R with k regressors where k> 1. In order to automate my code I packed all the regressors in a matrix Xso that lm(y~X) will always produce the results I want regardless of the variables in X. I am new to R but I found this advice somewhere so I guess it is relatively standard practice. This works very well until I need to forecast using the estimate model. I cannot pass a matrix to predict - when I pass a data frame I get the fitted valuie which leads me to think that R doesnt see the data.frame I pass to predict Thanks in advance, Paolo # REPRODUCIBLE CODE x <- matrix(rnorm(30), ncol =2) y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15) new_x <- matrix(rnorm(2), ncol =2) new_x.d <- data.frame(new_x) # fitted values predict(lm(y ~ x)) # same as fitted values predict(lm(y ~ x), new_x.d) # error predict(lm(y ~ x), new_x) [[alternative HTML version deleted]]
Dennis Murphy
2010-May-05 14:21 UTC
[R] Predict when regressors are passed through a data matrix
Hi: The problem arises because the variable names of the explanatory variables in the newdata data frame used in predict() have to match those in the fitted model object. Interestingly, using a matrix for the right hand side of the model formula in lm() creates problems for predict(). Using your code,> x <- matrix(rnorm(30), ncol =2) > y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15) > m0 <- lm(y ~ x) > m0... Coefficients: (Intercept) x1 x2 0.590281 4.868230 -0.007012> new_x <- matrix(rnorm(2), ncol =2) > new_x.d <- data.frame(new_x) > new_x.dX1 X2 1 0.1225315 0.8099963 The names of the covariates in the model have names x1 and x2, whereas those in the data frame you want to use in predict() are X1 and X2, creating a name mismatch. The apparent 'solution' is to change the names in new_x.d to lower case, but interesting things happen...> names(new_x.d) <- c('x1', 'x2') > predict(m0, new_x.d)1 2 3 4 5 6 7 1.1734885 -5.5551829 3.5652911 7.9607333 -9.4959770 4.3378850 -3.5098720 8 9 10 11 12 13 14 -2.1571867 3.8502343 5.8451436 -6.7490334 0.2203290 -4.2810391 0.4988267 15 6.8596084 Warning message: 'newdata' had 1 rows but variable(s) found have 15 rows> new_x.dx1 x2 1 0.1225315 0.8099963 Even though the names (apparently) match now, predict() returns the predicted values from the original input *matrix*, and that turns out to matter... Let's go back to x and put some column names on it, refit the model and try predict() again:> colnames(x) <- c('x1', 'x2') > class(x)[1] "matrix"> m1 <- lm(y ~ x) > predict(m1, new_x.d)# Same as above... Although the variable names in the input matrix and new_x.d now match, predict() still 'misbehaves'. To see why,> m1... Coefficients: (Intercept) xx1 xx2 0.590281 4.868230 -0.007012 lm() tacks a leading x onto the variable names, thus causing another mismatch with variable names in predict(). Now, combine x and y into a data frame, refit the model and try predict() again:> xx <- data.frame(y, x)# verify that it's a data frame with the right variable names...> str(xx)'data.frame': 15 obs. of 3 variables: $ y : num 0.236 -6.069 2.687 7.323 -10.028 ... $ x1: num 0.12 -1.261 0.611 1.514 -2.069 ... $ x2: num 0.367 1.192 -0.102 0.117 1.66 ... # Refit the model and run predict() again:> m2 <- lm(y ~ ., data = xx) > predict(m2, new_x.d)1 1.181113 Now it works. Evidently, inputting a matrix for the right hand side of the model formula in lm() creates problems for predict(). According to the help page, the first argument of predict.lm() is an object of class lm, whereas the second argument is a data frame. As it turns out, the key phrase needed to understand what's going on is the following: predict.lm produces predicted values, obtained by evaluating the regression function in the frame newdata (which defaults to model.frame(object)). The names of the model.frame() objects in the three models are:> names(model.frame(m0)) # x is a matrix, no colnames[1] "y" "x"> names(model.frame(m1)) # x is a matrix with colnames[1] "y" "x"> names(model.frame(m2)) # x1 and x2 are variables in a data frame[1] "y" "x1" "x2" Notice that these are the same as the objects given in the respective model formulas. Moreover,> head(model.frame(m0), 1)y x.1 x.2 1 0.2355153 0.1203279 0.3674401> head(model.frame(m1), 1)y x.x1 x.x2 1 0.2355153 0.1203279 0.3674401> head(model.frame(m2), 1)y x1 x2 1 0.2355153 0.1203279 0.3674401 Now, one can see that the names assigned to the covariates by model.frame() when x is a matrix depend on the column names assigned to the input matrix. Does this help? Let's copy new_x.d to another data frame object and rename the variables for prediction with m0:> new0 <- new_x.d > names(new0) <- c('x.1', 'x.2') > predict(m0, new0)1 2 3 4 5 6 7 1.1734885 -5.5551829 3.5652911 7.9607333 -9.4959770 4.3378850 -3.5098720 8 9 10 11 12 13 14 -2.1571867 3.8502343 5.8451436 -6.7490334 0.2203290 -4.2810391 0.4988267 15 6.8596084 Warning message: 'newdata' had 1 rows but variable(s) found have 15 rows> new0x.1 x.2 1 0.1225315 0.8099963 That doesn't help, either. lm() is not recognizing x.1 and x.2 as variable names in the model frame of m0, and this is seen in names(model.frame(m0)). The moral seems to be: to use predict() predictably, make sure that the inputs to lm() are in a data frame. One experiences far fewer headaches that way. A clearer, pithier explanation of why this phenomenon occurs would be welcome, too :) HTH, Dennis On Wed, May 5, 2010 at 3:16 AM, Paolo Agnolucci <agnolucponr@googlemail.com>wrote:> Hi everyone, > > this should be pretty basic but I need asking for help as I got stuck. > > I am running simple linear regression models on R with k regressors where k > > 1. In order to automate my code I packed all the regressors in a matrix X > so that lm(y~X) will always produce the results I want regardless of the > variables in X. I am new to R but I found this advice somewhere so I guess > it is relatively standard practice. This works very well until I need to > forecast using the estimate model. > > I cannot pass a matrix to predict - when I pass a data frame I get the > fitted valuie which leads me to think that R doesnt see the data.frame I > pass to predict > > Thanks in advance, > > Paolo > > > > # REPRODUCIBLE CODE > x <- matrix(rnorm(30), ncol =2) > y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15) > new_x <- matrix(rnorm(2), ncol =2) > new_x.d <- data.frame(new_x) > > # fitted values > predict(lm(y ~ x)) > > # same as fitted values > predict(lm(y ~ x), new_x.d) > > # error > predict(lm(y ~ x), new_x) > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]