thr3ads.net - R help - [R] Predict when regressors are passed through a data matrix [May 2010]

If this information is useful, please help other people find it:
Share via:

Paolo Agnolucci

2010-May-05 10:16 UTC

[R] Predict when regressors are passed through a data matrix

Hi everyone,

this should be pretty basic but I need asking for help as I got stuck.

I am running simple linear regression models on R with k regressors where
k> 1. In order to automate my code I packed all the regressors in a matrix Xso that lm(y~X) will always produce the results I want regardless of the
variables in X. I am new to R but I found this advice somewhere so I guess
it is relatively standard practice. This works very well until I need to
forecast using the estimate model.

I cannot pass a matrix to predict - when I pass a data frame I get the
fitted valuie which leads me to think that R doesnt see the data.frame I
pass to predict

Thanks in advance,

Paolo



# REPRODUCIBLE CODE
x <- matrix(rnorm(30), ncol =2)
y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15)
new_x <- matrix(rnorm(2), ncol =2)
new_x.d <- data.frame(new_x)

# fitted values
predict(lm(y ~ x))

# same as fitted values
predict(lm(y ~ x), new_x.d)

# error
predict(lm(y ~ x), new_x)

	[[alternative HTML version deleted]]

Dennis Murphy

2010-May-05 14:21 UTC

head link

[R] Predict when regressors are passed through a data matrix

Hi:

The problem arises because the variable names of the explanatory variables
in the newdata data frame used in predict() have to match those in the fitted
model object.
Interestingly, using
a matrix for the right hand side of the model formula in lm() creates
problems for predict().

Using your code,> x <- matrix(rnorm(30), ncol =2)
> y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15)
> m0 <- lm(y ~ x)
> m0...
Coefficients:
(Intercept)           x1           x2
   0.590281     4.868230    -0.007012
> new_x <- matrix(rnorm(2), ncol =2)
> new_x.d <- data.frame(new_x)
> new_x.d         X1        X2
1 0.1225315 0.8099963

The names of the covariates in the model have names x1 and x2, whereas those
in the
data frame you want to use in predict() are X1 and X2, creating a name
mismatch.

The apparent 'solution' is to change the names in new_x.d to lower case,
but
interesting things happen...> names(new_x.d) <- c('x1', 'x2')
> predict(m0, new_x.d)         1          2          3          4          5          6          7

 1.1734885 -5.5551829  3.5652911  7.9607333 -9.4959770  4.3378850 -3.5098720

         8          9         10         11         12         13         14

-2.1571867  3.8502343  5.8451436 -6.7490334  0.2203290 -4.2810391  0.4988267

        15
 6.8596084
Warning message:
'newdata' had 1 rows but variable(s) found have 15
rows> new_x.d         x1        x2
1 0.1225315 0.8099963

Even though the names (apparently) match now, predict() returns the
predicted values from the original
input *matrix*, and that turns out to matter...

Let's go back to x and put some column names on it, refit the model and try
predict() again:
> colnames(x) <- c('x1', 'x2')
> class(x)
[1] "matrix"> m1 <- lm(y ~ x)
> predict(m1, new_x.d)# Same as above...

Although the variable names in the input matrix and new_x.d now match,
predict()
still 'misbehaves'.  To see why,> m1...
Coefficients:
(Intercept)          xx1          xx2
   0.590281     4.868230    -0.007012

lm() tacks a leading x onto the variable names, thus causing another
mismatch with
variable names in predict().

Now, combine x and y into a data frame, refit the model and try predict()
again:> xx <- data.frame(y, x)# verify that it's a data frame with the right variable
names...> str(xx)'data.frame':   15 obs. of  3 variables:
 $ y : num  0.236 -6.069 2.687 7.323 -10.028 ...
 $ x1: num  0.12 -1.261 0.611 1.514 -2.069 ...
 $ x2: num  0.367 1.192 -0.102 0.117 1.66 ...

# Refit the model and run predict() again:> m2 <- lm(y ~ ., data = xx)
> predict(m2, new_x.d)       1
1.181113

Now it works.

Evidently, inputting a matrix for the right hand side of the model formula
in lm() creates
problems for predict(). According to the help page, the first argument of
predict.lm() is
an object of class lm, whereas the second argument is a data frame. As it
turns out, the
key phrase needed to understand what's going on is the following:

predict.lm produces predicted values, obtained by evaluating the regression
function in the frame newdata
(which defaults to model.frame(object)).

The names of the model.frame() objects in the three models
are:> names(model.frame(m0))    # x is a matrix, no colnames
[1] "y" "x"> names(model.frame(m1))     # x is a matrix with colnames
[1] "y" "x"> names(model.frame(m2))    # x1 and x2 are variables in a data frame[1] "y"  "x1" "x2"

Notice that these are the same as the objects given in the respective model
formulas.

Moreover,> head(model.frame(m0), 1)          y       x.1       x.2
1 0.2355153 0.1203279 0.3674401> head(model.frame(m1), 1)          y      x.x1      x.x2
1 0.2355153 0.1203279 0.3674401> head(model.frame(m2), 1)          y        x1        x2
1 0.2355153 0.1203279 0.3674401

Now, one can see that the names assigned to the covariates by model.frame()
when x is a
matrix depend on the column names assigned to the input matrix. Does this
help?

Let's copy new_x.d to another data frame object and rename the variables for
prediction with m0:> new0 <- new_x.d
> names(new0) <- c('x.1', 'x.2')
> predict(m0, new0)         1          2          3          4          5          6          7

 1.1734885 -5.5551829  3.5652911  7.9607333 -9.4959770  4.3378850 -3.5098720

         8          9         10         11         12         13         14

-2.1571867  3.8502343  5.8451436 -6.7490334  0.2203290 -4.2810391  0.4988267

        15
 6.8596084
Warning message:
'newdata' had 1 rows but variable(s) found have 15
rows> new0        x.1       x.2
1 0.1225315 0.8099963

That doesn't help, either. lm() is not recognizing x.1 and x.2 as variable
names in the model
frame of m0, and this is seen in names(model.frame(m0)).

 The moral seems to be: to use predict() predictably, make sure that the
inputs to lm() are
 in a data frame. One experiences far fewer headaches that way.

A clearer, pithier explanation of why this phenomenon occurs would be
welcome, too :)

HTH,
Dennis

On Wed, May 5, 2010 at 3:16 AM, Paolo Agnolucci
<agnolucponr@googlemail.com>wrote:
> Hi everyone,
>
> this should be pretty basic but I need asking for help as I got stuck.
>
> I am running simple linear regression models on R with k regressors where k
> > 1. In order to automate my code I packed all the regressors in a
matrix X
> so that lm(y~X) will always produce the results I want regardless of the
> variables in X. I am new to R but I found this advice somewhere so I guess
> it is relatively standard practice. This works very well until I need to
> forecast using the estimate model.
>
> I cannot pass a matrix to predict - when I pass a data frame I get the
> fitted valuie which leads me to think that R doesnt see the data.frame I
> pass to predict
>
> Thanks in advance,
>
> Paolo
>
>
>
> # REPRODUCIBLE CODE
> x <- matrix(rnorm(30), ncol =2)
> y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15)
> new_x <- matrix(rnorm(2), ncol =2)
> new_x.d <- data.frame(new_x)
>
> # fitted values
> predict(lm(y ~ x))
>
> # same as fitted values
> predict(lm(y ~ x), new_x.d)
>
> # error
> predict(lm(y ~ x), new_x)
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more reasonably related threads

R help - May 2010 - Predict when regressors are passed through a data matrix

[R] Predict when regressors are passed through a data matrix

[R] Predict when regressors are passed through a data matrix

Maybe Matching Threads