thr3ads.net - R help - [R] newdata for predict.lm() ?? [Nov 2020]

If this information is useful, please help other people find it:
Share via:

Boris Steipe

2020-Nov-04 09:50 UTC

[R] newdata for predict.lm() ??

Can't get data from a data frame into predict() without a detour that seems
quite unnecessary ...

Reprex:

# Data frame with simulated data in columns "h" (independent) and
"w" (dependent)
DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216, 2.118, 
                            1.755, 2.060, 2.136, 2.126, 1.792, 1.574,
                            2.117, 1.741, 2.295, 1.526, 1.666, 1.581,
                            1.522, 1.995), 
                      w = c(90.552, 89.518, 84.124, 94.685, 94.710, 82.429,
                            87.176, 90.318, 76.873, 84.183, 57.890, 62.005,
                            84.258, 78.317,101.304, 64.982, 71.237, 77.124,
                            65.010, 81.413)),
                 row.names = c( "1",  "2",  "3", 
"4",  "5",  "6",  "7",
                                "8",  "9", "10",
"11", "12", "13", "14",
                               "15", "16", "17",
"18", "19", "20"),
                 class = "data.frame")


myFit <- lm(DAT$w ~ DAT$h)
coef(myFit)

# (Intercept)       DAT$h 
#   11.76475    35.92002 


# Create 50 x-values with seq() to plot confidence intervals
myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))

pc <- predict(myFit, newdata = myNew, interval = "confidence")

# Warning message:
# 'newdata' had 50 rows but variables found have 20 rows 

# Problem: predict() was not able to take the single column in myNew
# as the independent variable.

# Ugly workaround: but with that everything works as expected.
xx <- DAT$h
yy <- DAT$w
myFit <- lm(yy ~ xx)
coef(myFit)

myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))
colnames(myNew) <- "xx"  # This fixes it!

pc <- predict(myFit, newdata = myNew, interval = "confidence")
str(pc)

# So: specifying the column in newdata to have same name as the coefficient
# name should work, right?
# Back to the original ...

myFit <- lm(DAT$w ~ DAT$h)
colnames(myNew) <- "`DAT$h`"
# ... same error

colnames(myNew) <- "h"
# ... same error again.

Bottom line: how can I properly specify newdata? The documentation is opaque. It
seems the algorithm is trying to EXACTLY match the text of the RHS of the
formula, which is unlikely to result in a useful column name, unless I assign to
an intermediate variable. There must be a better way ...



Thanks!
Boris

peter dalgaard

2020-Nov-04 09:56 UTC

head link

[R] newdata for predict.lm() ??

Don't use $ notation in lm() formulas. Use lm(w ~ h, data=DAT).

-pd
> On 4 Nov 2020, at 10:50 , Boris Steipe <boris.steipe at utoronto.ca>
wrote:
> 
> Can't get data from a data frame into predict() without a detour that
seems quite unnecessary ...
> 
> Reprex:
> 
> # Data frame with simulated data in columns "h" (independent) and
"w" (dependent)
> DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216, 2.118, 
>                            1.755, 2.060, 2.136, 2.126, 1.792, 1.574,
>                            2.117, 1.741, 2.295, 1.526, 1.666, 1.581,
>                            1.522, 1.995), 
>                      w = c(90.552, 89.518, 84.124, 94.685, 94.710, 82.429,
>                            87.176, 90.318, 76.873, 84.183, 57.890, 62.005,
>                            84.258, 78.317,101.304, 64.982, 71.237, 77.124,
>                            65.010, 81.413)),
>                 row.names = c( "1",  "2", 
"3",  "4",  "5",  "6",  "7",
>                                "8",  "9",
"10", "11", "12", "13", "14",
>                               "15", "16",
"17", "18", "19", "20"),
>                 class = "data.frame")
> 
> 
> myFit <- lm(DAT$w ~ DAT$h)
> coef(myFit)
> 
> # (Intercept)       DAT$h 
> #   11.76475    35.92002 
> 
> 
> # Create 50 x-values with seq() to plot confidence intervals
> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))
> 
> pc <- predict(myFit, newdata = myNew, interval = "confidence")
> 
> # Warning message:
> # 'newdata' had 50 rows but variables found have 20 rows 
> 
> # Problem: predict() was not able to take the single column in myNew
> # as the independent variable.
> 
> # Ugly workaround: but with that everything works as expected.
> xx <- DAT$h
> yy <- DAT$w
> myFit <- lm(yy ~ xx)
> coef(myFit)
> 
> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))
> colnames(myNew) <- "xx"  # This fixes it!
> 
> pc <- predict(myFit, newdata = myNew, interval = "confidence")
> str(pc)
> 
> # So: specifying the column in newdata to have same name as the coefficient
> # name should work, right?
> # Back to the original ...
> 
> myFit <- lm(DAT$w ~ DAT$h)
> colnames(myNew) <- "`DAT$h`"
> # ... same error
> 
> colnames(myNew) <- "h"
> # ... same error again.
> 
> Bottom line: how can I properly specify newdata? The documentation is
opaque. It seems the algorithm is trying to EXACTLY match the text of the RHS of
the formula, which is unlikely to result in a useful column name, unless I
assign to an intermediate variable. There must be a better way ...
> 
> 
> 
> Thanks!
> Boris
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Achim Zeileis

2020-Nov-04 10:05 UTC

head link

[R] newdata for predict.lm() ??

On Wed, 4 Nov 2020, peter dalgaard wrote:
> Don't use $ notation in lm() formulas. Use lm(w ~ h, data=DAT).
...or in any other formula for that matter!

Let me expand a bit on Peter's comment because this is really a pet peeve 
of mine:

The idea is that the formula "w ~ h" described the relationships
between
the variables involved, independent of the data set this should be applied 
to. In contrast "DAT$w ~ DAT$h" hard-wires the data into the formula
and
prevents it from applying the formula to another data set.

Hope that helps,
Achim

>> On 4 Nov 2020, at 10:50 , Boris Steipe <boris.steipe at
utoronto.ca> wrote:
>>
>> Can't get data from a data frame into predict() without a detour
that seems quite unnecessary ...
>>
>> Reprex:
>>
>> # Data frame with simulated data in columns "h" (independent)
and "w" (dependent)
>> DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216,
2.118,
>>                            1.755, 2.060, 2.136, 2.126, 1.792, 1.574,
>>                            2.117, 1.741, 2.295, 1.526, 1.666, 1.581,
>>                            1.522, 1.995),
>>                      w = c(90.552, 89.518, 84.124, 94.685, 94.710,
82.429,
>>                            87.176, 90.318, 76.873, 84.183, 57.890,
62.005,
>>                            84.258, 78.317,101.304, 64.982, 71.237,
77.124,
>>                            65.010, 81.413)),
>>                 row.names = c( "1",  "2", 
"3",  "4",  "5",  "6",  "7",
>>                                "8",  "9",
"10", "11", "12", "13", "14",
>>                               "15", "16",
"17", "18", "19", "20"),
>>                 class = "data.frame")
>>
>>
>> myFit <- lm(DAT$w ~ DAT$h)
>> coef(myFit)
>>
>> # (Intercept)       DAT$h
>> #   11.76475    35.92002
>>
>>
>> # Create 50 x-values with seq() to plot confidence intervals
>> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))
>>
>> pc <- predict(myFit, newdata = myNew, interval =
"confidence")
>>
>> # Warning message:
>> # 'newdata' had 50 rows but variables found have 20 rows
>>
>> # Problem: predict() was not able to take the single column in myNew
>> # as the independent variable.
>>
>> # Ugly workaround: but with that everything works as expected.
>> xx <- DAT$h
>> yy <- DAT$w
>> myFit <- lm(yy ~ xx)
>> coef(myFit)
>>
>> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50))
>> colnames(myNew) <- "xx"  # This fixes it!
>>
>> pc <- predict(myFit, newdata = myNew, interval =
"confidence")
>> str(pc)
>>
>> # So: specifying the column in newdata to have same name as the
coefficient
>> # name should work, right?
>> # Back to the original ...
>>
>> myFit <- lm(DAT$w ~ DAT$h)
>> colnames(myNew) <- "`DAT$h`"
>> # ... same error
>>
>> colnames(myNew) <- "h"
>> # ... same error again.
>>
>> Bottom line: how can I properly specify newdata? The documentation is
opaque. It seems the algorithm is trying to EXACTLY match the text of the RHS of
the formula, which is unlikely to result in a useful column name, unless I
assign to an intermediate variable. There must be a better way ...
>>
>>
>>
>> Thanks!
>> Boris
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

R help - Nov 2020 - newdata for predict.lm() ??

[R] newdata for predict.lm() ??

[R] newdata for predict.lm() ??

[R] newdata for predict.lm() ??