Can't get data from a data frame into predict() without a detour that seems quite unnecessary ... Reprex: # Data frame with simulated data in columns "h" (independent) and "w" (dependent) DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216, 2.118, 1.755, 2.060, 2.136, 2.126, 1.792, 1.574, 2.117, 1.741, 2.295, 1.526, 1.666, 1.581, 1.522, 1.995), w = c(90.552, 89.518, 84.124, 94.685, 94.710, 82.429, 87.176, 90.318, 76.873, 84.183, 57.890, 62.005, 84.258, 78.317,101.304, 64.982, 71.237, 77.124, 65.010, 81.413)), row.names = c( "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"), class = "data.frame") myFit <- lm(DAT$w ~ DAT$h) coef(myFit) # (Intercept) DAT$h # 11.76475 35.92002 # Create 50 x-values with seq() to plot confidence intervals myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50)) pc <- predict(myFit, newdata = myNew, interval = "confidence") # Warning message: # 'newdata' had 50 rows but variables found have 20 rows # Problem: predict() was not able to take the single column in myNew # as the independent variable. # Ugly workaround: but with that everything works as expected. xx <- DAT$h yy <- DAT$w myFit <- lm(yy ~ xx) coef(myFit) myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50)) colnames(myNew) <- "xx" # This fixes it! pc <- predict(myFit, newdata = myNew, interval = "confidence") str(pc) # So: specifying the column in newdata to have same name as the coefficient # name should work, right? # Back to the original ... myFit <- lm(DAT$w ~ DAT$h) colnames(myNew) <- "`DAT$h`" # ... same error colnames(myNew) <- "h" # ... same error again. Bottom line: how can I properly specify newdata? The documentation is opaque. It seems the algorithm is trying to EXACTLY match the text of the RHS of the formula, which is unlikely to result in a useful column name, unless I assign to an intermediate variable. There must be a better way ... Thanks! Boris
Don't use $ notation in lm() formulas. Use lm(w ~ h, data=DAT). -pd> On 4 Nov 2020, at 10:50 , Boris Steipe <boris.steipe at utoronto.ca> wrote: > > Can't get data from a data frame into predict() without a detour that seems quite unnecessary ... > > Reprex: > > # Data frame with simulated data in columns "h" (independent) and "w" (dependent) > DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216, 2.118, > 1.755, 2.060, 2.136, 2.126, 1.792, 1.574, > 2.117, 1.741, 2.295, 1.526, 1.666, 1.581, > 1.522, 1.995), > w = c(90.552, 89.518, 84.124, 94.685, 94.710, 82.429, > 87.176, 90.318, 76.873, 84.183, 57.890, 62.005, > 84.258, 78.317,101.304, 64.982, 71.237, 77.124, > 65.010, 81.413)), > row.names = c( "1", "2", "3", "4", "5", "6", "7", > "8", "9", "10", "11", "12", "13", "14", > "15", "16", "17", "18", "19", "20"), > class = "data.frame") > > > myFit <- lm(DAT$w ~ DAT$h) > coef(myFit) > > # (Intercept) DAT$h > # 11.76475 35.92002 > > > # Create 50 x-values with seq() to plot confidence intervals > myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50)) > > pc <- predict(myFit, newdata = myNew, interval = "confidence") > > # Warning message: > # 'newdata' had 50 rows but variables found have 20 rows > > # Problem: predict() was not able to take the single column in myNew > # as the independent variable. > > # Ugly workaround: but with that everything works as expected. > xx <- DAT$h > yy <- DAT$w > myFit <- lm(yy ~ xx) > coef(myFit) > > myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50)) > colnames(myNew) <- "xx" # This fixes it! > > pc <- predict(myFit, newdata = myNew, interval = "confidence") > str(pc) > > # So: specifying the column in newdata to have same name as the coefficient > # name should work, right? > # Back to the original ... > > myFit <- lm(DAT$w ~ DAT$h) > colnames(myNew) <- "`DAT$h`" > # ... same error > > colnames(myNew) <- "h" > # ... same error again. > > Bottom line: how can I properly specify newdata? The documentation is opaque. It seems the algorithm is trying to EXACTLY match the text of the RHS of the formula, which is unlikely to result in a useful column name, unless I assign to an intermediate variable. There must be a better way ... > > > > Thanks! > Boris > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On Wed, 4 Nov 2020, peter dalgaard wrote:> Don't use $ notation in lm() formulas. Use lm(w ~ h, data=DAT)....or in any other formula for that matter! Let me expand a bit on Peter's comment because this is really a pet peeve of mine: The idea is that the formula "w ~ h" described the relationships between the variables involved, independent of the data set this should be applied to. In contrast "DAT$w ~ DAT$h" hard-wires the data into the formula and prevents it from applying the formula to another data set. Hope that helps, Achim>> On 4 Nov 2020, at 10:50 , Boris Steipe <boris.steipe at utoronto.ca> wrote: >> >> Can't get data from a data frame into predict() without a detour that seems quite unnecessary ... >> >> Reprex: >> >> # Data frame with simulated data in columns "h" (independent) and "w" (dependent) >> DAT <- structure(list(h = c(2.174, 2.092, 2.059, 1.952, 2.216, 2.118, >> 1.755, 2.060, 2.136, 2.126, 1.792, 1.574, >> 2.117, 1.741, 2.295, 1.526, 1.666, 1.581, >> 1.522, 1.995), >> w = c(90.552, 89.518, 84.124, 94.685, 94.710, 82.429, >> 87.176, 90.318, 76.873, 84.183, 57.890, 62.005, >> 84.258, 78.317,101.304, 64.982, 71.237, 77.124, >> 65.010, 81.413)), >> row.names = c( "1", "2", "3", "4", "5", "6", "7", >> "8", "9", "10", "11", "12", "13", "14", >> "15", "16", "17", "18", "19", "20"), >> class = "data.frame") >> >> >> myFit <- lm(DAT$w ~ DAT$h) >> coef(myFit) >> >> # (Intercept) DAT$h >> # 11.76475 35.92002 >> >> >> # Create 50 x-values with seq() to plot confidence intervals >> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50)) >> >> pc <- predict(myFit, newdata = myNew, interval = "confidence") >> >> # Warning message: >> # 'newdata' had 50 rows but variables found have 20 rows >> >> # Problem: predict() was not able to take the single column in myNew >> # as the independent variable. >> >> # Ugly workaround: but with that everything works as expected. >> xx <- DAT$h >> yy <- DAT$w >> myFit <- lm(yy ~ xx) >> coef(myFit) >> >> myNew <- data.frame(seq(min(DAT$h), max(DAT$h), length.out = 50)) >> colnames(myNew) <- "xx" # This fixes it! >> >> pc <- predict(myFit, newdata = myNew, interval = "confidence") >> str(pc) >> >> # So: specifying the column in newdata to have same name as the coefficient >> # name should work, right? >> # Back to the original ... >> >> myFit <- lm(DAT$w ~ DAT$h) >> colnames(myNew) <- "`DAT$h`" >> # ... same error >> >> colnames(myNew) <- "h" >> # ... same error again. >> >> Bottom line: how can I properly specify newdata? The documentation is opaque. It seems the algorithm is trying to EXACTLY match the text of the RHS of the formula, which is unlikely to result in a useful column name, unless I assign to an intermediate variable. There must be a better way ... >> >> >> >> Thanks! >> Boris >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >