Mark van der Loo
2018-Mar-16 12:09 UTC
[Rd] Apparent bug in behavior of formulas with '-' operator for lm
Joris, the point is that 'z' is NOT used as a predictor in the model. Therefore it should not affect predictions. Also, I find it suspicious that the error only occurs when the response variable conitains missings and 'z' is unique (I have tested several other cases to confirm this). -Mark Op vr 16 mrt. 2018 om 13:03 schreef Joris Meys <jorismeys at gmail.com>:> It's not a bug per se. It's the effect of removing all observations linked > to a certain level in your data frame. So the output of lm() doesn't > contain a coefficient for level a of z, but your new data contains that > level a. With a small addition, this works again: > > d <- data.frame(x=rnorm(12),y=rnorm(12),z=rep(letters[1:6],2)) > > d$x[1] <- NA > m <- lm(x ~ . -z, data=d) > p <- predict(m, newdata=d) > > This is linked to another discussion earlier on stackoverflow : > https://stackoverflow.com/questions/48461980/prediction-in-r-glmm > which lead to an update to lme4 : https://github.com/lme4/lme4/issues/452 > > The point being that factors in your newdata should have the same levels > as factors in the original data that was used to fit the model. If you add > levels to these factors, it's impossible to use that model to predict for > these new data. > > Cheers > Joris > > On Fri, Mar 16, 2018 at 10:21 AM, Mark van der Loo < > mark.vanderloo at gmail.com> wrote: > >> Dear R-developers, >> >> In the 'lm' documentation, the '-' operator is only specified to be used >> with -1 (to remove the intercept from the model). >> >> However, the documentation also refers to the 'formula' help file, which >> indicates that it is possible to subtract any term. Indeed, the following >> works with no problems (the period '.' stands for 'all terms except the >> lhs'): >> >> d <- data.frame(x=rnorm(6), y=rnorm(6), z=letters[1:2]) >> m <- lm(x ~ . -z, data=d) >> p <- predict(m,newdata=d) >> >> Now, if I change 'z' so that it has only unique values, and I introduce an >> NA in the predicted variable, the following happens: >> >> d <- data.frame(x=rnorm(6),y=rnorm(6),z=letters[1:6]) >> d$x[1] <- NA >> m <- lm(x ~ . -z, data=d) >> p <- predict(m, newdata=d) >> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev >> object$xlevels) : factor z has new levels a >> >> It seems a bug to me, although one could argue that 'lm's documentation >> does not allow one to expect that the '-' operator should work generally. >> >> If it is a bug I'm happy to report it to bugzilla. >> >> Thanks for all your efforts, >> Mark >> >> ps: I was not able to test this on R3.4.4 yet, but the NEWS does not >> mention fixes related to 'lm' or 'predict'. >> >> >> > sessionInfo() >> R version 3.4.3 (2017-11-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 16.04.4 LTS >> >> Matrix products: default >> BLAS: /usr/lib/libblas/libblas.so.3.6.0 >> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 >> LC_PAPER=nl_NL.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.4.3 tools_3.4.3 yaml_2.1.16 >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > > <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> > > ----------- > Biowiskundedagen 2017-2018 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >[[alternative HTML version deleted]]
Joris Meys
2018-Mar-16 12:22 UTC
[Rd] Apparent bug in behavior of formulas with '-' operator for lm
Technically it is used as a predictor in the model. The information is contained in terms :> terms(x ~ . - z, data = d)x ~ (y + z) - z attr(,"variables") list(x, y, z) attr(,"factors") y x 0 y 1 z 0 attr(,"term.labels") [1] "y" attr(,"order") [1] 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") And the model.frame contains it :> head(model.frame(x ~ . - z, data = d))x y z 2 -0.06022984 -0.4483109 b 3 1.25293390 0.2687065 c 4 -1.11811090 0.8016076 d 5 -0.75521720 -0.7484931 e 6 0.93037156 0.4128456 f 7 1.32052028 -1.6609043 a It is at the construction of the model.matrix that z disappears, but the contrasts information for z is still attached :> attr(model.matrix(x ~ . - z, data = d),"contrasts")$z [1] "contr.treatment" As you can see from the error you printed, it is model.frame() complaining about it. In this case it wouldn't be necessary, but it is documented behaviour of model.frame. Which is why I didn't say "this is not a bug", but "this is not a bug per se". Meaning that this is not optimal behaviour and might not what you expect, but it follows the documentation of the underlying functions. Solving it would require a bypass of model.frame() to construct the correct model,matrix for the new predictions, and that's far from trivial as model.matrix() itself depends on model.frame(). Cheers Joris On Fri, Mar 16, 2018 at 1:09 PM, Mark van der Loo <mark.vanderloo at gmail.com> wrote:> Joris, the point is that 'z' is NOT used as a predictor in the model. > Therefore it should not affect predictions. Also, I find it suspicious that > the error only occurs when the response variable conitains missings and 'z' > is unique (I have tested several other cases to confirm this). > > -Mark > > Op vr 16 mrt. 2018 om 13:03 schreef Joris Meys <jorismeys at gmail.com>: > >> It's not a bug per se. It's the effect of removing all observations >> linked to a certain level in your data frame. So the output of lm() doesn't >> contain a coefficient for level a of z, but your new data contains that >> level a. With a small addition, this works again: >> >> d <- data.frame(x=rnorm(12),y=rnorm(12),z=rep(letters[1:6],2)) >> >> d$x[1] <- NA >> m <- lm(x ~ . -z, data=d) >> p <- predict(m, newdata=d) >> >> This is linked to another discussion earlier on stackoverflow : >> https://stackoverflow.com/questions/48461980/prediction-in-r-glmm >> which lead to an update to lme4 : https://github.com/lme4/lme4/issues/452 >> >> The point being that factors in your newdata should have the same levels >> as factors in the original data that was used to fit the model. If you add >> levels to these factors, it's impossible to use that model to predict for >> these new data. >> >> Cheers >> Joris >> >> On Fri, Mar 16, 2018 at 10:21 AM, Mark van der Loo < >> mark.vanderloo at gmail.com> wrote: >> >>> Dear R-developers, >>> >>> In the 'lm' documentation, the '-' operator is only specified to be used >>> with -1 (to remove the intercept from the model). >>> >>> However, the documentation also refers to the 'formula' help file, which >>> indicates that it is possible to subtract any term. Indeed, the following >>> works with no problems (the period '.' stands for 'all terms except the >>> lhs'): >>> >>> d <- data.frame(x=rnorm(6), y=rnorm(6), z=letters[1:2]) >>> m <- lm(x ~ . -z, data=d) >>> p <- predict(m,newdata=d) >>> >>> Now, if I change 'z' so that it has only unique values, and I introduce >>> an >>> NA in the predicted variable, the following happens: >>> >>> d <- data.frame(x=rnorm(6),y=rnorm(6),z=letters[1:6]) >>> d$x[1] <- NA >>> m <- lm(x ~ . -z, data=d) >>> p <- predict(m, newdata=d) >>> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev >>> >>> object$xlevels) : factor z has new levels a >>> >>> It seems a bug to me, although one could argue that 'lm's documentation >>> does not allow one to expect that the '-' operator should work generally. >>> >>> If it is a bug I'm happy to report it to bugzilla. >>> >>> Thanks for all your efforts, >>> Mark >>> >>> ps: I was not able to test this on R3.4.4 yet, but the NEWS does not >>> mention fixes related to 'lm' or 'predict'. >>> >>> >>> > sessionInfo() >>> R version 3.4.3 (2017-11-30) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> Running under: Ubuntu 16.04.4 LTS >>> >>> Matrix products: default >>> BLAS: /usr/lib/libblas/libblas.so.3.6.0 >>> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> LC_PAPER=nl_NL.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> loaded via a namespace (and not attached): >>> [1] compiler_3.4.3 tools_3.4.3 yaml_2.1.16 >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> >> >> -- >> Joris Meys >> Statistical consultant >> >> Department of Data Analysis and Mathematical Modelling >> Ghent University >> Coupure Links 653, B-9000 Gent (Belgium) >> >> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> >> >> ----------- >> Biowiskundedagen 2017-2018 >> http://www.biowiskundedagen.ugent.be/ >> >> ------------------------------- >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >> >-- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Mark van der Loo
2018-Mar-16 13:21 UTC
[Rd] Apparent bug in behavior of formulas with '-' operator for lm
Thanks, Joris, This clarifies at least where exactly it comes from. I still find the high-level behavior of 'predict' very counter-intuitive as the estimated model contains no coefficients in 'z', but I think we agree on that. I am not sure how much trouble it would be to improve this behavior, but perhaps one of the core authors can have a look at it. Best, Mark Op vr 16 mrt. 2018 om 13:22 schreef Joris Meys <jorismeys at gmail.com>:> Technically it is used as a predictor in the model. The information is > contained in terms : > > > terms(x ~ . - z, data = d) > x ~ (y + z) - z > attr(,"variables") > list(x, y, z) > attr(,"factors") > y > x 0 > y 1 > z 0 > attr(,"term.labels") > [1] "y" > attr(,"order") > [1] 1 > attr(,"intercept") > [1] 1 > attr(,"response") > [1] 1 > attr(,".Environment") > > And the model.frame contains it : > > > head(model.frame(x ~ . - z, data = d)) > x y z > 2 -0.06022984 -0.4483109 b > 3 1.25293390 0.2687065 c > 4 -1.11811090 0.8016076 d > 5 -0.75521720 -0.7484931 e > 6 0.93037156 0.4128456 f > 7 1.32052028 -1.6609043 a > > It is at the construction of the model.matrix that z disappears, but the > contrasts information for z is still attached : > > > attr(model.matrix(x ~ . - z, data = d),"contrasts") > $z > [1] "contr.treatment" > > As you can see from the error you printed, it is model.frame() complaining > about it. In this case it wouldn't be necessary, but it is documented > behaviour of model.frame. Which is why I didn't say "this is not a bug", > but "this is not a bug per se". Meaning that this is not optimal behaviour > and might not what you expect, but it follows the documentation of the > underlying functions. > > Solving it would require a bypass of model.frame() to construct the > correct model,matrix for the new predictions, and that's far from trivial > as model.matrix() itself depends on model.frame(). > > Cheers > Joris > > > > > > On Fri, Mar 16, 2018 at 1:09 PM, Mark van der Loo < > mark.vanderloo at gmail.com> wrote: > >> Joris, the point is that 'z' is NOT used as a predictor in the model. >> Therefore it should not affect predictions. Also, I find it suspicious that >> the error only occurs when the response variable conitains missings and 'z' >> is unique (I have tested several other cases to confirm this). >> >> -Mark >> >> Op vr 16 mrt. 2018 om 13:03 schreef Joris Meys <jorismeys at gmail.com>: >> >>> It's not a bug per se. It's the effect of removing all observations >>> linked to a certain level in your data frame. So the output of lm() doesn't >>> contain a coefficient for level a of z, but your new data contains that >>> level a. With a small addition, this works again: >>> >>> d <- data.frame(x=rnorm(12),y=rnorm(12),z=rep(letters[1:6],2)) >>> >>> d$x[1] <- NA >>> m <- lm(x ~ . -z, data=d) >>> p <- predict(m, newdata=d) >>> >>> This is linked to another discussion earlier on stackoverflow : >>> https://stackoverflow.com/questions/48461980/prediction-in-r-glmm >>> which lead to an update to lme4 : >>> https://github.com/lme4/lme4/issues/452 >>> >>> The point being that factors in your newdata should have the same levels >>> as factors in the original data that was used to fit the model. If you add >>> levels to these factors, it's impossible to use that model to predict for >>> these new data. >>> >>> Cheers >>> Joris >>> >>> On Fri, Mar 16, 2018 at 10:21 AM, Mark van der Loo < >>> mark.vanderloo at gmail.com> wrote: >>> >>>> Dear R-developers, >>>> >>>> In the 'lm' documentation, the '-' operator is only specified to be used >>>> with -1 (to remove the intercept from the model). >>>> >>>> However, the documentation also refers to the 'formula' help file, which >>>> indicates that it is possible to subtract any term. Indeed, the >>>> following >>>> works with no problems (the period '.' stands for 'all terms except the >>>> lhs'): >>>> >>>> d <- data.frame(x=rnorm(6), y=rnorm(6), z=letters[1:2]) >>>> m <- lm(x ~ . -z, data=d) >>>> p <- predict(m,newdata=d) >>>> >>>> Now, if I change 'z' so that it has only unique values, and I introduce >>>> an >>>> NA in the predicted variable, the following happens: >>>> >>>> d <- data.frame(x=rnorm(6),y=rnorm(6),z=letters[1:6]) >>>> d$x[1] <- NA >>>> m <- lm(x ~ . -z, data=d) >>>> p <- predict(m, newdata=d) >>>> Error in model.frame.default(Terms, newdata, na.action = na.action, >>>> xlev >>>> object$xlevels) : factor z has new levels a >>>> >>>> It seems a bug to me, although one could argue that 'lm's documentation >>>> does not allow one to expect that the '-' operator should work >>>> generally. >>>> >>>> If it is a bug I'm happy to report it to bugzilla. >>>> >>>> Thanks for all your efforts, >>>> Mark >>>> >>>> ps: I was not able to test this on R3.4.4 yet, but the NEWS does not >>>> mention fixes related to 'lm' or 'predict'. >>>> >>>> >>>> > sessionInfo() >>>> R version 3.4.3 (2017-11-30) >>>> Platform: x86_64-pc-linux-gnu (64-bit) >>>> Running under: Ubuntu 16.04.4 LTS >>>> >>>> Matrix products: default >>>> BLAS: /usr/lib/libblas/libblas.so.3.6.0 >>>> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> LC_PAPER=nl_NL.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> loaded via a namespace (and not attached): >>>> [1] compiler_3.4.3 tools_3.4.3 yaml_2.1.16 >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-devel at r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >>> >>> >>> -- >>> Joris Meys >>> Statistical consultant >>> >>> Department of Data Analysis and Mathematical Modelling >>> Ghent University >>> Coupure Links 653, B-9000 Gent (Belgium) >>> >>> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> >>> >>> ----------- >>> Biowiskundedagen 2017-2018 >>> http://www.biowiskundedagen.ugent.be/ >>> >>> ------------------------------- >>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >>> >> > > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > > <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> > > ----------- > Biowiskundedagen 2017-2018 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >[[alternative HTML version deleted]]
Apparently Analagous Threads
- Apparent bug in behavior of formulas with '-' operator for lm
- Apparent bug in behavior of formulas with '-' operator for lm
- download.file does not process gz files correctly (truncates them?)
- Date class shows Inf as NA; this confuses the use of is.na()
- truncation/rounding bug with write.csv