Hadley Wickham
2018-Apr-27 13:25 UTC
[Rd] predict.glm returns different results for the same model
Hi all, Very surprising (to me!) and mystifying result from predict.glm(): the predictions vary depending on whether or not I use ns() or splines::ns(). Reprex follows: library(splines) set.seed(12345) dat <- data.frame(claim = rbinom(1000, 1, 0.5)) mns <- c(3.4, 3.6) sds <- c(0.24, 0.35) dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd sds[dat$claim + 1])) dat <- dat[order(dat$wind), ] m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial) m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family = binomial) # The model coefficients are the same unname(coef(m1)) #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 unname(coef(m2)) #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 # But the predictions are not! newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5)) unname(predict(m1, newdata = newdf)) #> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266 unname(predict(m2, newdata = newdf)) #> [1] 0.5194712 -0.5666554 -0.1731268 2.8134844 3.9295814 Is this a bug? (Motivating issue from this ggplot2 bug report: https://github.com/tidyverse/ggplot2/issues/2426) Thanks! Hadley -- http://hadley.nz
Avraham Adler
2018-Apr-27 14:12 UTC
[Rd] predict.glm returns different results for the same model
Can?t copy from my computer as gmail is blocked at work but if it helps, the ?predvars? attribute if the terms object is not the same. Avi On Fri, Apr 27, 2018 at 9:25 AM Hadley Wickham <h.wickham at gmail.com> wrote:> Hi all, > > Very surprising (to me!) and mystifying result from predict.glm(): the > predictions vary depending on whether or not I use ns() or > splines::ns(). Reprex follows: > > library(splines) > > set.seed(12345) > dat <- data.frame(claim = rbinom(1000, 1, 0.5)) > mns <- c(3.4, 3.6) > sds <- c(0.24, 0.35) > dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd > sds[dat$claim + 1])) > dat <- dat[order(dat$wind), ] > > m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial) > m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family = binomial) > > # The model coefficients are the same > unname(coef(m1)) > #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 > unname(coef(m2)) > #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 > > # But the predictions are not! > newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5)) > unname(predict(m1, newdata = newdf)) > #> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266 > unname(predict(m2, newdata = newdf)) > #> [1] 0.5194712 -0.5666554 -0.1731268 2.8134844 3.9295814 > > Is this a bug? > > (Motivating issue from this ggplot2 bug report: > https://github.com/tidyverse/ggplot2/issues/2426) > > Thanks! > > Hadley > > > > -- > http://hadley.nz > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Sent from Gmail Mobile [[alternative HTML version deleted]]
Joris Meys
2018-Apr-27 14:23 UTC
[Rd] predict.glm returns different results for the same model
Hi Hadley, This is related to how the terms are constructed. If you look at terms(m1) versus terms(m2), you see that in the case of m1 the knots are added to the attribute predvars. Contrary, when using splines::ns() this doesn't happen. Compare: mf <- model.frame(claim ~ ns(wind, df = 5), data = dat) mt <- terms(mf) with mf2 <- model.frame(claim ~ splines::ns(wind, df = 5), data = dat) mt2 <- terms(mf) The culprit is actually in makepredictcall.ns, specifically: if (as.character(call)[1L] != "ns") return(call) Obviously the call starting with predict::ns() will cause the function to return the call unaltered. In the case of ns() it processes the knot information. And that difference causes the difference in predictions. So it is a bug imho. Especially since I don't understand the logic. The variable has class 'ns', so makepredictcall() dispatches correctly. That extra check in the function makepredictcall.ns seems unnecessary to me and might be simply removed. Cheers Joris On Fri, Apr 27, 2018 at 3:25 PM, Hadley Wickham <h.wickham at gmail.com> wrote:> Hi all, > > Very surprising (to me!) and mystifying result from predict.glm(): the > predictions vary depending on whether or not I use ns() or > splines::ns(). Reprex follows: > > library(splines) > > set.seed(12345) > dat <- data.frame(claim = rbinom(1000, 1, 0.5)) > mns <- c(3.4, 3.6) > sds <- c(0.24, 0.35) > dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd > sds[dat$claim + 1])) > dat <- dat[order(dat$wind), ] > > m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial) > m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family = binomial) > > # The model coefficients are the same > unname(coef(m1)) > #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 > unname(coef(m2)) > #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 > > # But the predictions are not! > newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5)) > unname(predict(m1, newdata = newdf)) > #> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266 > unname(predict(m2, newdata = newdf)) > #> [1] 0.5194712 -0.5666554 -0.1731268 2.8134844 3.9295814 > > Is this a bug? > > (Motivating issue from this ggplot2 bug report: > https://github.com/tidyverse/ggplot2/issues/2426) > > Thanks! > > Hadley > > > > -- > http://hadley.nz > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Duncan Murdoch
2018-Apr-27 14:28 UTC
[Rd] predict.glm returns different results for the same model
On 27/04/2018 9:25 AM, Hadley Wickham wrote:> Hi all, > > Very surprising (to me!) and mystifying result from predict.glm(): the > predictions vary depending on whether or not I use ns() or > splines::ns(). Reprex follows: > > library(splines) > > set.seed(12345) > dat <- data.frame(claim = rbinom(1000, 1, 0.5)) > mns <- c(3.4, 3.6) > sds <- c(0.24, 0.35) > dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd > sds[dat$claim + 1])) > dat <- dat[order(dat$wind), ] > > m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial) > m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family = binomial) > > # The model coefficients are the same > unname(coef(m1)) > #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 > unname(coef(m2)) > #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 > > # But the predictions are not! > newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5)) > unname(predict(m1, newdata = newdf)) > #> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266 > unname(predict(m2, newdata = newdf)) > #> [1] 0.5194712 -0.5666554 -0.1731268 2.8134844 3.9295814 > > Is this a bug?The two objects m1 and m2 differ more than they should, so this looks like a problem in glm, not just in predict.glm. > attr(m1$terms, "predvars") list(claim, ns(wind, knots = c(25.4756277492997, 30.2270250736796, 35.4093171222502, 43.038645381669), Boundary.knots = c(12.9423820390783, 108.071583734075), intercept = FALSE)) > attr(m2$terms, "predvars") list(claim, splines::ns(wind, df = 5)) This appears to be due to a bug in the splines package. There, the function splines:::makepredictcall.ns looks like this: makepredictcall.ns <- function(var, call) { if(as.character(call)[1L] != "ns") return(call) at <- attributes(var)[c("knots", "Boundary.knots", "intercept")] xxx <- call[1L:2L] xxx[names(at)] <- at xxx } The test fails for m2, because as.character(call)[1L] is "splines::ns" instead of "ns". I'll see if I can work out a better test and submit a patch. Duncan Murdoch> > (Motivating issue from this ggplot2 bug report: > https://github.com/tidyverse/ggplot2/issues/2426) > > Thanks! > > Hadley
Hadley Wickham
2018-Apr-27 14:36 UTC
[Rd] predict.glm returns different results for the same model
On Fri, Apr 27, 2018 at 7:28 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 27/04/2018 9:25 AM, Hadley Wickham wrote: >> >> Hi all, >> >> Very surprising (to me!) and mystifying result from predict.glm(): the >> predictions vary depending on whether or not I use ns() or >> splines::ns(). Reprex follows: > >> library(splines) >> >> set.seed(12345) >> dat <- data.frame(claim = rbinom(1000, 1, 0.5)) >> mns <- c(3.4, 3.6) >> sds <- c(0.24, 0.35) >> dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd >> sds[dat$claim + 1])) >> dat <- dat[order(dat$wind), ] >> >> m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial) >> m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family >> binomial) >> >> # The model coefficients are the same >> unname(coef(m1)) >> #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 >> unname(coef(m2)) >> #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 >> >> # But the predictions are not! >> newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5)) >> unname(predict(m1, newdata = newdf)) >> #> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266 >> unname(predict(m2, newdata = newdf)) >> #> [1] 0.5194712 -0.5666554 -0.1731268 2.8134844 3.9295814 >> >> Is this a bug? > > > The two objects m1 and m2 differ more than they should, so this looks like a > problem in glm, not just in predict.glm. > >> attr(m1$terms, "predvars") > list(claim, ns(wind, knots = c(25.4756277492997, 30.2270250736796, > 35.4093171222502, 43.038645381669), Boundary.knots = c(12.9423820390783, > 108.071583734075), intercept = FALSE)) > >> attr(m2$terms, "predvars") > list(claim, splines::ns(wind, df = 5)) > > This appears to be due to a bug in the splines package. There, the function > splines:::makepredictcall.ns looks like this: > > makepredictcall.ns <- function(var, call) > { > if(as.character(call)[1L] != "ns") return(call) > at <- attributes(var)[c("knots", "Boundary.knots", "intercept")] > xxx <- call[1L:2L] > xxx[names(at)] <- at > xxx > } > > The test fails for m2, because as.character(call)[1L] is "splines::ns" > instead of "ns". I'll see if I can work out a better test and submit a > patch.Great, thanks! -- http://hadley.nz
Martin Maechler
2018-Apr-27 16:51 UTC
[Rd] predict.glm returns different results for the same model
>>>>> Duncan Murdoch <murdoch.duncan at gmail.com> >>>>> on Fri, 27 Apr 2018 10:28:16 -0400 writes:> On 27/04/2018 9:25 AM, Hadley Wickham wrote: >> Hi all, >> >> Very surprising (to me!) and mystifying result from predict.glm(): the >> predictions vary depending on whether or not I use ns() or >> splines::ns(). Reprex follows: > >> library(splines) >> >> set.seed(12345) >> dat <- data.frame(claim = rbinom(1000, 1, 0.5)) >> mns <- c(3.4, 3.6) >> sds <- c(0.24, 0.35) >> dat$wind <- exp(rnorm(nrow(dat), mean = mns[dat$claim + 1], sd >> sds[dat$claim + 1])) >> dat <- dat[order(dat$wind), ] >> >> m1 <- glm(claim ~ ns(wind, df = 5), data = dat, family = binomial) >> m2 <- glm(claim ~ splines::ns(wind, df = 5), data = dat, family = binomial) >> >> # The model coefficients are the same >> unname(coef(m1)) >> #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 >> unname(coef(m2)) >> #> [1] 0.5194712 -0.8687737 -0.6803954 4.0838947 2.3908674 4.1564128 >> >> # But the predictions are not! >> newdf <- data.frame(wind = seq(min(dat$wind), max(dat$wind), length = 5)) >> unname(predict(m1, newdata = newdf)) >> #> [1] 0.51947119 0.03208719 2.82548847 3.90883496 4.06743266 >> unname(predict(m2, newdata = newdf)) >> #> [1] 0.5194712 -0.5666554 -0.1731268 2.8134844 3.9295814 >> >> Is this a bug? > The two objects m1 and m2 differ more than they should, so this looks > like a problem in glm, not just in predict.glm. >> attr(m1$terms, "predvars") > list(claim, ns(wind, knots = c(25.4756277492997, 30.2270250736796, > 35.4093171222502, 43.038645381669), Boundary.knots = c(12.9423820390783, > 108.071583734075), intercept = FALSE)) >> attr(m2$terms, "predvars") > list(claim, splines::ns(wind, df = 5)) > This appears to be due to a bug in the splines package. There, the > function splines:::makepredictcall.ns looks like this: > makepredictcall.ns <- function(var, call) > { > if(as.character(call)[1L] != "ns") return(call) > at <- attributes(var)[c("knots", "Boundary.knots", "intercept")] > xxx <- call[1L:2L] > xxx[names(at)] <- at > xxx > } > The test fails for m2, because as.character(call)[1L] is "splines::ns" > instead of "ns". I'll see if I can work out a better test and submit a > patch. > Duncan Murdoch Thank you, Duncan, for the good diagnosis and the patch! I will deal with it. Two things however (both *not* needing a new patch; I'll deal with it after dinner ;-) 1) I'd like to commit that ASAP, but do want to add a *minimal* Reprex for regression test, rather than the above, notably as I see that the only place our code calls makepredictcall() is in model.frame.default(), so I thought this should not really be related to glm at all, and I was right. 2) Reading the help page ?makepredictcall --- yes, a good idea, even for experts, believe me, ((notably when it is from a base package, not produced from roxy.. ;-\)) --- reveals what I vaguely remembered: The function was introduced to fix a bug first encountered in lm(. ~ poly(.)) and indeed: -> That help page actually contains an indirect closer to minimal reprex. -> I will also patch the makepredictcall.poly() method. [ as I said: after dinner .. ;-)] -- Martin