Prof Brian D Ripley
1998-Aug-22 17:06 UTC
Handling of offsets in glm is really inconsistent.
[Copied to R-devel for information] This applies to all versions of R I have: 0.62.2, 0.62.3, 0.63. Great care seems needed with glms with offsets, as many things seem wrong. Consider the following:> data(freeny) > freeny.glm <- glm(y ~ offset(lag.quarterly.revenue) + price.index +income.level + market.potential, data=freeny, subset=1:30)> predict(freeny.glm)Qtr1 Qtr2 Qtr3 Qtr4 1962: NA 0.01040457 0.01073223 0.01233351 1963: 0.01211730 0.02744293 0.03259214 0.03225403 1964: 0.03235081 0.03272723 0.03361984 0.03114351 1965: 0.03128098 0.03371820 0.02869783 0.02670215 1966: 0.02346093 0.02172193 0.02269731 0.01927851 1967: 0.02516107 0.02496217 0.02679227 0.03099460 1968: 0.03225807 0.03418791 0.03857815 0.03324247 1969: 0.02575733 0.02702418 0.02988582 NA> predict(freeny.glm, type="response")Qtr1 Qtr2 Qtr3 Qtr4 1962: NA 8.806765 8.803092 8.803704 1963: 8.826977 8.840453 8.940102 8.968984 1964: 8.993961 8.993167 9.042300 9.061634 1965: 9.100341 9.092428 9.135678 9.153552 1966: 9.194421 9.208372 9.260927 9.284149 1967: 9.309521 9.338742 9.377042 9.389345 1968: 9.429928 9.455688 9.480808 9.520452 1969: 9.549497 9.566824 9.611116 NA so one might think that prediction of a glm with an offset should include the offset on the response scale but not link scale. However,> predict(freeny.glm, newdata=freeny)1962.25 1962.5 1962.75 1963 1963.25 0.01040457 0.01073223 0.01233351 0.01211730 0.02744293 1963.5 1963.75 1964 1964.25 1964.5 0.03259214 0.03225403 0.03235081 0.03272723 0.03361984 1964.75 1965 1965.25 1965.5 1965.75 0.03114351 0.03128098 0.03371820 0.02869783 0.02670215 1966 1966.25 1966.5 1966.75 1967 0.02346093 0.02172193 0.02269731 0.01927851 0.02516107 1967.25 1967.5 1967.75 1968 1968.25 0.02496217 0.02679227 0.03099460 0.03225807 0.03418791 1968.5 1968.75 1969 1969.25 1969.5 0.03857815 0.03324247 0.02575733 0.02702418 0.02988582 1969.75 1970 1970.25 1970.5 1970.75 0.02686281 0.03816228 0.03910487 0.04325638 0.03599068 1971 1971.25 1971.5 1971.75 0.03357946 0.03890549 0.04196893 0.03952385> predict(freeny.glm, newdata=freeny, type="response")1962.25 1962.5 1962.75 1963 1963.25 0.01040457 0.01073223 0.01233351 0.01211730 0.02744293 1963.5 1963.75 1964 1964.25 1964.5 0.03259214 0.03225403 0.03235081 0.03272723 0.03361984 1964.75 1965 1965.25 1965.5 1965.75 0.03114351 0.03128098 0.03371820 0.02869783 0.02670215 1966 1966.25 1966.5 1966.75 1967 0.02346093 0.02172193 0.02269731 0.01927851 0.02516107 1967.25 1967.5 1967.75 1968 1968.25 0.02496217 0.02679227 0.03099460 0.03225807 0.03418791 1968.5 1968.75 1969 1969.25 1969.5 0.03857815 0.03324247 0.02575733 0.02702418 0.02988582 1969.75 1970 1970.25 1970.5 1970.75 0.02686281 0.03816228 0.03910487 0.04325638 0.03599068 1971 1971.25 1971.5 1971.75 0.03357946 0.03890549 0.04196893 0.03952385 and prediction on either scale with newdata ignores the offset. Now, S is also inconsistent (prior to S-PLUS 4.5 rel2), but at least it does usually include the offset (except for prediction on link scale with no newdata, where the offset was omitted until recently). [The different print layout is due to the original response being a time series of class "ts"; the predict method cannot know that.] The discrepancy is in predict.lm (not predict.glm) which ignores offsets in R but includes them in S. Correcting it is made difficult by the way delete.response also deletes offsets in R:> terms(freeny.glm)y ~ offset(lag.quarterly.revenue) + price.index + income.level + market.potential attr(,"variables") list(y, offset(lag.quarterly.revenue), price.index, income.level, market.potential) attr(,"factors") price.index income.level market.potential y 0 0 0 offset(lag.quarterly.revenue) 0 0 0 price.index 1 0 0 income.level 0 1 0 market.potential 0 0 1 attr(,"term.labels") [1] "price.index" "income.level" "market.potential" attr(,"order") [1] 1 1 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,"offset") [1] 2> delete.response(terms(freeny.glm))~price.index + income.level + market.potential attr(,"variables") list(price.index, income.level, market.potential) attr(,"factors") price.index income.level market.potential price.index 1 0 0 income.level 0 1 0 market.potential 0 0 1 attr(,"term.labels") [1] "price.index" "income.level" "market.potential" attr(,"order") [1] 1 1 1 attr(,"intercept") [1] 1 attr(,"response") [1] 0 and that is definitely not what is documented to happen. I do not begin to understand why delete.response is written the way it is: it looks like it needs a complete re-design. So - delete.response needs to only delete responses. - predict.lm needs to handle offsets (or predict.glm, but it is much cleaner in predict.lm). - glm.fit should return linear.predictors = eta + offset. There is another small problem:> predict(freeny.glm, se=T)Error: Object "price.index" not found as predict.lm does not recognize that newdata was missing in the caller (how lazy is lazy evaluation?) The simplest way out is to have if (missing(newdata) || is.null(newdata)) X <- model.matrix(object) the alternative is to test missing(newdata) in predict.glm. While this is being done, I wonder why R does not implement offsets for lm()? It `looks like an easy exercise'. -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._