Dear all, For an introductory course on glm?s I would like to create an example to show the difference between glm and transformation of the response. For this, I tried to create a dataset where the variance increases with the mean (as is the case in many ecological datasets): poissondata=data.frame( response=rpois(40,1:40), explanatory=1:40) attach(poissondata) However, I have run into a problem because it looks like the lm model (with sqrt-transformation) fits the data best: ## model1=lm(response~explanatory,poissondata) model2=lm(sqrt(response+0.5)~explanatory,poissondata) model3=lm(log(response+1)~explanatory,poissondata) model4=glm(response~explanatory,poissondata,family=poisson) model5=glm(response~explanatory,poissondata,family=quasipoisson) model6=glm.nb(response~explanatory,poissondata) model7=glm(response~explanatory,quasi(variance="mu",link="identity")) plot(explanatory,response,pch=16) lines(explanatory,predict(model1,explanatory=explanatory)) lines(explanatory,(predict(model2,explanatory=explanatory))^2-0.5,lty=2) lines(explanatory,exp(predict(model3,explanatory=explanatory))-1,lty=3) lines(explanatory,exp(predict(model5,explanatory=explanatory)),lty=1,col="red") lines(explanatory,predict(model6,explanatory=explanatory,type="response"),lty=1,col="blue") lines(explanatory,predict(model7,explanatory=explanatory,type="response"),lty=1,col="green") ## The only model that performs equally well is model7. How would you deal with this kind of analysis? What would be your recommendation to the students, given the fact that most of the standard glm models obviously don?t seem to produce good fits here? Many thanks and best wishes Christoph (using R 2.8.0 on Windows XP) -- Dr. rer.nat. Christoph Scherber University of Goettingen DNPW, Agroecology Waldweg 26 D-37073 Goettingen Germany phone +49 (0)551 39 8807 fax +49 (0)551 39 8806 Homepage http://www.gwdg.de/~cscherb1
I'm not sure on what kind of dataset would be most appropriate, but following code I used to create a dataset with a linear response and an increasing variance (the megaphone type, common in ecological datasets if I'm right) : beta0 <- 10 beta1 <- 1 x <- c(1:40) y <- beta0 + x*beta1 +rnorm(40,0,1)*seq(0.1,10,length.out=40) Data=data.frame(x,y) I won't win the price for most elegant programming with this, but it surely works fine for my simulations. Kind regards Joris On Tue, Nov 25, 2008 at 3:52 PM, Christoph Scherber <Christoph.Scherber at agr.uni-goettingen.de> wrote:> Dear all, > > For an introductory course on glm?s I would like to create an example to > show the difference between glm and transformation of the response. For > this, I tried to create a dataset where the variance increases with the mean > (as is the case in many ecological datasets): > > poissondata=data.frame( > response=rpois(40,1:40), > explanatory=1:40) > > attach(poissondata) > > However, I have run into a problem because it looks like the lm model (with > sqrt-transformation) fits the data best: > > ## > > model1=lm(response~explanatory,poissondata) > model2=lm(sqrt(response+0.5)~explanatory,poissondata) > model3=lm(log(response+1)~explanatory,poissondata) > model4=glm(response~explanatory,poissondata,family=poisson) > model5=glm(response~explanatory,poissondata,family=quasipoisson) > model6=glm.nb(response~explanatory,poissondata) > model7=glm(response~explanatory,quasi(variance="mu",link="identity")) > > > plot(explanatory,response,pch=16) > lines(explanatory,predict(model1,explanatory=explanatory)) > lines(explanatory,(predict(model2,explanatory=explanatory))^2-0.5,lty=2) > lines(explanatory,exp(predict(model3,explanatory=explanatory))-1,lty=3) > lines(explanatory,exp(predict(model5,explanatory=explanatory)),lty=1,col="red") > lines(explanatory,predict(model6,explanatory=explanatory,type="response"),lty=1,col="blue") > lines(explanatory,predict(model7,explanatory=explanatory,type="response"),lty=1,col="green") > > ## > > The only model that performs equally well is model7. > > How would you deal with this kind of analysis? What would be your > recommendation to the students, given the fact that most of the standard glm > models obviously don?t seem to produce good fits here? > > Many thanks and best wishes > Christoph > > (using R 2.8.0 on Windows XP) > > > > > > -- > Dr. rer.nat. Christoph Scherber > University of Goettingen > DNPW, Agroecology > Waldweg 26 > D-37073 Goettingen > Germany > > phone +49 (0)551 39 8807 > fax +49 (0)551 39 8806 > > Homepage http://www.gwdg.de/~cscherb1 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Christoph Scherber wrote:> Dear all, > > For an introductory course on glm?s I would like to create an example to > show the difference between glm and transformation of the response. For > this, I tried to create a dataset where the variance increases with the > mean (as is the case in many ecological datasets): > > poissondata=data.frame( > response=rpois(40,1:40), > explanatory=1:40) > > attach(poissondata) > > However, I have run into a problem because it looks like the lm model > (with sqrt-transformation) fits the data best: > > ## > > model1=lm(response~explanatory,poissondata) > model2=lm(sqrt(response+0.5)~explanatory,poissondata) > model3=lm(log(response+1)~explanatory,poissondata) > model4=glm(response~explanatory,poissondata,family=poisson) > model5=glm(response~explanatory,poissondata,family=quasipoisson) > model6=glm.nb(response~explanatory,poissondata) > model7=glm(response~explanatory,quasi(variance="mu",link="identity")) > > > plot(explanatory,response,pch=16) > lines(explanatory,predict(model1,explanatory=explanatory)) > lines(explanatory,(predict(model2,explanatory=explanatory))^2-0.5,lty=2) > lines(explanatory,exp(predict(model3,explanatory=explanatory))-1,lty=3) > lines(explanatory,exp(predict(model5,explanatory=explanatory)),lty=1,col="red") > > lines(explanatory,predict(model6,explanatory=explanatory,type="response"),lty=1,col="blue") > > lines(explanatory,predict(model7,explanatory=explanatory,type="response"),lty=1,col="green") > > > ## > > The only model that performs equally well is model7. > > How would you deal with this kind of analysis? What would be your > recommendation to the students, given the fact that most of the standard > glm models obviously don?t seem to produce good fits here? > > Many thanks and best wishes > Christoph > > (using R 2.8.0 on Windows XP) >Any good reason that you're not transforming both sides when transforming? and that you're not looking at model8 <- glm(response~log(explanatory),poissondata,family=poisson) (etc.) ?? BTW, your predict call seems to be missing data.frame() around "explanatory=explanatory". The predict() methods do not have an argument called "explanatory", so this is just ignored (a buglet if you ask me). -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
I reread the question, and I don't actually see the problem too much. If you plot the transformed dataset from most of your models, the problem is quite obvious : if you transform your predictor to the logscale, the result of a a linear regression on those outcomes will be naturally an exponential function. Exact the same thing happens if you use the poisson family in a glm. Now, there is no reason whatsoever that any exponential model would fit the data, in contrary. The way your data is constructed, is essentially linear. Indeed, in the model, the E(Y) for predictors 1:40 is 1:40, and you would expect a model with intercept 0 and slope 1, whatever your variance does. Actually, the first model fits the data far better than the second one, contrary to your beliefs. On the second one, there is a clear trend in the residuals, indicating an obvious deviation from the true model. Why model 7 fits equally well, is because it is essentially exactly the same as the classic linear model. The slight deviation comes from the fact that lm uses least squares estimates, and glm uses maximum likelihood. The problem you create with heteroscedasticity is not one of fitting, but one of confidence intervals for your estimated parameters. So my suggestion to any student faced with this question would be to take a look at the estimates for the variance, and the derived confidence intervals, and try to find a more robust way of determining the confidence intervals on the estimated parameters. Robust regression might be an outcome here. Kind regards Joris On Tue, Nov 25, 2008 at 3:52 PM, Christoph Scherber <Christoph.Scherber at agr.uni-goettingen.de> wrote:> Dear all, > > For an introductory course on glm?s I would like to create an example to > show the difference between glm and transformation of the response. For > this, I tried to create a dataset where the variance increases with the mean > (as is the case in many ecological datasets): > > poissondata=data.frame( > response=rpois(40,1:40), > explanatory=1:40) > > attach(poissondata) > > However, I have run into a problem because it looks like the lm model (with > sqrt-transformation) fits the data best: > > ## > > model1=lm(response~explanatory,poissondata) > model2=lm(sqrt(response+0.5)~explanatory,poissondata) > model3=lm(log(response+1)~explanatory,poissondata) > model4=glm(response~explanatory,poissondata,family=poisson) > model5=glm(response~explanatory,poissondata,family=quasipoisson) > model6=glm.nb(response~explanatory,poissondata) > model7=glm(response~explanatory,quasi(variance="mu",link="identity")) > > > plot(explanatory,response,pch=16) > lines(explanatory,predict(model1,explanatory=explanatory)) > lines(explanatory,(predict(model2,explanatory=explanatory))^2-0.5,lty=2) > lines(explanatory,exp(predict(model3,explanatory=explanatory))-1,lty=3, col="green") > lines(explanatory,exp(predict(model5,explanatory=explanatory)),lty=1,col="red") > lines(explanatory,predict(model6,explanatory=explanatory,type="response"),lty=1,col="blue") > lines(explanatory,predict(model7,explanatory=explanatory,type="response"),lty=1,col="green") > > ## > > The only model that performs equally well is model7. > > How would you deal with this kind of analysis? What would be your > recommendation to the students, given the fact that most of the standard glm > models obviously don?t seem to produce good fits here? > > Many thanks and best wishes > Christoph > > (using R 2.8.0 on Windows XP) > > > > > > -- > Dr. rer.nat. Christoph Scherber > University of Goettingen > DNPW, Agroecology > Waldweg 26 > D-37073 Goettingen > Germany > > phone +49 (0)551 39 8807 > fax +49 (0)551 39 8806 > > Homepage http://www.gwdg.de/~cscherb1 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
The default link function for the glm poisson family is a log link, which means that it is fitting the model: log(mu) ~ b0 + b1 * x But the data that you generate is based on a linear link. Therefore your glm analysis does not match with how the data was generated (and therefore should not necessarily be the best fit). Either analyze using glm and a linear link, or generate the data based on a log link (e.g. rpois(40, exp(seq(1,3, length.out=40))) ). Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Christoph Scherber > Sent: Tuesday, November 25, 2008 7:53 AM > To: r-help at stat.math.ethz.ch > Subject: [R] glm or transformation of the response? > > Dear all, > > For an introductory course on glm?s I would like to create an example > to show the difference between > glm and transformation of the response. For this, I tried to create a > dataset where the variance > increases with the mean (as is the case in many ecological datasets): > > poissondata=data.frame( > response=rpois(40,1:40), > explanatory=1:40) > > attach(poissondata) > > However, I have run into a problem because it looks like the lm model > (with sqrt-transformation) > fits the data best: > > ## > > model1=lm(response~explanatory,poissondata) > model2=lm(sqrt(response+0.5)~explanatory,poissondata) > model3=lm(log(response+1)~explanatory,poissondata) > model4=glm(response~explanatory,poissondata,family=poisson) > model5=glm(response~explanatory,poissondata,family=quasipoisson) > model6=glm.nb(response~explanatory,poissondata) > model7=glm(response~explanatory,quasi(variance="mu",link="identity")) > > > plot(explanatory,response,pch=16) > lines(explanatory,predict(model1,explanatory=explanatory)) > lines(explanatory,(predict(model2,explanatory=explanatory))^2- > 0.5,lty=2) > lines(explanatory,exp(predict(model3,explanatory=explanatory))-1,lty=3) > lines(explanatory,exp(predict(model5,explanatory=explanatory)),lty=1,co > l="red") > lines(explanatory,predict(model6,explanatory=explanatory,type="response > "),lty=1,col="blue") > lines(explanatory,predict(model7,explanatory=explanatory,type="response > "),lty=1,col="green") > > ## > > The only model that performs equally well is model7. > > How would you deal with this kind of analysis? What would be your > recommendation to the students, > given the fact that most of the standard glm models obviously don?t > seem to produce good fits here? > > Many thanks and best wishes > Christoph > > (using R 2.8.0 on Windows XP) > > > > > > -- > Dr. rer.nat. Christoph Scherber > University of Goettingen > DNPW, Agroecology > Waldweg 26 > D-37073 Goettingen > Germany > > phone +49 (0)551 39 8807 > fax +49 (0)551 39 8806 > > Homepage http://www.gwdg.de/~cscherb1 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.