Hello, I have found a problem with using the glm function with a gamma family. I have a vector of data, assumed to be generated by a gamma distribution. The parameters of this gamma distribution are estimated in two ways (i) using the glm() function, (ii) "by hand", using the optim() function. I find that the -2*likelihood at the maximum found by (i) is substantially larger than that found by (ii), i.e. the glm() function is not finding the maximum. This is some what of a pathological example, as the data set is highly skewed and contains a couple of outliers. I've tested this in S+ and the same problem is there too. Is this cause for concern, or is my data set just a "nasty" one to deal with? I am really impressed with the optim() function. Indeed, it is the reason why I switched to R from Splus. The Splus analogue was very slow, and didn't find the maximum. The data set and code for the two methods of estimation are included below. I don't think I am making a mistake here. Sorry if I have. Thanks Richard> gamma1(data) #uses the glm() function$loglik [1] 875.4274 $par [1] 9.572403e-02 4.345771e+03> gamma2(data) #"by hand" using optim()$loglik [1] 793.3913 $par [1] 0.518145 802.854297 #Data set data_c(51.47, 210.19, 49.55, 61.93, 60.61, 744.57, 338.59, 133.93, 191.57, 111.43, 432.83, 185.23, 155.61, 84.72, 120.2, 15.33, 77.05, 115.77, 25.23, 657.94, 108.39, 61.08, 142.42, 87.86, 272.87, 213.78, 65.23, 102.45, 58.16, 176.58, 76.58, 434.12, 362.35, 102.53, 103.6, 25.23, 97.19, 88.52, 118.55, 151.9, 2.7, 156.41, 21.79, 272.27, 23.16, 32.07, 6325.23, 92.37, 8340.04, 51.08, 55.59, 94.08, 69.98, 554.13, 104.88, 170.15, 945.1, 143.52) #Fits data to a gamma distribution using glm() gamma1_function(data){ n_length(data) m_summary(glm(data~1, family=Gamma(link=identity))) shape_1/as.numeric(m$disp) scale_as.numeric(m$coeff[1]*m$disp) dev.res_-2*log(dgamma(data,shape=shape,scale=scale)) loglik_sum(dev.res) #actually -2 * log like list(loglik=loglik,par=c(shape,scale)) } #Fits data to a gamma distribution "by hand" using optim() gamma2_function(data){ n_length(data) m_summary(glm(data~1, family=Gamma)) shape_1/as.numeric(m$disp) #L = -Log likelihood L_function(x){-(-n*log(gamma(x[1]))+n*x[1]*log(x[1]/x[2])+(x[1]-1)*sum(log(data))-x[1]/x[2] *sum(data))} start_c(shape, mean(data)) parscale_start fit_optim(start,L,method="L-BFGS-B",lower=c(shape/100,0), upper=c(NA,NA),control=list(parscale=parscale)) shape_fit$par[1] mu_fit$par[2] scale_mu/shape dev.res_-2*log(dgamma(data,shape=shape,scale=scale)) loglik_sum(dev.res) #actually -2 * log like list(loglik=loglik,par=c(shape,scale)) } -- Dr. Richard Nixon MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge, CB2 2SR http://www.mrc-bsu.cam.ac.uk/personal/richard Tel: +44 (0)1223 330382, Fax: +44 (0)1223 330388 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

In message <Pine.GSO.4.44.0204221641030.4377-100000 at moeran> you write:> > #Fits data to a gamma distribution using glm() > gamma1_function(data){ > n_length(data) > m_summary(glm(data~1, family=Gamma(link=identity))) > shape_1/as.numeric(m$disp)You're not using the mle of the gamma shape parameter. See the function shape.gamma in library(MASS). -- Brett Presnell Department of Statistics University of Florida http://www.stat.ufl.edu/~presnell/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Hello, Thanks to Brett Presnell and Brian Ripley I see the error of my ways. I now conclude one of two things, either: (1) model_glm(data~1, family=Gamma) summary(model)$dispersion is not the same thing as dispersion as defined by McCullagh and Nelder (M+N) (Generalized linear models). Because the shape parameter of a gamma distribution is 1/(M+N dispersion), but 1/summary(model)$dispersion is not the shape parameter. However, or (2) 1/summary(model)$dispersion is supposed to equal M+N dispersion but my pathological data set messes it up. note that the moment estimator of the shape parameter mean(data)^2/var(data) = 0.09572403 is close to 1/summary(model)$dispersion = 0.09622557 Brian Ripley also points out that optim() is now part of the latest MASS library for splus6. Thanks Richard> library(MASS) > gamma1(data) #uses the glm() function$loglik1 [1] 875.0035 $loglik2 [1] 793.3913 # Now agrees with log likelihood below $par [1] 0.09622557 0.51814501 415.99465517 # $par[2]=shape = gamma.shape(m)$alpha is correct # $par[1]=shape = 1/summary(m)$disp is not correct> gamma2(data) #"by hand" using optim()$loglik [1] 793.3913 $par [1] 0.518145 415.994662 data_c(51.47, 210.19, 49.55, 61.93, 60.61, 744.57, 338.59, 133.93, 191.57, 111.43, 432.83, 185.23, 155.61, 84.72, 120.2, 15.33, 77.05, 115.77, 25.23, 657.94, 108.39, 61.08, 142.42, 87.86, 272.87, 213.78, 65.23, 102.45, 58.16, 176.58, 76.58, 434.12, 362.35, 102.53, 103.6, 25.23, 97.19, 88.52, 118.55, 151.9, 2.7, 156.41, 21.79, 272.27, 23.16, 32.07, 6325.23, 92.37, 8340.04, 51.08, 55.59, 94.08, 69.98, 554.13, 104.88, 170.15, 945.1, 143.52) #Fits data to a gamma distribution using glm() gamma1_function(data){ m_glm(data~1, family=Gamma) shape1_1/summary(m)$disp shape2_gamma.shape(m)$alpha mu_mean(data) dev.res1_-2*log(dgamma(data,shape=shape1,scale=mu/shape1)) loglik1_sum(dev.res1) #actually -2 * log like dev.res2_-2*log(dgamma(data,shape=shape2,scale=mu/shape2)) loglik2_sum(dev.res2) #actually -2 * log like list(loglik1=loglik1,loglik2=loglik2,par=c(shape1,shape2,mu)) } #Fits data to a gamma distribution "by hand" using optim() gamma2_function(data){ n_length(data) m_glm(data~1, family=Gamma) shape_gamma.shape(m)$alpha #L = -Log likelihood L_function(x){-(-n*log(gamma(x[1]))+n*x[1]*log(x[1]/x[2])+(x[1]-1)*sum(log(data))-x[1]/x[2] *sum(data))} start_c(shape, mean(data)) parscale_start fit_optim(start,L,method="L-BFGS-B",lower=c(shape/100,0), upper=c(NA,NA),control=list(parscale=parscale)) shape_fit$par[1] mu_fit$par[2] dev.res_-2*log(dgamma(data,shape=shape,scale=mu/shape)) loglik_sum(dev.res) #actually -2 * log like list(loglik=loglik,par=c(shape,mu)) } -- Dr. Richard Nixon MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge, CB2 2SR http://www.mrc-bsu.cam.ac.uk/personal/richard Tel: +44 (0)1223 330382, Fax: +44 (0)1223 330388 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

For those of you who have been following my random ranting about the glm(,family=gamma) function, for completeness here are my final (hopefully!) remarks. Thanks to Brett Presnell for explaining how the glm function estimates its parameters. The dispersion parameter in a gamma distribution is the same thing as dispersion as defined by McCullagh and Nelder (Generalized linear models). It is estimated by a moment estimator. If one requires a maximum likelihood estimator, then one can use the gamma.shape() function from the MASS library. shape=1/dispersion. These two methods will generally yield similar results, but for highly skewed data, like that given in my example, they can differ markedly. Code showing this is included below. Hope this is of some use to someone, and has iterated close enough to how this function actually works. Richard.> gamma(data)$loglik.mom [1] 875.0035 #-2log likelihood using a moment estimator for shape $loglik.mle [1] 793.3913 #-2log likelihood using a mle estimator for shape $par [1] 0.09622557 0.51814501 415.99465517 #par[1] = moment estmator for shape #par[2] = mle estimator for shape data_c(51.47, 210.19, 49.55, 61.93, 60.61, 744.57, 338.59, 133.93, 191.57, 111.43, 432.83, 185.23, 155.61, 84.72, 120.2, 15.33, 77.05, 115.77, 25.23, 657.94, 108.39, 61.08, 142.42, 87.86, 272.87, 213.78, 65.23, 102.45, 58.16, 176.58, 76.58, 434.12, 362.35, 102.53, 103.6, 25.23, 97.19, 88.52, 118.55, 151.9, 2.7, 156.41, 21.79, 272.27, 23.16, 32.07, 6325.23, 92.37, 8340.04, 51.08, 55.59, 94.08, 69.98, 554.13, 104.88, 170.15, 945.1, 143.52) #Fits data to a gamma distribution using glm() gamma_function(data){ require(MASS, quietly=T) model_glm(data~1, family=Gamma) shape.mom_1/summary(model)$dispersion #this is a moment estimator shape.mle_gamma.shape(m)$alpha #this is a mle estimator mu_mean(data) dev.res.mom_-2*log(dgamma(data,shape=shape.mom,scale=mu/shape.mom)) loglik.mom_sum(dev.res.mom) #actually -2 * log like dev.res.mle_-2*log(dgamma(data,shape=shape.mle,scale=mu/shape.mle)) loglik.mle_sum(dev.res.mle) #actually -2 * log like list(loglik.mom=loglik.mom,loglik.mle=loglik.mle,par=c(shape.mom,shape.mle,mu)) } -- Dr. Richard Nixon MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge, CB2 2SR http://www.mrc-bsu.cam.ac.uk/personal/richard Tel: +44 (0)1223 330382, Fax: +44 (0)1223 330388 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._