Hi all, This is as much as statistical/estimation question as an R-specific one, but here goes. I am trying to use logistic regression to predict suitability of habitats for certain plant species. The response variable is a binary one that indicates whether a particular species is found at a site on the landscape. The independent variables represent physical characteristics of the landscape derived from a GIS. A significant proportion of the time I get the following warning messages from glm():> lr <- glm(known.v1~elevation+aspect+slope+energy15+energy166+aspect+accum+streams.buffered,family=binomial,data=siteframe)Warning messages: 1: Algorithm did not converge in: (if (is.empty.model(mt)) glm.fit.null else glm.fit)(x = X, y = Y, 2: fitted probabilities numerically 0 or 1 occurred in: (if (is.empty.model(mt)) glm.fit.null else glm.fit)(x = X, y = Y, Now I can get the algorithm to converge (or at least not produce the warning) by increasing the number of iterations, but that does not affect the second warning. A read of Hosmer and Lemeshow (1989) does not provide much insight, so I thought that I would post the question to the list. Any comments? Also, I'd be happy to email a dataset that exhibits this behavior if anyone is curious enough. Cheers, A. -- Allan Strand, Biology linum.cofc.edu College of Charleston Ph. (843) 953-8085 Charleston, SC 29424 Fax (843) 953-5453 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Prof Brian D Ripley
2000-Jul-11 16:36 UTC
[R] warnings associated with logistic regression
On 11 Jul 2000, Allan Strand wrote:> Hi all, > > This is as much as statistical/estimation question as an R-specific > one, but here goes. > > I am trying to use logistic regression to predict suitability of > habitats for certain plant species. The response variable is a binary > one that indicates whether a particular species is found at a site on > the landscape. The independent variables represent physical > characteristics of the landscape derived from a GIS. A significant > proportion of the time I get the following warning messages from > glm(): > > > lr <- glm(known.v1~elevation+aspect+slope+energy15+energy166+aspect+accum+streams.buffered,family=binomial,data=siteframe) > Warning messages: > 1: Algorithm did not converge in: (if (is.empty.model(mt)) glm.fit.null else glm.fit)(x = X, y = Y, > 2: fitted probabilities numerically 0 or 1 occurred in: (if (is.empty.model(mt)) glm.fit.null else glm.fit)(x = X, y = Y, > > Now I can get the algorithm to converge (or at least not produce the > warning) by increasing the number of iterations, but that does not > affect the second warning. A read of Hosmer and Lemeshow (1989) does not > provide much insight, so I thought that I would post the question to > the list. > > Any comments? Also, I'd be happy to email a dataset that exhibits this > behavior if anyone is curious enough.It usually means that your dataset exhibits complete separation, and so a logistic regression can fit perfectly. All the diagnostics (p-values etc) are then (very) unreliable. There are also concepts of partial separation, where only some of the cases are fitted perfectly, but similar comments apply. This is shamefully missed in most statistics books, but is well known in the AI community, which used to seek such fits (as `perceptrons') and do again (as `support vector machines') Santer & Duffy is the only contingency-tables book I know that covers this, as does my (1996) Pattern Recognition and Neural Networks book. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, stats.ox.ac.uk/~ripley University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On 11 Jul 2000, Allan Strand wrote:> Hi all, > > This is as much as statistical/estimation question as an R-specific > one, but here goes. > > I am trying to use logistic regression to predict suitability of > habitats for certain plant species. The response variable is a binary > one that indicates whether a particular species is found at a site on > the landscape. The independent variables represent physical > characteristics of the landscape derived from a GIS. A significant > proportion of the time I get the following warning messages from > glm(): > > > lr <- glm(known.v1~elevation+aspect+slope+energy15+energy166+aspect+accum+streams.buffered,family=binomial,data=siteframe) > Warning messages: > 1: Algorithm did not converge in: (if (is.empty.model(mt)) glm.fit.null else glm.fit)(x = X, y = Y, > 2: fitted probabilities numerically 0 or 1 occurred in: (if (is.empty.model(mt)) glm.fit.null else glm.fit)(x = X, y = Y, > > Now I can get the algorithm to converge (or at least not produce the > warning) by increasing the number of iterations, but that does not > affect the second warning.Well, that's what you'd expect. The warning says that for certain combinations of predictors the fitted response is equal to 0 or 1. This also means that the maximum of the likelihood is at infinity for some coefficients. This potentially causes numerical problems, at least in that R won't report infinite coefficients. It also causes statistical problems, because the Wald p-values reported are not useful for very large coefficients. Sometimes this happens when you try to fit too many parameters, in which case you may be able to fix it. It can also happen when the coefficient in question really is large and happens by chance to give perfect predictions. A third possibility is that the probability really is zero (eg above the treeline you really don't have any trees), in which case you don't want a logistic regression model. -thomas -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._