Mr Timothy James BENHAM
2010-Nov-03 00:03 UTC
[R] bugs and misfeatures in polr(MASS).... fixed!
In polr.R the (several) functions gmin and fmin contain the code> theta <- beta[pc + 1L:q] > gamm <- c(-100, cumsum(c(theta[1L], exp(theta[-1L]))), 100)That's bad. There's no reason to suppose beta[pc+1L] is larger than -100 or that the cumulative sum is smaller than 100. For practical datasets those assumptions are frequently violated, causing the optimization to fail. A work-around is to center the explanatory variables. This helps keep the zetas small. The correct approach is to use the values -Inf and Inf as the first and last cut points. The functions plogis, dnorm, etc all behave correctly when the input is one of these values. The dgumbel function does not, returning NaN for -Inf. Correct this as follows dgumbel <- function (x, loc = 0, scale = 1, log = FALSE) { x <- (x - loc)/scale d <- log(1/scale) - x - exp(-x) d[is.nan(d)] <- -Inf # -tjb if (!log) exp(d) else d } The documentation states> start: initial values for the parameters. This is in the format > 'c(coefficients, zeta)': see the Values section.The relevant code is> s0 <- if(pc > 0) c(start[seq_len(pc+1)], diff(start[-seq_len(pc)])) > else c(start[1L], diff(start))This doesn't take the logs of the differences as required to repose the zetas into the form used in the optimization. The fix is obvious. polr.fit has the same problem which is responsible for summary() frequently failing when the Hessian is not provided. I'm not convinced the t values reported by summary() are reliable. I've noticed that a one dimensional linear transformation the independent variables can cause the reported t values to change by a factor of more than 100, which doesn't seem right. --tjb
The start value generation code in polr is also known to fail quite frequently. For example, against the Iris data as recently posted to this list by blackscorpio ( Sep 6, 2010).> polr(Species~Sepal_Length+Sepal_Width+Petal_Length+Petal_Width,data=iris)Error in polr(Species ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width, : attempt to find suitable starting values failed In addition: Warning messages: 1: In glm.fit(X, y1, wt, family = binomial(), offset = offset) : algorithm did not converge 2: In glm.fit(X, y1, wt, family = binomial(), offset = offset) : fitted probabilities numerically 0 or 1 occurred I suggest that simply setting the coefficients beta to zero and the cutpoints zeta to sensible values will always produce a feasible starting point for non-pathological data. Here is my code that does this: if(missing(start)) { # try something that should always work -tjb u <- as.integer(table(y)) u <- (cumsum(u)/sum(u))[1:q] zetas <- switch(method, "logistic"= qlogis(u), "probit"= qnorm(u), "cauchit"= qcauchy(u), "cloglog"= -log(-log(u)) ) s0 <- c(rep(0,pc),zetas[1],log(diff(zetas))) Using this start code the problem is not manifested.> source('fixed-polr.R') > polr(Species~Sepal_Length+Sepal_Width+Petal_Length+Petal_Width,data=iris)Call: polr(formula = Species ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width, data = iris) Coefficients: Sepal_Length Sepal_Width Petal_Length Petal_Width -2.466724 -6.671515 9.431689 18.270058 Intercepts: setosa|versicolor versicolor|virginica 4.080189 42.639320 Residual Deviance: 11.89857 AIC: 23.89857 My change would also likely fix the problem reported by Kevin Coombes on May 6, 2010. -- View this message in context: http://r.789695.n4.nabble.com/bugs-and-misfeatures-in-polr-MASS-fixed-tp3024677p3030405.html Sent from the R help mailing list archive at Nabble.com.
Note that that the enhancements in my original post solve the unresolved problem of Chaehyung Ahn (22 Mar 2005) whose data I reproduce: y,x,lx 0,3.2e-02,-1.49485 0,3.2e-02,-1.49485 0,1.0e-01,-1.00000 0,1.0e-01,-1.00000 0,3.2e-01,-0.49485 0,3.2e-01,-0.49485 1,1.0e+00,0.00000 0,1.0e+00,0.00000 1,3.2e+00,0.50515 1,3.2e+00,0.50515 0,1.0e+01,1.00000 1,1.0e+01,1.00000 1,3.2e+01,1.50515 2,3.2e+01,1.50515 2,1.0e+02,2.00000 1,1.0e+02,2.00000 2,3.2e+02,2.50515 1,3.2e+02,2.50515 2,1.0e+03,3.00000 2,1.0e+03,3.00000 Using the MASS version we get> ahn$y<-as.factor(ahn$y) > summary(polr(y~lx,data=ahn))Re-fitting to get Hessian Error in optim(s0, fmin, gmin, method = "BFGS", hessian = Hess, ...) : initial value in 'vmmin' is not finite Whereas,> source('fixed-polr.R') > summary(polr(y~lx,data=ahn))Re-fitting to get Hessian Call: polr(formula = y ~ lx, data = ahn) Coefficients: Value Std. Error t value lx 2.421 0.8146 2.971 Intercepts: Value Std. Error t value 0|1 0.5865 0.8118 0.7224 1|2 4.8966 1.7422 2.8106 Residual Deviance: 20.43286 AIC: 26.43286 -- View this message in context: http://r.789695.n4.nabble.com/bugs-and-misfeatures-in-polr-MASS-fixed-tp3024677p3030411.html Sent from the R help mailing list archive at Nabble.com.