thr3ads.net - R devel - [Rd] Bug in glm.fit? (PR#1331) [Feb 2002]

If this information is useful, please help other people find it:
Share via:

berwin@maths.uwa.edu.au

2002-Feb-27 08:16 UTC

[Rd] Bug in glm.fit? (PR#1331)

G'day all,

I had a look at the GLM code of R (1.4.1) and I believe that there are
problems with the function "glm.fit" that may bite in rare
circumstances.  Note, I have no data set with which I ran into
trouble.  This report is solely based on having a look at the code.

Below I append a listing of the glm.fit function as produced by my
system.  I have added line numbers so that I can easily refer to the
code that is, in my opinion, problematic.  I also append a modified
version that, hopefully, solves these problems.  At least on
"example(glm)" both functions produce the same output (if the random
number generator is set to the same seed).

Short summary: I envisage problems with glm.fit if two conditions are
met simultaneously.  Namely, (1) the design matrix is such that the QR
algorithm used in the weighted least squares (WLS) step has to pivot
columns (lines 81-86) and (2) during the iterations the step size has
to be truncated (either lines 98-114 or lines 115-130).

More details:
In line 90, the variables "start" and "coef" are set to the
coefficients of the WLS and "start" is then permuted (line 91) to take
a possible pivoting of the QR algorithm into account.  Note that
"coef" is not permutated.   Also, since the weights in the iterative
weighted least squares algorithm change in each iteration (lines
60-139) it is theoretically possible that on some executions of this
chunk pivoting occurs and on others there is no pivoting.

Hoever, if the new coefficients are unsuitable and either the block at
lines 98-114 or lines 115-130 is entered, then "start" is modified by
shrinking it toward "coefold".  There are two problems:
1) If this happens the first time the loop in lines 60-139 is executed,
   then coefold is initialised to the value of "start" that was passed
down to the routine (line 58).  This might well be NULL (usually is?).
In this case line 105 and line 122 would produce a new "start" which
is numeric(0) and glm.fit would stop execution pretty soon thereafter.
2) If it happens during a later execution of the loop in lines 60-139,
   then "coefold" was set to the value "coef" of the prior
execution
of the loop (line 137).  Depending on whether either of the code
chunks at lines 98-114 or lines 115-130 have been executed on that
previous occastion, "coefold" may or may not be permuted to the same
order as "start" in this moment.  Hence, it could happen that
"start"
is shrunken towards something that doesn't make sense in line 105 or
line 122.

Relating to point 2) above.  Theoretically, during the main loop
(lines 60-139) either of the chunks in lines 98-114 or lines 115-130
could be entered.  In these chunks "coef" is set to the shrunken
"start" at the end (line 111 or line 126).  If, with the ``deviance
for the shrunken "start"'' the convergence criterion in line
131 is
fulfilled, then the main loop would finish.  In other words, at line
140 there is no way to tell whether "coef" is in the order of
"fit$coefficients" or has been permutated to have the same order as
"start".  In the later case, the additional permutation in line 157
would garble the solution.

I hope that the revised version that I append below, works around all
these potential problems without breaking anything.  As I said, I ran
"example(glm)" with the revised version and got the same results.  But
I guess you guys have far more exhaustive tests and may want to do
some further test to ensure that nothing is broken.

The revised version also makes some cosmetic changes that reflect my
preferences :-) :
1) Given the default values of the parameters "weights" and
"offset" I
   was surprised about the tests in lines 17 and 19.  They seemed
superfluous.  Until I noticed that "glm" can call "glm.fit"
with
actual arguments for these parameters that are NULL.  Hence, I suggest
that the default values for these arguments should be NULL.  This
makes it easier to understand the code of "glm.fit" (if one looks only
at "glm.fit" and not at all the code "glm",
"glm.fit" &c
simultaneously). 
2) As mentioned above, "coefold" may be initialised to NULL in line 58
   if the parameter "start" is not specified.  (Which typically would
be
the case?)  But furthermore, in lines 44-53 the test for the parameter
"start" is mangled into the test for the parameter
"etastart".  If
"etastart" is not specified but a bad "start" (not the
correct
length), the bad "start" would be caught.  But if a valid
"etastart"
is specified, then a bad "start" would not be caught since those tests
are not executed.  (Are there users that specify both?).  Anyway, I
have untangled those sanity check.  Also to get a clean initialisation
for "coefold".
3) Given that the column names of fit$qr are (correctly) set in line
   171, I didn't see the purpose of line 155.  That line anyhow sets
the column names wrongly if pivoting has occured.  So I deleted that
line. 

That's all for now.  Thanks for the great package and I hope that this
bug report will be useful.

Cheers,

        Berwin

========================= original glm.fit ========================  1  glm.fit
<-
  2  function (x, y, weights = rep(1, nobs), start = NULL, etastart = NULL, 
  3      mustart = NULL, offset = rep(0, nobs), family = gaussian(), 
  4      control = glm.control(), intercept = TRUE) 
  5  {
  6      x <- as.matrix(x)
  7      xnames <- dimnames(x)[[2]]
  8      ynames <- names(y)
  9      conv <- FALSE
 10      nobs <- NROW(y)
 11      nvars <- NCOL(x)
 12      if (nvars == 0) {
 13          cc <- match.call()
 14          cc[[1]] <- as.name("glm.fit.null")
 15          return(eval(cc, parent.frame()))
 16      }
 17      if (is.null(weights)) 
 18          weights <- rep(1, nobs)
 19      if (is.null(offset)) 
 20          offset <- rep(0, nobs)
 21      variance <- family$variance
 22      dev.resids <- family$dev.resids
 23      aic <- family$aic
 24      linkinv <- family$linkinv
 25      mu.eta <- family$mu.eta
 26      if (!is.function(variance) || !is.function(linkinv)) 
 27          stop("illegal `family' argument")
 28      valideta <- family$valideta
 29      if (is.null(valideta)) 
 30          valideta <- function(eta) TRUE
 31      validmu <- family$validmu
 32      if (is.null(validmu)) 
 33          validmu <- function(mu) TRUE
 34      if (is.null(mustart)) {
 35          eval(family$initialize)
 36      }
 37      else {
 38          mukeep <- mustart
 39          eval(family$initialize)
 40          mustart <- mukeep
 41      }
 42      if (NCOL(y) > 1) 
 43          stop("y must be univariate unless binomial")
 44      eta <- if (!is.null(etastart) && valideta(etastart)) 
 45          etastart
 46      else if (!is.null(start)) 
 47          if (length(start) != nvars) 
 48              stop(paste("Length of start should equal", nvars, 
 49                  "and correspond to initial coefs for",
deparse(xnames)))
 50          else as.vector(if (NCOL(x) == 1) 
 51              x * start
 52          else x %*% start)
 53      else family$linkfun(mustart)
 54      mu <- linkinv(eta)
 55      if (!(validmu(mu) && valideta(eta))) 
 56          stop("Can't find valid starting values: please specify
some")
 57      devold <- sum(dev.resids(y, mu, weights))
 58      coefold <- start
 59      boundary <- FALSE
 60      for (iter in 1:control$maxit) {
 61          good <- weights > 0
 62          varmu <- variance(mu)[good]
 63          if (any(is.na(varmu))) 
 64              stop("NAs in V(mu)")
 65          if (any(varmu == 0)) 
 66              stop("0s in V(mu)")
 67          mu.eta.val <- mu.eta(eta)
 68          if (any(is.na(mu.eta.val[good]))) 
 69              stop("NAs in d(mu)/d(eta)")
 70          good <- (weights > 0) & (mu.eta.val != 0)
 71          if (all(!good)) {
 72              conv <- FALSE
 73              warning(paste("No observations informative at
iteration",
 74                  iter))
 75              break
 76          }
 77          z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
 78          w <- sqrt((weights[good] *
mu.eta.val[good]^2)/variance(mu)[good])
 79          ngoodobs <- as.integer(nobs - sum(!good))
 80          ncols <- as.integer(1)
 81          fit <- .Fortran("dqrls", qr = x[good, ] * w, n =
as.integer(ngoodobs),
 82              p = nvars, y = w * z, ny = ncols, tol = min(1e-07, 
 83                  control$epsilon/1000), coefficients = numeric(nvars), 
 84              residuals = numeric(ngoodobs), effects = numeric(ngoodobs), 
 85              rank = integer(1), pivot = 1:nvars, qraux = double(nvars), 
 86              work = double(2 * nvars), PACKAGE = "base")
 87          if (nobs < fit$rank) 
 88              stop(paste("X matrix has rank", fit$rank, "but
only",
 89                  nobs, "observations"))
 90          start <- coef <- fit$coefficients
 91          start[fit$pivot] <- coef
 92          eta <- drop(x %*% start)
 93          mu <- linkinv(eta <- eta + offset)
 94          dev <- sum(dev.resids(y, mu, weights))
 95          if (control$trace) 
 96              cat("Deviance =", dev, "Iterations -",
iter, "\n")
 97          boundary <- FALSE
 98          if (is.na(dev) || any(is.na(coef))) {
 99              warning("Step size truncated due to divergence")
100              ii <- 1
101              while ((is.na(dev) || any(is.na(start)))) {
102                  if (ii > control$maxit) 
103                    stop("inner loop 1; can't correct step
size")
104                  ii <- ii + 1
105                  start <- (start + coefold)/2
106                  eta <- drop(x %*% start)
107                  mu <- linkinv(eta <- eta + offset)
108                  dev <- sum(dev.resids(y, mu, weights))
109              }
110              boundary <- TRUE
111              coef <- start
112              if (control$trace) 
113                  cat("New Deviance =", dev, "\n")
114          }
115          if (!(valideta(eta) && validmu(mu))) {
116              warning("Step size truncated: out of bounds.")
117              ii <- 1
118              while (!(valideta(eta) && validmu(mu))) {
119                  if (ii > control$maxit) 
120                    stop("inner loop 2; can't correct step
size")
121                  ii <- ii + 1
122                  start <- (start + coefold)/2
123                  mu <- linkinv(eta <- eta + offset)
124              }
125              boundary <- TRUE
126              coef <- start
127              dev <- sum(dev.resids(y, mu, weights))
128              if (control$trace) 
129                  cat("New Deviance =", dev, "\n")
130          }
131          if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
132              conv <- TRUE
133              break
134          }
135          else {
136              devold <- dev
137              coefold <- coef
138          }
139      }
140      if (!conv) 
141          warning("Algorithm did not converge")
142      if (boundary) 
143          warning("Algorithm stopped at boundary value")
144      eps <- 10 * .Machine$double.eps
145      if (family$family == "binomial") {
146          if (any(mu > 1 - eps) || any(mu < eps)) 
147              warning("fitted probabilities numerically 0 or 1
occurred")
148      }
149      if (family$family == "poisson") {
150          if (any(mu < eps)) 
151              warning("fitted rates numerically 0 occurred")
152      }
153      if (fit$rank != nvars) {
154          coef[seq(fit$rank + 1, nvars)] <- NA
155          dimnames(fit$qr) <- list(NULL, xnames)
156      }
157      coef[fit$pivot] <- coef
158      xxnames <- xnames[fit$pivot]
159      residuals <- rep(NA, nobs)
160      residuals[good] <- z - (eta - offset)[good]
161      fit$qr <- as.matrix(fit$qr)
162      nr <- min(sum(good), nvars)
163      if (nr < nvars) {
164          Rmat <- diag(nvars)
165          Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
166      }
167      else Rmat <- fit$qr[1:nvars, 1:nvars]
168      Rmat <- as.matrix(Rmat)
169      Rmat[row(Rmat) > col(Rmat)] <- 0
170      names(coef) <- xnames
171      colnames(fit$qr) <- xxnames
172      dimnames(Rmat) <- list(xxnames, xxnames)
173      names(residuals) <- ynames
174      names(mu) <- ynames
175      names(eta) <- ynames
176      wt <- rep(0, nobs)
177      wt[good] <- w^2
178      names(wt) <- ynames
179      names(weights) <- ynames
180      names(y) <- ynames
181      names(fit$effects) <- c(xxnames[seq(fit$rank)], rep("",
sum(good) -
182          fit$rank))
183      wtdmu <- if (intercept) 
184          sum(weights * y)/sum(weights)
185      else linkinv(offset)
186      nulldev <- sum(dev.resids(y, wtdmu, weights))
187      n.ok <- nobs - sum(weights == 0)
188      nulldf <- n.ok - as.integer(intercept)
189      resdf <- n.ok - fit$rank
190      aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
191      list(coefficients = coef, residuals = residuals, fitted.values = mu, 
192          effects = fit$effects, R = Rmat, rank = fit$rank, qr =
fit[c("qr",
193              "rank", "qraux", "pivot",
"tol")], family = family,
194          linear.predictors = eta, deviance = dev, aic = aic.model, 
195          null.deviance = nulldev, iter = iter, weights = wt, prior.weights =
weights,
196          df.residual = resdf, df.null = nulldf, y = y, converged = conv, 
197          boundary = boundary)
198  }
========================= modified glm.fit ========================glm.fit <-
function (x, y, weights = NULL, start = NULL, etastart = NULL, 
    mustart = NULL, offset = NULL, family = gaussian(), 
    control = glm.control(), intercept = TRUE) 
{
    x <- as.matrix(x)
    xnames <- dimnames(x)[[2]]
    ynames <- names(y)
    conv <- FALSE
    nobs <- NROW(y)
    nvars <- NCOL(x)
    if (nvars == 0) {
        cc <- match.call()
        cc[[1]] <- as.name("glm.fit.null")
        return(eval(cc, parent.frame()))
    }
    if (is.null(weights)) 
        weights <- rep(1, nobs)
    if (is.null(offset)) 
        offset <- rep(0, nobs)
    variance <- family$variance
    dev.resids <- family$dev.resids
    aic <- family$aic
    linkinv <- family$linkinv
    mu.eta <- family$mu.eta
    if (!is.function(variance) || !is.function(linkinv)) 
        stop("illegal `family' argument")
    valideta <- family$valideta
    if (is.null(valideta)) 
        valideta <- function(eta) TRUE
    validmu <- family$validmu
    if (is.null(validmu)) 
        validmu <- function(mu) TRUE
    if (is.null(mustart)) {
        eval(family$initialize)
    }
    else {
        mukeep <- mustart
        eval(family$initialize)
        mustart <- mukeep
    }
    if (NCOL(y) > 1) 
        stop("y must be univariate unless binomial")
    if (!is.null(start)){
      if (length(start) != nvars) 
        stop(paste("Length of start should equal", nvars, 
                   "and correspond to initial coefs for",
deparse(xnames)))
      coef <- coefold <- start  # initialise coef here in order to
                                # be able to subset it below 
    }
    else
      coef <- coefold <- rep(0,nvars)
    eta <- if (!is.null(etastart) && valideta(etastart)) 
      etastart
    else if (!is.null(start)) 
      as.vector(if (NCOL(x) == 1) 
                x * start
      else x %*% start)
    else family$linkfun(mustart)
    mu <- linkinv(eta)
    if (!(validmu(mu) && valideta(eta))) 
        stop("Can't find valid starting values: please specify
some")
    devold <- sum(dev.resids(y, mu, weights))
    boundary <- FALSE
    for (iter in 1:control$maxit) {
        good <- weights > 0
        varmu <- variance(mu)[good]
        if (any(is.na(varmu))) 
            stop("NAs in V(mu)")
        if (any(varmu == 0)) 
            stop("0s in V(mu)")
        mu.eta.val <- mu.eta(eta)
        if (any(is.na(mu.eta.val[good]))) 
            stop("NAs in d(mu)/d(eta)")
        good <- (weights > 0) & (mu.eta.val != 0)
        if (all(!good)) {
            conv <- FALSE
            warning(paste("No observations informative at iteration", 
                iter))
            break
        }
        z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
        w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
        ngoodobs <- as.integer(nobs - sum(!good))
        ncols <- as.integer(1)
        fit <- .Fortran("dqrls", qr = x[good, ] * w, n =
as.integer(ngoodobs),
            p = nvars, y = w * z, ny = ncols, tol = min(1e-07, 
                control$epsilon/1000), coefficients = numeric(nvars), 
            residuals = numeric(ngoodobs), effects = numeric(ngoodobs), 
            rank = integer(1), pivot = 1:nvars, qraux = double(nvars), 
            work = double(2 * nvars), PACKAGE = "base")
        if (nobs < fit$rank) 
            stop(paste("X matrix has rank", fit$rank, "but
only",
                nobs, "observations"))
        coef[fit$pivot] <- fit$coefficients
        eta <- drop(x %*% coef)
        mu <- linkinv(eta <- eta + offset)
        dev <- sum(dev.resids(y, mu, weights))
        if (control$trace) 
            cat("Deviance =", dev, "Iterations -", iter,
"\n")
        boundary <- FALSE
        if (is.na(dev) || any(is.na(coef))) {
            warning("Step size truncated due to divergence")
            ii <- 1
            while ((is.na(dev) || any(is.na(coef)))) {
                if (ii > control$maxit) 
                  stop("inner loop 1; can't correct step size")
                ii <- ii + 1
                coef <- (coef + coefold)/2
                eta <- drop(x %*% coef)
                mu <- linkinv(eta <- eta + offset)
                dev <- sum(dev.resids(y, mu, weights))
            }
            boundary <- TRUE
            if (control$trace) 
                cat("New Deviance =", dev, "\n")
        }
        if (!(valideta(eta) && validmu(mu))) {
            warning("Step size truncated: out of bounds.")
            ii <- 1
            while (!(valideta(eta) && validmu(mu))) {
                if (ii > control$maxit) 
                  stop("inner loop 2; can't correct step size")
                ii <- ii + 1
                coef <- (coef + coefold)/2
                mu <- linkinv(eta <- eta + offset)
            }
            boundary <- TRUE
            dev <- sum(dev.resids(y, mu, weights))
            if (control$trace) 
                cat("New Deviance =", dev, "\n")
        }
        if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
            conv <- TRUE
            break
        }
        else {
            devold <- dev
            coefold <- coef
        }
    }
    if (!conv) 
        warning("Algorithm did not converge")
    if (boundary) 
        warning("Algorithm stopped at boundary value")
    eps <- 10 * .Machine$double.eps
    if (family$family == "binomial") {
        if (any(mu > 1 - eps) || any(mu < eps)) 
            warning("fitted probabilities numerically 0 or 1
occurred")
    }
    if (family$family == "poisson") {
        if (any(mu < eps)) 
            warning("fitted rates numerically 0 occurred")
    }
    if (fit$rank != nvars) {
        coef[fit$pivot][seq(fit$rank + 1, nvars)] <- NA
    }
    xxnames <- xnames[fit$pivot]
    residuals <- rep(NA, nobs)
    residuals[good] <- z - (eta - offset)[good]
    fit$qr <- as.matrix(fit$qr)
    nr <- min(sum(good), nvars)
    if (nr < nvars) {
        Rmat <- diag(nvars)
        Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
    }
    else Rmat <- fit$qr[1:nvars, 1:nvars]
    Rmat <- as.matrix(Rmat)
    Rmat[row(Rmat) > col(Rmat)] <- 0
    names(coef) <- xnames
    colnames(fit$qr) <- xxnames
    dimnames(Rmat) <- list(xxnames, xxnames)
    names(residuals) <- ynames
    names(mu) <- ynames
    names(eta) <- ynames
    wt <- rep(0, nobs)
    wt[good] <- w^2
    names(wt) <- ynames
    names(weights) <- ynames
    names(y) <- ynames
    names(fit$effects) <- c(xxnames[seq(fit$rank)], rep("",
sum(good) -
        fit$rank))
    wtdmu <- if (intercept) 
        sum(weights * y)/sum(weights)
    else linkinv(offset)
    nulldev <- sum(dev.resids(y, wtdmu, weights))
    n.ok <- nobs - sum(weights == 0)
    nulldf <- n.ok - as.integer(intercept)
    resdf <- n.ok - fit$rank
    aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
    list(coefficients = coef, residuals = residuals, fitted.values = mu, 
        effects = fit$effects, R = Rmat, rank = fit$rank, qr =
fit[c("qr",
            "rank", "qraux", "pivot",
"tol")], family = family,
        linear.predictors = eta, deviance = dev, aic = aic.model, 
        null.deviance = nulldev, iter = iter, weights = wt, prior.weights =
weights,
        df.residual = resdf, df.null = nulldf, y = y, converged = conv, 
        boundary = boundary)
}

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

ripley@stats.ox.ac.uk

2002-Apr-05 15:13 UTC

head link

[Rd] Bug in glm.fit? (PR#1331)

Berwin,

Thank you for this.

Unfortunately, although the analysis is mainly correct, but the solution
has almost as many problems as the current state.  Suppose pivoting
occurs. Then for the internal code, coef needs to contain zeroes at the
omitted cols. But you have

coef <- coefold <- start  # initialise coef here in order to
                          # be able to subset it below

so it will contain whatever start had for those cols.  Further, the test
        if (is.na(dev) || any(is.na(coef))) {
is followed by

                coef <- (coef + coefold)/2
                eta <- drop(x %*% coef)

which will ensure that if any NAs occur in coef, they will be propogated
through eta.

A related point is that you assume that coefold <- rep(0,nvars) is
a valid starting value to shrink towards, but that need not be the case.
I think one can only shrink towards the previous value after the first
iteration.  (Only if start is specified is the validity checked:
an initial etastart or mustart is normally not in the space of
possible values under the model.)

There seems to me no point to test if `start' is valid if it is going to
be unused (as it will be if a valid eta is specified), so I believe the
original test logic is the correct one, not the unravelled one.

There are clearly problems in the glm.fit code, but in the absence of a
bug that has actually bitten someone, it seems only worth changing if we
have a 100% watertight solution.  As I said to Brett Presnell, I believe
the time would be better spent re-writing glm ab initio.

Also, the code you submitted had all the comments removed, and so was much
harder to read than the code in glm.R, as well as being subtly
reformatted.

Brian


On Wed, 27 Feb 2002 berwin@maths.uwa.edu.au wrote:
> G'day all,
>
> I had a look at the GLM code of R (1.4.1) and I believe that there are
> problems with the function "glm.fit" that may bite in rare
> circumstances.  Note, I have no data set with which I ran into
> trouble.  This report is solely based on having a look at the code.
>
> Below I append a listing of the glm.fit function as produced by my
> system.  I have added line numbers so that I can easily refer to the
> code that is, in my opinion, problematic.  I also append a modified
> version that, hopefully, solves these problems.  At least on
> "example(glm)" both functions produce the same output (if the
random
> number generator is set to the same seed).
>
> Short summary: I envisage problems with glm.fit if two conditions are
> met simultaneously.  Namely, (1) the design matrix is such that the QR
> algorithm used in the weighted least squares (WLS) step has to pivot
> columns (lines 81-86) and (2) during the iterations the step size has
> to be truncated (either lines 98-114 or lines 115-130).
>
> More details:
> In line 90, the variables "start" and "coef" are set to
the
> coefficients of the WLS and "start" is then permuted (line 91) to
take
> a possible pivoting of the QR algorithm into account.  Note that
> "coef" is not permutated.   Also, since the weights in the
iterative
> weighted least squares algorithm change in each iteration (lines
> 60-139) it is theoretically possible that on some executions of this
> chunk pivoting occurs and on others there is no pivoting.
>
> Hoever, if the new coefficients are unsuitable and either the block at
> lines 98-114 or lines 115-130 is entered, then "start" is
modified by
> shrinking it toward "coefold".  There are two problems:
> 1) If this happens the first time the loop in lines 60-139 is executed,
>    then coefold is initialised to the value of "start" that was
passed
> down to the routine (line 58).  This might well be NULL (usually is?).
> In this case line 105 and line 122 would produce a new "start"
which
> is numeric(0) and glm.fit would stop execution pretty soon thereafter.
> 2) If it happens during a later execution of the loop in lines 60-139,
>    then "coefold" was set to the value "coef" of the
prior execution
> of the loop (line 137).  Depending on whether either of the code
> chunks at lines 98-114 or lines 115-130 have been executed on that
> previous occastion, "coefold" may or may not be permuted to the
same
> order as "start" in this moment.  Hence, it could happen that
"start"
> is shrunken towards something that doesn't make sense in line 105 or
> line 122.
>
> Relating to point 2) above.  Theoretically, during the main loop
> (lines 60-139) either of the chunks in lines 98-114 or lines 115-130
> could be entered.  In these chunks "coef" is set to the shrunken
> "start" at the end (line 111 or line 126).  If, with the
``deviance
> for the shrunken "start"'' the convergence criterion in
line 131 is
> fulfilled, then the main loop would finish.  In other words, at line
> 140 there is no way to tell whether "coef" is in the order of
> "fit$coefficients" or has been permutated to have the same order
as
> "start".  In the later case, the additional permutation in line
157
> would garble the solution.
>
> I hope that the revised version that I append below, works around all
> these potential problems without breaking anything.  As I said, I ran
> "example(glm)" with the revised version and got the same results.
But
> I guess you guys have far more exhaustive tests and may want to do
> some further test to ensure that nothing is broken.
>
> The revised version also makes some cosmetic changes that reflect my
> preferences :-) :
> 1) Given the default values of the parameters "weights" and
"offset" I
>    was surprised about the tests in lines 17 and 19.  They seemed
> superfluous.  Until I noticed that "glm" can call
"glm.fit" with
> actual arguments for these parameters that are NULL.  Hence, I suggest
> that the default values for these arguments should be NULL.  This
> makes it easier to understand the code of "glm.fit" (if one looks
only
> at "glm.fit" and not at all the code "glm",
"glm.fit" &c
> simultaneously).
> 2) As mentioned above, "coefold" may be initialised to NULL in
line 58
>    if the parameter "start" is not specified.  (Which typically
would be
> the case?)  But furthermore, in lines 44-53 the test for the parameter
> "start" is mangled into the test for the parameter
"etastart".  If
> "etastart" is not specified but a bad "start" (not the
correct
> length), the bad "start" would be caught.  But if a valid
"etastart"
> is specified, then a bad "start" would not be caught since those
tests
> are not executed.  (Are there users that specify both?).  Anyway, I
> have untangled those sanity check.  Also to get a clean initialisation
> for "coefold".
> 3) Given that the column names of fit$qr are (correctly) set in line
>    171, I didn't see the purpose of line 155.  That line anyhow sets
> the column names wrongly if pivoting has occured.  So I deleted that
> line.
>
> That's all for now.  Thanks for the great package and I hope that this
> bug report will be useful.
>
> Cheers,
>
>         Berwin
>
> ========================= original glm.fit ========================>   1
glm.fit <-
>   2  function (x, y, weights = rep(1, nobs), start = NULL, etastart = NULL,
>   3      mustart = NULL, offset = rep(0, nobs), family = gaussian(),
>   4      control = glm.control(), intercept = TRUE)
>   5  {
>   6      x <- as.matrix(x)
>   7      xnames <- dimnames(x)[[2]]
>   8      ynames <- names(y)
>   9      conv <- FALSE
>  10      nobs <- NROW(y)
>  11      nvars <- NCOL(x)
>  12      if (nvars == 0) {
>  13          cc <- match.call()
>  14          cc[[1]] <- as.name("glm.fit.null")
>  15          return(eval(cc, parent.frame()))
>  16      }
>  17      if (is.null(weights))
>  18          weights <- rep(1, nobs)
>  19      if (is.null(offset))
>  20          offset <- rep(0, nobs)
>  21      variance <- family$variance
>  22      dev.resids <- family$dev.resids
>  23      aic <- family$aic
>  24      linkinv <- family$linkinv
>  25      mu.eta <- family$mu.eta
>  26      if (!is.function(variance) || !is.function(linkinv))
>  27          stop("illegal `family' argument")
>  28      valideta <- family$valideta
>  29      if (is.null(valideta))
>  30          valideta <- function(eta) TRUE
>  31      validmu <- family$validmu
>  32      if (is.null(validmu))
>  33          validmu <- function(mu) TRUE
>  34      if (is.null(mustart)) {
>  35          eval(family$initialize)
>  36      }
>  37      else {
>  38          mukeep <- mustart
>  39          eval(family$initialize)
>  40          mustart <- mukeep
>  41      }
>  42      if (NCOL(y) > 1)
>  43          stop("y must be univariate unless binomial")
>  44      eta <- if (!is.null(etastart) && valideta(etastart))
>  45          etastart
>  46      else if (!is.null(start))
>  47          if (length(start) != nvars)
>  48              stop(paste("Length of start should equal",
nvars,
>  49                  "and correspond to initial coefs for",
deparse(xnames)))
>  50          else as.vector(if (NCOL(x) == 1)
>  51              x * start
>  52          else x %*% start)
>  53      else family$linkfun(mustart)
>  54      mu <- linkinv(eta)
>  55      if (!(validmu(mu) && valideta(eta)))
>  56          stop("Can't find valid starting values: please
specify some")
>  57      devold <- sum(dev.resids(y, mu, weights))
>  58      coefold <- start
>  59      boundary <- FALSE
>  60      for (iter in 1:control$maxit) {
>  61          good <- weights > 0
>  62          varmu <- variance(mu)[good]
>  63          if (any(is.na(varmu)))
>  64              stop("NAs in V(mu)")
>  65          if (any(varmu == 0))
>  66              stop("0s in V(mu)")
>  67          mu.eta.val <- mu.eta(eta)
>  68          if (any(is.na(mu.eta.val[good])))
>  69              stop("NAs in d(mu)/d(eta)")
>  70          good <- (weights > 0) & (mu.eta.val != 0)
>  71          if (all(!good)) {
>  72              conv <- FALSE
>  73              warning(paste("No observations informative at
iteration",
>  74                  iter))
>  75              break
>  76          }
>  77          z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
>  78          w <- sqrt((weights[good] *
mu.eta.val[good]^2)/variance(mu)[good])
>  79          ngoodobs <- as.integer(nobs - sum(!good))
>  80          ncols <- as.integer(1)
>  81          fit <- .Fortran("dqrls", qr = x[good, ] * w, n =
as.integer(ngoodobs),
>  82              p = nvars, y = w * z, ny = ncols, tol = min(1e-07,
>  83                  control$epsilon/1000), coefficients = numeric(nvars),
>  84              residuals = numeric(ngoodobs), effects =
numeric(ngoodobs),
>  85              rank = integer(1), pivot = 1:nvars, qraux = double(nvars),
>  86              work = double(2 * nvars), PACKAGE = "base")
>  87          if (nobs < fit$rank)
>  88              stop(paste("X matrix has rank", fit$rank,
"but only",
>  89                  nobs, "observations"))
>  90          start <- coef <- fit$coefficients
>  91          start[fit$pivot] <- coef
>  92          eta <- drop(x %*% start)
>  93          mu <- linkinv(eta <- eta + offset)
>  94          dev <- sum(dev.resids(y, mu, weights))
>  95          if (control$trace)
>  96              cat("Deviance =", dev, "Iterations -",
iter, "\n")
>  97          boundary <- FALSE
>  98          if (is.na(dev) || any(is.na(coef))) {
>  99              warning("Step size truncated due to divergence")
> 100              ii <- 1
> 101              while ((is.na(dev) || any(is.na(start)))) {
> 102                  if (ii > control$maxit)
> 103                    stop("inner loop 1; can't correct step
size")
> 104                  ii <- ii + 1
> 105                  start <- (start + coefold)/2
> 106                  eta <- drop(x %*% start)
> 107                  mu <- linkinv(eta <- eta + offset)
> 108                  dev <- sum(dev.resids(y, mu, weights))
> 109              }
> 110              boundary <- TRUE
> 111              coef <- start
> 112              if (control$trace)
> 113                  cat("New Deviance =", dev, "\n")
> 114          }
> 115          if (!(valideta(eta) && validmu(mu))) {
> 116              warning("Step size truncated: out of bounds.")
> 117              ii <- 1
> 118              while (!(valideta(eta) && validmu(mu))) {
> 119                  if (ii > control$maxit)
> 120                    stop("inner loop 2; can't correct step
size")
> 121                  ii <- ii + 1
> 122                  start <- (start + coefold)/2
> 123                  mu <- linkinv(eta <- eta + offset)
> 124              }
> 125              boundary <- TRUE
> 126              coef <- start
> 127              dev <- sum(dev.resids(y, mu, weights))
> 128              if (control$trace)
> 129                  cat("New Deviance =", dev, "\n")
> 130          }
> 131          if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
> 132              conv <- TRUE
> 133              break
> 134          }
> 135          else {
> 136              devold <- dev
> 137              coefold <- coef
> 138          }
> 139      }
> 140      if (!conv)
> 141          warning("Algorithm did not converge")
> 142      if (boundary)
> 143          warning("Algorithm stopped at boundary value")
> 144      eps <- 10 * .Machine$double.eps
> 145      if (family$family == "binomial") {
> 146          if (any(mu > 1 - eps) || any(mu < eps))
> 147              warning("fitted probabilities numerically 0 or 1
occurred")
> 148      }
> 149      if (family$family == "poisson") {
> 150          if (any(mu < eps))
> 151              warning("fitted rates numerically 0 occurred")
> 152      }
> 153      if (fit$rank != nvars) {
> 154          coef[seq(fit$rank + 1, nvars)] <- NA
> 155          dimnames(fit$qr) <- list(NULL, xnames)
> 156      }
> 157      coef[fit$pivot] <- coef
> 158      xxnames <- xnames[fit$pivot]
> 159      residuals <- rep(NA, nobs)
> 160      residuals[good] <- z - (eta - offset)[good]
> 161      fit$qr <- as.matrix(fit$qr)
> 162      nr <- min(sum(good), nvars)
> 163      if (nr < nvars) {
> 164          Rmat <- diag(nvars)
> 165          Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
> 166      }
> 167      else Rmat <- fit$qr[1:nvars, 1:nvars]
> 168      Rmat <- as.matrix(Rmat)
> 169      Rmat[row(Rmat) > col(Rmat)] <- 0
> 170      names(coef) <- xnames
> 171      colnames(fit$qr) <- xxnames
> 172      dimnames(Rmat) <- list(xxnames, xxnames)
> 173      names(residuals) <- ynames
> 174      names(mu) <- ynames
> 175      names(eta) <- ynames
> 176      wt <- rep(0, nobs)
> 177      wt[good] <- w^2
> 178      names(wt) <- ynames
> 179      names(weights) <- ynames
> 180      names(y) <- ynames
> 181      names(fit$effects) <- c(xxnames[seq(fit$rank)],
rep("", sum(good) -
> 182          fit$rank))
> 183      wtdmu <- if (intercept)
> 184          sum(weights * y)/sum(weights)
> 185      else linkinv(offset)
> 186      nulldev <- sum(dev.resids(y, wtdmu, weights))
> 187      n.ok <- nobs - sum(weights == 0)
> 188      nulldf <- n.ok - as.integer(intercept)
> 189      resdf <- n.ok - fit$rank
> 190      aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
> 191      list(coefficients = coef, residuals = residuals, fitted.values =
mu,
> 192          effects = fit$effects, R = Rmat, rank = fit$rank, qr =
fit[c("qr",
> 193              "rank", "qraux", "pivot",
"tol")], family = family,
> 194          linear.predictors = eta, deviance = dev, aic = aic.model,
> 195          null.deviance = nulldev, iter = iter, weights = wt,
prior.weights = weights,
> 196          df.residual = resdf, df.null = nulldf, y = y, converged =
conv,
> 197          boundary = boundary)
> 198  }
> ========================= modified glm.fit ========================>
glm.fit <-
> function (x, y, weights = NULL, start = NULL, etastart = NULL,
>     mustart = NULL, offset = NULL, family = gaussian(),
>     control = glm.control(), intercept = TRUE)
> {
>     x <- as.matrix(x)
>     xnames <- dimnames(x)[[2]]
>     ynames <- names(y)
>     conv <- FALSE
>     nobs <- NROW(y)
>     nvars <- NCOL(x)
>     if (nvars == 0) {
>         cc <- match.call()
>         cc[[1]] <- as.name("glm.fit.null")
>         return(eval(cc, parent.frame()))
>     }
>     if (is.null(weights))
>         weights <- rep(1, nobs)
>     if (is.null(offset))
>         offset <- rep(0, nobs)
>     variance <- family$variance
>     dev.resids <- family$dev.resids
>     aic <- family$aic
>     linkinv <- family$linkinv
>     mu.eta <- family$mu.eta
>     if (!is.function(variance) || !is.function(linkinv))
>         stop("illegal `family' argument")
>     valideta <- family$valideta
>     if (is.null(valideta))
>         valideta <- function(eta) TRUE
>     validmu <- family$validmu
>     if (is.null(validmu))
>         validmu <- function(mu) TRUE
>     if (is.null(mustart)) {
>         eval(family$initialize)
>     }
>     else {
>         mukeep <- mustart
>         eval(family$initialize)
>         mustart <- mukeep
>     }
>     if (NCOL(y) > 1)
>         stop("y must be univariate unless binomial")
>     if (!is.null(start)){
>       if (length(start) != nvars)
>         stop(paste("Length of start should equal", nvars,
>                    "and correspond to initial coefs for",
deparse(xnames)))
>       coef <- coefold <- start  # initialise coef here in order to
>                                 # be able to subset it below
>     }
>     else
>       coef <- coefold <- rep(0,nvars)
>     eta <- if (!is.null(etastart) && valideta(etastart))
>       etastart
>     else if (!is.null(start))
>       as.vector(if (NCOL(x) == 1)
>                 x * start
>       else x %*% start)
>     else family$linkfun(mustart)
>     mu <- linkinv(eta)
>     if (!(validmu(mu) && valideta(eta)))
>         stop("Can't find valid starting values: please specify
some")
>     devold <- sum(dev.resids(y, mu, weights))
>     boundary <- FALSE
>     for (iter in 1:control$maxit) {
>         good <- weights > 0
>         varmu <- variance(mu)[good]
>         if (any(is.na(varmu)))
>             stop("NAs in V(mu)")
>         if (any(varmu == 0))
>             stop("0s in V(mu)")
>         mu.eta.val <- mu.eta(eta)
>         if (any(is.na(mu.eta.val[good])))
>             stop("NAs in d(mu)/d(eta)")
>         good <- (weights > 0) & (mu.eta.val != 0)
>         if (all(!good)) {
>             conv <- FALSE
>             warning(paste("No observations informative at
iteration",
>                 iter))
>             break
>         }
>         z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
>         w <- sqrt((weights[good] *
mu.eta.val[good]^2)/variance(mu)[good])
>         ngoodobs <- as.integer(nobs - sum(!good))
>         ncols <- as.integer(1)
>         fit <- .Fortran("dqrls", qr = x[good, ] * w, n =
as.integer(ngoodobs),
>             p = nvars, y = w * z, ny = ncols, tol = min(1e-07,
>                 control$epsilon/1000), coefficients = numeric(nvars),
>             residuals = numeric(ngoodobs), effects = numeric(ngoodobs),
>             rank = integer(1), pivot = 1:nvars, qraux = double(nvars),
>             work = double(2 * nvars), PACKAGE = "base")
>         if (nobs < fit$rank)
>             stop(paste("X matrix has rank", fit$rank, "but
only",
>                 nobs, "observations"))
>         coef[fit$pivot] <- fit$coefficients
>         eta <- drop(x %*% coef)
>         mu <- linkinv(eta <- eta + offset)
>         dev <- sum(dev.resids(y, mu, weights))
>         if (control$trace)
>             cat("Deviance =", dev, "Iterations -",
iter, "\n")
>         boundary <- FALSE
>         if (is.na(dev) || any(is.na(coef))) {
>             warning("Step size truncated due to divergence")
>             ii <- 1
>             while ((is.na(dev) || any(is.na(coef)))) {
>                 if (ii > control$maxit)
>                   stop("inner loop 1; can't correct step
size")
>                 ii <- ii + 1
>                 coef <- (coef + coefold)/2
>                 eta <- drop(x %*% coef)
>                 mu <- linkinv(eta <- eta + offset)
>                 dev <- sum(dev.resids(y, mu, weights))
>             }
>             boundary <- TRUE
>             if (control$trace)
>                 cat("New Deviance =", dev, "\n")
>         }
>         if (!(valideta(eta) && validmu(mu))) {
>             warning("Step size truncated: out of bounds.")
>             ii <- 1
>             while (!(valideta(eta) && validmu(mu))) {
>                 if (ii > control$maxit)
>                   stop("inner loop 2; can't correct step
size")
>                 ii <- ii + 1
>                 coef <- (coef + coefold)/2
>                 mu <- linkinv(eta <- eta + offset)
>             }
>             boundary <- TRUE
>             dev <- sum(dev.resids(y, mu, weights))
>             if (control$trace)
>                 cat("New Deviance =", dev, "\n")
>         }
>         if (abs(dev - devold)/(0.1 + abs(dev)) < control$epsilon) {
>             conv <- TRUE
>             break
>         }
>         else {
>             devold <- dev
>             coefold <- coef
>         }
>     }
>     if (!conv)
>         warning("Algorithm did not converge")
>     if (boundary)
>         warning("Algorithm stopped at boundary value")
>     eps <- 10 * .Machine$double.eps
>     if (family$family == "binomial") {
>         if (any(mu > 1 - eps) || any(mu < eps))
>             warning("fitted probabilities numerically 0 or 1
occurred")
>     }
>     if (family$family == "poisson") {
>         if (any(mu < eps))
>             warning("fitted rates numerically 0 occurred")
>     }
>     if (fit$rank != nvars) {
>         coef[fit$pivot][seq(fit$rank + 1, nvars)] <- NA
>     }
>     xxnames <- xnames[fit$pivot]
>     residuals <- rep(NA, nobs)
>     residuals[good] <- z - (eta - offset)[good]
>     fit$qr <- as.matrix(fit$qr)
>     nr <- min(sum(good), nvars)
>     if (nr < nvars) {
>         Rmat <- diag(nvars)
>         Rmat[1:nr, 1:nvars] <- fit$qr[1:nr, 1:nvars]
>     }
>     else Rmat <- fit$qr[1:nvars, 1:nvars]
>     Rmat <- as.matrix(Rmat)
>     Rmat[row(Rmat) > col(Rmat)] <- 0
>     names(coef) <- xnames
>     colnames(fit$qr) <- xxnames
>     dimnames(Rmat) <- list(xxnames, xxnames)
>     names(residuals) <- ynames
>     names(mu) <- ynames
>     names(eta) <- ynames
>     wt <- rep(0, nobs)
>     wt[good] <- w^2
>     names(wt) <- ynames
>     names(weights) <- ynames
>     names(y) <- ynames
>     names(fit$effects) <- c(xxnames[seq(fit$rank)], rep("",
sum(good) -
>         fit$rank))
>     wtdmu <- if (intercept)
>         sum(weights * y)/sum(weights)
>     else linkinv(offset)
>     nulldev <- sum(dev.resids(y, wtdmu, weights))
>     n.ok <- nobs - sum(weights == 0)
>     nulldf <- n.ok - as.integer(intercept)
>     resdf <- n.ok - fit$rank
>     aic.model <- aic(y, n, mu, weights, dev) + 2 * fit$rank
>     list(coefficients = coef, residuals = residuals, fitted.values = mu,
>         effects = fit$effects, R = Rmat, rank = fit$rank, qr =
fit[c("qr",
>             "rank", "qraux", "pivot",
"tol")], family = family,
>         linear.predictors = eta, deviance = dev, aic = aic.model,
>         null.deviance = nulldev, iter = iter, weights = wt, prior.weights =
weights,
>         df.residual = resdf, df.null = nulldf, y = y, converged = conv,
>         boundary = boundary)
> }
>
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-devel mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Feb 2002 - Bug in glm.fit? (PR#1331)

[Rd] Bug in glm.fit? (PR#1331)

[Rd] Bug in glm.fit? (PR#1331)

Seemingly Similar Threads