I propose the following enhancements and changes to plot.lm(), the most important of which is the addition of a Residuals vs Leverage plot. (1) A residual versus leverage plot has been added, available by specifying which = 5, and not included as one of the default plots. Contours of Cook's distance are included, by default at values of 0.5 and 1.0. The labeled points, if any, are those with the largest Cook's distances. The parameter cook.levels can be changed as required, to control what contours appear. (2) Remove the word "plot" from the captions for which=2, 3, 4. It is redundant. (3) Now that the pos argument to text() is vectorized, use that in preference to an offset. (4) For which!=4 or 5, by default use pos=4 on the left half of the panel, and pos=2 on the right half of the panel. This prevents labels from appearing outside the plot area, where they can overlap other graphical features. The parameter label.pos allows users to change this default. The modified code that I propose is below. This, a modified .Rd file, and files from diff used with the April 20 development version, are in my directory http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/ I believe the Residual-Leverage plot is given in Krause & Olsen, whether with Cook's distance contours I do not recall. I do not have access to a copy of this book. Martin Maechler drew my attention to it in 2003, as superior to the Cook's distance plot. I have finally got around to coding it up! John Maindonald. "plot.lm" <- function (x, which = 1:4, caption = c("Residuals vs Fitted", "Normal Q-Q", "Scale-Location", "Cook's distance", "Residuals vs Leverage"), panel = points, sub.caption = deparse(x$call), main = "", ask = prod(par("mfcol")) < length(which) && dev.interactive(), ..., id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75, cook.levels=c(0.5, 1.0), label.pos=c(4,2)) { if (!inherits(x, "lm")) stop("Use only with 'lm' objects") if (!is.numeric(which) || any(which < 1) || any(which > 5)) stop("`which' must be in 1:5") isGlm <- inherits(x, "glm") show <- rep(FALSE, 5) show[which] <- TRUE r <- residuals(x) yh <- predict(x) w <- weights(x) if (!is.null(w)) { wind <- w != 0 r <- r[wind] yh <- yh[wind] w <- w[wind] labels.id <- labels.id[wind] } n <- length(r) if (any(show[2:5])) { s <- if (inherits(x, "rlm")) x$s else sqrt(deviance(x)/df.residual(x)) hii <- lm.influence(x, do.coef = FALSE)$hat if (any(show[4:5])) { cook <- if (isGlm)cooks.distance(x) else cooks.distance(x, sd = s, res = r) } } if (any(show[c(2:3,5)])) { ylab23 <- if (isGlm) "Std. deviance resid." else "Standardized residuals" r.w <- if (is.null(w)) r else sqrt(w) * r rs <- r.w/(s * sqrt(1 - hii)) } if (any(show[c(1, 3)])) l.fit <- if (isGlm) "Predicted values" else "Fitted values" if (is.null(id.n)) id.n <- 0 else { id.n <- as.integer(id.n) if (id.n < 0 || id.n > n) stop("`id.n' must be in {1,..,", n, "}") } if (id.n > 0) { if (is.null(labels.id)) labels.id <- paste(1:n) iid <- 1:id.n show.r <- sort.list(abs(r), decreasing = TRUE)[iid] if (any(show[2:3])) show.rs <- sort.list(abs(rs), decreasing = TRUE)[iid] text.id <- function(x, y, ind, adj.x = FALSE){ midx <- mean(range(x)) labpos <- if (!adj.x) 3 else label.pos[1+as.numeric(x>midx)] text(x, y, labels.id[ind], cex = cex.id, xpd = TRUE, pos=labpos, offset=0.25) } } one.fig <- prod(par("mfcol")) == 1 if (ask) { op <- par(ask = TRUE) on.exit(par(op)) } if (show[1]) { ylim <- range(r, na.rm = TRUE) if (id.n > 0) ylim <- ylim + c(-1, 1) * 0.08 * diff(ylim) plot(yh, r, xlab = l.fit, ylab = "Residuals", main = main, ylim = ylim, type = "n", ...) panel(yh, r, ...) if (one.fig) title(sub = sub.caption, ...) mtext(caption[1], 3, 0.25) if (id.n > 0) { y.id <- r[show.r] y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3 text.id(yh[show.r], y.id, show.r, adj.x = TRUE) } abline(h = 0, lty = 3, col = "gray") } if (show[2]) { ylim <- range(rs, na.rm = TRUE) ylim[2] <- ylim[2] + diff(ylim) * 0.075 qq <- qqnorm(rs, main = main, ylab = ylab23, ylim = ylim, ...) if (one.fig) title(sub = sub.caption, ...) mtext(caption[2], 3, 0.25) if (id.n > 0) text.id(qq$x[show.rs], qq$y[show.rs], show.rs, adj.x = TRUE) } if (show[3]) { sqrtabsr <- sqrt(abs(rs)) ylim <- c(0, max(sqrtabsr, na.rm = TRUE)) yl <- as.expression(substitute(sqrt(abs(YL)), list(YL = as.name(ylab23)))) yhn0 <- if (is.null(w)) yh else yh[w != 0] plot(yhn0, sqrtabsr, xlab = l.fit, ylab = yl, main = main, ylim = ylim, type = "n", ...) panel(yhn0, sqrtabsr, ...) if (one.fig) title(sub = sub.caption, ...) mtext(caption[3], 3, 0.25) if (id.n > 0) text.id(yhn0[show.rs], sqrtabsr[show.rs], show.rs, adj.x = TRUE) } if (show[4]) { if (id.n > 0) { show.r <- order(-cook)[iid] ymx <- cook[show.r[1]] * 1.075 } else ymx <- max(cook) plot(cook, type = "h", ylim = c(0, ymx), main = main, xlab = "Obs. number", ylab = "Cook's distance", ...) if (one.fig) title(sub = sub.caption, ...) mtext(caption[4], 3, 0.25) if (id.n > 0) text.id(show.r, cook[show.r], show.r, adj.x=FALSE) } if (show[5]) { ylim <- range(rs, na.rm = TRUE) hatval <- hatvalues(x) if (id.n > 0) { ylim <- ylim + c(-1, 1) * 0.08 * diff(ylim) show.r <- order(-cook)[iid] } plot(hatval, rs, ylim = ylim, main = main, xlab = "Leverages", ylab = ylab23, type="n", ...) panel(hatval, rs, ...) if (one.fig) title(sub = sub.caption, ...) p <- length(coef(x)) for(crit in cook.levels){ curve(sqrt(crit*p*(1-x)/x), lty=2, add=T) curve(-sqrt(crit*p*(1-x)/x), lty=2, add=T) } xmax <- par()$usr[2] ymult <- sqrt(p*(1-xmax)/xmax) aty <- c(-sqrt(rev(cook.levels))*ymult, sqrt(cook.levels)*ymult) axis(4, at=aty, labels=paste(c(rev(cook.levels), cook.levels)), mgp=c(.25,.25,0), las=2, tck=0, cex.axis=cex.id) mtext(caption[5], 3, 0.25) if (id.n > 0) { y.id <- rs[show.r] y.id[y.id < 0] <- y.id[y.id < 0] - strheight(" ")/3 text(hatval[show.r], y.id, paste(show.r), pos=2, cex=cex.id, offset=0.25) } } if (!one.fig && par("oma")[3] >= 1) mtext(sub.caption, outer = TRUE, cex = 1.25) invisible() } John Maindonald email: john.maindonald@anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.
On 23 Apr 2005, at 12:30, John Maindonald wrote:> I propose the following enhancements and changes to plot.lm(), > the most important of which is the addition of a Residuals vs > Leverage plot. > > (1) A residual versus leverage plot has been added, available > by specifying which = 5, and not included as one of the default > plots. Contours of Cook's distance are included, by default at > values of 0.5 and 1.0. The labeled points, if any, are those with > the largest Cook's distances. The parameter cook.levels can be > changed as required, to control what contours appear. > > (2) Remove the word "plot" from the captions for which=2, 3, 4. > It is redundant. > > (3) Now that the pos argument to text() is vectorized, use that > in preference to an offset. > > (4) For which!=4 or 5, by default use pos=4 on the left half > of the panel, and pos=2 on the right half of the panel. > This prevents labels from appearing outside the plot area, > where they can overlap other graphical features. > The parameter label.pos allows users to change this default. > > The modified code that I propose is below. This, a modified .Rd > file, and files from diff used with the April 20 development version, > are in my directory > > http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/ > > I believe the Residual-Leverage plot is given in Krause & Olsen, > whether with Cook's distance contours I do not recall. I do not > have access to a copy of this book. Martin Maechler drew my > attention to it in 2003, as superior to the Cook's distance plot.Agreed. Alternatively Cook's distance versus leverage/(1-leverage), as on p74 of this book: Statistical Theory and Modelling, In honour of Sir David Cox, FRS. Eds D V Hinkley, N Reid and E J Snell. Chapman and Hall, 1991. In that graph the contours of residual^2 are straight lines through the origin. A small disadvantage is that the sign of the residual is lost. David> I have finally got around to coding it up! > > John Maindonald....
The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/ now includes files: plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in which David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6. The tick labels are in units of leverage, and the contour labels are in units of absolute values of the standardized residual. plot6.lm.Rd file: A matching help file Comments will be welcome. Another issue, discussed recently on r-help, is that when the model formula is long, the default sub.caption=deparse(x$call) is broken into multiple text elements and overwrites. The only clean and simple way that I can see to handle is to set a default that tests whether the formula is broken into multiple text elements, and if it is then omit it. Users can then use their own imaginative skills, and such suggestions as have been made on r-help, to construct whatever form of labeling best suits their case, their imaginative skills and their coding skills. John Maindonald. On 25 Apr 2005, at 8:00 PM, David Firth wrote:> From: David Firth <d.firth@warwick.ac.uk> > Date: 24 April 2005 10:23:51 PM > To: John Maindonald <john.maindonald@anu.edu.au> > Cc: r-devel@stat.math.ethz.ch > Subject: Re: [Rd] Enhanced version of plot.lm() > > > On 24 Apr 2005, at 05:37, John Maindonald wrote: > >> I'd not like to lose the signs of the residuals. Also, as >> plots 1-3 focus on residuals, there is less of a mental >> leap in moving to residuals vs leverage; residuals vs >> leverage/(1-leverage) would also be in the same spirit. > > Yes, I know what you mean. Mental leaps are a matter of > taste...pitfalls, etc, come to mind. > >> >> Maybe, one way or another, both plots (residuals vs >> a function of leverage, and the plot from Hinkley et al) >> should go in. The easiest way to do this is to add a >> further which=6. I will do this if the consensus is that >> this is the right way to go. In any case, I'll add the >> Hinkley et al reference (author of the contribution that >> includes p.74?) to the draft help page. > > Sorry, I should have given the full reference, which (in BibTeX format > from CIS) is > > @inproceedings{Firt:gene:1991, > author = {Firth, D.}, > title = {Generalized Linear Models}, > year = {1991}, > booktitle = {Statistical Theory and Modelling. In Honour of Sir > David Cox, FRS}, > editor = {Hinkley, D. V. and Reid, N. and Snell, E. J.}, > publisher = {Chapman \& Hall Ltd}, > pages = {55--82}, > keywords = {Analysis of deviance; Likelihood} > } > > David >John Maindonald email: john.maindonald@anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. [[alternative text/enriched version deleted]]
The web page http://wwwmaths.anu.edu.au/~johnm/r/plot-lm/ now includes files: plot.lm.RData: Image for file for plot6.lm, a version of plot.lm in which David Firth's Cook's distance vs leverage/(1-leverage) plot is plot 6. The tick labels are in units of leverage, and the contour labels are in units of absolute values of the standardized residual. plot6.lm.Rd file: A matching help file Comments will be welcome. Another issue, discussed recently on r-help, is that when the model formula is long, the default sub.caption=deparse(x$call) is broken into multiple text elements and overwrites. The only clean and simple way that I can see to handle is to set a default that tests whether the formula is broken into multiple text elements, and if it is then omit it. Users can then use their own imaginative skills, and such suggestions as have been made on r-help, to construct whatever form of labeling best suits their case, their imaginative skills and their coding skills. John Maindonald. On 25 Apr 2005, at 8:00 PM, David Firth wrote:> From: David Firth <d.firth@warwick.ac.uk> > Date: 24 April 2005 10:23:51 PM > To: John Maindonald <john.maindonald@anu.edu.au> > Cc: r-devel@stat.math.ethz.ch > Subject: Re: [Rd] Enhanced version of plot.lm() > > > On 24 Apr 2005, at 05:37, John Maindonald wrote: > >> I'd not like to lose the signs of the residuals. Also, as >> plots 1-3 focus on residuals, there is less of a mental >> leap in moving to residuals vs leverage; residuals vs >> leverage/(1-leverage) would also be in the same spirit. > > Yes, I know what you mean. Mental leaps are a matter of > taste...pitfalls, etc, come to mind. > >> >> Maybe, one way or another, both plots (residuals vs >> a function of leverage, and the plot from Hinkley et al) >> should go in. The easiest way to do this is to add a >> further which=6. I will do this if the consensus is that >> this is the right way to go. In any case, I'll add the >> Hinkley et al reference (author of the contribution that >> includes p.74?) to the draft help page. > > Sorry, I should have given the full reference, which (in BibTeX format > from CIS) is > > @inproceedings{Firt:gene:1991, > author = {Firth, D.}, > title = {Generalized Linear Models}, > year = {1991}, > booktitle = {Statistical Theory and Modelling. In Honour of Sir > David Cox, FRS}, > editor = {Hinkley, D. V. and Reid, N. and Snell, E. J.}, > publisher = {Chapman \& Hall Ltd}, > pages = {55--82}, > keywords = {Analysis of deviance; Likelihood} > } > > David >John Maindonald email: john.maindonald@anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.