William Dunlap
2010-Jul-27 18:48 UTC
[Rd] problem with zero-weighted observations in predict.lm?
In modelling functions some people like to use a weight of 0 to drop an observation instead of using a subset value of FALSE. E.g., weights=c(0,1,1,...) instead of subset=c(FALSE, TRUE, TRUE, ...) to drop the first observation. lm() and summary.lm() appear to treat these in the same way, decrementing the number of degrees of freedom for each dropped observation. However, predict.lm() does not treat them the same. It doesn't seem to diminish the df to account for the 0-weighted observations. E.g., the last printout from the following script is as follows, where predw is the prediction from the fit that used 0-weights and preds is from using FALSE's in the subset argument. Is this difference proper? predw preds $fit $fit fit lwr upr fit lwr upr 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724 2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341 $se.fit $se.fit 1 2 1 2 0.06723657 0.09367810 0.1097969 0.1529757 $df $df [1] 8 [1] 3 $residual.scale $residual.scale [1] 0.1031362 [1] 0.1684207 ### start of script ### data <- data.frame(x=1:10, y=log(1:10)) fitw <- lm(data=data, y~x, weights=as.numeric(x<=5)) fits <- lm(data=data, y~x, subset=x<=5) fitw$df.residual == 3 && fits$df.residual == 3 # TRUE identical(coef(fitw), coef(fits)) # TRUE sumw <- summary(fitw) sums <- summary(fits) identical(sumw$df, sums$df) # TRUE predw <- predict(fitw, interval="confidence", se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) preds <- predict(fits, interval="confidence", se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) all.equal(predw, preds) # lots of differences sideBySide <- function (a, b, argNames) { # print objects a and b side by side oldWidth <- options(width = getOption("width")/2 - 4) on.exit(options(oldWidth)) if (missing(argNames)) { argNames <- c(deparse(substitute(a))[1], deparse(substitute(b))[1]) } pa <- capture.output(print(a)) pb <- capture.output(print(b)) nlines <- max(length(pa), length(pb)) length(pa) <- nlines length(pb) <- nlines pb[is.na(pb)] <- "" pa[is.na(pa)] <- "" retval <- cbind(pa, pb, deparse.level = 0) dimnames(retval) <- list(rep("",nrow(retval)), argNames) noquote(retval) } # lwr, upr, se.fit, df, residual.scale differ sideBySide(predw, preds) Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
Peter Dalgaard
2010-Jul-27 20:27 UTC
[Rd] problem with zero-weighted observations in predict.lm?
William Dunlap wrote:> In modelling functions some people like to use > a weight of 0 to drop an observation instead of > using a subset value of FALSE. E.g., > weights=c(0,1,1,...) > instead of > subset=c(FALSE, TRUE, TRUE, ...) > to drop the first observation. > > lm() and summary.lm() appear to treat these in the > same way, decrementing the number of degrees of > freedom for each dropped observation. However, > predict.lm() does not treat them the same. It > doesn't seem to diminish the df to account for the > 0-weighted observations. E.g., the last printout > from the following script is as follows, where > predw is the prediction from the fit that used > 0-weights and preds is from using FALSE's in the > subset argument. Is this difference proper?Nice catch. The issue is that the subset fit and the zero-weighted fit are not completely the same. Notice that the residuals vector has different length in the two analyses. With a simplified setup:> length(lm(y~1,weights=w)$residuals)[1] 10> length(lm(y~1,subset=-1)$residuals)[1] 9> w[1] 0 1 1 1 1 1 1 1 1 1 This in turn is what confuses predict.lm because it gets n and residual df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I suppose that predict.lm should follow suit.> > predw preds > $fit $fit > fit lwr upr fit lwr upr > 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724 > 2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341 > > $se.fit $se.fit > 1 2 1 2 > 0.06723657 0.09367810 0.1097969 0.1529757 > > $df $df > [1] 8 [1] 3 > > $residual.scale $residual.scale > [1] 0.1031362 [1] 0.1684207 > > ### start of script ### > data <- data.frame(x=1:10, y=log(1:10)) > fitw <- lm(data=data, y~x, weights=as.numeric(x<=5)) > fits <- lm(data=data, y~x, subset=x<=5) > fitw$df.residual == 3 && fits$df.residual == 3 # TRUE > identical(coef(fitw), coef(fits)) # TRUE > > sumw <- summary(fitw) > sums <- summary(fits) > identical(sumw$df, sums$df) # TRUE > > predw <- predict(fitw, interval="confidence", > se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) > preds <- predict(fits, interval="confidence", > se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) > all.equal(predw, preds) # lots of differences > > sideBySide <- function (a, b, argNames) > { > # print objects a and b side by side > oldWidth <- options(width = getOption("width")/2 - 4) > on.exit(options(oldWidth)) > if (missing(argNames)) { > argNames <- c(deparse(substitute(a))[1], > deparse(substitute(b))[1]) > } > pa <- capture.output(print(a)) > pb <- capture.output(print(b)) > nlines <- max(length(pa), length(pb)) > length(pa) <- nlines > length(pb) <- nlines > pb[is.na(pb)] <- "" > pa[is.na(pa)] <- "" > retval <- cbind(pa, pb, deparse.level = 0) > dimnames(retval) <- list(rep("",nrow(retval)), argNames) > noquote(retval) > } > > # lwr, upr, se.fit, df, residual.scale differ > sideBySide(predw, preds) > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com