William Dunlap
2010-Jul-27 18:48 UTC
[Rd] problem with zero-weighted observations in predict.lm?
In modelling functions some people like to use
a weight of 0 to drop an observation instead of
using a subset value of FALSE. E.g.,
weights=c(0,1,1,...)
instead of
subset=c(FALSE, TRUE, TRUE, ...)
to drop the first observation.
lm() and summary.lm() appear to treat these in the
same way, decrementing the number of degrees of
freedom for each dropped observation. However,
predict.lm() does not treat them the same. It
doesn't seem to diminish the df to account for the
0-weighted observations. E.g., the last printout
from the following script is as follows, where
predw is the prediction from the fit that used
0-weights and preds is from using FALSE's in the
subset argument. Is this difference proper?
predw preds
$fit $fit
fit lwr upr fit lwr upr
1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724
2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341
$se.fit $se.fit
1 2 1 2
0.06723657 0.09367810 0.1097969 0.1529757
$df $df
[1] 8 [1] 3
$residual.scale $residual.scale
[1] 0.1031362 [1] 0.1684207
### start of script ###
data <- data.frame(x=1:10, y=log(1:10))
fitw <- lm(data=data, y~x, weights=as.numeric(x<=5))
fits <- lm(data=data, y~x, subset=x<=5)
fitw$df.residual == 3 && fits$df.residual == 3 # TRUE
identical(coef(fitw), coef(fits)) # TRUE
sumw <- summary(fitw)
sums <- summary(fits)
identical(sumw$df, sums$df) # TRUE
predw <- predict(fitw, interval="confidence",
se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
preds <- predict(fits, interval="confidence",
se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
all.equal(predw, preds) # lots of differences
sideBySide <- function (a, b, argNames)
{
# print objects a and b side by side
oldWidth <- options(width = getOption("width")/2 - 4)
on.exit(options(oldWidth))
if (missing(argNames)) {
argNames <- c(deparse(substitute(a))[1],
deparse(substitute(b))[1])
}
pa <- capture.output(print(a))
pb <- capture.output(print(b))
nlines <- max(length(pa), length(pb))
length(pa) <- nlines
length(pb) <- nlines
pb[is.na(pb)] <- ""
pa[is.na(pa)] <- ""
retval <- cbind(pa, pb, deparse.level = 0)
dimnames(retval) <- list(rep("",nrow(retval)), argNames)
noquote(retval)
}
# lwr, upr, se.fit, df, residual.scale differ
sideBySide(predw, preds)
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
Peter Dalgaard
2010-Jul-27 20:27 UTC
[Rd] problem with zero-weighted observations in predict.lm?
William Dunlap wrote:> In modelling functions some people like to use > a weight of 0 to drop an observation instead of > using a subset value of FALSE. E.g., > weights=c(0,1,1,...) > instead of > subset=c(FALSE, TRUE, TRUE, ...) > to drop the first observation. > > lm() and summary.lm() appear to treat these in the > same way, decrementing the number of degrees of > freedom for each dropped observation. However, > predict.lm() does not treat them the same. It > doesn't seem to diminish the df to account for the > 0-weighted observations. E.g., the last printout > from the following script is as follows, where > predw is the prediction from the fit that used > 0-weights and preds is from using FALSE's in the > subset argument. Is this difference proper?Nice catch. The issue is that the subset fit and the zero-weighted fit are not completely the same. Notice that the residuals vector has different length in the two analyses. With a simplified setup:> length(lm(y~1,weights=w)$residuals)[1] 10> length(lm(y~1,subset=-1)$residuals)[1] 9> w[1] 0 1 1 1 1 1 1 1 1 1 This in turn is what confuses predict.lm because it gets n and residual df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I suppose that predict.lm should follow suit.> > predw preds > $fit $fit > fit lwr upr fit lwr upr > 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724 > 2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341 > > $se.fit $se.fit > 1 2 1 2 > 0.06723657 0.09367810 0.1097969 0.1529757 > > $df $df > [1] 8 [1] 3 > > $residual.scale $residual.scale > [1] 0.1031362 [1] 0.1684207 > > ### start of script ### > data <- data.frame(x=1:10, y=log(1:10)) > fitw <- lm(data=data, y~x, weights=as.numeric(x<=5)) > fits <- lm(data=data, y~x, subset=x<=5) > fitw$df.residual == 3 && fits$df.residual == 3 # TRUE > identical(coef(fitw), coef(fits)) # TRUE > > sumw <- summary(fitw) > sums <- summary(fits) > identical(sumw$df, sums$df) # TRUE > > predw <- predict(fitw, interval="confidence", > se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) > preds <- predict(fits, interval="confidence", > se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) > all.equal(predw, preds) # lots of differences > > sideBySide <- function (a, b, argNames) > { > # print objects a and b side by side > oldWidth <- options(width = getOption("width")/2 - 4) > on.exit(options(oldWidth)) > if (missing(argNames)) { > argNames <- c(deparse(substitute(a))[1], > deparse(substitute(b))[1]) > } > pa <- capture.output(print(a)) > pb <- capture.output(print(b)) > nlines <- max(length(pa), length(pb)) > length(pa) <- nlines > length(pb) <- nlines > pb[is.na(pb)] <- "" > pa[is.na(pa)] <- "" > retval <- cbind(pa, pb, deparse.level = 0) > dimnames(retval) <- list(rep("",nrow(retval)), argNames) > noquote(retval) > } > > # lwr, upr, se.fit, df, residual.scale differ > sideBySide(predw, preds) > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com