thr3ads.net - R devel - [Rd] problem with zero-weighted observations in predict.lm? [Jul 2010]

If this information is useful, please help other people find it:
Share via:

William Dunlap

2010-Jul-27 18:48 UTC

[Rd] problem with zero-weighted observations in predict.lm?

In modelling functions some people like to use
a weight of 0 to drop an observation instead of
using a subset value of FALSE.  E.g.,
  weights=c(0,1,1,...)
instead of
  subset=c(FALSE, TRUE, TRUE, ...)
to drop the first observation.

lm() and summary.lm() appear to treat these in the
same way, decrementing the number of degrees of
freedom for each dropped observation.  However,
predict.lm() does not treat them the same.  It
doesn't seem to diminish the df to account for the
0-weighted observations.   E.g., the last printout
from the following script is as follows, where
predw is the prediction from the fit that used
0-weights and preds is from using FALSE's in the
subset argument.  Is this difference proper?

 predw                        preds                       
 $fit                         $fit                        
        fit      lwr      upr        fit      lwr      upr
 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724
 2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341
                                                          
 $se.fit                      $se.fit                     
          1          2                1         2         
 0.06723657 0.09367810        0.1097969 0.1529757         
                                                          
 $df                          $df                         
 [1] 8                        [1] 3                       
                                                          
 $residual.scale              $residual.scale             
 [1] 0.1031362                [1] 0.1684207                

### start of script ###
data <- data.frame(x=1:10, y=log(1:10))
fitw <- lm(data=data, y~x, weights=as.numeric(x<=5))
fits <- lm(data=data, y~x, subset=x<=5)
fitw$df.residual  == 3 && fits$df.residual == 3 # TRUE
identical(coef(fitw), coef(fits)) # TRUE

sumw <- summary(fitw)
sums <- summary(fits)
identical(sumw$df, sums$df) # TRUE

predw <- predict(fitw, interval="confidence",
         se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
preds <- predict(fits, interval="confidence",
         se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
all.equal(predw, preds) # lots of differences

sideBySide <- function (a, b, argNames)
{
    # print objects a and b side by side
    oldWidth <- options(width = getOption("width")/2 - 4)
    on.exit(options(oldWidth))
    if (missing(argNames)) {
        argNames <- c(deparse(substitute(a))[1],
                      deparse(substitute(b))[1])
    }
    pa <- capture.output(print(a))
    pb <- capture.output(print(b))
    nlines <- max(length(pa), length(pb))
    length(pa) <- nlines
    length(pb) <- nlines
    pb[is.na(pb)] <- ""
    pa[is.na(pa)] <- ""
    retval <- cbind(pa, pb, deparse.level = 0)
    dimnames(retval) <- list(rep("",nrow(retval)), argNames)
    noquote(retval)
}

# lwr, upr, se.fit, df, residual.scale differ
sideBySide(predw, preds)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

Peter Dalgaard

2010-Jul-27 20:27 UTC

head link

[Rd] problem with zero-weighted observations in predict.lm?

William Dunlap wrote:> In modelling functions some people like to use
> a weight of 0 to drop an observation instead of
> using a subset value of FALSE.  E.g.,
>   weights=c(0,1,1,...)
> instead of
>   subset=c(FALSE, TRUE, TRUE, ...)
> to drop the first observation.
> 
> lm() and summary.lm() appear to treat these in the
> same way, decrementing the number of degrees of
> freedom for each dropped observation.  However,
> predict.lm() does not treat them the same.  It
> doesn't seem to diminish the df to account for the
> 0-weighted observations.   E.g., the last printout
> from the following script is as follows, where
> predw is the prediction from the fit that used
> 0-weights and preds is from using FALSE's in the
> subset argument.  Is this difference proper?
Nice catch.

The issue is that the subset fit and the zero-weighted fit are not
completely the same. Notice that the residuals vector has different
length in the two analyses. With a simplified setup:
> length(lm(y~1,weights=w)$residuals)
[1] 10> length(lm(y~1,subset=-1)$residuals)
[1] 9> w [1] 0 1 1 1 1 1 1 1 1 1

This in turn is what confuses predict.lm because it gets n and residual
df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I
suppose that predict.lm should follow suit.



> 
>  predw                        preds                       
>  $fit                         $fit                        
>         fit      lwr      upr        fit      lwr      upr
>  1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724
>  2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341
>                                                           
>  $se.fit                      $se.fit                     
>           1          2                1         2         
>  0.06723657 0.09367810        0.1097969 0.1529757         
>                                                           
>  $df                          $df                         
>  [1] 8                        [1] 3                       
>                                                           
>  $residual.scale              $residual.scale             
>  [1] 0.1031362                [1] 0.1684207                
> 
> ### start of script ###
> data <- data.frame(x=1:10, y=log(1:10))
> fitw <- lm(data=data, y~x, weights=as.numeric(x<=5))
> fits <- lm(data=data, y~x, subset=x<=5)
> fitw$df.residual  == 3 && fits$df.residual == 3 # TRUE
> identical(coef(fitw), coef(fits)) # TRUE
> 
> sumw <- summary(fitw)
> sums <- summary(fits)
> identical(sumw$df, sums$df) # TRUE
> 
> predw <- predict(fitw, interval="confidence",
>          se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
> preds <- predict(fits, interval="confidence",
>          se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5)))
> all.equal(predw, preds) # lots of differences
> 
> sideBySide <- function (a, b, argNames)
> {
>     # print objects a and b side by side
>     oldWidth <- options(width = getOption("width")/2 - 4)
>     on.exit(options(oldWidth))
>     if (missing(argNames)) {
>         argNames <- c(deparse(substitute(a))[1],
>                       deparse(substitute(b))[1])
>     }
>     pa <- capture.output(print(a))
>     pb <- capture.output(print(b))
>     nlines <- max(length(pa), length(pb))
>     length(pa) <- nlines
>     length(pb) <- nlines
>     pb[is.na(pb)] <- ""
>     pa[is.na(pa)] <- ""
>     retval <- cbind(pa, pb, deparse.level = 0)
>     dimnames(retval) <- list(rep("",nrow(retval)), argNames)
>     noquote(retval)
> }
> 
> # lwr, upr, se.fit, df, residual.scale differ
> sideBySide(predw, preds)
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Apparently Analagous Threads

Search for more apparently analagous threads

R devel - Jul 2010 - problem with zero-weighted observations in predict.lm?

[Rd] problem with zero-weighted observations in predict.lm?

[Rd] problem with zero-weighted observations in predict.lm?

Apparently Analagous Threads