Kurt Rinehart
2013-Jan-30 16:50 UTC
[R] How does predict() calculate prediction intervals?
For a given linear regression, I wish to find the 2-tailed t-dist probability that Y-hat <= newly observed values. I generate prediction intervals in predict() for plotting, but when I calculate my t-dist probabilities, they don't agree. I have researched the issues with variance of individual predictions and been advised to use the variance formula below (in the code). I presume my variance function differs from that used in predict(). Can someone advise me as to why I cannot reproduce the results of predict()? As a test, I calculated the prediction intervals about Y-hat by hand and compared them to predict(): X <- log10(c(780,54,520,703,1356,1900,2500,741,1500,600)) Y <- log10(c(0.037,0.036,0.026,0.0315,0.028,0.0099,0.00955,0.0405,0.019,0.0245)) dat <- data.frame(cbind(X, Y)) mod <- lm(Y ~ X, data = dat) rm(X,Y) ## model predictions, y.hat, at the new X values pred <- data.frame(predict.lm(mod,newdata = obs, interval = "prediction", level = 0.95)) # 3 subsequent observations obs <- log10(data.frame(cbind(c(40, 200, 450), c(0.3, 0.06, .034)))) names(obs) <- c("X", "Y") ## Calculating t-dist, 2-tailed, 95% prediction intervals for new observations mse <- anova(mod)[2,3] new.x <- obs$X - mean(dat$X) sum.x2 <- sum((dat$X - mean(dat$X))^2) y.hat <- pred$fit var.y.hat <- mse*(1+new.x^2/sum.x2) upr <- y.hat + qt(c(0.975), df = 8) * sqrt(var.y.hat) lwr <- y.hat + qt(c(0.025), df = 8) * sqrt(var.y.hat) hand <- data.frame(cbind(y.hat, lwr, upr)) #The limits are not the same pred hand -- Thank you, Kurt [[alternative HTML version deleted]]
Kurt: You missed including the term 1/n in your var.y.hat calculation. Do that and pred and hand are the same. Brian Brian S. Cade, PhD U. S. Geological Survey Fort Collins Science Center 2150 Centre Ave., Bldg. C Fort Collins, CO 80526-8818 email: brian_cade@usgs.gov tel: 970 226-9326 On Wed, Jan 30, 2013 at 9:50 AM, Kurt Rinehart <kurt.rinehart@uvm.edu>wrote:> For a given linear regression, I wish to find the 2-tailed t-dist > probability that Y-hat <= newly observed values. I generate prediction > intervals in predict() for plotting, but when I calculate my t-dist > probabilities, they don't agree. I have researched the issues with variance > of individual predictions and been advised to use the variance formula > below (in the code). > > I presume my variance function differs from that used in predict(). Can > someone advise me as to why I cannot reproduce the results of predict()? > > As a test, I calculated the prediction intervals about Y-hat by hand and > compared them to predict(): > > X <- log10(c(780,54,520,703,1356,1900,2500,741,1500,600)) > Y <- > log10(c(0.037,0.036,0.026,0.0315,0.028,0.0099,0.00955,0.0405,0.019,0.0245)) > dat <- data.frame(cbind(X, Y)) > mod <- lm(Y ~ X, data = dat) > rm(X,Y) > > ## model predictions, y.hat, at the new X values > pred <- data.frame(predict.lm(mod,newdata = obs, interval = "prediction", > level = 0.95)) > > # 3 subsequent observations > obs <- log10(data.frame(cbind(c(40, 200, 450), c(0.3, 0.06, .034)))) > names(obs) <- c("X", "Y") > > ## Calculating t-dist, 2-tailed, 95% prediction intervals for new > observations > mse <- anova(mod)[2,3] > new.x <- obs$X - mean(dat$X) > sum.x2 <- sum((dat$X - mean(dat$X))^2) > y.hat <- pred$fit > var.y.hat <- mse*(1+new.x^2/sum.x2) > > upr <- y.hat + qt(c(0.975), df = 8) * sqrt(var.y.hat) > lwr <- y.hat + qt(c(0.025), df = 8) * sqrt(var.y.hat) > hand <- data.frame(cbind(y.hat, lwr, upr)) > > #The limits are not the same > pred > hand > > -- > Thank you, > Kurt > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Just look at the code of predict.lm(). It is reasonably perspicuous. In particular look at res.var. cheers, Rolf Turner On 01/31/2013 05:50 AM, Kurt Rinehart wrote:> For a given linear regression, I wish to find the 2-tailed t-dist > probability that Y-hat <= newly observed values. I generate prediction > intervals in predict() for plotting, but when I calculate my t-dist > probabilities, they don't agree. I have researched the issues with variance > of individual predictions and been advised to use the variance formula > below (in the code). > > I presume my variance function differs from that used in predict(). Can > someone advise me as to why I cannot reproduce the results of predict()? > > As a test, I calculated the prediction intervals about Y-hat by hand and > compared them to predict(): > > X <- log10(c(780,54,520,703,1356,1900,2500,741,1500,600)) > Y <- > log10(c(0.037,0.036,0.026,0.0315,0.028,0.0099,0.00955,0.0405,0.019,0.0245)) > dat <- data.frame(cbind(X, Y)) > mod <- lm(Y ~ X, data = dat) > rm(X,Y) > > ## model predictions, y.hat, at the new X values > pred <- data.frame(predict.lm(mod,newdata = obs, interval = "prediction", > level = 0.95)) > > # 3 subsequent observations > obs <- log10(data.frame(cbind(c(40, 200, 450), c(0.3, 0.06, .034)))) > names(obs) <- c("X", "Y") > > ## Calculating t-dist, 2-tailed, 95% prediction intervals for new > observations > mse <- anova(mod)[2,3] > new.x <- obs$X - mean(dat$X) > sum.x2 <- sum((dat$X - mean(dat$X))^2) > y.hat <- pred$fit > var.y.hat <- mse*(1+new.x^2/sum.x2) > > upr <- y.hat + qt(c(0.975), df = 8) * sqrt(var.y.hat) > lwr <- y.hat + qt(c(0.025), df = 8) * sqrt(var.y.hat) > hand <- data.frame(cbind(y.hat, lwr, upr)) > > #The limits are not the same > pred > hand > > -- > Thank you, > Kurt > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >