ripley at stats.ox.ac.uk
2006-May-24 05:35 UTC
[Rd] (PR#8877) predict.lm does not have a weights argument for
I am more than 'a little disappointed' that you expect a detailed explanation of the problems with your 'bug' report, especially as you did not provide any explanation yourself as to your reasoning (nor did you provide any credentials nor references). Note that 1) Your report did not make clear that this was only relevant to prediction intervals, which are not commonly used. 2) Only in some rather special circumstances do weights enter into prediction intervals, and definitely not necessarily the weights used for fitting. Indeed, it seems that to label the variances that do enter as inverse weights would be rather misleading. 3) In a later message you referenced Brown's book, which is dealing with a different model. The model fitted by lm is y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2) (Row vector x, column vector \beta.) If the observations are from the model, OLS is appropriate, but weighting is used in several scenarios, including: (a) case weights: w_i = 3 means `I have three observations like (y, x)' (b) inverse-variance weights, most often an indication that w_i = 1/3 means that y_i is actually the average of 3 observations at x_i. (c) multiple imputation, where a case with missing values in x is split into say 5 parts, with case weights less than and summing to one. (d) Heteroscedasticity, where the model is rather y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x)) And there may well be other scenarios, but those are the most common (in decreasing order) in my experience. Now, consider prediction intervals. It would be perverse to consider these to be for other than a single future observation at x. In scenarios (a) to (c), R's current behaviour is what is commonly accepted to be correct (and you provide no arguments otherwise). If a future observation has missing values, predict.lm would only be a starting point for multiple imputation. Even if 'newdata' is not supplied, prediction intervals must apply to new observations, not the existing ones (or the formula used is wrong: perhaps to avoid your confusion they should not be allowed in that case). Only in case (d), which is a different model, is it appropriate to supply error variances (not weights) for prediction intervals. This is why I marked it for the wishlist. Equally, one might want to specify \sigma^2 for all future observations as being different from the model fitting, as the training data may include other components of variance in their error variances. On Sat, 20 May 2006, jranke at uni-bremen.de wrote:> Dear R developers, > > I am a little disappointed that my bug report only made it to the > wishlist, with the argument: > > Well, it does not say it has. > Only relevant to prediction intervals. > > predict.lm does calculate prediction intervals for linear models from > weighted regression, so they should be correct, right? > > As far as I can see they are bound to be wrong in almost all cases, if > no weights for newdata can be given. So the point is that predict.lm > needs such an argument in order to give correct prediction intervals for > models from weighted linear regression. > > Also, it strikes me that in the absence of a "newdata" argument, the > weights from the "lm" object need to be taken into account for > constructing prediction intervals.Where are the references and arguments?> My updated proposal fixing both points as well as the help file can be found at: > > http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.patchNot found.> and I wrote up a small demonstration of the problem and my proposed solution: > > http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.pdfThat example is not a valid use of WLS, as you have the weights depending on the data you are fitting.> Kind regards, > > Johannes Ranke > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Peter Dalgaard
2006-May-24 08:14 UTC
[Rd] (PR#8877) predict.lm does not have a weights argument for
ripley at stats.ox.ac.uk writes:> (a) case weights: w_i = 3 means `I have three observations like (y, x)' > > (b) inverse-variance weights, most often an indication that w_i = 1/3 > means that y_i is actually the average of 3 observations at x_i. > > (c) multiple imputation, where a case with missing values in x is split > into say 5 parts, with case weights less than and summing to one. > > (d) Heteroscedasticity, where the model is rather > > y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x)) > > And there may well be other scenarios, but those are the most common (in > decreasing order) in my experience.I'd have (d) higher on the list, but never mind. There's also (e) Inverse probability weights: Knowing that part of the population is undersampled and wanting results that are compatible with what you would have gotten in a balanced sample. Prototypically: You sample X, taking only a third of those with X > c; find population mean of X, (or univariate regression on some other variable, which is only recorded in the subsample). (R-bugs stripped from recipients since this doesn't really have anything to do with the purported bug.) -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907