Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: df <- data.frame(x = runif(100, 0, 100)) df$y <- df$x + 1 + rnorm(100, sd=15) I had expected that: summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df))) would be equivalent, but they are not. I suspect the difference is how the degrees of freedom is calculated - I had expected it to be sum(weights), but seems to be sum(weights > 0). This seems unintuitive to me: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) What am I missing? And what is the usual way to do a linear regression when you have aggregated data? Thanks, Hadley
See below. hadley wickham wrote:> Dear all, > > I'm struggling with weighted least squares, where something that I had > assumed to be true appears not to be the case. Take the following > data set as an example: > > df <- data.frame(x = runif(100, 0, 100)) > df$y <- df$x + 1 + rnorm(100, sd=15) > > I had expected that: > > summary(lm(y ~ x, data=df, weights=rep(2, 100))) > summary(lm(y ~ x, data=rbind(df,df)))You assign weights to different points according to some external quality or reliability measure not number of times the data point was measured. Look at the estimates and standard error of the two models below: coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) ) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01 x 0.982610 0.05893262 16.673448 2.264258e-30 coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) ) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01 x 0.982610 0.04146066 23.6998165 1.012067e-59 You can see that they have same coefficient estimates but the second one has smaller variances. The repeated values artificially deflates the variance and thus inflates the precision. This is why you cannot treat replicate data as independent observations.> would be equivalent, but they are not. I suspect the difference is > how the degrees of freedom is calculated - I had expected it to be > sum(weights), but seems to be sum(weights > 0). This seems > unintuitive to me: > > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) > > What am I missing? And what is the usual way to do a linear > regression when you have aggregated data?I would be best to use the individual data points instead of aggregated data as it allows you to estimate the within-group variations as well. If you had individual data points, you could try something as follows. Please check the codes as I am no expert in the area of repeated measures. x <- runif(100, 0, 100) y1 <- x + rnorm(100, mean=1, sd=15) y2 <- y1 + rnorm(100, sd=5) df <- data.frame( y=c(y1, y2), x=c(x,x), subject=factor(rep( paste("p", 1:100, sep=""), 2 ) )) library(nlme) summary( lme( y ~ x, random = ~ 1 | subject, data=df ) ) Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related material for more information. Hope this helps.> Thanks, > > HadleyRegards, Adai
Dear Hadley, I think that the problem is that the term "weights" has different meanings, which, although they are related, are not quite the same. The weights used by lm() are (inverse-)"variance weights," reflecting the variances of the errors, with observations that have low-variance errors therefore being accorded greater weight in the resulting WLS regression. What you have are sometimes called "case weights," and I'm unaware of a general way of handling them in R, although you could regenerate the unaggregated data. As you discovered, you get the same coefficients with case weights as with variance weights, but different standard errors. Finally, there are "sampling weights," which are inversely proportional to the probability of selection; these are accommodated by the survey package. To complicate matters, this terminology isn't entirely standard. I hope this helps, John -------------------------------- John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox --------------------------------> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of hadley wickham > Sent: Tuesday, May 08, 2007 5:09 AM > To: R Help > Subject: [R] Weighted least squares > > Dear all, > > I'm struggling with weighted least squares, where something > that I had assumed to be true appears not to be the case. > Take the following data set as an example: > > df <- data.frame(x = runif(100, 0, 100)) df$y <- df$x + 1 + > rnorm(100, sd=15) > > I had expected that: > > summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y > ~ x, data=rbind(df,df))) > > would be equivalent, but they are not. I suspect the > difference is how the degrees of freedom is calculated - I > had expected it to be sum(weights), but seems to be > sum(weights > 0). This seems unintuitive to me: > > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) > > What am I missing? And what is the usual way to do a linear > regression when you have aggregated data? > > Thanks, > > Hadley > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Doubling the length of the data doubles the apparent number of observations. You would expect the standard error to reduce by sqrt(2) (which it just about does, though I'm not clear on why its not exact here) Weights are not as simple as they look. You have given all your data the same weight, so the answer is independent of the weights (!). Try again with weights=rep(4,100) etc. Equal weights simply cancel out in the lm process. In fact, some linear regression algorithms rescale all weights to sum to 1; in others, weights are scaled to average 1; done 'naturally' the weights simply appear in two places which cancel out in the final covariance matrix calculation (eg in the weighted 'residual sd' and in the hessian for the chi-squared function, if I remember correctly). Bottom line - equal weights make no difference in lm, so choose what you like. 1 is a good number, though. Steve e>>> "hadley wickham" <h.wickham at gmail.com> 08/05/2007 10:08:34 >>>Dear all, I'm struggling with weighted least squares, where something that I had assumed to be true appears not to be the case. Take the following data set as an example: ******************************************************************* This email and any attachments are confidential. Any use, co...{{dropped}}
Hadley, You asked> .. what is the usual way to do a linear > regression when you have aggregated data?Least squares generally uses inverse variance weighting. For aggregated data fitted as mean values, you just need the variances for the _means_. So if you have individual means x_i and sd's s_i that arise from aggregated data with n_i observations in group i, the natural weighting is by inverse squared standard error of the mean. The appropriate weight for x_i would then be n_i/(s_i^2). In R, that's n/(s^2), as n and s would be vectors with the same length as x. If all the groups had the same variance, or nearly so, s is a scalar; if they have the same number of observations, n is a scalar. Of course, if they have the same variance and same number of observations, they all have the same weight and you needn't weight them at all: see previous posting! Steve E ******************************************************************* This email and any attachments are confidential. Any use, co...{{dropped}}
>>> Adaikalavan Ramasamy <ramasamy at cancer.org.uk> 09/05/2007 01:37:31 >>> >..the variance of means of each row in table above is ZERO because >the individual elements that comprise each row are identical. >... Then is it valid then to use lm( y ~ x, weights=freq ) ?ermmm... probably not, because if that heppened I'd strongly suspect we'd substantially violated some assumptions. We are given a number of groups of identical observations. But we are seeking a solution to a problem that posits an underlying variance. If it's not visible within the groups, where is it? Has it disappeared in numerical precision, or is something else going on? If we did this regression, we would see identical residuals for all members of a group. That would imply that the variance arises entirely from between-group effects and not at all from within-group effects. To me, that would in turn imply that the number of observations in the group is irrelevant; we should be using use unweighted regression on the group 'means' in this situation if we're using least squares at all. If we genuinely have independent observations and by some coincidence they have the same value within available precision, we might be justified in saying "we can't see the variance within groups, but we can estimate it from the residual variance". That would be equivalent to assuming constant variance, and my n/(s^2) reduces to n except for a scaling factor. Using n alone would then be consistent with one's assumptions, I think. On the kind of data I get, though (mostly chemical measurement with continuous scales), I'd have considerable difficulty justifying that assumption. And if I didn't have that kind of data (or a reasonable approximation thereto) I'd be wondering whether I should be using linear regression at all. S ******************************************************************* This email and any attachments are confidential. Any use, co...{{dropped}}