thr3ads.net - R help - [R] Weighted least squares [May 2007]

If this information is useful, please help other people find it:
Share via:

hadley wickham

2007-May-08 09:08 UTC

[R] Weighted least squares

Dear all,

I'm struggling with weighted least squares, where something that I had
assumed to be true appears not to be the case.  Take the following
data set as an example:

df <- data.frame(x = runif(100, 0, 100))
df$y <- df$x + 1 + rnorm(100, sd=15)

I had expected that:

summary(lm(y ~ x, data=df, weights=rep(2, 100)))
summary(lm(y ~ x, data=rbind(df,df)))

would be equivalent, but they are not.  I suspect the difference is
how the degrees of freedom is calculated - I had expected it to be
sum(weights), but seems to be sum(weights > 0).  This seems
unintuitive to me:

summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))

What am I missing?  And what is the usual way to do a linear
regression when you have aggregated data?

Thanks,

Hadley

Adaikalavan Ramasamy

2007-May-08 12:16 UTC

head link

[R] Weighted least squares

See below.

hadley wickham wrote:> Dear all,
> 
> I'm struggling with weighted least squares, where something that I had
> assumed to be true appears not to be the case.  Take the following
> data set as an example:
> 
> df <- data.frame(x = runif(100, 0, 100))
> df$y <- df$x + 1 + rnorm(100, sd=15)
> 
> I had expected that:
> 
> summary(lm(y ~ x, data=df, weights=rep(2, 100)))
> summary(lm(y ~ x, data=rbind(df,df)))
You assign weights to different points according to some external 
quality or reliability measure not number of times the data point was 
measured.

Look at the estimates and standard error of the two models below:

  coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
              Estimate Std. Error   t value     Pr(>|t|)
  (Intercept) 1.940765 3.30348066  0.587491 5.582252e-01
  x           0.982610 0.05893262 16.673448 2.264258e-30

  coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
              Estimate Std. Error    t value     Pr(>|t|)
  (Intercept) 1.940765 2.32408609  0.8350659 4.046871e-01
  x           0.982610 0.04146066 23.6998165 1.012067e-59

You can see that they have same coefficient estimates but the second one 
  has smaller variances.

The repeated values artificially deflates the variance and thus inflates 
the precision. This is why you cannot treat replicate data as 
independent observations.

> would be equivalent, but they are not.  I suspect the difference is
> how the degrees of freedom is calculated - I had expected it to be
> sum(weights), but seems to be sum(weights > 0).  This seems
> unintuitive to me:
> 
> summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
> summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
> 
> What am I missing?  And what is the usual way to do a linear
> regression when you have aggregated data?
I would be best to use the individual data points instead of aggregated 
data as it allows you to estimate the within-group variations as well.

If you had individual data points, you could try something as follows. 
Please check the codes as I am no expert in the area of repeated measures.

  x  <- runif(100, 0, 100)
  y1 <- x + rnorm(100, mean=1, sd=15)
  y2 <- y1 + rnorm(100, sd=5)

  df <- data.frame( y=c(y1, y2),
                    x=c(x,x),
                    subject=factor(rep( paste("p", 1:100,
sep=""), 2 ) ))

  library(nlme)
  summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )

Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related 
material for more information. Hope this helps.
> Thanks,
> 
> Hadley
Regards, Adai

John Fox

2007-May-08 15:19 UTC

head link

[R] Weighted least squares

Dear Hadley,

I think that the problem is that the term "weights" has different
meanings,
which, although they are related, are not quite the same. 

The weights used by lm() are (inverse-)"variance weights," reflecting
the
variances of the errors, with observations that have low-variance errors
therefore being accorded greater weight in the resulting WLS regression.
What you have are sometimes called "case weights," and I'm unaware
of a
general way of handling them in R, although you could regenerate the
unaggregated data. As you discovered, you get the same coefficients with
case weights as with variance weights, but different standard errors.
Finally, there are "sampling weights," which are inversely
proportional to
the probability of selection; these are accommodated by the survey package. 

To complicate matters, this terminology isn't entirely standard.

I hope this helps,
 John

--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
-------------------------------- 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of hadley wickham
> Sent: Tuesday, May 08, 2007 5:09 AM
> To: R Help
> Subject: [R] Weighted least squares
> 
> Dear all,
> 
> I'm struggling with weighted least squares, where something 
> that I had assumed to be true appears not to be the case.  
> Take the following data set as an example:
> 
> df <- data.frame(x = runif(100, 0, 100)) df$y <- df$x + 1 + 
> rnorm(100, sd=15)
> 
> I had expected that:
> 
> summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y 
> ~ x, data=rbind(df,df)))
> 
> would be equivalent, but they are not.  I suspect the 
> difference is how the degrees of freedom is calculated - I 
> had expected it to be sum(weights), but seems to be 
> sum(weights > 0).  This seems unintuitive to me:
> 
> summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) 
> summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
> 
> What am I missing?  And what is the usual way to do a linear 
> regression when you have aggregated data?
> 
> Thanks,
> 
> Hadley
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

S Ellison

2007-May-08 15:40 UTC

head link

[R] Weighted least squares

Doubling the length of the data doubles the apparent number of observations. You
would expect the standard error to reduce by sqrt(2) (which it just about does,
though I'm not clear on why its not exact here)

Weights are not as simple as they look. You have given all your data the same
weight, so the answer is independent of the weights (!). Try again with
weights=rep(4,100) etc. Equal weights simply cancel out in the lm process. In
fact, some linear regression algorithms rescale all weights to sum to 1; in
others, weights are scaled to average 1; done 'naturally' the weights
simply appear in two places which cancel out in the final covariance matrix
calculation (eg in the weighted 'residual sd' and in the hessian for the
chi-squared function, if I remember correctly).

Bottom line - equal weights make no difference in lm, so choose what you like. 1
is a good number, though.

Steve e
>>> "hadley wickham" <h.wickham at gmail.com>
08/05/2007 10:08:34 >>>Dear all,

I'm struggling with weighted least squares, where something that I had
assumed to be true appears not to be the case.  Take the following
data set as an example:


*******************************************************************
This email and any attachments are confidential. Any use, co...{{dropped}}

S Ellison

2007-May-08 17:21 UTC

head link

[R] Weighted least squares

Hadley,

You asked> .. what is the usual way to do a linear 
> regression when you have aggregated data?
Least squares generally uses inverse variance weighting. For aggregated data
fitted as mean values, you just need the variances for the _means_.

So if you have individual means x_i and sd's s_i that arise from aggregated
data with n_i observations in group i, the natural weighting is by inverse
squared standard error of the mean. The appropriate weight for x_i would then be
n_i/(s_i^2). In R, that's n/(s^2), as n and s would be vectors with the same
length as x. If all the groups had the same variance, or nearly so, s is a
scalar; if they have the same number of observations, n is a scalar.

Of course, if they have the same variance and same number of observations, they
all have the same weight and you needn't weight them at all: see previous
posting!

Steve E



*******************************************************************
This email and any attachments are confidential. Any use, co...{{dropped}}

S Ellison

2007-May-09 13:34 UTC

head link

[R] Weighted least squares

>>> Adaikalavan Ramasamy <ramasamy at cancer.org.uk> 09/05/2007
01:37:31 >>>
>..the variance of means of each row in table above is ZERO because 
>the individual elements that comprise each row are identical. 
>... Then is it valid then to use lm( y ~ x, weights=freq ) ?
ermmm... probably not, because if that heppened I'd strongly suspect
we'd substantially violated some assumptions.

We are given a number of groups of identical observations. But we are seeking a
solution to a problem that posits an underlying variance. If it's not
visible within the groups,  where is it? Has it disappeared in numerical
precision, or is something else going on?

If we did this regression, we would see identical residuals for all members of a
group. That would imply that the variance arises entirely from between-group
effects and not at all from within-group effects. To me, that would in turn
imply that the number of observations in the group is irrelevant; we should be
using use unweighted regression on the group 'means' in this situation
if we're using least squares at all.

If we genuinely have independent observations and by some coincidence they have
the same value within available precision, we might be justified in saying
"we can't see the variance within groups, but we can estimate it from
the residual variance". That would be equivalent to assuming constant
variance, and my n/(s^2) reduces to n except for a scaling factor. Using n alone
would then be consistent with one's assumptions, I think. On the kind of
data I get, though (mostly chemical measurement with continuous scales), I'd
have considerable difficulty justifying that assumption. And if I didn't
have that kind of data (or a reasonable approximation thereto) I'd be
wondering whether I should be using linear regression at all.

S

*******************************************************************
This email and any attachments are confidential. Any use, co...{{dropped}}

Possibly Parallel Threads

Search for more possibly parallel threads

R help - May 2007 - Weighted least squares

[R] Weighted least squares

[R] Weighted least squares

[R] Weighted least squares

[R] Weighted least squares

[R] Weighted least squares

[R] Weighted least squares

Possibly Parallel Threads