thr3ads.net - R help - [R] R2 always increases as variables are added? [May 2007]

If this information is useful, please help other people find it:
Share via:

李俊杰

2007-May-17 11:02 UTC

[R] R2 always increases as variables are added?

Hi, everybody,

3 questions about R-square:
---------(1)----------- Does R2 always increase as variables are added?
---------(2)----------- Does R2 always greater than 1?
---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
calculated? It is different from (r.square=sum((y.hat-mean
(y))^2)/sum((y-mean(y))^2))

I will illustrate these problems by the following codes:
---------(1)-----------  R2  doesn't always increase as variables are added
> x=matrix(rnorm(20),ncol=2)
> y=rnorm(10)
>
> lm=lm(y~1)
> y.hat=rep(1*lm$coefficients,length(y))
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
[1] 2.646815e-33>
> lm=lm(y~x-1)
> y.hat=x%*%lm$coefficients
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
[1] 0.4443356>
> ################ This is the biggest model, but its R2 is not the biggest,
why?> lm=lm(y~x)
> y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))[1] 0.2704789


---------(2)-----------  R2  can greater than 1
> x=rnorm(10)
> y=runif(10)
> lm=lm(y~x-1)
> y.hat=x*lm$coefficients
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))[1] 3.513865


 ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
calculated? It is different from (r.square=sum((y.hat-mean
(y))^2)/sum((y-mean(y))^2))> x=matrix(rnorm(20),ncol=2)
> xx=cbind(rep(1,10),x)
> y=x%*%c(1,2)+rnorm(10)
> ### r2 calculated by lm(y~x)
> lm=lm(y~x)
> summary(lm)$r.squared
[1] 0.9231062> ### r2 calculated by lm(y~xx-1)
> lm=lm(y~xx-1)
> summary(lm)$r.squared
[1] 0.9365253> ### r2 calculated by me
> y.hat=xx%*%lm$coefficients
> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))[1] 0.9231062


Thanks a lot for any cue:)




-- 
Junjie Li,                  klijunjie@gmail.com
Undergranduate in DEP of Tsinghua University,

	[[alternative HTML version deleted]]

Duncan Murdoch

2007-May-17 11:29 UTC

head link

[R] R2 always increases as variables are added?

On 17/05/2007 7:02 AM, ??? wrote:
 > Hi, everybody,
 >
 > 3 questions about R-square:
 > ---------(1)----------- Does R2 always increase as variables are added?
 > ---------(2)----------- Does R2 always greater than 1?
 > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
 > calculated? It is different from (r.square=sum((y.hat-mean
 > (y))^2)/sum((y-mean(y))^2))

You are mixing models with intercepts with models without them.  When 
you do this, it doesn't make sense to compare the variability about the 
mean.  The model y = constant + error is not a special case of the model 
y = slope*x + error.

Duncan Murdoch

李俊杰

2007-May-19 06:53 UTC

head link

[R] R2 always increases as variables are added?

I know that "-1" indicates to remove the intercept term. But my
question is
why intercept term CAN NOT be treated as a variable term as we place a
column consited of 1 in the predictor matrix.

If I stick to make a comparison between a model with intercept and one
without intercept on adjusted r2 term, now I think the strategy is always to
use another definition of r-square or adjusted r-square, in which
r-square=sum((y.hat)^2)/sum((y)^2).

Am I  in the right way?

Thanks

Li Junjie


2007/5/19, Paul Lynch <plynchnlm@gmail.com>:>
> In case you weren't aware, the meaning of the "-1" in y ~ x -
1 is to
> remove the intercept term that would otherwise be implied.
>     --Paul
>
> On 5/17/07, Àî¿¡½Ü <klijunjie@gmail.com> wrote:
> > Hi, everybody,
> >
> > 3 questions about R-square:
> > ---------(1)----------- Does R2 always increase as variables are
added?
> > ---------(2)----------- Does R2 always greater than 1?
> > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> > calculated? It is different from (r.square=sum((y.hat-mean
> > (y))^2)/sum((y-mean(y))^2))
> >
> > I will illustrate these problems by the following codes:
> > ---------(1)-----------  R2  doesn't always increase as variables
are
> added
> >
> > > x=matrix(rnorm(20),ncol=2)
> > > y=rnorm(10)
> > >
> > > lm=lm(y~1)
> > > y.hat=rep(1*lm$coefficients,length(y))
> > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > [1] 2.646815e-33
> > >
> > > lm=lm(y~x-1)
> > > y.hat=x%*%lm$coefficients
> > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > [1] 0.4443356
> > >
> > > ################ This is the biggest model, but its R2 is not the
> biggest,
> > why?
> > > lm=lm(y~x)
> > > y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients
> > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > [1] 0.2704789
> >
> >
> > ---------(2)-----------  R2  can greater than 1
> >
> > > x=rnorm(10)
> > > y=runif(10)
> > > lm=lm(y~x-1)
> > > y.hat=x*lm$coefficients
> > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > [1] 3.513865
> >
> >
> >  ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> > calculated? It is different from (r.square=sum((y.hat-mean
> > (y))^2)/sum((y-mean(y))^2))
> > > x=matrix(rnorm(20),ncol=2)
> > > xx=cbind(rep(1,10),x)
> > > y=x%*%c(1,2)+rnorm(10)
> > > ### r2 calculated by lm(y~x)
> > > lm=lm(y~x)
> > > summary(lm)$r.squared
> > [1] 0.9231062
> > > ### r2 calculated by lm(y~xx-1)
> > > lm=lm(y~xx-1)
> > > summary(lm)$r.squared
> > [1] 0.9365253
> > > ### r2 calculated by me
> > > y.hat=xx%*%lm$coefficients
> > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > [1] 0.9231062
> >
> >
> > Thanks a lot for any cue:)
> >
> >
> >
> >
> > --
> > Junjie Li,                  klijunjie@gmail.com
> > Undergranduate in DEP of Tsinghua University,
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
> --
> Paul Lynch
> Aquilent, Inc.
> National Library of Medicine (Contractor)
>


-- 
Junjie Li,                  klijunjie@gmail.com
Undergranduate in DEP of Tsinghua University,

	[[alternative HTML version deleted]]

李俊杰

2007-May-21 00:23 UTC

head link

[R] R2 always increases as variables are added?

Hi, Mark

What I want to do exactly is that I want to make a comparison between a
model with intercept and one without intercept on adjusted r2 term, since we
know that minimizing adjusted r-square is a variable selection strategy. I
know there are other alternatives to conduct a variable selection, but I
really have to try this one.

Thanks.




2007/5/21, Leeds, Mark (IED)
<Mark.Leeds@morganstanley.com>:>
> Hi : You can put in the -1 and then create your own vector of 1's which
> which will be a "variable" but I'm not sure if I undersrand
what you want
> and I don't think others do either because
> I didn't see other responses.    I don't mnean to be offensive or
rude but
> can you explain what you want to do more clearly. If you do that, I'm
sure
> you willg et more responses.
>
>
>
> -----Original Message-----
> From: r-help-bounces@stat.math.ethz.ch [mailto:
> r-help-bounces@stat.math.ethz.ch] On Behalf Of ???
> Sent: Saturday, May 19, 2007 2:54 AM
> To: Paul Lynch
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] R2 always increases as variables are added?
>
> I know that "-1" indicates to remove the intercept term. But my
question
> is why intercept term CAN NOT be treated as a variable term as we place a
> column consited of 1 in the predictor matrix.
>
> If I stick to make a comparison between a model with intercept and one
> without intercept on adjusted r2 term, now I think the strategy is always
to
> use another definition of r-square or adjusted r-square, in which
> r-square=sum((y.hat)^2)/sum((y)^2).
>
> Am I  in the right way?
>
> Thanks
>
> Li Junjie
>
>
> 2007/5/19, Paul Lynch <plynchnlm@gmail.com>:
> >
> > In case you weren't aware, the meaning of the "-1" in y
~ x - 1 is to
> > remove the intercept term that would otherwise be implied.
> >     --Paul
> >
> > On 5/17/07, Àî¿¡½Ü <klijunjie@gmail.com> wrote:
> > > Hi, everybody,
> > >
> > > 3 questions about R-square:
> > > ---------(1)----------- Does R2 always increase as variables are
> added?
> > > ---------(2)----------- Does R2 always greater than 1?
> > > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> > > calculated? It is different from (r.square=sum((y.hat-mean
> > > (y))^2)/sum((y-mean(y))^2))
> > >
> > > I will illustrate these problems by the following codes:
> > > ---------(1)-----------  R2  doesn't always increase as
variables
> > > are
> > added
> > >
> > > > x=matrix(rnorm(20),ncol=2)
> > > > y=rnorm(10)
> > > >
> > > > lm=lm(y~1)
> > > > y.hat=rep(1*lm$coefficients,length(y))
> > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > [1] 2.646815e-33
> > > >
> > > > lm=lm(y~x-1)
> > > > y.hat=x%*%lm$coefficients
> > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > [1] 0.4443356
> > > >
> > > > ################ This is the biggest model, but its R2 is
not the
> > biggest,
> > > why?
> > > > lm=lm(y~x)
> > > > y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients
> > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > [1] 0.2704789
> > >
> > >
> > > ---------(2)-----------  R2  can greater than 1
> > >
> > > > x=rnorm(10)
> > > > y=runif(10)
> > > > lm=lm(y~x-1)
> > > > y.hat=x*lm$coefficients
> > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > [1] 3.513865
> > >
> > >
> > >  ---------(3)----------- How is R2 in
summary(lm(y~x-1))$r.squared
> > > calculated? It is different from (r.square=sum((y.hat-mean
> > > (y))^2)/sum((y-mean(y))^2))
> > > > x=matrix(rnorm(20),ncol=2)
> > > > xx=cbind(rep(1,10),x)
> > > > y=x%*%c(1,2)+rnorm(10)
> > > > ### r2 calculated by lm(y~x)
> > > > lm=lm(y~x)
> > > > summary(lm)$r.squared
> > > [1] 0.9231062
> > > > ### r2 calculated by lm(y~xx-1)
> > > > lm=lm(y~xx-1)
> > > > summary(lm)$r.squared
> > > [1] 0.9365253
> > > > ### r2 calculated by me
> > > > y.hat=xx%*%lm$coefficients
> > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > [1] 0.9231062
> > >
> > >
> > > Thanks a lot for any cue:)
> > >
> > >
> > >
> > >
> > > --
> > > Junjie Li,                  klijunjie@gmail.com
> > > Undergranduate in DEP of Tsinghua University,
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.
> > >
> >
> >
> > --
> > Paul Lynch
> > Aquilent, Inc.
> > National Library of Medicine (Contractor)
> >
>
>
>
> --
> Junjie Li,                  klijunjie@gmail.com
> Undergranduate in DEP of Tsinghua University,
>
>        [[alternative HTML version deleted]]
> --------------------------------------------------------
>
> This is not an offer (or solicitation of an offer) to buy/sell the
> securities/instruments mentioned or an official confirmation.  Morgan
> Stanley may deal as principal in or own or act as market maker for
> securities/instruments mentioned or may advise the issuers.  This is not
> research and is not from MS Research but it may refer to a research
> analyst/research report.  Unless indicated, these views are the
author's and
> may differ from those of Morgan Stanley research or others in the Firm.  We
> do not represent this is accurate or complete and we may not update
> this.  Past performance is not indicative of future returns.  For
additional
> information, research reports and important disclosures, contact me or see
> https://secure.ms.com/servlet/cls.  You should not use e-mail to request,
> authorize or effect the purchase or sale of any security or instrument, to
> send transfer instructions, or to effect any other transactions.  We cannot
> guarantee that any such requests received via e-mail will be processed in a
> timely manner.  This communication is solely for the addressee(s) and may
> contain confidential information.  We do not waive confidentiality by
> mistransmission.  Contact me if you do not wish to receive these
> communications.  In the UK, this communication is directed in the UK to
> those persons who are market counterparties or intermediate customers (as
> defined in the UK Financial Services Authority's rules).
>


-- 
Junjie Li,                  klijunjie@gmail.com
Undergranduate in DEP of Tsinghua University,

	[[alternative HTML version deleted]]

Tony Plate

2007-May-24 19:37 UTC

head link

[R] R2 always increases as variables are added?

The answer to your question three is that the calculation of r-squared 
in summary.lm does depend on whether or not an intercept is included in 
the model.  (Another part of the reason for you puzzlement is, I think, 
that you are computing R-squared as SSR/SST, which is only valid when 
when the model has an intercept).

The code is in summary.lm, here are the relevant excerpts (assuming your 
model does not have weights):

     r <- z$residuals
     f <- z$fitted
     w <- z$weights
     if (is.null(w)) {
         mss <- if (attr(z$terms, "intercept"))
             sum((f - mean(f))^2)
         else sum(f^2)
         rss <- sum(r^2)
     }
...
     ans$r.squared <- mss/(mss + rss)

If you want to compare models with and without an intercept based on 
R^2, then I suspect it's most appropriate to use the version of R^2 that 
does not use a mean.

It's also worthwhile thinking about what you are actually doing.  I find 
the most intuitive definition of R^2 
(http://en.wikipedia.org/wiki/R_squared) is

R2 = 1 - SSE / SST

where SSE = sum_i (yhat_i - y_i)^2, (sum of errors in predictions for 
you model)
and SST = sum_i (y_i - mean(y))^2 (sum of errors in predictions for an 
intercept-only model)

This means that the standard definition of R2 effectively compares the 
model with an intercept-only model.  As the error in predictions goes 
down, R2 goes up, and the model that uses the mean(y) as a prediction 
(i.e., the intercept-only model) provides a scale for these errors.

If you think or know that the true mean of y is zero then it may be 
appropriate to compare against a zero model rather than an 
intercept-only model (in SST).  And if the sample mean of y is quite 
different from zero, and you compare a no-intercept model against an 
intercept-only model, then you're going to get results that are not 
easily interpreted.

Note that a common way of expressing and computing R^2 is as SSR/SST 
(which you used).  (Where SSR = sum_i (yhat_i - mean(y))^2 ). However, 
this is only valid when the model has an intercept (i.e., SSR/SST = 1 - 
SSE/SST ONLY when the model has an intercept.)

Here's some examples, based on your example:

 > set.seed(1)
 > data <- data.frame(x1=rnorm(10), x2=rnorm(10), y=rnorm(10), I=1)
 >
 > lm1 <- lm(y~1, data=data)
 > summary(lm1)$r.squared
[1] 0
 > y.hat <- fitted(lm1)
 > sum((y.hat-mean(data$y))^2)/sum((data$y-mean(data$y))^2)
[1] 5.717795e-33
 >
 > # model with no intercept
 > lm2 <- lm(y~x1+x2-1, data=data)
 > summary(lm2)$r.squared
[1] 0.6332317
 > y.hat <- fitted(lm2)
 > # no-intercept version of R^2 (2 ways to compute)
 > 1-sum((y.hat-data$y)^2)/sum((data$y)^2)
[1] 0.6332317
 > sum((y.hat)^2)/sum((data$y)^2)
[1] 0.6332317
 > # standard (assuming model has intercept) computations for R^2:
 > SSE <- sum((y.hat - data$y)^2)
 > SST <- sum((data$y - mean(data$y))^2)
 > SSR <- sum((y.hat - mean(data$y))^2)
 > 1 - SSE/SST
[1] 0.6252577
 > # Note that SSR/SST != 1 - SSE/SST (because the model doesn't have an 
intercept)
 > SSR/SST
[1] 0.6616612
 >
 > # model with intercept included in data
 > lm3 <- lm(y~x1+x2+I-1, data=data)
 > summary(lm3)$r.squared
[1] 0.6503186
 > y.hat <- fitted(lm3)
 > # no-intercept version of R^2 (2 ways to compute)
 > 1-sum((y.hat-data$y)^2)/sum((data$y)^2)
[1] 0.6503186
 > sum((y.hat)^2)/sum((data$y)^2)
[1] 0.6503186
 > # standard (assuming model has intercept) computations for R^2:
 > SSE <- sum((y.hat - data$y)^2)
 > SST <- sum((data$y - mean(data$y))^2)
 > SSR <- sum((y.hat - mean(data$y))^2)
 > 1 - SSE/SST
[1] 0.6427161
 > SSR/SST
[1] 0.6427161
 >
 >

hope this helps,

Tony Plate

Disclaimer: I too do not have any degrees in statistics, but I'm 95% 
sure the above is mostly correct :-)  If there are any major mistakes, 
I'm sure someone will point them out.

??? wrote:> Hi, everybody,
> 
> 3 questions about R-square:
> ---------(1)----------- Does R2 always increase as variables are added?
> ---------(2)----------- Does R2 always greater than 1?
> ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> calculated? It is different from (r.square=sum((y.hat-mean
> (y))^2)/sum((y-mean(y))^2))
> 
> I will illustrate these problems by the following codes:
> ---------(1)-----------  R2  doesn't always increase as variables are
added
> 
>> x=matrix(rnorm(20),ncol=2)
>> y=rnorm(10)
>>
>> lm=lm(y~1)
>> y.hat=rep(1*lm$coefficients,length(y))
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 2.646815e-33
>> lm=lm(y~x-1)
>> y.hat=x%*%lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 0.4443356
>> ################ This is the biggest model, but its R2 is not the
biggest,
> why?
>> lm=lm(y~x)
>> y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 0.2704789
> 
> 
> ---------(2)-----------  R2  can greater than 1
> 
>> x=rnorm(10)
>> y=runif(10)
>> lm=lm(y~x-1)
>> y.hat=x*lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 3.513865
> 
> 
>  ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> calculated? It is different from (r.square=sum((y.hat-mean
> (y))^2)/sum((y-mean(y))^2))
>> x=matrix(rnorm(20),ncol=2)
>> xx=cbind(rep(1,10),x)
>> y=x%*%c(1,2)+rnorm(10)
>> ### r2 calculated by lm(y~x)
>> lm=lm(y~x)
>> summary(lm)$r.squared
> [1] 0.9231062
>> ### r2 calculated by lm(y~xx-1)
>> lm=lm(y~xx-1)
>> summary(lm)$r.squared
> [1] 0.9365253
>> ### r2 calculated by me
>> y.hat=xx%*%lm$coefficients
>> (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> [1] 0.9231062
> 
> 
> Thanks a lot for any cue:)
> 
> 
> 
>

Seemingly Similar Threads

Search for more apparently analagous threads

R help - May 2007 - R2 always increases as variables are added?

[R] R2 always increases as variables are added?

[R] R2 always increases as variables are added?

[R] R2 always increases as variables are added?

[R] R2 always increases as variables are added?

[R] R2 always increases as variables are added?

Seemingly Similar Threads