thr3ads.net - R help - [R] Help needed in interpreting linear models [Jan 2012]

If this information is useful, please help other people find it:
Share via:

mails

2012-Jan-13 08:39 UTC

[R] Help needed in interpreting linear models

Dear members of the R-help list,

I have sent the email below to the R-SIG-ME list to ask for help in
interpreting some R output of fitted linear models.

Unfortunately, I haven't yet received any answers. As I am not sure if my
email was sent successfully to the mailing list I

am asking for help here:



Dear members of the R-SIG-ME list,


I am new to linear models and struggling with interpreting some of the R
output but hope to get some advice from here.

I created the following dummy data set:

scores <- c(2,6,10,12,14,20)

weight <- c(60,70,80,75,80,85)

height <- c(180,180,190,180,180,180)

The scores of a game/match should be dependent on the weight of the player
but not on the height. 

For me the output of the following two linear models make sense:
> (lm1 <- summary(lm(scores ~ weight)))
Call:
lm(formula = scores ~ weight)

Residuals:
       1        2        3        4        5        6 
 1.08333 -1.41667 -3.91667  1.33333  0.08333  2.83333 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -38.0833    10.0394  -3.793  0.01921 * 
weight        0.6500     0.1331   4.885  0.00813 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

Residual standard error: 2.661 on 4 degrees of freedom
Multiple R-squared: 0.8564,	Adjusted R-squared: 0.8205 
F-statistic: 23.86 on 1 and 4 DF,  p-value: 0.008134 
> 
> (lm2 <- summary(lm(scores ~ height)))
Call:
lm(formula = scores ~ height)

Residuals:
         1          2          3          4          5          6 
-8.800e+00 -4.800e+00  1.377e-14  1.200e+00  3.200e+00  9.200e+00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  25.2000   139.6175   0.180    0.866
height       -0.0800     0.7684  -0.104    0.922

Residual standard error: 7.014 on 4 degrees of freedom
Multiple R-squared: 0.002703,	Adjusted R-squared: -0.2466 
F-statistic: 0.01084 on 1 and 4 DF,  p-value: 0.9221 

The p-value of the first output is 0.008134 which makes sense as scores and
weight have a high correlation

and therefore, the scores "can be explained" by the explanatory
variable/factor weight very well. Hence, the R-squared

value is close to 1. For the second example it also makes sense that the
p-value is almost 1 (p=0.9221) as there is

hardly any correlation between scores and height.

What is not clear to me is shown in my 3rd linear model which includes both
weight and height.
> (lm3 <- summary(lm(scores ~ weight + height)))
Call:
lm(formula = scores ~ weight + height)

Residuals:
         1          2          3          4          5          6 
 1.189e+00 -1.946e+00 -2.165e-15  4.865e-01 -1.081e+00  1.351e+00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 49.45946   33.50261   1.476  0.23635   
weight       0.71351    0.08716   8.186  0.00381 **
height      -0.50811    0.19096  -2.661  0.07628 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

Residual standard error: 1.677 on 3 degrees of freedom
Multiple R-squared: 0.9573,	Adjusted R-squared: 0.9288 
F-statistic:  33.6 on 2 and 3 DF,  p-value: 0.008833 

It makes sense that the R-squared value is higher when one adds both
explanatory variables/factors to the linear model as 

the more variables are added the more variance is explained and therefore
the fit of the model will be better. However, I do NOT

understand why the p-value of height (Pr(> | t |)  = 0.07628) is now almost
significant? And also, I do NOT understand why the overall

p-value of 0.008833 is less significant as compared to the one from model
lm1 which was p-value: 0.008134.

The p-value of weight being low (p=0.00381) makes sense as this factor
"explains" the scores very well.



After fitting the 3 models (lm1, lm2 and lm3) I wanted to compare model lm1
with lm3 using the anova function to check whether the factor height

significantly improves the model. In other words I wanted to check if adding
height to the model helps explaining the scores of the players.

The output of the anova looks as follows:
> lm1 <- lm(scores ~ weight)
> 
> lm2 <- lm(scores ~ weight + height)
> 
> anova(lm1,lm2)Analysis of Variance Table

Model 1: scores ~ weight
Model 2: scores ~ weight + height
  Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
1      4 28.3333                              
2      3  8.4324  1    19.901 7.0801 0.07628 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

In my opinion the p-value should be almost 1 and not close to significance
(0.07) as we have seen from model lm2

height does not at all "explain" the scores. Here, I thought that a
significant p-value means that the factor height adds

significant value to the model.


I would be very grateful if anyone could help me in interpreting the R
output.

Best regards

 






--
View this message in context:
http://r.789695.n4.nabble.com/Help-needed-in-interpreting-linear-models-tp4291670p4291670.html
Sent from the R help mailing list archive at Nabble.com.

Petr PIKAL

2012-Jan-13 09:35 UTC

head link

[R] Help needed in interpreting linear models

Hi

It seems to me quite like a homework for which the policy of this list is 
not to respond.
But far from being an expert in statistics I only express my opinion. It 
seems to me that your height variable behaves like a two level factor and 
the 190 value points to rather suspicious value in weight if I look at the 
plot

plot(scores, weight)

Regards
Petr

> Dear members of the R-help list,
> 
> I have sent the email below to the R-SIG-ME list to ask for help in
> interpreting some R output of fitted linear models.
> 
> Unfortunately, I haven't yet received any answers. As I am not sure if 
my> email was sent successfully to the mailing list I
> 
> am asking for help here:
> 
> 
> 
> Dear members of the R-SIG-ME list,
> 
> 
> I am new to linear models and struggling with interpreting some of the R
> output but hope to get some advice from here.
> 
> I created the following dummy data set:
> 
> scores <- c(2,6,10,12,14,20)
> 
> weight <- c(60,70,80,75,80,85)
> 
> height <- c(180,180,190,180,180,180)
> 
> The scores of a game/match should be dependent on the weight of the 
player> but not on the height. 
> 
> For me the output of the following two linear models make sense:
> 
> > (lm1 <- summary(lm(scores ~ weight)))
> 
> Call:
> lm(formula = scores ~ weight)
> 
> Residuals:
>        1        2        3        4        5        6 
>  1.08333 -1.41667 -3.91667  1.33333  0.08333  2.83333 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|) 
> (Intercept) -38.0833    10.0394  -3.793  0.01921 * 
> weight        0.6500     0.1331   4.885  0.00813 **
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
> 
> Residual standard error: 2.661 on 4 degrees of freedom
> Multiple R-squared: 0.8564,   Adjusted R-squared: 0.8205 
> F-statistic: 23.86 on 1 and 4 DF,  p-value: 0.008134 
> 
> > 
> > (lm2 <- summary(lm(scores ~ height)))
> 
> Call:
> lm(formula = scores ~ height)
> 
> Residuals:
>          1          2          3          4          5          6 
> -8.800e+00 -4.800e+00  1.377e-14  1.200e+00  3.200e+00  9.200e+00 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)  25.2000   139.6175   0.180    0.866
> height       -0.0800     0.7684  -0.104    0.922
> 
> Residual standard error: 7.014 on 4 degrees of freedom
> Multiple R-squared: 0.002703,   Adjusted R-squared: -0.2466 
> F-statistic: 0.01084 on 1 and 4 DF,  p-value: 0.9221 
> 
> The p-value of the first output is 0.008134 which makes sense as scores 
and> weight have a high correlation
> 
> and therefore, the scores "can be explained" by the explanatory
> variable/factor weight very well. Hence, the R-squared
> 
> value is close to 1. For the second example it also makes sense that the
> p-value is almost 1 (p=0.9221) as there is
> 
> hardly any correlation between scores and height.
> 
> What is not clear to me is shown in my 3rd linear model which includes 
both> weight and height.
> 
> > (lm3 <- summary(lm(scores ~ weight + height)))
> 
> Call:
> lm(formula = scores ~ weight + height)
> 
> Residuals:
>          1          2          3          4          5          6 
>  1.189e+00 -1.946e+00 -2.165e-15  4.865e-01 -1.081e+00  1.351e+00 
> 
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|) 
> (Intercept) 49.45946   33.50261   1.476  0.23635 
> weight       0.71351    0.08716   8.186  0.00381 **
> height      -0.50811    0.19096  -2.661  0.07628 . 
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
> 
> Residual standard error: 1.677 on 3 degrees of freedom
> Multiple R-squared: 0.9573,   Adjusted R-squared: 0.9288 
> F-statistic:  33.6 on 2 and 3 DF,  p-value: 0.008833 
> 
> It makes sense that the R-squared value is higher when one adds both
> explanatory variables/factors to the linear model as 
> 
> the more variables are added the more variance is explained and 
therefore> the fit of the model will be better. However, I do NOT
> 
> understand why the p-value of height (Pr(> | t |)  = 0.07628) is now 
almost> significant? And also, I do NOT understand why the overall
> 
> p-value of 0.008833 is less significant as compared to the one from 
model> lm1 which was p-value: 0.008134.
> 
> The p-value of weight being low (p=0.00381) makes sense as this factor
> "explains" the scores very well.
> 
> 
> 
> After fitting the 3 models (lm1, lm2 and lm3) I wanted to compare model 
lm1> with lm3 using the anova function to check whether the factor height
> 
> significantly improves the model. In other words I wanted to check if 
adding> height to the model helps explaining the scores of the players.
> 
> The output of the anova looks as follows:
> 
> > lm1 <- lm(scores ~ weight)
> > 
> > lm2 <- lm(scores ~ weight + height)
> > 
> > anova(lm1,lm2)
> Analysis of Variance Table
> 
> Model 1: scores ~ weight
> Model 2: scores ~ weight + height
>   Res.Df     RSS Df Sum of Sq      F  Pr(>F) 
> 1      4 28.3333 
> 2      3  8.4324  1    19.901 7.0801 0.07628 .
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1
> 
> In my opinion the p-value should be almost 1 and not close to 
significance> (0.07) as we have seen from model lm2
> 
> height does not at all "explain" the scores. Here, I thought that
a
> significant p-value means that the factor height adds
> 
> significant value to the model.
> 
> 
> I would be very grateful if anyone could help me in interpreting the R
> output.
> 
> Best regards
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Help-needed-
> in-interpreting-linear-models-tp4291670p4291670.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

Petr PIKAL

2012-Jan-13 14:06 UTC

head link

[R] Help needed in interpreting linear models

I forgot to post it to r help.
Petr

Hi
> 
> Hi Petr,
> 
> thanks for your answer.
> 
> First of all it's not homework I am a student and need to analyse
cancer
> data using linear models.
> I looked into that topic since a week now and still struggling in
> interpreting some of the R output that is why
> I was asking for help here.
> 
> I don't quite understand your answer because the 180/190 values belong 
to> height and not to weight. What do you want to 
> show with plot(scores,weight). What I can see from the plot is that 
there is> a correlation between the two variables and 
> therefore weight "explains" scores.
Yes, but as far as I remember (I do not keep mails so now I can not see 
the data you posted - Nabble is not available for me) I said that the 
height value 190 (which was unique, all others were 180 if I remember 
correctly) is pointing to scores/weight pair which is slightly out from 
the simple linear model

lm(scores~weight)

so it is kind of an outlier from the model. Therefore adding the variable 
(height) to the model improves it and therefore the height variable in the 
second model is slightly significant as you found from anova.

You can also inspect your models by

plot(predict(fit), y.variable)
abline(0,1)

The better is the model the more close are the points to 0,1 line. Of 
course you can use some more formal evaluation (residuals, hatvalues...) 
and you can find appropriate literature e.g. at CRAN web. Those two are my 
favourites, however there are plenty other sources.

Using R for Data Analysis and Graphics - Introduction, Examples and 
Commentary? by John Maindonald (PDF, data sets and scripts are available 
at JM's homepage). 
?Practical Regression and Anova using R? by Julian Faraway (PDF, data sets 
and scripts are available at the book homepage). 

Regards
Petr
> 
> Regards
> 
> --
> View this message in context: http://r.789695.n4.nabble.com/Help-needed-
> in-interpreting-linear-models-tp4291670p4291894.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more seemingly similar threads

R help - Jan 2012 - Help needed in interpreting linear models

[R] Help needed in interpreting linear models

[R] Help needed in interpreting linear models

[R] Help needed in interpreting linear models

Reasonably Related Threads