Dear members of the R-help list, I have sent the email below to the R-SIG-ME list to ask for help in interpreting some R output of fitted linear models. Unfortunately, I haven't yet received any answers. As I am not sure if my email was sent successfully to the mailing list I am asking for help here: Dear members of the R-SIG-ME list, I am new to linear models and struggling with interpreting some of the R output but hope to get some advice from here. I created the following dummy data set: scores <- c(2,6,10,12,14,20) weight <- c(60,70,80,75,80,85) height <- c(180,180,190,180,180,180) The scores of a game/match should be dependent on the weight of the player but not on the height. For me the output of the following two linear models make sense:> (lm1 <- summary(lm(scores ~ weight)))Call: lm(formula = scores ~ weight) Residuals: 1 2 3 4 5 6 1.08333 -1.41667 -3.91667 1.33333 0.08333 2.83333 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -38.0833 10.0394 -3.793 0.01921 * weight 0.6500 0.1331 4.885 0.00813 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.661 on 4 degrees of freedom Multiple R-squared: 0.8564, Adjusted R-squared: 0.8205 F-statistic: 23.86 on 1 and 4 DF, p-value: 0.008134> > (lm2 <- summary(lm(scores ~ height)))Call: lm(formula = scores ~ height) Residuals: 1 2 3 4 5 6 -8.800e+00 -4.800e+00 1.377e-14 1.200e+00 3.200e+00 9.200e+00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 25.2000 139.6175 0.180 0.866 height -0.0800 0.7684 -0.104 0.922 Residual standard error: 7.014 on 4 degrees of freedom Multiple R-squared: 0.002703, Adjusted R-squared: -0.2466 F-statistic: 0.01084 on 1 and 4 DF, p-value: 0.9221 The p-value of the first output is 0.008134 which makes sense as scores and weight have a high correlation and therefore, the scores "can be explained" by the explanatory variable/factor weight very well. Hence, the R-squared value is close to 1. For the second example it also makes sense that the p-value is almost 1 (p=0.9221) as there is hardly any correlation between scores and height. What is not clear to me is shown in my 3rd linear model which includes both weight and height.> (lm3 <- summary(lm(scores ~ weight + height)))Call: lm(formula = scores ~ weight + height) Residuals: 1 2 3 4 5 6 1.189e+00 -1.946e+00 -2.165e-15 4.865e-01 -1.081e+00 1.351e+00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 49.45946 33.50261 1.476 0.23635 weight 0.71351 0.08716 8.186 0.00381 ** height -0.50811 0.19096 -2.661 0.07628 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.677 on 3 degrees of freedom Multiple R-squared: 0.9573, Adjusted R-squared: 0.9288 F-statistic: 33.6 on 2 and 3 DF, p-value: 0.008833 It makes sense that the R-squared value is higher when one adds both explanatory variables/factors to the linear model as the more variables are added the more variance is explained and therefore the fit of the model will be better. However, I do NOT understand why the p-value of height (Pr(> | t |) = 0.07628) is now almost significant? And also, I do NOT understand why the overall p-value of 0.008833 is less significant as compared to the one from model lm1 which was p-value: 0.008134. The p-value of weight being low (p=0.00381) makes sense as this factor "explains" the scores very well. After fitting the 3 models (lm1, lm2 and lm3) I wanted to compare model lm1 with lm3 using the anova function to check whether the factor height significantly improves the model. In other words I wanted to check if adding height to the model helps explaining the scores of the players. The output of the anova looks as follows:> lm1 <- lm(scores ~ weight) > > lm2 <- lm(scores ~ weight + height) > > anova(lm1,lm2)Analysis of Variance Table Model 1: scores ~ weight Model 2: scores ~ weight + height Res.Df RSS Df Sum of Sq F Pr(>F) 1 4 28.3333 2 3 8.4324 1 19.901 7.0801 0.07628 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 In my opinion the p-value should be almost 1 and not close to significance (0.07) as we have seen from model lm2 height does not at all "explain" the scores. Here, I thought that a significant p-value means that the factor height adds significant value to the model. I would be very grateful if anyone could help me in interpreting the R output. Best regards -- View this message in context: http://r.789695.n4.nabble.com/Help-needed-in-interpreting-linear-models-tp4291670p4291670.html Sent from the R help mailing list archive at Nabble.com.
Hi It seems to me quite like a homework for which the policy of this list is not to respond. But far from being an expert in statistics I only express my opinion. It seems to me that your height variable behaves like a two level factor and the 190 value points to rather suspicious value in weight if I look at the plot plot(scores, weight) Regards Petr> Dear members of the R-help list, > > I have sent the email below to the R-SIG-ME list to ask for help in > interpreting some R output of fitted linear models. > > Unfortunately, I haven't yet received any answers. As I am not sure ifmy> email was sent successfully to the mailing list I > > am asking for help here: > > > > Dear members of the R-SIG-ME list, > > > I am new to linear models and struggling with interpreting some of the R > output but hope to get some advice from here. > > I created the following dummy data set: > > scores <- c(2,6,10,12,14,20) > > weight <- c(60,70,80,75,80,85) > > height <- c(180,180,190,180,180,180) > > The scores of a game/match should be dependent on the weight of theplayer> but not on the height. > > For me the output of the following two linear models make sense: > > > (lm1 <- summary(lm(scores ~ weight))) > > Call: > lm(formula = scores ~ weight) > > Residuals: > 1 2 3 4 5 6 > 1.08333 -1.41667 -3.91667 1.33333 0.08333 2.83333 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) -38.0833 10.0394 -3.793 0.01921 * > weight 0.6500 0.1331 4.885 0.00813 ** > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > Residual standard error: 2.661 on 4 degrees of freedom > Multiple R-squared: 0.8564, Adjusted R-squared: 0.8205 > F-statistic: 23.86 on 1 and 4 DF, p-value: 0.008134 > > > > > (lm2 <- summary(lm(scores ~ height))) > > Call: > lm(formula = scores ~ height) > > Residuals: > 1 2 3 4 5 6 > -8.800e+00 -4.800e+00 1.377e-14 1.200e+00 3.200e+00 9.200e+00 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 25.2000 139.6175 0.180 0.866 > height -0.0800 0.7684 -0.104 0.922 > > Residual standard error: 7.014 on 4 degrees of freedom > Multiple R-squared: 0.002703, Adjusted R-squared: -0.2466 > F-statistic: 0.01084 on 1 and 4 DF, p-value: 0.9221 > > The p-value of the first output is 0.008134 which makes sense as scoresand> weight have a high correlation > > and therefore, the scores "can be explained" by the explanatory > variable/factor weight very well. Hence, the R-squared > > value is close to 1. For the second example it also makes sense that the > p-value is almost 1 (p=0.9221) as there is > > hardly any correlation between scores and height. > > What is not clear to me is shown in my 3rd linear model which includesboth> weight and height. > > > (lm3 <- summary(lm(scores ~ weight + height))) > > Call: > lm(formula = scores ~ weight + height) > > Residuals: > 1 2 3 4 5 6 > 1.189e+00 -1.946e+00 -2.165e-15 4.865e-01 -1.081e+00 1.351e+00 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 49.45946 33.50261 1.476 0.23635 > weight 0.71351 0.08716 8.186 0.00381 ** > height -0.50811 0.19096 -2.661 0.07628 . > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > Residual standard error: 1.677 on 3 degrees of freedom > Multiple R-squared: 0.9573, Adjusted R-squared: 0.9288 > F-statistic: 33.6 on 2 and 3 DF, p-value: 0.008833 > > It makes sense that the R-squared value is higher when one adds both > explanatory variables/factors to the linear model as > > the more variables are added the more variance is explained andtherefore> the fit of the model will be better. However, I do NOT > > understand why the p-value of height (Pr(> | t |) = 0.07628) is nowalmost> significant? And also, I do NOT understand why the overall > > p-value of 0.008833 is less significant as compared to the one frommodel> lm1 which was p-value: 0.008134. > > The p-value of weight being low (p=0.00381) makes sense as this factor > "explains" the scores very well. > > > > After fitting the 3 models (lm1, lm2 and lm3) I wanted to compare modellm1> with lm3 using the anova function to check whether the factor height > > significantly improves the model. In other words I wanted to check ifadding> height to the model helps explaining the scores of the players. > > The output of the anova looks as follows: > > > lm1 <- lm(scores ~ weight) > > > > lm2 <- lm(scores ~ weight + height) > > > > anova(lm1,lm2) > Analysis of Variance Table > > Model 1: scores ~ weight > Model 2: scores ~ weight + height > Res.Df RSS Df Sum of Sq F Pr(>F) > 1 4 28.3333 > 2 3 8.4324 1 19.901 7.0801 0.07628 . > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > In my opinion the p-value should be almost 1 and not close tosignificance> (0.07) as we have seen from model lm2 > > height does not at all "explain" the scores. Here, I thought that a > significant p-value means that the factor height adds > > significant value to the model. > > > I would be very grateful if anyone could help me in interpreting the R > output. > > Best regards > > > > > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Help-needed- > in-interpreting-linear-models-tp4291670p4291670.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
I forgot to post it to r help. Petr Hi> > Hi Petr, > > thanks for your answer. > > First of all it's not homework I am a student and need to analyse cancer > data using linear models. > I looked into that topic since a week now and still struggling in > interpreting some of the R output that is why > I was asking for help here. > > I don't quite understand your answer because the 180/190 values belongto> height and not to weight. What do you want to > show with plot(scores,weight). What I can see from the plot is thatthere is> a correlation between the two variables and > therefore weight "explains" scores.Yes, but as far as I remember (I do not keep mails so now I can not see the data you posted - Nabble is not available for me) I said that the height value 190 (which was unique, all others were 180 if I remember correctly) is pointing to scores/weight pair which is slightly out from the simple linear model lm(scores~weight) so it is kind of an outlier from the model. Therefore adding the variable (height) to the model improves it and therefore the height variable in the second model is slightly significant as you found from anova. You can also inspect your models by plot(predict(fit), y.variable) abline(0,1) The better is the model the more close are the points to 0,1 line. Of course you can use some more formal evaluation (residuals, hatvalues...) and you can find appropriate literature e.g. at CRAN web. Those two are my favourites, however there are plenty other sources. Using R for Data Analysis and Graphics - Introduction, Examples and Commentary? by John Maindonald (PDF, data sets and scripts are available at JM's homepage). ?Practical Regression and Anova using R? by Julian Faraway (PDF, data sets and scripts are available at the book homepage). Regards Petr> > Regards > > -- > View this message in context: http://r.789695.n4.nabble.com/Help-needed- > in-interpreting-linear-models-tp4291670p4291894.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.