Hi all, I have an issue with the lm() function regarding the listing of the coefficients. My data are below, showing a list of hours (HR) relating to the time spent resting (R) by an individual animal. Simply i want to run a lm() to run in an anova() to see if there is a significant difference in resting between hours. HR R 1 2 0.6666667 2 2 0.4666667 3 2 0.8000000 4 2 0.6333333 5 2 0.7333333 6 2 0.8000000 7 2 0.8666667 8 2 0.7857143 9 2 0.7826087 10 2 0.6666667 11 2 0.9166667 12 2 0.6666667 13 3 0.5294118 14 3 0.8541667 15 3 0.4583333 16 3 0.5882353 17 3 0.9347826 18 3 0.7878788 19 3 0.7857143 20 3 0.6944444 21 3 0.8333333 22 3 0.7450980 23 3 0.9230769 24 3 0.7222222 25 4 0.6571429 26 4 0.7241379 27 4 0.7391304 28 4 0.6571429 29 4 0.8000000 30 4 0.9130435 31 4 0.7187500 32 4 0.8437500 33 4 0.9230769 34 4 0.8571429 35 4 0.8695652 36 4 0.8888889 37 5 0.3333333 38 5 0.5365854 39 5 0.6774194 40 5 0.7142857 41 5 0.6904762 42 5 0.5483871 43 5 0.5952381 44 5 0.4166667 45 5 0.5666667 46 5 0.5952381 47 5 0.7894737 48 5 0.7500000 49 6 0.6268657 50 6 0.7187500 51 6 0.5500000 52 6 0.7164179 53 6 0.7656250 54 6 0.5869565 55 6 0.7164179 56 6 0.7031250 57 6 0.7230769 58 6 0.7462687 59 6 0.9200000 60 6 0.8536585 61 7 0.6379310 62 7 0.5357143 63 7 0.5227273 64 7 0.8000000 65 7 0.6724138 66 7 0.7083333 67 7 0.7241379 68 7 0.6938776 69 7 0.6545455 70 7 0.7931034 71 7 0.7560976 72 7 0.8684211 73 8 0.6727273 74 8 0.6000000 75 8 0.8333333 76 8 0.8181818 77 8 0.7818182 78 8 0.7647059 79 8 0.5818182 80 8 0.5918367 81 8 0.7450980 82 8 0.7818182 83 8 0.8048780 84 8 0.8684211 The script i'm using and output is as follows:> anova(rdayml <- lm(R ~ HR, data=rdata2, na.action=na.exclude))Analysis of Variance Table Response: R Df Sum Sq Mean Sq F value Pr(>F) HR 6 0.25992 0.04332 3.1762 0.00774 ** Residuals 77 1.05021 0.01364 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1> > summary(rdayml <- lm(R ~ HR,data=rdata2))Call: lm(formula = R ~ HR, data = rdata2) Residuals: Min 1Q Median 3Q Max -0.279725 -0.065416 0.005593 0.077486 0.201070 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.732082 0.033713 21.715 <2e-16 *** HR3 0.005976 0.047678 0.125 0.9006 HR4 0.067232 0.047678 1.410 0.1625 HR5 -0.130935 0.047678 -2.746 0.0075 ** HR6 -0.013152 0.047678 -0.276 0.7834 HR7 -0.034807 0.047678 -0.730 0.4676 HR8 0.004971 0.047678 0.104 0.9172 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.1168 on 77 degrees of freedom Multiple R-squared: 0.1984, Adjusted R-squared: 0.1359 F-statistic: 3.176 on 6 and 77 DF, p-value: 0.00774 What i really don't understand is why the lm summary lists the hour numbers in the coefficient of the lm, as apposed to just reading HR? On top of that if R does display the data like this then i don't understand why it omits hour 2? If i can get this to work correctly can I use the p value to determine which of the hours is significantly different to the others - so in this example hour 5 is significantly different? Or is it just a case of using the p value from the anova to determine that there is a significant difference between hours (in this case) and use a plot to determine which hour(s) are likely to be the cause? Any help or advice would be most useful! Best wishes, Ross -- View this message in context: http://www.nabble.com/lm-coefficients-output-confusing-tp24958398p24958398.html Sent from the R help mailing list archive at Nabble.com.
On Thu, 13 Aug 2009, Ross Culloch wrote:> > Hi all, > > I have an issue with the lm() function regarding the listing of the > coefficients. My data are below, showing a list of hours (HR) relating to > the time spent resting (R) by an individual animal. Simply i want to run a > lm() to run in an anova() to see if there is a significant difference in > resting between hours.The problem is not with lm(), but with your data. The listing below does not precisely represent your data; when I copy it to the clipboard and then use dat <- read.table("clipboard") lm(R~HR,dat) I get a different result. OTOH, when I use summary(lm(R~as.factor(HR),rdata2)) I recap your results (up to labelling). I suggest you try str( rdata2 ) to get more insight into your data. I suspect that one or more of the values in HR was quoted in your data file or that you used colClasses to determine the class of data in each column and turned HR into a factor. HTH, Chuck> > HR R > 1 2 0.6666667 > 2 2 0.4666667 > 3 2 0.8000000 > 4 2 0.6333333 > 5 2 0.7333333 > 6 2 0.8000000 > 7 2 0.8666667 > 8 2 0.7857143 > 9 2 0.7826087 > 10 2 0.6666667 > 11 2 0.9166667 > 12 2 0.6666667 > 13 3 0.5294118 > 14 3 0.8541667 > 15 3 0.4583333 > 16 3 0.5882353 > 17 3 0.9347826 > 18 3 0.7878788 > 19 3 0.7857143 > 20 3 0.6944444 > 21 3 0.8333333 > 22 3 0.7450980 > 23 3 0.9230769 > 24 3 0.7222222 > 25 4 0.6571429 > 26 4 0.7241379 > 27 4 0.7391304 > 28 4 0.6571429 > 29 4 0.8000000 > 30 4 0.9130435 > 31 4 0.7187500 > 32 4 0.8437500 > 33 4 0.9230769 > 34 4 0.8571429 > 35 4 0.8695652 > 36 4 0.8888889 > 37 5 0.3333333 > 38 5 0.5365854 > 39 5 0.6774194 > 40 5 0.7142857 > 41 5 0.6904762 > 42 5 0.5483871 > 43 5 0.5952381 > 44 5 0.4166667 > 45 5 0.5666667 > 46 5 0.5952381 > 47 5 0.7894737 > 48 5 0.7500000 > 49 6 0.6268657 > 50 6 0.7187500 > 51 6 0.5500000 > 52 6 0.7164179 > 53 6 0.7656250 > 54 6 0.5869565 > 55 6 0.7164179 > 56 6 0.7031250 > 57 6 0.7230769 > 58 6 0.7462687 > 59 6 0.9200000 > 60 6 0.8536585 > 61 7 0.6379310 > 62 7 0.5357143 > 63 7 0.5227273 > 64 7 0.8000000 > 65 7 0.6724138 > 66 7 0.7083333 > 67 7 0.7241379 > 68 7 0.6938776 > 69 7 0.6545455 > 70 7 0.7931034 > 71 7 0.7560976 > 72 7 0.8684211 > 73 8 0.6727273 > 74 8 0.6000000 > 75 8 0.8333333 > 76 8 0.8181818 > 77 8 0.7818182 > 78 8 0.7647059 > 79 8 0.5818182 > 80 8 0.5918367 > 81 8 0.7450980 > 82 8 0.7818182 > 83 8 0.8048780 > 84 8 0.8684211 > > > The script i'm using and output is as follows: > > > anova(rdayml <- lm(R ~ HR, data=rdata2, na.action=na.exclude)) > Analysis of Variance Table > > Response: R > Df Sum Sq Mean Sq F value Pr(>F) > HR 6 0.25992 0.04332 3.1762 0.00774 ** > Residuals 77 1.05021 0.01364 > --- > Signif. codes: 0 ???***??? 0.001 ???**??? 0.01 ???*??? 0.05 ???.??? 0.1 ??? ??? 1 > > > > summary(rdayml <- lm(R ~ HR,data=rdata2)) > > Call: > lm(formula = R ~ HR, data = rdata2) > > Residuals: > Min 1Q Median 3Q Max > -0.279725 -0.065416 0.005593 0.077486 0.201070 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.732082 0.033713 21.715 <2e-16 *** > HR3 0.005976 0.047678 0.125 0.9006 > HR4 0.067232 0.047678 1.410 0.1625 > HR5 -0.130935 0.047678 -2.746 0.0075 ** > HR6 -0.013152 0.047678 -0.276 0.7834 > HR7 -0.034807 0.047678 -0.730 0.4676 > HR8 0.004971 0.047678 0.104 0.9172 > --- > Signif. codes: 0 ???***??? 0.001 ???**??? 0.01 ???*??? 0.05 ???.??? 0.1 ??? ??? 1 > > Residual standard error: 0.1168 on 77 degrees of freedom > Multiple R-squared: 0.1984, Adjusted R-squared: 0.1359 > F-statistic: 3.176 on 6 and 77 DF, p-value: 0.00774 > > > What i really don't understand is why the lm summary lists the hour numbers > in the coefficient of the lm, as apposed to just reading HR? On top of that > if R does display the data like this then i don't understand why it omits > hour 2? If i can get this to work correctly can I use the p value to > determine which of the hours is significantly different to the others - so > in this example hour 5 is significantly different? Or is it just a case of > using the p value from the anova to determine that there is a significant > difference between hours (in this case) and use a plot to determine which > hour(s) are likely to be the cause? > > Any help or advice would be most useful! > > Best wishes, > > Ross > > > -- > View this message in context: http://www.nabble.com/lm-coefficients-output-confusing-tp24958398p24958398.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
On 13-Aug-09 20:45:31, Ross Culloch wrote:> > Hi all, > > I have an issue with the lm() function regarding the listing of > the coefficients. My data are below, showing a list of hours (HR) > relating to the time spent resting (R) by an individual animal. > Simply i want to run a lm() to run in an anova() to see if there > is a significant difference in resting between hours. > > HR R > 1 2 0.6666667 > 2 2 0.4666667 > 3 2 0.8000000 > 4 2 0.6333333 > 5 2 0.7333333 > 6 2 0.8000000 > 7 2 0.8666667 > 8 2 0.7857143 > 9 2 0.7826087 > 10 2 0.6666667 > 11 2 0.9166667 > 12 2 0.6666667 > 13 3 0.5294118 > 14 3 0.8541667 > 15 3 0.4583333 > 16 3 0.5882353 > 17 3 0.9347826 > 18 3 0.7878788 > 19 3 0.7857143 > 20 3 0.6944444 > 21 3 0.8333333 > 22 3 0.7450980 > 23 3 0.9230769 > 24 3 0.7222222 > 25 4 0.6571429 > 26 4 0.7241379 > 27 4 0.7391304 > 28 4 0.6571429 > 29 4 0.8000000 > 30 4 0.9130435 > 31 4 0.7187500 > 32 4 0.8437500 > 33 4 0.9230769 > 34 4 0.8571429 > 35 4 0.8695652 > 36 4 0.8888889 > 37 5 0.3333333 > 38 5 0.5365854 > 39 5 0.6774194 > 40 5 0.7142857 > 41 5 0.6904762 > 42 5 0.5483871 > 43 5 0.5952381 > 44 5 0.4166667 > 45 5 0.5666667 > 46 5 0.5952381 > 47 5 0.7894737 > 48 5 0.7500000 > 49 6 0.6268657 > 50 6 0.7187500 > 51 6 0.5500000 > 52 6 0.7164179 > 53 6 0.7656250 > 54 6 0.5869565 > 55 6 0.7164179 > 56 6 0.7031250 > 57 6 0.7230769 > 58 6 0.7462687 > 59 6 0.9200000 > 60 6 0.8536585 > 61 7 0.6379310 > 62 7 0.5357143 > 63 7 0.5227273 > 64 7 0.8000000 > 65 7 0.6724138 > 66 7 0.7083333 > 67 7 0.7241379 > 68 7 0.6938776 > 69 7 0.6545455 > 70 7 0.7931034 > 71 7 0.7560976 > 72 7 0.8684211 > 73 8 0.6727273 > 74 8 0.6000000 > 75 8 0.8333333 > 76 8 0.8181818 > 77 8 0.7818182 > 78 8 0.7647059 > 79 8 0.5818182 > 80 8 0.5918367 > 81 8 0.7450980 > 82 8 0.7818182 > 83 8 0.8048780 > 84 8 0.8684211 > > > The script i'm using and output is as follows: > >> anova(rdayml <- lm(R ~ HR, data=rdata2, na.action=na.exclude)) > Analysis of Variance Table > > Response: R > Df Sum Sq Mean Sq F value Pr(>F) > HR 6 0.25992 0.04332 3.1762 0.00774 ** > Residuals 77 1.05021 0.01364 > --- > Signif. codes: 0 ???***??? 0.001 ???**??? 0.01 ???*??? 0.05 ???.??? > 0.1 ??? ??? 1 >> >> summary(rdayml <- lm(R ~ HR,data=rdata2)) > > Call: > lm(formula = R ~ HR, data = rdata2) > > Residuals: > Min > 1Q Median 3Q Max > -0.279725 -0.065416 0.005593 0.077486 0.201070 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.732082 0.033713 21.715 <2e-16 *** > HR3 0.005976 0.047678 0.125 0.9006 > HR4 0.067232 0.047678 1.410 0.1625 > HR5 -0.130935 0.047678 -2.746 0.0075 ** > HR6 -0.013152 0.047678 -0.276 0.7834 > HR7 -0.034807 0.047678 -0.730 0.4676 > HR8 0.004971 0.047678 0.104 0.9172 > ---> Signif. codes: 0 ???***??? 0.001 ???**??? .01 ???*??? 0.05???.???> 0.1 ??? ??? 1 > > Residual standard error: 0.1168 on 77 degrees of freedom > Multiple R-squared: 0.1984, Adjusted R-squared: 0.1359 > F-statistic: 3.176 on 6 and 77 DF, p-value: 0.00774 > > > What i really don't understand is why the lm summary lists the hour > numbers in the coefficient of the lm, as apposed to just reading HR?What has apparently happened here is that HR has been read in as a factor, nor a numerical covariate (which is probably what you wanted anyway, since you seem to want to compare hours). Hence HR is a factor with 7 levels: "2","3","4","5","6","7","8" (I have wirtten them between "" to show that they will not be interpreted numerically). When you get a coefficient from lm(), e.g. for HR6, this is the coefficient for level "6" of HR, and that is the standard way of writing a factor with its level in the output of ln().> On top of that if R does display the data like this then i don't > understand why it omits hour 2?This is because your have the Intercept in the model, so along with the 7 levels of HR you now have 8 variables present for a model with 7 coefficients. So one has to go (being redundant). Conventionally this is the first level of the factor (though you can arrange it differently if you want to). Hence HR2 is, in effect, absorbed into the Intercept. In this case, the Intercept is equal to the mean of the values for Hour 2. If you were to add the coefficint for Intercept to the coefficient for any level of HR, e.g. Intercept HR5 (0.732082) + (-0.130935) = 0.601147 you should find that it is the same as the mean of the values in Hour 5: (0.3333333+0.5365854+0.6774194+0.7142857+0.6904762+0.5483871+ 0.5952381+0.4166667+0.5666667+0.5952381+0.7894737+0.7500000)/12 = 0.6011475 as indeed it is! The coefficient of a given level of HR is here the difference between the mean for that hour and the mean for Hour 2. This depends on the system of contrasts ... by default, lm() uses "treatment contrasts" for factors, which express the difference between each "treatment" and the "base level" treatment. Alternativve systems of cintrasts give different representations.> If i can get this to work correctly > can I use the p value to determine which of the hours is significantly > different to the others - so in this example hour 5 is significantly > different? Or is it just a case of using the p value from theanova> to determine that there is a significant difference between hours > (in this case) and use a plot to determine which hour(s) are likely > to be the cause?Don't trust the individual P-values in the first instance, since with several levels somce are bound to have more extreme P-values than others. However, your anova() test is an overall test for whether one or more of the means jointly depart significantly from the Null Hypothesis that they are all equal. Here, your anova() has given a significant result, so the hour means are not all equal. *Now* you can look at the individual P-values to see where this may be coming from. In the list of coefficients, there is only one candidate for "being different from Intercept", and this is Hour 5. Indeed the P-value for HR5 and the P-value for the anova() are nearly equal (the latter being slightly larger, allowing for the fact that it is 1 out of 7). Hoping this helps! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 13-Aug-09 Time: 22:53:22 ------------------------------ XFMail ------------------------------
I have answered a similar question just hours ago. Your question either indicates an unfamiliarity with R or, more generally, an unfamiliarity with regression analysis, or both. Your anova indicates that the overall model is significant. That is, your model is an improvement over the null model with just an intercept. Thus, the anova output tells you that there are significant differences between the HR groups. This is also indicated by the F-test and p-value for the F-test in the second model output (the summary.lm). The question why R does not show HR2 or not just only HR indicates the lack of understanding of R or regression. The fact that there are seven coefficient estimates for HR3 through HR8 indicates that the HR variable is coded as a factor. Consequently, R estimates the model using dummy variables for each level of HR, except for the baseline, which is coded zero (this is basically an analysis of variance, not of covariance). By standard, the baseline is the smallest value (or lowest in alphabetical order) of a factor-coded variable, which is HR=2 in your case. Since it is the baseline, there is no estimate for a coefficient for HR2. All other estimated coefficients compare the effect of HR=3 through HR8 relative to the group in which HR=2. If you want to include HR as a numeric variable in the regression (analysis of covariance), you have to code it as.numeric. Note, however, that your results for the factor-coded HR variable indicate that the trend (if any) is not linear. For a comparison of all groups against one another from an analysis of variance, I think there other methods, like the Bonferroni-Dunn test (as a post-hoc test). Best, Daniel ------------------------- cuncta stricte discussurus ------------------------- -----Urspr?ngliche Nachricht----- Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im Auftrag von Ross Culloch Gesendet: Thursday, August 13, 2009 4:46 PM An: r-help at r-project.org Betreff: [R] lm coefficients output confusing Hi all, I have an issue with the lm() function regarding the listing of the coefficients. My data are below, showing a list of hours (HR) relating to the time spent resting (R) by an individual animal. Simply i want to run a lm() to run in an anova() to see if there is a significant difference in resting between hours. HR R 1 2 0.6666667 2 2 0.4666667 3 2 0.8000000 4 2 0.6333333 5 2 0.7333333 6 2 0.8000000 7 2 0.8666667 8 2 0.7857143 9 2 0.7826087 10 2 0.6666667 11 2 0.9166667 12 2 0.6666667 13 3 0.5294118 14 3 0.8541667 15 3 0.4583333 16 3 0.5882353 17 3 0.9347826 18 3 0.7878788 19 3 0.7857143 20 3 0.6944444 21 3 0.8333333 22 3 0.7450980 23 3 0.9230769 24 3 0.7222222 25 4 0.6571429 26 4 0.7241379 27 4 0.7391304 28 4 0.6571429 29 4 0.8000000 30 4 0.9130435 31 4 0.7187500 32 4 0.8437500 33 4 0.9230769 34 4 0.8571429 35 4 0.8695652 36 4 0.8888889 37 5 0.3333333 38 5 0.5365854 39 5 0.6774194 40 5 0.7142857 41 5 0.6904762 42 5 0.5483871 43 5 0.5952381 44 5 0.4166667 45 5 0.5666667 46 5 0.5952381 47 5 0.7894737 48 5 0.7500000 49 6 0.6268657 50 6 0.7187500 51 6 0.5500000 52 6 0.7164179 53 6 0.7656250 54 6 0.5869565 55 6 0.7164179 56 6 0.7031250 57 6 0.7230769 58 6 0.7462687 59 6 0.9200000 60 6 0.8536585 61 7 0.6379310 62 7 0.5357143 63 7 0.5227273 64 7 0.8000000 65 7 0.6724138 66 7 0.7083333 67 7 0.7241379 68 7 0.6938776 69 7 0.6545455 70 7 0.7931034 71 7 0.7560976 72 7 0.8684211 73 8 0.6727273 74 8 0.6000000 75 8 0.8333333 76 8 0.8181818 77 8 0.7818182 78 8 0.7647059 79 8 0.5818182 80 8 0.5918367 81 8 0.7450980 82 8 0.7818182 83 8 0.8048780 84 8 0.8684211 The script i'm using and output is as follows:> anova(rdayml <- lm(R ~ HR, data=rdata2, na.action=na.exclude))Analysis of Variance Table Response: R Df Sum Sq Mean Sq F value Pr(>F) HR 6 0.25992 0.04332 3.1762 0.00774 ** Residuals 77 1.05021 0.01364 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1> > summary(rdayml <- lm(R ~ HR,data=rdata2))Call: lm(formula = R ~ HR, data = rdata2) Residuals: Min 1Q Median 3Q Max -0.279725 -0.065416 0.005593 0.077486 0.201070 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.732082 0.033713 21.715 <2e-16 *** HR3 0.005976 0.047678 0.125 0.9006 HR4 0.067232 0.047678 1.410 0.1625 HR5 -0.130935 0.047678 -2.746 0.0075 ** HR6 -0.013152 0.047678 -0.276 0.7834 HR7 -0.034807 0.047678 -0.730 0.4676 HR8 0.004971 0.047678 0.104 0.9172 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.1168 on 77 degrees of freedom Multiple R-squared: 0.1984, Adjusted R-squared: 0.1359 F-statistic: 3.176 on 6 and 77 DF, p-value: 0.00774 What i really don't understand is why the lm summary lists the hour numbers in the coefficient of the lm, as apposed to just reading HR? On top of that if R does display the data like this then i don't understand why it omits hour 2? If i can get this to work correctly can I use the p value to determine which of the hours is significantly different to the others - so in this example hour 5 is significantly different? Or is it just a case of using the p value from the anova to determine that there is a significant difference between hours (in this case) and use a plot to determine which hour(s) are likely to be the cause? Any help or advice would be most useful! Best wishes, Ross -- View this message in context: http://www.nabble.com/lm-coefficients-output-confusing-tp24958398p24958398.h tml Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi Ted, Firstly, so sorry for the late reply - thank you very much for your detailed and patient answer. It was a big help for me and gave me a big step forward in understanding the lm function! I really appreciate the time you took (and everyone on here takes!), Best wishes, Ross -- View this message in context: http://www.nabble.com/lm-coefficients-output-confusing-tp24958398p25158027.html Sent from the R help mailing list archive at Nabble.com.