thr3ads.net - R help - [R] summary vs anova [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Brent Pedersen

2011-Dec-19 14:09 UTC

[R] summary vs anova

Hi, I'm sure this is simple, but I haven't been able to find this in
TFM,
say I have some data in R like this (pasted here:
http://pastebin.com/raw.php?i=sjS9Zkup):

  > head(df)
    gender age smokes disease    Y
  1 female  65   ever control 0.18
  2 female  77  never control 0.12
  3   male  40         state1 0.11
  4 female  67   ever control 0.20
  5   male  63   ever  state1 0.16
  6 female  26  never  state1 0.13

where unique(disease) == c("control", "state1",
"state2")
and unique(smokes) == c("ever", "never", "",
"current")

I then fit a linear model like:

    > model = lm(Y ~ smokes + disease + age + gender, data=df)

And I want to understand the difference between:

    > print(summary(model))
    Call:
    lm(formula = Y ~ smokes + disease + age + gender, data = df)

    Residuals:
         Min       1Q   Median       3Q      Max
    -0.22311 -0.08108 -0.03483  0.05604  0.46507

    Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
    (Intercept)    0.1206825  0.0521368   2.315   0.0211 *
    smokescurrent  0.0150641  0.0444466   0.339   0.7348
    smokesever     0.0498764  0.0326254   1.529   0.1271
    smokesnever    0.0394109  0.0349142   1.129   0.2597
    diseasestate1  0.0018739  0.0176817   0.106   0.9157
    diseasestate2 -0.0009858  0.0178651  -0.055   0.9560
    age            0.0002841  0.0006290   0.452   0.6518
    gendermale     0.1164889  0.0128748   9.048   <2e-16 ***
    ---
    Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

    Residual standard error: 0.1257 on 397 degrees of freedom
    Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791
    F-statistic: 13.59 on 7 and 397 DF,  p-value: 8.975e-16


and:

  > anova(model)
  Analysis of Variance Table

  Response: Y
             Df Sum Sq Mean Sq F value  Pr(>F)
  smokes      3 0.1536 0.05120  3.2397 0.02215 *
  disease     2 0.0129 0.00647  0.4096 0.66420
  age         1 0.0431 0.04310  2.7270 0.09946 .
  gender      1 1.2937 1.29373 81.8634 < 2e-16 ***
  Residuals 397 6.2740 0.01580
  ---
  Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

I understand (hopefully correctly) that anova() tests by adding each covariate
to the model in order it is specified in the formula.

More specific questions are:

1) How do the p-values for smokes* in summary(model) relate to the
   Pr(>F) for smokes in anova
2) what do the p-values for each of those smokes* mean exactly?
3) the summary above shows the values for diseasestate1 and diseasestate2
   how can I get the p-value for diseasecontrol? (or, e.g. genderfemale)

thanks.

David Winsemius

2011-Dec-19 15:00 UTC

head link

[R] summary vs anova

On Dec 19, 2011, at 9:09 AM, Brent Pedersen wrote:
> Hi, I'm sure this is simple, but I haven't been able to find this
in
> TFM,
> say I have some data in R like this (pasted here:
> http://pastebin.com/raw.php?i=sjS9Zkup):
One of the reason this is not in TFM is that these are questions that  
should be available in any first course on regression textbook.
>
>> head(df)
>    gender age smokes disease    Y
>  1 female  65   ever control 0.18
>  2 female  77  never control 0.12
>  3   male  40         state1 0.11
>  4 female  67   ever control 0.20
>  5   male  63   ever  state1 0.16
>  6 female  26  never  state1 0.13
>
> where unique(disease) == c("control", "state1",
"state2")
> and unique(smokes) == c("ever", "never", "",
"current")
>
> I then fit a linear model like:
>
>> model = lm(Y ~ smokes + disease + age + gender, data=df)
>
> And I want to understand the difference between:
>
>> print(summary(model))
>    Call:
>    lm(formula = Y ~ smokes + disease + age + gender, data = df)
>
>    Residuals:
>         Min       1Q   Median       3Q      Max
>    -0.22311 -0.08108 -0.03483  0.05604  0.46507
>
>    Coefficients:
>                    Estimate Std. Error t value Pr(>|t|)
>    (Intercept)    0.1206825  0.0521368   2.315   0.0211 *
>    smokescurrent  0.0150641  0.0444466   0.339   0.7348
>    smokesever     0.0498764  0.0326254   1.529   0.1271
>    smokesnever    0.0394109  0.0349142   1.129   0.2597
>    diseasestate1  0.0018739  0.0176817   0.106   0.9157
>    diseasestate2 -0.0009858  0.0178651  -0.055   0.9560
>    age            0.0002841  0.0006290   0.452   0.6518
>    gendermale     0.1164889  0.0128748   9.048   <2e-16 ***
>    ---
>    Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>
>    Residual standard error: 0.1257 on 397 degrees of freedom
>    Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791
>    F-statistic: 13.59 on 7 and 397 DF,  p-value: 8.975e-16
>
> and:
>
>> anova(model)
>  Analysis of Variance Table
>
>  Response: Y
>             Df Sum Sq Mean Sq F value  Pr(>F)
>  smokes      3 0.1536 0.05120  3.2397 0.02215 *
>  disease     2 0.0129 0.00647  0.4096 0.66420
>  age         1 0.0431 0.04310  2.7270 0.09946 .
>  gender      1 1.2937 1.29373 81.8634 < 2e-16 ***
>  Residuals 397 6.2740 0.01580
>  ---
>  Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>
> I understand (hopefully correctly) that anova() tests by adding each  
> covariate
> to the model in order it is specified in the formula.
>
> More specific questions are:
All of which are general statistics questions which you are asked to  
post in forums or lists that expect such questions ... and not to r- 
help.
>
> 1) How do the p-values for smokes* in summary(model) relate to the
>   Pr(>F) for smokes in anova
> 2) what do the p-values for each of those smokes* mean exactly?
> 3) the summary above shows the values for diseasestate1 and  
> diseasestate2
>   how can I get the p-value for diseasecontrol? (or, e.g.  
> genderfemale)
>
>
> ^^^^^^^^^^^^^^^^^^^^^^^
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html-------------------

David Winsemius, MD
West Hartford, CT

peter dalgaard

2011-Dec-19 15:18 UTC

head link

[R] summary vs anova

On Dec 19, 2011, at 15:09 , Brent Pedersen wrote:
> Hi, I'm sure this is simple, but I haven't been able to find this
in TFM,
It's not _that_ simple. You likely need TFtextbook rather than TFM. Most
(but not all) will go into at least some detail of coding categorical variables
using dummy variables.
> [snip]
> 
> I understand (hopefully correctly) that anova() tests by adding each
covariate
> to the model in order it is specified in the formula.
> 
Yes. Note, however, that categorical variables cause more than one dummy
covariate to be added.
> More specific questions are:
> 
> 1) How do the p-values for smokes* in summary(model) relate to the
>   Pr(>F) for smokes in anova
If the last Pr(>F) corresponds to a single-df term, then F=t^2 for that term
(only), and the p value is the same. If the last Pr(>F)  is for a k-df term,
it corresponds to simultaneously testing that the corresponding k regression
coefficients are _all_ zero;  the joint p value can not in general be calculated
from tests on individual coefficients. However, they at least test related
hypotheses.

p values higher up the list in anova() test for hypotheses in models obtained
after removal of subsequent factors, so are not in general comparable to the t
tests in summary().

If you use drop1(...., test="F") instead of anova(), then you avoid
the sequential aspect and all 1-df tests correspond to t-tests in the summary
table.
> 2) what do the p-values for each of those smokes* mean exactly?
In the default parametrization, they correspond to comparisons between the
stated level and the reference (first) level of the factor. In different
contrast parametrizations, the interpretation will differ; the only complete
advice is that you need to understand the relation between the factor levels and
the rows of the design matrix.
> 3) the summary above shows the values for diseasestate1 and diseasestate2
>   how can I get the p-value for diseasecontrol? (or, e.g. genderfemale)
You can't. It would correspond to a comparison of that level with itself.
> 
> thanks.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Maybe Matching Threads

Search for more possibly parallel threads

R help - Dec 2011 - summary vs anova

[R] summary vs anova

[R] summary vs anova

[R] summary vs anova

Maybe Matching Threads