Hi, I'm sure this is simple, but I haven't been able to find this in TFM, say I have some data in R like this (pasted here: http://pastebin.com/raw.php?i=sjS9Zkup): > head(df) gender age smokes disease Y 1 female 65 ever control 0.18 2 female 77 never control 0.12 3 male 40 state1 0.11 4 female 67 ever control 0.20 5 male 63 ever state1 0.16 6 female 26 never state1 0.13 where unique(disease) == c("control", "state1", "state2") and unique(smokes) == c("ever", "never", "", "current") I then fit a linear model like: > model = lm(Y ~ smokes + disease + age + gender, data=df) And I want to understand the difference between: > print(summary(model)) Call: lm(formula = Y ~ smokes + disease + age + gender, data = df) Residuals: Min 1Q Median 3Q Max -0.22311 -0.08108 -0.03483 0.05604 0.46507 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1206825 0.0521368 2.315 0.0211 * smokescurrent 0.0150641 0.0444466 0.339 0.7348 smokesever 0.0498764 0.0326254 1.529 0.1271 smokesnever 0.0394109 0.0349142 1.129 0.2597 diseasestate1 0.0018739 0.0176817 0.106 0.9157 diseasestate2 -0.0009858 0.0178651 -0.055 0.9560 age 0.0002841 0.0006290 0.452 0.6518 gendermale 0.1164889 0.0128748 9.048 <2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.1257 on 397 degrees of freedom Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791 F-statistic: 13.59 on 7 and 397 DF, p-value: 8.975e-16 and: > anova(model) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) smokes 3 0.1536 0.05120 3.2397 0.02215 * disease 2 0.0129 0.00647 0.4096 0.66420 age 1 0.0431 0.04310 2.7270 0.09946 . gender 1 1.2937 1.29373 81.8634 < 2e-16 *** Residuals 397 6.2740 0.01580 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 I understand (hopefully correctly) that anova() tests by adding each covariate to the model in order it is specified in the formula. More specific questions are: 1) How do the p-values for smokes* in summary(model) relate to the Pr(>F) for smokes in anova 2) what do the p-values for each of those smokes* mean exactly? 3) the summary above shows the values for diseasestate1 and diseasestate2 how can I get the p-value for diseasecontrol? (or, e.g. genderfemale) thanks.
On Dec 19, 2011, at 9:09 AM, Brent Pedersen wrote:> Hi, I'm sure this is simple, but I haven't been able to find this in > TFM, > say I have some data in R like this (pasted here: > http://pastebin.com/raw.php?i=sjS9Zkup):One of the reason this is not in TFM is that these are questions that should be available in any first course on regression textbook.> >> head(df) > gender age smokes disease Y > 1 female 65 ever control 0.18 > 2 female 77 never control 0.12 > 3 male 40 state1 0.11 > 4 female 67 ever control 0.20 > 5 male 63 ever state1 0.16 > 6 female 26 never state1 0.13 > > where unique(disease) == c("control", "state1", "state2") > and unique(smokes) == c("ever", "never", "", "current") > > I then fit a linear model like: > >> model = lm(Y ~ smokes + disease + age + gender, data=df) > > And I want to understand the difference between: > >> print(summary(model)) > Call: > lm(formula = Y ~ smokes + disease + age + gender, data = df) > > Residuals: > Min 1Q Median 3Q Max > -0.22311 -0.08108 -0.03483 0.05604 0.46507 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.1206825 0.0521368 2.315 0.0211 * > smokescurrent 0.0150641 0.0444466 0.339 0.7348 > smokesever 0.0498764 0.0326254 1.529 0.1271 > smokesnever 0.0394109 0.0349142 1.129 0.2597 > diseasestate1 0.0018739 0.0176817 0.106 0.9157 > diseasestate2 -0.0009858 0.0178651 -0.055 0.9560 > age 0.0002841 0.0006290 0.452 0.6518 > gendermale 0.1164889 0.0128748 9.048 <2e-16 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 0.1257 on 397 degrees of freedom > Multiple R-squared: 0.1933, Adjusted R-squared: 0.1791 > F-statistic: 13.59 on 7 and 397 DF, p-value: 8.975e-16 > > and: > >> anova(model) > Analysis of Variance Table > > Response: Y > Df Sum Sq Mean Sq F value Pr(>F) > smokes 3 0.1536 0.05120 3.2397 0.02215 * > disease 2 0.0129 0.00647 0.4096 0.66420 > age 1 0.0431 0.04310 2.7270 0.09946 . > gender 1 1.2937 1.29373 81.8634 < 2e-16 *** > Residuals 397 6.2740 0.01580 > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > I understand (hopefully correctly) that anova() tests by adding each > covariate > to the model in order it is specified in the formula.> > More specific questions are:All of which are general statistics questions which you are asked to post in forums or lists that expect such questions ... and not to r- help.> > 1) How do the p-values for smokes* in summary(model) relate to the > Pr(>F) for smokes in anova > 2) what do the p-values for each of those smokes* mean exactly? > 3) the summary above shows the values for diseasestate1 and > diseasestate2 > how can I get the p-value for diseasecontrol? (or, e.g. > genderfemale) > > > ^^^^^^^^^^^^^^^^^^^^^^^ > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html------------------- David Winsemius, MD West Hartford, CT
On Dec 19, 2011, at 15:09 , Brent Pedersen wrote:> Hi, I'm sure this is simple, but I haven't been able to find this in TFM,It's not _that_ simple. You likely need TFtextbook rather than TFM. Most (but not all) will go into at least some detail of coding categorical variables using dummy variables.> [snip] > > I understand (hopefully correctly) that anova() tests by adding each covariate > to the model in order it is specified in the formula. >Yes. Note, however, that categorical variables cause more than one dummy covariate to be added.> More specific questions are: > > 1) How do the p-values for smokes* in summary(model) relate to the > Pr(>F) for smokes in anovaIf the last Pr(>F) corresponds to a single-df term, then F=t^2 for that term (only), and the p value is the same. If the last Pr(>F) is for a k-df term, it corresponds to simultaneously testing that the corresponding k regression coefficients are _all_ zero; the joint p value can not in general be calculated from tests on individual coefficients. However, they at least test related hypotheses. p values higher up the list in anova() test for hypotheses in models obtained after removal of subsequent factors, so are not in general comparable to the t tests in summary(). If you use drop1(...., test="F") instead of anova(), then you avoid the sequential aspect and all 1-df tests correspond to t-tests in the summary table.> 2) what do the p-values for each of those smokes* mean exactly?In the default parametrization, they correspond to comparisons between the stated level and the reference (first) level of the factor. In different contrast parametrizations, the interpretation will differ; the only complete advice is that you need to understand the relation between the factor levels and the rows of the design matrix.> 3) the summary above shows the values for diseasestate1 and diseasestate2 > how can I get the p-value for diseasecontrol? (or, e.g. genderfemale)You can't. It would correspond to a comparison of that level with itself.> > thanks. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Maybe Matching Threads
- Concerning: in 0.99.10.7 - Maildir: LIST now doesn't skip symlinks
- Optim function returning always initial value for parameter to be optimized
- Add columns in a dataframe and fill them from another table according to a criteria
- Remove thw data from the dataframe
- help in replacing for llop