Hi All, I have data with 400 individuals and the following information Grade: pass or fail coded as 1 for pass and 0 for fail Sex: male or female ( coded as 1 for male and 2 for female ) Age Teaching.method : can be 1,2,3 I want to fit a logistic regression where the outcome if (1=pass or 0 for fail) and the rest of the variables are the regressors. My question is that I am not sure when to use ?factor? for a variable. In my example, Grade, sex, teaching method are categorial variables coded as stated above. Age is a continuous variable I have tried the model both ways where in the first model I stick in the word ?factor? in front of the categorial variables, but in this case I do not know how to interpret the output? Can someone shed some light on the difference between model1 and model2 and how to interpret them? Below is my output Thanks for your help Call: glm(formula = factor(Grade) ~ factor(sex) + age + factor(teaching.method), family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.8649 -1.1926 0.7494 1.0091 1.6659 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.77217 0.82182 -3.373 0.000743 *** factor(sex)2 -0.34751 0.22960 -1.514 0.130140 age 0.04544 0.01074 4.230 2.34e-05 *** factor(teaching.method) 2 -0.07125 0.30123 -0.237 0.813023 factor(teaching.method)3 0.50058 0.33087 1.513 0.130303 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 465.18 on 344 degrees of freedom Residual deviance: 438.91 on 340 degrees of freedom AIC: 448.91 Number of Fisher Scoring iterations: 4> model2<-glm(Grade~ sex + age +teaching.method, family=binomial,data=ndata) > summary(model2)Call: glm(formula = Grade ~ sex + age +teaching.method, family = binomial, data = ndata) Deviance Residuals: Min 1Q Median 3Q Max -1.7959 -1.2122 0.7547 1.0043 1.5791 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.83988 0.94749 -2.997 0.00272 ** sex -0.33361 0.22867 -1.459 0.14458 age 0.04432 0.01065 4.160 3.18e-05 *** teaching.method 0.28017 0.16181 1.731 0.08336 . --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 465.18 on 344 degrees of freedom Residual deviance: 440.85 on 341 degrees of freedom AIC: 448.85 Number of Fisher Scoring iterations: 4 -- View this message in context: http://www.nabble.com/what-is-the-difference-between-the-two-logistic-models--tp24943440p24943440.html Sent from the R help mailing list archive at Nabble.com.
Daniel Malter
2009-Aug-13 06:07 UTC
[R] what is the difference between the two logistic models?
As I wrote in my previous email, you need to pick up a methods book that deals with an introduction to regression analysis. Using factors in R means using dummy variable coding. The coefficients estimated in your model using factors indicate, the effect of teaching.method = 2 in comparison to the effect of teaching method = 1 and the effect of teaching.method = 3 in comparison to the effect of teaching method = 3. Using the linear term, as you do in your second model, is definetely wrong for teaching.method, unless the teaching.method(s) differ only in the hours taught. The model says, as teaching method increases by 1, the linear predictor in the logistic model increases by 0.28. This is obviously bogus if the difference for the values assigned to the levels of teaching.method are non-informative about quantitative differences in teaching.method(s). To give a very plastic example: Say a table is green, red, or brown and you assign values 1, 2, and 3 to the colors. What to the numbers tell you? - nothing! The difference between green, red, and brown tables are qualitative. Therefore, the numeric differences in the coding of the color variable are non-informative. You cannot use such variables as linear terms in a regression model. In your previous post, it seemed that teaching method is perfectly collinear with teaching hours. If that is the case, you may want to consider to code your dummy variable as orthogonal polynomial contrasts. But do so only if a.) there is no qualitative difference between teaching methods and the only difference is the quantitative difference in the hours taught and b.) you are actually able to interpret your model. However, I grasp that your understanding of regressions is quite limited. Therefore, your initial goal should be to build models that you can understand and interpret. Daniel ------------------------- cuncta stricte discussurus ------------------------- -----Urspr?ngliche Nachricht----- Von: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] Im Auftrag von SNN Gesendet: Wednesday, August 12, 2009 6:05 PM An: r-help at r-project.org Betreff: [R] what is the difference between the two logistic models? Hi All, I have data with 400 individuals and the following information Grade: pass or fail coded as 1 for pass and 0 for fail Sex: male or female ( coded as 1 for male and 2 for female ) Age Teaching.method : can be 1,2,3 I want to fit a logistic regression where the outcome if (1=pass or 0 for fail) and the rest of the variables are the regressors. My question is that I am not sure when to use ?factor? for a variable. In my example, Grade, sex, teaching method are categorial variables coded as stated above. Age is a continuous variable I have tried the model both ways where in the first model I stick in the word ?factor? in front of the categorial variables, but in this case I do not know how to interpret the output? Can someone shed some light on the difference between model1 and model2 and how to interpret them? Below is my output Thanks for your help Call: glm(formula = factor(Grade) ~ factor(sex) + age + factor(teaching.method), family = binomial, data = data) Deviance Residuals: Min 1Q Median 3Q Max -1.8649 -1.1926 0.7494 1.0091 1.6659 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.77217 0.82182 -3.373 0.000743 *** factor(sex)2 -0.34751 0.22960 -1.514 0.130140 age 0.04544 0.01074 4.230 2.34e-05 *** factor(teaching.method) 2 -0.07125 0.30123 -0.237 0.813023 factor(teaching.method)3 0.50058 0.33087 1.513 0.130303 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 465.18 on 344 degrees of freedom Residual deviance: 438.91 on 340 degrees of freedom AIC: 448.91 Number of Fisher Scoring iterations: 4> model2<-glm(Grade~ sex + age +teaching.method, > family=binomial,data=ndata) > summary(model2)Call: glm(formula = Grade ~ sex + age +teaching.method, family = binomial, data = ndata) Deviance Residuals: Min 1Q Median 3Q Max -1.7959 -1.2122 0.7547 1.0043 1.5791 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.83988 0.94749 -2.997 0.00272 ** sex -0.33361 0.22867 -1.459 0.14458 age 0.04432 0.01065 4.160 3.18e-05 *** teaching.method 0.28017 0.16181 1.731 0.08336 . --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 465.18 on 344 degrees of freedom Residual deviance: 440.85 on 341 degrees of freedom AIC: 448.85 Number of Fisher Scoring iterations: 4 -- View this message in context: http://www.nabble.com/what-is-the-difference-between-the-two-logistic-models --tp24943440p24943440.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.