Hello, I'd like to create a model with a factor-type response variable. This is an example:> mydata <- data.frame(factor_var = as.factor(c(rep('one', 100), rep('two', 100), rep('three', 100))), real_var = c(rnorm(150), rnorm(150) + 5)) > summary(mydata)factor_var real_var one :100 Min. :-2.742877 three:100 1st Qu.:-0.009493 two :100 Median : 2.361669 Mean : 2.490411 3rd Qu.: 4.822394 Max. : 6.924588> mymodel = glm(factor_var ~ real_var, family = 'binomial', data = mydata) > summary(mymodel)Call: glm(formula = factor_var ~ real_var, family = "binomial", data = mydata) Deviance Residuals: Min 1Q Median 3Q Max -1.7442 -0.6774 0.1849 0.3133 2.1187 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.6798 0.1882 -3.613 0.000303 *** real_var 0.8971 0.1066 8.417 < 2e-16 *** --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 381.91 on 299 degrees of freedom Residual deviance: 213.31 on 298 degrees of freedom AIC: 217.31 Number of Fisher Scoring iterations: 6 --------------------------------------------------------------------- For models with real-type response variable it's easy to figure out, what's the equation for the response variable (in the model). But here - how do I interpret the model? -- God made the world in six days, and was arrested on the seventh.
On 04-May-05 Maciej Blizi??ski wrote:> Hello, > > I'd like to create a model with a factor-type response variable. > This is an example: > >> mydata <- data.frame(factor_var = as.factor(c(rep('one', 100), >> rep('two', 100), rep('three', 100))), real_var = c(rnorm(150), >> rnorm(150) + 5)) >> summary(mydata) > factor_var real_var > one :100 Min. :-2.742877 > three:100 1st Qu.:-0.009493 > two :100 Median : 2.361669 > Mean : 2.490411 > 3rd Qu.: 4.822394 > Max. : 6.924588 >> mymodel = glm(factor_var ~ real_var, family = 'binomial', data >> mydata) >> summary(mymodel) > > Call: > glm(formula = factor_var ~ real_var, family = "binomial", data > mydata) > > Deviance Residuals: > Min 1Q Median 3Q Max > -1.7442 -0.6774 0.1849 0.3133 2.1187 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -0.6798 0.1882 -3.613 0.000303 *** > real_var 0.8971 0.1066 8.417 < 2e-16 *** > --- > Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > > (Dispersion parameter for binomial family taken to be 1) > > Null deviance: 381.91 on 299 degrees of freedom > Residual deviance: 213.31 on 298 degrees of freedom > AIC: 217.31 > > Number of Fisher Scoring iterations: 6Have you noticed that you get identical results with set.seed(214354) mydata <- data.frame(factor.var = as.factor(c(rep('one', 100), rep('two',100), rep('three', 100))), real.var = c(rnorm(150), rnorm(150) + 5)) mymodel <- glm(factor.var ~ real.var, family='binomial', data=mydata) summary(mymodel) and set.seed(214354) mydata <- data.frame(factor.var = as.factor(c(rep('one', 100), rep('two',200))),real.var = c(rnorm(150),rnorm(150) + 5)) mymodel <- glm(factor.var ~ real.var, family='binomial', data=mydata) summary(mymodel) (I've left out the "summary(mydata)" since these do naturally differ, and I've replaced "factor_var" with "factor.var" and "real_var" with "real.var" because of potential complications with "_"; also "mymodel =" to "mymodel <-"). So I think the interpretation of the results from your first model is that, because of the "family='binomial'", glm is treating "factor.var='one'" as binomial response "0", say, and "factor.var='two'" or "factor.var='three'" as binomial response "1". You're trying to fit a multinomial response, but you've specified a binomial family to 'glm'. 'glm' does not have a multinomial response family. You could try 'multinom' from package 'nnet' which fits loglinear models to factor responses with more than 2 levels. E.g. library(nnet) mymodel <- multinom(factor.var ~ real.var,data=mydata) ### weights: 9 (4 variable) ## initial value 329.583687 ## iter 10 value 209.780666 ## final value 209.779951 ## converged summary(mymodel) ## Re-fitting to get Hessian ## Call: ## multinom(formula = factor.var ~ real.var, data = mydata) ## Coefficients: ## (Intercept) real.var ## three -3.4262565 1.3838231 ## two -0.6754253 0.7116955 ## ## Std. Errors: ## (Intercept) real.var ## three 0.5028541 0.1480138 ## two 0.1846827 0.1068821 ## ## Residual Deviance: 419.5599 ## AIC: 427.5599 ## ## Correlation of Coefficients: ## three:(Intercept) three:real.var two:(Intercept) ## three:real.var -0.7286258 ## two:(Intercept) 0.1986995 -0.1261034 ## two:real.var -0.1411377 0.7012481 -0.3285741 This output does suggest a fairly clear interpretation! Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 04-May-05 Time: 09:18:03 ------------------------------ XFMail ------------------------------
On Wed, 4 May 2005, Maciej [iso-8859-2] BliziDski wrote:> I'd like to create a model with a factor-type response variable. This is > an example:What you have done here is to fit a logistic regression. The interpretation of that is covered in many good books: for example there are plots of the predicted values in MASS4. I do wonder if that is what you intended, though. You have fitted a model of 'two or three' vs 'one'. You may have intended a multinomial logistic model: again MASS4 has details of such models.>> mydata <- data.frame(factor_var = as.factor(c(rep('one', 100), rep('two', 100), rep('three', 100))), real_var = c(rnorm(150), rnorm(150) + 5)) >> summary(mydata) > factor_var real_var > one :100 Min. :-2.742877 > three:100 1st Qu.:-0.009493 > two :100 Median : 2.361669 > Mean : 2.490411 > 3rd Qu.: 4.822394 > Max. : 6.924588 >> mymodel = glm(factor_var ~ real_var, family = 'binomial', data = mydata) >> summary(mymodel) > > Call: > glm(formula = factor_var ~ real_var, family = "binomial", data = mydata) > > Deviance Residuals: > Min 1Q Median 3Q Max > -1.7442 -0.6774 0.1849 0.3133 2.1187 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -0.6798 0.1882 -3.613 0.000303 *** > real_var 0.8971 0.1066 8.417 < 2e-16 *** > --- > Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > > (Dispersion parameter for binomial family taken to be 1) > > Null deviance: 381.91 on 299 degrees of freedom > Residual deviance: 213.31 on 298 degrees of freedom > AIC: 217.31 > > Number of Fisher Scoring iterations: 6 > > --------------------------------------------------------------------- > > For models with real-type response variable it's easy to figure out, > what's the equation for the response variable (in the model). But here > - how do I interpret the model?-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
>>> Maciej Blizi(B??ski <m.blizinski at wsisiz.edu.pl> 5/4/2005 6:02:14 AM >>><<< I'm trying to analyze a survey. Most of the variables are of factor type, with values for example {"no_at_all", "a_little", "mostly", "a_lot"}.>>>In that case, you probably want to look at ordinal logistic regression. This is covered in numerous texts, one good one which uses R is Harrell's Regression Modeling Strategies (an excellent book in other regards, as well). Another book which might be useful (although not R specific) isLong's Regression Models for Categorical and Limited Dependent Variables <<< I thought about mapping those answers to numbers, but I didn't know what numbers should I assign them to: {1, 2, 3, 4} (linear) or maybe {1, 2, 4, 8} (exponential)? So I rather tried to analyze the original factor survey data. Multinomial factor response wasn't covered in the lectures in my school so I'm trying to use my intuition and trial/error technique (please forgive me :-) ).>>>Using your intuition and trial and error seems to me to be a way to guarantee lots of trials and lots of errors, but not necessarily to guarantee success. You might want to consult a statistician before proceeding; you certainly want to consult a text. HTH Peter Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)