Hi, do you guys know what function in R handles the multiple regression on categorical predictor data. i.e, 'lm' is used to handle continuous predictor data. thanks, karena -- View this message in context: http://r.789695.n4.nabble.com/regression-function-for-categorical-predictor-data-tp2532045p2532045.html Sent from the R help mailing list archive at Nabble.com.
(Ted Harding)
2010-Sep-08  22:33 UTC
[R] regression function for categorical predictor data
On 08-Sep-10 21:11:27, karena wrote:> Hi, do you guys know what function in R handles the multiple regression > on categorical predictor data. i.e, 'lm' is used to handle continuous > predictor data. > > thanks, > karenaKarena, lm() also handles categorical data, provided these are presented as factors. For example: set.seed(12345) X <- 0.05*(-20:20) # Continuous predictor F <- as.factor(c(rep("A",21),rep("B",20))) ##21 obs at level "A", 20 at level "B" Y <- 0.5*X + c(0.25*rnorm(21),0.25*rnorm(20)+2.0) ## Y increases linearly with X (coeff = 0.5) ## Y at Level "B" is 2.0 higher than at Level "A" ## "Error" term has SD = 0.25 plot(X,Y) summary(lm(Y ~ X + F)) # Call: lm(formula = Y ~ X + F) # Residuals: # Min 1Q Median 3Q Max # -0.56511 -0.15807 -0.00034 0.16484 0.44048 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 0.09561 0.08869 1.078 0.288 # X 0.63621 0.13671 4.654 3.89e-05 *** # FB 1.93821 0.16181 11.978 1.80e-14 *** # --- # Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 # Residual standard error: 0.2589 on 38 degrees of freedom # Multiple R-squared: 0.965, Adjusted R-squared: 0.9631 # F-statistic: 523.4 on 2 and 38 DF, p-value: < 2.2e-16 The reported Estimate FB give the change in level resulting from a change from "A" to "B" in F. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 08-Sep-10 Time: 23:33:34 ------------------------------ XFMail ------------------------------
glm() is another choice. Using glm(), you response variable can be a discrete random bariable, however, you need to specify the distribution in the argument: family = " distriubtion name" Use Teds simulated data and glm(), you get the same result as that produced in lm():> summary(glm(Y ~ X + F, family="gaussian"))Call: glm(formula = Y ~ X + F, family = "gaussian") Deviance Residuals: Min 1Q Median 3Q Max -0.53796 -0.16201 -0.08087 0.15080 0.47363 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03723 0.08457 0.440 0.662267 X 0.51009 0.13036 3.913 0.000365 *** FB 1.82578 0.15429 11.833 2.6e-14 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for gaussian family taken to be 0.06096497) Null deviance: 59.7558 on 40 degrees of freedom Residual deviance: 2.3167 on 38 degrees of freedom AIC: 6.5418 Number of Fisher Scoring iterations: 2 -- View this message in context: http://r.789695.n4.nabble.com/regression-function-for-categorical-predictor-data-tp2532045p2532302.html Sent from the R help mailing list archive at Nabble.com.
Sorry, result is not the same, since our datasets are different. I also run lm() based on the dataset that used in glm(). THe results are exactly the same:> summary(lm(Y ~ X + F))Call: lm(formula = Y ~ X + F) Residuals: Min 1Q Median 3Q Max -0.53796 -0.16201 -0.08087 0.15080 0.47363 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03723 0.08457 0.440 0.662267 X 0.51009 0.13036 3.913 0.000365 *** FB 1.82578 0.15429 11.833 2.6e-14 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.2469 on 38 degrees of freedom Multiple R-squared: 0.9612, Adjusted R-squared: 0.9592 F-statistic: 471.1 on 2 and 38 DF, p-value: < 2.2e-16 ==============The dataset is given below:> cbind(Y,X,F)Y X F [1,] -0.28473266 -1.00 1 [2,] -0.59041310 -0.95 1 [3,] -0.50431754 -0.90 1 [4,] -0.60095969 -0.85 1 [5,] -0.45849905 -0.80 1 [6,] -0.48287208 -0.75 1 [7,] -0.49598666 -0.70 1 [8,] -0.08746758 -0.65 1 [9,] -0.18665177 -0.60 1 [10,] -0.01007210 -0.55 1 [11,] -0.45765308 -0.50 1 [12,] -0.27318684 -0.45 1 [13,] 0.07638855 -0.40 1 [14,] 0.27043727 -0.35 1 [15,] 0.26926216 -0.30 1 [16,] -0.43047783 -0.25 1 [17,] 0.40884468 -0.20 1 [18,] -0.14638563 -0.15 1 [19,] -0.31374179 -0.10 1 [20,] -0.15028159 -0.05 1 [21,] -0.12540519 0.00 1 [22,] 1.58015611 0.05 2 [23,] 1.68200774 0.10 2 [24,] 2.02821901 0.15 2 [25,] 2.02359285 0.20 2 [26,] 2.14133171 0.25 2 [27,] 2.06931685 0.30 2 [28,] 2.05561726 0.35 2 [29,] 2.35720999 0.40 2 [30,] 1.96134404 0.45 2 [31,] 2.26144356 0.50 2 [32,] 2.24454620 0.55 2 [33,] 2.55707426 0.60 2 [34,] 2.18732022 0.65 2 [35,] 1.90950697 0.70 2 [36,] 2.10371010 0.75 2 [37,] 2.18266009 0.80 2 [38,] 2.18490441 0.85 2 [39,] 2.45248295 0.90 2 [40,] 2.79851838 0.95 2 [41,] 1.83514341 1.00 2 -- View this message in context: http://r.789695.n4.nabble.com/regression-function-for-categorical-predictor-data-tp2532045p2532305.html Sent from the R help mailing list archive at Nabble.com.