Julian Righ Sampedro
2018-Nov-17 20:01 UTC
[R] Multiplication of regression coefficient by factor type variable
Dear all, In a context of regression, I have three regressors, two of which are categorical variables (sex and education) and have class 'factor'. y = data$income x1 = as.factor(data$sex) # 2 levels x2 = data$age # continuous x3 = as.factor(data$ed) # 8 levels for example, the first entries of x3 are head(x3)[1] 5 3 5 5 4 2 Levels: 1 2 3 4 5 6 7 8 When we call the model, the output looks like this model1=lm(y ~ x1 + x2 + x3, data = data) summary(model1) Residuals: Min 1Q Median 3Q Max -31220 -6300 -594 4429 190731 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1440.66 3809.99 0.378 0.705417 x1 -4960.88 772.96 -6.418 2.13e-10 *** x2 181.45 25.03 7.249 8.41e-13 *** x32 2174.95 3453.22 0.630 0.528948 x33 7497.68 3428.94 2.187 0.029004 * x34 8278.97 3576.30 2.315 0.020817 * x35 13686.88 3454.93 3.962 7.97e-05 *** x36 15902.92 4408.49 3.607 0.000325 *** x37 28773.13 3696.77 7.783 1.76e-14 *** x38 31455.55 5448.11 5.774 1.03e-08 ***--- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 12060 on 1001 degrees of freedom Multiple R-squared: 0.2486, Adjusted R-squared: 0.2418 F-statistic: 36.79 on 9 and 1001 DF, p-value: < 2.2e-16 Now suppose I want to compute the residuals. To do so I first need to compute the prediction by the model. (I use it in a cross validation context so it is a partial display of the code) yhat1 = model1$coef[1] + model1$coef[2]*x1[i] + model1$coef[3]*x2[i] + model1$coef[4]*x3[i] But I get the following warnings Warning messages:1: In Ops.factor(model1$coef[2], x1[i]) : ?*? not meaningful for factors2: In Ops.factor(model1$coef[4], x3[i]) : ?*? not meaningful for factors 1st question: Is there a way to multiply the coefficient by the 'factor' without having to transform my 'factor' into a 'numeric' type variable ? 2nd question: Since x3 is associated with 7 parameters (one for x32, one for x33, ... , one for x38), how do I multiply the 'correct' parameter coefficient with my 'factor' x3 ? I have been considering a 'if then' solution, but to no avail. I also have considered splitting my x3 variable into 8 binary variables without succeeding. What may be the best approach ? Thank you for your help. Since I understand this my not be specific enough, I add here the complete code # for n-fold cross validation# fit models on leave-one-out samples x1= as.factor(data$sex) x2= data$age x3= as.factor(data$ed) yn=data$income n = length(yn) e1 = e2 = numeric(n) for (i in 1:n) { # the ith observation is excluded y = yn[-i] x_1 = x1[-i] x_2 = x2[-i] x_3 = x3[-i] x_4 = as.factor(cf4)[-i] # fit the first model without the ith observation J1 = lm(y ~ x_1 + x_2 + x_3) yhat1 = J1$coef[1] + J1$coef[2]*x1[i] + J1$coef[3]*x2[i] + J1$coef[4]*x3[i] # construct the ith part of the loss function for model 1 e1[i] = yn[i] - yhat1 # fit the second model without the ith observation J2 = lm(y ~ x_1 + x_2 + x_3 + x_4) yhat2=J2$coef[1]+J2$coef[2]*x1[i]+J2$coef[3]*x2[i]+J2$coef[4]*x3[i]+J2$coef[5]*cf4[i] e2[i] = yn[i] - yhat2 } sqrt(c(mean(e1^2),mean(e2^2))) # RMSE cf4 is a variable corresponding to groups (using mixtures) . What we want to demonstrate is that the prediction error after cross-validation is lower when we include this latent grouping variable. It works wonders when the categorical variables are treated as 'numeric' variables. Though the ols estimates are obviously very different. Thank you in advance for your views on the problem. Best Regards, julian [[alternative HTML version deleted]]
Bert Gunter
2018-Nov-17 21:58 UTC
[R] Multiplication of regression coefficient by factor type variable
You shouldn't have to do any of what you are doing. See ?predict.lm and note the "newdata" argument. Also, you should spend some time studying a linear model text, as your question appears to indicate some basic confusion (e.g. about "contrasts" ) about how they work. Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Nov 17, 2018 at 1:24 PM Julian Righ Sampedro <julian.righ.sampedro at gmail.com> wrote:> > Dear all, > > In a context of regression, I have three regressors, two of which are > categorical variables (sex and education) and have class 'factor'. > > y = data$income > x1 = as.factor(data$sex) # 2 levels > x2 = data$age # continuous > x3 = as.factor(data$ed) # 8 levels > > for example, the first entries of x3 are > > head(x3)[1] 5 3 5 5 4 2 > Levels: 1 2 3 4 5 6 7 8 > > When we call the model, the output looks like this > > model1=lm(y ~ x1 + x2 + x3, data = data) > summary(model1) > > Residuals: > Min 1Q Median 3Q Max -31220 -6300 -594 4429 190731 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) (Intercept) 1440.66 > 3809.99 0.378 0.705417 > x1 -4960.88 772.96 -6.418 2.13e-10 *** > x2 181.45 25.03 7.249 8.41e-13 *** > x32 2174.95 3453.22 0.630 0.528948 > x33 7497.68 3428.94 2.187 0.029004 * > x34 8278.97 3576.30 2.315 0.020817 * > x35 13686.88 3454.93 3.962 7.97e-05 *** > x36 15902.92 4408.49 3.607 0.000325 *** > x37 28773.13 3696.77 7.783 1.76e-14 *** > x38 31455.55 5448.11 5.774 1.03e-08 ***--- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 12060 on 1001 degrees of freedom > Multiple R-squared: 0.2486, Adjusted R-squared: 0.2418 > F-statistic: 36.79 on 9 and 1001 DF, p-value: < 2.2e-16 > > Now suppose I want to compute the residuals. To do so I first need to > compute the prediction by the model. (I use it in a cross validation > context so it is a partial display of the code) > > yhat1 = model1$coef[1] + model1$coef[2]*x1[i] + model1$coef[3]*x2[i] + > model1$coef[4]*x3[i] > > But I get the following warnings > > Warning messages:1: In Ops.factor(model1$coef[2], x1[i]) : ?*? not > meaningful for factors2: In Ops.factor(model1$coef[4], x3[i]) : ?*? > not meaningful for factors > > 1st question: Is there a way to multiply the coefficient by the 'factor' > without having to transform my 'factor' into a 'numeric' type variable ? > > 2nd question: Since x3 is associated with 7 parameters (one for x32, one > for x33, ... , one for x38), how do I multiply the 'correct' parameter > coefficient with my 'factor' x3 ? > > I have been considering a 'if then' solution, but to no avail. I also have > considered splitting my x3 variable into 8 binary variables without > succeeding. What may be the best approach ? Thank you for your help. > > Since I understand this my not be specific enough, I add here the complete > code > > # for n-fold cross validation# fit models on leave-one-out samples > x1= as.factor(data$sex) > x2= data$age > x3= as.factor(data$ed) > yn=data$income > n = length(yn) > e1 = e2 = numeric(n) > > for (i in 1:n) { > # the ith observation is excluded > y = yn[-i] > x_1 = x1[-i] > x_2 = x2[-i] > x_3 = x3[-i] > x_4 = as.factor(cf4)[-i] > # fit the first model without the ith observation > J1 = lm(y ~ x_1 + x_2 + x_3) > yhat1 = J1$coef[1] + J1$coef[2]*x1[i] + J1$coef[3]*x2[i] + J1$coef[4]*x3[i] > # construct the ith part of the loss function for model 1 > e1[i] = yn[i] - yhat1 > # fit the second model without the ith observation > J2 = lm(y ~ x_1 + x_2 + x_3 + x_4) > yhat2=J2$coef[1]+J2$coef[2]*x1[i]+J2$coef[3]*x2[i]+J2$coef[4]*x3[i]+J2$coef[5]*cf4[i] > e2[i] = yn[i] - yhat2 > } > sqrt(c(mean(e1^2),mean(e2^2))) # RMSE > > cf4 is a variable corresponding to groups (using mixtures) . What we want > to demonstrate is that the prediction error after cross-validation is lower > when we include this latent grouping variable. It works wonders when the > categorical variables are treated as 'numeric' variables. Though the ols > estimates are obviously very different. > > Thank you in advance for your views on the problem. > > Best Regards, > > julian > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.