Aldi Kraja
2012-Aug-07  19:28 UTC
[R] lm with a single X and step with several Xi-s, beta coef. quite different:
Hi, (R version 2.15.0) I am running a pgm with 1 response (earlier standardized Y) and 44 independent vars (Xi) from the same data =a2: When I run the 'lm' function on single Xi at a time, the beta coefficient for let's say X1 is = -0.08 (se=0.03256) But when I run the same Y with 44 Xi-s with the 'step' function (because I left direction parameter empty, I assume a backward multiple reg is implemented), 12 Xia-a remain in the final model where X1 is still present, the X1 beta coefficient becomes = --0.43402 (se=0.06847) I did not expect such a drastic change (4 times smaller) in the beta coeff. from "lm" with X1 (bx1=-0.08) to "step" with final 12 Xis including X1 (bx1=--0.43402). I understand that step function is producing partial reg coeff, when all other Xi-s are held constant, but is there any good reason why X1 in a multivariate reg. can become so significant (from lm px1=0.00296 ** to step px1=2.55e-10 ***)? Some of the 44 Xi-s are correlated to each other, but I am hoping that stepwise reg will drop some of those correlated ones. The Xi-s represent variables coded numerically as 0,1,2 to apply a linear regression on them. For example the frequency of X1 is: [1] x1 Levels: x1 0 1 2 3459 985 96 output of lm(Y ~ X1): ================= > obj1<-lm(y ~ x1, data=a2) > summary(obj1) Call: lm(formula = y ~ x1, data = a2) Residuals: Min 1Q Median 3Q Max -3.3418 -0.7240 -0.0462 0.6577 4.2929 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03635 0.01781 2.042 0.04124 * x1 -0.09682 0.03256 -2.973 0.00296 ** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 1.024 on 4255 degrees of freedom Multiple R-squared: 0.002074, Adjusted R-squared: 0.001839 F-statistic: 8.842 on 1 and 4255 DF, p-value: 0.002961 output from the step function on 44 Xi-s: ===================================a2 <-na.omit(ac16g761[,3:(44+2+1)]) lm.a2<-lm(y ~ ., data=a2) lm.final <-step(lm.a2,trace=F) summary(lm.final) Call: lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12, data = a2) Residuals: Min 1Q Median 3Q Max -3.2955 -0.7210 -0.0611 0.6623 4.1064 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.01065 0.02637 0.404 0.686412 x1 -0.43402 0.06847 -6.339 2.55e-10 *** x2 -0.17109 0.11370 -1.505 0.132464 x3 0.23552 0.11552 2.039 0.041533 * x4 -0.19898 0.10133 -1.964 0.049625 * x5 0.06653 0.03796 1.752 0.079769 . x6 0.18319 0.08592 2.132 0.033070 * x7 -0.17443 0.05095 -3.424 0.000624 *** x8 0.24013 0.06516 3.685 0.000232 *** x9 0.19202 0.08009 2.398 0.016543 * x10 -0.17257 0.05576 -3.095 0.001983 ** x11 -0.23537 0.05704 -4.126 3.75e-05 *** x12 0.25992 0.06260 4.152 3.35e-05 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 1.02 on 4244 degrees of freedom Multiple R-squared: 0.01353, Adjusted R-squared: 0.01074 F-statistic: 4.851 on 12 and 4244 DF, p-value: 5.466e-08 Thank you in advance, Aldi P.S. Sorry that I cannot distribute these data for a test. --
Ista Zahn
2012-Aug-08  13:49 UTC
[R] lm with a single X and step with several Xi-s, beta coef. quite different:
Hi, Sounds like suppression -- see e.g., http://www.jstor.org/stable/2988294?seq=1 for a discussion. Since this is not an R question but a statistical one, it may be more appropriate to post this question to a statistics forum such as http://stats.stackexchange.com/ Best, Ista On Tue, Aug 7, 2012 at 3:28 PM, Aldi Kraja <aldi at wustl.edu> wrote:> Hi, (R version 2.15.0) > I am running a pgm with 1 response (earlier standardized Y) and 44 > independent vars (Xi) from the same data =a2: > When I run the 'lm' function on single Xi at a time, the beta coefficient > for let's say X1 is = -0.08 (se=0.03256) > But when I run the same Y with 44 Xi-s with the 'step' function (because I > left direction parameter empty, I assume a backward multiple reg is > implemented), 12 Xia-a remain in the final model where X1 is still present, > the X1 beta coefficient becomes = --0.43402 (se=0.06847) > > I did not expect such a drastic change (4 times smaller) in the beta coeff. > from "lm" with X1 (bx1=-0.08) to "step" with final 12 Xis including X1 > (bx1=--0.43402). > I understand that step function is producing partial reg coeff, when all > other Xi-s are held constant, but is there any good reason why X1 in a > multivariate reg. can become so significant (from lm px1=0.00296 ** to step > px1=2.55e-10 ***)? > > Some of the 44 Xi-s are correlated to each other, but I am hoping that > stepwise reg will drop some of those correlated ones. > The Xi-s represent variables coded numerically as 0,1,2 to apply a linear > regression on them. > For example the frequency of X1 is: > [1] x1 > Levels: x1 > 0 1 2 > 3459 985 96 > > output of lm(Y ~ X1): > =================>> obj1<-lm(y ~ x1, data=a2) >> summary(obj1) > > Call: > lm(formula = y ~ x1, data = a2) > > Residuals: > Min 1Q Median 3Q Max > -3.3418 -0.7240 -0.0462 0.6577 4.2929 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.03635 0.01781 2.042 0.04124 * > x1 -0.09682 0.03256 -2.973 0.00296 ** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 1.024 on 4255 degrees of freedom > Multiple R-squared: 0.002074, Adjusted R-squared: 0.001839 > F-statistic: 8.842 on 1 and 4255 DF, p-value: 0.002961 > > output from the step function on 44 Xi-s: > ===================================> a2 <-na.omit(ac16g761[,3:(44+2+1)]) > lm.a2<-lm(y ~ ., data=a2) > lm.final <-step(lm.a2,trace=F) > summary(lm.final) > Call: > lm(formula = y ~ x1 + x2 + > x3 + x4 + x5 + x6 + x7 + x8 + > x9 + x10 + x11 + x12, data = a2) > > Residuals: > Min 1Q Median 3Q Max > -3.2955 -0.7210 -0.0611 0.6623 4.1064 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.01065 0.02637 0.404 0.686412 > x1 -0.43402 0.06847 -6.339 2.55e-10 *** > x2 -0.17109 0.11370 -1.505 0.132464 > x3 0.23552 0.11552 2.039 0.041533 * > x4 -0.19898 0.10133 -1.964 0.049625 * > x5 0.06653 0.03796 1.752 0.079769 . > x6 0.18319 0.08592 2.132 0.033070 * > x7 -0.17443 0.05095 -3.424 0.000624 *** > x8 0.24013 0.06516 3.685 0.000232 *** > x9 0.19202 0.08009 2.398 0.016543 * > x10 -0.17257 0.05576 -3.095 0.001983 ** > x11 -0.23537 0.05704 -4.126 3.75e-05 *** > x12 0.25992 0.06260 4.152 3.35e-05 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 1.02 on 4244 degrees of freedom > Multiple R-squared: 0.01353, Adjusted R-squared: 0.01074 > F-statistic: 4.851 on 12 and 4244 DF, p-value: 5.466e-08 > > Thank you in advance, > > Aldi > > P.S. Sorry that I cannot distribute these data for a test. > > -- > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.