I saw this type of models in some of my company projects. To simplify: Y is regressed on X1 and X2. But the regression is done by two steps: First Y is regressed on X1 with intercept, and the residuals from the first step are used to regress on X2, without the constant. The reason to do so is some observations have X1 data but do not have X2, so I guess the person wants to use as much information as he can to get the coef. for X1, and then use part of the residuals (that have X2 data) to catch what is left to be explained by X2. But my concern is, should we consider the correlation between X1 and X2? If residuals from the first step are used, then X1 effect has been removed. Then what does it really mean by regressing residuals on X2, which has some X1 effect correlated with?? should X2 be adjusted by X1, too (regress X2 on X1 and use the residuals)? What if both X1 and X2 are dummy variables? Dummy variables can have a meaningful correlation, too, right? Thanks a lot! -- View this message in context: http://www.nabble.com/Can-I-do-regression-by-steps--tp18338562p18338562.html Sent from the R help mailing list archive at Nabble.com.
Be very careful! When regression is performed by steps, you often will not get the same results as you would get from a single multivariable regression. The explanation for this is not simple, but a simplified explanation is that when you do your first regression, y=f(x1) all the total variance that can be accounted for is sucked up by x1 leaving little varinace to be accounted for by your second regression, residuals=f(x2). In contrast when you perform a multivariable regression, y=f(x1,x2) the total variance is proportioned between x1 and x2. John John David Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 (Phone) 410-605-7119 (Fax) 410-605-7913 (Please call phone number above prior to faxing)>>> rlearner309 <unixunix99 at gmail.com> 7/8/2008 8:53 AM >>>I saw this type of models in some of my company projects. To simplify: Y is regressed on X1 and X2. But the regression is done by two steps: First Y is regressed on X1 with intercept, and the residuals from the first step are used to regress on X2, without the constant. The reason to do so is some observations have X1 data but do not have X2, so I guess the person wants to use as much information as he can to get the coef. for X1, and then use part of the residuals (that have X2 data) to catch what is left to be explained by X2. But my concern is, should we consider the correlation between X1 and X2? If residuals from the first step are used, then X1 effect has been removed. Then what does it really mean by regressing residuals on X2, which has some X1 effect correlated with?? should X2 be adjusted by X1, too (regress X2 on X1 and use the residuals)? What if both X1 and X2 are dummy variables? Dummy variables can have a meaningful correlation, too, right? Thanks a lot! -- View this message in context: http://www.nabble.com/Can-I-do-regression-by-steps--tp18338562p18338562.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Confidentiality Statement: This email message, including any attachments, is for th...{{dropped:6}}
Thanks for the reply. I am awared of the difference, but can I do regression by steps at all? I am not feeling comfortable about it. John Sorkin wrote:> > Be very careful! > When regression is performed by steps, you often will not get the same > results as you would get from a single multivariable regression. The > explanation for this is not simple, but a simplified explanation is that > when you do your first regression, > y=f(x1) > all the total variance that can be accounted for is sucked up by x1 > leaving little varinace to be accounted for by your second regression, > residuals=f(x2). In contrast when you perform a multivariable regression, > y=f(x1,x2) the total variance is proportioned between x1 and x2. > John > > John David Sorkin M.D., Ph.D. > Chief, Biostatistics and Informatics > University of Maryland School of Medicine Division of Gerontology > Baltimore VA Medical Center > 10 North Greene Street > GRECC (BT/18/GR) > Baltimore, MD 21201-1524 > (Phone) 410-605-7119 > (Fax) 410-605-7913 (Please call phone number above prior to faxing) > >>>> rlearner309 <unixunix99 at gmail.com> 7/8/2008 8:53 AM >>> > > I saw this type of models in some of my company projects. > > To simplify: > Y is regressed on X1 and X2. But the regression is done by two steps: > First Y is regressed on X1 with intercept, and the residuals from the > first > step are used to regress on X2, without the constant. The reason to do so > is some observations have X1 data but do not have X2, so I guess the > person > wants to use as much information as he can to get the coef. for X1, and > then > use part of the residuals (that have X2 data) to catch what is left to be > explained by X2. > > But my concern is, should we consider the correlation between X1 and X2? > If > residuals from the first step are used, then X1 effect has been removed. > Then what does it really mean by regressing residuals on X2, which has > some > X1 effect correlated with?? should X2 be adjusted by X1, too (regress X2 > on > X1 and use the residuals)? > > What if both X1 and X2 are dummy variables? Dummy variables can have a > meaningful correlation, too, right? > > Thanks a lot! > -- > View this message in context: > http://www.nabble.com/Can-I-do-regression-by-steps--tp18338562p18338562.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > Confidentiality Statement: > This email message, including any attachments, is for th...{{dropped:6}} > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- View this message in context: http://www.nabble.com/Can-I-do-regression-by-steps--tp18338562p18350475.html Sent from the R help mailing list archive at Nabble.com.
someone else can blast me if this is not correct but i think that 2 step procedure only gives the same answer as the regular regression if X1 and X2 perfectly uncorrelated. If they are at all correlated, then what john pointed out messes the procedure up. i was asked that question on an interview a long time ago and the question i always wondered but never asked was why someone would want to do that ? so, here goes: why do you want to do that ? On Tue, Jul 8, 2008 at 6:25 PM, rlearner309 wrote:> Thanks for the reply. > I am awared of the difference, but can I do regression by steps at > all? I > am not feeling comfortable about it. > > > > John Sorkin wrote: >> >> Be very careful! >> When regression is performed by steps, you often will not get the >> same >> results as you would get from a single multivariable regression. The >> explanation for this is not simple, but a simplified explanation is >> that >> when you do your first regression, >> y=f(x1) >> all the total variance that can be accounted for is sucked up by x1 >> leaving little varinace to be accounted for by your second >> regression, >> residuals=f(x2). In contrast when you perform a multivariable >> regression, >> y=f(x1,x2) the total variance is proportioned between x1 and x2. >> John >> >> John David Sorkin M.D., Ph.D. >> Chief, Biostatistics and Informatics >> University of Maryland School of Medicine Division of Gerontology >> Baltimore VA Medical Center >> 10 North Greene Street >> GRECC (BT/18/GR) >> Baltimore, MD 21201-1524 >> (Phone) 410-605-7119 >> (Fax) 410-605-7913 (Please call phone number above prior to faxing) >> >>>>> rlearner309 <unixunix99 at gmail.com> 7/8/2008 8:53 AM >>> >> >> I saw this type of models in some of my company projects. >> To simplify: >> Y is regressed on X1 and X2. But the regression is done by two >> steps: First Y is regressed on X1 with intercept, and the residuals >> from the >> first >> step are used to regress on X2, without the constant. The reason to >> do so >> is some observations have X1 data but do not have X2, so I guess the >> person >> wants to use as much information as he can to get the coef. for X1, >> and >> then >> use part of the residuals (that have X2 data) to catch what is left >> to be >> explained by X2. >> >> But my concern is, should we consider the correlation between X1 and >> X2? If >> residuals from the first step are used, then X1 effect has been >> removed. Then what does it really mean by regressing residuals on X2, >> which has >> some >> X1 effect correlated with?? should X2 be adjusted by X1, too (regress >> X2 >> on >> X1 and use the residuals)? >> What if both X1 and X2 are dummy variables? Dummy variables can have >> a >> meaningful correlation, too, right? >> >> Thanks a lot! >> -- >> View this message in context: >> >> http://www.nabble.com/Can-I-do-regression-by-steps--tp18338562p18338562.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the >> posting guide >> http://www.R-project.org/posting-guide.html and provide commented, >> minimal, self-contained, reproducible code. >> >> Confidentiality Statement: >> This email message, including any attachments, is for >> th...{{dropped:6}} >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > > -- > View this message in context: > http://www.nabble.com/Can-I-do-regression-by-steps--tp18338562p18350475.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.