Valerio Leone Sciabolazza
2022-Oct-06 11:23 UTC
[R] Fixed effect model: different estimation approaches with R return different results
Good morning, I am trying to use R to estimate a fixed effects model (i.e., a panel regression model controlling for unobserved time-invariant heterogeneities across agents) using different estimation approaches (e.g. replicating xtreg from Stata, see e.g. https://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/). I have already asked this question on different stacks exchange forums and contacted package creators who dealt with this issue before, but I wasn't able to obtain an answer to my doubts. I hope to have better luck on this list. Let me introduce the problem, and note that I am using an unbalanced panel. The easiest way to estimate my fixed effect model is using the function lm. Example: # load packages library(dplyr) # set seed for replication purposes set.seed(123) # create toy dataset x <- rnorm(4000) x2 <- rnorm(length(x)) id <- factor(sample(500,length(x),replace=TRUE)) firm <- data.frame(id = id) %>% group_by(id) %>% mutate(firm = 1:n()) %>% pull(firm) id.eff <- rlnorm(nlevels(id)) firm.eff <- rexp(length(unique(firm))) y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x)) db = data.frame(y = y, x = x, id = id, firm = firm) rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>% filter(firm == 1) %>% pull(id) db = db[-which(db$id %in% rm), ] # Run regression test <- lm(y ~ x + id, data = db) Another approach is demeaning the variables included into the model specification. In this way, one can exclude the fixed effects from the model. Of course, point estimates will be correct, while standard errors will be not (because we are not accounting for the degrees of freedom used in the demeaning). # demean data dbm <- as_tibble(db) %>% group_by(id) %>% mutate(y = y - mean(y), x = x - mean(x)) %>% ungroup() # run regression test2 <- lm(y ~ x, data = dbm) # compare results summary(test)$coefficients[2,1]> 0.9753364summary(test2)$coefficients[2,1]> 0.9753364Another way to do this is to demean the variables and add their grand average (I believe that this is what xtreg from Stata does) # create data n = length(unique(db$id)) dbh <- dbm %>% mutate(yh = y + (sum(db$y)/n), xh = x + (sum(db$x)/n)) # run regression test3 <- lm(yh ~ xh, dbh) # compare results summary(test)$coefficients[2,1]> 0.9753364summary(test2)$coefficients[2,1]> 0.9753364summary(test3)$coefficients[2,1]> 0.9753364As one can see, the three approaches report the same point estimates (again, standard errors will be different instead). When I include an additional set of fixed effects in the model specification, the three approaches no longer return the same point estimate. However, differences seem to be negligible and they could be due to rounding. db$firm <- as.factor(db$firm) dbm$firm <- as.factor(dbm$firm) dbh$firm <- as.factor(dbh$firm) testB <- lm(y ~ x + id + firm, data = db) testB2 <- lm(y ~ x + firm, data = dbm) testB3 <- lm(yh ~ xh + firm, data = dbh) summary(testB)$coefficients[2,1]> 0.9834414summary(testB2)$coefficients[2,1]> 0.9833334summary(testB3)$coefficients[2,1]> 0.9833334A similar behavior occurs if I use a dummy variable rather than a continous one. For the only purpose of the example, I show this by transforming my target variable x from a continuous to a dummy variable. # create data x3 <- ifelse(db$x > 0, 1, 0) db <- db %>% mutate(x3 = x3) dbm <- dbm %>% mutate(x3 = x3) %>% group_by(id) %>% mutate(x3 = x3 - mean(x3)) %>% ungroup() dbh <- dbh %>% mutate(x3 = dbm$x3) %>% mutate(x3 = x3 + (sum(db$x3)/n)) %>% ungroup() # Run regressions testC <- lm(y ~ x3 + id + firm, data = db) testC2 <- lm(y ~ x3 + firm, data = dbm) testC3 <- lm(yh ~ x3 + firm, data = dbh) summary(testC)$coefficients[2, 1]> 1.579883summary(testC2)$coefficients[2, 1]> 1.579159summary(testC3)$coefficients[2, 1]> 1.579159Now, I want to estimate both the impact of x when this is higher than 0 (i.e., x3) and when this is lower or equal to zero (call it x4). Again, observe that x3 might as well be a real dummy variable, not a transformation of a continuous variable. In order to do that, I exclude the intercept from my model. Specifically, I do the following: # create data x4 <- ifelse(db$x <= 0, 1, 0) db <- db %>% mutate(x4 = x4) dbm <- dbm %>% mutate(x4 = x4) %>% group_by(id) %>% mutate(x4 = x4 - mean(x4)) %>% ungroup() dbh <- dbh %>% mutate(x4 = dbm$x4) %>% mutate(x4 = x4 + (sum(db$x4)/n)) %>% ungroup() testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db) testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm) testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh) summary(testD)$coefficients[1:2, ]> 1.2372816 -0.3426011summary(testD2)> Call:lm(formula = y ~ x3 + x4 + firm - 1, data = dbm) Residuals: Min 1Q Median 3Q Max -3.8794 -0.7497 0.0010 0.7442 3.8486 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) x3 1.57916 0.03779 41.788 < 2e-16 *** x4 NA NA NA NA ... redacted summary(testD3)$coefficients[1:2]> 3.254654 1.675495As you can see, the second approach is not able to estimate the impact of x4 on y. At the same time, the first and the third approach return very different point estimates. Is anyone able to explain me why I cannot obtain the same point estimates for this last exercise? Is there anything wrong in the way I include the second set of fixed effects? Is there anything wrong in the way I include the variables x3 and x4? Or this is simply a problem due to some internal functions in R? Any hint would be much appreciated. Best, Valerio
Bert Gunter
2022-Oct-06 14:31 UTC
[R] Fixed effect model: different estimation approaches with R return different results
You could get lucky here, but strictly speaking, this list is about R programming and statistical issues are typically off topic Someone might respond privately, though. Cheers, Bert On Thu, Oct 6, 2022 at 4:24 AM Valerio Leone Sciabolazza < sciabolazza at gmail.com> wrote:> Good morning, > I am trying to use R to estimate a fixed effects model (i.e., a panel > regression model controlling for unobserved time-invariant > heterogeneities across agents) using different estimation approaches > (e.g. replicating xtreg from Stata, see e.g. > > https://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/ > ). > I have already asked this question on different stacks exchange forums > and contacted package creators who dealt with this issue before, but I > wasn't able to obtain an answer to my doubts. > I hope to have better luck on this list. > > Let me introduce the problem, and note that I am using an unbalanced panel. > > The easiest way to estimate my fixed effect model is using the function lm. > > Example: > > # load packages > library(dplyr) > # set seed for replication purposes > set.seed(123) > # create toy dataset > x <- rnorm(4000) > x2 <- rnorm(length(x)) > id <- factor(sample(500,length(x),replace=TRUE)) > firm <- data.frame(id = id) %>% > group_by(id) %>% > mutate(firm = 1:n()) %>% > pull(firm) > id.eff <- rlnorm(nlevels(id)) > firm.eff <- rexp(length(unique(firm))) > y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x)) > db = data.frame(y = y, x = x, id = id, firm = firm) > rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>% > filter(firm == 1) %>% pull(id) > db = db[-which(db$id %in% rm), ] > # Run regression > test <- lm(y ~ x + id, data = db) > > Another approach is demeaning the variables included into the model > specification. > In this way, one can exclude the fixed effects from the model. Of > course, point estimates will be correct, while standard errors will be > not (because we are not accounting for the degrees of freedom used in > the demeaning). > > # demean data > dbm <- as_tibble(db) %>% > group_by(id) %>% > mutate(y = y - mean(y), > x = x - mean(x)) %>% > ungroup() > # run regression > test2 <- lm(y ~ x, data = dbm) > # compare results > summary(test)$coefficients[2,1] > > 0.9753364 > summary(test2)$coefficients[2,1] > > 0.9753364 > > Another way to do this is to demean the variables and add their grand > average (I believe that this is what xtreg from Stata does) > > # create data > n = length(unique(db$id)) > dbh <- dbm %>% > mutate(yh = y + (sum(db$y)/n), > xh = x + (sum(db$x)/n)) > # run regression > test3 <- lm(yh ~ xh, dbh) > # compare results > summary(test)$coefficients[2,1] > > 0.9753364 > summary(test2)$coefficients[2,1] > > 0.9753364 > summary(test3)$coefficients[2,1] > > 0.9753364 > > As one can see, the three approaches report the same point estimates > (again, standard errors will be different instead). > > When I include an additional set of fixed effects in the model > specification, the three approaches no longer return the same point > estimate. However, differences seem to be negligible and they could be > due to rounding. > > db$firm <- as.factor(db$firm) > dbm$firm <- as.factor(dbm$firm) > dbh$firm <- as.factor(dbh$firm) > testB <- lm(y ~ x + id + firm, data = db) > testB2 <- lm(y ~ x + firm, data = dbm) > testB3 <- lm(yh ~ xh + firm, data = dbh) > summary(testB)$coefficients[2,1] > > 0.9834414 > summary(testB2)$coefficients[2,1] > > 0.9833334 > summary(testB3)$coefficients[2,1] > > 0.9833334 > > A similar behavior occurs if I use a dummy variable rather than a > continous one. For the only purpose of the example, I show this by > transforming my target variable x from a continuous to a dummy > variable. > > # create data > x3 <- ifelse(db$x > 0, 1, 0) > db <- db %>% mutate(x3 = x3) > dbm <- dbm %>% > mutate(x3 = x3) %>% > group_by(id) %>% > mutate(x3 = x3 - mean(x3)) %>% > ungroup() > dbh <- dbh %>% mutate(x3 = dbm$x3) %>% > mutate(x3 = x3 + (sum(db$x3)/n)) %>% > ungroup() > # Run regressions > testC <- lm(y ~ x3 + id + firm, data = db) > testC2 <- lm(y ~ x3 + firm, data = dbm) > testC3 <- lm(yh ~ x3 + firm, data = dbh) > summary(testC)$coefficients[2, 1] > > 1.579883 > summary(testC2)$coefficients[2, 1] > > 1.579159 > summary(testC3)$coefficients[2, 1] > > 1.579159 > > Now, I want to estimate both the impact of x when this is higher than > 0 (i.e., x3) and when this is lower or equal to zero (call it x4). > Again, observe that x3 might as well be a real dummy variable, not a > transformation of a continuous variable. > > In order to do that, I exclude the intercept from my model. > Specifically, I do the following: > > # create data > x4 <- ifelse(db$x <= 0, 1, 0) > db <- db %>% mutate(x4 = x4) > dbm <- dbm %>% > mutate(x4 = x4) %>% > group_by(id) %>% > mutate(x4 = x4 - mean(x4)) %>% > ungroup() > dbh <- dbh %>% mutate(x4 = dbm$x4) %>% > mutate(x4 = x4 + (sum(db$x4)/n)) %>% > ungroup() > testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db) > testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm) > testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh) > summary(testD)$coefficients[1:2, ] > > 1.2372816 -0.3426011 > summary(testD2) > > Call: > lm(formula = y ~ x3 + x4 + firm - 1, data = dbm) > > Residuals: > Min 1Q Median 3Q Max > -3.8794 -0.7497 0.0010 0.7442 3.8486 > > Coefficients: (1 not defined because of singularities) > Estimate Std. Error t value Pr(>|t|) > x3 1.57916 0.03779 41.788 < 2e-16 *** > x4 NA NA NA NA > ... redacted > summary(testD3)$coefficients[1:2] > > 3.254654 1.675495 > > As you can see, the second approach is not able to estimate the impact > of x4 on y. At the same time, the first and the third approach return > very different point estimates. > > Is anyone able to explain me why I cannot obtain the same point > estimates for this last exercise? > > Is there anything wrong in the way I include the second set of fixed > effects? > Is there anything wrong in the way I include the variables x3 and x4? > Or this is simply a problem due to some internal functions in R? > > Any hint would be much appreciated. > > Best, > Valerio > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]