Ben Bolker
2015-Mar-28 17:28 UTC
[R] Error in lm() with very small (close to zero) regressor
peter dalgaard <pdalgd <at> gmail.com> writes:> > > > On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch> wrote: > > > > Hello everybody, > > > > I have encountered the following problem with lm(): > > > > When running lm() with a regressor close to zero - > of the order e-10, the > > value of the estimate is of huge absolute value , of order millions. > > > > However, if I write the formula of the OLS estimator, > in matrix notation: > > pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the > > estimate has value 0.How do you know this answer is "correct"?> > here is the code: > > > > y <- rnorm(n_obs, 10,2.89) > > x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045) > > x2 <- rnorm(n_obs, 10,3.21) > > X <- cbind(x1,x2) > > > > bFE <- lm(y ~ x1 + x2) > > bFE > > > > bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y > > bOLS > > > > > > Note: I am applying a deviation from the > mean projection to the data, that > > is why I have some regressors with such small values. > > > > Thank you for any help! > > > > RalucaIs there a reason you can't scale your regressors?> > > Example not reproducible: >I agree that the OP's question was not reproducible, but it's not too hard to make it reproducible. I bothered to use library("sos"); findFn("pseudoinverse") to find pseudoinverse() in corpcor: It is true that we get estimates with very large magnitudes, but their set.seed(101) n_obs <- 1000 y <- rnorm(n_obs, 10,2.89) x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17) x2 <- rnorm(n_obs, 10,3.21) X <- cbind(x1,x2) bFE <- lm(y ~ x1 + x2) bFE coef(summary(bFE)) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.155959e+01 2.312956e+01 0.49977541 0.6173435 x1 -1.658420e+14 1.872598e+15 -0.08856254 0.9294474 x2 3.797342e-02 2.813000e-02 1.34992593 0.1773461 library("corpcor") bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y bOLS [,1] [1,] 9.807664e-16 [2,] 8.880273e-01 And if we scale the predictor: bFE2 <- lm(y ~ I(1e14*x1) + x2) coef(summary(bFE2)) Estimate Std. Error t value Pr(>|t|) (Intercept) 11.55958731 23.12956 0.49977541 0.6173435 I(1e+14 * x1) -1.65842037 18.72598 -0.08856254 0.9294474 x2 0.03797342 0.02813 1.34992593 0.1773461 bOLS stays constant. To be honest, I haven't thought about this enough to see which answer is actually correct, although I suspect the problem is in bOLS, since the numerical methods (unlike the brute-force pseudoinverse method given here) behind lm have been carefully considered for numerical stability.
Thank you for your replies! I am terribly sorry for the code not being reproducible, is the first time I am posting here, I run the code several times before I posted, but...I forgot about the library used. To answer to your questions: How do you know this answer is "correct"? What I am doing is actually a "fixed effect" estimation. I apply a projection matrix to the data, both dependent and independent variables, projection which renders the regressors that do not vary, equal to basically zero - the x1 from the post. Once I apply the projection, I need to run OLS to get the estimates, so x1 should be zero. Therefore, the results with the scaled regressor is not correct. Besides, I do not see why the bOLS is wrong, since is the formula of the OLS estimator from any Econometrics book. Here again the corrected code: install.packages("corpcor") library(corpcor) n_obs <- 1000 y <- rnorm(n_obs, 10,2.89) x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045) x2 <- rnorm(n_obs, 10,3.21) X <- cbind(x1,x2) bFE <- lm(y ~ x1 + x2) bFE bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y bOLS Best, Raluca Gui -- View this message in context: http://r.789695.n4.nabble.com/Error-in-lm-with-very-small-close-to-zero-regressor-tp4705185p4705212.html Sent from the R help mailing list archive at Nabble.com.
peter dalgaard
2015-Mar-29 10:42 UTC
[R] Error in lm() with very small (close to zero) regressor
> On 28 Mar 2015, at 18:28 , Ben Bolker <bbolker at gmail.com> wrote: > > peter dalgaard <pdalgd <at> gmail.com> writes: > >> >> >>> On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch> wrote: >>> >>> Hello everybody, >>> >>> I have encountered the following problem with lm(): >>> >>> When running lm() with a regressor close to zero - >> of the order e-10, the >>> value of the estimate is of huge absolute value , of order millions. >>> >>> However, if I write the formula of the OLS estimator, >> in matrix notation: >>> pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the >>> estimate has value 0. > > How do you know this answer is "correct"? > >>> here is the code: >>> >>> y <- rnorm(n_obs, 10,2.89) >>> x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045) >>> x2 <- rnorm(n_obs, 10,3.21) >>> X <- cbind(x1,x2) >>> >>> bFE <- lm(y ~ x1 + x2) >>> bFE >>> >>> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y >>> bOLS >>> >>> >>> Note: I am applying a deviation from the >> mean projection to the data, that >>> is why I have some regressors with such small values. >>> >>> Thank you for any help! >>> >>> Raluca > > Is there a reason you can't scale your regressors? > >>> >> Example not reproducible: >> > > I agree that the OP's question was not reproducible, but it's > not too hard to make it reproducible. I bothered to use > library("sos"); findFn("pseudoinverse") to find pseudoinverse() > in corpcor:Well, it shouldn't be my work, nor yours... And I thought it particularly egregious to treat a function from an unspecified package as gospel.> > It is true that we get estimates with very large magnitudes, > but their > > set.seed(101) > n_obs <- 1000 > y <- rnorm(n_obs, 10,2.89) > x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17) > x2 <- rnorm(n_obs, 10,3.21) > X <- cbind(x1,x2) > > bFE <- lm(y ~ x1 + x2) > bFE > coef(summary(bFE)) > > Estimate Std. Error t value Pr(>|t|) > (Intercept) 1.155959e+01 2.312956e+01 0.49977541 0.6173435 > x1 -1.658420e+14 1.872598e+15 -0.08856254 0.9294474 > x2 3.797342e-02 2.813000e-02 1.34992593 0.1773461 > > library("corpcor") > bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y > bOLS > > [,1] > [1,] 9.807664e-16 > [2,] 8.880273e-01 > > And if we scale the predictor: > > bFE2 <- lm(y ~ I(1e14*x1) + x2) > coef(summary(bFE2)) > > Estimate Std. Error t value Pr(>|t|) > (Intercept) 11.55958731 23.12956 0.49977541 0.6173435 > I(1e+14 * x1) -1.65842037 18.72598 -0.08856254 0.9294474 > x2 0.03797342 0.02813 1.34992593 0.1773461 > > bOLS stays constant. > > To be honest, I haven't thought about this enough to see > which answer is actually correct, although I suspect the > problem is in bOLS, since the numerical methods (unlike > the brute-force pseudoinverse method given here) behind > lm have been carefully considered for numerical stability.In particular, the pseudoinverse() function has a tol= argument which allows it to zap small singular values.> pseudoinverse(crossprod(X))%*%crossprod(X,y)[,1] [1,] 9.807664e-16 [2,] 8.880273e-01> pseudoinverse(crossprod(X),tol=1e-40)%*%crossprod(X,y)[,1] [1,] 1.286421e+15 [2,] -5.327384e-01 Also, notice that there is no intercept in the above, so it would be more reasonable to compare to> bFE <- lm(y ~ x1 + x2-1) > summary(bFE)Call: lm(formula = y ~ x1 + x2 - 1) Residuals: Min 1Q Median 3Q Max -9.1712 -1.9439 -0.0321 1.7637 9.4540 Coefficients: Estimate Std. Error t value Pr(>|t|) x1 7.700e+14 2.435e+13 31.62 <2e-16 *** x2 3.766e-02 2.811e-02 1.34 0.181 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 2.771 on 998 degrees of freedom Multiple R-squared: 0.9275, Adjusted R-squared: 0.9273 F-statistic: 6382 on 2 and 998 DF, p-value: < 2.2e-16 Or, maybe better to use cbind(1,X) in the pseudoinverse() method. It doesn't help, though. What really surprises me is this> X <- cbind(x1,x2) > crossprod(X)x1 x2 x1 1.526698e-25 1.265085e-10 x2 1.265085e-10 1.145462e+05> solve(crossprod(X))Error in solve.default(crossprod(X)) : system is computationally singular: reciprocal condition number = 1.13052e-31> solve(crossprod(X), tol=1e-40)x1 x2 x1 7.722232e+25 -8.528686e+10 x2 -8.528686e+10 1.029237e-04> pseudoinverse(crossprod(X), tol=1e-40)[,1] [,2] [1,] 6.222152e+25 -6.217178e+10 [2,] -6.871948e+10 7.739466e-05 How does pseudoinverse() go so badly wrong? Notice that apart from the scaling, this really isn't very ill-conditioned, but a lot of accuracy seems to be lost in its SVD step. Notice that> X <- cbind(1e14*x1,x2) > solve(crossprod(X))x2 0.0077222325 -0.0008528686 x2 -0.0008528686 0.0001029237> pseudoinverse(crossprod(X))[,1] [,2] [1,] 0.0077222325 -0.0008528686 [2,] -0.0008528686 0.0001029237 -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
peter dalgaard
2015-Mar-29 18:31 UTC
[R] Error in lm() with very small (close to zero) regressor
> On 28 Mar 2015, at 18:52 , RiGui <raluca.gui at business.uzh.ch> wrote: > > Thank you for your replies! > > I am terribly sorry for the code not being reproducible, is the first time I > am posting here, I run the code several times before I posted, but...I > forgot about the library used. > > To answer to your questions: > > How do you know this answer is "correct"? > > What I am doing is actually a "fixed effect" estimation. I apply a > projection matrix to the data, both dependent and independent variables, > projection which renders the regressors that do not vary, equal to basically > zero - the x1 from the post. > > Once I apply the projection, I need to run OLS to get the estimates, so x1 > should be zero.Please rethink: If a regressor is very small, the regression coefficient will be very large; if it is small and random, OLS estimators will be highly variable. R has no way of knowing that a regressor with small values isn't what the user intended (e.g. it could be picoMolar concentrations stated in Molar units); if you want a mechanism that eliminates near-zero regressors you need to do it explicitly.> Therefore, the results with the scaled regressor is not correct. > > Besides, I do not see why the bOLS is wrong, since is the formula of the OLS > estimator from any Econometrics book.Textbooks often gloss over details like numerical stability (and in general, textbooks often use slightly oversimplified methods in order not to confuse students unnecessarily). Better books will give the (X'X)^-1 X'Y formula with a warning not to use it as is, but e.g. use the X=QR decomposition [which gives (R'Q'QR)^-1 R'Q'Y = (R'R)^-1 R'Q'Y = R^-1 Q'Y].> Here again the corrected code: > > install.packages("corpcor") > library(corpcor) > > n_obs <- 1000 > y <- rnorm(n_obs, 10,2.89) > x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045) > x2 <- rnorm(n_obs, 10,3.21) > X <- cbind(x1,x2) > > bFE <- lm(y ~ x1 + x2) > bFE > > bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y > bOLS >Notice again, that these are not comparable in that bFE has an intercept term and bOLS hasn't. You need to compare with y ~ x1 + x2 - 1 and y ~ x2 - 1> > Best, > > Raluca Gui > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Error-in-lm-with-very-small-close-to-zero-regressor-tp4705185p4705212.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Ben Bolker
2015-Mar-29 18:34 UTC
[R] Error in lm() with very small (close to zero) regressor
RiGui <raluca.gui <at> business.uzh.ch> writes:>[snip]> I am terribly sorry for the code not being reproducible, is the > first time I am posting here, I run the code several times before I > posted, but...I forgot about the library used.Thanks for updating.> To answer to your questions: > >> How do you know this answer is "correct"?> What I am doing is actually a "fixed effect" estimation. I apply a > projection matrix to the data, both dependent and independent > variables, projection which renders the regressors that do not vary, > equal to basically zero - the x1 from the post.> Once I apply the projection, I need to run OLS to get the estimates, > so x1 should be zero.Yes, but not *exactly* zero.> Therefore, the results with the scaled regressor is not correct.> Besides, I do not see why the bOLS is wrong, since is the formula of > the OLS estimator from any Econometrics book.Because it's numerically unstable. Unfortunately, you can't always translate formulas directly from books into code and expect them to be reliable. Based on Peter's comments, I believe that as expected lm() is actually getting closer to the 'correct' answer.> Here again the corrected code:> library(corpcor) > > n_obs <- 1000 > y <- rnorm(n_obs, 10,2.89) > x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045) > x2 <- rnorm(n_obs, 10,3.21) > X <- cbind(x1,x2)As Peter points out, one example uses an intercept and the other doesn't: you should either use X <- cbind(1,x1,x2) or lm(y~x1+x2-1) for compatibility.> bFE <- lm(y ~ x1 + x2) > bFE > > bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y > bOLS[snip: Gmane doesn't like it if I quote "too much"]