thr3ads.net - R help - [R] Error in lm() with very small (close to zero) regressor [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Ben Bolker

2015-Mar-28 17:28 UTC

[R] Error in lm() with very small (close to zero) regressor

peter dalgaard <pdalgd <at> gmail.com> writes:
> 
> 
> > On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at>
business.uzh.ch> wrote:
> > 
> > Hello everybody,
> > 
> > I have encountered the following problem with lm():
> > 
> > When running lm() with a regressor close to zero - 
> of the order e-10, the
> > value of the estimate is of huge absolute value , of order millions. 
> > 
> > However, if I write the formula of the OLS estimator, 
> in matrix notation:
> > pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning
the
> > estimate has value 0.
  How do you know this answer is "correct"?
> > here is the code:
> > 
> > y  <- rnorm(n_obs, 10,2.89)
> > x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
> > x2 <- rnorm(n_obs, 10,3.21)
> > X  <- cbind(x1,x2)
> > 
> > bFE <- lm(y ~ x1 + x2)
> > bFE
> > 
> > bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
> > bOLS
> > 
> > 
> > Note: I am applying a deviation from the 
> mean projection to the data, that
> > is why I have some regressors with such small values.
> > 
> > Thank you for any help!
> > 
> > Raluca
  Is there a reason you can't scale your regressors?
> > 
> Example not reproducible:
> 
  I agree that the OP's question was not reproducible, but it's
not too hard to make it reproducible. I bothered to use
library("sos"); findFn("pseudoinverse") to find
pseudoinverse()
in corpcor:

It is true that we get estimates with very large magnitudes,
but their 

set.seed(101)
n_obs <- 1000
y  <- rnorm(n_obs, 10,2.89)
x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17)
x2 <- rnorm(n_obs, 10,3.21)
X  <- cbind(x1,x2)
 
bFE <- lm(y ~ x1 + x2)
bFE
coef(summary(bFE))

                 Estimate   Std. Error     t value  Pr(>|t|)
(Intercept)  1.155959e+01 2.312956e+01  0.49977541 0.6173435
x1          -1.658420e+14 1.872598e+15 -0.08856254 0.9294474
x2           3.797342e-02 2.813000e-02  1.34992593 0.1773461

library("corpcor")
bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
bOLS

             [,1]
[1,] 9.807664e-16
[2,] 8.880273e-01

And if we scale the predictor:

bFE2 <- lm(y ~ I(1e14*x1) + x2)
coef(summary(bFE2))

                 Estimate Std. Error     t value  Pr(>|t|)
(Intercept)   11.55958731   23.12956  0.49977541 0.6173435
I(1e+14 * x1) -1.65842037   18.72598 -0.08856254 0.9294474
x2             0.03797342    0.02813  1.34992593 0.1773461

bOLS stays constant.

To be honest, I haven't thought about this enough to see
which answer is actually correct, although I suspect the
problem is in bOLS, since the numerical methods (unlike
the brute-force pseudoinverse method given here) behind
lm have been carefully considered for numerical stability.

RiGui

2015-Mar-28 17:52 UTC

head link

[R] Error in lm() with very small (close to zero) regressor

Thank you for your replies! 

I am terribly sorry for the code not being reproducible, is the first time I
am posting here, I run the code several times before I posted, but...I
forgot about the library used.

To answer to your questions:

How do you know this answer is "correct"? 

What I am doing is actually a "fixed effect" estimation. I apply a
projection matrix to the data, both dependent and independent variables,
projection which renders the regressors that do not vary, equal to basically
zero - the x1 from the post. 

Once I apply the projection, I need to run OLS to get the estimates, so x1
should be zero. 

Therefore, the results with the scaled regressor is not correct. 

Besides, I do not see why the bOLS is wrong, since is the formula of the OLS
estimator from any Econometrics book.

Here again the corrected code: 

install.packages("corpcor")
library(corpcor)

n_obs <- 1000
y  <- rnorm(n_obs, 10,2.89)
x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
x2 <- rnorm(n_obs, 10,3.21)
X  <- cbind(x1,x2)

 bFE <- lm(y ~ x1 + x2)
 bFE

 bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
 bOLS


Best,

Raluca Gui 




--
View this message in context:
http://r.789695.n4.nabble.com/Error-in-lm-with-very-small-close-to-zero-regressor-tp4705185p4705212.html
Sent from the R help mailing list archive at Nabble.com.

peter dalgaard

2015-Mar-29 10:42 UTC

head link

[R] Error in lm() with very small (close to zero) regressor

> On 28 Mar 2015, at 18:28 , Ben Bolker <bbolker at gmail.com> wrote:
> 
> peter dalgaard <pdalgd <at> gmail.com> writes:
> 
>> 
>> 
>>> On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at>
business.uzh.ch> wrote:
>>> 
>>> Hello everybody,
>>> 
>>> I have encountered the following problem with lm():
>>> 
>>> When running lm() with a regressor close to zero - 
>> of the order e-10, the
>>> value of the estimate is of huge absolute value , of order
millions.
>>> 
>>> However, if I write the formula of the OLS estimator, 
>> in matrix notation:
>>> pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning
the
>>> estimate has value 0.
> 
>  How do you know this answer is "correct"?
> 
>>> here is the code:
>>> 
>>> y  <- rnorm(n_obs, 10,2.89)
>>> x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
>>> x2 <- rnorm(n_obs, 10,3.21)
>>> X  <- cbind(x1,x2)
>>> 
>>> bFE <- lm(y ~ x1 + x2)
>>> bFE
>>> 
>>> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
>>> bOLS
>>> 
>>> 
>>> Note: I am applying a deviation from the 
>> mean projection to the data, that
>>> is why I have some regressors with such small values.
>>> 
>>> Thank you for any help!
>>> 
>>> Raluca
> 
>  Is there a reason you can't scale your regressors?
> 
>>> 
>> Example not reproducible:
>> 
> 
>  I agree that the OP's question was not reproducible, but it's
> not too hard to make it reproducible. I bothered to use
> library("sos"); findFn("pseudoinverse") to find
pseudoinverse()
> in corpcor:
Well, it shouldn't be my work, nor yours... And I thought it particularly
egregious to treat a function from an unspecified package as gospel.
> 
> It is true that we get estimates with very large magnitudes,
> but their 
> 
> set.seed(101)
> n_obs <- 1000
> y  <- rnorm(n_obs, 10,2.89)
> x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17)
> x2 <- rnorm(n_obs, 10,3.21)
> X  <- cbind(x1,x2)
> 
> bFE <- lm(y ~ x1 + x2)
> bFE
> coef(summary(bFE))
> 
>                 Estimate   Std. Error     t value  Pr(>|t|)
> (Intercept)  1.155959e+01 2.312956e+01  0.49977541 0.6173435
> x1          -1.658420e+14 1.872598e+15 -0.08856254 0.9294474
> x2           3.797342e-02 2.813000e-02  1.34992593 0.1773461
> 
> library("corpcor")
> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
> bOLS
> 
>             [,1]
> [1,] 9.807664e-16
> [2,] 8.880273e-01
> 
> And if we scale the predictor:
> 
> bFE2 <- lm(y ~ I(1e14*x1) + x2)
> coef(summary(bFE2))
> 
>                 Estimate Std. Error     t value  Pr(>|t|)
> (Intercept)   11.55958731   23.12956  0.49977541 0.6173435
> I(1e+14 * x1) -1.65842037   18.72598 -0.08856254 0.9294474
> x2             0.03797342    0.02813  1.34992593 0.1773461
> 
> bOLS stays constant.
> 
> To be honest, I haven't thought about this enough to see
> which answer is actually correct, although I suspect the
> problem is in bOLS, since the numerical methods (unlike
> the brute-force pseudoinverse method given here) behind
> lm have been carefully considered for numerical stability.
In particular, the pseudoinverse() function has a tol= argument which allows it
to zap small singular values.
> pseudoinverse(crossprod(X))%*%crossprod(X,y)             [,1]
[1,] 9.807664e-16
[2,] 8.880273e-01> pseudoinverse(crossprod(X),tol=1e-40)%*%crossprod(X,y)              [,1]
[1,]  1.286421e+15
[2,] -5.327384e-01

Also, notice that there is no intercept in the above, so it would be more
reasonable to compare to
> bFE <- lm(y ~ x1 + x2-1)
> summary(bFE)
Call:
lm(formula = y ~ x1 + x2 - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1712 -1.9439 -0.0321  1.7637  9.4540 

Coefficients:
    Estimate Std. Error t value Pr(>|t|)    
x1 7.700e+14  2.435e+13   31.62   <2e-16 ***
x2 3.766e-02  2.811e-02    1.34    0.181    
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 2.771 on 998 degrees of freedom
Multiple R-squared:  0.9275,	Adjusted R-squared:  0.9273 
F-statistic:  6382 on 2 and 998 DF,  p-value: < 2.2e-16

Or, maybe better to use cbind(1,X) in the pseudoinverse() method. It doesn't
help, though.

What really surprises me is this
> X  <- cbind(x1,x2)
> crossprod(X)             x1           x2
x1 1.526698e-25 1.265085e-10
x2 1.265085e-10 1.145462e+05> solve(crossprod(X))Error in solve.default(crossprod(X)) : 
  system is computationally singular: reciprocal condition number =
1.13052e-31> solve(crossprod(X), tol=1e-40)              x1            x2
x1  7.722232e+25 -8.528686e+10
x2 -8.528686e+10  1.029237e-04> pseudoinverse(crossprod(X), tol=1e-40)              [,1]          [,2]
[1,]  6.222152e+25 -6.217178e+10
[2,] -6.871948e+10  7.739466e-05

How does pseudoinverse() go so badly wrong? Notice that apart from the scaling,
this really isn't very ill-conditioned, but a lot of accuracy seems to be
lost in its SVD step. Notice that
> X  <- cbind(1e14*x1,x2)
> solve(crossprod(X))                            x2
    0.0077222325 -0.0008528686
x2 -0.0008528686  0.0001029237> pseudoinverse(crossprod(X))              [,1]          [,2]
[1,]  0.0077222325 -0.0008528686
[2,] -0.0008528686  0.0001029237


-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

peter dalgaard

2015-Mar-29 18:31 UTC

head link

[R] Error in lm() with very small (close to zero) regressor

> On 28 Mar 2015, at 18:52 , RiGui <raluca.gui at business.uzh.ch>
wrote:
> 
> Thank you for your replies! 
> 
> I am terribly sorry for the code not being reproducible, is the first time
I
> am posting here, I run the code several times before I posted, but...I
> forgot about the library used.
> 
> To answer to your questions:
> 
> How do you know this answer is "correct"? 
> 
> What I am doing is actually a "fixed effect" estimation. I apply
a
> projection matrix to the data, both dependent and independent variables,
> projection which renders the regressors that do not vary, equal to
basically
> zero - the x1 from the post. 
> 
> Once I apply the projection, I need to run OLS to get the estimates, so x1
> should be zero. 
Please rethink: If a regressor is very small, the regression coefficient will be
very large; if it is small and random, OLS estimators will be highly variable.

R has no way of knowing that a regressor with small values isn't what the
user intended (e.g. it could be picoMolar concentrations stated in Molar units);
if you want a mechanism that eliminates near-zero regressors you need to do it
explicitly.
> Therefore, the results with the scaled regressor is not correct. 
> 
> Besides, I do not see why the bOLS is wrong, since is the formula of the
OLS
> estimator from any Econometrics book.
Textbooks often gloss over details like numerical stability (and in general,
textbooks often use slightly oversimplified methods in order not to confuse
students unnecessarily).
Better books will give the (X'X)^-1 X'Y formula with a warning not to
use it as is, but e.g. use the X=QR decomposition [which gives
(R'Q'QR)^-1 R'Q'Y = (R'R)^-1 R'Q'Y = R^-1 Q'Y].

> Here again the corrected code: 
> 
> install.packages("corpcor")
> library(corpcor)
> 
> n_obs <- 1000
> y  <- rnorm(n_obs, 10,2.89)
> x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
> x2 <- rnorm(n_obs, 10,3.21)
> X  <- cbind(x1,x2)
> 
> bFE <- lm(y ~ x1 + x2)
> bFE
> 
> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
> bOLS
> 
Notice again, that these are not comparable in that bFE has an intercept term
and bOLS hasn't. You need to compare with

y ~ x1 + x2 - 1

and 

y ~ x2 - 1

> 
> Best,
> 
> Raluca Gui 
> 
> 
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/Error-in-lm-with-very-small-close-to-zero-regressor-tp4705185p4705212.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Ben Bolker

2015-Mar-29 18:34 UTC

head link

[R] Error in lm() with very small (close to zero) regressor

RiGui <raluca.gui <at> business.uzh.ch> writes:
>
[snip]
 > I am terribly sorry for the code not being reproducible, is the
> first time I am posting here, I run the code several times before I
> posted, but...I forgot about the library used.
  Thanks for updating.
 > To answer to your questions:
> 
>> How do you know this answer is "correct"? 
 > What I am doing is actually a "fixed effect" estimation. I apply
a
> projection matrix to the data, both dependent and independent
> variables, projection which renders the regressors that do not vary,
> equal to basically zero - the x1 from the post.
 > Once I apply the projection, I need to run OLS to get the estimates,
> so x1 should be zero.
  Yes, but not *exactly* zero.
 > Therefore, the results with the scaled regressor is not correct. 
 > Besides, I do not see why the bOLS is wrong, since is the formula of
> the OLS estimator from any Econometrics book.
 Because it's numerically unstable.
 Unfortunately, you can't always translate formulas directly from
books into code and expect them to be reliable.

  Based on Peter's comments, I believe that as expected lm()
is actually getting closer to the 'correct' answer.
> Here again the corrected code: 
> library(corpcor)
> 
> n_obs <- 1000
> y  <- rnorm(n_obs, 10,2.89)
> x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045)
> x2 <- rnorm(n_obs, 10,3.21)
> X  <- cbind(x1,x2)
As Peter points out, one example uses an intercept and the
other doesn't: you should either use X <- cbind(1,x1,x2) or
lm(y~x1+x2-1) for compatibility.
>  bFE <- lm(y ~ x1 + x2)
>  bFE
> 
>  bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y
>  bOLS
[snip: Gmane doesn't like it if I quote "too much"]

R help - Mar 2015 - Error in lm() with very small (close to zero) regressor

[R] Error in lm() with very small (close to zero) regressor

[R] Error in lm() with very small (close to zero) regressor

[R] Error in lm() with very small (close to zero) regressor

[R] Error in lm() with very small (close to zero) regressor

[R] Error in lm() with very small (close to zero) regressor