John Maindonald
2015-Mar-30 00:42 UTC
[R] Error in lm() with very small (close to zero) regressor
There are two issues: 1) Damage was done to accuracy in the calculation of t(X) %*% X Where the mean to SD ratio is large, maybe even of the order of 100 or so, centre that column. 2) pseudo inverse() goes awry with columns of X that are of the order of large negative (or, I expect, +ve) powers of ten. were supplied to it. I?d guess that this has to do with the way that a check for singularity is implemented. Ben?s example: set.seed(101) n_obs <- 1000 y <- rnorm(n_obs, 10,2.89) x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17) x2 <- rnorm(n_obs, 10,3.21) X <- cbind(rep(1,n_obs),x1,x2) coef(lm(y~x1+x2)) ## (Intercept) x1 x2 ## 1.155959e+01 -1.658420e+14 3.797342e-02 library(corpcor) t(cbind(coef(lm(y~x1+x2)), as.vector(pseudoinverse(t(X) %*% X) %*% t(X) %*% y))) ## (Intercept) x1 x2 ## [1,] 11.559587 -1.658420e+14 0.03797342 ## [2,] 9.511348 1.151908e-13 0.03788714 ## Examine mean to SD ratios round(c(mean(x1)/sd(x1), mean(x2)/sd(x2)), 2) ## [1] 263.65 3.28 ## Notice that the coefficients for x2, where the mean/sd ratio ## is smaller, roughly agree. ## Damage was done to accuracy in the calculation of t(X) %*% X ## (but there is more to it than that, as we will see). ## Try xc1 <- scale(x1,center=TRUE, scale=FALSE) xc2 <- scale(x2,center=TRUE, scale=FALSE) Xc <- cbind(rep(1,n_obs),x1,x2) as.vector(pseudoinverse(t(Xc) %*% Xc) %*% t(Xc) %*% y) ## [1] 9.511348e+00 1.151908e-13 3.788714e-02 ## Note now, however, that one should be able to dispense with ## the column of 1s, with no change to the coefficients of x1 & x2 Xc0 <- cbind(xc1,xc2) as.vector(pseudoinverse(t(Xc0) %*% Xc0) %*% t(Xc0) %*% y) ## [1] 1.971167e-20 3.788714e-02 ## ## Now try a more sensible scaling for xc1 Xcs <- cbind(rep(1,n_obs), xc1*1e14, xc2) Xcs0 <- cbind(xc1*1e14, xc2) t(cbind(coef(lm(y~I(1e14*xc1)+xc2)), as.vector(pseudoinverse(t(Xcs) %*% Xcs) %*% t(Xcs) %*% y))) ## (Intercept) I(1e+14 * xc1) xc2 ## [1,] 9.899249 -1.65842 0.03797342 ## [2,] 9.899249 -1.65842 0.03797342 ## Eureka! ## cf also as.vector(pseudoinverse(t(Xcs0) %*% Xcs0) %*% t(Xcs0) %*% y) ## [1] -1.65842037 0.03797342 John Maindonald email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au> Centre for Mathematics & Its Applications, Australian National University, Canberra ACT 0200. On 29/03/2015, at 23:00, <r-help-request at r-project.org<mailto:r-help-request at r-project.org>> <r-help-request at r-project.org<mailto:r-help-request at r-project.org>> wrote: From: Ben Bolker <bbolker at gmail.com<mailto:bbolker at gmail.com>> Subject: Re: [R] Error in lm() with very small (close to zero) regressor Date: 29 March 2015 6:28:22 NZDT To: <r-help at stat.math.ethz.ch<mailto:r-help at stat.math.ethz.ch>> peter dalgaard <pdalgd <at> gmail.com<http://gmail.com/>> writes: On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch<http://business.uzh.ch/>> wrote: Hello everybody, I have encountered the following problem with lm(): When running lm() with a regressor close to zero - of the order e-10, the value of the estimate is of huge absolute value , of order millions. However, if I write the formula of the OLS estimator, in matrix notation: pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the estimate has value 0. How do you know this answer is "correct"? here is the code: y <- rnorm(n_obs, 10,2.89) x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045) x2 <- rnorm(n_obs, 10,3.21) X <- cbind(x1,x2) Eh? You want X <- cbind(rep(1,n_obs),x1,x2) bFE <- lm(y ~ x1 + x2) bFE bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y bOLS Note: I am applying a deviation from the mean projection to the data, that is why I have some regressors with such small values. Thank you for any help! Raluca Is there a reason you can't scale your regressors? Example not reproducible: I agree that the OP's question was not reproducible, but it's not too hard to make it reproducible. I bothered to use library("sos"); findFn("pseudoinverse") to find pseudoinverse() in corpcor: It is true that we get estimates with very large magnitudes, but their set.seed(101) n_obs <- 1000 y <- rnorm(n_obs, 10,2.89) x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17) x2 <- rnorm(n_obs, 10,3.21) X <- cbind(x1,x2) bFE <- lm(y ~ x1 + x2) bFE coef(summary(bFE)) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.155959e+01 2.312956e+01 0.49977541 0.6173435 x1 -1.658420e+14 1.872598e+15 -0.08856254 0.9294474 x2 3.797342e-02 2.813000e-02 1.34992593 0.1773461 library("corpcor") bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y bOLS [,1] [1,] 9.807664e-16 [2,] 8.880273e-01 And if we scale the predictor: bFE2 <- lm(y ~ I(1e14*x1) + x2) coef(summary(bFE2)) Estimate Std. Error t value Pr(>|t|) (Intercept) 11.55958731 23.12956 0.49977541 0.6173435 I(1e+14 * x1) -1.65842037 18.72598 -0.08856254 0.9294474 x2 0.03797342 0.02813 1.34992593 0.1773461 bOLS stays constant. To be honest, I haven't thought about this enough to see which answer is actually correct, although I suspect the problem is in bOLS, since the numerical methods (unlike the brute-force pseudoinverse method given here) behind lm have been carefully considered for numerical stability. [[alternative HTML version deleted]]