thr3ads.net - R devel - [Rd] [R] Increasing number of observations worsen the regression model [May 2019]

If this information is useful, please help other people find it:
Share via:

Raffa

2019-May-25 12:38 UTC

[R] Increasing number of observations worsen the regression model

I have the following code:

```

rm(list=ls())
N = 30000
xvar <- runif(N, -10, 10)
e <- rnorm(N, mean=0, sd=1)
yvar <- 1 + 2*xvar + e
plot(xvar,yvar)
lmMod <- lm(yvar~xvar)
print(summary(lmMod))
domain <- seq(min(xvar), max(xvar))??? # define a vector of x values to 
feed into model
lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))??? # 
add regression line, using `predict` to generate y-values

```

I expected the coefficients to be something similar to [1,2]. Instead R 
keeps throwing at me random numbers that are not statistically 
significant and don't fit the model, and I have 20k observations. For 
example

```

Call:
lm(formula = yvar ~ xvar)

Residuals:
 ??? Min????? 1Q? Median????? 3Q???? Max
-21.384? -8.908?? 1.016? 10.972? 23.663

Coefficients:
 ???????????? Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0007145? 0.0670316?? 0.011??? 0.991
xvar??????? 0.0168271? 0.0116420?? 1.445??? 0.148

Residual standard error: 11.61 on 29998 degrees of freedom
Multiple R-squared:? 7.038e-05,??? Adjusted R-squared: 3.705e-05
F-statistic: 2.112 on 1 and 29998 DF,? p-value: 0.1462

```


The strange thing is that the code works perfectly for N=200 or N=2000. 
It's only for larger N that this thing happen U(for example, N=20000). I 
have tried to ask for example in CrossValidated 
<https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model>
but the code works for them. Any help?

I am runnign R 3.6.0 on Kubuntu 19.04

Best regards

Raffaele


	[[alternative HTML version deleted]]

R. Mark Sharp

2019-May-26 15:09 UTC

head link

[R] Increasing number of observations worsen the regression model

Raffa, 

I ran this on a MacOS machine and got what you expected. I added a call to
sessionInfo() for your information.
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
Call:
lm(formula = yvar ~ xvar)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2407 -0.6738 -0.0031  0.6822  4.0619 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.0059022  0.0057370   175.3   <2e-16 ***
xvar        2.0005811  0.0009918  2017.2   <2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 0.9937 on 29998 degrees of freedom
Multiple R-squared:  0.9927,	Adjusted R-squared:  0.9927 
F-statistic: 4.069e+06 on 1 and 29998 DF,  p-value: < 2.2e-16
> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to
feed into model
> lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))    # add
regression line, using `predict` to generate y-values
> sessionInfo()R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS:  
/Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.0




R. Mark Sharp, Ph.D.
Data Scientist and Biomedical Statistical Consultant
7526 Meadow Green St.
San Antonio, TX 78251
mobile: 210-218-2868
rmsharp at me.com










> On May 25, 2019, at 7:38 AM, Raffa <raffamaiden at gmail.com> wrote:
> 
> I have the following code:
> 
> ```
> 
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to 
> feed into model
> lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))    # 
> add regression line, using `predict` to generate y-values
> 
> ```
> 
> I expected the coefficients to be something similar to [1,2]. Instead R 
> keeps throwing at me random numbers that are not statistically 
> significant and don't fit the model, and I have 20k observations. For 
> example
> 
> ```
> 
> Call:
> lm(formula = yvar ~ xvar)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max
> -21.384  -8.908   1.016  10.972  23.663
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.0007145  0.0670316   0.011    0.991
> xvar        0.0168271  0.0116420   1.445    0.148
> 
> Residual standard error: 11.61 on 29998 degrees of freedom
> Multiple R-squared:  7.038e-05,    Adjusted R-squared: 3.705e-05
> F-statistic: 2.112 on 1 and 29998 DF,  p-value: 0.1462
> 
> ```
> 
> 
> The strange thing is that the code works perfectly for N=200 or N=2000. 
> It's only for larger N that this thing happen U(for example, N=20000).
I
> have tried to ask for example in CrossValidated 
>
<https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model>
> but the code works for them. Any help?
> 
> I am runnign R 3.6.0 on Kubuntu 19.04
> 
> Best regards
> 
> Raffaele
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ivan Krylov

2019-May-27 08:47 UTC

head link

[R] Increasing number of observations worsen the regression model

On Sat, 25 May 2019 14:38:07 +0200
Raffa <raffamaiden at gmail.com> wrote:
> I have tried to ask for example in CrossValidated 
>
<https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model>
> but the code works for them. Any help?
In the comments you note that the problem went away after you replaced
Intel MKL with OpenBLAS. This is important.

The code that fits linear models in R is somewhat complex[*]; if
you want to get to the bottom of the problem, you may have to take
parts of it and feed them differently-sized linear regression problems
until you narrow it down to a specific set of calls to BLAS or LAPACK
functions which Intel MKL provides.

One option would be to ask at Intel MKL forums[**].

-- 
Best regards,
Ivan

[*]
https://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html

[**] https://software.intel.com/en-us/forums/intel-math-kernel-library/

peter dalgaard

2019-May-27 09:31 UTC

head link

[Rd] [R] Increasing number of observations worsen the regression model

Yes, it is important that it only happens with certan BLAS, so probably not
really an R issue.
However, there has been some concern over the C/Fortran interfaces lately, so if
you could narrow it down to a specific BLAS routine, it could prove useful for
the developers.

One fairly easy thing to do would be to find the breakdown point. I speculate
that it could be at 16384 (=2^14) and that some sort of endianness or integer
width declaration is the cause. (It would in turn suggest that MKL is using
16-bit integers somehow, which doesn't really seem credible, but you never
know.)

I'm moving this to the r-devel list. It certainly is not for r-help.

-pd
> On 27 May 2019, at 10:47 , Ivan Krylov <krylov.r00t at gmail.com>
wrote:
> 
> On Sat, 25 May 2019 14:38:07 +0200
> Raffa <raffamaiden at gmail.com> wrote:
> 
>> I have tried to ask for example in CrossValidated 
>>
<https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model>
>> but the code works for them. Any help?
> 
> In the comments you note that the problem went away after you replaced
> Intel MKL with OpenBLAS. This is important.
> 
> The code that fits linear models in R is somewhat complex[*]; if
> you want to get to the bottom of the problem, you may have to take
> parts of it and feed them differently-sized linear regression problems
> until you narrow it down to a specific set of calls to BLAS or LAPACK
> functions which Intel MKL provides.
> 
> One option would be to ask at Intel MKL forums[**].
> 
> -- 
> Best regards,
> Ivan
> 
> [*]
> https://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html
> 
> [**] https://software.intel.com/en-us/forums/intel-math-kernel-library/
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Apparently Analagous Threads

Search for more reasonably related threads

R devel - May 2019 - [R] Increasing number of observations worsen the regression model

[R] Increasing number of observations worsen the regression model

[R] Increasing number of observations worsen the regression model

[R] Increasing number of observations worsen the regression model

[Rd] [R] Increasing number of observations worsen the regression model

Apparently Analagous Threads