Raffa
2019-May-25  12:38 UTC
[R] Increasing number of observations worsen the regression model
I have the following code: ``` rm(list=ls()) N = 30000 xvar <- runif(N, -10, 10) e <- rnorm(N, mean=0, sd=1) yvar <- 1 + 2*xvar + e plot(xvar,yvar) lmMod <- lm(yvar~xvar) print(summary(lmMod)) domain <- seq(min(xvar), max(xvar))??? # define a vector of x values to feed into model lines(domain, predict(lmMod, newdata = data.frame(xvar=domain)))??? # add regression line, using `predict` to generate y-values ``` I expected the coefficients to be something similar to [1,2]. Instead R keeps throwing at me random numbers that are not statistically significant and don't fit the model, and I have 20k observations. For example ``` Call: lm(formula = yvar ~ xvar) Residuals: ??? Min????? 1Q? Median????? 3Q???? Max -21.384? -8.908?? 1.016? 10.972? 23.663 Coefficients: ???????????? Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0007145? 0.0670316?? 0.011??? 0.991 xvar??????? 0.0168271? 0.0116420?? 1.445??? 0.148 Residual standard error: 11.61 on 29998 degrees of freedom Multiple R-squared:? 7.038e-05,??? Adjusted R-squared: 3.705e-05 F-statistic: 2.112 on 1 and 29998 DF,? p-value: 0.1462 ``` The strange thing is that the code works perfectly for N=200 or N=2000. It's only for larger N that this thing happen U(for example, N=20000). I have tried to ask for example in CrossValidated <https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model> but the code works for them. Any help? I am runnign R 3.6.0 on Kubuntu 19.04 Best regards Raffaele [[alternative HTML version deleted]]
R. Mark Sharp
2019-May-26  15:09 UTC
[R] Increasing number of observations worsen the regression model
Raffa, I ran this on a MacOS machine and got what you expected. I added a call to sessionInfo() for your information.> rm(list=ls()) > N = 30000 > xvar <- runif(N, -10, 10) > e <- rnorm(N, mean=0, sd=1) > yvar <- 1 + 2*xvar + e > plot(xvar,yvar) > lmMod <- lm(yvar~xvar) > print(summary(lmMod))Call: lm(formula = yvar ~ xvar) Residuals: Min 1Q Median 3Q Max -4.2407 -0.6738 -0.0031 0.6822 4.0619 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.0059022 0.0057370 175.3 <2e-16 *** xvar 2.0005811 0.0009918 2017.2 <2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.9937 on 29998 degrees of freedom Multiple R-squared: 0.9927, Adjusted R-squared: 0.9927 F-statistic: 4.069e+06 on 1 and 29998 DF, p-value: < 2.2e-16> domain <- seq(min(xvar), max(xvar)) # define a vector of x values to feed into model > lines(domain, predict(lmMod, newdata = data.frame(xvar=domain))) # add regression line, using `predict` to generate y-values > sessionInfo()R version 3.6.0 (2019-04-26) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.4 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.6.0 R. Mark Sharp, Ph.D. Data Scientist and Biomedical Statistical Consultant 7526 Meadow Green St. San Antonio, TX 78251 mobile: 210-218-2868 rmsharp at me.com> On May 25, 2019, at 7:38 AM, Raffa <raffamaiden at gmail.com> wrote: > > I have the following code: > > ``` > > rm(list=ls()) > N = 30000 > xvar <- runif(N, -10, 10) > e <- rnorm(N, mean=0, sd=1) > yvar <- 1 + 2*xvar + e > plot(xvar,yvar) > lmMod <- lm(yvar~xvar) > print(summary(lmMod)) > domain <- seq(min(xvar), max(xvar)) # define a vector of x values to > feed into model > lines(domain, predict(lmMod, newdata = data.frame(xvar=domain))) # > add regression line, using `predict` to generate y-values > > ``` > > I expected the coefficients to be something similar to [1,2]. Instead R > keeps throwing at me random numbers that are not statistically > significant and don't fit the model, and I have 20k observations. For > example > > ``` > > Call: > lm(formula = yvar ~ xvar) > > Residuals: > Min 1Q Median 3Q Max > -21.384 -8.908 1.016 10.972 23.663 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.0007145 0.0670316 0.011 0.991 > xvar 0.0168271 0.0116420 1.445 0.148 > > Residual standard error: 11.61 on 29998 degrees of freedom > Multiple R-squared: 7.038e-05, Adjusted R-squared: 3.705e-05 > F-statistic: 2.112 on 1 and 29998 DF, p-value: 0.1462 > > ``` > > > The strange thing is that the code works perfectly for N=200 or N=2000. > It's only for larger N that this thing happen U(for example, N=20000). I > have tried to ask for example in CrossValidated > <https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model> > but the code works for them. Any help? > > I am runnign R 3.6.0 on Kubuntu 19.04 > > Best regards > > Raffaele > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Ivan Krylov
2019-May-27  08:47 UTC
[R] Increasing number of observations worsen the regression model
On Sat, 25 May 2019 14:38:07 +0200 Raffa <raffamaiden at gmail.com> wrote:> I have tried to ask for example in CrossValidated > <https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model> > but the code works for them. Any help?In the comments you note that the problem went away after you replaced Intel MKL with OpenBLAS. This is important. The code that fits linear models in R is somewhat complex[*]; if you want to get to the bottom of the problem, you may have to take parts of it and feed them differently-sized linear regression problems until you narrow it down to a specific set of calls to BLAS or LAPACK functions which Intel MKL provides. One option would be to ask at Intel MKL forums[**]. -- Best regards, Ivan [*] https://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html [**] https://software.intel.com/en-us/forums/intel-math-kernel-library/
peter dalgaard
2019-May-27  09:31 UTC
[Rd] [R] Increasing number of observations worsen the regression model
Yes, it is important that it only happens with certan BLAS, so probably not really an R issue. However, there has been some concern over the C/Fortran interfaces lately, so if you could narrow it down to a specific BLAS routine, it could prove useful for the developers. One fairly easy thing to do would be to find the breakdown point. I speculate that it could be at 16384 (=2^14) and that some sort of endianness or integer width declaration is the cause. (It would in turn suggest that MKL is using 16-bit integers somehow, which doesn't really seem credible, but you never know.) I'm moving this to the r-devel list. It certainly is not for r-help. -pd> On 27 May 2019, at 10:47 , Ivan Krylov <krylov.r00t at gmail.com> wrote: > > On Sat, 25 May 2019 14:38:07 +0200 > Raffa <raffamaiden at gmail.com> wrote: > >> I have tried to ask for example in CrossValidated >> <https://stats.stackexchange.com/questions/410050/increasing-number-of-observations-worsen-the-regression-model> >> but the code works for them. Any help? > > In the comments you note that the problem went away after you replaced > Intel MKL with OpenBLAS. This is important. > > The code that fits linear models in R is somewhat complex[*]; if > you want to get to the bottom of the problem, you may have to take > parts of it and feed them differently-sized linear regression problems > until you narrow it down to a specific set of calls to BLAS or LAPACK > functions which Intel MKL provides. > > One option would be to ask at Intel MKL forums[**]. > > -- > Best regards, > Ivan > > [*] > https://madrury.github.io/jekyll/update/statistics/2016/07/20/lm-in-R.html > > [**] https://software.intel.com/en-us/forums/intel-math-kernel-library/ > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Possibly Parallel Threads
- by (tapply) and for loop differences
- using eval to handle column names in function calling scatterplot graph function
- question about formula for lm
- speed up process
- Writing a helper function that takes in the dataframe and variable names and then does a subset and plot