thr3ads.net - R help - [R] Increasing number of observations worsen the regression model [May 2019]

If this information is useful, please help other people find it:
Share via:

Fox, John

2019-May-26 14:06 UTC

[R] Increasing number of observations worsen the regression model

Dear Raffaele,

Using your code, with one modification -- setting the seed for R's random
number generator to make the result reproducible -- I get:
> set.seed(12345)
. . .
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
Call:
lm(formula = yvar ~ xvar)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0293 -0.6732  0.0021  0.6749  4.2883 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.0057713  0.0057529   174.8   <2e-16 ***
xvar        2.0000889  0.0009998  2000.4   <2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 0.9964 on 29998 degrees of freedom
Multiple R-squared:  0.9926,	Adjusted R-squared:  0.9926 
F-statistic: 4.002e+06 on 1 and 29998 DF,  p-value: < 2.2e-16

which is more or less what one would expect.

My guess: you've saved your R workspace from a previous session, and it is
then loaded at the start of your R session; something in the saved workspace is
affecting the result, although frankly I can't think what that might be.

I hope this helps,
 John

-----------------------------------------------------------------
John Fox
Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: https://socialsciences.mcmaster.ca/jfox/


> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Raffa
> Sent: Saturday, May 25, 2019 8:38 AM
> To: r-help at r-project.org
> Subject: [R] Increasing number of observations worsen the regression model
> 
> I have the following code:
> 
> ```
> 
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
> domain <- seq(min(xvar), max(xvar))??? # define a vector of x values to
feed
> into model lines(domain, predict(lmMod, newdata >
data.frame(xvar=domain)))??? # add regression line, using `predict` to generate
> y-values
> 
> ```
> 
> I expected the coefficients to be something similar to [1,2]. Instead R
keeps
> throwing at me random numbers that are not statistically significant and
don't
> fit the model, and I have 20k observations. For example
> 
> ```
> 
> Call:
> lm(formula = yvar ~ xvar)
> 
> Residuals:
>  ??? Min????? 1Q? Median????? 3Q???? Max
> -21.384? -8.908?? 1.016? 10.972? 23.663
> 
> Coefficients:
>  ???????????? Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.0007145? 0.0670316?? 0.011??? 0.991
> xvar??????? 0.0168271? 0.0116420?? 1.445??? 0.148
> 
> Residual standard error: 11.61 on 29998 degrees of freedom Multiple R-
> squared:? 7.038e-05,??? Adjusted R-squared: 3.705e-05
> F-statistic: 2.112 on 1 and 29998 DF,? p-value: 0.1462
> 
> ```
> 
> 
> The strange thing is that the code works perfectly for N=200 or N=2000.
> It's only for larger N that this thing happen U(for example, N=20000).
I have
> tried to ask for example in CrossValidated
> <https://stats.stackexchange.com/questions/410050/increasing-number-of-
> observations-worsen-the-regression-model>
> but the code works for them. Any help?
> 
> I am runnign R 3.6.0 on Kubuntu 19.04
> 
> Best regards
> 
> Raffaele
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Raffa

2019-May-26 14:09 UTC

head link

[R] Increasing number of observations worsen the regression model

I have solved the problem. It was caused by Intel MKL. Uninstalling 
Intel MKL and using OpenBLAS instead fixed the problem completely.

Notice that, using Intel MKL, I was able to reproduce the problem by 
computing the regression coefficients directly from the usual formula, 
which were returned completely wrong.

My guess is that some optimization of Intel MKL which is activated in 
"large" matrices give completely wrong result (I don't know which 
operation exactly)

Thanks,

Best,

Raffaele Mancuso

On 26/05/19 16:06, Fox, John wrote:> Dear Raffaele,
>
> Using your code, with one modification -- setting the seed for R's
random number generator to make the result reproducible -- I get:
>
>> set.seed(12345)
> . . .
>
>> lmMod <- lm(yvar~xvar)
>> print(summary(lmMod))
> Call:
> lm(formula = yvar ~ xvar)
>
> Residuals:
>      Min      1Q  Median      3Q     Max
> -4.0293 -0.6732  0.0021  0.6749  4.2883
>
> Coefficients:
>               Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.0057713  0.0057529   174.8   <2e-16 ***
> xvar        2.0000889  0.0009998  2000.4   <2e-16 ***
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>
> Residual standard error: 0.9964 on 29998 degrees of freedom
> Multiple R-squared:  0.9926,	Adjusted R-squared:  0.9926
> F-statistic: 4.002e+06 on 1 and 29998 DF,  p-value: < 2.2e-16
>
> which is more or less what one would expect.
>
> My guess: you've saved your R workspace from a previous session, and it
is then loaded at the start of your R session; something in the saved workspace
is affecting the result, although frankly I can't think what that might be.
>
> I hope this helps,
>   John
>
> -----------------------------------------------------------------
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/
>
>
>
>> -----Original Message-----
>> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of
Raffa
>> Sent: Saturday, May 25, 2019 8:38 AM
>> To: r-help at r-project.org
>> Subject: [R] Increasing number of observations worsen the regression
model
>>
>> I have the following code:
>>
>> ```
>>
>> rm(list=ls())
>> N = 30000
>> xvar <- runif(N, -10, 10)
>> e <- rnorm(N, mean=0, sd=1)
>> yvar <- 1 + 2*xvar + e
>> plot(xvar,yvar)
>> lmMod <- lm(yvar~xvar)
>> print(summary(lmMod))
>> domain <- seq(min(xvar), max(xvar))??? # define a vector of x values
to feed
>> into model lines(domain, predict(lmMod, newdata >>
data.frame(xvar=domain)))??? # add regression line, using `predict` to generate
>> y-values
>>
>> ```
>>
>> I expected the coefficients to be something similar to [1,2]. Instead R
keeps
>> throwing at me random numbers that are not statistically significant
and don't
>> fit the model, and I have 20k observations. For example
>>
>> ```
>>
>> Call:
>> lm(formula = yvar ~ xvar)
>>
>> Residuals:
>>   ??? Min????? 1Q? Median????? 3Q???? Max
>> -21.384? -8.908?? 1.016? 10.972? 23.663
>>
>> Coefficients:
>>   ???????????? Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 0.0007145? 0.0670316?? 0.011??? 0.991
>> xvar??????? 0.0168271? 0.0116420?? 1.445??? 0.148
>>
>> Residual standard error: 11.61 on 29998 degrees of freedom Multiple R-
>> squared:? 7.038e-05,??? Adjusted R-squared: 3.705e-05
>> F-statistic: 2.112 on 1 and 29998 DF,? p-value: 0.1462
>>
>> ```
>>
>>
>> The strange thing is that the code works perfectly for N=200 or N=2000.
>> It's only for larger N that this thing happen U(for example,
N=20000). I have
>> tried to ask for example in CrossValidated
>>
<https://stats.stackexchange.com/questions/410050/increasing-number-of-
>> observations-worsen-the-regression-model>
>> but the code works for them. Any help?
>>
>> I am runnign R 3.6.0 on Kubuntu 19.04
>>
>> Best regards
>>
>> Raffaele
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.

R help - May 2019 - Increasing number of observations worsen the regression model

[R] Increasing number of observations worsen the regression model

[R] Increasing number of observations worsen the regression model