thr3ads.net - R help - [R] A doubt about "lm" and the meaning of its summary [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Alberto Monteiro

2008-Aug-18 18:31 UTC

[R] A doubt about "lm" and the meaning of its summary

I have a conceptual problem (sort of). Maybe there's a simple solution, 
maybe not.

First, let me explain the test case that goes ok.

Let x be a (fixed) n-dimensional vector. I simulate a lot of linear models, 
y = m x + c + error, then I do a lot of regressions. As expected, the 
estimated values of m (let's call them m.bar) are distributed according to a
Student's t distribution.

More precisely (test case, it takes a few minutes to run):
#
# start with fixed numbers
#
  m <- sample(c(-1, -0.1, 0.1, 1), 1)
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  x <- 1:n # it's possible to use other x
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will hold the simulation
output

#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    y <- m * x + c + sigma * rnorm(n)
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
    m.std.error[i] <- summary(reg)$coefficients['x', 'Std.
Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error is
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))

Now, this goes ok, with ks.test returning a number uniformly distributed in 
the 0-1 interval.

Next, I did a slightly different model. This model is a reversion model, 
where the simulated random variable lp goes along the equation:
lp[t + 1] <- (1 + m) * lp[t] + c + error

I tried to use the same method as above, defining 
x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
y <- m * x + c + error

And now it breaks. Test case (it takes some minutes to run):

#
# start with fixed numbers
#
  m <- runif(1, -1, 0)  # m must be something in the (-1,0) interval
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will hold the simulation
output

#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    lp <- 0
    lp[1] <- 0
    for (j in 1:n) {
      lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
    }
    x <- lp[1:n]
    y <- lp[-1] - x
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
    m.std.error[i] <- summary(reg)$coefficients['x', 'Std.
Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error should be
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))

... and now it's not.

What's wrong with this? I suspect that the model y ~ x does only give 
meaningful values when x is deterministic; in this case x is stochastic. Is 
there any correct way to estimate this model giving meaningful outputs?

Alberto Monteiro

Moshe Olshansky

2008-Aug-19 01:41 UTC

head link

[R] A doubt about "lm" and the meaning of its summary

Hi Alberto,

In your second case the linear model 
y = a*x + b + error 
does not hold.


--- On Tue, 19/8/08, Alberto Monteiro <albmont at centroin.com.br> wrote:
> From: Alberto Monteiro <albmont at centroin.com.br>
> Subject: [R] A doubt about "lm" and the meaning of its summary
> To: r-help at r-project.org
> Received: Tuesday, 19 August, 2008, 4:31 AM
> I have a conceptual problem (sort of). Maybe there's a
> simple solution, 
> maybe not.
> 
> First, let me explain the test case that goes ok.
> 
> Let x be a (fixed) n-dimensional vector. I simulate a lot
> of linear models, 
> y = m x + c + error, then I do a lot of regressions. As
> expected, the 
> estimated values of m (let's call them m.bar) are
> distributed according to a 
> Student's t distribution.
> 
> More precisely (test case, it takes a few minutes to run):
> #
> # start with fixed numbers
> #
>   m <- sample(c(-1, -0.1, 0.1, 1), 1)
>   c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
>   sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
>   n <- sample(c(4,5,10,1000), 1)
>   x <- 1:n # it's possible to use other x
>   NN <- sample(c(3000, 4000, 5000), 1)
>   m.bar <- m.std.error <- 0  # these vectors will
> hold the simulation output
> 
> #
> # Now let's repeat NN times:
> # simulate y
> # regress y ~ x
> # store m.bar and its error
> #
>   for (i in 1:NN) {
>     y <- m * x + c + sigma * rnorm(n)
>     reg <- lm(y ~ x)
>     m.bar[i] <- summary(reg)$coefficients['x',
> 'Estimate']
>     m.std.error[i] <-
> summary(reg)$coefficients['x', 'Std. Error']
>   }
> #
> # Finally, let's analyse it
> # The distribution of (m.bar - m) / m.std.error is
> # a Student's t with n - 2 degrees of freedom.
> # Is it? Let's test!
> #
>   print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
> 
> Now, this goes ok, with ks.test returning a number
> uniformly distributed in 
> the 0-1 interval.
> 
> Next, I did a slightly different model. This model is a
> reversion model, 
> where the simulated random variable lp goes along the
> equation:
> lp[t + 1] <- (1 + m) * lp[t] + c + error
> 
> I tried to use the same method as above, defining 
> x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
> y <- m * x + c + error
> 
> And now it breaks. Test case (it takes some minutes to
> run):
> 
> #
> # start with fixed numbers
> #
>   m <- runif(1, -1, 0)  # m must be something in the
> (-1,0) interval
>   c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
>   sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
>   n <- sample(c(4,5,10,1000), 1)
>   NN <- sample(c(3000, 4000, 5000), 1)
>   m.bar <- m.std.error <- 0  # these vectors will
> hold the simulation output
> 
> #
> # Now let's repeat NN times:
> # simulate y
> # regress y ~ x
> # store m.bar and its error
> #
>   for (i in 1:NN) {
>     lp <- 0
>     lp[1] <- 0
>     for (j in 1:n) {
>       lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
>     }
>     x <- lp[1:n]
>     y <- lp[-1] - x
>     reg <- lm(y ~ x)
>     m.bar[i] <- summary(reg)$coefficients['x',
> 'Estimate']
>     m.std.error[i] <-
> summary(reg)$coefficients['x', 'Std. Error']
>   }
> #
> # Finally, let's analyse it
> # The distribution of (m.bar - m) / m.std.error should be
> # a Student's t with n - 2 degrees of freedom.
> # Is it? Let's test!
> #
>   print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
> 
> ... and now it's not.
> 
> What's wrong with this? I suspect that the model y ~ x
> does only give 
> meaningful values when x is deterministic; in this case x
> is stochastic. Is 
> there any correct way to estimate this model giving
> meaningful outputs?
> 
> Alberto Monteiro
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.

Moshe Olshansky

2008-Aug-19 02:19 UTC

head link

[R] A doubt about "lm" and the meaning of its summary

Hi Alberto,

Please disregard my previous note - I probably had a black-out!!!


--- On Tue, 19/8/08, Alberto Monteiro <albmont at centroin.com.br> wrote:
> From: Alberto Monteiro <albmont at centroin.com.br>
> Subject: [R] A doubt about "lm" and the meaning of its summary
> To: r-help at r-project.org
> Received: Tuesday, 19 August, 2008, 4:31 AM
> I have a conceptual problem (sort of). Maybe there's a
> simple solution, 
> maybe not.
> 
> First, let me explain the test case that goes ok.
> 
> Let x be a (fixed) n-dimensional vector. I simulate a lot
> of linear models, 
> y = m x + c + error, then I do a lot of regressions. As
> expected, the 
> estimated values of m (let's call them m.bar) are
> distributed according to a 
> Student's t distribution.
> 
> More precisely (test case, it takes a few minutes to run):
> #
> # start with fixed numbers
> #
>   m <- sample(c(-1, -0.1, 0.1, 1), 1)
>   c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
>   sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
>   n <- sample(c(4,5,10,1000), 1)
>   x <- 1:n # it's possible to use other x
>   NN <- sample(c(3000, 4000, 5000), 1)
>   m.bar <- m.std.error <- 0  # these vectors will
> hold the simulation output
> 
> #
> # Now let's repeat NN times:
> # simulate y
> # regress y ~ x
> # store m.bar and its error
> #
>   for (i in 1:NN) {
>     y <- m * x + c + sigma * rnorm(n)
>     reg <- lm(y ~ x)
>     m.bar[i] <- summary(reg)$coefficients['x',
> 'Estimate']
>     m.std.error[i] <-
> summary(reg)$coefficients['x', 'Std. Error']
>   }
> #
> # Finally, let's analyse it
> # The distribution of (m.bar - m) / m.std.error is
> # a Student's t with n - 2 degrees of freedom.
> # Is it? Let's test!
> #
>   print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
> 
> Now, this goes ok, with ks.test returning a number
> uniformly distributed in 
> the 0-1 interval.
> 
> Next, I did a slightly different model. This model is a
> reversion model, 
> where the simulated random variable lp goes along the
> equation:
> lp[t + 1] <- (1 + m) * lp[t] + c + error
> 
> I tried to use the same method as above, defining 
> x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
> y <- m * x + c + error
> 
> And now it breaks. Test case (it takes some minutes to
> run):
> 
> #
> # start with fixed numbers
> #
>   m <- runif(1, -1, 0)  # m must be something in the
> (-1,0) interval
>   c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
>   sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
>   n <- sample(c(4,5,10,1000), 1)
>   NN <- sample(c(3000, 4000, 5000), 1)
>   m.bar <- m.std.error <- 0  # these vectors will
> hold the simulation output
> 
> #
> # Now let's repeat NN times:
> # simulate y
> # regress y ~ x
> # store m.bar and its error
> #
>   for (i in 1:NN) {
>     lp <- 0
>     lp[1] <- 0
>     for (j in 1:n) {
>       lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
>     }
>     x <- lp[1:n]
>     y <- lp[-1] - x
>     reg <- lm(y ~ x)
>     m.bar[i] <- summary(reg)$coefficients['x',
> 'Estimate']
>     m.std.error[i] <-
> summary(reg)$coefficients['x', 'Std. Error']
>   }
> #
> # Finally, let's analyse it
> # The distribution of (m.bar - m) / m.std.error should be
> # a Student's t with n - 2 degrees of freedom.
> # Is it? Let's test!
> #
>   print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
> 
> ... and now it's not.
> 
> What's wrong with this? I suspect that the model y ~ x
> does only give 
> meaningful values when x is deterministic; in this case x
> is stochastic. Is 
> there any correct way to estimate this model giving
> meaningful outputs?
> 
> Alberto Monteiro
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.

Possibly Parallel Threads

Search for more maybe matching threads

R help - Aug 2008 - A doubt about "lm" and the meaning of its summary

[R] A doubt about "lm" and the meaning of its summary

[R] A doubt about "lm" and the meaning of its summary

[R] A doubt about "lm" and the meaning of its summary

Possibly Parallel Threads