Alberto Monteiro
2008-Aug-18  18:31 UTC
[R] A doubt about "lm" and the meaning of its summary
I have a conceptual problem (sort of). Maybe there's a simple solution, 
maybe not.
First, let me explain the test case that goes ok.
Let x be a (fixed) n-dimensional vector. I simulate a lot of linear models, 
y = m x + c + error, then I do a lot of regressions. As expected, the 
estimated values of m (let's call them m.bar) are distributed according to a
Student's t distribution.
More precisely (test case, it takes a few minutes to run):
#
# start with fixed numbers
#
  m <- sample(c(-1, -0.1, 0.1, 1), 1)
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  x <- 1:n # it's possible to use other x
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will hold the simulation
output
#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    y <- m * x + c + sigma * rnorm(n)
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
    m.std.error[i] <- summary(reg)$coefficients['x', 'Std.
Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error is
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
Now, this goes ok, with ks.test returning a number uniformly distributed in 
the 0-1 interval.
Next, I did a slightly different model. This model is a reversion model, 
where the simulated random variable lp goes along the equation:
lp[t + 1] <- (1 + m) * lp[t] + c + error
I tried to use the same method as above, defining 
x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation
y <- m * x + c + error
And now it breaks. Test case (it takes some minutes to run):
#
# start with fixed numbers
#
  m <- runif(1, -1, 0)  # m must be something in the (-1,0) interval
  c <- sample(c(-1, -0.1, 0, 0.1, 1), 1)
  sigma <- sample(c(0.1, 0.2, 0.5, 1), 1)
  n <- sample(c(4,5,10,1000), 1)
  NN <- sample(c(3000, 4000, 5000), 1)
  m.bar <- m.std.error <- 0  # these vectors will hold the simulation
output
#
# Now let's repeat NN times:
# simulate y
# regress y ~ x
# store m.bar and its error
#
  for (i in 1:NN) {
    lp <- 0
    lp[1] <- 0
    for (j in 1:n) {
      lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1)
    }
    x <- lp[1:n]
    y <- lp[-1] - x
    reg <- lm(y ~ x)
    m.bar[i] <- summary(reg)$coefficients['x', 'Estimate']
    m.std.error[i] <- summary(reg)$coefficients['x', 'Std.
Error']
  }
#
# Finally, let's analyse it
# The distribution of (m.bar - m) / m.std.error should be
# a Student's t with n - 2 degrees of freedom.
# Is it? Let's test!
#
  print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2)))
... and now it's not.
What's wrong with this? I suspect that the model y ~ x does only give 
meaningful values when x is deterministic; in this case x is stochastic. Is 
there any correct way to estimate this model giving meaningful outputs?
Alberto Monteiro
Moshe Olshansky
2008-Aug-19  01:41 UTC
[R] A doubt about "lm" and the meaning of its summary
Hi Alberto, In your second case the linear model y = a*x + b + error does not hold. --- On Tue, 19/8/08, Alberto Monteiro <albmont at centroin.com.br> wrote:> From: Alberto Monteiro <albmont at centroin.com.br> > Subject: [R] A doubt about "lm" and the meaning of its summary > To: r-help at r-project.org > Received: Tuesday, 19 August, 2008, 4:31 AM > I have a conceptual problem (sort of). Maybe there's a > simple solution, > maybe not. > > First, let me explain the test case that goes ok. > > Let x be a (fixed) n-dimensional vector. I simulate a lot > of linear models, > y = m x + c + error, then I do a lot of regressions. As > expected, the > estimated values of m (let's call them m.bar) are > distributed according to a > Student's t distribution. > > More precisely (test case, it takes a few minutes to run): > # > # start with fixed numbers > # > m <- sample(c(-1, -0.1, 0.1, 1), 1) > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > x <- 1:n # it's possible to use other x > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > y <- m * x + c + sigma * rnorm(n) > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error is > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > Now, this goes ok, with ks.test returning a number > uniformly distributed in > the 0-1 interval. > > Next, I did a slightly different model. This model is a > reversion model, > where the simulated random variable lp goes along the > equation: > lp[t + 1] <- (1 + m) * lp[t] + c + error > > I tried to use the same method as above, defining > x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation > y <- m * x + c + error > > And now it breaks. Test case (it takes some minutes to > run): > > # > # start with fixed numbers > # > m <- runif(1, -1, 0) # m must be something in the > (-1,0) interval > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > lp <- 0 > lp[1] <- 0 > for (j in 1:n) { > lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1) > } > x <- lp[1:n] > y <- lp[-1] - x > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error should be > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > ... and now it's not. > > What's wrong with this? I suspect that the model y ~ x > does only give > meaningful values when x is deterministic; in this case x > is stochastic. Is > there any correct way to estimate this model giving > meaningful outputs? > > Alberto Monteiro > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code.
Moshe Olshansky
2008-Aug-19  02:19 UTC
[R] A doubt about "lm" and the meaning of its summary
Hi Alberto, Please disregard my previous note - I probably had a black-out!!! --- On Tue, 19/8/08, Alberto Monteiro <albmont at centroin.com.br> wrote:> From: Alberto Monteiro <albmont at centroin.com.br> > Subject: [R] A doubt about "lm" and the meaning of its summary > To: r-help at r-project.org > Received: Tuesday, 19 August, 2008, 4:31 AM > I have a conceptual problem (sort of). Maybe there's a > simple solution, > maybe not. > > First, let me explain the test case that goes ok. > > Let x be a (fixed) n-dimensional vector. I simulate a lot > of linear models, > y = m x + c + error, then I do a lot of regressions. As > expected, the > estimated values of m (let's call them m.bar) are > distributed according to a > Student's t distribution. > > More precisely (test case, it takes a few minutes to run): > # > # start with fixed numbers > # > m <- sample(c(-1, -0.1, 0.1, 1), 1) > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > x <- 1:n # it's possible to use other x > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > y <- m * x + c + sigma * rnorm(n) > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error is > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > Now, this goes ok, with ks.test returning a number > uniformly distributed in > the 0-1 interval. > > Next, I did a slightly different model. This model is a > reversion model, > where the simulated random variable lp goes along the > equation: > lp[t + 1] <- (1 + m) * lp[t] + c + error > > I tried to use the same method as above, defining > x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation > y <- m * x + c + error > > And now it breaks. Test case (it takes some minutes to > run): > > # > # start with fixed numbers > # > m <- runif(1, -1, 0) # m must be something in the > (-1,0) interval > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > lp <- 0 > lp[1] <- 0 > for (j in 1:n) { > lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1) > } > x <- lp[1:n] > y <- lp[-1] - x > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error should be > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > ... and now it's not. > > What's wrong with this? I suspect that the model y ~ x > does only give > meaningful values when x is deterministic; in this case x > is stochastic. Is > there any correct way to estimate this model giving > meaningful outputs? > > Alberto Monteiro > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code.