Alberto Monteiro
2008-Aug-18 18:31 UTC
[R] A doubt about "lm" and the meaning of its summary
I have a conceptual problem (sort of). Maybe there's a simple solution, maybe not. First, let me explain the test case that goes ok. Let x be a (fixed) n-dimensional vector. I simulate a lot of linear models, y = m x + c + error, then I do a lot of regressions. As expected, the estimated values of m (let's call them m.bar) are distributed according to a Student's t distribution. More precisely (test case, it takes a few minutes to run): # # start with fixed numbers # m <- sample(c(-1, -0.1, 0.1, 1), 1) c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) n <- sample(c(4,5,10,1000), 1) x <- 1:n # it's possible to use other x NN <- sample(c(3000, 4000, 5000), 1) m.bar <- m.std.error <- 0 # these vectors will hold the simulation output # # Now let's repeat NN times: # simulate y # regress y ~ x # store m.bar and its error # for (i in 1:NN) { y <- m * x + c + sigma * rnorm(n) reg <- lm(y ~ x) m.bar[i] <- summary(reg)$coefficients['x', 'Estimate'] m.std.error[i] <- summary(reg)$coefficients['x', 'Std. Error'] } # # Finally, let's analyse it # The distribution of (m.bar - m) / m.std.error is # a Student's t with n - 2 degrees of freedom. # Is it? Let's test! # print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) Now, this goes ok, with ks.test returning a number uniformly distributed in the 0-1 interval. Next, I did a slightly different model. This model is a reversion model, where the simulated random variable lp goes along the equation: lp[t + 1] <- (1 + m) * lp[t] + c + error I tried to use the same method as above, defining x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation y <- m * x + c + error And now it breaks. Test case (it takes some minutes to run): # # start with fixed numbers # m <- runif(1, -1, 0) # m must be something in the (-1,0) interval c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) n <- sample(c(4,5,10,1000), 1) NN <- sample(c(3000, 4000, 5000), 1) m.bar <- m.std.error <- 0 # these vectors will hold the simulation output # # Now let's repeat NN times: # simulate y # regress y ~ x # store m.bar and its error # for (i in 1:NN) { lp <- 0 lp[1] <- 0 for (j in 1:n) { lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1) } x <- lp[1:n] y <- lp[-1] - x reg <- lm(y ~ x) m.bar[i] <- summary(reg)$coefficients['x', 'Estimate'] m.std.error[i] <- summary(reg)$coefficients['x', 'Std. Error'] } # # Finally, let's analyse it # The distribution of (m.bar - m) / m.std.error should be # a Student's t with n - 2 degrees of freedom. # Is it? Let's test! # print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) ... and now it's not. What's wrong with this? I suspect that the model y ~ x does only give meaningful values when x is deterministic; in this case x is stochastic. Is there any correct way to estimate this model giving meaningful outputs? Alberto Monteiro
Moshe Olshansky
2008-Aug-19 01:41 UTC
[R] A doubt about "lm" and the meaning of its summary
Hi Alberto, In your second case the linear model y = a*x + b + error does not hold. --- On Tue, 19/8/08, Alberto Monteiro <albmont at centroin.com.br> wrote:> From: Alberto Monteiro <albmont at centroin.com.br> > Subject: [R] A doubt about "lm" and the meaning of its summary > To: r-help at r-project.org > Received: Tuesday, 19 August, 2008, 4:31 AM > I have a conceptual problem (sort of). Maybe there's a > simple solution, > maybe not. > > First, let me explain the test case that goes ok. > > Let x be a (fixed) n-dimensional vector. I simulate a lot > of linear models, > y = m x + c + error, then I do a lot of regressions. As > expected, the > estimated values of m (let's call them m.bar) are > distributed according to a > Student's t distribution. > > More precisely (test case, it takes a few minutes to run): > # > # start with fixed numbers > # > m <- sample(c(-1, -0.1, 0.1, 1), 1) > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > x <- 1:n # it's possible to use other x > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > y <- m * x + c + sigma * rnorm(n) > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error is > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > Now, this goes ok, with ks.test returning a number > uniformly distributed in > the 0-1 interval. > > Next, I did a slightly different model. This model is a > reversion model, > where the simulated random variable lp goes along the > equation: > lp[t + 1] <- (1 + m) * lp[t] + c + error > > I tried to use the same method as above, defining > x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation > y <- m * x + c + error > > And now it breaks. Test case (it takes some minutes to > run): > > # > # start with fixed numbers > # > m <- runif(1, -1, 0) # m must be something in the > (-1,0) interval > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > lp <- 0 > lp[1] <- 0 > for (j in 1:n) { > lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1) > } > x <- lp[1:n] > y <- lp[-1] - x > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error should be > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > ... and now it's not. > > What's wrong with this? I suspect that the model y ~ x > does only give > meaningful values when x is deterministic; in this case x > is stochastic. Is > there any correct way to estimate this model giving > meaningful outputs? > > Alberto Monteiro > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code.
Moshe Olshansky
2008-Aug-19 02:19 UTC
[R] A doubt about "lm" and the meaning of its summary
Hi Alberto, Please disregard my previous note - I probably had a black-out!!! --- On Tue, 19/8/08, Alberto Monteiro <albmont at centroin.com.br> wrote:> From: Alberto Monteiro <albmont at centroin.com.br> > Subject: [R] A doubt about "lm" and the meaning of its summary > To: r-help at r-project.org > Received: Tuesday, 19 August, 2008, 4:31 AM > I have a conceptual problem (sort of). Maybe there's a > simple solution, > maybe not. > > First, let me explain the test case that goes ok. > > Let x be a (fixed) n-dimensional vector. I simulate a lot > of linear models, > y = m x + c + error, then I do a lot of regressions. As > expected, the > estimated values of m (let's call them m.bar) are > distributed according to a > Student's t distribution. > > More precisely (test case, it takes a few minutes to run): > # > # start with fixed numbers > # > m <- sample(c(-1, -0.1, 0.1, 1), 1) > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > x <- 1:n # it's possible to use other x > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > y <- m * x + c + sigma * rnorm(n) > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error is > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > Now, this goes ok, with ks.test returning a number > uniformly distributed in > the 0-1 interval. > > Next, I did a slightly different model. This model is a > reversion model, > where the simulated random variable lp goes along the > equation: > lp[t + 1] <- (1 + m) * lp[t] + c + error > > I tried to use the same method as above, defining > x = lp[1:n], y = lp[2:(n+1}] - lp[1:n], with equation > y <- m * x + c + error > > And now it breaks. Test case (it takes some minutes to > run): > > # > # start with fixed numbers > # > m <- runif(1, -1, 0) # m must be something in the > (-1,0) interval > c <- sample(c(-1, -0.1, 0, 0.1, 1), 1) > sigma <- sample(c(0.1, 0.2, 0.5, 1), 1) > n <- sample(c(4,5,10,1000), 1) > NN <- sample(c(3000, 4000, 5000), 1) > m.bar <- m.std.error <- 0 # these vectors will > hold the simulation output > > # > # Now let's repeat NN times: > # simulate y > # regress y ~ x > # store m.bar and its error > # > for (i in 1:NN) { > lp <- 0 > lp[1] <- 0 > for (j in 1:n) { > lp[j+1] = (1 + m) * lp[j] + c + sigma * rnorm(1) > } > x <- lp[1:n] > y <- lp[-1] - x > reg <- lm(y ~ x) > m.bar[i] <- summary(reg)$coefficients['x', > 'Estimate'] > m.std.error[i] <- > summary(reg)$coefficients['x', 'Std. Error'] > } > # > # Finally, let's analyse it > # The distribution of (m.bar - m) / m.std.error should be > # a Student's t with n - 2 degrees of freedom. > # Is it? Let's test! > # > print(ks.test((m.bar - m) / m.std.error, rt(NN, n - 2))) > > ... and now it's not. > > What's wrong with this? I suspect that the model y ~ x > does only give > meaningful values when x is deterministic; in this case x > is stochastic. Is > there any correct way to estimate this model giving > meaningful outputs? > > Alberto Monteiro > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code.