Alexandra
2012-Jan-27 13:04 UTC
[R] Why does the order of terms in a formula translate into different models/ model matrices?
Dear all, I have encountered some strange things when creating lm objects in R: model depends on the order of the terms specified in a formula. Let us consider the following simple example:> dat <- expand.grid(A = factor(c("a1", "a2")),+ B = factor(paste("b", 1:4, sep="")), + rep = factor(1:2))> set.seed(12345) > dat$x <- 1:16 * .5 + runif(16) > dat$Y <- rnorm(16)> datA B rep x Y 1 a1 b1 1 1.220904 -0.2841597 2 a2 b1 1 1.875773 -0.9193220 3 a1 b2 1 2.260982 -0.1162478 4 a2 b2 1 2.886125 1.8173120 5 a1 b3 1 2.956481 0.3706279 6 a2 b3 1 3.166372 0.5202165 7 a1 b4 1 3.825095 -0.7505320 8 a2 b4 1 4.509224 0.8168998 9 a1 b1 2 5.227705 -0.8863575 10 a2 b1 2 5.989737 -0.3315776 11 a1 b2 2 5.534535 1.1207127 12 a2 b2 2 6.152373 0.2987237 13 a1 b3 2 7.235685 0.7796219 14 a2 b3 2 7.001137 1.4557851 15 a1 b4 2 7.891203 -0.6443284 16 a2 b4 2 8.462495 -1.5531374> logLik(m0 <- lm(Y ~ A:B + x:A, dat))'log Lik.' -13.22186 (df=11)> logLik(m1 <- lm(Y ~ x:A + A:B, dat))'log Lik.' -13.66822 (df=10) To me it is a bit strange that m0 and m1 models appear to have different loglikelihood (only the order of the terms in a formula was changed!) My guess is that the problem lies in model.matrix: X1 <- model.matrix(~ x:A + A:B, dat) X2 <- model.matrix(~ A:B + x:A, dat) ## number of columns: ncol(X1) ## 9> ncol(X2) ## 11## rank of design matrices: qr(X1)$rank ## 9 qr(X2)$rank ## 10 Will be very grateful if someone could help me here! Thanks, Alexandra -- View this message in context: http://r.789695.n4.nabble.com/Why-does-the-order-of-terms-in-a-formula-translate-into-different-models-model-matrices-tp4333408p4333408.html Sent from the R help mailing list archive at Nabble.com.
Ben Bolker
2012-Jan-27 16:29 UTC
[R] Why does the order of terms in a formula translate into different models/ model matrices?
Alexandra <alku <at> imm.dtu.dk> writes:> > Dear all, > > I have encountered some strange things when creating lm objects in R: model > depends on the order of the terms specified in a formula. > Let us consider the following simple example: > > > dat <- expand.grid(A = factor(c("a1", "a2")), > + B = factor(paste("b", 1:4, sep="")), > + rep = factor(1:2)) > > set.seed(12345) > > dat$x <- 1:16 * .5 + runif(16) > > dat$Y <- rnorm(16) > > > dat[snip]> > > logLik(m0 <- lm(Y ~ A:B + x:A, dat)) > 'log Lik.' -13.22186 (df=11) > > > logLik(m1 <- lm(Y ~ x:A + A:B, dat)) > 'log Lik.' -13.66822 (df=10) > > To me it is a bit strange that m0 and m1 models appear to have different > loglikelihood (only the order of the terms in a formula was changed!) > > My guess is that the problem lies in model.matrix: > > X1 <- model.matrix(~ x:A + A:B, dat) > X2 <- model.matrix(~ A:B + x:A, dat) > > ## number of columns: > ncol(X1) ## 9 > > > ncol(X2) ## 11 > > ## rank of design matrices: > qr(X1)$rank ## 9 > > qr(X2)$rank ## 10I agree that this seems weird, and I haven't been able to figure it out yet. My best (not very well-informed) guess is that there is something going on with automatic dropping of terms that appear to be aliased?? and that this test is (perhaps unintentionally) order- dependent. I don't see anything obvious in the R code (model.matrix.default) that would be responsible, and I haven't had time (and probably won't) to go through the underlying C code (do_modelmatrix in src/main/model.c) to see what's going on. For what it's worth, the ultimate reference for how this stuff is *supposed* to work is (I think) the "White Book" (Statistical Models in S, Chambers and Hastie) -- I don't alas have a copy. I don't see any further references to model.matrix or 'model matrix' in the R manuals, other than a fairly short description in section 11 of the "Intro to R". In other words, I'm stumped too and I hope someone else can provide a more informed answer. Ben Bolker
cberry at tajo.ucsd.edu
2012-Jan-27 21:39 UTC
[R] Why does the order of terms in a formula translate into different models/ model matrices?
Alexandra <alku at imm.dtu.dk> writes:> Dear all, > > I have encountered some strange things when creating lm objects in R: model > depends on the order of the terms specified in a formula. > Let us consider the following simple example: > >> dat <- expand.grid(A = factor(c("a1", "a2")), > + B = factor(paste("b", 1:4, sep="")), > + rep = factor(1:2)) >> set.seed(12345) >> dat$x <- 1:16 * .5 + runif(16) >> dat$Y <- rnorm(16) > >> dat > A B rep x Y > 1 a1 b1 1 1.220904 -0.2841597 > 2 a2 b1 1 1.875773 -0.9193220 > 3 a1 b2 1 2.260982 -0.1162478 > 4 a2 b2 1 2.886125 1.8173120 > 5 a1 b3 1 2.956481 0.3706279 > 6 a2 b3 1 3.166372 0.5202165 > 7 a1 b4 1 3.825095 -0.7505320 > 8 a2 b4 1 4.509224 0.8168998 > 9 a1 b1 2 5.227705 -0.8863575 > 10 a2 b1 2 5.989737 -0.3315776 > 11 a1 b2 2 5.534535 1.1207127 > 12 a2 b2 2 6.152373 0.2987237 > 13 a1 b3 2 7.235685 0.7796219 > 14 a2 b3 2 7.001137 1.4557851 > 15 a1 b4 2 7.891203 -0.6443284 > 16 a2 b4 2 8.462495 -1.5531374 > > >> logLik(m0 <- lm(Y ~ A:B + x:A, dat)) > 'log Lik.' -13.22186 (df=11) > >> logLik(m1 <- lm(Y ~ x:A + A:B, dat)) > 'log Lik.' -13.66822 (df=10) > > To me it is a bit strange that m0 and m1 models appear to have different > loglikelihood (only the order of the terms in a formula was changed!) > > My guess is that the problem lies in model.matrix: > > X1 <- model.matrix(~ x:A + A:B, dat) > X2 <- model.matrix(~ A:B + x:A, dat)Close, but not quite. The problem lies in terms() Here are the attr(terms(...),"factors") matrices: > attributes(terms(Y ~ x:A + A:B,data=dat))$factors x:A A:B Y 0 0 x 2 0 A 2 2 B 0 1> attributes(terms(Y ~ A:B + x:A ,data=dat))$factorsA:B A:x Y 0 0 A 2 2 B 2 0 x 0 1 As you see, the encoding of x and B are treated differently under the two orderings. See ?terms.object for what those codes mean. Same deal for these seemingly equivalent formulae:> attributes(terms(Y ~ (x + A + B)^2-A,data=dat))$factorsx B x:A x:B A:B Y 0 0 0 0 0 x 1 0 2 1 0 A 0 0 1 0 1 B 0 1 0 1 1> attributes(terms(Y ~ (A + B + x)^2-A,data=dat))$factorsB x A:B A:x B:x Y 0 0 0 0 0 A 0 0 1 1 0 B 1 0 2 0 1 x 0 1 0 1 1>AFAICS, this is a bug. HTH, Chuck> > > ## number of columns: > ncol(X1) ## 9 > >> ncol(X2) ## 11 > > ## rank of design matrices: > qr(X1)$rank ## 9 > > qr(X2)$rank ## 10 > > > Will be very grateful if someone could help me here! > > Thanks, > > Alexandra > > > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Why-does-the-order-of-terms-in-a-formula-translate-into-different-models-model-matrices-tp4333408p4333408.html > Sent from the R help mailing list archive at Nabble.com. >-- Charles C. Berry Dept of Family/Preventive Medicine cberry at ucsd edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901