thr3ads.net - R help - [R] Why does the order of terms in a formula translate into different models/ model matrices? [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Alexandra

2012-Jan-27 13:04 UTC

[R] Why does the order of terms in a formula translate into different models/ model matrices?

Dear all,

I have encountered some strange things when creating lm objects in R:  model
depends on the order of the terms specified in a formula.
Let us consider the following simple example:
> dat <- expand.grid(A = factor(c("a1", "a2")),+                    B = factor(paste("b", 1:4, sep="")),
+                    rep = factor(1:2))> set.seed(12345)
> dat$x <- 1:16 * .5 + runif(16)
> dat$Y <- rnorm(16)
> dat    A  B rep        x          Y
1  a1 b1   1 1.220904 -0.2841597
2  a2 b1   1 1.875773 -0.9193220
3  a1 b2   1 2.260982 -0.1162478
4  a2 b2   1 2.886125  1.8173120
5  a1 b3   1 2.956481  0.3706279
6  a2 b3   1 3.166372  0.5202165
7  a1 b4   1 3.825095 -0.7505320
8  a2 b4   1 4.509224  0.8168998
9  a1 b1   2 5.227705 -0.8863575
10 a2 b1   2 5.989737 -0.3315776
11 a1 b2   2 5.534535  1.1207127
12 a2 b2   2 6.152373  0.2987237
13 a1 b3   2 7.235685  0.7796219
14 a2 b3   2 7.001137  1.4557851
15 a1 b4   2 7.891203 -0.6443284
16 a2 b4   2 8.462495 -1.5531374

> logLik(m0 <- lm(Y ~ A:B + x:A, dat))'log Lik.' -13.22186 (df=11)
> logLik(m1 <- lm(Y ~ x:A + A:B, dat))'log Lik.' -13.66822 (df=10)

To me it is a bit strange that m0 and m1 models appear to have different
loglikelihood (only the order of the terms in a formula was changed!) 

My guess is that the problem lies in model.matrix:

X1 <- model.matrix(~ x:A + A:B, dat)
X2 <- model.matrix(~ A:B + x:A, dat)


## number of columns:
ncol(X1) ## 9
> ncol(X2) ## 11
## rank of design matrices:
qr(X1)$rank ## 9

qr(X2)$rank ## 10

 
Will be very grateful if someone could help me here!

Thanks,

Alexandra





--
View this message in context:
http://r.789695.n4.nabble.com/Why-does-the-order-of-terms-in-a-formula-translate-into-different-models-model-matrices-tp4333408p4333408.html
Sent from the R help mailing list archive at Nabble.com.

Ben Bolker

2012-Jan-27 16:29 UTC

head link

[R] Why does the order of terms in a formula translate into different models/ model matrices?

Alexandra <alku <at> imm.dtu.dk> writes:
> 
> Dear all,
> 
> I have encountered some strange things when creating lm objects in R: 
model
> depends on the order of the terms specified in a formula.
> Let us consider the following simple example:
> 
> > dat <- expand.grid(A = factor(c("a1", "a2")),
> +                    B = factor(paste("b", 1:4,
sep="")),
> +                    rep = factor(1:2))
> > set.seed(12345)
> > dat$x <- 1:16 * .5 + runif(16)
> > dat$Y <- rnorm(16)
> 
> > dat
  [snip]
> 
> > logLik(m0 <- lm(Y ~ A:B + x:A, dat))
> 'log Lik.' -13.22186 (df=11)
> 
> > logLik(m1 <- lm(Y ~ x:A + A:B, dat))
> 'log Lik.' -13.66822 (df=10)
> 
> To me it is a bit strange that m0 and m1 models appear to have different
> loglikelihood (only the order of the terms in a formula was changed!) 
> 
> My guess is that the problem lies in model.matrix:
> 
> X1 <- model.matrix(~ x:A + A:B, dat)
> X2 <- model.matrix(~ A:B + x:A, dat)
> 
> ## number of columns:
> ncol(X1) ## 9
> 
> > ncol(X2) ## 11
> 
> ## rank of design matrices:
> qr(X1)$rank ## 9
> 
> qr(X2)$rank ## 10
   I agree that this seems weird, and I haven't been able to figure
it out yet.  My best (not very well-informed) guess is that there is
something going on with automatic dropping of terms that appear to
be aliased?? and that this test is (perhaps unintentionally) order-
dependent. I don't see anything obvious in the R code 
(model.matrix.default) that would be responsible, and I haven't
had time (and probably won't) to go through the underlying C code
(do_modelmatrix in src/main/model.c) to see what's going on.

  For what it's worth, the ultimate reference for how this stuff
is *supposed* to work is (I think) the "White Book" (Statistical
Models in S, Chambers and Hastie) -- I don't alas have a copy.
I don't see any further references to model.matrix or 'model matrix'
in the R manuals, other than a fairly short description in section
11 of the "Intro to R".

  In other words, I'm stumped too and I hope someone else can
provide a more informed answer.

  Ben Bolker

cberry at tajo.ucsd.edu

2012-Jan-27 21:39 UTC

head link

[R] Why does the order of terms in a formula translate into different models/ model matrices?

Alexandra <alku at imm.dtu.dk> writes:
> Dear all,
>
> I have encountered some strange things when creating lm objects in R: 
model
> depends on the order of the terms specified in a formula.
> Let us consider the following simple example:
>
>> dat <- expand.grid(A = factor(c("a1", "a2")),
> +                    B = factor(paste("b", 1:4,
sep="")),
> +                    rep = factor(1:2))
>> set.seed(12345)
>> dat$x <- 1:16 * .5 + runif(16)
>> dat$Y <- rnorm(16)
>
>> dat
>     A  B rep        x          Y
> 1  a1 b1   1 1.220904 -0.2841597
> 2  a2 b1   1 1.875773 -0.9193220
> 3  a1 b2   1 2.260982 -0.1162478
> 4  a2 b2   1 2.886125  1.8173120
> 5  a1 b3   1 2.956481  0.3706279
> 6  a2 b3   1 3.166372  0.5202165
> 7  a1 b4   1 3.825095 -0.7505320
> 8  a2 b4   1 4.509224  0.8168998
> 9  a1 b1   2 5.227705 -0.8863575
> 10 a2 b1   2 5.989737 -0.3315776
> 11 a1 b2   2 5.534535  1.1207127
> 12 a2 b2   2 6.152373  0.2987237
> 13 a1 b3   2 7.235685  0.7796219
> 14 a2 b3   2 7.001137  1.4557851
> 15 a1 b4   2 7.891203 -0.6443284
> 16 a2 b4   2 8.462495 -1.5531374
>
>
>> logLik(m0 <- lm(Y ~ A:B + x:A, dat))
> 'log Lik.' -13.22186 (df=11)
>
>> logLik(m1 <- lm(Y ~ x:A + A:B, dat))
> 'log Lik.' -13.66822 (df=10)
>
> To me it is a bit strange that m0 and m1 models appear to have different
> loglikelihood (only the order of the terms in a formula was changed!) 
>
> My guess is that the problem lies in model.matrix:
>
> X1 <- model.matrix(~ x:A + A:B, dat)
> X2 <- model.matrix(~ A:B + x:A, dat)

Close, but not quite. The problem lies in terms()

Here are the attr(terms(...),"factors") matrices:

 > attributes(terms(Y ~ x:A + A:B,data=dat))$factors
  x:A A:B
Y   0   0
x   2   0
A   2   2
B   0   1>   attributes(terms(Y ~ A:B + x:A ,data=dat))$factors  A:B A:x
Y   0   0
A   2   2
B   2   0
x   0   1

As you see, the encoding of x and B are treated differently under the
two orderings.

See ?terms.object for what those codes mean.

Same deal for these seemingly equivalent formulae:

>   attributes(terms(Y ~ (x + A + B)^2-A,data=dat))$factors  x B x:A x:B A:B
Y 0 0   0   0   0
x 1 0   2   1   0
A 0 0   1   0   1
B 0 1   0   1   1>   attributes(terms(Y ~ (A + B + x)^2-A,data=dat))$factors  B x A:B A:x B:x
Y 0 0   0   0   0
A 0 0   1   1   0
B 1 0   2   0   1
x 0 1   0   1   1> 

AFAICS, this is a bug.

HTH,

Chuck
>
>
> ## number of columns:
> ncol(X1) ## 9
>
>> ncol(X2) ## 11
>
> ## rank of design matrices:
> qr(X1)$rank ## 9
>
> qr(X2)$rank ## 10
>
>  
> Will be very grateful if someone could help me here!
>
> Thanks,
>
> Alexandra
>
>
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Why-does-the-order-of-terms-in-a-formula-translate-into-different-models-model-matrices-tp4333408p4333408.html
> Sent from the R help mailing list archive at Nabble.com.
>
-- 
Charles C. Berry                            Dept of Family/Preventive Medicine
cberry at ucsd edu			    UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

R help - Jan 2012 - Why does the order of terms in a formula translate into different models/ model matrices?

[R] Why does the order of terms in a formula translate into different models/ model matrices?

[R] Why does the order of terms in a formula translate into different models/ model matrices?

[R] Why does the order of terms in a formula translate into different models/ model matrices?

Apparently Analagous Threads