Frank Harrell
2015-Jun-15 02:22 UTC
[R] Different behavior of model.matrix between R 3.2 and R3.1.1
Terry - your example didn't demonstrate the problem because the variable
that interacted with strata (zed) was not a factor variable.
But I had stated the problem incorrectly. It's not that there are too
many strata terms; there are too many non-strata terms when the variable
interacting with the stratification factor is a factor variable. Here
is a simple example, where I have attached no packages other than the
basic startup packages.
strat <- function(x) x
d <- expand.grid(a=c('a1','a2'),
b=c('b1','b2'))
d$y <- c(1,3,2,4)
f <- y ~ a * strat(b)
m <- model.frame(f, data=d)
Terms <- terms(f, specials='strat', data=d)
specials <- attr(Terms, 'specials')
temp <- survival:::untangle.specials(Terms, 'strat', 1)
Terms <- Terms[- temp$terms]
model.matrix(Terms, m)
(Intercept) aa2 aa1:strat(b)b2 aa2:strat(b)b2
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 1 0 1
. . .
The column corresponding to a='a1' b='b2' should not be there
(aa1:strat(b)b2).
This does seem to be a change in R. Any help appreciated.
Note that after subsetting out strat terms using Terms[ - temp$terms],
Terms attributes factor and term.labels are:
attr(,"factors")
a a:strat(b)
y 0 0
a 1 2
strat(b) 0 1
attr(,"term.labels")
[1] "a" "a:strat(b)"
Frank
On 06/11/2015 08:44 AM, Therneau, Terry M., Ph.D. wrote:> Frank,
> I'm not sure what is going on. The following test function works
for
> me in both 3.1.1 and 3.2, i.e, the second model matrix has fewer
> columns. As I indicated to you earlier, the coxph code removes the
> strata() columns after creating X because I found it easier to correctly
> create the assign attribute.
>
> Can you create a worked example?
>
> require(survival)
> testfun <- function(formula, data) {
> tform <- terms(formula, specials="strata")
> mf <- model.frame(tform, data)
>
> terms2 <- terms(mf)
> strat <- untangle.specials(terms2, "strata")
> if (length(strat$terms)) terms2 <- terms2[-strat$terms]
> X <- model.matrix(terms2, mf)
> X
> }
>
> tdata <- data.frame(y= 1:10, zed = 1:10, grp >
factor(c(1,1,1,2,2,2,1,1,3,3)))
>
> testfun(y ~ zed*grp, tdata)
>
> testfun(y ~ strata(grp)*zed, tdata)
>
>
> Terry T.
>
> ----- original message --
>
> For building design matrices for Cox proportional hazards models in the
> cph function in the rms package I have always used this construct:
>
> Terms <- terms(formula, specials=c("strat",
"cluster", "strata"),
> data=data)
> specials <- attr(Terms, 'specials')
> stra <- specials$strat
> Terms.ns <- Terms
> if(length(stra)) {
> temp <- untangle.specials(Terms.ns, "strat", 1)
> Terms.ns <- Terms.ns[- temp$terms] #uses [.terms function
> }
> X <- model.matrix(Terms.ns, X)[, -1, drop=FALSE]
>
> The Terms.ns logic removes stratification factor "main effects"
so that
> if a stratification factor interacts with a non-stratification factor,
> only the interaction terms are included, not the strat. factor main
> effects. [In a Cox PH model stratification goes into the nonparametric
> survival curve part of the model].
>
> Lately this logic quit working; model.matrix keeps the unneeded main
> effects in the design matrix. Does anyone know what changed in R that
> could have caused this, and possibly a workaround?
>
>
> -------
>
>
--
------------------------------------------------------------------------
Frank E Harrell Jr Professor and Chairman School of Medicine
Department of *Biostatistics* *Vanderbilt University*
Therneau, Terry M., Ph.D.
2015-Jun-15 14:05 UTC
[R] Different behavior of model.matrix between R 3.2 and R3.1.1
Frank,
I don't think there is any way to "fix" your problem except the
way that I did it.
library(survival)
tdata <- data.frame(y=c(1,3,3,5, 5,7, 7,9, 9,13),
x1=factor(letters[c(1,1,1,1,1,2,2,2,2,2)]),
x2= c(1,2,1,2,1,2,1,2,1,2))
fit1 <- lm( y ~ x1 * strata(x2) - strata(x2), tdata)
coef(fit1)
(Intercept) x1b x1a:strata(x2)x2=2 x1b:strata(x2)x2=2
3.000000 5.000000 1.000000 1.666667
Your code is calling model.matrix with the same model frame and terms structure
as the lm
call above (I checked). In your case you know that the underlying model has 2
intercepts
(strata), one for the group with x2=1 and another for the group with x2=2, but
how is the
model.matrix routine supposed to guess that? It can't, so model.matrix
returns the proper
result for the lm call. As seen above the result is not singular, while for the
Cox model
it is singular due to the extra intercept.
This is simply an extension of leaving the "intercept" term in the
model and then removing
that column from the returned X matrix, which is necessary to have the correct
coding for
ordinary factor variables, something we've both done since day 1. In order
for
model.matrix to do the right thing with interactions, it has to know how many
intercepts
there actually are.
I've come to the conclusion that the entire thrust of 'contrasts' in
S was wrong headed,
i.e., the "remove redundant columns from the X matrix ahead of time"
logic. It is simply
not possible for the model.matrix routine to guess correctly for all y and x
combinations,
something that been acknowledged in R by changing the default for
"singular.ok" to TRUE.
Dealing with this after the fact via a good contrast function (a la SAS --
heresy!) would
have been a much better design choice. But as long as I'm in R the coxph
routine tries to
be a good citizen.
Terry T.
Frank Harrell
2015-Jun-16 23:10 UTC
[Rd] Different behavior of model.matrix between R 3.2 and R3.1.1
Terry Therneau has been very helpful on r-help but we can't figure out
what change in R in the past months made extra columns appear in
model.matrix when the terms object is subsetted to remove stratification
factors in a Cox model. Terry has changed his logic in the survival
package to avoid this issue but he requires generating a larger design
matrix then dropping columns.
A simple example is below.
strat <- function(x) x
d <- expand.grid(a=c('a1','a2'),
b=c('b1','b2'))
d$y <- c(1,3,2,4)
f <- y ~ a * strat(b)
m <- model.frame(f, data=d)
Terms <- drop.terms(terms(f, data=d), 2)
model.matrix(Terms, m)
(Intercept) aa2 aa1:strat(b)b2 aa2:strat(b)b2
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 1 0 1
. . .
The column corresponding to a='a1' b='b2' should not be there
(aa1:strat(b)b2).
This does seem to be a change in R. Any help appreciated.
Terms attributes factor and term.labels are:
attr(,"factors")
a a:strat(b)
y 0 0
a 1 2
strat(b) 0 1
attr(,"term.labels")
[1] "a" "a:strat(b)"
Frank