Hi all, In preparation for teaching a class next week, I've been reviewing R's standard modelling algebra. I've used it for a long time and have a pretty good intuitive feel for how it works, but would like to understand more of the technical details. The best (online) reference I've found so far is the section in "An Introduction to R" (http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models). Does anyone have any other suggestions? I have a few questions about the definitions given in "An Introduction to R": * "M_1 : M_2 - The tensor product of M_1 and M_2. If both terms are factors, then the ?subclasses? factor." From my reading, the usual interpretation of a tensor product when x and y are vectors is the outer product. I don't see how that would work here - how does a matrix work as an predictor in a linear model? In what sense is the tensor product of x with itself equal to x? What is the subclasses factor? Is it interaction(M_1, M_2, sep = "")? * "M_1 %in% M_2 - Similar to M_1:M_2, but with a different coding." How is the coding different? Where is %in% documented within R? I'm pretty sure it's a different action to ?"%in%, and it's not mentioned in ?formula I have also read G.?N. Wilkinson and C.?E. Rogers. Symbolic descriptions of factorial models for analysis of variance. Journal of the Royal Statistical Society. Series C (Applied Statistics), 22:392?399, 1973. - Can anyone comment on any important differences to R's modelling algebra? What does %in% correspond to in Wilkinson and Rogers' framework? Thanks! Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
>>> Hadley Wickham <hadley at rice.edu> 02/07/2010 14:59:53 >>> > Where is %in% documented within R? I'm pretty sure it's a different >action to ?"%in%, and it's not mentioned in ?formula?formula in R 2.9.2 says in para 2: "The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. " ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
Hadley, The S language modeling language was designed with Wilkinson and Rogers in mind. The notation was changed from their paper to retain consistency with the parsing rules for ordinary algebra in S. I think of ":" as an indicator of an indexing system into the dummy variables. It is not an indicator of degrees of freedom. For simplicity in notation, let A be a factor with a levels and B be a factor with b levels. Then A:B implies a set of dummy variables with at most ab columns indexed by an A level and a B level. The degrees of freedom associated with A:B depends on the linear dependencies of the associated dummy variables with the dummy variables of other terms in the model. The excess columns can be suppressed when the dummy variables are generated or they can be pivoted out during the analysis. When we have the special case A:A, there is only one factor mentioned, so the indexing scheme is based on just the one factor. You could generate the full set of a^2 columns, and then you would discover that they are all linearly dependent on the first a. The columns can be labeled either a1b1 a1b2 a1b3 a2b1 a2b2 a2b3 or a1b1 a2b1 a1b2 a2b2 a1b3 a2b3 If there is crossing, we would report the a single sum of squares and degrees of freedom for the interaction. If there is nesting, say a/b , then it might make sense to group the dummy variables say (a1b1 a1b2 a1b3) and (a2b1 a2b2 a2b3) and report simple effects sum of squares and degrees of freedom for each of the groups. The structure of the individual columns depends on the set of contrasts used for the A and B factors. Rich [[alternative HTML version deleted]]
On Fri, 2 Jul 2010, Hadley Wickham wrote:> Hi all, > > In preparation for teaching a class next week, I've been reviewing R's > standard modelling algebra. I've used it for a long time and have a > pretty good intuitive feel for how it works, but would like to > understand more of the technical details. The best (online) reference > I've found so far is the section in "An Introduction to R" > (http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models). > Does anyone have any other suggestions? > > I have a few questions about the definitions given in "An Introduction to R": > > * "M_1 : M_2 - The tensor product of M_1 and M_2. If both terms are > factors, then the ?subclasses? factor." > > From my reading, the usual interpretation of a tensor product when > x and y are vectors is the outer product. I don't see how that would > work here - how does a matrix work as an predictor in a linear model?Think of it for a single observation. x and y specify terms that could be scalars or could be row vectors (eg ns(x), poly(y,3)), and the terms in x:y are the products of each term from x with each term from y. Like taking the Kronecker product and then reshaping it back into a row vector.> In what sense is the tensor product of x with itself equal to x?This is the messy bit. The 'product' operator is not the arithmetic product, because x:x is not the same as x:z even if z=x. The product of a set of single-column terms is formed by eliminating any terms from the set that are syntactically duplicates and then taking the arithmetic product of the remaining terms. This is the Right Thing for producing design matrices, but is a bit of a mess to describe. So x:z:log(z) contains no duplicates and produces x*z*log(z). x:z contains no duplicates and produces x*z (even if z=x), but x:z:x produces x*z and x:x produces x.> What is the subclasses factor? Is it interaction(M_1, M_2, sep = "")?Yes. You might find the Wilkinson & Rogers paper more helpful: @Article{Wilkinson.Rogers.73, author = "G. N. Wilkinson and C. E. Rogers", title = "Symbolic description of factorial models for analysis of variance", journal = "Applied Statistics", volume = "22", pages = "392--399", year = "1973", comment = "Reference from MASS", } The notation is slightly different; R uses ':' for their '.' and '^' for their '**'. I think the algebra is the same. -thomas Thomas Lumley Professor of Biostatistics University of Washington, Seattle
> Hadley Wickham <hadley <at> rice.edu> > > Where is %in% documented within R? I'm pretty sure it's a different > action to ?"%in%, and it's not mentioned in ?formulaYou find the documentation for operators like <-, %in%, if, etc by putting the operators between qoutes ?"%in%" ?"<-" ?"if" Regards, Adrian