Robert Gentleman
2000-Dec-18 19:17 UTC
[Rd] Some potential changes (enhancements) to formulas and models
Here is part 1 of my long saga towards a more flexible modeling paradigm. Comments and hints are especially welcome. The simple version: Starting with a formula and data R goes through 3 main steps to get the data into a form suitable for fitting. 1) application of terms 2) application of model.frame (subset and na.action occur in 2). 3) application of model.matrix To be concrete think of the following two formulas F1: y~a*log(b) F2: y~a*(1+ exp(b*t)) My goal is to introduce a meaningful way of specifying which symbols are parameters and which are data. For now I'm just going to talk about the terms function and within that just about the factors component that is returned. I've had a look at the man pages and the White book (ch 2). The factors component is supposed to be a matrix with the variable names along the rows and the terms (whatever they are) as columns. If in F1 both a and b are variables then the terms are terms: a, log(b), a:log(b) the variable names are vars: y, a, log(b) If a or b is a parameter (say it's a) then terms: log(b) vars: y, log(b) I think it would be easier if log(b) was in the terms but b was in the vars. With F2, things aren't so simple: currently we get, y ~ a + exp(b * t) + a:exp(b * t) attr(,"variables") list(y, a, exp(b * t)) attr(,"factors") a exp(b * t) a:exp(b * t) y 0 0 0 a 1 0 1 exp(b * t) 0 1 1 Which is ok, given that we haven't said anything about the variables but it certainly won't help us to build a model. Now suppose I want to identify a and b as parameters, then I want to get: y ~ a * ( 1 + exp( b * t ) ) terms: t vars: y, t If just a is a parameter then I think that we should get terms: t, b, t:b or terms: y, exp(b*t) vars: y, t, b Open questions: 1) When do the special formula operators work and when do they take their usual interpretation? 2) I think that model.frame should produce a dataframe with columns corresponding to the variables in the model. - model.matrix is then responsible for using the model frame and the terms to produce a model.matrix In the F1 example, then under this scheme the model frame would contain y a and log(b) The model matrix would have a and log(b) in it. If b appears on its own and inside a function call does that correspond to two variables or one? 3) In some sense we need define what a term is and what a variable is. We need to do that in a way that is meaningful for both linear and nonlinear (and possible graphical) models. 4) Is the notion of model.matrix useful outside of linear models? If so, is it the place where we code up the contrasts Robert -- +---------------------------------------------------------------------------+ | Robert Gentleman phone : (617) 632-5250 | | Associate Professor fax: (617) 632-2444 | | Department of Biostatistics office: not yet | | Harvard School of Public Health email: rgentlem@jimmy.dfci.harvard.edu | +---------------------------------------------------------------------------+ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._