Greg Ridgeway
2014-Oct-15 14:11 UTC
[Rd] Advice on package design for handling of dots in a formula
I am working on a new package, one in which the user needs to specify the role that different variables play in the analysis. Where I'm stumped is the best way to have users specify those roles. Approach #1: Separate formula for each special component First I thought to have users specify each formula separately, like: new.function(formula=y~X1+X2+X3, weights=~w, observationID=~ID, strata=~site, data=mydata) This seems to be a common approach in other packages. However, one of my testers noted that if he put formula=y~. then w, ID, and site showed up in the model where they weren't supposed to be. I could add some code to try to prevent that (string matching and editing the terms object, perhaps?), but that seemed a little clumsy to me. Approach #2: Create specials to label special variables So I turned to the user interface design in coxph where the user can specify strata and cluster in a single formula. So my approach would look something like: new.function(formula=y~weights(w)+strata(site)+observationID(ID)+X1+X2+X3, data=mydata) My aim would be that the user could use a dot instead of X1+X2+X3 and the dot would not expand to include w, site, and ID. However, at least as implemented in coxph(), this approach does not handle the dot in the formula any better than the first approach. Call: coxph(formula = Surv(time, status) ~ strata(sex) + ., data = test1) coef exp(coef) se(coef) z p x 0.802 2.23 0.822 0.976 0.33 sex NA NA 0.000 NA NA Surely the user wants the dot to mean all the other variables but not the ones that are already in the model, like sex. I could also develop some code (again perhaps clumsily) to search after the fact for variables (like sex) that shouldn't be in there. Approach #3: Require the user to first describe a separate study design object Lastly I looked at the design for the survey package. This package first requires the user to create an object that describes the key components of the dataset. So I would have the user do something like this: mystudy <- study.design(weights=~w, observationID=~ID, strata=~site, data=mydata) myresults <- doanalysis(formula=y~X1+X2+X3, design=mystudy) But it seems that the survey package is also not designed to handle the dot. data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svyglm(api00~., design=dstrat) Error in svyglm.survey.design(api00 ~ ., design = dstrat) : all variables must be in design= argument Does anyone have advice on how best to handle this? 1. Tell my tester "Tough, you can't use dots in a formula in my package".essentially what the survey package seems to do. Encourage the use of survey::make.formula()? 2. Fix Approach #1 to search for duplicates in the weights, observation ID, and strata parameters. Any elegant ways to do that? 3. Fix Approach #2, the coxph style, to try to remove redundant covariates. Not sure if there's a graceful way not involving string matching 4. Any existing elegant approaches to interpreting the dot? Or should I just do string matching to delete duplicate variables from the terms object. Thanks, Greg Greg Ridgeway Associate Professor University of Pennsylvania
S Ellison
2014-Oct-15 15:50 UTC
[Rd] Advice on package design for handling of dots in a formula
> This seems to be a common approach in other packages. However, one of my > testers noted that if he put formula=y~. then w, ID, and site showed up in the > model where they weren't supposed to be.This is the documented behaviour for '.' in a formula - it means 'everything else in the data object' Without changing your current code, though, your user could have said something like y~.-w-ID-site if they wanted to specify 'everything _except_ the subtracted terms', so it's not as bad as having no shortcuts at all. If you want to do the work for them, one (probably crude) way of doing it could use drop.terms() in combination with some work with the term labels: #A function that drops the terms in two later arguments from the terms in the first and returns the resulting trimmed terms object. f <- function(form, dropthis, dropthattoo, data) { everything <- attr(terms(form, data=data), "term.labels") #needs data to expand '.' drops <- c(attr(terms(dropthis, data=data), "term.labels"), attr(terms(dropthattoo, data=data), "term.labels")) #could probably do without 'data' excludes <-which(everything %in% drops) terms(form, data=data)[-excludes] } d <- data.frame(a=1:10, b=10:1, g=gl(5,2), g2=gl(2,5), y=rnorm(10)) f(y~., ~g, ~b, data=d) #This returns a terms object, but there's a formula in that if you want it.... formula(f(y~., ~g, ~b, data=d)) You'll need to be careful about evaluating that though; don't forget to give any relevant model or model matrix functions the environment (data frame) to go with it or you'll get nonsense. S ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
Charles Berry
2014-Oct-15 15:55 UTC
[Rd] Advice on package design for handling of dots in a formula
Greg Ridgeway <gregridgeway <at> gmail.com> writes:> > I am working on a new package, one in which the user needs to specify the > role that different variables play in the analysis. Where I'm stumped is >the best way to have users specify those roles. [delete discussion of dot in formula and specials]> > Does anyone have advice on how best to handle this? > 1. Tell my tester "Tough, you can't use dots in a formula in my > package".essentially what the survey package seems to do. Encourage the > use of survey::make.formula()? > 2. Fix Approach #1 to search for duplicates in the weights, observation > ID,and strata parameters. Any elegant ways to do that? > 3. Fix Approach #2, the coxph style, to try to remove redundant > covariates. > Not sure if there's a graceful way not involving string matching > 4. Any existing elegant approaches to interpreting the dot? Or should I > just do string matching to delete duplicate variables from the terms > object. >See ?terms.formula and note the `allowDotAsName' arg.> trms <- terms(y~speshul(x)+.,allowDotAsName=TRUE,specials="speshul") > attr(trms,"term.labels")[1] "speshul(x)" "." See ?all.vars> all.vars(trms)[1] "y" "x" "."> setdiff( all.vars(trms) , "." )[1] "y" "x">HTH, Chuck