thr3ads.net - R devel - [Rd] Advice on package design for handling of dots in a formula [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Greg Ridgeway

2014-Oct-15 14:11 UTC

[Rd] Advice on package design for handling of dots in a formula

I am working on a new package, one in which the user needs to specify the
role that different variables play in the analysis. Where I'm stumped is the
best way to have users specify those roles.

Approach #1: Separate formula for each special component

First I thought to have users specify each formula separately, like:

new.function(formula=y~X1+X2+X3,
             weights=~w,
             observationID=~ID,
             strata=~site,
             data=mydata)

This seems to be a common approach in other packages. However, one of my
testers noted that if he put formula=y~. then w, ID, and site showed up in
the model where they weren't supposed to be. I could add some code to try to
prevent that (string matching and editing the terms object, perhaps?), but
that seemed a little clumsy to me.

Approach #2: Create specials to label special variables

So I turned to the user interface design in coxph where the user can specify
strata and cluster in a single formula. So my approach would look something
like:

new.function(formula=y~weights(w)+strata(site)+observationID(ID)+X1+X2+X3,
             data=mydata)

My aim would be that the user could use a dot instead of X1+X2+X3 and the
dot would not expand to include w, site, and ID. However, at least as
implemented in coxph(), this approach does not handle the dot in the formula
any better than the first approach.

Call:
coxph(formula = Surv(time, status) ~ strata(sex) + ., data = test1)

     coef exp(coef) se(coef)     z    p
x   0.802      2.23    0.822 0.976 0.33
sex    NA        NA    0.000    NA   NA

Surely the user wants the dot to mean all the other variables but not the
ones that are already in the model, like sex. I could also develop some code
(again perhaps clumsily) to search after the fact for variables (like sex)
that shouldn't be in there.

Approach #3: Require the user to first describe a separate study design
object

Lastly I looked at the design for the survey package. This package first
requires the user to create an object that describes the key components of
the dataset. So I would have the user do something like this:

mystudy <- study.design(weights=~w,
                        observationID=~ID,
                        strata=~site,
                        data=mydata)
myresults <- doanalysis(formula=y~X1+X2+X3, design=mystudy)

But it seems that the survey package is also not designed to handle the dot.

data(api)
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
svyglm(api00~., design=dstrat)
Error in svyglm.survey.design(api00 ~ ., design = dstrat) : 
  all variables must be in design= argument

Does anyone have advice on how best to handle this? 
1. Tell my tester "Tough, you can't use dots in a formula in my
package".essentially what the survey package seems to do. Encourage the use
of survey::make.formula()?
2. Fix Approach #1 to search for duplicates in the weights, observation ID,
and strata parameters. Any elegant ways to do that?
3. Fix Approach #2, the coxph style, to try to remove redundant covariates.
Not sure if there's a graceful way not involving string matching
4. Any existing elegant approaches to interpreting the dot? Or should I just
do string matching to delete duplicate variables from the terms object.

Thanks,
Greg

Greg Ridgeway
Associate Professor
University of Pennsylvania

S Ellison

2014-Oct-15 15:50 UTC

head link

[Rd] Advice on package design for handling of dots in a formula

> This seems to be a common approach in other packages. However, one of my
> testers noted that if he put formula=y~. then w, ID, and site showed up in
the
> model where they weren't supposed to be. 
This is the documented behaviour for '.' in a formula - it means
'everything else in the data object'

Without changing your current code, though, your user could have said something
like
y~.-w-ID-site

if they wanted to specify 'everything _except_ the subtracted terms', so
it's not as bad as having no shortcuts at all.

If you want to do the work for them, one (probably crude) way of doing it could
use drop.terms() in combination with some work with the term labels:

#A function that drops the terms in two later arguments from the terms in the
first and returns the resulting trimmed terms object.
f <- function(form, dropthis, dropthattoo, data) {
	everything <- attr(terms(form, data=data), "term.labels") #needs
data to expand '.'
	drops <- c(attr(terms(dropthis, data=data), "term.labels"), 
			attr(terms(dropthattoo, data=data), "term.labels")) #could probably
do without 'data'
	excludes <-which(everything %in% drops)
	terms(form, data=data)[-excludes]
}

d <- data.frame(a=1:10, b=10:1, g=gl(5,2), g2=gl(2,5), y=rnorm(10))

f(y~., ~g, ~b, data=d)
	#This returns a terms object, but there's a formula in that if you want
it....

formula(f(y~., ~g, ~b, data=d))

 You'll need to be careful about evaluating that though; don't forget to
give any relevant model or model matrix functions the environment (data frame)
to go with it or you'll get nonsense.


S


*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Charles Berry

2014-Oct-15 15:55 UTC

head link

[Rd] Advice on package design for handling of dots in a formula

Greg Ridgeway <gregridgeway <at> gmail.com> writes:
> 
> I am working on a new package, one in which the user needs to specify the
> role that different variables play in the analysis. Where I'm stumped
is >the best way to have users specify those roles.

[delete discussion of dot in formula and specials]> 
> Does anyone have advice on how best to handle this? 
> 1. Tell my tester "Tough, you can't use dots in a formula in my
> package".essentially what the survey package seems to do. Encourage
the
> use of survey::make.formula()?
> 2. Fix Approach #1 to search for duplicates in the weights, observation 
> ID,and strata parameters. Any elegant ways to do that?
> 3. Fix Approach #2, the coxph style, to try to remove redundant 
> covariates.
> Not sure if there's a graceful way not involving string matching
> 4. Any existing elegant approaches to interpreting the dot? Or should I 
> just do string matching to delete duplicate variables from the terms 
> object.
> 
See ?terms.formula and note the `allowDotAsName' arg.
> trms <-
terms(y~speshul(x)+.,allowDotAsName=TRUE,specials="speshul")
> attr(trms,"term.labels")[1] "speshul(x)" "."         

See ?all.vars
> all.vars(trms)
[1] "y" "x" "."> setdiff( all.vars(trms) , "." )
[1] "y" "x"> 
HTH,

Chuck

Apparently Analagous Threads

Search for more apparently analagous threads

R devel - Oct 2014 - Advice on package design for handling of dots in a formula

[Rd] Advice on package design for handling of dots in a formula

[Rd] Advice on package design for handling of dots in a formula

[Rd] Advice on package design for handling of dots in a formula

Apparently Analagous Threads