thr3ads.net - R devel - [Rd] problem using model.frame() [Aug 2005]

If this information is useful, please help other people find it:
Share via:

Gavin Simpson

2005-Aug-16 15:15 UTC

[Rd] problem using model.frame()

Hi I'm having a problem with model.frame, encapsulated in this example:

y1 <- matrix(c(3,1,0,1,0,1,1,0,0,0,1,0,0,0,1,1,0,1,1,1),
             nrow = 5, byrow = TRUE)
y1 <- as.data.frame(y1)
rownames(y1) <- paste("site", 1:5, sep = "")
colnames(y1) <- paste("spp", 1:4, sep = "")
y1

model.frame(~ y1)
Error in model.frame(formula, rownames, variables, varnames, extras, extranames,
:
	invalid variable type

temp <- as.matrix(y1)
model.frame(~ temp)
  temp.spp1 temp.spp2 temp.spp3 temp.spp4
1         3         1         0         1
2         0         1         1         0
3         0         0         1         0
4         0         0         1         1
5         0         1         1         1

Ideally the above wouldn't have names like temp.var1, temp.var2, but one
could deal with that later.

I have tracked down the source of the error message to line 1330 in
model.c - here I'm stumped as I don't know any C, but it looks as if the
code is looping over the variables in the formula and checking of they
are the right "type". So a matrix of variables gets through, but a
data.frame doesn't.

It would be good if model.frame could cope with data.frames in formulae,
but seeing as I am incapable of providing a patch, is there a way around
this problem?

Below is the head of the function I am currently using, including the
function for parsing the formula - borrowed and hacked from
ordiParseFormula() in package vegan.

I can work out the class of the rhs of the forumla. Is there a way to
create a suitable environment for the data argument of parseFormula()
such that it contains the rhs dataframe coerced to a matrix, which then
should get through model.frame.default without error? How would I go
about manipulating/creating such an environment? Any other ideas?

Thanks in advance

Gav

coca.formula <- function(formula, method = c("predictive",
"symmetric"),
                         reg.method = c("simpls", "eigen"),
weights = NULL,
                         n.axes = NULL, symmetric = FALSE, data)
  {
    parseFormula <- function (formula, data)
      {
        browser()
        Terms <- terms(formula, "Condition", data = data)
        flapart <- fla <- formula <- formula(Terms, width.cutoff = 500)
        specdata <- formula[[2]]
        X <- eval(specdata, data, parent.frame())
        X <- as.matrix(X)
        formula[[2]] <- NULL
        if (formula[[2]] == "1" || formula[[2]] == "0")
          Y <- NULL
        else {
          mf <- model.frame(formula, data, na.action = na.fail)
          Y <- model.matrix(formula, mf)
          if (any(colnames(Y) == "(Intercept)")) {
            xint <- which(colnames(Y) == "(Intercept)")
            Y <- Y[, -xint, drop = FALSE]
          }
        }
        list(X = X, Y = Y)
      }
    if (missing(data))
      data <- parent.frame()
    #browser()
    dat <- parseFormula(formula, data)

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [T] +44 (0)20 7679 5522
ENSIS Research Fellow             [F] +44 (0)20 7679 7565
ENSIS Ltd. & ECRC                 [E] gavin.simpsonATNOSPAMucl.ac.uk
UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
London.  WC1H 0AP.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

Gabor Grothendieck

2005-Aug-16 15:25 UTC

head link

[Rd] problem using model.frame()

It can handle data frames like this:

	model.frame(y1)
or
	model.frame(~., y1)


On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
wrote:> Hi I'm having a problem with model.frame, encapsulated in this example:
> 
> y1 <- matrix(c(3,1,0,1,0,1,1,0,0,0,1,0,0,0,1,1,0,1,1,1),
>             nrow = 5, byrow = TRUE)
> y1 <- as.data.frame(y1)
> rownames(y1) <- paste("site", 1:5, sep = "")
> colnames(y1) <- paste("spp", 1:4, sep = "")
> y1
> 
> model.frame(~ y1)
> Error in model.frame(formula, rownames, variables, varnames, extras,
extranames,  :
>        invalid variable type
> 
> temp <- as.matrix(y1)
> model.frame(~ temp)
>  temp.spp1 temp.spp2 temp.spp3 temp.spp4
> 1         3         1         0         1
> 2         0         1         1         0
> 3         0         0         1         0
> 4         0         0         1         1
> 5         0         1         1         1
> 
> Ideally the above wouldn't have names like temp.var1, temp.var2, but
one
> could deal with that later.
> 
> I have tracked down the source of the error message to line 1330 in
> model.c - here I'm stumped as I don't know any C, but it looks as
if the
> code is looping over the variables in the formula and checking of they
> are the right "type". So a matrix of variables gets through, but
a
> data.frame doesn't.
> 
> It would be good if model.frame could cope with data.frames in formulae,
> but seeing as I am incapable of providing a patch, is there a way around
> this problem?
> 
> Below is the head of the function I am currently using, including the
> function for parsing the formula - borrowed and hacked from
> ordiParseFormula() in package vegan.
> 
> I can work out the class of the rhs of the forumla. Is there a way to
> create a suitable environment for the data argument of parseFormula()
> such that it contains the rhs dataframe coerced to a matrix, which then
> should get through model.frame.default without error? How would I go
> about manipulating/creating such an environment? Any other ideas?
> 
> Thanks in advance
> 
> Gav
> 
> coca.formula <- function(formula, method = c("predictive",
"symmetric"),
>                         reg.method = c("simpls",
"eigen"), weights = NULL,
>                         n.axes = NULL, symmetric = FALSE, data)
>  {
>    parseFormula <- function (formula, data)
>      {
>        browser()
>        Terms <- terms(formula, "Condition", data = data)
>        flapart <- fla <- formula <- formula(Terms, width.cutoff =
500)
>        specdata <- formula[[2]]
>        X <- eval(specdata, data, parent.frame())
>        X <- as.matrix(X)
>        formula[[2]] <- NULL
>        if (formula[[2]] == "1" || formula[[2]] == "0")
>          Y <- NULL
>        else {
>          mf <- model.frame(formula, data, na.action = na.fail)
>          Y <- model.matrix(formula, mf)
>          if (any(colnames(Y) == "(Intercept)")) {
>            xint <- which(colnames(Y) == "(Intercept)")
>            Y <- Y[, -xint, drop = FALSE]
>          }
>        }
>        list(X = X, Y = Y)
>      }
>    if (missing(data))
>      data <- parent.frame()
>    #browser()
>    dat <- parseFormula(formula, data)
> 
> --
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> Gavin Simpson                     [T] +44 (0)20 7679 5522
> ENSIS Research Fellow             [F] +44 (0)20 7679 7565
> ENSIS Ltd. & ECRC                 [E] gavin.simpsonATNOSPAMucl.ac.uk
> UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
> 26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
> London.  WC1H 0AP.
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Gavin Simpson

2005-Aug-18 13:22 UTC

head link

[Rd] problem using model.frame()

On Thu, 2005-08-18 at 09:00 -0400, Gabor Grothendieck
wrote:> I think this one is a hard call.  Designing software is a
> series of tradeoffs. Its nice to maintain consistency with
> the R base, but in case of extensions (rather than changing
> behavior) as in this case, the argument against the change
> carries less weight.
> 
> The main problems with extensions are (1) that one has to
> remember which functions/packages have which extensions if
> one is to use them and (2) they can interfere with other
> future extensions.
> 
> On the other hand, if one is using a particular package a
> lot then convenience features like this may be attractive.
> Also, packages are where authors have the freedom to try out 
> new ideas and new functionality without being constrained.
> 
> Perhaps, if the extension in question is added there could be 
> a warning in the help file that this is a convenience feature 
> of this particular package and is not generally available 
> throughout R.
Thanks again Gabor for another useful contribution to this debate. Also
thanks to Martin, Gabor and Jari for their comments, ideas, suggestions
and viewpoints.

I still like y1 ~ y2 (both data frames), but during my bike ride to work
this morning I considered both sides of the argument and my position has
moved towards the R way of doing things - far be it for little old me to
go against years of S-formula tradition. So I'll revert the code back to
accepting y1 ~ ., data = y2 and leave it to throw an error for the rhs
being a data frame case.

Once again, thank you for helping me work through this dilemma.

All the best,

Gav
> On 8/18/05, Gavin Simpson <gavin.simpson at ucl.ac.uk> wrote:
> > On Thu, 2005-08-18 at 07:57 +0300, Jari Oksanen wrote:
> > > On 18 Aug 2005, at 1:49, Gavin Simpson wrote:
> > >
> > > > On Wed, 2005-08-17 at 20:24 +0200, Martin Maechler wrote:
> > > >>>>>>> "GS" == Gavin Simpson
<gavin.simpson at ucl.ac.uk>
> > > >>>>>>>     on Tue, 16 Aug 2005 18:44:23
+0100 writes:
> > > >>
> > > >>     GS> On Tue, 2005-08-16 at 12:35 -0400, Gabor
Grothendieck
> > > >>     GS> wrote:
> > > >>>> On 8/16/05, Gavin Simpson <gavin.simpson at
ucl.ac.uk>
> > > >>>> wrote: > On Tue, 2005-08-16 at 11:25 -0400,
Gabor
> > > >>>> Grothendieck wrote: > > It can handle data
frames like
> > > >>>> this:
> > > >>>>>>
> > > >>>>>> model.frame(y1) > > or > >
model.frame(~., y1)
> > > >>>>>
> > > >>>>> Thanks Gabor,
> > > >>>>>
> > > >>>>> Yes, I know that works, but I want the
function
> > > >>>> coca.formula to accept a > formula like this
y2 ~ y1,
> > > >>>> with both y1 and y2 being data frames. It is
> > > >>>>
> > > >>>> The expressions I gave work generally (i.e. lm,
glm,
> > > >>>> ...), not just in model.matrix, so would it be
ok if the
> > > >>>> user just does this?
> > > >>>>
> > > >>>> yourfunction(y2 ~., y1)
> > > >>
> > > >>     GS> Thanks again Gabor for your comments,
> > > >>
> > > >>     GS> I'd prefer the y1 ~ y2 as data frames -
as this is the
> > > >>     GS> most natural way of doing things. I'd
like to have (y2
> > > >>     GS> ~., y1) as well, and (y2 ~ spp1 + spp2 +
spp3, y1) also
> > > >>     GS> work - silently without any trouble.
> > > >>
> > > >> I'm sorry, Gavin, I tend to disagree quite a bit.
> > > >>
> > > >> The formula notation has quite a history in the S
language, and
> > > >> AFAIK never was the idea to use data.frames as formula
> > > >> components, but rather as "environments" in
which formula
> > > >> components are looked up --- exactly as Gabor has
explained.
> > > >
> > > > Hi Martin, thanks for your comments,
> > > >
> > > > But then one could have a matrix of variables on the rhs of
the formula
> > > > and it would work - whether this is a documented feature or
un-intended
> > > > side-effect of matrices being stored as vectors with dims, I
don't
> > > > know.
> > > >
> > > > And whilst the formula may have a long history, a number of
packages
> > > > have extended the interface to implement a specific feature,
which
> > > > don't
> > > > work with standard functions like lm, glm and friends. I
don't see how
> > > > what I wanted to achieve is greatly different to that or
using a
> > > > matrix.
> > > >
> > > >> To break with such a deeply rooted principle,
> > > >> you should have very very good reasons, because
you're breaking
> > > >> the concepts on which all other uses of formulae are
based.
> > > >> And this would potentially lead to much confusion of
your users,
> > > >> at least in the way they should learn to think about
what
> > > >> formulae mean.
> > > >
> > > > In the end I managed to treat y1 ~ y2 (both data frames) as
a special
> > > > case, which allows the existing formula notation to work as
well, so I
> > > > can use y1 ~ y2, y1 ~ ., data = y2, or y1 ~ var + var2, data
= y2. This
> > > > is what I wanted all along, to extend my interface (not do
anything to
> > > > R's formulae), but to also work in the traditional
sense.
> > > >
> > > > The model I am writing code for really is modelling the
relationship
> > > > between two matrices of data. In one version of the method,
there is
> > > > real equivalence between both sides of the formula so it
would seem odd
> > > > to treat the two sides of the formula differently. At least
to me ;-)
> > >
> > > It seems that I may be responsible for one of these extensions
(lhs as
> > > a data.frame in cca and rda in vegan package). There the response
(lhs)
> > > is multivariate or a multispecies community, and you must take
that as
> > > a whole without manipulation (and if you tried using VGAM you see
there
> > > really is painful to define lhs with, say, 127 elements).
> > 
> > Hi Jari,
> > 
> > Thanks for reminding me about this - I'd forgotten about not
normally
> > being able to have a data.frame on the lhs of the formula either -
I'm
> > surprised no-one pulled me up on that one before, either ;-)
> > 
> > I guess what I'm proposing is really pushing the formula
representation
> > too far for some people. I'm coming round to the y1 ~ ., data = y2
way
> > of doing things - still prefer y1 ~ y2 though ;-)
> > 
> > Also, both y1 and y2 are community matrices (i.e. both have many, many
> > species, aka variables for the non-community ecologists reading this).
> > I'm not sure that it makes sense to treat the two sides
differently. In
> > the predictive co-correspondence mode (the default), multivariate pls
is
> > used to regress one matrix on another, with the number of pls
components
> > being chosen by cross-validation or a permutation test.
> > 
> > > However, in
> > > general you shouldn't use models where you use all the
'explanatory'
> > > variables (rhs) that yo happen to have by accident. So much bad
science
> > > has been created with that approach even in your field, Gav.
> > 
> > Well, I agree with you there...
> > 
> > > The whole
> > > idea of formula is the ability to choose from candidate
variables. That
> > > is: to build a model. Therefore you have one-sided formulae in
prcomp()
> > > and princomp(): you can say prcomp(~ x1 + log(x2) +x4, data) or
> > > prcomp(~ . - x3, data). I think you should try to keep it so. Do
> > > instead like Gabor suggested: you could have a function
coca.default or
> > > coca.matrix with interface:
> > >
> > > coca.matrix(matx, maty, matz) -- or you can name this as
coca.default.
> > >
> > > and coca.formula which essentially parses your formula and
returns a
> > > list of matrices you need:
> > >
> > > coca.formula <- function(formula, data)
> > > {
> > >       matricesout <- parsemyformula(formula, data)
> > >      coca(matricesout$matx, matricesout$maty, matricesoutz)
> > > }
> > > Then you need the generic: coca <- function(...)
UseMethod("coca") and
> > > it's done (but fails in R CMD check unless you add
"..." in all
> > > specific functions...). The real work is always done in
coca.matrix (or
> > > coca.default), and the others just chew your data into suitable
form
> > > for your workhorse.
> > >
> > > If then somebody thinks that they need all possible variables as
> > > 'explanatory' variables (or perhaps constraints in your
case), they
> > > just call the function as
> > >
> > > coca(matx, maty, matz)
> > 
> > My functions are already generic with coca.default and coca.formula.
The
> > issue with matrices/data.frames was only a problem in the formula
> > interface.
> > 
> > > And if you have coca.data.frame they don't need
'quacking' with extra
> > > steps:
> > >
> > > coca.data.frame <- function(dfx, dfy dfz) coca(as.matrix(dfx),
> > > as.matrix(dfy), as.matrix(dfz)).
> > >
> > > This you call as coca(dfx, dfy, dfz) and there you go.
> > >
> > > The essential feature in formula is the ability to define the
model.
> > > Don't give it away.
> > 
> > I think the point I'm trying to make is that I don't think
what I'm
> > trying to do is any different than doing lm(y ~ x, data), (where y, x
> > are vectors) - it is just that my x and y happen to be multivariate. I
> > think it is easier to think of each community as a single entity in
this
> > regard - the relationship *is* between community 1 and community 2,
not
> > parts of community 2, or some parsimonious model of community 2 - but
> > that might just be semantics - unlike your cca/rda functions which
> > really are a (weighted) multivariate multiple regression. Happy to be
> > convinced otherwise though.
> > 
> > Also, it is worth re-iterating that I haven't broken the
traditional way
> > of working with formulae with my function - you can still do y1 ~ .,
> > data = y2, or y1 ~ spp1 + spp2 + spp3, data = y2, for maximum
> > flexibility. All I wanted (and worked out how) to do was treat the rhs
> > in a special way if it were a data frame, just like Jari treats a
> > data.frame on the lhs of formulae in package vegan as a special case.
> > 
> > Thanks everyone for your ideas and comments - lots of food for
thought.
> > I wavering between both camps on this - still time to be convinced and
> > change it before I finish the package.
> > 
> > All the best,
> > 
> > G
> > 
> > >
> > > cheers, jazza
> > > --
> > > Jari Oksanen, Oulu, Finland
> > >
> > --
> >
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > Gavin Simpson                     [T] +44 (0)20 7679 5522
> > ENSIS Research Fellow             [F] +44 (0)20 7679 7565
> > ENSIS Ltd. & ECRC                 [E]
gavin.simpsonATNOSPAMucl.ac.uk
> > UCL Department of Geography       [W]
http://www.ucl.ac.uk/~ucfagls/cv/
> > 26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
> > London.  WC1H 0AP.
> >
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > 
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [T] +44 (0)20 7679 5522
ENSIS Research Fellow             [F] +44 (0)20 7679 7565
ENSIS Ltd. & ECRC                 [E] gavin.simpsonATNOSPAMucl.ac.uk
UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
London.  WC1H 0AP.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

Maybe Matching Threads

Search for more possibly parallel threads

R devel - Aug 2005 - problem using model.frame()

[Rd] problem using model.frame()

[Rd] problem using model.frame()

[Rd] problem using model.frame()

Maybe Matching Threads