Joris Meys
2014-Oct-17 18:04 UTC
[Rd] Most efficient way to check the length of a variable mentioned in a formula.
Dear R gurus,
I need to know the length of a variable (let's call that X) that is
mentioned in a formula. So obviously I look for the environment from which
the formula is called and then I have two options:
- using eval(parse(text='length(X)'),
envir=environment(formula) )
- using length(get('X'),
envir=environment(formula) )
a bit of benchmarking showed that the first option is about 20 times
slower, to that extent that if I repeat it 10,000 times I save more than
half a second. So speed is not really an issue here.
Personally I'd go for option 2 as that one is easier to read and does the
job nicely, but with these functions I'm always a bit afraid that I'm
overseeing important details or side effects here (possibly memory issues
when working with larger data).
Anybody an idea what the dangers are of these methods, and which one is the
most robust method?
Thank you
Joris
--
Joris Meys
Statistical consultant
Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics
tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
[[alternative HTML version deleted]]
Gabriel Becker
2014-Oct-17 18:23 UTC
[Rd] Most efficient way to check the length of a variable mentioned in a formula.
Joris,
For me
length(environment(form)[["x"]])
Was about twice as fast as
length(get("x",environment(form))))
In the year-old version of R (3.0.2) that I have on the virtual machine i'm
currently using.
As for you, the eval method was much slower (though my factor was much
larger than 20)
> system.time({thing <-
replicate(10000,length(environment(form)[["x"]]))})
user system elapsed
0.018 0.000 0.018> system.time({thing <-
replicate(10000,length(get("x",environment(form))))}) user system
elapsed
0.031 0.000 0.033> system.time({thing <- replicate(10000,eval(parse(text =
"length(x)"),
envir=environment(form)))})
user system elapsed
4.528 0.003 4.656
I can't speak this second to whether this pattern will hold in the more
modern versions of R I typically use.
~G
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys <jorismeys at gmail.com>
wrote:
> Dear R gurus,
>
> I need to know the length of a variable (let's call that X) that is
> mentioned in a formula. So obviously I look for the environment from which
> the formula is called and then I have two options:
>
> - using eval(parse(text='length(X)'),
> envir=environment(formula) )
>
> - using length(get('X'),
> envir=environment(formula) )
>
> a bit of benchmarking showed that the first option is about 20 times
> slower, to that extent that if I repeat it 10,000 times I save more than
> half a second. So speed is not really an issue here.
>
> Personally I'd go for option 2 as that one is easier to read and does
the
> job nicely, but with these functions I'm always a bit afraid that
I'm
> overseeing important details or side effects here (possibly memory issues
> when working with larger data).
>
> Anybody an idea what the dangers are of these methods, and which one is the
> most robust method?
>
> Thank you
> Joris
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Gabriel Becker
Graduate Student
Statistics Department
University of California, Davis
[[alternative HTML version deleted]]
William Dunlap
2014-Oct-17 18:57 UTC
[Rd] Most efficient way to check the length of a variable mentioned in a formula.
I would use eval(), but I think that most formula-using functions do
it more like the following.
getRHSLength <-
function (formula, data = parent.frame())
{
rhsExpr <- formula[[length(formula)]]
rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula))
length(rhsValue)
}
* use eval() instead of get() so you will find variables are in
ancestral environments
of envir (if envir is an environment), not just envir itself.
* just evaluate the stuff in the formula using the non-standard
evaluation frame,
call length() in the current frame. Otherwise, if envir inherits
directly from emptyenv() the 'length' function will not be found.
* use envir=data so it looks first in the data argument for variables
* the enclos argument is used if envir is not an environment and is used to
find variables that are not in envir.
Here are some examples:
> X <- 1:10
> getRHSLength(~X)
[1] 10
> getRHSLength(~X, data=data.frame(X=1:2))
[1] 2
> getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame())
[1] 4
> getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2))
[1] 2
> getRHSLength((function(){X <- 1:4; ~X})(),
data=list2env(data.frame()))
[1] 10
> getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv())
Error in eval(expr, envir, enclos) : object 'X' not found
I think you will see the same lookups if you try analogous things with lm().
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys <jorismeys at gmail.com>
wrote:> Dear R gurus,
>
> I need to know the length of a variable (let's call that X) that is
> mentioned in a formula. So obviously I look for the environment from which
> the formula is called and then I have two options:
>
> - using eval(parse(text='length(X)'),
> envir=environment(formula) )
>
> - using length(get('X'),
> envir=environment(formula) )
>
> a bit of benchmarking showed that the first option is about 20 times
> slower, to that extent that if I repeat it 10,000 times I save more than
> half a second. So speed is not really an issue here.
>
> Personally I'd go for option 2 as that one is easier to read and does
the
> job nicely, but with these functions I'm always a bit afraid that
I'm
> overseeing important details or side effects here (possibly memory issues
> when working with larger data).
>
> Anybody an idea what the dangers are of these methods, and which one is the
> most robust method?
>
> Thank you
> Joris
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel