thr3ads.net - R devel - [Rd] we need an exists/get hybrid [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Peter Haverty

2014-Dec-03 21:30 UTC

[Rd] we need an exists/get hybrid

Thanks Winston!  I'm amazed that "[[" beats calling the .Internal
directly.  I guess the difference between .Primitive vs. .Internal is
pretty significant for things on this time scale.

NULL meaning NULL and NULL meaning undefined would lead to the same path
for much of my code.  I'll be swapping out many exists and get calls later
today.  Thanks!

I do still think it would be very useful to have some way to discriminate
the two NULL cases.  I'm reminded of how perl does the same thing.  It's
been a while, but it was something like

if (defined(x{'c'})) { print x{'c'}; }  # This is still two
lookups, but it
has the "defined" concept.

or maybe even

if (defined( foo = x{'c'} ) ) { print foo; }


Thanks again for the timings!


Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang <winstonchang1 at
gmail.com>
wrote:
> I've looked at related speed issues in the past, and have a couple
> related points to add. (I've put the info below at
> http://rpubs.com/wch/46428.)
>
> There's a significant amount of overhead just from calling the R
> function get(). This is true even when you skip the pos argument and
> provide envir. For example, if you call get(), it takes much more time
> than .Internal(get()), which is what get() does.
>
> If you already know that the object exists in an environment, it's
> faster to use e$x, and slightly faster still to use e[["x"]]:
>
> e <- new.env()
> e$a <- 1
>
> # Accessing objects in environments
> microbenchmark(
>   get("a", e, inherits = FALSE),
>   get("a", envir = e, inherits = FALSE),
>   .Internal(get("a", e, "any", FALSE)),
>   e$a,
>   e[["a"]],
>   .Primitive("[[")(e, "a"),
>
>   unit = "us"
> )
> #>   median                                  name
> #> 1 1.0300         get("a", e, inherits = FALSE)
> #> 2 0.9425 get("a", envir = e, inherits = FALSE)
> #> 3 0.3080  .Internal(get("a", e, "any", FALSE))
> #> 4 0.2305                                   e$a
> #> 5 0.1740                              e[["a"]]
> #> 6 0.2905              .Primitive("[[")(e, "a")
>
>
> A similar thing happens with exists(): the R function wrapper adds
> significant overhead on top of .Internal(exists()). It's also faster
> to use $ and [[, then test for NULL, but of course this won't
> distinguish between objects that don't exist, and those that do exist
> but have a NULL value:
>
> # Test for existence of `a` (which exists), and `c` (which doesn't)
> microbenchmark(
>   exists('a', e, inherits = FALSE),
>   exists('a', envir = e, inherits = FALSE),
>   .Internal(exists('a', e, 'any', FALSE)),
>   'a' %in% ls(e, all.names = TRUE),
>   is.null(e[['a']]),
>   is.null(e$a),
>
>   exists('c', e, inherits = FALSE),
>   exists('c', envir = e, inherits = FALSE),
>   .Internal(exists('c', e, 'any', FALSE)),
>   'c' %in% ls(e, all.names = TRUE),
>   is.null(e[['c']]),
>   is.null(e$c),
>
>   unit = "us"
> )
> #>    median                                     name
> #> 1  1.2015         exists("a", e, inherits = FALSE)
> #> 2  1.0545 exists("a", envir = e, inherits = FALSE)
> #> 3  0.3615  .Internal(exists("a", e, "any",
FALSE))
> #> 4  7.6345         "a" %in% ls(e, all.names = TRUE)
> #> 5  0.3055                        is.null(e[["a"]])
> #> 6  0.3270                             is.null(e$a)
> #> 7  1.1890         exists("c", e, inherits = FALSE)
> #> 8  1.0370 exists("c", envir = e, inherits = FALSE)
> #> 9  0.3465  .Internal(exists("c", e, "any",
FALSE))
> #> 10 7.5475         "c" %in% ls(e, all.names = TRUE)
> #> 11 0.2675                        is.null(e[["c"]])
> #> 12 0.3010                             is.null(e$c)
>
>
> -Winston
>
> On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty <haverty.peter at
gene.com>
> wrote:
> > Hi All,
> >
> > I've been looking into speeding up the loading of packages that
use a lot
> > of S4.  After profiling I noticed the "exists" function
accounts for a
> > surprising fraction of the time.  I have some thoughts about speeding
up
> > exists (below). More to the point of this post, Martin M?chler noted
that
> > 'exists' and 'get' are often used in conjunction. 
Both functions are
> > different usages of the do_get C function, so it's a pity to run
that
> twice.
> >
> > "get" gives an error when a symbol is not found, so you
can't just do a
> > 'get'.  With R's C library, one might do
> >
> > SEXP x = findVarInFrame3(symbol,env);
> > if (x != R_UnboundValue) {
> >     // do stuff with x
> > }
> >
> > It would be very convenient to have something like this at the R
level.
> We
> > don't want to do any tryCatch stuff or to add args to get (That
would
> kill
> > any speed advantage. The overhead for handling redundant args accounts
> for
> > 30% of the time used by "exists").  Michael Lawrence and I
worked out
> that
> > we need a function that returns either the desired object, or
something
> > that represents R_UnboundValue. We also need a very cheap way to check
if
> > something equals this new R_UnboundValue. This might look like
> >
> > if (defined(x <- fetch(symbol, env))) {
> >   do_stuff_with_x(x)
> > }
> >
> > A few more thoughts about "exists":
> >
> > Moving the bit of R in the exists function to C saves 10% of the time.
> > Dropping the redundant pos and frame args entirely saves 30% of the
time
> > used by this function. I suggest that the arguments of both get and
> > exists should
> > be simplified to (x, envir, mode, inherits). The existing C code
handles
> > numeric, character, and environment input for where. The arg frame is
> > rarely used (0/128 exists calls in the methods package). Users that
need
> to
> > can call sys.frame themselves. get already lacks a frame argument and
the
> > manpage for exists notes that envir is only there for backwards
> > compatibility. Let's deprecate the extra args in exists and get
and
> perhaps
> > move the extra argument handling to C in the interim.  Similarly, the
> > "assign" function does nothing with the
"immediate" argument.
> >
> > I'd be interested to hear if there is any support for a
"fetch"-like
> > function (and/or deprecating some unused arguments).
> >
> > All the best,
> > Pete
> >
> >
> >
> > Pete
> >
> > ____________________
> > Peter M. Haverty, Ph.D.
> > Genentech, Inc.
> > phaverty at gene.com
> >
> >         [[alternative HTML version deleted]]
> >
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
	[[alternative HTML version deleted]]

Lorenz, David

2014-Dec-04 14:24 UTC

head link

[Rd] we need an exists/get hybrid

All,
  So that suggests that .GlobalEnv[["X"]] is more efficient than
get("X",
pos=1L). What about .GlobalEnv[["X"]] <-  value, compared to
assign("X",
value)?
Dave

On Wed, Dec 3, 2014 at 3:30 PM, Peter Haverty <haverty.peter at gene.com>
wrote:
> Thanks Winston!  I'm amazed that "[[" beats calling the
.Internal
> directly.  I guess the difference between .Primitive vs. .Internal is
> pretty significant for things on this time scale.
>
> NULL meaning NULL and NULL meaning undefined would lead to the same path
> for much of my code.  I'll be swapping out many exists and get calls
later
> today.  Thanks!
>
> I do still think it would be very useful to have some way to discriminate
> the two NULL cases.  I'm reminded of how perl does the same thing. 
It's
> been a while, but it was something like
>
> if (defined(x{'c'})) { print x{'c'}; }  # This is still two
lookups, but it
> has the "defined" concept.
>
> or maybe even
>
> if (defined( foo = x{'c'} ) ) { print foo; }
>
>
> Thanks again for the timings!
>
>
> Pete
>
> ____________________
> Peter M. Haverty, Ph.D.
> Genentech, Inc.
> phaverty at gene.com
>
> On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang <winstonchang1 at
gmail.com>
> wrote:
>
> > I've looked at related speed issues in the past, and have a couple
> > related points to add. (I've put the info below at
> > http://rpubs.com/wch/46428.)
> >
> > There's a significant amount of overhead just from calling the R
> > function get(). This is true even when you skip the pos argument and
> > provide envir. For example, if you call get(), it takes much more time
> > than .Internal(get()), which is what get() does.
> >
> > If you already know that the object exists in an environment, it's
> > faster to use e$x, and slightly faster still to use
e[["x"]]:
> >
> > e <- new.env()
> > e$a <- 1
> >
> > # Accessing objects in environments
> > microbenchmark(
> >   get("a", e, inherits = FALSE),
> >   get("a", envir = e, inherits = FALSE),
> >   .Internal(get("a", e, "any", FALSE)),
> >   e$a,
> >   e[["a"]],
> >   .Primitive("[[")(e, "a"),
> >
> >   unit = "us"
> > )
> > #>   median                                  name
> > #> 1 1.0300         get("a", e, inherits = FALSE)
> > #> 2 0.9425 get("a", envir = e, inherits = FALSE)
> > #> 3 0.3080  .Internal(get("a", e, "any",
FALSE))
> > #> 4 0.2305                                   e$a
> > #> 5 0.1740                              e[["a"]]
> > #> 6 0.2905              .Primitive("[[")(e,
"a")
> >
> >
> > A similar thing happens with exists(): the R function wrapper adds
> > significant overhead on top of .Internal(exists()). It's also
faster
> > to use $ and [[, then test for NULL, but of course this won't
> > distinguish between objects that don't exist, and those that do
exist
> > but have a NULL value:
> >
> > # Test for existence of `a` (which exists), and `c` (which
doesn't)
> > microbenchmark(
> >   exists('a', e, inherits = FALSE),
> >   exists('a', envir = e, inherits = FALSE),
> >   .Internal(exists('a', e, 'any', FALSE)),
> >   'a' %in% ls(e, all.names = TRUE),
> >   is.null(e[['a']]),
> >   is.null(e$a),
> >
> >   exists('c', e, inherits = FALSE),
> >   exists('c', envir = e, inherits = FALSE),
> >   .Internal(exists('c', e, 'any', FALSE)),
> >   'c' %in% ls(e, all.names = TRUE),
> >   is.null(e[['c']]),
> >   is.null(e$c),
> >
> >   unit = "us"
> > )
> > #>    median                                     name
> > #> 1  1.2015         exists("a", e, inherits = FALSE)
> > #> 2  1.0545 exists("a", envir = e, inherits = FALSE)
> > #> 3  0.3615  .Internal(exists("a", e, "any",
FALSE))
> > #> 4  7.6345         "a" %in% ls(e, all.names = TRUE)
> > #> 5  0.3055                        is.null(e[["a"]])
> > #> 6  0.3270                             is.null(e$a)
> > #> 7  1.1890         exists("c", e, inherits = FALSE)
> > #> 8  1.0370 exists("c", envir = e, inherits = FALSE)
> > #> 9  0.3465  .Internal(exists("c", e, "any",
FALSE))
> > #> 10 7.5475         "c" %in% ls(e, all.names = TRUE)
> > #> 11 0.2675                        is.null(e[["c"]])
> > #> 12 0.3010                             is.null(e$c)
> >
> >
> > -Winston
> >
> > On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty <haverty.peter at
gene.com>
> > wrote:
> > > Hi All,
> > >
> > > I've been looking into speeding up the loading of packages
that use a
> lot
> > > of S4.  After profiling I noticed the "exists" function
accounts for a
> > > surprising fraction of the time.  I have some thoughts about
speeding
> up
> > > exists (below). More to the point of this post, Martin M?chler
noted
> that
> > > 'exists' and 'get' are often used in conjunction.
Both functions are
> > > different usages of the do_get C function, so it's a pity to
run that
> > twice.
> > >
> > > "get" gives an error when a symbol is not found, so you
can't just do a
> > > 'get'.  With R's C library, one might do
> > >
> > > SEXP x = findVarInFrame3(symbol,env);
> > > if (x != R_UnboundValue) {
> > >     // do stuff with x
> > > }
> > >
> > > It would be very convenient to have something like this at the R
level.
> > We
> > > don't want to do any tryCatch stuff or to add args to get
(That would
> > kill
> > > any speed advantage. The overhead for handling redundant args
accounts
> > for
> > > 30% of the time used by "exists").  Michael Lawrence
and I worked out
> > that
> > > we need a function that returns either the desired object, or
something
> > > that represents R_UnboundValue. We also need a very cheap way to
check
> if
> > > something equals this new R_UnboundValue. This might look like
> > >
> > > if (defined(x <- fetch(symbol, env))) {
> > >   do_stuff_with_x(x)
> > > }
> > >
> > > A few more thoughts about "exists":
> > >
> > > Moving the bit of R in the exists function to C saves 10% of the
time.
> > > Dropping the redundant pos and frame args entirely saves 30% of
the
> time
> > > used by this function. I suggest that the arguments of both get
and
> > > exists should
> > > be simplified to (x, envir, mode, inherits). The existing C code
> handles
> > > numeric, character, and environment input for where. The arg
frame is
> > > rarely used (0/128 exists calls in the methods package). Users
that
> need
> > to
> > > can call sys.frame themselves. get already lacks a frame argument
and
> the
> > > manpage for exists notes that envir is only there for backwards
> > > compatibility. Let's deprecate the extra args in exists and
get and
> > perhaps
> > > move the extra argument handling to C in the interim.  Similarly,
the
> > > "assign" function does nothing with the
"immediate" argument.
> > >
> > > I'd be interested to hear if there is any support for a
"fetch"-like
> > > function (and/or deprecating some unused arguments).
> > >
> > > All the best,
> > > Pete
> > >
> > >
> > >
> > > Pete
> > >
> > > ____________________
> > > Peter M. Haverty, Ph.D.
> > > Genentech, Inc.
> > > phaverty at gene.com
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > >
> > > ______________________________________________
> > > R-devel at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> >
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
	[[alternative HTML version deleted]]

Sven E. Templer

2014-Dec-04 19:04 UTC

head link

[Rd] we need an exists/get hybrid

David, 'assign' is slower than '<-':

##   median                                          expr

## 1 0.1440                                  X <- letters
## 2 0.4420         .Internal(assign("X", letters, e, F))
## 3 1.1820                           e[["X"]] <- letters
## 4 1.2570                                e$X <- letters
## 5 1.8380 assign("X", letters, envir = e, inherits = F)
## 6 1.9415         assign("X", letters, e, inherits = F)

(micro seconds, 500 times, see http://rpubs.com/setempler/46568)

---

Two questions:

'X<-letters' is the fastest since it does not need to change the
environment from 'benchmark' to 'e'?
Why is the call to '.Internal' faster than '[[<-' as compared
to the
'get'/'[[' functions/benchmark of Winston?

thanks,
s

On 4 December 2014 at 15:24, Lorenz, David <lorenz at usgs.gov>
wrote:> All,
>   So that suggests that .GlobalEnv[["X"]] is more efficient than
get("X",
> pos=1L). What about .GlobalEnv[["X"]] <-  value, compared to
assign("X",
> value)?
> Dave
>
> On Wed, Dec 3, 2014 at 3:30 PM, Peter Haverty <haverty.peter at
gene.com>
> wrote:
>
>> Thanks Winston!  I'm amazed that "[[" beats calling the
.Internal
>> directly.  I guess the difference between .Primitive vs. .Internal is
>> pretty significant for things on this time scale.
>>
>> NULL meaning NULL and NULL meaning undefined would lead to the same
path
>> for much of my code.  I'll be swapping out many exists and get
calls later
>> today.  Thanks!
>>
>> I do still think it would be very useful to have some way to
discriminate
>> the two NULL cases.  I'm reminded of how perl does the same thing. 
It's
>> been a while, but it was something like
>>
>> if (defined(x{'c'})) { print x{'c'}; }  # This is still
two lookups, but it
>> has the "defined" concept.
>>
>> or maybe even
>>
>> if (defined( foo = x{'c'} ) ) { print foo; }
>>
>>
>> Thanks again for the timings!
>>
>>
>> Pete
>>
>> ____________________
>> Peter M. Haverty, Ph.D.
>> Genentech, Inc.
>> phaverty at gene.com
>>
>> On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang <winstonchang1 at
gmail.com>
>> wrote:
>>
>> > I've looked at related speed issues in the past, and have a
couple
>> > related points to add. (I've put the info below at
>> > http://rpubs.com/wch/46428.)
>> >
>> > There's a significant amount of overhead just from calling the
R
>> > function get(). This is true even when you skip the pos argument
and
>> > provide envir. For example, if you call get(), it takes much more
time
>> > than .Internal(get()), which is what get() does.
>> >
>> > If you already know that the object exists in an environment,
it's
>> > faster to use e$x, and slightly faster still to use
e[["x"]]:
>> >
>> > e <- new.env()
>> > e$a <- 1
>> >
>> > # Accessing objects in environments
>> > microbenchmark(
>> >   get("a", e, inherits = FALSE),
>> >   get("a", envir = e, inherits = FALSE),
>> >   .Internal(get("a", e, "any", FALSE)),
>> >   e$a,
>> >   e[["a"]],
>> >   .Primitive("[[")(e, "a"),
>> >
>> >   unit = "us"
>> > )
>> > #>   median                                  name
>> > #> 1 1.0300         get("a", e, inherits = FALSE)
>> > #> 2 0.9425 get("a", envir = e, inherits = FALSE)
>> > #> 3 0.3080  .Internal(get("a", e, "any",
FALSE))
>> > #> 4 0.2305                                   e$a
>> > #> 5 0.1740                              e[["a"]]
>> > #> 6 0.2905              .Primitive("[[")(e,
"a")
>> >
>> >
>> > A similar thing happens with exists(): the R function wrapper adds
>> > significant overhead on top of .Internal(exists()). It's also
faster
>> > to use $ and [[, then test for NULL, but of course this won't
>> > distinguish between objects that don't exist, and those that
do exist
>> > but have a NULL value:
>> >
>> > # Test for existence of `a` (which exists), and `c` (which
doesn't)
>> > microbenchmark(
>> >   exists('a', e, inherits = FALSE),
>> >   exists('a', envir = e, inherits = FALSE),
>> >   .Internal(exists('a', e, 'any', FALSE)),
>> >   'a' %in% ls(e, all.names = TRUE),
>> >   is.null(e[['a']]),
>> >   is.null(e$a),
>> >
>> >   exists('c', e, inherits = FALSE),
>> >   exists('c', envir = e, inherits = FALSE),
>> >   .Internal(exists('c', e, 'any', FALSE)),
>> >   'c' %in% ls(e, all.names = TRUE),
>> >   is.null(e[['c']]),
>> >   is.null(e$c),
>> >
>> >   unit = "us"
>> > )
>> > #>    median                                     name
>> > #> 1  1.2015         exists("a", e, inherits = FALSE)
>> > #> 2  1.0545 exists("a", envir = e, inherits = FALSE)
>> > #> 3  0.3615  .Internal(exists("a", e,
"any", FALSE))
>> > #> 4  7.6345         "a" %in% ls(e, all.names = TRUE)
>> > #> 5  0.3055                        is.null(e[["a"]])
>> > #> 6  0.3270                             is.null(e$a)
>> > #> 7  1.1890         exists("c", e, inherits = FALSE)
>> > #> 8  1.0370 exists("c", envir = e, inherits = FALSE)
>> > #> 9  0.3465  .Internal(exists("c", e,
"any", FALSE))
>> > #> 10 7.5475         "c" %in% ls(e, all.names = TRUE)
>> > #> 11 0.2675                        is.null(e[["c"]])
>> > #> 12 0.3010                             is.null(e$c)
>> >
>> >
>> > -Winston
>> >
>> > On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty <haverty.peter at
gene.com>
>> > wrote:
>> > > Hi All,
>> > >
>> > > I've been looking into speeding up the loading of
packages that use a
>> lot
>> > > of S4.  After profiling I noticed the "exists"
function accounts for a
>> > > surprising fraction of the time.  I have some thoughts about
speeding
>> up
>> > > exists (below). More to the point of this post, Martin
M?chler noted
>> that
>> > > 'exists' and 'get' are often used in
conjunction.  Both functions are
>> > > different usages of the do_get C function, so it's a pity
to run that
>> > twice.
>> > >
>> > > "get" gives an error when a symbol is not found, so
you can't just do a
>> > > 'get'.  With R's C library, one might do
>> > >
>> > > SEXP x = findVarInFrame3(symbol,env);
>> > > if (x != R_UnboundValue) {
>> > >     // do stuff with x
>> > > }
>> > >
>> > > It would be very convenient to have something like this at
the R level.
>> > We
>> > > don't want to do any tryCatch stuff or to add args to get
(That would
>> > kill
>> > > any speed advantage. The overhead for handling redundant args
accounts
>> > for
>> > > 30% of the time used by "exists").  Michael
Lawrence and I worked out
>> > that
>> > > we need a function that returns either the desired object, or
something
>> > > that represents R_UnboundValue. We also need a very cheap way
to check
>> if
>> > > something equals this new R_UnboundValue. This might look
like
>> > >
>> > > if (defined(x <- fetch(symbol, env))) {
>> > >   do_stuff_with_x(x)
>> > > }
>> > >
>> > > A few more thoughts about "exists":
>> > >
>> > > Moving the bit of R in the exists function to C saves 10% of
the time.
>> > > Dropping the redundant pos and frame args entirely saves 30%
of the
>> time
>> > > used by this function. I suggest that the arguments of both
get and
>> > > exists should
>> > > be simplified to (x, envir, mode, inherits). The existing C
code
>> handles
>> > > numeric, character, and environment input for where. The arg
frame is
>> > > rarely used (0/128 exists calls in the methods package).
Users that
>> need
>> > to
>> > > can call sys.frame themselves. get already lacks a frame
argument and
>> the
>> > > manpage for exists notes that envir is only there for
backwards
>> > > compatibility. Let's deprecate the extra args in exists
and get and
>> > perhaps
>> > > move the extra argument handling to C in the interim. 
Similarly, the
>> > > "assign" function does nothing with the
"immediate" argument.
>> > >
>> > > I'd be interested to hear if there is any support for a
"fetch"-like
>> > > function (and/or deprecating some unused arguments).
>> > >
>> > > All the best,
>> > > Pete
>> > >
>> > >
>> > >
>> > > Pete
>> > >
>> > > ____________________
>> > > Peter M. Haverty, Ph.D.
>> > > Genentech, Inc.
>> > > phaverty at gene.com
>> > >
>> > >         [[alternative HTML version deleted]]
>> > >
>> > >
>> > > ______________________________________________
>> > > R-devel at r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> > >
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Possibly Parallel Threads

Search for more apparently analagous threads

R devel - Dec 2014 - we need an exists/get hybrid

[Rd] we need an exists/get hybrid

[Rd] we need an exists/get hybrid

[Rd] we need an exists/get hybrid

Possibly Parallel Threads