Hi All, I've been looking into speeding up the loading of packages that use a lot of S4. After profiling I noticed the "exists" function accounts for a surprising fraction of the time. I have some thoughts about speeding up exists (below). More to the point of this post, Martin M?chler noted that 'exists' and 'get' are often used in conjunction. Both functions are different usages of the do_get C function, so it's a pity to run that twice. "get" gives an error when a symbol is not found, so you can't just do a 'get'. With R's C library, one might do SEXP x = findVarInFrame3(symbol,env); if (x != R_UnboundValue) { // do stuff with x } It would be very convenient to have something like this at the R level. We don't want to do any tryCatch stuff or to add args to get (That would kill any speed advantage. The overhead for handling redundant args accounts for 30% of the time used by "exists"). Michael Lawrence and I worked out that we need a function that returns either the desired object, or something that represents R_UnboundValue. We also need a very cheap way to check if something equals this new R_UnboundValue. This might look like if (defined(x <- fetch(symbol, env))) { do_stuff_with_x(x) } A few more thoughts about "exists": Moving the bit of R in the exists function to C saves 10% of the time. Dropping the redundant pos and frame args entirely saves 30% of the time used by this function. I suggest that the arguments of both get and exists should be simplified to (x, envir, mode, inherits). The existing C code handles numeric, character, and environment input for where. The arg frame is rarely used (0/128 exists calls in the methods package). Users that need to can call sys.frame themselves. get already lacks a frame argument and the manpage for exists notes that envir is only there for backwards compatibility. Let's deprecate the extra args in exists and get and perhaps move the extra argument handling to C in the interim. Similarly, the "assign" function does nothing with the "immediate" argument. I'd be interested to hear if there is any support for a "fetch"-like function (and/or deprecating some unused arguments). All the best, Pete Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com [[alternative HTML version deleted]]
I've looked at related speed issues in the past, and have a couple related points to add. (I've put the info below at http://rpubs.com/wch/46428.) There?s a significant amount of overhead just from calling the R function get(). This is true even when you skip the pos argument and provide envir. For example, if you call get(), it takes much more time than .Internal(get()), which is what get() does. If you already know that the object exists in an environment, it's faster to use e$x, and slightly faster still to use e[["x"]]: e <- new.env() e$a <- 1 # Accessing objects in environments microbenchmark( get("a", e, inherits = FALSE), get("a", envir = e, inherits = FALSE), .Internal(get("a", e, "any", FALSE)), e$a, e[["a"]], .Primitive("[[")(e, "a"), unit = "us" ) #> median name #> 1 1.0300 get("a", e, inherits = FALSE) #> 2 0.9425 get("a", envir = e, inherits = FALSE) #> 3 0.3080 .Internal(get("a", e, "any", FALSE)) #> 4 0.2305 e$a #> 5 0.1740 e[["a"]] #> 6 0.2905 .Primitive("[[")(e, "a") A similar thing happens with exists(): the R function wrapper adds significant overhead on top of .Internal(exists()). It?s also faster to use $ and [[, then test for NULL, but of course this won?t distinguish between objects that don?t exist, and those that do exist but have a NULL value: # Test for existence of `a` (which exists), and `c` (which doesn't) microbenchmark( exists('a', e, inherits = FALSE), exists('a', envir = e, inherits = FALSE), .Internal(exists('a', e, 'any', FALSE)), 'a' %in% ls(e, all.names = TRUE), is.null(e[['a']]), is.null(e$a), exists('c', e, inherits = FALSE), exists('c', envir = e, inherits = FALSE), .Internal(exists('c', e, 'any', FALSE)), 'c' %in% ls(e, all.names = TRUE), is.null(e[['c']]), is.null(e$c), unit = "us" ) #> median name #> 1 1.2015 exists("a", e, inherits = FALSE) #> 2 1.0545 exists("a", envir = e, inherits = FALSE) #> 3 0.3615 .Internal(exists("a", e, "any", FALSE)) #> 4 7.6345 "a" %in% ls(e, all.names = TRUE) #> 5 0.3055 is.null(e[["a"]]) #> 6 0.3270 is.null(e$a) #> 7 1.1890 exists("c", e, inherits = FALSE) #> 8 1.0370 exists("c", envir = e, inherits = FALSE) #> 9 0.3465 .Internal(exists("c", e, "any", FALSE)) #> 10 7.5475 "c" %in% ls(e, all.names = TRUE) #> 11 0.2675 is.null(e[["c"]]) #> 12 0.3010 is.null(e$c) -Winston On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty <haverty.peter at gene.com> wrote:> Hi All, > > I've been looking into speeding up the loading of packages that use a lot > of S4. After profiling I noticed the "exists" function accounts for a > surprising fraction of the time. I have some thoughts about speeding up > exists (below). More to the point of this post, Martin M?chler noted that > 'exists' and 'get' are often used in conjunction. Both functions are > different usages of the do_get C function, so it's a pity to run that twice. > > "get" gives an error when a symbol is not found, so you can't just do a > 'get'. With R's C library, one might do > > SEXP x = findVarInFrame3(symbol,env); > if (x != R_UnboundValue) { > // do stuff with x > } > > It would be very convenient to have something like this at the R level. We > don't want to do any tryCatch stuff or to add args to get (That would kill > any speed advantage. The overhead for handling redundant args accounts for > 30% of the time used by "exists"). Michael Lawrence and I worked out that > we need a function that returns either the desired object, or something > that represents R_UnboundValue. We also need a very cheap way to check if > something equals this new R_UnboundValue. This might look like > > if (defined(x <- fetch(symbol, env))) { > do_stuff_with_x(x) > } > > A few more thoughts about "exists": > > Moving the bit of R in the exists function to C saves 10% of the time. > Dropping the redundant pos and frame args entirely saves 30% of the time > used by this function. I suggest that the arguments of both get and > exists should > be simplified to (x, envir, mode, inherits). The existing C code handles > numeric, character, and environment input for where. The arg frame is > rarely used (0/128 exists calls in the methods package). Users that need to > can call sys.frame themselves. get already lacks a frame argument and the > manpage for exists notes that envir is only there for backwards > compatibility. Let's deprecate the extra args in exists and get and perhaps > move the extra argument handling to C in the interim. Similarly, the > "assign" function does nothing with the "immediate" argument. > > I'd be interested to hear if there is any support for a "fetch"-like > function (and/or deprecating some unused arguments). > > All the best, > Pete > > > > Pete > > ____________________ > Peter M. Haverty, Ph.D. > Genentech, Inc. > phaverty at gene.com > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Thanks Winston! I'm amazed that "[[" beats calling the .Internal directly. I guess the difference between .Primitive vs. .Internal is pretty significant for things on this time scale. NULL meaning NULL and NULL meaning undefined would lead to the same path for much of my code. I'll be swapping out many exists and get calls later today. Thanks! I do still think it would be very useful to have some way to discriminate the two NULL cases. I'm reminded of how perl does the same thing. It's been a while, but it was something like if (defined(x{'c'})) { print x{'c'}; } # This is still two lookups, but it has the "defined" concept. or maybe even if (defined( foo = x{'c'} ) ) { print foo; } Thanks again for the timings! Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Wed, Dec 3, 2014 at 12:48 PM, Winston Chang <winstonchang1 at gmail.com> wrote:> I've looked at related speed issues in the past, and have a couple > related points to add. (I've put the info below at > http://rpubs.com/wch/46428.) > > There's a significant amount of overhead just from calling the R > function get(). This is true even when you skip the pos argument and > provide envir. For example, if you call get(), it takes much more time > than .Internal(get()), which is what get() does. > > If you already know that the object exists in an environment, it's > faster to use e$x, and slightly faster still to use e[["x"]]: > > e <- new.env() > e$a <- 1 > > # Accessing objects in environments > microbenchmark( > get("a", e, inherits = FALSE), > get("a", envir = e, inherits = FALSE), > .Internal(get("a", e, "any", FALSE)), > e$a, > e[["a"]], > .Primitive("[[")(e, "a"), > > unit = "us" > ) > #> median name > #> 1 1.0300 get("a", e, inherits = FALSE) > #> 2 0.9425 get("a", envir = e, inherits = FALSE) > #> 3 0.3080 .Internal(get("a", e, "any", FALSE)) > #> 4 0.2305 e$a > #> 5 0.1740 e[["a"]] > #> 6 0.2905 .Primitive("[[")(e, "a") > > > A similar thing happens with exists(): the R function wrapper adds > significant overhead on top of .Internal(exists()). It's also faster > to use $ and [[, then test for NULL, but of course this won't > distinguish between objects that don't exist, and those that do exist > but have a NULL value: > > # Test for existence of `a` (which exists), and `c` (which doesn't) > microbenchmark( > exists('a', e, inherits = FALSE), > exists('a', envir = e, inherits = FALSE), > .Internal(exists('a', e, 'any', FALSE)), > 'a' %in% ls(e, all.names = TRUE), > is.null(e[['a']]), > is.null(e$a), > > exists('c', e, inherits = FALSE), > exists('c', envir = e, inherits = FALSE), > .Internal(exists('c', e, 'any', FALSE)), > 'c' %in% ls(e, all.names = TRUE), > is.null(e[['c']]), > is.null(e$c), > > unit = "us" > ) > #> median name > #> 1 1.2015 exists("a", e, inherits = FALSE) > #> 2 1.0545 exists("a", envir = e, inherits = FALSE) > #> 3 0.3615 .Internal(exists("a", e, "any", FALSE)) > #> 4 7.6345 "a" %in% ls(e, all.names = TRUE) > #> 5 0.3055 is.null(e[["a"]]) > #> 6 0.3270 is.null(e$a) > #> 7 1.1890 exists("c", e, inherits = FALSE) > #> 8 1.0370 exists("c", envir = e, inherits = FALSE) > #> 9 0.3465 .Internal(exists("c", e, "any", FALSE)) > #> 10 7.5475 "c" %in% ls(e, all.names = TRUE) > #> 11 0.2675 is.null(e[["c"]]) > #> 12 0.3010 is.null(e$c) > > > -Winston > > On Tue, Dec 2, 2014 at 8:46 PM, Peter Haverty <haverty.peter at gene.com> > wrote: > > Hi All, > > > > I've been looking into speeding up the loading of packages that use a lot > > of S4. After profiling I noticed the "exists" function accounts for a > > surprising fraction of the time. I have some thoughts about speeding up > > exists (below). More to the point of this post, Martin M?chler noted that > > 'exists' and 'get' are often used in conjunction. Both functions are > > different usages of the do_get C function, so it's a pity to run that > twice. > > > > "get" gives an error when a symbol is not found, so you can't just do a > > 'get'. With R's C library, one might do > > > > SEXP x = findVarInFrame3(symbol,env); > > if (x != R_UnboundValue) { > > // do stuff with x > > } > > > > It would be very convenient to have something like this at the R level. > We > > don't want to do any tryCatch stuff or to add args to get (That would > kill > > any speed advantage. The overhead for handling redundant args accounts > for > > 30% of the time used by "exists"). Michael Lawrence and I worked out > that > > we need a function that returns either the desired object, or something > > that represents R_UnboundValue. We also need a very cheap way to check if > > something equals this new R_UnboundValue. This might look like > > > > if (defined(x <- fetch(symbol, env))) { > > do_stuff_with_x(x) > > } > > > > A few more thoughts about "exists": > > > > Moving the bit of R in the exists function to C saves 10% of the time. > > Dropping the redundant pos and frame args entirely saves 30% of the time > > used by this function. I suggest that the arguments of both get and > > exists should > > be simplified to (x, envir, mode, inherits). The existing C code handles > > numeric, character, and environment input for where. The arg frame is > > rarely used (0/128 exists calls in the methods package). Users that need > to > > can call sys.frame themselves. get already lacks a frame argument and the > > manpage for exists notes that envir is only there for backwards > > compatibility. Let's deprecate the extra args in exists and get and > perhaps > > move the extra argument handling to C in the interim. Similarly, the > > "assign" function does nothing with the "immediate" argument. > > > > I'd be interested to hear if there is any support for a "fetch"-like > > function (and/or deprecating some unused arguments). > > > > All the best, > > Pete > > > > > > > > Pete > > > > ____________________ > > Peter M. Haverty, Ph.D. > > Genentech, Inc. > > phaverty at gene.com > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > >[[alternative HTML version deleted]]