Hi, S4 method dispatch can be very slow. Would it be reasonable to cache the most recent dispatch, anticipating the next invocation will be on the same type? This would be very helpful in loops. fun0 <- function(x) sapply(x, paste, collapse="+") fun1 <- function(x) { paste <- selectMethod(paste, class(x[[1]])) sapply(x, paste, collapse="+") } lst <- split(rep(LETTERS, 100), rep(1:1300, 2)) library(microbenchmark) microbenchmark(fun0(lst), times=10) ## Unit: milliseconds ## expr min lq median uq max neval ## fun0(lst) 4.153287 4.180659 4.513539 5.19261 5.280481 10 setGeneric("paste") microbenchmark(fun0(lst), fun1(lst), times=10) ## > microbenchmark(fun0(lst), fun1(lst), times=10) ## Unit: milliseconds ## expr min lq median uq max neval ## fun0(lst) 21.093180 21.27616 21.453174 21.833686 24.758791 10 ## fun1(lst) 4.517808 4.53067 4.582641 4.682235 5.121856 10 Dispatch seems to be especially slow when packages are involved, e.g., with the Bioconductor IRanges package (http://bioconductor.org/packages/release/bioc/html/IRanges.html) removeGeneric("paste") library(IRanges) showMethods(paste) ## Function: paste (package BiocGenerics) ## ...="ANY" ## ...="Rle" selectMethod(paste, "ANY") ## Method Definition (Class "derivedDefaultMethod"): ## ## function (..., sep = " ", collapse = NULL) ## .Internal(paste(list(...), sep, collapse)) ## <environment: namespace:base> ## ## Signatures: ## ... ## target "ANY" ## defined "ANY" microbenchmark(fun0(lst), fun1(lst), times=10) ## Unit: milliseconds ## expr min lq median uq max neval ## fun0(lst) 233.539585 234.592491 236.311209 237.268506 243.181123 10 ## fun1(lst) 4.564914 4.592996 4.642898 4.729009 5.492706 10 sessionInfo() ## R version 3.0.0 Patched (2013-04-04 r62492) ## Platform: x86_64-unknown-linux-gnu (64-bit) ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=C LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] IRanges_1.19.15 BiocGenerics_0.7.2 microbenchmark_1.3-0 ## ## loaded via a namespace (and not attached): ## [1] stats4_3.0.0 Thanks, Valerie
Hi Val, [off list... I don't want to compromise your chances to start a constructive discussion ;-)] Thanks for reporting this. Just wanted to mention that the reason I think the situation is worst when you use the paste() generic defined in BiocGenerics than when you make paste() a generic with setGeneric("paste") is because of the signature of the generic. With the latter dispatch is on the 'sep' and 'collapse' args only (which is surprising but that's another story), while with the former it's on ...: > setGeneric("paste") [1] "paste" > paste standardGeneric for "paste" defined from package "base" function (..., sep = " ", collapse = NULL) standardGeneric("paste") <environment: 0x157a028> Methods may be defined for arguments: sep, collapse Use showMethods("paste") for currently available ones. ## Note that showMethods() is broken (it contradicts the above ## that indicates dispatch is on 'sep' and 'collapse'). > showMethods("paste") Function: paste (package base) ...="ANY" > microbenchmark(fun0(lst), fun1(lst), times=10) Unit: milliseconds expr min lq median uq max neval fun0(lst) 27.374228 27.508580 28.144858 28.895889 33.528221 10 fun1(lst) 5.474173 5.739289 5.803471 6.050482 6.825982 10 > removeGeneric("paste") [1] TRUE > setGeneric("paste", signature="...") # this how it's defined in BiocGenerics Creating a new generic function for ?paste? in the global environment [1] "paste" > microbenchmark(fun0(lst), fun1(lst), times=10) Unit: milliseconds expr min lq median uq max neval fun0(lst) 149.828201 153.192866 155.845508 157.916067 176.313906 10 fun1(lst) 4.924387 5.088094 5.114532 5.200432 5.332386 10 Dispatch on ... seems to have a ridiculously high cost! H. On 07/01/2013 10:04 PM, Valerie Obenchain wrote:> Hi, > > S4 method dispatch can be very slow. Would it be reasonable to cache the > most > recent dispatch, anticipating the next invocation will be on the same > type? This > would be very helpful in loops. > > fun0 <- function(x) > sapply(x, paste, collapse="+") > fun1 <- function(x) { > paste <- selectMethod(paste, class(x[[1]])) > sapply(x, paste, collapse="+") > } > lst <- split(rep(LETTERS, 100), rep(1:1300, 2)) > > library(microbenchmark) > microbenchmark(fun0(lst), times=10) > ## Unit: milliseconds > ## expr min lq median uq max neval > ## fun0(lst) 4.153287 4.180659 4.513539 5.19261 5.280481 10 > > setGeneric("paste") > microbenchmark(fun0(lst), fun1(lst), times=10) > ## > microbenchmark(fun0(lst), fun1(lst), times=10) > ## Unit: milliseconds > ## expr min lq median uq max neval > ## fun0(lst) 21.093180 21.27616 21.453174 21.833686 24.758791 10 > ## fun1(lst) 4.517808 4.53067 4.582641 4.682235 5.121856 10 > > Dispatch seems to be especially slow when packages are involved, e.g., > with the Bioconductor IRanges package > (http://bioconductor.org/packages/release/bioc/html/IRanges.html) > > removeGeneric("paste") > library(IRanges) > showMethods(paste) > ## Function: paste (package BiocGenerics) > ## ...="ANY" > ## ...="Rle" > selectMethod(paste, "ANY") > ## Method Definition (Class "derivedDefaultMethod"): > ## > ## function (..., sep = " ", collapse = NULL) > ## .Internal(paste(list(...), sep, collapse)) > ## <environment: namespace:base> > ## > ## Signatures: > ## ... > ## target "ANY" > ## defined "ANY" > > microbenchmark(fun0(lst), fun1(lst), times=10) > ## Unit: milliseconds > ## expr min lq median uq max > neval > ## fun0(lst) 233.539585 234.592491 236.311209 237.268506 243.181123 > 10 > ## fun1(lst) 4.564914 4.592996 4.642898 4.729009 5.492706 > 10 > > sessionInfo() > ## R version 3.0.0 Patched (2013-04-04 r62492) > ## Platform: x86_64-unknown-linux-gnu (64-bit) > ## > ## locale: > ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > ## [7] LC_PAPER=C LC_NAME=C > ## [9] LC_ADDRESS=C LC_TELEPHONE=C > ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > ## > ## attached base packages: > ## [1] parallel stats graphics grDevices utils datasets > methods > ## [8] base > ## > ## other attached packages: > ## [1] IRanges_1.19.15 BiocGenerics_0.7.2 microbenchmark_1.3-0 > ## > ## loaded via a namespace (and not attached): > ## [1] stats4_3.0.0 > > > Thanks, > Valerie > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
It's hard to see how repeated dispatch on the same classes can be that slow, _if_ the function being called each time is itself doing some substantial work. The first call (in a session) with a particular signature searches for inherited methods and stores the method found in a table. The following calls with that signature should do a single lookup in a hash table. Caching the last signature is unlikely to be dramatically faster, but we can experiment and see. What is substantially different is calling a generic function vs calling a primitive or internal. If the local paste you constructed is the default, base::paste, that is a .Internal. Not going through the R generic function several thousand times would make a difference. It's a fundamental point about R that function calls do enough work that they add significant time to a "trivial" computation, such as a primitive call. There are various efforts going on these days to provide more efficient alternatives. They're all helpful; my personal favorite when the game is worth it is to consider doing key computations in a seriously faster language, like C++ via Rcpp. John On 7/1/13 10:04 PM, Valerie Obenchain wrote:> Hi, > > S4 method dispatch can be very slow. Would it be reasonable to cache the > most > recent dispatch, anticipating the next invocation will be on the same > type? This > would be very helpful in loops. > > fun0 <- function(x) > sapply(x, paste, collapse="+") > fun1 <- function(x) { > paste <- selectMethod(paste, class(x[[1]])) > sapply(x, paste, collapse="+") > } > lst <- split(rep(LETTERS, 100), rep(1:1300, 2)) > > library(microbenchmark) > microbenchmark(fun0(lst), times=10) > ## Unit: milliseconds > ## expr min lq median uq max neval > ## fun0(lst) 4.153287 4.180659 4.513539 5.19261 5.280481 10 > > setGeneric("paste") > microbenchmark(fun0(lst), fun1(lst), times=10) > ## > microbenchmark(fun0(lst), fun1(lst), times=10) > ## Unit: milliseconds > ## expr min lq median uq max neval > ## fun0(lst) 21.093180 21.27616 21.453174 21.833686 24.758791 10 > ## fun1(lst) 4.517808 4.53067 4.582641 4.682235 5.121856 10 > > Dispatch seems to be especially slow when packages are involved, e.g., > with the Bioconductor IRanges package > (http://bioconductor.org/packages/release/bioc/html/IRanges.html) > > removeGeneric("paste") > library(IRanges) > showMethods(paste) > ## Function: paste (package BiocGenerics) > ## ...="ANY" > ## ...="Rle" > selectMethod(paste, "ANY") > ## Method Definition (Class "derivedDefaultMethod"): > ## > ## function (..., sep = " ", collapse = NULL) > ## .Internal(paste(list(...), sep, collapse)) > ## <environment: namespace:base> > ## > ## Signatures: > ## ... > ## target "ANY" > ## defined "ANY" > > microbenchmark(fun0(lst), fun1(lst), times=10) > ## Unit: milliseconds > ## expr min lq median uq max > neval > ## fun0(lst) 233.539585 234.592491 236.311209 237.268506 243.181123 > 10 > ## fun1(lst) 4.564914 4.592996 4.642898 4.729009 5.492706 > 10 > > sessionInfo() > ## R version 3.0.0 Patched (2013-04-04 r62492) > ## Platform: x86_64-unknown-linux-gnu (64-bit) > ## > ## locale: > ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > ## [7] LC_PAPER=C LC_NAME=C > ## [9] LC_ADDRESS=C LC_TELEPHONE=C > ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > ## > ## attached base packages: > ## [1] parallel stats graphics grDevices utils datasets > methods > ## [8] base > ## > ## other attached packages: > ## [1] IRanges_1.19.15 BiocGenerics_0.7.2 microbenchmark_1.3-0 > ## > ## loaded via a namespace (and not attached): > ## [1] stats4_3.0.0 > > > Thanks, > Valerie > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel