Gabe Newell
2017-Dec-11 17:45 UTC
[Rd] document environment passing in parallel::parLapply
The runtime of parallel::parLapply depends on variables unrelated to the parLapply call. However, this is not clearly documented. Therefore I would like to suggest expanding the relevant documentation to explain this behaviour. Consider this example: parallel_demo <- function(random_values_count) { some_data <- runif(random_values_count) dummy_function <- function(x) { x } cluster <- parallel::makeCluster(3) start <- proc.time() parallel::parLapply(cluster, 1:3, dummy_function) runtime <- proc.time() - start parallel::stopCluster(cluster) runtime } parallel_demo(10) parallel_demo(100 * 1000 * 1000) On my machine, this results in a measured runtime of 0.01 seconds being returned for the first call to parallel_demo, but in a runtime of 7.04 seconds being returned for the second call. I could not find clear documentation in either ?parallel::parLapply or vignette("parallel", package = "parallel") - or any other obvious place - on what is the reason for the demonstrated difference in runtime. Based on the observations described above (and on lots of additional tests), my _assumption_ is that parallel::parLapply passes the whole environment of its "fun" argument to all cluster nodes, which of course takes some time. Thus the more data there is in this environment, the longer this takes, even though the environment data might not be needed to execute the function "fun". For environments with lots of data in them, this can considerably slow down the computation at hand. At the same time, this behaviour of passing all data in the environment of "fun" to the cluster nodes is not clearly documented. The only - rather vague - hint that I found about this is in the "extended examples" section (specifically on page 13, in section 10.4) of vignette("parallel", package = "parallel"). Furthermore, this behaviour is not something that would very easily be expected by every R user, in my opinion. Therefore I want to suggested expanding the documentation of parallel::parLapply so that it explicitely states that the environment of "fun" has to be passed to all cluster nodes, which may take some time. I spent a considerable amount of time on figuring out why my parallelization code didn't really speed up my calculations, and I would like to save others from going through this hassle again. :-) For the sake of completeness, here is my session info:> version_ platform x86_64-w64-mingw32 arch x86_64 os mingw32 system x86_64, mingw32 status major 3 minor 4.3 year 2017 month 11 day 30 svn rev 73796 language R version.string R version 3.4.3 (2017-11-30) nickname Kite-Eating Tree> sessionInfo()R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 [4] LC_NUMERIC=C LC_TIME=German_Germany.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.3 parallel_3.4.3 tools_3.4.3 yaml_2.1.14 Martin
Possibly Parallel Threads
- Snow parLapply
- parLapply fails to detect default cluster?
- parLapply within a function
- Parallel computing: how to transmit multiple parameters to a function in parLapply?
- parLapply - Error in do.call("fun", lapply(args, enquote)) : could not find function "fun"