[cross-posted on R-devel and Bioc-devel, since the functions from the parallel package discussed here are mirrored in the BiocGenerics package] Hi, I am currently running a lengthy simulation study (no details necessary) on a large multi-core system. The simulated data sets are stored in a long list and they are unevenly sized (hence, the computation times vary greatly between data sets), so I naively thought that parLapplyLB() would be the right function to parallelize my computations. Presently, only two thirds of my computations are finished, however, only 16 of my 72 workers seem to be active, whereas the remaining 56 seem to run idle. I tried to find out in more detail what parLapplyLB() actually does and I was astonished to see that, in the same way as parLapply(), it splits the input list X into as many parts as there are nodes in the cluster and then calls clusterApplyLB() on these chunks. By the way, parLapply() does exactly the same, but then calls clusterApply(). I may miss a point, but this means to me, since there are as many chunks as nodes in the cluster, that actually no load balancing takes place, right? My list of data sets are actually ordered by size, that's why the small ones are grouped together and the large ones are grouped together - with the obvious effect that the workers with the chunks or small data sets have finished long ago while the workers with the longest data sets are still not finished and will take quite some more time to complete. First question: is this behavior of parLapplyLB() really intended? As it seems to me, clusterApplyLB() is on the other side of the spectrum, since it distributes every single data item dynamically and, therefore, supposedly creates quite some communication overhead. Why not adding an argument to parLapplyLB() that controls the size or number of chunks the input list X is split into? Additionally (or alternatively), communication overhead can be reduced by sending all additional arguments to the workers once and for all (a trick I have seen on page 20 of "Parallel R" by QE McCallum). Here is my version of it (the only difference to McCallum's code is that I moved LB.init() and LB.worker() to the inside of parLapplyLB2()): parLapplyLB2 <- function(cl, x, fun, ...) { LB.init <- function(fun, ...) { assign(".LB.fun", fun, pos=globalenv()) assign(".LB.args", list(...), pos=globalenv()) NULL } LB.worker <- function(x) do.call(".LB.fun", c(list(x), .LB.args)) clusterCall(cl, LB.init, fun, ...) r <- clusterApplyLB(cl, x, LB.worker) clusterEvalQ(cl, rm(".LB.fun", ".LB.args", pos=globalenv())) r } The issue described above motivated me to learn more about how load balancing is done with R's parallel package. I do not want to bother you with details, but I can provide my code and results upon request. My result is that clusterApply(), clusterApplyLB(), and parLapplyLB2() from above (which is essentially a wrapper around clusterApplyLB()) are equally bad if the tasks are ordered ascendingly according to the time they take. In any case, however, parLapply() and parLapplyLB() are even worse (because of the grouping effect highlighted above). If nothing is known about how long the tasks take (or if the tasks are in random order), clusterLapplyLB() and parLapplyLB2() provide quite good results, but still not a linear speedup. If the tasks are ordered descendingly in terms of the time they take, clusterLapplyLB() and parLapplyLB2() provide the best results because all workers start with the tasks that take longest and then, with increasing time, shorter tasks can be distributed more evenly over idle workers. I suppose this is not surprising and probably common knowledge. I just wanted to make the point that also empirical evaluations confirmed that parLapply() and parLapplyLB() perform equally bad when the lengths of tasks vary. In the case of parLapply(), this is to be expected. In the case of parLapplyLB(), this is probably what users would not expect - given the misleading "LB" in the name of the function. Regards, Ulrich [[alternative HTML version deleted]]