Parallel processing usually includes quite a lot of overhead, which is
expensive if the computation itself is quick. This is definitely an
example where the function is too simple to take advantage of
parallelization. Another thing is that your example has some errors,
which makes the effect even stronger, as you are only averaging over the
first three elements of the list.
I have modified the example below to call a more complicated function
than the mean function. Then the parallelized example is faster
(although not by much). To see the difference, replace the lapply lines
with "lapply(dat, mean)". Under the foreach example, you can also see
the same computation with clusterApply, which seems to be much more
efficient for this problem.
N <- 200000
myList <- vector('list', N)
for(i in 1:N){
myList[[i]] <- rnorm(100)
}
library(foreach)
library(doParallel)
ncores = 7
registerDoParallel(cores=ncores)
names(myList) = make.names(rep(1:ncores, length.out = N))
nms = 1:ncores
system.time(result <- foreach(i = 1:ncores) %do% {
dat <- myList[which(names(myList) == make.names(nms[i]))]
lapply(dat, FUN = function(x) log(sd(x)) + sd(x) + var(x))
} )
system.time(
result2 <- foreach(i = 1:ncores) %dopar% {
dat <- myList[which(names(myList) == make.names(nms[i]))]
lapply(dat, FUN = function(x) log(sd(x)) + sd(x) + var(x))
} )
foreach is not always the best choice for parallel processing. You could
also have a look at clusterApply:
f1 = function(x) mean(x)
f2 = function(x) log(sd(x)) + sd(x) + var(x)
cl = makeCluster(ncores)
clusterExport(cl, list("f1", "f2"))
dats = split(myList, names(myList))
system.time(res <- clusterApply(cl, dats, fun = function(x) lapply(x, f1)))
system.time(res <- lapply(dats, FUN = function(x) lapply(x, f1)))
system.time(res <- clusterApply(cl, dats, fun = function(x) lapply(x, f2)))
system.time(res <- lapply(dats, FUN = function(x) lapply(x, f2)))
lapply is still faster for the example with mean, but much slower for
the more complicated function.
Best,
Jon
On 12/4/2016 3:11 AM, Doran, Harold wrote:> As a follow up to this, I have been able to generate a toy example of
reproducible code that generates the same problem. Below is just a sample to
represent the issue, but my data and subsequent functions acting on the data are
much more involved.
>
> I no longer have the error, but, the loop running in parallel is extremely
slow relative to its serialized counterpart.
>
> I have narrowed down the problem to the fact that I am searching through a
very large list, grabbing the data from that list by indexing to subset and then
doing stuff to it. Both "work", but the parallel version is very, very
slow. I believe I am sending data files to each core and the number of searches
happening is prohibitive.
>
> I am very much stuck in the design-based way of how I would do this
particular problem on a single core and am not sure if there is a better
designed based approach for solving this problem in the parallel version.
>
> Any advice on better ways to work with the %dopar% version here?
>
> N <- 200000
> myList <- vector('list', N)
> names(myList) <- 1:N
> for(i in 1:N){
> myList[[i]] <- rnorm(100)
> }
> nms <- 1:N
> library(foreach)
> library(doParallel)
> registerDoParallel(cores=7)
>
> result <- foreach(i = 1:3) %do% {
> dat <- myList[[which(names(myList) == nms[i])]]
> mean(dat)
> }
>
> result <- foreach(i = 1:3) %dopar% {
> dat <- myList[[which(names(myList) == nms[i])]]
> mean(dat)
> }
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Doran,
Harold
> Sent: Saturday, December 03, 2016 4:26 PM
> To: r-help at r-project.org
> Subject: [R] error serialize (foreach)
>
> I have a portion of a foreach loop that I cannot run as parallel but works
fine when serialized. Below is a representation of the problem as in this
instance I cannot provide reproducible data to generate the same error, the
actual data I am working with are confidential.
>
> Within each foreach loop are a series of custom functions acting on my
data. When using %do% I get expected result but replacing it with %dopar%
generates the error.
>
> I have searched archives and also stackexchange and see this is an issue
that arises and I have tried a couple of the recommendations, like trying to use
an outfile in makeCluster. But I am not having success.
>
> Oddly, (or perhaps not oddly), others portions of my program run in
parallel and do not generate this same error
>
> library(foreach)
> library(doParallel)
> registerDoParallel(cores=3)
>
> # This portion runs and produces expected result result <- foreach(i =
1:N) %do% {
> tmp1 <- function1(...)
> tmp2 <- function2(...)
> tmp2
> }
>
> # This portion generates error in serialize result <- foreach(i = 1:N)
%dopar% {
> tmp1 <- function1(...)
> tmp2 <- function2(...)
> tmp2
> }
>
> error in serialize(data, node$con) : error writing to connection
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jon Olav Sk?ien
Joint Research Centre - European Commission
Institute for Space, Security & Migration
Disaster Risk Management Unit
Via E. Fermi 2749, TP 122, I-21027 Ispra (VA), ITALY
jon.skoien at jrc.ec.europa.eu
Tel: +39 0332 789205
Disclaimer: Views expressed in this email are those of the individual
and do not necessarily represent official views of the European Commission.