jacob at forestlidar.org
2016-Mar-23 16:54 UTC
[R] bug (?) with lapply / clusterMap / clusterApply etc
Very informative! Thank you. Quoting Martin Morgan <martin.morgan at roswellpark.org>:> On 03/22/2016 01:46 PM, jacob at forestlidar.org wrote: >> >> Hello I have encountered a bug(?) with the parallel package. When run >> from within a function, the parLapply function appears to be copying the >> entire parent environment (environment of interior of function) into all >> child nodes in the cluster, one node at a time - which is very very slow >> - and the copied contents are not even accessible within the child nodes >> even though they are apparent in the memory footprint. This happens when >> parLapply is run from within a function. I may be misusing the terms >> "parent" and "node" here... >> >> The below code demonstrates the issue. The same parallel command is used >> twice within the function, once before creating a large object, and once >> afterwards. Both commands should take a nearly identical amount of time. >> Initially the parallel code takes less than 1/100th of a second, but in >> the second iteration requires hundreds of times longer... >> >> Example Code: >> >> #create a cluster of nodes >> if(!"clus1" %in% ls()) clus1=makeCluster(10) >> >> #function used to demonstrate bug >> rows_fn1=function(x,clus){ >> >> #first set of parallel code >> >> print(system.time(parLapply(clus,1:5,function(z){y=rnorm(5000);return(mean(y))}))) >> >> >> #create large vector >> x=rnorm(10^7) >> >> #second set >> >> print(system.time(parLapply(clus,1:5,function(z){y=rnorm(5000);return(mean(y))}))) >> >> >> } >> >> #demonstrate bug - watch task manager and see windows slowly copy >> the vector to each node in the cluster >> rows_fn1(1:5000,clus1) >> >> Although the child nodes bloat proportionally to the size of x in the >> parent environment, x is not available in the child nodes. The code > > With this > > library(parallel) > cl <- makeCluster(2) > f <- function() { > x <- 10 > parSapply(cl, 1:5, function(i) x * i) > } > > we see both that x is available, and why (so that symbols available > in the environment in which FUN is defined are available, just like > serial evaluation) the variable is copied > >> f() > [1] 10 20 30 40 50 > > Defining the function in the global environment, rather than in the > body of a function, avoids copying implicit state, > > cl <- makeCluster(2) > FUN <- function(i) x * i > f <- function() { > x <- 10 > parSapply(cl, 1:5, FUN) > } > > but requires that all arguments are defined / passed > >> f() > Error in checkForRemoteErrors(val) (from #3) : > 2 nodes produced errors; first error: object 'x' not found > > updating the function definition and use > > FUN <- function(i, x) x * i > f <- function() { > x <- 10 > parSapply(cl, 1:5, FUN, x) > } > >> f() > [1] 10 20 30 40 50 > > The foreach package tries to be smart and export only symbols used > (but can be tricked) > > library(foreach) > library(doSNOW) > registerDoSNOW(cl) > g <- function() { > x <- 10 > foreach(i=1:2) %dopar% { get("x") } > } > >> g() # fails because 'x' is not referenced directly so not exported > Error in { (from #3) : task 1 failed - "object 'x' not found" > > versus > > g <- function() { > x <- 10 > foreach(i=1:2) %dopar% { get("x"); x } > } > > and > >> g() # works because 'x' referenced and exported > [[1]] > [1] 10 > > [[2]] > [1] 10 > > > Martin > >> above can be tweaked to add more variables (x1,x2,x3 ...) and the child >> nodes will bloat to the same degree. >> >> I am working on Windows Server 2012, I am using 64bit R version 3.2.1. I >> upgraded to 3.2.4revised and observed the same bug. >> >> I have googled for this issue and have not encountered any other >> individuals having a similar problem. >> >> I have attempted to reboot my machine without effect (aside from the >> obvious). >> >> Any suggestions would be greatly appreciated! >> >> With regards, >> >> Jacob L Strunk >> Forest Biometrician (PhD), Statistician (MSc) >> and Data Munger >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > This email message may contain legally privileged and/or > confidential information. If you are not the intended recipient(s), > or the employee or agent responsible for the delivery of this > message to the intended recipient(s), you are hereby notified that > any disclosure, copying, distribution, or use of this email message > is prohibited. If you have received this message in error, please > notify the sender immediately by e-mail and delete this email > message from your computer. Thank you.