jacob at forestlidar.org
2016-Mar-22 17:46 UTC
[R] bug (?) with lapply / clusterMap / clusterApply etc
Hello I have encountered a bug(?) with the parallel package. When run from within a function, the parLapply function appears to be copying the entire parent environment (environment of interior of function) into all child nodes in the cluster, one node at a time - which is very very slow - and the copied contents are not even accessible within the child nodes even though they are apparent in the memory footprint. This happens when parLapply is run from within a function. I may be misusing the terms "parent" and "node" here... The below code demonstrates the issue. The same parallel command is used twice within the function, once before creating a large object, and once afterwards. Both commands should take a nearly identical amount of time. Initially the parallel code takes less than 1/100th of a second, but in the second iteration requires hundreds of times longer... Example Code: #create a cluster of nodes if(!"clus1" %in% ls()) clus1=makeCluster(10) #function used to demonstrate bug rows_fn1=function(x,clus){ #first set of parallel code print(system.time(parLapply(clus,1:5,function(z){y=rnorm(5000);return(mean(y))}))) #create large vector x=rnorm(10^7) #second set print(system.time(parLapply(clus,1:5,function(z){y=rnorm(5000);return(mean(y))}))) } #demonstrate bug - watch task manager and see windows slowly copy the vector to each node in the cluster rows_fn1(1:5000,clus1) Although the child nodes bloat proportionally to the size of x in the parent environment, x is not available in the child nodes. The code above can be tweaked to add more variables (x1,x2,x3 ...) and the child nodes will bloat to the same degree. I am working on Windows Server 2012, I am using 64bit R version 3.2.1. I upgraded to 3.2.4revised and observed the same bug. I have googled for this issue and have not encountered any other individuals having a similar problem. I have attempted to reboot my machine without effect (aside from the obvious). Any suggestions would be greatly appreciated! With regards, Jacob L Strunk Forest Biometrician (PhD), Statistician (MSc) and Data Munger