Travers Ching
2019-Apr-12 19:31 UTC
[Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
Just throwing my two cents in: I think removing/deprecating fork would be a bad idea for two reasons: 1) There are no performant alternatives 2) Removing fork would break existing workflows Even if replaced with something using the same interface (e.g., a function that automatically detects variables to export as in the amazing `future` package), the lack of copy-on-write functionality would cause scripts everywhere to break. A simple example illustrating these two points: `x <- 5e8; mclapply(1:24, sum, x, 8)` Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` does not complete. Travers On Fri, Apr 12, 2019 at 2:32 AM I?aki Ucar <iucar at fedoraproject.org> wrote:> > On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson > <henrik.bengtsson at gmail.com> wrote: > > > > ISSUE: > > Using *forks* for parallel processing in R is not always safe. > > [...] > > Comments? > > Using fork() is never safe. The reference provided by Kevin [1] is > pretty compelling (I kindly encourage anyone who ever forked a process > to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd > advocate for deprecating fork clusters and eventually removing them > from parallel. > > [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf > > -- > I?aki ?car > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
IƱaki Ucar
2019-Apr-12 22:45 UTC
[Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
On Fri, 12 Apr 2019 at 21:32, Travers Ching <traversc at gmail.com> wrote:> > Just throwing my two cents in: > > I think removing/deprecating fork would be a bad idea for two reasons: > > 1) There are no performant alternatives"Performant"... in terms of what. If the cost of copying the data predominates over the computation time, maybe you didn't need parallelization in the first place.> 2) Removing fork would break existing workflowsI don't see why mclapply could not be rewritten using PSOCK clusters. And as a side effect, this would enable those workflows on Windows, which doesn't support fork.> Even if replaced with something using the same interface (e.g., a > function that automatically detects variables to export as in the > amazing `future` package), the lack of copy-on-write functionality > would cause scripts everywhere to break.To implement copy-on-write, Linux overcommits virtual memory, and this is what causes scripts to break unexpectedly: everything works fine, until you change a small unimportant bit and... boom, out of memory. And in general, running forks in any GUI would cause things everywhere to break.> A simple example illustrating these two points: > `x <- 5e8; mclapply(1:24, sum, x, 8)` > > Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` > does not complete.I'm not sure how did you setup that, but it does complete. Or do you mean that you ran out of memory? Then try replacing "x" with, e.g., "x+1" in your mclapply example and see what happens (hint: save your work first). -- I?aki ?car
Travers Ching
2019-Apr-13 01:03 UTC
[Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
Hi Inaki,> "Performant"... in terms of what. If the cost of copying the data > predominates over the computation time, maybe you didn't need > parallelization in the first place.Performant in terms of speed. There's no copying in that example using `mclapply` and so it is significantly faster than other alternatives. It is a very simple and contrived example, but there are lots of applications that depend on processing of large data and benefit from multithreading. For example, if I read in large sequencing data with `Rsamtools` and want to check sequences for a set of motifs.> I don't see why mclapply could not be rewritten using PSOCK clusters.Because it would be much slower.> To implement copy-on-write, Linux overcommits virtual memory, and this > is what causes scripts to break unexpectedly: everything works fine, > until you change a small unimportant bit and... boom, out of memory. > And in general, running forks in any GUI would cause things everywhere > to break.> I'm not sure how did you setup that, but it does complete. Or do you > mean that you ran out of memory? Then try replacing "x" with, e.g., > "x+1" in your mclapply example and see what happens (hint: save your > work first).Yes, I meant that it ran out of memory on my desktop. I understand the limits, and it is not perfect because of the GUI issue you mention, but I don't see a better alternative in terms of speed. Regards, Travers On Fri, Apr 12, 2019 at 3:45 PM I?aki Ucar <iucar at fedoraproject.org> wrote:> > On Fri, 12 Apr 2019 at 21:32, Travers Ching <traversc at gmail.com> wrote: > > > > Just throwing my two cents in: > > > > I think removing/deprecating fork would be a bad idea for two reasons: > > > > 1) There are no performant alternatives > > "Performant"... in terms of what. If the cost of copying the data > predominates over the computation time, maybe you didn't need > parallelization in the first place. > > > 2) Removing fork would break existing workflows > > I don't see why mclapply could not be rewritten using PSOCK clusters. > And as a side effect, this would enable those workflows on Windows, > which doesn't support fork. > > > Even if replaced with something using the same interface (e.g., a > > function that automatically detects variables to export as in the > > amazing `future` package), the lack of copy-on-write functionality > > would cause scripts everywhere to break. > > To implement copy-on-write, Linux overcommits virtual memory, and this > is what causes scripts to break unexpectedly: everything works fine, > until you change a small unimportant bit and... boom, out of memory. > And in general, running forks in any GUI would cause things everywhere > to break. > > > A simple example illustrating these two points: > > `x <- 5e8; mclapply(1:24, sum, x, 8)` > > > > Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` > > does not complete. > > I'm not sure how did you setup that, but it does complete. Or do you > mean that you ran out of memory? Then try replacing "x" with, e.g., > "x+1" in your mclapply example and see what happens (hint: save your > work first). > > -- > I?aki ?car
Reasonably Related Threads
- SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
- SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
- SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
- SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
- SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()