Andrew Piskorski
2008-Sep-04  18:34 UTC
[Rd] Erlang-style message-passing in R: Rmpi, Snow, NetWorkSpaces, etc.
I see about 7 different R packages for multi-process parallel programming. Which do you think is the best, most complete, and most robust to pick for general purpose Erlang-style message-passing programming in R, and why? First here's my use case, and then my analysis so far. I often have code whose basic organization looks something like this: 1. Fetch step: For each date, gather up or pre-process a bunch of data. Return a big list of data, one item on the list for each date. 2. Compute step: For each date on the big list of data, do a bunch of computations. Of course, when the number of dates is large, it's pretty annoying to wait for all the fetches to complete before starting the compute step. (Especially when the compute step then hits a bug on the very first date.) So in practice, I end up breaking things apart to fetch and then compute one date at a time, etc. However, instead of completely serializing everything the way I do now, it would be nice to have 2 concurrent threads of control (processes, threads, coroutines, or whatever) which talk to each other. Then the compute thread can just periodically say to the fetch thread, "Give me the next date's worth of data, please." And usually the fetch thread will already have that data fetched and ready to go. Also, sometimes my "compute step" is slow, and has a lots of readily parallelizable work, so it would be even better if I can optionally run things across multiple physical machines in a cluster. How to do it? R is single-threaded and not thread safe, so threads are out. Coroutines are also probably out. The obvious approach is to use multiple R processes which talk to each other via some message passing library. Fortunately, R has a plethora of such packages. My question is, which is the best choice for this sort of use? From reading their API docs, here are my brief thoughts on each so far: - papply: Not suitable, no bi-directional communication. Slave process return values when the papply() call completes, that's it. - biopara: Not suitable, simple one-way master/slave communication only, just like papply. - snow: Not directly suitable, the supported communication is intended to be very simple. But since it runs on top of Rmpi, perhaps its utility code would be useful in conjunction with Rmpi? - taskPR: Sounds equivalent to snow. Also uses MPI underneath. - Rmpi: Probably. Should definitely work for my needs, only question is if it's the best choice. Is it stable, complete, robust, etc.? - rpvm: Maybe. Should be equivalent to Rmpi, but MPI is much more popular on clusters than PVM these days. - NetWorkSpaces: Maybe. This looks like a rather mature and well-supported multi-language TupleSpace implementation, so it could certainly be made to work. Passing all my large R data objects back and forth solely as strings seems very unappealing, but the docs hint that it includes direct (or at least transparent) support for binary R objects. I need to start up and run an explicit NetWorkSpaces Python/Twisted server. Also, TupleSpace programming sounds somewhat more limiting than Erlang-style message passing (although I definitely do not know that for sure!). On the other hand, the TupleSpace APIs sound a lot simpler than MPI. Since I've never done MPI programming before, I'm also curious about some of the practical semantics of Rmpi. E.g., is it possible to send a message to a busy R process that says, "Stop what you're doing right now!" and have it obeyed immediately? Probably not, as I think that would require either multiple threads or an active event loop somewhere in either R or the MPI stack. Finally, here are links and some notes on each of the above 7 packages (converted from HTML with 'lynx -dump'): * [1]Rmpi ([2]CRAN, [3]tutorial), [4]rpvm ([5]CRAN). * [6]SNOW ([7]CRAN) - Simple Network of Workstations for R, high level interface for parallel R on clusters, uses sockets, MPI, or PVM underneath. Reportedly intended for "embarassingly parallel" not closely coupled problems. * [8]papply ([9]CRAN) * The [10]Parallel-R project provides both [11]RScaLAPACK ([12]CRAN) and [13]taskPR ([14]old), using MPI. * [15]biopara - One-way master/slave communication, much like papply or taskPR. Uses R sockets, no MPI or PVM underneath. * [16]NetWorkSpaces for R ([17]article, [18]FAQ) from [19]SCAI is a [20]dual licenced (GPL and commercial) Linda/tuplespace implementation. Also, some aspects sound similar to the [21]data flow variables in [22]Van Roy's [23]CTM and [24]Mozart/Oz. References 1. http://www.stats.uwo.ca/faculty/yu/Rmpi/ 2. http://cran.us.r-project.org/src/contrib/Descriptions/Rmpi.html 3. http://ace.acadiau.ca/math/ACMMaC/Rmpi/ 4. http://www.analytics.washington.edu/statcomp/projects/rhpc/rpvm/ 5. http://cran.us.r-project.org/src/contrib/Descriptions/rpvm.html 6. http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html 7. http://cran.us.r-project.org/src/contrib/Descriptions/snow.html 8. http://ace.acadiau.ca/math/ACMMaC/software/papply/ 9. http://cran.us.r-project.org/src/contrib/Descriptions/papply.html 10. http://www.aspect-sdm.org/Parallel-R/ 11. http://www.aspect-sdm.org/Parallel-R/RScaLAPACK/RScaLAPACK.html 12. http://cran.us.r-project.org/src/contrib/Descriptions/RScaLAPACK.html 13. http://cran.us.r-project.org/web/packages/taskPR/ 14. http://www.aspect-sdm.org/Parallel-R/task-pR/task-pR.html 15. http://cran.us.r-project.org/src/contrib/Descriptions/biopara.html 16. http://sourceforge.net/projects/nws-r/ 17. http://www.ddj.com/web-development/200001971 18. http://nws-r.sourceforge.net/NetWorkSpacesFAQ.html 19. http://www.lindaspaces.com/about/ 20. http://www.lindaspaces.com/products/os_licensing.html 21. http://en.wikipedia.org/wiki/Oz_(programming_language)#Dataflow_variables_and_declarative_concurrency 22. http://www.info.ucl.ac.be/~pvr/cvvanroy.html 23. http://www.amazon.com/gp/product/0262220695/ 24. http://www.mozart-oz.org/ -- Andrew Piskorski <atp at piskorski.com> http://www.piskorski.com/
David Bauer
2008-Sep-04  20:06 UTC
[Rd] Erlang-style message-passing in R: Rmpi, Snow, NetWorkSpaces, etc.
> - taskPR: Sounds equivalent to snow. Also uses MPI underneath.Actually, it is very different from snow. taskPR was an attempt to get 'free' parallelism out of already existing programs by using simple data dependencies to figure out which individual statements in a program can be run in parallel. The name comes from the description of the program as exploiting task-level parallelism. Compare this to snow which uses data-level parallelism (performing the same operation on many pieces of data at once). Additionally, MPI is optional, and only used for the initial setup of processes. (If anybody actually uses or has successfully used this package, I would love to hear about it, btw. While the package *does* work, there are probably few cases where it is worth it.) David Bauer