David Kane <David Kane
2002-May-08 12:45 UTC
[R] Suggestions for poor man's parallel processing
Almost all of the heavy crunching I do in R is like:> for(i in long.list){+ do.something(i) + }> collect.results()Since all the invocations of do.something are independent of one another, there is no reason that I can't run them in parallel. Since my machine has four processors, a natural way to do this is to divide up long.list into 4 pieces and then start 4 jobs, each of which would process 1/4 of the items. I could then wait for the four jobs to finish (waiting for tag files and the like), collect the results, and go on my happy way. I might do this all within R (using system calls to fork off other R processes?) or by using Perl as a wrapper. But surely there are others that have faced and solved this problem already! I do not *think* that I want to go into the details of RPVM since my needs are so limitted. Does anyone have any advice for me? Various postings to R-help have hinted at ideas, but I couldn't find anything definitive. I will summarize for the list. To the extent that it matters:> R.version_ platform sparc-sun-solaris2.6 arch sparc os solaris2.6 system sparc, solaris2.6 status major 1 minor 5.0 year 2002 month 04 day 29 language R Regards, Dave Kane -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
By far the easiest approach is, as you say, just to hand code parameter ranges into R scripts and run them on different machines. That said, I've recently found using a relational database to store parameter combinations and associated results really convenient. The basic idea is <one time> insert all parameter combinations into database mark each row 'not started' <on each client> connect to database while (1) { break if invalid connection begin transaction lock table query for parameter combination marked 'not started' break if no row returned mark returned row as 'in progress' insert time stamp and client host name <optional> end transaction <unlocks table for other clients> compute the result insert result into database insert ending time stamp <optional> mark row 'completed' } close connection exit I then fire up the client script on each machine (one per processor) and let it run until its done. You get automatic load balancing because faster machines process more parameter combinations. You can also query the database to get intermediate results and see how long before the entire parameter space is processed. I recently used this approach to do ~320 days of computing in about 30 days. I added 40 client jobs on a nearby cluster half way through the run when it became clear my 6 local cpu's were going to take awhile. It was really convenient that I could add clients without disrupting anything. (As written, however, you cannot kill client jobs without leaving unfinished rows marked 'in progress', but that can easily be fixed.) T. On Wed, 2002-05-08 at 08:45, David Kane -->> Almost all of the heavy crunching I do in R is like: > > > for(i in long.list){ > + do.something(i) > + } > > collect.results() > > Since all the invocations of do.something are independent of one another, there > is no reason that I can't run them in parallel. Since my machine has four > processors, a natural way to do this is to divide up long.list into 4 pieces > and then start 4 jobs, each of which would process 1/4 of the items. I could > then wait for the four jobs to finish (waiting for tag files and the like), > collect the results, and go on my happy way. I might do this all within R > (using system calls to fork off other R processes?) or by using Perl as a > wrapper. > > But surely there are others that have faced and solved this problem already! I > do not *think* that I want to go into the details of RPVM since my needs are so > limitted. Does anyone have any advice for me? Various postings to R-help have > hinted at ideas, but I couldn't find anything definitive. I will summarize for > the list. > > To the extent that it matters: > > > R.version > _ > platform sparc-sun-solaris2.6 > arch sparc > os solaris2.6 > system sparc, solaris2.6 > status > major 1 > minor 5.0 > year 2002 > month 04 > day 29 > language R > > > Regards, > > Dave Kane > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Yes, RPVM is annoying to startup, however, there are higher level functions that are coming (parallel apply), that might help justify the effort. Better yet, they are wrappers that hopefully will be sitting on top of RPVM, or RMPI (via the LAM MPI implementaation), and incorporate a prarllel pRNG for assurance of minimum quality. (RSPRNG, a wrapper to the SPRNG library) Coming soon (no set release date planned yet) to a CRAN archive near you (the package is called SNOW, by Luke Tierney).>>>>> "david" == David Kane <David Kane" <a296180 at mica.fmr.com> writes:david> Almost all of the heavy crunching I do in R is like: >> for(i in long.list){ david> + do.something(i) david> + } >> collect.results() david> Since all the invocations of do.something are independent of one another, there david> is no reason that I can't run them in parallel. Since my machine has four david> processors, a natural way to do this is to divide up long.list into 4 pieces david> and then start 4 jobs, each of which would process 1/4 of the items. I could david> then wait for the four jobs to finish (waiting for tag files and the like), david> collect the results, and go on my happy way. I might do this all within R david> (using system calls to fork off other R processes?) or by using Perl as a david> wrapper. david> But surely there are others that have faced and solved this problem already! I david> do not *think* that I want to go into the details of RPVM since my needs are so david> limitted. Does anyone have any advice for me? Various postings to R-help have david> hinted at ideas, but I couldn't find anything definitive. I will summarize for david> the list. david> To the extent that it matters: >> R.version david> _ david> platform sparc-sun-solaris2.6 david> arch sparc david> os solaris2.6 david> system sparc, solaris2.6 david> status david> major 1 david> minor 5.0 david> year 2002 david> month 04 david> day 29 david> language R david> Regards, david> Dave Kane david> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- david> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html david> Send "info", "help", or "[un]subscribe" david> (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch david> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ -- A.J. Rossini Rsrch. Asst. Prof. of Biostatistics U. of Washington Biostatistics rossini at u.washington.edu FHCRC/SCHARP/HIV Vaccine Trials Net rossini at scharp.org -------------- http://software.biostat.washington.edu/ ---------------- FHCRC: M-W: 206-667-7025 (fax=4812)|Voicemail is pretty sketchy/use Email UW: Th: 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX (my friday location is usually completely unpredictable.) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I've been working on a simple interface for this sort of thing modeled loosely on the Python CoW (Cluster of Workstations) package. A rough draft writeup with a link to the preliminary package is at http://www.stat.umn.edu/~luke/R/cluster/cluster.html. The idea is to provide a very simple front end for handling things like farming out simulations to a bunch of machines (or a bunch of processors on one machine) and collecting the results. The communications back ends that are supported are sockets or pvm via Michael Li and Tony Rossini's rpvm; mpi via Hao Yu's Rmpi should be eventually possible as well. Michael and Tony's rsprng is also supported. It's very rough, but I won't get to cleaning it up for a week or two at least, so if anyone wants to play with it in the mean time, go ahead. luke On Wed, May 08, 2002 at 08:45:47AM -0400, David Kane <David Kane wrote:> Almost all of the heavy crunching I do in R is like: > > > for(i in long.list){ > + do.something(i) > + } > > collect.results() > > Since all the invocations of do.something are independent of one another, there > is no reason that I can't run them in parallel. Since my machine has four > processors, a natural way to do this is to divide up long.list into 4 pieces > and then start 4 jobs, each of which would process 1/4 of the items. I could > then wait for the four jobs to finish (waiting for tag files and the like), > collect the results, and go on my happy way. I might do this all within R > (using system calls to fork off other R processes?) or by using Perl as a > wrapper. > > But surely there are others that have faced and solved this problem already! I > do not *think* that I want to go into the details of RPVM since my needs are so > limitted. Does anyone have any advice for me? Various postings to R-help have > hinted at ideas, but I couldn't find anything definitive. I will summarize for > the list. > > To the extent that it matters: > > > R.version > _ > platform sparc-sun-solaris2.6 > arch sparc > os solaris2.6 > system sparc, solaris2.6 > status > major 1 > minor 5.0 > year 2002 > month 04 > day 29 > language R > > > Regards, > > Dave Kane > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-- Luke Tierney University of Minnesota Phone: 612-625-7843 School of Statistics Fax: 612-624-8868 313 Ford Hall, 224 Church St. S.E. email: luke at stat.umn.edu Minneapolis, MN 55455 USA WWW: http://www.stat.umn.edu -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._