Martin Morgan
2012-Oct-26 04:35 UTC
[Rd] parallel::pvec FUN types differ when v is a list; code simplifications?
In pvec(list(1, 2), FUN, mc.cores=2) FUN sees integer() arguments whereas pvec(list(1, 2, 3), FUN, mc.cores=2) FUN sees list() arguments; the latter seems consistent with pvec's description. This came up in a complicated Bioconductor thread about generics and parallel evaluation https://stat.ethz.ch/pipermail/bioc-devel/2012-October/003745.html One relevant point is that a light-weight re-write of parallel/R/unix/pvec.R (below) appears to avoid the need for an object v to satisfy is.vector as.list [ and instead only requires [ This can be important when as.list() forces an inefficient representation of an object that can nonetheless be subset with [. It also seems like the code below will be more space efficient, since v is not split into a second object in the parent but only subset in the children? Index: pvec.R ==================================================================--- pvec.R (revision 61012) +++ pvec.R (working copy) @@ -21,8 +21,6 @@ pvec <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE, mc.cores = getOption("mc.cores", 2L), mc.cleanup = TRUE) { - if (!is.vector(v)) stop("'v' must be a vector") - env <- parent.frame() cores <- as.integer(mc.cores) if(cores < 1L) stop("'mc.cores' must be >= 1") @@ -31,16 +29,7 @@ if(mc.set.seed) mc.reset.stream() n <- length(v) - l <- if (n <= cores) as.list(v) else { - ## compute the scheduling, making it as fair as possible - il <- as.integer(n / cores) - xc <- n - il * cores - sl <- rep(il, cores) - if (xc) sl[1:xc] <- il + 1L - si <- cumsum(c(1L, sl)) - se <- si + c(sl, 0L) - 1L - lapply(seq_len(cores), function(ix) v[si[ix]:se[ix]]) - } + si <- splitIndices(n, cores) jobs <- NULL cleanup <- function() { ## kill children if cleanup is requested @@ -59,8 +48,8 @@ on.exit(cleanup()) FUN <- match.fun(FUN) ## may have more cores than tasks .... - jobs <- lapply(seq_len(min(n, cores)), - function(i) mcparallel(FUN(l[[i]], ...), name = i, + jobs <- lapply(si, + function(i) mcparallel(FUN(v[i], ...), mc.set.seed = mc.set.seed, silent = mc.silent)) res <- mccollect(jobs) On 10/24/2012 05:07 PM, Cook, Malcolm wrote:> On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.michael at gene.com> wrote: > >> I agree that it would fruitful to have parLapply in BiocGenerics. It looks >> to be a flexible abstraction and its presence in the parallel package >> makes >> it ubiquitous. If it hasn't been done already, mclapply (and mcmapply) >> would be good candidates, as well. The fork-based parallelism is >> substantively different in terms of the API from the more general >> parallelism of parLapply. >> >> Someone was working on some more robust and convenient wrappers around >> mclapply. Did that ever see the light of day? > > > If you are referring to > http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660 > > in which I had offered some small changes to parallel::pvec > > https://gist.github.com/3757873/ > > and after which Martin had provided me with a route I have not (yet?) > followed toward submitting a patch to R for consideration by R-devel / > Simon Urbanek in > > http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-oth > er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c > > > > >>>> On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou < >>>> mailinglist.honeypot at gmail.com**> wrote: >>>> >>>> In response to a question from yesterday, I pointed someone to the >>>>> ShortRead `srapply` function and I wondered to myself why it had to >>>>> necessarily by "burried" in the ShortRead package (aside from it >>>>> having a `sr` prefix). >>>>> >>>> >>> I don't know that srapply necessarily 'got it right'... > > > One thing I like about srapply is its support for a reduce argument. > >>>>> I had thought it might be a good idea to move that (or something like >>>>> that) to BiocGenerics (unless implementations aren't allowed there) >>>>> but also realized that it would add more dependencies where someone >>>>> might not necessarily need them. > > >>>>> >>>>> But, almost surely, a large majority of the people will be happy to do >>>>> some form of ||-ization, so in my mind it's not such an onerous thing >>>>> to add -- on the other hand, this large majority is probably enriched >>>>> for people who are doing NGS analysis, in which case, keeping it in >>>>> ShortRead can make some sense. > > I remain confused about the need for putting any of this into BiocGenerics > at all. It seems to me that properly construed parallization primitives > ought to 'just work' with any object which supports indexing and length. > > I would appreciate hearing arguments to the contrary. > > Florian, in a similar vein, could we not seek to change > parallel::makeCluster to be extensible to, say, support SGE cluster? THis > seems like the 'right thing to do'. ??? > > > Regardless, I think we have raised some considerations that might inform > improvements to parallel, including points about error handling, reducing > results, block-level parallization over List/Vector (in addition to > vector), etc. > > I think perhaps having a google doc that we can collectively edit to > contain the requirements we are trying to achieve might move us forward > effectively. Would this help? Or perhaps a page under > http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ??? > > >>>>> Taking one step back, I recall some chatter last week (or two) about >>>>> some better ||-ization "primitives" -- something about a pvec doo-dad, >>>>> and there being ideas to wrap different types of ||-ization behind an >>>>> easy to use interface (I think this was the convo), and then I took a >>>>> further step back and often wonder why we just don't bite the bullet >>>>> and take advantage of the `foreach` infrastructure that is already out >>>>> there -- in which case, I could imagne a "doSGE" package that might >>>>> handle the particulars of what Florain is referring to. You could then >>>>> configure it externally via some `registerDoSGE(some.config.**object)` >>>>> and just have the package code happily run it through `foreach(...) >>>>> %dopar%` and be done w/ it. >>>>> >>>>> >>>>> IMHO it is relevant. I have not looked for other abstractions, and >>>>> this >>>> one seems >>>> to work. Florian's objectives might be a good test case for adequacy. >>>> >>> >>> The registerDoDah does seem to be a useful abstraction. > > Is this not more-or-less the intention of parallel::setDefaultCluster? > > --Malcolm > > > >>> >>> I think there's a lot of work to do for some sort of coordinated >>> parallelization that putting parLapply into BiocGenerics might >>> encourage; >>> not good things will happen when everyone in a call stack tries to >>> parallelize independently. But I'm in favor of parLapply in >>> BiocGenerics at >>> least for the moment. >>> >>> Martin >>> >>> >>> >>>> >>>> ... at least, I thought this is what was being talked about here (and >>>>> popped up a week or two ago) -- sorry if I completely missed the mark >>>>> ... >>>>> >>>>> -steve >>>>> >>>>> >>>>> On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian >>>>> <florian.hahne at novartis.com> wrote: >>>>> >>>>>> Hi Martin, >>>>>> I could define the generics in my own package, but that would mean >>>>>> that >>>>>> those will only be available there, or in the global environment >>>>>> assuming >>>>>> that I also export them, or in all additional packages that >>>>>> explicitly >>>>>> import them from my name space. Now there already are a whole bunch >>>>>> of >>>>>> packages around that all allow for parallelization via a cluster >>>>>> object. >>>>>> Obviously those all import the parLapply function from the parallel >>>>>> package. That means that I can't simply supply my own modified >>>>>> cluster >>>>>> object, because the code that calls parLapply will not know about the >>>>>> generic in my package, even if it is attached. Ideally parLapply >>>>>> would >>>>>> be >>>>>> a generic function already in the parallel package. Not sure who >>>>>> needs >>>>>> to >>>>>> be convinced in order for this to happen, but my gut feeling was >>>>>> that it >>>>>> could be easier to have the generic in BiocGenerics. >>>>>> Maybe I am missing something obvious here, but imo there is no way to >>>>>> overwrite parLapply globally for my own class unless the generic is >>>>>> imported by everyone who wants to make use of the special method. >>>>>> Florian >>>>>> -- >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 10/23/12 2:20 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote: >>>>>> >>>>>> On 10/17/2012 05:45 AM, Hahne, Florian wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> I was wondering whether it would be possible to have proper >>>>>>>> generics >>>>>>>> >>>>>>> for >>>>> >>>>>> some of the functions in the parallel package, e.g. parLapply and >>>>>>>> clusterCall. The reason I am asking is because I want to build an >>>>>>>> S4 >>>>>>>> class >>>>>>>> that essentially looks like an S3 cluster object but knows how to >>>>>>>> deal >>>>>>>> with the SGE. That way I can abstract away all the overhead >>>>>>>> regarding >>>>>>>> job >>>>>>>> submission, job status and reducing the results in the parLapply >>>>>>>> method >>>>>>>> of >>>>>>>> that class, and would be able to supply this new cluster object to >>>>>>>> all >>>>>>>> of >>>>>>>> my existing functions that can be processed in parallel using a >>>>>>>> cluster >>>>>>>> object as input. I have played around with the BatchJobs package >>>>>>>> as an >>>>>>>> abstraction layer to SGE and that work nicely. As a test case I >>>>>>>> have >>>>>>>> created the necessary generics myself in order to supply my own >>>>>>>> SGEcluster >>>>>>>> object to a function that normally deals with the "regular" >>>>>>>> parallel >>>>>>>> package S3 cluster objects and everything just worked out of the >>>>>>>> box, >>>>>>>> but >>>>>>>> obviously this fails once I am in a name space and my generic is >>>>>>>> not >>>>>>>> found >>>>>>>> anymore. Of course what we would really want is some proper >>>>>>>> abstraction >>>>>>>> of >>>>>>>> parallelization in R, but for now this seem to be at least a cheap >>>>>>>> compromise. Any thoughts on this? > >>>>>>>> >>>>>>> >>>>>>> Hi Florian -- we talked about this locally, but I guess we didn't >>>>>>> actually send >>>>>>> any email! >>>>>>> >>>>>>> Is there an obstacle to promoting these to generics in your own >>>>>>> package? >>>>>>> The >>>>>>> usual motivation for inclusion in BiocGenerics has been to avoid >>>>>>> conflicts >>>>>>> between packages, but I'm not sure whether this is the case (yet)? >>>>>>> This >>>>>>> would >>>>>>> also add a dependency fairly deep in the hierarchy. >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>>> Martin >>>>>>> >>>>>>> Florian >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Computational Biology / Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N. >>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>> >>>>>>> Location: Arnold Building M1 B861 >>>>>>> Phone: (206) 667-2793 >>>>>>> >>>>>> >>>>>> ______________________________**_________________ >>>>>> Bioc-devel at r-project.org mailing list >>>>>> >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c >>>>>> h/mailman/listinfo/bioc-devel> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Steve Lianoglou >>>>> Graduate Student: Computational Systems Biology >>>>> | Memorial Sloan-Kettering Cancer Center >>>>> | Weill Medical College of Cornell University >>>>> Contact Info: >>>>> http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elianos >>>>> /contact> >>>>> >>>>> ______________________________**_________________ >>>>> Bioc-devel at r-project.org mailing list >>>>> >>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch >>>>> /mailman/listinfo/bioc-devel> >>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________**_________________ >>>> Bioc-devel at r-project.org mailing list >>>> >>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/ >>>> mailman/listinfo/bioc-devel> >>>> >>>> >>> >>> -- >>> Computational Biology / Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. >>> PO Box 19024 Seattle, WA 98109 >>> >>> Location: Arnold Building M1 B861 >>> Phone: (206) 667-2793 >>> >>> ______________________________**_________________ >>> Bioc-devel at r-project.org mailing list >>> >>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/m >>> ailman/listinfo/bioc-devel> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793