Bryce Lelbach via llvm-dev
2017-May-11 05:14 UTC
[llvm-dev] PSA: Parallel STL algorithms available in LLVM
On May 10, 2017 9:14 PM, "Hal Finkel" <hfinkel at anl.gov> wrote: On 05/10/2017 10:36 PM, Zachary Turner via llvm-dev wrote: It's hard to say. By definition it appears undefined (in the sense that the TS literally does not define it), but on the other hand it is a TS and this issue would (hopefully) come up and be specified before it made it to standardization. You mean the parallelism TS that was voted into C++17? ;) Bryce, did they end up defining this? TL;DR: ... $*!# I'll take a full look at this tomorrow -Hal Supporting recursive parallel calls certainly seems like desirable behavior, so from my point of view it would be nice to make sure it works. Not sure if others feel differently. On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at purestorage.com> wrote:> The spec doesn't seem to say anything about recursive calls to parallel > functions. In my mind that means the functions must support it, since it > doesn't explicitly say it does not need to support it. Do you think that's > accurate? > > If so, I'll rely on that behavior in LLDB, and extend the implementation > in LLVM accordingly. > > On Wed, May 10, 2017 at 5:37 PM, Zachary Turner via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi all, >> >> This is just a PSA that as of r302752, 3 parallel algorithms (for_each, >> for_each_n, and sort) are available in llvm/Support/Parallel.h. >> >> Effort was made to match the C++ Parallelism TS N4507 [ >> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf] as >> closely as possible, but some aspects were intentionally omitted. >> >> No support is added for the executor proposal N4406 [ >> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4406.pdf], but >> I plan to try to work on this in the future, with no specified timeline. >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170510/6a2a13e5/attachment.html>
Bryce Lelbach via llvm-dev
2017-May-12 07:52 UTC
[llvm-dev] PSA: Parallel STL algorithms available in LLVM
Alright, I thought about this today and now I have a less terse response. High-level summary on nested parallelism:> On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at purestorage.com> > wrote: >> >> The spec doesn't seem to say anything about recursive calls to parallel >> functions. In my mind that means the functions must support it, since it >> doesn't explicitly say it does not need to support it.* ^ That sounds correct. Conforming C++17 and Parallelism TS implementations should support nested parallel algorithms; I am fairly confident about this. * My implementation (HPX) supports nested parallel algorithms implicitly (due to the design of our system - we have a M:N user-space tasking system) - at least for our CPU-centric executors. So, I've never thought about this before. * I am concerned that nested parallel algorithms will prove to be a big implementation burden for GPU and accelerator architectures. * I don't think having some executors support this and some not support this makes sense. Executors represent the /where/ of execution; on what resources (be they software abstractions, like thread pools, or a concrete hardware resource) should work be executed. Execution policies describe the /how/ of execution; e.g. what parallelism is allowable and what the contract is between element access functions (aka users) and the parallel algorithms runtime. I could imagine an execution policy that specifies UB for nested parallelism. * I don't think we've discussed this on the committee in the 1.5 years I've been involved with the standardization of the parallel algorithms. That time period covers the TS process and the IS process - I don't remember any NB comments about this on either. I checked the minutes briefly, as well. Perhaps I am very wrong and it was discussed earlier in the process before I arrived, or I am remembering incorrectly; but it is entirely possible this issue was never considered directly. I took a look at your initial work on llvm/Support/Parallel.h - it looks like a good start! However, I want to gently urge caution and then take this thread on a bit of a tangent, because I suspect this work will start leading towards whatever goes into the libc++ C++17 parallel algorithms implementation. Even if the intention is to have a separate, stand-alone implementation in LLVM Support, I think that implementation should share some interfaces and design philosophy with whatever goes into libc++. I think there is a very important step that needs to be taken before substantial work is done on libc++'s parallel algorithms. Let me explain by analogy: A few years ago, my research group was tasked with delivering an OpenMP compatibility story for our software stack. We decided the best approach was for us to implement the interface of the Intel OpenMP runtime library because this would allow us to "hijack" that interface, which is targeted by multiple compilers (Clang, ICPC, and one DoE research compiler thingy). Much to my surprise, this effort actually worked pretty well, with very few of the nightmarish stories that you might expect when gluing multiple large projects together. Why was this effort successful? * The Intel OpenMP interface was fairly stable. * It was well documented (relative to other internal interface layers). * It was designed to support this "hijacking" model. * The interface was not overly restrictive; there were not an excessive number of places where we had to force square pegs through round holes. I feel strongly that parallel algorithms implementation should take a similar approach, where the parallel runtime layer is encapsulated, well defined, and separate from the "frontend" (aka the algorithms themselves). One option, of course, would be to use executors for this. However, executors will likely not enable the "hijacking" model without code changes and recompilation. Even with executors, we will always have the "implicit default executor" that notionally exists today in C++17 and the Parallelism TS. I'd really like end-users and people who are shipping LLVM toolchains to be able to switch the parallel runtime layer by simply linking against a different library that implements some agreed-upon interface. I've spoken with both the libc++ and libstdc++ devs at past committee meetings about this; I think they have had similar thoughts. The challenge is to come up with something that meet everyone's requirements: * Everyone will want clear encapsulation between the parallel runtime engine and the algorithms implementation - including a well-defined and fairly stable (both API and ABI stable) interface between the two. * The default implementation will need to work for hundreds of thousands of users; it has to be robust and MUST avoid inflicting any unnecessary runtime costs (e.g. no one should pay for what they aren't using). * High-performance users will want a very different type of implementation - they'll want a more aggressive runtime layer that will assume large amounts of parallel work will be executed and will try to amortize, not minimize costs. * Some users will need implementations with very strict real-time or latency constraints. * Some users will want to plug in special implementations for heterogeneous systems. At the end of the day, crazy people like me will make our runtimes work with whatever interface is provided. However, I think it would be very beneficial to try and come up with a design before major implementation work is done, and get feedback and guidance from interested parties. The potential benefit of this approach: one day in the future, we might have multiple different backend implementations of the parallel algorithms that work with multiple different compiler frameworks, all of which interoperate through this agreed-upon C++17 parallel algorithms runtime layer. I think we'll thank ourselves later for sketching out a concrete, overall design first. So... the question is, are others interested in this approach and will to help? As one of the culprits of parallel algorithms in C++17, I've suspected that I'd end up contributing some sweat and blood to libc++'s design and implementation, but it is not a one-person project. TL;DR - We should encapsulate the parallelism engine from the algorithms themselves in libc++ and stick a flexible and robust interface layer (distinct from executors and completely implicit) between them. I tentatively volunteer to collaborate with Marshall/Eric on this, but would need other interested parties to help. -- Bryce Adelstein Lelbach aka wash Lawrence Berkeley National Laboratory ISO C++ Committee Member CppCon and C++Now Program Chair Compiler ICE Hunter --
C Bergström via llvm-dev
2017-May-12 08:05 UTC
[llvm-dev] PSA: Parallel STL algorithms available in LLVM
On Fri, May 12, 2017 at 3:52 PM, Bryce Lelbach via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Alright, I thought about this today and now I have a less terse response. > > High-level summary on nested parallelism: > > > On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at purestorage.com > > > > wrote: > >> > >> The spec doesn't seem to say anything about recursive calls to parallel > >> functions. In my mind that means the functions must support it, since > it > >> doesn't explicitly say it does not need to support it. > > * ^ That sounds correct. Conforming C++17 and Parallelism TS > implementations should support nested parallel algorithms; I am fairly > confident about this. > * My implementation (HPX) supports nested parallel algorithms > implicitly (due to the design of our system - we have a M:N user-space > tasking system) - at least for our CPU-centric executors. So, I've > never thought about this before. > * I am concerned that nested parallel algorithms will prove to be a > big implementation burden for GPU and accelerator architectures. > * I don't think having some executors support this and some not > support this makes sense. Executors represent the /where/ of > execution; on what resources (be they software abstractions, like > thread pools, or a concrete hardware resource) should work be > executed. Execution policies describe the /how/ of execution; e.g. > what parallelism is allowable and what the contract is between element > access functions (aka users) and the parallel algorithms runtime. I > could imagine an execution policy that specifies UB for nested > parallelism. > * I don't think we've discussed this on the committee in the 1.5 years > I've been involved with the standardization of the parallel > algorithms. That time period covers the TS process and the IS process > - I don't remember any NB comments about this on either. I checked the > minutes briefly, as well. Perhaps I am very wrong and it was discussed > earlier in the process before I arrived, or I am remembering > incorrectly; but it is entirely possible this issue was never > considered directly. > > I took a look at your initial work on llvm/Support/Parallel.h - it > looks like a good start! > > However, I want to gently urge caution and then take this thread on a > bit of a tangent, because I suspect this work will start leading > towards whatever goes into the libc++ C++17 parallel algorithms > implementation. Even if the intention is to have a separate, > stand-alone implementation in LLVM Support, I think that > implementation should share some interfaces and design philosophy with > whatever goes into libc++. I think there is a very important step that > needs to be taken before substantial work is done on libc++'s parallel > algorithms. > > Let me explain by analogy: A few years ago, my research group was > tasked with delivering an OpenMP compatibility story for our software > stack. We decided the best approach was for us to implement the > interface of the Intel OpenMP runtime library because this would allow > us to "hijack" that interface, which is targeted by multiple compilers > (Clang, ICPC, and one DoE research compiler thingy). Much to my > surprise, this effort actually worked pretty well, with very few of > the nightmarish stories that you might expect when gluing multiple > large projects together. > > Why was this effort successful? > > * The Intel OpenMP interface was fairly stable. > * It was well documented (relative to other internal interface layers). > * It was designed to support this "hijacking" model. > * The interface was not overly restrictive; there were not an > excessive number of places where we had to force square pegs through > round holes. > > I feel strongly that parallel algorithms implementation should take a > similar approach, where the parallel runtime layer is encapsulated, > well defined, and separate from the "frontend" (aka the algorithms > themselves). One option, of course, would be to use executors for > this. However, executors will likely not enable the "hijacking" model > without code changes and recompilation. Even with executors, we will > always have the "implicit default executor" that notionally exists > today in C++17 and the Parallelism TS. I'd really like end-users and > people who are shipping LLVM toolchains to be able to switch the > parallel runtime layer by simply linking against a different library > that implements some agreed-upon interface. > > I've spoken with both the libc++ and libstdc++ devs at past committee > meetings about this; I think they have had similar thoughts. > > The challenge is to come up with something that meet everyone's > requirements: > > * Everyone will want clear encapsulation between the parallel runtime > engine and the algorithms implementation - including a well-defined > and fairly stable (both API and ABI stable) interface between the two. > * The default implementation will need to work for hundreds of > thousands of users; it has to be robust and MUST avoid inflicting any > unnecessary runtime costs (e.g. no one should pay for what they aren't > using). > * High-performance users will want a very different type of > implementation - they'll want a more aggressive runtime layer that > will assume large amounts of parallel work will be executed and will > try to amortize, not minimize costs. > * Some users will need implementations with very strict real-time or > latency constraints. > * Some users will want to plug in special implementations for > heterogeneous systems. > > At the end of the day, crazy people like me will make our runtimes > work with whatever interface is provided. However, I think it would be > very beneficial to try and come up with a design before major > implementation work is done, and get feedback and guidance from > interested parties. > > The potential benefit of this approach: one day in the future, we > might have multiple different backend implementations of the parallel > algorithms that work with multiple different compiler frameworks, all > of which interoperate through this agreed-upon C++17 parallel > algorithms runtime layer. I think we'll thank ourselves later for > sketching out a concrete, overall design first. > > So... the question is, are others interested in this approach and will > to help? As one of the culprits of parallel algorithms in C++17, I've > suspected that I'd end up contributing some sweat and blood to > libc++'s design and implementation, but it is not a one-person > project. > > TL;DR - We should encapsulate the parallelism engine from the > algorithms themselves in libc++ and stick a flexible and robust > interface layer (distinct from executors and completely implicit) > between them. I tentatively volunteer to collaborate with > Marshall/Eric on this, but would need other interested parties to > help. >I'd be happy to quietly follow the discussion and chime in on the GPU or accelerator side of things if needed. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170512/57046716/attachment.html>
Scott Smith via llvm-dev
2017-May-12 16:00 UTC
[llvm-dev] PSA: Parallel STL algorithms available in LLVM
On Fri, May 12, 2017 at 12:52 AM, Bryce Lelbach <balelbach at lbl.gov> wrote:> * I am concerned that nested parallel algorithms will prove to be a > big implementation burden for GPU and accelerator architectures. >Can't they fall back on serial execution? I thought the executor is a hint, not a requirement (certainly the standard doesn't say it has to execute on multiple threads, just that it may).> > However, I want to gently urge caution and then take this thread on a > bit of a tangent, because I suspect this work will start leading > towards whatever goes into the libc++ C++17 parallel algorithms > implementation. Even if the intention is to have a separate, > stand-alone implementation in LLVM Support, I think that > implementation should share some interfaces and design philosophy with > whatever goes into libc++. I think there is a very important step that > needs to be taken before substantial work is done on libc++'s parallel > algorithms. >I fear what you're describing is another 1.5 year long standards committee-like process, involving multiple stakeholders, discussions, etc. The background of how this came to be in LLVM is roughly: 1. I wanted to parallelize more work in LLDB, which has it's own non-nestable task execution model. It involved creating individual tasks, rather than describing the iteration requested, so I put together my own parallel:for_each-like implementation just for LLDB. 2. It was suggested that rather than have each LLVM subproject implement its own framework, that it should instead be put into LLVM proper. Zachary volunteered to do that, taking code from LLD and pulling it into LLVM. 3. It was then suggested that the interface should look more like the C++17 standard, presumably to make it easier to transition to the standard library and drop LLVM's own implementation once the time was right. 4. Back to LLDB, I want to make more code use for_each, but that code may itself call for_each, so in order to avoid creating Yet Another parallelism model, I want to make sure LLVM's for_each supports nested calling. As it is, we have a real use case today for this behavior, but not the resources/time to invest in coming up with defining how a shared library interface should look, separate from the C++17 parallelism interface, just so that libc++ may (or may not) pick it up somewhere down the line. IMO it makes more sense to continue with the separate implementation of "kinda mostly C++17 parallelism" with minimal changes/improvements as necessary inside LLVM, and then switch to libc++/libstdc++/etc standard implementation later once those interfaces are implemented and pervasive across all the architectures that LLVM needs to work on. Otherwise, we hold up progress in LLVM/LLDB today. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170512/18f7b255/attachment.html>