thr3ads.net - llvm dev - [llvm-dev] PSA: Parallel STL algorithms available in LLVM [May 2017]

If this information is useful, please help other people find it:
Share via:

Bryce Lelbach via llvm-dev

2017-May-11 05:14 UTC

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

On May 10, 2017 9:14 PM, "Hal Finkel" <hfinkel at anl.gov>
wrote:

On 05/10/2017 10:36 PM, Zachary Turner via llvm-dev wrote:

It's hard to say.  By definition it appears undefined (in the sense that
the TS literally does not define it), but on the other hand it is a TS and
this issue would (hopefully) come up and be specified before it made it to
standardization.

You mean the parallelism TS that was voted into C++17? ;)

Bryce, did they end up defining this?

TL;DR: ... $*!#

I'll take a full look at this tomorrow

 -Hal

Supporting recursive parallel calls certainly seems like desirable
behavior, so from my point of view it would be nice to make sure it works.
Not sure if others feel differently.

On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at
purestorage.com>
wrote:
> The spec doesn't seem to say anything about recursive calls to parallel
> functions.  In my mind that means the functions must support it, since it
> doesn't explicitly say it does not need to support it.  Do you think
that's
> accurate?
>
> If so, I'll rely on that behavior in LLDB, and extend the
implementation
> in LLVM accordingly.
>
> On Wed, May 10, 2017 at 5:37 PM, Zachary Turner via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi all,
>>
>> This is just a PSA that as of r302752, 3 parallel algorithms (for_each,
>> for_each_n, and sort) are available in llvm/Support/Parallel.h.
>>
>> Effort was made to match the C++ Parallelism TS N4507 [
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf] as
>> closely as possible, but some aspects were intentionally omitted.
>>
>> No support is added for the executor proposal N4406 [
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4406.pdf], but
>> I plan to try to work on this in the future, with no specified
timeline.
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170510/6a2a13e5/attachment.html>

Bryce Lelbach via llvm-dev

2017-May-12 07:52 UTC

head link

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

Alright, I thought about this today and now I have a less terse response.

High-level summary on nested parallelism:
> On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at
purestorage.com>
> wrote:
>>
>> The spec doesn't seem to say anything about recursive calls to
parallel
>> functions.  In my mind that means the functions must support it, since
it
>> doesn't explicitly say it does not need to support it.
* ^ That sounds correct. Conforming C++17 and Parallelism TS
implementations should support nested parallel algorithms; I am fairly
confident about this.
* My implementation (HPX) supports nested parallel algorithms
implicitly (due to the design of our system - we have a M:N user-space
tasking system) - at least for our CPU-centric executors. So, I've
never thought about this before.
* I am concerned that nested parallel algorithms will prove to be a
big implementation burden for GPU and accelerator architectures.
* I don't think having some executors support this and some not
support this makes sense. Executors represent the /where/ of
execution; on what resources (be they software abstractions, like
thread pools, or a concrete hardware resource) should work be
executed. Execution policies describe the /how/ of execution; e.g.
what parallelism is allowable and what the contract is between element
access functions (aka users) and the parallel algorithms runtime. I
could imagine an execution policy that specifies UB for nested
parallelism.
* I don't think we've discussed this on the committee in the 1.5 years
I've been involved with the standardization of the parallel
algorithms. That time period covers the TS process and the IS process
- I don't remember any NB comments about this on either. I checked the
minutes briefly, as well. Perhaps I am very wrong and it was discussed
earlier in the process before I arrived, or I am remembering
incorrectly; but it is entirely possible this issue was never
considered directly.

I took a look at your initial work on llvm/Support/Parallel.h - it
looks like a good start!

However, I want to gently urge caution and then take this thread on a
bit of a tangent, because I suspect this work will start leading
towards whatever goes into the libc++ C++17 parallel algorithms
implementation. Even if the intention is to have a separate,
stand-alone implementation in LLVM Support, I think that
implementation should share some interfaces and design philosophy with
whatever goes into libc++. I think there is a very important step that
needs to be taken before substantial work is done on libc++'s parallel
algorithms.

Let me explain by analogy: A few years ago, my research group was
tasked with delivering an OpenMP compatibility story for our software
stack. We decided the best approach was for us to implement the
interface of the Intel OpenMP runtime library because this would allow
us to "hijack" that interface, which is targeted by multiple compilers
(Clang, ICPC, and one DoE research compiler thingy). Much to my
surprise, this effort actually worked pretty well, with very few of
the nightmarish stories that you might expect when gluing multiple
large projects together.

Why was this effort successful?

* The Intel OpenMP interface was fairly stable.
* It was well documented (relative to other internal interface layers).
* It was designed to support this "hijacking" model.
* The interface was not overly restrictive; there were not an
excessive number of places where we had to force square pegs through
round holes.

I feel strongly that parallel algorithms implementation should take a
similar approach, where the parallel runtime layer is encapsulated,
well defined, and separate from the "frontend" (aka the algorithms
themselves). One option, of course, would be to use executors for
this. However, executors will likely not enable the "hijacking" model
without code changes and recompilation. Even with executors, we will
always have the "implicit default executor" that notionally exists
today in C++17 and the Parallelism TS. I'd really like end-users and
people who are shipping LLVM toolchains to be able to switch the
parallel runtime layer by simply linking against a different library
that implements some agreed-upon interface.

I've spoken with both the libc++ and libstdc++ devs at past committee
meetings about this; I think they have had similar thoughts.

The challenge is to come up with something that meet everyone's
requirements:

* Everyone will want clear encapsulation between the parallel runtime
engine and the algorithms implementation - including a well-defined
and fairly stable (both API and ABI stable) interface between the two.
* The default implementation will need to work for hundreds of
thousands of users; it has to be robust and MUST avoid inflicting any
unnecessary runtime costs (e.g. no one should pay for what they aren't
using).
* High-performance users will want a very different type of
implementation - they'll want a more aggressive runtime layer that
will assume large amounts of parallel work will be executed and will
try to amortize, not minimize costs.
* Some users will need implementations with very strict real-time or
latency constraints.
* Some users will want to plug in special implementations for
heterogeneous systems.

At the end of the day, crazy people like me will make our runtimes
work with whatever interface is provided. However, I think it would be
very beneficial to try and come up with a design before major
implementation work is done, and get feedback and guidance from
interested parties.

The potential benefit of this approach: one day in the future, we
might have multiple different backend implementations of the parallel
algorithms that work with multiple different compiler frameworks, all
of which interoperate through this agreed-upon C++17 parallel
algorithms runtime layer. I think we'll thank ourselves later for
sketching out a concrete, overall design first.

So... the question is, are others interested in this approach and will
to help? As one of the culprits of parallel algorithms in C++17, I've
suspected that I'd end up contributing some sweat and blood to
libc++'s design and implementation, but it is not a one-person
project.

TL;DR - We should encapsulate the parallelism engine from the
algorithms themselves in libc++ and stick a flexible and robust
interface layer (distinct from executors and completely implicit)
between them. I tentatively volunteer to collaborate with
Marshall/Eric on this, but would need other interested parties to
help.

-- 
Bryce Adelstein Lelbach aka wash
Lawrence Berkeley National Laboratory
ISO C++ Committee Member
CppCon and C++Now Program Chair

Compiler ICE Hunter
--

C Bergström via llvm-dev

2017-May-12 08:05 UTC

head link

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

On Fri, May 12, 2017 at 3:52 PM, Bryce Lelbach via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Alright, I thought about this today and now I have a less terse response.
>
> High-level summary on nested parallelism:
>
> > On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at
purestorage.com
> >
> > wrote:
> >>
> >> The spec doesn't seem to say anything about recursive calls to
parallel
> >> functions.  In my mind that means the functions must support it,
since
> it
> >> doesn't explicitly say it does not need to support it.
>
> * ^ That sounds correct. Conforming C++17 and Parallelism TS
> implementations should support nested parallel algorithms; I am fairly
> confident about this.
> * My implementation (HPX) supports nested parallel algorithms
> implicitly (due to the design of our system - we have a M:N user-space
> tasking system) - at least for our CPU-centric executors. So, I've
> never thought about this before.
> * I am concerned that nested parallel algorithms will prove to be a
> big implementation burden for GPU and accelerator architectures.
> * I don't think having some executors support this and some not
> support this makes sense. Executors represent the /where/ of
> execution; on what resources (be they software abstractions, like
> thread pools, or a concrete hardware resource) should work be
> executed. Execution policies describe the /how/ of execution; e.g.
> what parallelism is allowable and what the contract is between element
> access functions (aka users) and the parallel algorithms runtime. I
> could imagine an execution policy that specifies UB for nested
> parallelism.
> * I don't think we've discussed this on the committee in the 1.5
years
> I've been involved with the standardization of the parallel
> algorithms. That time period covers the TS process and the IS process
> - I don't remember any NB comments about this on either. I checked the
> minutes briefly, as well. Perhaps I am very wrong and it was discussed
> earlier in the process before I arrived, or I am remembering
> incorrectly; but it is entirely possible this issue was never
> considered directly.
>
> I took a look at your initial work on llvm/Support/Parallel.h - it
> looks like a good start!
>
> However, I want to gently urge caution and then take this thread on a
> bit of a tangent, because I suspect this work will start leading
> towards whatever goes into the libc++ C++17 parallel algorithms
> implementation. Even if the intention is to have a separate,
> stand-alone implementation in LLVM Support, I think that
> implementation should share some interfaces and design philosophy with
> whatever goes into libc++. I think there is a very important step that
> needs to be taken before substantial work is done on libc++'s parallel
> algorithms.
>
> Let me explain by analogy: A few years ago, my research group was
> tasked with delivering an OpenMP compatibility story for our software
> stack. We decided the best approach was for us to implement the
> interface of the Intel OpenMP runtime library because this would allow
> us to "hijack" that interface, which is targeted by multiple
compilers
> (Clang, ICPC, and one DoE research compiler thingy). Much to my
> surprise, this effort actually worked pretty well, with very few of
> the nightmarish stories that you might expect when gluing multiple
> large projects together.
>
> Why was this effort successful?
>
> * The Intel OpenMP interface was fairly stable.
> * It was well documented (relative to other internal interface layers).
> * It was designed to support this "hijacking" model.
> * The interface was not overly restrictive; there were not an
> excessive number of places where we had to force square pegs through
> round holes.
>
> I feel strongly that parallel algorithms implementation should take a
> similar approach, where the parallel runtime layer is encapsulated,
> well defined, and separate from the "frontend" (aka the
algorithms
> themselves). One option, of course, would be to use executors for
> this. However, executors will likely not enable the "hijacking"
model
> without code changes and recompilation. Even with executors, we will
> always have the "implicit default executor" that notionally
exists
> today in C++17 and the Parallelism TS. I'd really like end-users and
> people who are shipping LLVM toolchains to be able to switch the
> parallel runtime layer by simply linking against a different library
> that implements some agreed-upon interface.
>
> I've spoken with both the libc++ and libstdc++ devs at past committee
> meetings about this; I think they have had similar thoughts.
>
> The challenge is to come up with something that meet everyone's
> requirements:
>
> * Everyone will want clear encapsulation between the parallel runtime
> engine and the algorithms implementation - including a well-defined
> and fairly stable (both API and ABI stable) interface between the two.
> * The default implementation will need to work for hundreds of
> thousands of users; it has to be robust and MUST avoid inflicting any
> unnecessary runtime costs (e.g. no one should pay for what they aren't
> using).
> * High-performance users will want a very different type of
> implementation - they'll want a more aggressive runtime layer that
> will assume large amounts of parallel work will be executed and will
> try to amortize, not minimize costs.
> * Some users will need implementations with very strict real-time or
> latency constraints.
> * Some users will want to plug in special implementations for
> heterogeneous systems.
>
> At the end of the day, crazy people like me will make our runtimes
> work with whatever interface is provided. However, I think it would be
> very beneficial to try and come up with a design before major
> implementation work is done, and get feedback and guidance from
> interested parties.
>
> The potential benefit of this approach: one day in the future, we
> might have multiple different backend implementations of the parallel
> algorithms that work with multiple different compiler frameworks, all
> of which interoperate through this agreed-upon C++17 parallel
> algorithms runtime layer. I think we'll thank ourselves later for
> sketching out a concrete, overall design first.
>
> So... the question is, are others interested in this approach and will
> to help? As one of the culprits of parallel algorithms in C++17, I've
> suspected that I'd end up contributing some sweat and blood to
> libc++'s design and implementation, but it is not a one-person
> project.
>
> TL;DR - We should encapsulate the parallelism engine from the
> algorithms themselves in libc++ and stick a flexible and robust
> interface layer (distinct from executors and completely implicit)
> between them. I tentatively volunteer to collaborate with
> Marshall/Eric on this, but would need other interested parties to
> help.
>
I'd be happy to quietly follow the discussion and chime in on the GPU or
accelerator side of things if needed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170512/57046716/attachment.html>

Scott Smith via llvm-dev

2017-May-12 16:00 UTC

head link

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

On Fri, May 12, 2017 at 12:52 AM, Bryce Lelbach <balelbach at lbl.gov>
wrote:
> * I am concerned that nested parallel algorithms will prove to be a
> big implementation burden for GPU and accelerator architectures.
>
Can't they fall back on serial execution?  I thought the executor is a
hint, not a requirement (certainly the standard doesn't say it has to
execute on multiple threads, just that it may).

>
> However, I want to gently urge caution and then take this thread on a
> bit of a tangent, because I suspect this work will start leading
> towards whatever goes into the libc++ C++17 parallel algorithms
> implementation. Even if the intention is to have a separate,
> stand-alone implementation in LLVM Support, I think that
> implementation should share some interfaces and design philosophy with
> whatever goes into libc++. I think there is a very important step that
> needs to be taken before substantial work is done on libc++'s parallel
> algorithms.
>
I fear what you're describing is another 1.5 year long standards
committee-like process, involving multiple stakeholders, discussions, etc.

The background of how this came to be in LLVM is roughly:
1. I wanted to parallelize more work in LLDB, which has it's own
non-nestable task execution model.  It involved creating individual tasks,
rather than describing the iteration requested, so I put together my own
parallel:for_each-like implementation just for LLDB.
2. It was suggested that rather than have each LLVM subproject implement
its own framework, that it should instead be put into LLVM proper.  Zachary
volunteered to do that, taking code from LLD and pulling it into LLVM.
3. It was then suggested that the interface should look more like the C++17
standard, presumably to make it easier to transition to the standard
library and drop LLVM's own implementation once the time was right.
4. Back to LLDB, I want to make more code use for_each, but that code may
itself call for_each, so in order to avoid creating Yet Another parallelism
model, I want to make sure LLVM's for_each supports nested calling.

As it is, we have a real use case today for this behavior, but not the
resources/time to invest in coming up with defining how a shared library
interface should look, separate from the C++17 parallelism interface, just
so that libc++ may (or may not) pick it up somewhere down the line.

IMO it makes more sense to continue with the separate implementation of
"kinda mostly C++17 parallelism" with minimal changes/improvements as
necessary inside LLVM, and then switch to libc++/libstdc++/etc standard
implementation later once those interfaces are implemented and pervasive
across all the architectures that LLVM needs to work on.  Otherwise, we
hold up progress in LLVM/LLDB today.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170512/18f7b255/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - May 2017 - PSA: Parallel STL algorithms available in LLVM

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

[llvm-dev] PSA: Parallel STL algorithms available in LLVM

Maybe Matching Threads