thr3ads.net - llvm dev - [llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs) [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Nicolai Hähnle via llvm-dev

2018-Dec-19 19:31 UTC

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Hi all,

LLVM needs a solution to the long-standing problem that the IR is unable 
to express certain semantics expected by high-level programming 
languages that target GPUs.

Solving this issue is necessary both for upstream use of LLVM as a 
compiler backend for GPUs and for correctly supporting LLVM IR <-> 
SPIR-V roundtrip translation. It may also be useful for compilers 
targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really 
just looks like a CPU with extra-wide SIMD.

After thinking and talking about the problem on and off for more than 
two years now, I'm convinced that it can only be solved by adding 
dedicated semantics to LLVM IR, which take the shape of:

- a new function (and call) attribute similar to `convergent`,
- explicitly adding language to the LangRef that talks about groups of 
threads and their communication with each other via special functions,
- including how these groups of threads are impacted (split and merged) 
by branches etc., and
- new target-independent intrinsic(s) which manipulate subgroups of 
threads, mostly by marking locations in the program where threads reconverge

Details to be determined, of course.

In this mail, I would mostly like to make initial contact with the 
larger community. First, to explain the problem to CPU-centric folks and 
maybe help get over the initial shock. Second, to reach out to other 
people thinking about GPUs: Has anybody else given this issue much 
thought? How can we proceed to get this solved?


The Problem
==========Programming languages for GPUs have "cross-lane" or
"subgroup"
operations which allow fine-grained exchange of data between threads 
being executed together in a "wave" or "warp".

The poster-child is ballot, which takes a boolean argument and returns a 
bitmask of the value of the argument across the 
"subgroup"/"wave"/"warp", but more complex
operations exist as well e.g.
for reducing a value across all active lanes of a wave or for computing 
a prefix scan.

The key issue is that *the set of threads participating in the exchange 
of data is implicitly defined by control flow*.

Two examples to demonstrate the resulting problem and the limitation of 
the existing LLVM IR semantics. The first one:

       bool value = ...;

       if (condition) {
         bitmask0 = ballot(value);
         foo(bitmask0);
       } else {
         bitmask1 = ballot(value);
         bar(bitmask1);
       }

The semantics of high-level languages demand that `bitmask0` only 
contains set bits for threads (lanes) for which `condition` is true, and 
analogously for `bitmask1`. However, there is no reasonable way in LLVM 
IR to prevent the ballot call from being hoisted above the if-statement, 
which changes the behavior.

(Existing frontends for the AMDGPU target currently implement a gross 
hack where `value` is passed through a call to a unique chunk of no-op 
inline assembly that is marked as having unspecified side effects...)

The second example:

       uint64_t bitmask;
       for (;;) {
         ...
         if (...) {
           bool value = ...;
           bitmask = ballot(value);
           break;
         }
         ...
       }

The semantics of high-level languages expect that `bitmask` only 
contains set bits for threads (lanes) which break from the loop in the 
same iteration. However, the basic block containing the ballot call in 
the natural lowering to LLVM IR is not part of the loop at all. The 
information that it was intended to be run as part of the loop is 
currently lost forever.


The Design Space of Solutions
============================The relevant high-level languages are structured
programming languages,
where the behavior of subgroups falls out quite naturally. Needless to 
say, we cannot rely on structured control flow in LLVM IR.

HSAIL defines subgroups as forking at branches and joining at the 
immediate post-dominator. It also attempts to define restrictions on 
program transformations in terms of immediate dominators and 
post-dominators. I am not certain that this definition is sound in all 
cases, and it is too restrictive in places.

Its main disadvantage is that describing restrictions on transformations 
purely in terms of dominators and post-dominators causes non-local 
effects. For example, jump threading can change the 
dominator/post-dominator tree, but verifying whether the corresponding 
transform is legal in the face of subgroup operations requires 
inspecting distant parts of the code that jump threading would not have 
to look at for a CPU target.

So I reject HSAIL-like approaches based on the fact that they would 
require invasive changes in generic middle-end passes.

There is a type of approach that most people who come into contact with 
this problem eventually at least think about, which suggests replacing 
the implicit dependence on control flow by an explicit one. That is, 
augment subgroup intrinsics with an additional argument that represents 
the subgroup of threads which participate in the exchange of data 
described by this intrinsic, which results in code that looks similar to 
the co-operative groups in new versions of Cuda.

This kind of approach can be a valid solution to the problem of 
preserving the correct semantics, although it imposes annoying 
restrictions on function call ABIs.

The major problem with this kind of approach is that it does not 
actually restrict the transforms made by middle-end passes, and so the 
final IR before code generation might end up containing code patterns 
that cannot be natively expressed on a target that implements SPMD with 
lock-step execution. The burden on the backend to reliably catch and 
implement the rare cases when this happens is excessive. Even for 
targets without lock-step execution, these transforms may have unwanted 
performance impacts. So I reject this type of proposal as well.

The literature from practicioners on SPMD/SIMT control flow (a lot of it 
targeting a hardware audience rather than a compiler audience) does not 
concern itself with this problem to my knowledge, but there is a 
commonly recurring theme of reconverging or rejoining threads at 
explicit instructions and/or post-dominators.

This suggests a viable path towards a solution to me.

The SPIR-V spec has a notion of explicitly structured control flow with 
merge basic blocks. It also defines "dynamic instances" of
instructions
that are distinguished by control flow path, which provides a decent 
option for modeling the set of threads which participate in subgroup 
operations.

The SPIR-V spec itself is IMO quite lacking, in that the formalism is 
very incomplete, the concrete structured control flow constructs are 
very complex, and the details of how they are expressed lead to the same 
non-local effects you get with the HSAIL approach. Nevertheless, I think 
there's another viable path towards a solution hidden there.

Finally and for completeness, it should be noted that if we were to 
forbid irreducible control flow in LLVM IR entirely, that would open up 
more freedom in the design since we could properly define some things in 
terms of loop structure.


Final Thoughts
=============Formalizing these semantics could also help put divergence analysis
on a
more solid foundation.

Mostly I'm interested in general feedback at this point. Is there an 
important part of the design space that I missed? What do people think 
about explicitly talking about thread groups and e.g. dynamic instances 
as in SPIR-V as part of the LLVM LangRef? Are people generally happy 
with the notion? If not, how can we get to a place where people are 
happy with it?

Thanks,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

Justin Lebar via llvm-dev

2018-Dec-19 19:45 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Hi from one of the people who works/worked on the NVPTX backend.
> The key issue is that *the set of threads participating in the exchangeof data is implicitly defined by control flow*.

I agree that this is the root of the problem.

The way nvidia GPUs "solve" this is that in Volta and newer, it's
up to the
user to track explicitly which set of threads participate in
warp-synchronous functions.  You cannot safely ballot().  Problem
"solved".  :)

We already have the notion of "convergent" functions like
syncthreads(), to
which we cannot add control-flow dependencies.  That is, it's legal to
hoist syncthreads out of an "if", but it's not legal to sink it
into an
"if".  It's not clear to me why we can't have
"anticonvergent" (terrible
name) functions which cannot have control-flow dependencies removed from
them?  ballot() would be both convergent and anticonvergent.

Would that solve your problem?
> However, the basic block containing the ballot call in the naturallowering to LLVM IR is not part of the loop at all. The information that it
was intended to be run as part of the loop is currently lost forever.

Sounds like the natural lowering of this example is not respecting
anticonvergence and just needs to be made more unnatural?

I also think it's worthwhile to consider the reasons behind nvidia's
move
away from functions like ballot() towards explicit tracking of which
threads are active.  +Olivier Giroux <ogiroux at gmail.com> told me a
while
ago that he was working on a paper which showed that the old way of doing
things is even more fraught/broken/difficult-to-specify than I thought; I'm
not sure if anything ended up being published.

On Wed, Dec 19, 2018 at 11:32 AM Nicolai Hähnle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi all,
>
> LLVM needs a solution to the long-standing problem that the IR is unable
> to express certain semantics expected by high-level programming
> languages that target GPUs.
>
> Solving this issue is necessary both for upstream use of LLVM as a
> compiler backend for GPUs and for correctly supporting LLVM IR <->
> SPIR-V roundtrip translation. It may also be useful for compilers
> targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really
> just looks like a CPU with extra-wide SIMD.
>
> After thinking and talking about the problem on and off for more than
> two years now, I'm convinced that it can only be solved by adding
> dedicated semantics to LLVM IR, which take the shape of:
>
> - a new function (and call) attribute similar to `convergent`,
> - explicitly adding language to the LangRef that talks about groups of
> threads and their communication with each other via special functions,
> - including how these groups of threads are impacted (split and merged)
> by branches etc., and
> - new target-independent intrinsic(s) which manipulate subgroups of
> threads, mostly by marking locations in the program where threads
> reconverge
>
> Details to be determined, of course.
>
> In this mail, I would mostly like to make initial contact with the
> larger community. First, to explain the problem to CPU-centric folks and
> maybe help get over the initial shock. Second, to reach out to other
> people thinking about GPUs: Has anybody else given this issue much
> thought? How can we proceed to get this solved?
>
>
> The Problem
> ==========> Programming languages for GPUs have "cross-lane"
or "subgroup"
> operations which allow fine-grained exchange of data between threads
> being executed together in a "wave" or "warp".
>
> The poster-child is ballot, which takes a boolean argument and returns a
> bitmask of the value of the argument across the
> "subgroup"/"wave"/"warp", but more complex
operations exist as well e.g.
> for reducing a value across all active lanes of a wave or for computing
> a prefix scan.
>
> The key issue is that *the set of threads participating in the exchange
> of data is implicitly defined by control flow*.
>
> Two examples to demonstrate the resulting problem and the limitation of
> the existing LLVM IR semantics. The first one:
>
>        bool value = ...;
>
>        if (condition) {
>          bitmask0 = ballot(value);
>          foo(bitmask0);
>        } else {
>          bitmask1 = ballot(value);
>          bar(bitmask1);
>        }
>
> The semantics of high-level languages demand that `bitmask0` only
> contains set bits for threads (lanes) for which `condition` is true, and
> analogously for `bitmask1`. However, there is no reasonable way in LLVM
> IR to prevent the ballot call from being hoisted above the if-statement,
> which changes the behavior.
>
> (Existing frontends for the AMDGPU target currently implement a gross
> hack where `value` is passed through a call to a unique chunk of no-op
> inline assembly that is marked as having unspecified side effects...)
>
> The second example:
>
>        uint64_t bitmask;
>        for (;;) {
>          ...
>          if (...) {
>            bool value = ...;
>            bitmask = ballot(value);
>            break;
>          }
>          ...
>        }
>
> The semantics of high-level languages expect that `bitmask` only
> contains set bits for threads (lanes) which break from the loop in the
> same iteration. However, the basic block containing the ballot call in
> the natural lowering to LLVM IR is not part of the loop at all. The
> information that it was intended to be run as part of the loop is
> currently lost forever.
>
>
> The Design Space of Solutions
> ============================> The relevant high-level languages are
structured programming languages,
> where the behavior of subgroups falls out quite naturally. Needless to
> say, we cannot rely on structured control flow in LLVM IR.
>
> HSAIL defines subgroups as forking at branches and joining at the
> immediate post-dominator. It also attempts to define restrictions on
> program transformations in terms of immediate dominators and
> post-dominators. I am not certain that this definition is sound in all
> cases, and it is too restrictive in places.
>
> Its main disadvantage is that describing restrictions on transformations
> purely in terms of dominators and post-dominators causes non-local
> effects. For example, jump threading can change the
> dominator/post-dominator tree, but verifying whether the corresponding
> transform is legal in the face of subgroup operations requires
> inspecting distant parts of the code that jump threading would not have
> to look at for a CPU target.
>
> So I reject HSAIL-like approaches based on the fact that they would
> require invasive changes in generic middle-end passes.
>
> There is a type of approach that most people who come into contact with
> this problem eventually at least think about, which suggests replacing
> the implicit dependence on control flow by an explicit one. That is,
> augment subgroup intrinsics with an additional argument that represents
> the subgroup of threads which participate in the exchange of data
> described by this intrinsic, which results in code that looks similar to
> the co-operative groups in new versions of Cuda.
>
> This kind of approach can be a valid solution to the problem of
> preserving the correct semantics, although it imposes annoying
> restrictions on function call ABIs.
>
> The major problem with this kind of approach is that it does not
> actually restrict the transforms made by middle-end passes, and so the
> final IR before code generation might end up containing code patterns
> that cannot be natively expressed on a target that implements SPMD with
> lock-step execution. The burden on the backend to reliably catch and
> implement the rare cases when this happens is excessive. Even for
> targets without lock-step execution, these transforms may have unwanted
> performance impacts. So I reject this type of proposal as well.
>
> The literature from practicioners on SPMD/SIMT control flow (a lot of it
> targeting a hardware audience rather than a compiler audience) does not
> concern itself with this problem to my knowledge, but there is a
> commonly recurring theme of reconverging or rejoining threads at
> explicit instructions and/or post-dominators.
>
> This suggests a viable path towards a solution to me.
>
> The SPIR-V spec has a notion of explicitly structured control flow with
> merge basic blocks. It also defines "dynamic instances" of
instructions
> that are distinguished by control flow path, which provides a decent
> option for modeling the set of threads which participate in subgroup
> operations.
>
> The SPIR-V spec itself is IMO quite lacking, in that the formalism is
> very incomplete, the concrete structured control flow constructs are
> very complex, and the details of how they are expressed lead to the same
> non-local effects you get with the HSAIL approach. Nevertheless, I think
> there's another viable path towards a solution hidden there.
>
> Finally and for completeness, it should be noted that if we were to
> forbid irreducible control flow in LLVM IR entirely, that would open up
> more freedom in the design since we could properly define some things in
> terms of loop structure.
>
>
> Final Thoughts
> =============> Formalizing these semantics could also help put
divergence analysis on a
> more solid foundation.
>
> Mostly I'm interested in general feedback at this point. Is there an
> important part of the design space that I missed? What do people think
> about explicitly talking about thread groups and e.g. dynamic instances
> as in SPIR-V as part of the LLVM LangRef? Are people generally happy
> with the notion? If not, how can we get to a place where people are
> happy with it?
>
> Thanks,
> Nicolai
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/3a58e842/attachment.html>

Olivier Giroux via llvm-dev

2018-Dec-20 05:41 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Hi Justin, Nicolai, LLVM-dev,

The new *_sync() collective operations, and the newer & higher-level
Cooperative Groups API, both addressed difficulties that users faced and removed
obstacles that prevented architectural innovation with SIMT going forward. It
was tough work, and deprecation is a tough sell, but we have a much better
programming system because of it now.

While I won’t write long-form here, you can refer to the IEEE Micro article on
Volta (Volume: 38 , Issue: 2 , Mar./Apr. 2018) where I wrote something about the
reasons why this change happened:

Avoiding both deadlock and silent data corruption with user-authored
synchronization.
Simplifying the legality rules for collective operations. (What Justin was
alluding to, I think.)
Moving beyond simple Dom/PDom reconvergence methods, to extract more
convergence.
And I have a fourth reason that isn’t in IEEE Micro, because it’s only
interesting to this audience: being free from having to audit every front-end
and transformation, by requiring only single-threaded semantics be preserved.
This might be less relevant to users, except for the fact that it improved
reliability.

Now I can’t get into a deep discussion of each reason, and for this I apologize
in advance.

Sincerely, and wishing you happy holidays,

Olivier

> On Dec 19, 2018, at 11:45 AM, Justin Lebar <jlebar at google.com>
wrote:
> 
> Hi from one of the people who works/worked on the NVPTX backend.
> 
> > The key issue is that *the set of threads participating in the
exchange of data is implicitly defined by control flow*.
> 
> I agree that this is the root of the problem.
> 
> The way nvidia GPUs "solve" this is that in Volta and newer,
it's up to the user to track explicitly which set of threads participate in
warp-synchronous functions.  You cannot safely ballot().  Problem
"solved".  :)
> 
> We already have the notion of "convergent" functions like
syncthreads(), to which we cannot add control-flow dependencies.  That is,
it's legal to hoist syncthreads out of an "if", but it's not
legal to sink it into an "if".  It's not clear to me why we
can't have "anticonvergent" (terrible name) functions which cannot
have control-flow dependencies removed from them?  ballot() would be both
convergent and anticonvergent.
> 
> Would that solve your problem?
> 
> > However, the basic block containing the ballot call in the natural
lowering to LLVM IR is not part of the loop at all. The information that it was
intended to be run as part of the loop is currently lost forever.
> 
> Sounds like the natural lowering of this example is not respecting
anticonvergence and just needs to be made more unnatural?
> 
> I also think it's worthwhile to consider the reasons behind
nvidia's move away from functions like ballot() towards explicit tracking of
which threads are active.  +Olivier Giroux <mailto:ogiroux at gmail.com>
told me a while ago that he was working on a paper which showed that the old way
of doing things is even more fraught/broken/difficult-to-specify than I thought;
I'm not sure if anything ended up being published.
> 
> On Wed, Dec 19, 2018 at 11:32 AM Nicolai Hähnle via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Hi all,
> 
> LLVM needs a solution to the long-standing problem that the IR is unable 
> to express certain semantics expected by high-level programming 
> languages that target GPUs.
> 
> Solving this issue is necessary both for upstream use of LLVM as a 
> compiler backend for GPUs and for correctly supporting LLVM IR <-> 
> SPIR-V roundtrip translation. It may also be useful for compilers 
> targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really 
> just looks like a CPU with extra-wide SIMD.
> 
> After thinking and talking about the problem on and off for more than 
> two years now, I'm convinced that it can only be solved by adding 
> dedicated semantics to LLVM IR, which take the shape of:
> 
> - a new function (and call) attribute similar to `convergent`,
> - explicitly adding language to the LangRef that talks about groups of 
> threads and their communication with each other via special functions,
> - including how these groups of threads are impacted (split and merged) 
> by branches etc., and
> - new target-independent intrinsic(s) which manipulate subgroups of 
> threads, mostly by marking locations in the program where threads
reconverge
> 
> Details to be determined, of course.
> 
> In this mail, I would mostly like to make initial contact with the 
> larger community. First, to explain the problem to CPU-centric folks and 
> maybe help get over the initial shock. Second, to reach out to other 
> people thinking about GPUs: Has anybody else given this issue much 
> thought? How can we proceed to get this solved?
> 
> 
> The Problem
> ==========> Programming languages for GPUs have "cross-lane"
or "subgroup"
> operations which allow fine-grained exchange of data between threads 
> being executed together in a "wave" or "warp".
> 
> The poster-child is ballot, which takes a boolean argument and returns a 
> bitmask of the value of the argument across the 
> "subgroup"/"wave"/"warp", but more complex
operations exist as well e.g.
> for reducing a value across all active lanes of a wave or for computing 
> a prefix scan.
> 
> The key issue is that *the set of threads participating in the exchange 
> of data is implicitly defined by control flow*.
> 
> Two examples to demonstrate the resulting problem and the limitation of 
> the existing LLVM IR semantics. The first one:
> 
>        bool value = ...;
> 
>        if (condition) {
>          bitmask0 = ballot(value);
>          foo(bitmask0);
>        } else {
>          bitmask1 = ballot(value);
>          bar(bitmask1);
>        }
> 
> The semantics of high-level languages demand that `bitmask0` only 
> contains set bits for threads (lanes) for which `condition` is true, and 
> analogously for `bitmask1`. However, there is no reasonable way in LLVM 
> IR to prevent the ballot call from being hoisted above the if-statement, 
> which changes the behavior.
> 
> (Existing frontends for the AMDGPU target currently implement a gross 
> hack where `value` is passed through a call to a unique chunk of no-op 
> inline assembly that is marked as having unspecified side effects...)
> 
> The second example:
> 
>        uint64_t bitmask;
>        for (;;) {
>          ...
>          if (...) {
>            bool value = ...;
>            bitmask = ballot(value);
>            break;
>          }
>          ...
>        }
> 
> The semantics of high-level languages expect that `bitmask` only 
> contains set bits for threads (lanes) which break from the loop in the 
> same iteration. However, the basic block containing the ballot call in 
> the natural lowering to LLVM IR is not part of the loop at all. The 
> information that it was intended to be run as part of the loop is 
> currently lost forever.
> 
> 
> The Design Space of Solutions
> ============================> The relevant high-level languages are
structured programming languages,
> where the behavior of subgroups falls out quite naturally. Needless to 
> say, we cannot rely on structured control flow in LLVM IR.
> 
> HSAIL defines subgroups as forking at branches and joining at the 
> immediate post-dominator. It also attempts to define restrictions on 
> program transformations in terms of immediate dominators and 
> post-dominators. I am not certain that this definition is sound in all 
> cases, and it is too restrictive in places.
> 
> Its main disadvantage is that describing restrictions on transformations 
> purely in terms of dominators and post-dominators causes non-local 
> effects. For example, jump threading can change the 
> dominator/post-dominator tree, but verifying whether the corresponding 
> transform is legal in the face of subgroup operations requires 
> inspecting distant parts of the code that jump threading would not have 
> to look at for a CPU target.
> 
> So I reject HSAIL-like approaches based on the fact that they would 
> require invasive changes in generic middle-end passes.
> 
> There is a type of approach that most people who come into contact with 
> this problem eventually at least think about, which suggests replacing 
> the implicit dependence on control flow by an explicit one. That is, 
> augment subgroup intrinsics with an additional argument that represents 
> the subgroup of threads which participate in the exchange of data 
> described by this intrinsic, which results in code that looks similar to 
> the co-operative groups in new versions of Cuda.
> 
> This kind of approach can be a valid solution to the problem of 
> preserving the correct semantics, although it imposes annoying 
> restrictions on function call ABIs.
> 
> The major problem with this kind of approach is that it does not 
> actually restrict the transforms made by middle-end passes, and so the 
> final IR before code generation might end up containing code patterns 
> that cannot be natively expressed on a target that implements SPMD with 
> lock-step execution. The burden on the backend to reliably catch and 
> implement the rare cases when this happens is excessive. Even for 
> targets without lock-step execution, these transforms may have unwanted 
> performance impacts. So I reject this type of proposal as well.
> 
> The literature from practicioners on SPMD/SIMT control flow (a lot of it 
> targeting a hardware audience rather than a compiler audience) does not 
> concern itself with this problem to my knowledge, but there is a 
> commonly recurring theme of reconverging or rejoining threads at 
> explicit instructions and/or post-dominators.
> 
> This suggests a viable path towards a solution to me.
> 
> The SPIR-V spec has a notion of explicitly structured control flow with 
> merge basic blocks. It also defines "dynamic instances" of
instructions
> that are distinguished by control flow path, which provides a decent 
> option for modeling the set of threads which participate in subgroup 
> operations.
> 
> The SPIR-V spec itself is IMO quite lacking, in that the formalism is 
> very incomplete, the concrete structured control flow constructs are 
> very complex, and the details of how they are expressed lead to the same 
> non-local effects you get with the HSAIL approach. Nevertheless, I think 
> there's another viable path towards a solution hidden there.
> 
> Finally and for completeness, it should be noted that if we were to 
> forbid irreducible control flow in LLVM IR entirely, that would open up 
> more freedom in the design since we could properly define some things in 
> terms of loop structure.
> 
> 
> Final Thoughts
> =============> Formalizing these semantics could also help put
divergence analysis on a
> more solid foundation.
> 
> Mostly I'm interested in general feedback at this point. Is there an 
> important part of the design space that I missed? What do people think 
> about explicitly talking about thread groups and e.g. dynamic instances 
> as in SPIR-V as part of the LLVM LangRef? Are people generally happy 
> with the notion? If not, how can we get to a place where people are 
> happy with it?
> 
> Thanks,
> Nicolai
> -- 
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/24c3714a/attachment-0001.html>

via llvm-dev

2018-Dec-20 10:56 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

December 20, 2018 1:15 AM, "Justin Lebar via llvm-dev" <llvm-dev at
lists.llvm.org> wrote:
> We already have the notion of "convergent" functions like
syncthreads(), to which we cannot add
> control-flow dependencies. That is, it's legal to hoist syncthreads out
of an "if", but it's not
> legal to sink it into an "if". It's not clear to me why we
can't have "anticonvergent" (terrible
> name) functions which cannot have control-flow dependencies removed from
them? ballot() would be
> both convergent and anticonvergent.
> 
> Would that solve your problem?
One could fold both these constraints into the single existing
"convergent" attribute, and say that the control-flow dependencies of
a convergent function call cannot be trifled with. But there is more: is it okay
to sink a syncthreads() call into an "if", if the "if"
(heh!) is known to be uniform, i.e., all threads are guaranteed to take the same
side of the branch? This may not be important for syncthreads, but it may be a
useful for calls like ballot() that produce or use values. That would mean that
the convergent attribute should really constrain /divergent/ control-flow
dependencies. It would be nice to have a single-threaded way to say all of this.
 >> However, the basic block containing the ballot call in the natural
lowering to LLVM IR is not
> part of the loop at all. The information that it was intended to be run as
part of the loop is
> currently lost forever.
> 
> Sounds like the natural lowering of this example is not respecting
anticonvergence and just needs
> to be made more unnatural?
Right. We probably can't change the way control flow is represented in LLVM.
But we could have a way to mark the extra blocks at the exit of the loop as
being special. Optimizations will need to be aware of this when moving
convergent functions. This is similar to using header/merge blocks in SPIR-V to
demarcate the "extended" body of a loop. The tricky part is that the
loop condition will now need to be a control-flow dependency for any convergent
calls in those extra blocks!
> I also think it's worthwhile to consider the reasons behind
nvidia's move away from functions like
> ballot() towards explicit tracking of which threads are active.
Ack for Olivier's email about *_sync and co-operative groups. Having a new
programming model is quite useful. But what we are looking for is a generic way
to support existing programming models in LLVM while limiting how invasive the
changes will need to be.

Sameer.

Nicolai Hähnle via llvm-dev

2018-Dec-20 13:01 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Hi Justin,

On 19.12.18 20:45, Justin Lebar wrote:> Hi from one of the people who works/worked on the NVPTX backend.
> 
>  > The key issue is that *the set of threads participating in the 
> exchange of data is implicitly defined by control flow*.
> 
> I agree that this is the root of the problem.
> 
> The way nvidia GPUs "solve" this is that in Volta and newer,
it's up to
> the user to track explicitly which set of threads participate in 
> warp-synchronous functions.  You cannot safely ballot().  Problem 
> "solved".  :)
Yeah. Though there are costs associated with this approach, no matter 
how you implement it exactly, which is something I'd like to avoid.

The costs come from handling the case where the user tracks the set of 
threads "incorrectly", i.e. the set of threads conflicts with what
would
be naturally implied by control flow. "Incorrectly" in quotation marks
because one could imagine this happening on purpose. But the more common 
case is for the programmer to intend to match their set of threads to 
align with control flow, and then we shouldn't have to pay the 
additional cost.

Now, if part of a clean solution was to somehow represent the set of 
threads explicitly in LLVM IR as values, I wouldn't be fundamentally 
opposed to that. The important point is that doing so isn't sufficient, 
because you also want to express the tie-in back to control flow somehow.

> We already have the notion of "convergent" functions like
syncthreads(),
> to which we cannot add control-flow dependencies.  That is, it's legal 
> to hoist syncthreads out of an "if", but it's not legal to
sink it into
> an "if".  It's not clear to me why we can't have
"anticonvergent"
> (terrible name) functions which cannot have control-flow dependencies 
> removed from them?  ballot() would be both convergent and anticonvergent.
> 
> Would that solve your problem?
Right, thanks for reminding me of this line of thought. I think that 
already came up in a failed (and in hindsight misguided) review of mine 
on Phabricator a long time ago that you participated on.

There's a mixture of reasons why I have given up on this kind of approach.

The first is that "control-flow dependencies" aren't a
particularly
standard concept for unstructured control flow graphs in the first 
place, so forcing CPU folks to think about it is a bit problematic. 
Related to this is that all the definition attempts I'm aware of suffer 
from spooky action at a distance. An example of what I mean:

   flag = true;
   if (cond) {
     ...
     flag = ...;
   }
   if (flag) {
     ballot();
   }

There's a natural jump threading here that you would want to do for a 
CPU. Is that jump threading legal for a GPU?

The jump threading *should* be legal if the ballot wasn't there -- after 
all, it's legal and desired on CPUs. However, the jump threading pass 
does not naturally look at the basic block that contains the ballot, so 
it cannot make that judgment without an unnatural compile-time impact. 
Hence "spooky action at a distance".

And there are definitely implementations for which the jump threading 
here is illegal (classic PDOM reconvergence, for one). FWIW, the AMDGPU 
backend would currently just undo the jump threading entirely if the 
program is as simple as this, although I'm not comfortable with how 
reliably that works in more complex control flow (e.g., if there are 
loop nests involved as well). Since correctness requires reconverging 
anyway, it's preferable that jump threading is not legal here as long as 
the ballot is present.

But then, if `cond` is uniform, the jump threading can be both legal and 
desirable -- this goes in the direction of what Sameer wrote in his email.

An `anticonvergent` is simply insufficient to cover all this, but 
explicitly placing strategic "reconverging points" (in the form of
some
intrinsic) can solve all these problems.

At a higher level, all of the `convergent` / `anticonvergent` approaches 
are ultimately cop-outs. There are reasons why they developed the way 
they did, but I think it's time to face the reality of what's happening,
which is that there is communication / synchronization between threads. 
If we just modeled that honestly, a lot of the confusion should disappear.

>  > However, the basic block containing the ballot call in the natural 
> lowering to LLVM IR is not part of the loop at all. The information that 
> it was intended to be run as part of the loop is currently lost forever.
> 
> Sounds like the natural lowering of this example is not respecting 
> anticonvergence and just needs to be made more unnatural?
Right, though two points:

1. The optional structured control flow elements in SPIR-V point a way 
towards annotating an otherwise natural lowering with intrinsics in a 
way that models the required semantics.

2. If some other way was used that involved an unnatural lowering, we 
must be careful to guarantee that control flow transforms are unable to 
transform it back to the natural lowering.

> I also think it's worthwhile to consider the reasons behind
nvidia's
> move away from functions like ballot() towards explicit tracking of 
> which threads are active. +Olivier Giroux 
> <mailto:ogiroux at gmail.com> told me a while ago that he was working
on a
> paper which showed that the old way of doing things is even more 
> fraught/broken/difficult-to-specify than I thought; I'm not sure if 
> anything ended up being published.
I'm curious about that, too.

I understand how the Volta architecture helps to avoid deadlocks with 
independent locking of threads, and it seems to me that co-operative 
groups mostly just fall out as a relatively easily implementable feature 
on top of that. IMHO they're mostly a move in the wrong direction 
though, at least from a programmer ergonomics point of view. They make 
it easier to shoot yourself in the foot for some of the typical use 
cases of cross-lane operations.

There are difficulties with specifying semantics of the implicit groups, 
sure, but I'm rather convinced they can be solved, so I'm curious to 
read some well-stated arguments stating the opposite :)

Cheers,
Nicolai

> 
> On Wed, Dec 19, 2018 at 11:32 AM Nicolai Hähnle via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
> 
>     Hi all,
> 
>     LLVM needs a solution to the long-standing problem that the IR is
>     unable
>     to express certain semantics expected by high-level programming
>     languages that target GPUs.
> 
>     Solving this issue is necessary both for upstream use of LLVM as a
>     compiler backend for GPUs and for correctly supporting LLVM IR
<->
>     SPIR-V roundtrip translation. It may also be useful for compilers
>     targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really
>     just looks like a CPU with extra-wide SIMD.
> 
>     After thinking and talking about the problem on and off for more than
>     two years now, I'm convinced that it can only be solved by adding
>     dedicated semantics to LLVM IR, which take the shape of:
> 
>     - a new function (and call) attribute similar to `convergent`,
>     - explicitly adding language to the LangRef that talks about groups of
>     threads and their communication with each other via special functions,
>     - including how these groups of threads are impacted (split and merged)
>     by branches etc., and
>     - new target-independent intrinsic(s) which manipulate subgroups of
>     threads, mostly by marking locations in the program where threads
>     reconverge
> 
>     Details to be determined, of course.
> 
>     In this mail, I would mostly like to make initial contact with the
>     larger community. First, to explain the problem to CPU-centric folks
>     and
>     maybe help get over the initial shock. Second, to reach out to other
>     people thinking about GPUs: Has anybody else given this issue much
>     thought? How can we proceed to get this solved?
> 
> 
>     The Problem
>     ==========>     Programming languages for GPUs have
"cross-lane" or "subgroup"
>     operations which allow fine-grained exchange of data between threads
>     being executed together in a "wave" or "warp".
> 
>     The poster-child is ballot, which takes a boolean argument and
>     returns a
>     bitmask of the value of the argument across the
>     "subgroup"/"wave"/"warp", but more
complex operations exist as well
>     e.g.
>     for reducing a value across all active lanes of a wave or for computing
>     a prefix scan.
> 
>     The key issue is that *the set of threads participating in the exchange
>     of data is implicitly defined by control flow*.
> 
>     Two examples to demonstrate the resulting problem and the limitation of
>     the existing LLVM IR semantics. The first one:
> 
>             bool value = ...;
> 
>             if (condition) {
>               bitmask0 = ballot(value);
>               foo(bitmask0);
>             } else {
>               bitmask1 = ballot(value);
>               bar(bitmask1);
>             }
> 
>     The semantics of high-level languages demand that `bitmask0` only
>     contains set bits for threads (lanes) for which `condition` is true,
>     and
>     analogously for `bitmask1`. However, there is no reasonable way in LLVM
>     IR to prevent the ballot call from being hoisted above the
>     if-statement,
>     which changes the behavior.
> 
>     (Existing frontends for the AMDGPU target currently implement a gross
>     hack where `value` is passed through a call to a unique chunk of no-op
>     inline assembly that is marked as having unspecified side effects...)
> 
>     The second example:
> 
>             uint64_t bitmask;
>             for (;;) {
>               ...
>               if (...) {
>                 bool value = ...;
>                 bitmask = ballot(value);
>                 break;
>               }
>               ...
>             }
> 
>     The semantics of high-level languages expect that `bitmask` only
>     contains set bits for threads (lanes) which break from the loop in the
>     same iteration. However, the basic block containing the ballot call in
>     the natural lowering to LLVM IR is not part of the loop at all. The
>     information that it was intended to be run as part of the loop is
>     currently lost forever.
> 
> 
>     The Design Space of Solutions
>     ============================>     The relevant high-level languages
are structured programming languages,
>     where the behavior of subgroups falls out quite naturally. Needless to
>     say, we cannot rely on structured control flow in LLVM IR.
> 
>     HSAIL defines subgroups as forking at branches and joining at the
>     immediate post-dominator. It also attempts to define restrictions on
>     program transformations in terms of immediate dominators and
>     post-dominators. I am not certain that this definition is sound in all
>     cases, and it is too restrictive in places.
> 
>     Its main disadvantage is that describing restrictions on
>     transformations
>     purely in terms of dominators and post-dominators causes non-local
>     effects. For example, jump threading can change the
>     dominator/post-dominator tree, but verifying whether the corresponding
>     transform is legal in the face of subgroup operations requires
>     inspecting distant parts of the code that jump threading would not have
>     to look at for a CPU target.
> 
>     So I reject HSAIL-like approaches based on the fact that they would
>     require invasive changes in generic middle-end passes.
> 
>     There is a type of approach that most people who come into contact with
>     this problem eventually at least think about, which suggests replacing
>     the implicit dependence on control flow by an explicit one. That is,
>     augment subgroup intrinsics with an additional argument that represents
>     the subgroup of threads which participate in the exchange of data
>     described by this intrinsic, which results in code that looks
>     similar to
>     the co-operative groups in new versions of Cuda.
> 
>     This kind of approach can be a valid solution to the problem of
>     preserving the correct semantics, although it imposes annoying
>     restrictions on function call ABIs.
> 
>     The major problem with this kind of approach is that it does not
>     actually restrict the transforms made by middle-end passes, and so the
>     final IR before code generation might end up containing code patterns
>     that cannot be natively expressed on a target that implements SPMD with
>     lock-step execution. The burden on the backend to reliably catch and
>     implement the rare cases when this happens is excessive. Even for
>     targets without lock-step execution, these transforms may have unwanted
>     performance impacts. So I reject this type of proposal as well.
> 
>     The literature from practicioners on SPMD/SIMT control flow (a lot
>     of it
>     targeting a hardware audience rather than a compiler audience) does not
>     concern itself with this problem to my knowledge, but there is a
>     commonly recurring theme of reconverging or rejoining threads at
>     explicit instructions and/or post-dominators.
> 
>     This suggests a viable path towards a solution to me.
> 
>     The SPIR-V spec has a notion of explicitly structured control flow with
>     merge basic blocks. It also defines "dynamic instances" of
instructions
>     that are distinguished by control flow path, which provides a decent
>     option for modeling the set of threads which participate in subgroup
>     operations.
> 
>     The SPIR-V spec itself is IMO quite lacking, in that the formalism is
>     very incomplete, the concrete structured control flow constructs are
>     very complex, and the details of how they are expressed lead to the
>     same
>     non-local effects you get with the HSAIL approach. Nevertheless, I
>     think
>     there's another viable path towards a solution hidden there.
> 
>     Finally and for completeness, it should be noted that if we were to
>     forbid irreducible control flow in LLVM IR entirely, that would open up
>     more freedom in the design since we could properly define some
>     things in
>     terms of loop structure.
> 
> 
>     Final Thoughts
>     =============>     Formalizing these semantics could also help put
divergence analysis
>     on a
>     more solid foundation.
> 
>     Mostly I'm interested in general feedback at this point. Is there
an
>     important part of the design space that I missed? What do people think
>     about explicitly talking about thread groups and e.g. dynamic instances
>     as in SPIR-V as part of the LLVM LangRef? Are people generally happy
>     with the notion? If not, how can we get to a place where people are
>     happy with it?
> 
>     Thanks,
>     Nicolai
>     -- 
>     Lerne, wie die Welt wirklich ist,
>     Aber vergiss niemals, wie sie sein sollte.
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

Connor Abbott via llvm-dev

2018-Dec-20 17:03 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

On Wed, Dec 19, 2018, 2:45 PM Justin Lebar via llvm-dev <
llvm-dev at lists.llvm.org wrote:
> Hi from one of the people who works/worked on the NVPTX backend.
>
> > The key issue is that *the set of threads participating in the
exchange
> of data is implicitly defined by control flow*.
>
> I agree that this is the root of the problem.
>
> The way nvidia GPUs "solve" this is that in Volta and newer,
it's up to
> the user to track explicitly which set of threads participate in
> warp-synchronous functions.  You cannot safely ballot().  Problem
> "solved".  :)
>
Well, that certainly works for Nvidia, but from what I understand, you
can't efficiently do this without special hardware support to properly
synchronize the threads when you get to the ballot or whatever, so the rest
of us are unfortunately out of luck :)

> We already have the notion of "convergent" functions like
syncthreads(),
> to which we cannot add control-flow dependencies.  That is, it's legal
to
> hoist syncthreads out of an "if", but it's not legal to sink
it into an
> "if".  It's not clear to me why we can't have
"anticonvergent" (terrible
> name) functions which cannot have control-flow dependencies removed from
> them?  ballot() would be both convergent and anticonvergent.
>
> Would that solve your problem?
>
I think it's important to note that we already have such an attribute,
although with the opposite sense - it's impossible to remove control flow
dependencies from a call unless you mark it as "speculatable".
However,
this doesn't prevent

if (...) {
} else {
}
foo = ballot();

from being turned into

if (...) {
    foo1 = ballot();
} else {
    foo2 = ballot();
}
foo = phi(foo1, foo2)

and vice versa. We have a "noduplicate" attribute which prevents
transforming the first into the second, but not the other way around. Of
course we could keep going this way and add a "nocombine" attribute to
complement noduplicate. But even then, there are even still problematic
transforms. For example, take this program, which is simplified from a real
game that doesn't work with the AMDGPU backend:

while (cond1 /* uniform */) {
    ballot();
    ...
    if (cond2 /* non-uniform */) continue;
    ...
}

In SPIR-V, when using structured control flow, the semantics of this are
pretty clearly defined. In particular, there's a continue block after the
body of the loop where control flow re-converges, and the only back edge is
from the continue block, so the ballot is in uniform control flow. But LLVM
will get rid of the continue block since it's empty, and re-analyze the
loop as two nested loops, splitting the loop header in two, producing a CFG
which corresponds to this:

while (cond1 /* uniform */) {
    do {
        ballot();
         ...
    } while (cond2 /* non-uniform */);
    ...
}

Now, in an implementation where control flow re-converges at the immediate
post-dominator, this won't do the right thing anymore. In order to handle
it correctly, you'd effectively need to always flatten nested loops, which
will probably be really bad for performance if the programmer actually
wanted the second thing. It also makes it impossible when translating a
high-level language to LLVM to get the "natural" behavior which game
developers actually expect. This is exactly the sort of "spooky action at a
distance" which makes me think that everything we've done so far is
really
insufficient, and we need to add an explicit notion of control-flow
divergence and reconvergence to the IR. We need a way to say that control
flow re-converges at the continue block, so that LLVM won't eliminate it,
and we can vectorize it correctly without penalizing cases where it's
better for control flow not to re-converge.

> > However, the basic block containing the ballot call in the natural
> lowering to LLVM IR is not part of the loop at all. The information that it
> was intended to be run as part of the loop is currently lost forever.
>
> Sounds like the natural lowering of this example is not respecting
> anticonvergence and just needs to be made more unnatural?
>
> I also think it's worthwhile to consider the reasons behind
nvidia's move
> away from functions like ballot() towards explicit tracking of which
> threads are active.  +Olivier Giroux <ogiroux at gmail.com> told me a
while
> ago that he was working on a paper which showed that the old way of doing
> things is even more fraught/broken/difficult-to-specify than I thought;
I'm
> not sure if anything ended up being published.
>
> On Wed, Dec 19, 2018 at 11:32 AM Nicolai Hähnle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi all,
>>
>> LLVM needs a solution to the long-standing problem that the IR is
unable
>> to express certain semantics expected by high-level programming
>> languages that target GPUs.
>>
>> Solving this issue is necessary both for upstream use of LLVM as a
>> compiler backend for GPUs and for correctly supporting LLVM IR
<->
>> SPIR-V roundtrip translation. It may also be useful for compilers
>> targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really
>> just looks like a CPU with extra-wide SIMD.
>>
>> After thinking and talking about the problem on and off for more than
>> two years now, I'm convinced that it can only be solved by adding
>> dedicated semantics to LLVM IR, which take the shape of:
>>
>> - a new function (and call) attribute similar to `convergent`,
>> - explicitly adding language to the LangRef that talks about groups of
>> threads and their communication with each other via special functions,
>> - including how these groups of threads are impacted (split and merged)
>> by branches etc., and
>> - new target-independent intrinsic(s) which manipulate subgroups of
>> threads, mostly by marking locations in the program where threads
>> reconverge
>>
>> Details to be determined, of course.
>>
>> In this mail, I would mostly like to make initial contact with the
>> larger community. First, to explain the problem to CPU-centric folks
and
>> maybe help get over the initial shock. Second, to reach out to other
>> people thinking about GPUs: Has anybody else given this issue much
>> thought? How can we proceed to get this solved?
>>
>>
>> The Problem
>> ==========>> Programming languages for GPUs have
"cross-lane" or "subgroup"
>> operations which allow fine-grained exchange of data between threads
>> being executed together in a "wave" or "warp".
>>
>> The poster-child is ballot, which takes a boolean argument and returns
a
>> bitmask of the value of the argument across the
>> "subgroup"/"wave"/"warp", but more
complex operations exist as well e.g.
>> for reducing a value across all active lanes of a wave or for computing
>> a prefix scan.
>>
>> The key issue is that *the set of threads participating in the exchange
>> of data is implicitly defined by control flow*.
>>
>> Two examples to demonstrate the resulting problem and the limitation of
>> the existing LLVM IR semantics. The first one:
>>
>>        bool value = ...;
>>
>>        if (condition) {
>>          bitmask0 = ballot(value);
>>          foo(bitmask0);
>>        } else {
>>          bitmask1 = ballot(value);
>>          bar(bitmask1);
>>        }
>>
>> The semantics of high-level languages demand that `bitmask0` only
>> contains set bits for threads (lanes) for which `condition` is true,
and
>> analogously for `bitmask1`. However, there is no reasonable way in LLVM
>> IR to prevent the ballot call from being hoisted above the
if-statement,
>> which changes the behavior.
>>
>> (Existing frontends for the AMDGPU target currently implement a gross
>> hack where `value` is passed through a call to a unique chunk of no-op
>> inline assembly that is marked as having unspecified side effects...)
>>
>> The second example:
>>
>>        uint64_t bitmask;
>>        for (;;) {
>>          ...
>>          if (...) {
>>            bool value = ...;
>>            bitmask = ballot(value);
>>            break;
>>          }
>>          ...
>>        }
>>
>> The semantics of high-level languages expect that `bitmask` only
>> contains set bits for threads (lanes) which break from the loop in the
>> same iteration. However, the basic block containing the ballot call in
>> the natural lowering to LLVM IR is not part of the loop at all. The
>> information that it was intended to be run as part of the loop is
>> currently lost forever.
>>
>>
>> The Design Space of Solutions
>> ============================>> The relevant high-level languages
are structured programming languages,
>> where the behavior of subgroups falls out quite naturally. Needless to
>> say, we cannot rely on structured control flow in LLVM IR.
>>
>> HSAIL defines subgroups as forking at branches and joining at the
>> immediate post-dominator. It also attempts to define restrictions on
>> program transformations in terms of immediate dominators and
>> post-dominators. I am not certain that this definition is sound in all
>> cases, and it is too restrictive in places.
>>
>> Its main disadvantage is that describing restrictions on
transformations
>> purely in terms of dominators and post-dominators causes non-local
>> effects. For example, jump threading can change the
>> dominator/post-dominator tree, but verifying whether the corresponding
>> transform is legal in the face of subgroup operations requires
>> inspecting distant parts of the code that jump threading would not have
>> to look at for a CPU target.
>>
>> So I reject HSAIL-like approaches based on the fact that they would
>> require invasive changes in generic middle-end passes.
>>
>> There is a type of approach that most people who come into contact with
>> this problem eventually at least think about, which suggests replacing
>> the implicit dependence on control flow by an explicit one. That is,
>> augment subgroup intrinsics with an additional argument that represents
>> the subgroup of threads which participate in the exchange of data
>> described by this intrinsic, which results in code that looks similar
to
>> the co-operative groups in new versions of Cuda.
>>
>> This kind of approach can be a valid solution to the problem of
>> preserving the correct semantics, although it imposes annoying
>> restrictions on function call ABIs.
>>
>> The major problem with this kind of approach is that it does not
>> actually restrict the transforms made by middle-end passes, and so the
>> final IR before code generation might end up containing code patterns
>> that cannot be natively expressed on a target that implements SPMD with
>> lock-step execution. The burden on the backend to reliably catch and
>> implement the rare cases when this happens is excessive. Even for
>> targets without lock-step execution, these transforms may have unwanted
>> performance impacts. So I reject this type of proposal as well.
>>
>> The literature from practicioners on SPMD/SIMT control flow (a lot of
it
>> targeting a hardware audience rather than a compiler audience) does not
>> concern itself with this problem to my knowledge, but there is a
>> commonly recurring theme of reconverging or rejoining threads at
>> explicit instructions and/or post-dominators.
>>
>> This suggests a viable path towards a solution to me.
>>
>> The SPIR-V spec has a notion of explicitly structured control flow with
>> merge basic blocks. It also defines "dynamic instances" of
instructions
>> that are distinguished by control flow path, which provides a decent
>> option for modeling the set of threads which participate in subgroup
>> operations.
>>
>> The SPIR-V spec itself is IMO quite lacking, in that the formalism is
>> very incomplete, the concrete structured control flow constructs are
>> very complex, and the details of how they are expressed lead to the
same
>> non-local effects you get with the HSAIL approach. Nevertheless, I
think
>> there's another viable path towards a solution hidden there.
>>
>> Finally and for completeness, it should be noted that if we were to
>> forbid irreducible control flow in LLVM IR entirely, that would open up
>> more freedom in the design since we could properly define some things
in
>> terms of loop structure.
>>
>>
>> Final Thoughts
>> =============>> Formalizing these semantics could also help put
divergence analysis on a
>> more solid foundation.
>>
>> Mostly I'm interested in general feedback at this point. Is there
an
>> important part of the design space that I missed? What do people think
>> about explicitly talking about thread groups and e.g. dynamic instances
>> as in SPIR-V as part of the LLVM LangRef? Are people generally happy
>> with the notion? If not, how can we get to a place where people are
>> happy with it?
>>
>> Thanks,
>> Nicolai
>> --
>> Lerne, wie die Welt wirklich ist,
>> Aber vergiss niemals, wie sie sein sollte.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181220/91c35cc6/attachment.html>

llvm dev - Dec 2018 - [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)