thr3ads.net - llvm dev - [llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs) [Dec 2018]

If this information is useful, please help other people find it:
Share via:

Connor Abbott via llvm-dev

2018-Dec-20 17:03 UTC

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

On Wed, Dec 19, 2018, 2:45 PM Justin Lebar via llvm-dev <
llvm-dev at lists.llvm.org wrote:
> Hi from one of the people who works/worked on the NVPTX backend.
>
> > The key issue is that *the set of threads participating in the
exchange
> of data is implicitly defined by control flow*.
>
> I agree that this is the root of the problem.
>
> The way nvidia GPUs "solve" this is that in Volta and newer,
it's up to
> the user to track explicitly which set of threads participate in
> warp-synchronous functions.  You cannot safely ballot().  Problem
> "solved".  :)
>
Well, that certainly works for Nvidia, but from what I understand, you
can't efficiently do this without special hardware support to properly
synchronize the threads when you get to the ballot or whatever, so the rest
of us are unfortunately out of luck :)

> We already have the notion of "convergent" functions like
syncthreads(),
> to which we cannot add control-flow dependencies.  That is, it's legal
to
> hoist syncthreads out of an "if", but it's not legal to sink
it into an
> "if".  It's not clear to me why we can't have
"anticonvergent" (terrible
> name) functions which cannot have control-flow dependencies removed from
> them?  ballot() would be both convergent and anticonvergent.
>
> Would that solve your problem?
>
I think it's important to note that we already have such an attribute,
although with the opposite sense - it's impossible to remove control flow
dependencies from a call unless you mark it as "speculatable".
However,
this doesn't prevent

if (...) {
} else {
}
foo = ballot();

from being turned into

if (...) {
    foo1 = ballot();
} else {
    foo2 = ballot();
}
foo = phi(foo1, foo2)

and vice versa. We have a "noduplicate" attribute which prevents
transforming the first into the second, but not the other way around. Of
course we could keep going this way and add a "nocombine" attribute to
complement noduplicate. But even then, there are even still problematic
transforms. For example, take this program, which is simplified from a real
game that doesn't work with the AMDGPU backend:

while (cond1 /* uniform */) {
    ballot();
    ...
    if (cond2 /* non-uniform */) continue;
    ...
}

In SPIR-V, when using structured control flow, the semantics of this are
pretty clearly defined. In particular, there's a continue block after the
body of the loop where control flow re-converges, and the only back edge is
from the continue block, so the ballot is in uniform control flow. But LLVM
will get rid of the continue block since it's empty, and re-analyze the
loop as two nested loops, splitting the loop header in two, producing a CFG
which corresponds to this:

while (cond1 /* uniform */) {
    do {
        ballot();
         ...
    } while (cond2 /* non-uniform */);
    ...
}

Now, in an implementation where control flow re-converges at the immediate
post-dominator, this won't do the right thing anymore. In order to handle
it correctly, you'd effectively need to always flatten nested loops, which
will probably be really bad for performance if the programmer actually
wanted the second thing. It also makes it impossible when translating a
high-level language to LLVM to get the "natural" behavior which game
developers actually expect. This is exactly the sort of "spooky action at a
distance" which makes me think that everything we've done so far is
really
insufficient, and we need to add an explicit notion of control-flow
divergence and reconvergence to the IR. We need a way to say that control
flow re-converges at the continue block, so that LLVM won't eliminate it,
and we can vectorize it correctly without penalizing cases where it's
better for control flow not to re-converge.

> > However, the basic block containing the ballot call in the natural
> lowering to LLVM IR is not part of the loop at all. The information that it
> was intended to be run as part of the loop is currently lost forever.
>
> Sounds like the natural lowering of this example is not respecting
> anticonvergence and just needs to be made more unnatural?
>
> I also think it's worthwhile to consider the reasons behind
nvidia's move
> away from functions like ballot() towards explicit tracking of which
> threads are active.  +Olivier Giroux <ogiroux at gmail.com> told me a
while
> ago that he was working on a paper which showed that the old way of doing
> things is even more fraught/broken/difficult-to-specify than I thought;
I'm
> not sure if anything ended up being published.
>
> On Wed, Dec 19, 2018 at 11:32 AM Nicolai Hähnle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi all,
>>
>> LLVM needs a solution to the long-standing problem that the IR is
unable
>> to express certain semantics expected by high-level programming
>> languages that target GPUs.
>>
>> Solving this issue is necessary both for upstream use of LLVM as a
>> compiler backend for GPUs and for correctly supporting LLVM IR
<->
>> SPIR-V roundtrip translation. It may also be useful for compilers
>> targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really
>> just looks like a CPU with extra-wide SIMD.
>>
>> After thinking and talking about the problem on and off for more than
>> two years now, I'm convinced that it can only be solved by adding
>> dedicated semantics to LLVM IR, which take the shape of:
>>
>> - a new function (and call) attribute similar to `convergent`,
>> - explicitly adding language to the LangRef that talks about groups of
>> threads and their communication with each other via special functions,
>> - including how these groups of threads are impacted (split and merged)
>> by branches etc., and
>> - new target-independent intrinsic(s) which manipulate subgroups of
>> threads, mostly by marking locations in the program where threads
>> reconverge
>>
>> Details to be determined, of course.
>>
>> In this mail, I would mostly like to make initial contact with the
>> larger community. First, to explain the problem to CPU-centric folks
and
>> maybe help get over the initial shock. Second, to reach out to other
>> people thinking about GPUs: Has anybody else given this issue much
>> thought? How can we proceed to get this solved?
>>
>>
>> The Problem
>> ==========>> Programming languages for GPUs have
"cross-lane" or "subgroup"
>> operations which allow fine-grained exchange of data between threads
>> being executed together in a "wave" or "warp".
>>
>> The poster-child is ballot, which takes a boolean argument and returns
a
>> bitmask of the value of the argument across the
>> "subgroup"/"wave"/"warp", but more
complex operations exist as well e.g.
>> for reducing a value across all active lanes of a wave or for computing
>> a prefix scan.
>>
>> The key issue is that *the set of threads participating in the exchange
>> of data is implicitly defined by control flow*.
>>
>> Two examples to demonstrate the resulting problem and the limitation of
>> the existing LLVM IR semantics. The first one:
>>
>>        bool value = ...;
>>
>>        if (condition) {
>>          bitmask0 = ballot(value);
>>          foo(bitmask0);
>>        } else {
>>          bitmask1 = ballot(value);
>>          bar(bitmask1);
>>        }
>>
>> The semantics of high-level languages demand that `bitmask0` only
>> contains set bits for threads (lanes) for which `condition` is true,
and
>> analogously for `bitmask1`. However, there is no reasonable way in LLVM
>> IR to prevent the ballot call from being hoisted above the
if-statement,
>> which changes the behavior.
>>
>> (Existing frontends for the AMDGPU target currently implement a gross
>> hack where `value` is passed through a call to a unique chunk of no-op
>> inline assembly that is marked as having unspecified side effects...)
>>
>> The second example:
>>
>>        uint64_t bitmask;
>>        for (;;) {
>>          ...
>>          if (...) {
>>            bool value = ...;
>>            bitmask = ballot(value);
>>            break;
>>          }
>>          ...
>>        }
>>
>> The semantics of high-level languages expect that `bitmask` only
>> contains set bits for threads (lanes) which break from the loop in the
>> same iteration. However, the basic block containing the ballot call in
>> the natural lowering to LLVM IR is not part of the loop at all. The
>> information that it was intended to be run as part of the loop is
>> currently lost forever.
>>
>>
>> The Design Space of Solutions
>> ============================>> The relevant high-level languages
are structured programming languages,
>> where the behavior of subgroups falls out quite naturally. Needless to
>> say, we cannot rely on structured control flow in LLVM IR.
>>
>> HSAIL defines subgroups as forking at branches and joining at the
>> immediate post-dominator. It also attempts to define restrictions on
>> program transformations in terms of immediate dominators and
>> post-dominators. I am not certain that this definition is sound in all
>> cases, and it is too restrictive in places.
>>
>> Its main disadvantage is that describing restrictions on
transformations
>> purely in terms of dominators and post-dominators causes non-local
>> effects. For example, jump threading can change the
>> dominator/post-dominator tree, but verifying whether the corresponding
>> transform is legal in the face of subgroup operations requires
>> inspecting distant parts of the code that jump threading would not have
>> to look at for a CPU target.
>>
>> So I reject HSAIL-like approaches based on the fact that they would
>> require invasive changes in generic middle-end passes.
>>
>> There is a type of approach that most people who come into contact with
>> this problem eventually at least think about, which suggests replacing
>> the implicit dependence on control flow by an explicit one. That is,
>> augment subgroup intrinsics with an additional argument that represents
>> the subgroup of threads which participate in the exchange of data
>> described by this intrinsic, which results in code that looks similar
to
>> the co-operative groups in new versions of Cuda.
>>
>> This kind of approach can be a valid solution to the problem of
>> preserving the correct semantics, although it imposes annoying
>> restrictions on function call ABIs.
>>
>> The major problem with this kind of approach is that it does not
>> actually restrict the transforms made by middle-end passes, and so the
>> final IR before code generation might end up containing code patterns
>> that cannot be natively expressed on a target that implements SPMD with
>> lock-step execution. The burden on the backend to reliably catch and
>> implement the rare cases when this happens is excessive. Even for
>> targets without lock-step execution, these transforms may have unwanted
>> performance impacts. So I reject this type of proposal as well.
>>
>> The literature from practicioners on SPMD/SIMT control flow (a lot of
it
>> targeting a hardware audience rather than a compiler audience) does not
>> concern itself with this problem to my knowledge, but there is a
>> commonly recurring theme of reconverging or rejoining threads at
>> explicit instructions and/or post-dominators.
>>
>> This suggests a viable path towards a solution to me.
>>
>> The SPIR-V spec has a notion of explicitly structured control flow with
>> merge basic blocks. It also defines "dynamic instances" of
instructions
>> that are distinguished by control flow path, which provides a decent
>> option for modeling the set of threads which participate in subgroup
>> operations.
>>
>> The SPIR-V spec itself is IMO quite lacking, in that the formalism is
>> very incomplete, the concrete structured control flow constructs are
>> very complex, and the details of how they are expressed lead to the
same
>> non-local effects you get with the HSAIL approach. Nevertheless, I
think
>> there's another viable path towards a solution hidden there.
>>
>> Finally and for completeness, it should be noted that if we were to
>> forbid irreducible control flow in LLVM IR entirely, that would open up
>> more freedom in the design since we could properly define some things
in
>> terms of loop structure.
>>
>>
>> Final Thoughts
>> =============>> Formalizing these semantics could also help put
divergence analysis on a
>> more solid foundation.
>>
>> Mostly I'm interested in general feedback at this point. Is there
an
>> important part of the design space that I missed? What do people think
>> about explicitly talking about thread groups and e.g. dynamic instances
>> as in SPIR-V as part of the LLVM LangRef? Are people generally happy
>> with the notion? If not, how can we get to a place where people are
>> happy with it?
>>
>> Thanks,
>> Nicolai
>> --
>> Lerne, wie die Welt wirklich ist,
>> Aber vergiss niemals, wie sie sein sollte.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20181220/91c35cc6/attachment.html>

Nicolai Hähnle via llvm-dev

2018-Dec-29 16:32 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

On 20.12.18 18:03, Connor Abbott wrote:>     We already have the notion of "convergent" functions like
>     syncthreads(), to which we cannot add control-flow dependencies. 
>     That is, it's legal to hoist syncthreads out of an "if",
but it's
>     not legal to sink it into an "if".  It's not clear to me
why we
>     can't have "anticonvergent" (terrible name) functions
which cannot
>     have control-flow dependencies removed from them?  ballot() would be
>     both convergent and anticonvergent.
> 
>     Would that solve your problem?
> 
> 
> I think it's important to note that we already have such an attribute, 
> although with the opposite sense - it's impossible to remove control 
> flow dependencies from a call unless you mark it as
"speculatable".
This isn't actually true. If both sides of an if/else have the same 
non-speculative function call, it can still be moved out of control flow.

That's because doing so doesn't change anything at all from a 
single-threaded perspective. Hence why I think we should model the 
communication between threads honestly.

> However, this doesn't prevent
> 
> if (...) {
> } else {
> }
> foo = ballot();
> 
> from being turned into
> 
> if (...) {
>      foo1 = ballot();
> } else {
>      foo2 = ballot();
> }
> foo = phi(foo1, foo2)
> 
> and vice versa. We have a "noduplicate" attribute which prevents 
> transforming the first into the second, but not the other way around. Of 
> course we could keep going this way and add a "nocombine"
attribute to
> complement noduplicate. But even then, there are even still problematic 
> transforms. For example, take this program, which is simplified from a 
> real game that doesn't work with the AMDGPU backend:
> 
> while (cond1 /* uniform */) {
>      ballot();
>      ...
>      if (cond2 /* non-uniform */) continue;
>      ...
> }
> 
> In SPIR-V, when using structured control flow, the semantics of this are 
> pretty clearly defined. In particular, there's a continue block after 
> the body of the loop where control flow re-converges, and the only back 
> edge is from the continue block, so the ballot is in uniform control 
> flow. But LLVM will get rid of the continue block since it's empty, and
> re-analyze the loop as two nested loops, splitting the loop header in 
> two, producing a CFG which corresponds to this:
> 
> while (cond1 /* uniform */) {
>      do {
>          ballot();
>           ...
>      } while (cond2 /* non-uniform */);
>      ...
> }
> 
> Now, in an implementation where control flow re-converges at the 
> immediate post-dominator, this won't do the right thing anymore. In 
> order to handle it correctly, you'd effectively need to always flatten 
> nested loops, which will probably be really bad for performance if the 
> programmer actually wanted the second thing. It also makes it impossible 
> when translating a high-level language to LLVM to get the
"natural"
> behavior which game developers actually expect. This is exactly the sort 
> of "spooky action at a distance" which makes me think that
everything
> we've done so far is really insufficient, and we need to add an
explicit
> notion of control-flow divergence and reconvergence to the IR. We need a 
> way to say that control flow re-converges at the continue block, so that 
> LLVM won't eliminate it, and we can vectorize it correctly without 
> penalizing cases where it's better for control flow not to re-converge.
Well said!

Cheers,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.

Jan Sjodin via llvm-dev

2019-Jan-24 15:06 UTC

head link

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

I was looking into ballot() and how if it is possible to keep a
single-threadedview of the code, but add some extra conditions that must hold
after the
transformation. I had the initial idea that each call to ballot() in a
single-threaded program can be seen as a partial write to a memory
location, and each location memory location is unique for every call site, plus
there some externally observable side effect. We can abstract this away by
tagging the calls, e.g. by using aliases.
 For example:

if (...) {
     foo1 = ballot();
} else {
     foo2 = ballot();
}
simply becomes:

if (...) {
     foo1 = ballot_1();
} else {
     foo2 = ballot_2();
}

and
if (...) {} else {}ballot();
becomes

if (...) {} else {}ballot_1();
In the first case it would prevent combining the two calls into oneafter the if.
In the second example there is generally nothing that says it could not be
transformed into the first example with two calls to ballot_1(), which should
not be allowed.

Another form of duplication that we must allow are loop transforms,like
unrolling or peeling. These might seem similar to the example above, since we
are cloning code and with conditions etc. Butthey are different since they calls
are in different loop iterations.

The condition that needs to be met is that:
There must be a single path between all cloned ballot_n() functions.
The reason for this condition is that if we clone the same call, thenthe copies
must be mutually exclusive, but if they are cloned from a loop, there must be a
path, or we would skip iterations.

If we want to be more sophisticated we can add:

If there is no such path, the calls must be separated by uniform branches.
After the transform everything should be re-tagged, since we already checked the
calls and we don't want to check them again. Also, not all transforms need
(or should) have the tagging checked. One example is inlining, where multiple
copies are created, but they are clearly different calls. The tagging can be
done temporarily for a single pass, and then eliminated. This could be a good
tool for debugging as well, since it can detect if a transform is suspect.

The code would of course have to make sense as far as control flow. If we have:
for(;;) {   if(...) { 
      ballot();      break;
   }
}

This would have to be translated to:
for(;;) {    if(...) {      ballot();      if(UnknownConstant) {  // Not a
uniform condition, but later on translated to "true"
        break;
      } }

However, I think that this is the way the code is generated today anyway. There
would have to be some attribute that indicate that these calls (or
functions)contain ballot or other cross-lane operations so they could be tagged
and checked. The attribute would be used by the passes to know that the
specialconditions exist for those calls.

As far as what it means to have a path, it could be complicated.For example:
x = ...ballot_1();
could be transformed to:
if (x < 4711) {  ballot_1();
if(x >= 4711) {  ballot_1();
}

So a simple path check would say there is a path, and the transform is legal,but
if we examine the conditions, there is no path, and the transform should not be
legal.It could be made even more obscure of course, but I don't see any
optimizations reallydoing this kind of thing,

- Jan

    On Saturday, December 29, 2018, 11:32:25 AM EST, Nicolai Hähnle via llvm-dev
<llvm-dev at lists.llvm.org> wrote:

 On 20.12.18 18:03, Connor Abbott wrote:>    We already have the notion of "convergent" functions like
>    syncthreads(), to which we cannot add control-flow dependencies. 
>    That is, it's legal to hoist syncthreads out of an "if",
but it's
>    not legal to sink it into an "if".  It's not clear to me
why we
>    can't have "anticonvergent" (terrible name) functions
which cannot
>    have control-flow dependencies removed from them?  ballot() would be
>    both convergent and anticonvergent.
> 
>    Would that solve your problem?
> 
> 
> I think it's important to note that we already have such an attribute, 
> although with the opposite sense - it's impossible to remove control 
> flow dependencies from a call unless you mark it as
"speculatable".
This isn't actually true. If both sides of an if/else have the same 
non-speculative function call, it can still be moved out of control flow.

That's because doing so doesn't change anything at all from a 
single-threaded perspective. Hence why I think we should model the 
communication between threads honestly.

> However, this doesn't prevent
> 
> if (...) {
> } else {
> }
> foo = ballot();
> 
> from being turned into
> 
> if (...) {
>      foo1 = ballot();
> } else {
>      foo2 = ballot();
> }
> foo = phi(foo1, foo2)
> 
> and vice versa. We have a "noduplicate" attribute which prevents 
> transforming the first into the second, but not the other way around. Of 
> course we could keep going this way and add a "nocombine"
attribute to
> complement noduplicate. But even then, there are even still problematic 
> transforms. For example, take this program, which is simplified from a 
> real game that doesn't work with the AMDGPU backend:
> 
> while (cond1 /* uniform */) {
>      ballot();
>      ...
>      if (cond2 /* non-uniform */) continue;
>      ...
> }
> 
> In SPIR-V, when using structured control flow, the semantics of this are 
> pretty clearly defined. In particular, there's a continue block after 
> the body of the loop where control flow re-converges, and the only back 
> edge is from the continue block, so the ballot is in uniform control 
> flow. But LLVM will get rid of the continue block since it's empty, and
> re-analyze the loop as two nested loops, splitting the loop header in 
> two, producing a CFG which corresponds to this:
> 
> while (cond1 /* uniform */) {
>      do {
>          ballot();
>           ...
>      } while (cond2 /* non-uniform */);
>      ...
> }
> 
> Now, in an implementation where control flow re-converges at the 
> immediate post-dominator, this won't do the right thing anymore. In 
> order to handle it correctly, you'd effectively need to always flatten 
> nested loops, which will probably be really bad for performance if the 
> programmer actually wanted the second thing. It also makes it impossible 
> when translating a high-level language to LLVM to get the
"natural"
> behavior which game developers actually expect. This is exactly the sort 
> of "spooky action at a distance" which makes me think that
everything
> we've done so far is really insufficient, and we need to add an
explicit
> notion of control-flow divergence and reconvergence to the IR. We need a 
> way to say that control flow re-converges at the continue block, so that 
> LLVM won't eliminate it, and we can vectorize it correctly without 
> penalizing cases where it's better for control flow not to re-converge.
Well said!

Cheers,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190124/03588261/attachment.html>

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Dec 2018 - [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Seemingly Similar Threads