Connor Abbott via llvm-dev
2019-Feb-09 17:06 UTC
[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)
On Sat, Feb 9, 2019 at 4:44 PM Jan Sjodin <jan_sjodin at yahoo.com> wrote:> > The reason I'm looking for solutions that can work without "scanning the > > code" or "spooky action at a distance" is that we should have a solution > > that's easily digestible by folks who are not aware of GPU execution > models. > > > > The fallback model for such folks should be: if your pass doesn't touch > > specially-marked calls (whether that marker is "convergent" or something > > else), then you don't have to worry. Which, as a corollary, means that > > transforms can be conservatively correct by not touching such calls. > > > > I don't see how to achieve this goal while forcing passes to do specific > > code scans. > > I'm wondering if using instructions/intrinsics is the right thing, or > is it a property of a basic block that control flow converges? Can the > intrinsics exist anywhere in a basic block? Is there a difference > between instructions in the same BB before and after a convergence > instrinsic? >The way I was thinking, we'd allow the merge instruction anywhere in a basic block. Otherwise, we'd have to allow for cases where there's a trivial edge from a basic block with one sucessor to a block with one predecessor that can't be eliminated. For example, in the single-break case mentioned earlier: while (true) { if (...) { ballot(); // should be executed before re-converging. break; } } ballot(); // should be executed after re-converging With a merge instruction, the two ballots could be in the same basic block, with a merge instruction separating them, versus two basic blocks with the second marked as a merge block. Of course, this does have the disadvantage that the backend would have to duplicate any convergent operations before the merge instruction/intrinsic if the block has multiple predecessors. I think someone will have to actually implement it and see which is better.> > If the IR will be multi-threaded by default (otherwise a flag for a > module would be needed), then I'm thinking that we would need a more > fundamental change to the IR to represent this, especially since > convergence can be specified without the presence of cross-lane > functions. >I think that whatever we do, we definitely shouldn't allow any convergent calls in a function unless the function definition itself is marked convergent. That is, no one has to worry about cross-thread operations in a function, and passes can even delete merge instructions (although not recommended for performance!) unless it's marked convergent. I believe this is already the case now. And since convergence has to be specified somehow as soon as there are any convergent calls, I don't think it's that much more radical to be able to specify convergence without them. We need a fundamental change to the IR to be able to correctly handle ballot() and friends anyways, and once you accept that, then some kind of ability to explicitly specify merge points isn't really all that much more of a change, and makes it a lot easier to adapt optimization passes.> > >>> It's certainly interesting to think about how to maintain correctness > >>> in the face of ballots() with such a pass, but a) it's certainly no > >>> harder with merge intrinsics than merges being implicit and b) I doubt > >>> that's useful for anything you'd want to do with a GPU. > >> > >> Irreducible control flow has to be handled somehow, and linearization > >> is the only transform I know of, that will handle everything. I'm not > >> sure what the execution model says about irreducible control flow. > > > > For what it's worth, the reconverging CFG approach can also handle > > arbitrary irreducible control flow. > > > > Whether (or how) it can do this while being compatible with whatever > > semantics we come up with for cross-lane operations remains to be seen. > > I am wondering about execution order and if that is part of the > specification? > For example: > > if (A) { > L1: > ballot(); > if (B) > goto L2; > else > goto END: > } else { > L2: > ballot(); > if (C) > goto L1; > else > goto END; > } > END: > > Should the L1 ballot be executed first, or the L2, or does it not > matter? I'm not sure what the ballots should return. Would swapping > the branches from A (use !A instead of A) be a legal transform? >The proposal doesn't, and I don't think we should, specify which order the diverging threads should execute in. That's going to be highly architecture-specific, and might even be determined at runtime. That is, threads that have diverged should be treated just like completely separate threads. (There's a subtle difference wrt atomics and forward progress guarantees on some architectures, since execution of divergent threads is usually serialized, though... but close enough). So yes, swapping the branch condition should be allowed. For your example, I think that under Nicolai's proposal only a merge instruction after END would do anything (since none of the other basic blocks post-dominate anything). Then control flow would only be guaranteed to re-converge at the very end, and each unique path through the loop could eventually split into its own thread group. We might want to allow control to converge earlier, to make it easier to map this to structured control flow. Of course, if you're writing GPU-oriented code with divergent *and* irreducible control flow like that, then... well, there's not much hope anyways :)> > - Jan > > > > On Friday, February 1, 2019, 4:57:54 PM EST, Nicolai Hähnle < > nhaehnle at gmail.com> wrote: > > > On 31.01.19 15:59, Jan Sjodin wrote: > >> > Any transform that re-arranges control flow would potentially have to > >> > know about the properties of ballot(), and the rules with respect to > >> > the CFG (and maybe consider the target) to know where to insert the > >> > intrinsics. > > > >> But the same is true for basically any approach to handling this. In > >> fact, adding the merge intrinsics makes this much easier. Beyond the > >> usual problems with hoisting ballots, which passes currently avoid via > >> the current convergent + speculatable attributes, we'd only have to > >> add awareness to passes that they can't duplicate/combine merge > >> intrinsics or move them past convergent intrinsics, which is a local > >> property and hence easily checkable. In example I explained, without > >> some kind of merge intrinsic, tail duplication has to look at the > >> entire loop to know whether the transform is legal. Of course, you > >> could have some kind of "no convergent calls" flag on a function, but > >> that doesn't eliminate the nastyness when there are convergent calls. > > > > We will have to determine if the intrinsics are worth more compared to > > scanning the code. > > The reason I'm looking for solutions that can work without "scanning the > code" or "spooky action at a distance" is that we should have a solution > that's easily digestible by folks who are not aware of GPU execution > models. > > The fallback model for such folks should be: if your pass doesn't touch > specially-marked calls (whether that marker is "convergent" or something > else), then you don't have to worry. Which, as a corollary, means that > transforms can be conservatively correct by not touching such calls. > > I don't see how to achieve this goal while forcing passes to do specific > code scans. > > > >> > I had the impression that the control flow convergence was > >> > in part specified by what the target architecture can handle. > > > >> This is true, although hopefully we can agree on something that > >> everyone can implement. > > > > Yes, hopefully there is an execution model that is works for everyone. > > > >> > One of > >> > the more complicated cases would be linearization where the control > >> > flow is completely rewritten, and is encoded in a variable that says > >> > which basic block is the next one to execute. > > > >> It's certainly interesting to think about how to maintain correctness > >> in the face of ballots() with such a pass, but a) it's certainly no > >> harder with merge intrinsics than merges being implicit and b) I doubt > >> that's useful for anything you'd want to do with a GPU. > > > > Irreducible control flow has to be handled somehow, and linearization > > is the only transform I know of, that will handle everything. I'm not > > sure what the execution model says about irreducible control flow. > > For what it's worth, the reconverging CFG approach can also handle > arbitrary irreducible control flow. > > Whether (or how) it can do this while being compatible with whatever > semantics we come up with for cross-lane operations remains to be seen. > > > >> > Another case is DCE, > >> > where a ballot() could be eliminated, and it would potentially have > to > >> > remove a number of intrinsics to enable later optimizations (unless > it > >> > would affect performance?), so the intrinsics will require some > >> > non-local updates. > > > >> Removing merge intrinsics if it's profitable and allowed is a > >> separate optimization, that can be done relatively quickly in a single > >> pass without impacting the performance of other optimizations. It's > >> requiring expensive non-local checks in optimizations which modify > >> control flow that's a problem. > > > > There are certainly a lot of tradeoffs. My point was simply that the > > intrinsics are not strictly local. > > Right. I have some ideas for how to fix this, but wasn't able to work on > them for the last weeks. > > Cheers, > Nicolai > > > >> > So we might have them without a ballot(), which would seem to make it > >> > more difficult for structurizers or other transforms to maintain the > >> > semantics and insert intrinsics. > > > >> It's not any more difficult code-wise, since the case where there is > >> a ballot needs to be handled anyways. And while it might take longer > >> to process stuff when this hypothetical pass encounters a merge > >> intrinsic (I can't think of a real-world case where it would matter), > >> it should result in better code generated. > > > > I was thinking of linearization or something similar, where an > > instrinsic by itself might be hard to preserve, compared to looking at > > a ballot() and derive where intrinsics should be inserted. > > > >> > > > How are uniform branches handled? Do they affect the convergence > model? > >> > > > > >> > > We may be able to remove convergence points if branches are uniform. > >> > > In Nicolai's proposal, I believe we'd want to remove a merge > intrinsic > >> > > when all the branches that it post-dominates that aren't also > >> > > post-dominated by some other merge intrinsic are uniform. > >> > > >> > I couldn't quite understand the last sentence, but I assume the > >> > conditions would prevent removing convergence points that help > >> > performance. Post-domination might not be adequate if there are loops > >> > btw. > >> > >> It should be -- Nicolai's proposal is that a merge intrinsic merges > >> all the divergences caused by branches post dominated by the > >> intrinsic, so if all the divergent branches are merged by some other > >> intrinsic earlier, then there's no divergence and the merge intrinsic > >> is a no-op. > > > > Makes sense. > > > > - Jan > > On Wednesday, January 30, 2019, 11:41:29 AM EST, Connor Abbott > > <cwabbott0 at gmail.com> wrote: > > > > > > On Wed, Jan 30, 2019 at 4:20 PM Jan Sjodin <jan_sjodin at yahoo.com > > <mailto:jan_sjodin at yahoo.com>> wrote: > > > > > > > > > > > > for (int i = 0; i < 2; i++) { > > > > > foo = ballot(true); // ballot 1 > > > > > > > > > > if (threadID /* ID of the thread within a wavefront/warp */ > % 2 == 0) continue; > > > > > > > > > > bar = ballot(true); // ballot 2 > > > > > } > > > > > > > > > > versus: > > > > > > > > > > int i = 0; > > > > > while (true) { > > > > > do { > > > > > if (i == 2) break; > > > > > foo = ballot(true); // ballot 1 > > > > > i++; > > > > > } while (threadID % 2 == 0); > > > > > > > > > > if (i == 2) break; > > > > > bar = ballot(true); // ballot 2 > > > > > i++; > > > > > } > > > > > > > > I think you can remove the second "i++", otherwise we can > increment "i" twice > > > > if threadID % 2 != 0, but I see the issue. Using the equivalence > classes would > > > > prevent this transform, since we have re-arranged the control > flow in that way, > > > > > > What do you mean? Note that the transforms that lead from the first > > > example to the second are actually desirable if there aren't any > > > ballots or other convergent operations, so banning them entirely is > a > > > no-go. Otherwise, note that the ballot here can be nested > arbitrarily > > > deep, which means that jump forwarding/tail duplication has to be > > > aware of the entire loop unless we have some kind of merge > intrinsic. > > > > Yes, if there weren't any calls to a ballot, then the transform would > > be legal. What I was saying in the beginning was that ballot() would > > have some special rules attached to it. It is of course undesirable to > > have flags to enforce correctness, but that is the point we are at > > right now. > > > > > Also, I should say that while interesting, control equivalence > classes > > > can be part of the *implementation*, but not the *specification*. We > > > need > > > to define the semantics of the IR -- that is, how a program > > > compiled from any given IR is allowed to do when executed (and *not* > > > what it looks like when compiled) -- *first*, and then which > > > transforms are allowed/not allowed will fall out of that. We can't > > > start by listing disallowed transforms, because then when someone > > > comes along and writes a new optimization, it might be technically > > > "allowed" even though it breaks something in practice. > > > > I think you misunderstand me if you think I was listing disallowed > > transforms. My question was if it is possible to have multi-threaded > > semantics in the source language, but in the compiler have a > > single-threaded view, where some properties of the CFG would determine > > what is legal and not for some functions with a special flag. > > > > > > But that's practically the same as listing allowed and disallowed > > transforms -- you're defining what the final IR is allowed to look like, > > not how it's allowed to execute. If at all possible, the former should > > be derived from the latter. > > > > I agree > > it is more desirable to have the the semantics specified in the > > IR. However, I am exploring this from a practical point of view, since > > ballot() is very rare compared to all code that is being compiled for > > all targets. These intrinsics would always have to be considered when > > writing a pass. They seem to me harder to think about, and > > test, for someone who is working on a single-threaded target, compared > > to looking at a flag. If we allow ourselves to look at which > > transforms that might violate these properties, we could would free up > > the rest of the compiler (and developers) to not have to worry about > > these things. Intrinsics would have to be maintained throughout the > > entire compilation process in every pass. > > > > > The only way to > > > conclusively prove that transforms will never break code that's > > > supposed to work (except for bugs in the transforms) is to define > the > > > semantics of the IR, and then to make sure that it refines the > > > semantics of the source language. This is why the LangRef almost > never > > > talks about allowed/disallowed transforms except as examples to > > > explain some given semantics, and if you don't follow that rule, > your > > > patch will probably be rejected. > > > > I'm trying to figure out if we are in the "almost never" territory > > here or not. > > > > > > I don't think so at all. In addition to being much, much harder to > > reason about and prove that the approach is sound, I think it'll be more > > intrusive. > > > > > > > Now, to define a semantics for ballot(), we need to define what it's > > > allowed to return for any given execution of any given program, and > to > > > do that, we need to define which threads must be active together > when > > > it's reached, which in turns means we need to define when control > flow > > > can re-converge somehow. Any proposal for how to handle ballot() > must > > > start here. > > > > > > I'm not sure if using these rules will be easier or harder than > dealing with > > > > intrinsics. One problem is that the rules for single-threaded > code might seem > > > > arbitrary, and it would be hard to reason about them in a larger > context. > > > > > > What happens to new control flow created by transforms, and what > will guide > > > > the insertion of intrinsics in the new code? Code structurization > is one example > > > > were this could happen. > > > > > Hopefully, it's clear from the above how this all falls out. The > > > frontend for e.g. GLSL or SPIR-V would have to insert the merge > > > intrinsics to preserve the semantics of the source language. Any > > > transforms must refine the semantics of the IR, although I can't > think > > > of a scenario where that would involve emitting any new merge > > > intrinsics. Usually, structurized IR's have their own semantics > about > > > when control flow converges, so a code structurizer should respect > the > > > original semantics. AMDGPU has its own code structurizer that runs > > > very late in the pipeline (although there are plans to remove it), > so > > > we might have to change that to make it respect the merge intrinsic > > > intrinsics, and then we'll correctly implement them "for free". > > > > Any transform that re-arranges control flow would potentially have to > > know about the properties of ballot(), and the rules with respect to > > the CFG (and maybe consider the target) to know where to insert the > > intrinsics. > > > > > > But the same is true for basically any approach to handling this. In > > fact, adding the merge intrinsics makes this much easier. Beyond the > > usual problems with hoisting ballots, which passes currently avoid via > > the current convergent + speculatable attributes, we'd only have to add > > awareness to passes that they can't duplicate/combine merge intrinsics > > or move them past convergent intrinsics, which is a local property and > > hence easily checkable. In example I explained, without some kind of > > merge intrinsic, tail duplication has to look at the entire loop to know > > whether the transform is legal. Of course, you could have some kind of > > "no convergent calls" flag on a function, but that doesn't eliminate the > > nastyness when there are convergent calls. > > > > I had the impression that the control flow convergence was > > in part specified by what the target architecture can handle. > > > > > > This is true, although hopefully we can agree on something that everyone > > can implement. > > > > One of > > the more complicated cases would be linearization where the control > > flow is completely rewritten, and is encoded in a variable that says > > which basic block is the next one to execute. > > > > > > It's certainly interesting to think about how to maintain correctness in > > the face of ballots() with such a pass, but a) it's certainly no harder > > with merge intrinsics than merges being implicit and b) I doubt that's > > useful for anything you'd want to do with a GPU. > > > > Another case is DCE, > > where a ballot() could be eliminated, and it would potentially have to > > remove a number of intrinsics to enable later optimizations (unless it > > would affect performance?), so the intrinsics will require some > > non-local updates. > > > > > > Removing merge intrinsics if it's profitable and allowed is a separate > > optimization, that can be done relatively quickly in a single pass > > without impacting the performance of other optimizations. It's requiring > > expensive non-local checks in optimizations which modify control flow > > that's a problem. > > > > > > > > > Would they only be present if ballot and similar functions are > used, or do they > > > > > have to be present everywhere? > > > > > > > > They'd only have to be present when ballot or other convergent > > > > functions are called, since otherwise it doesn't matter when > control > > > flow re-converges. However, we may want to keep them around for > > > performance reasons (usually destroying convergence points is bad > for > > > performance). > > > > So we might have them without a ballot(), which would seem to make it > > more difficult for structurizers or other transforms to maintain the > > semantics and insert intrinsics. > > > > > > It's not any more difficult code-wise, since the case where there is a > > ballot needs to be handled anyways. And while it might take longer to > > process stuff when this hypothetical pass encounters a merge intrinsic > > (I can't think of a real-world case where it would matter), it should > > result in better code generated. > > > > > > > > How are uniform branches handled? Do they affect the convergence > model? > > > > > > > We may be able to remove convergence points if branches are uniform. > > > In Nicolai's proposal, I believe we'd want to remove a merge > intrinsic > > > when all the branches that it post-dominates that aren't also > > > post-dominated by some other merge intrinsic are uniform. > > > > I couldn't quite understand the last sentence, but I assume the > > conditions would prevent removing convergence points that help > > performance. Post-domination might not be adequate if there are loops > > btw. > > > > > > It should be -- Nicolai's proposal is that a merge intrinsic merges all > > the divergences caused by branches post dominated by the intrinsic, so > > if all the divergent branches are merged by some other intrinsic > > earlier, then there's no divergence and the merge intrinsic is a no-op. > > > > > > - Jan > > > > > > > > On Wednesday, January 30, 2019, 6:29:52 AM EST, Connor Abbott > > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>> wrote: > > > > > > On Mon, Jan 28, 2019 at 9:09 PM Jan Sjodin <jan_sjodin at yahoo.com > > <mailto:jan_sjodin at yahoo.com>> wrote: > > > > > > > for (int i = 0; i < 2; i++) { > > > > foo = ballot(true); // ballot 1 > > > > > > > > if (threadID /* ID of the thread within a wavefront/warp */ > > % 2 == 0) continue; > > > > > > > > bar = ballot(true); // ballot 2 > > > > } > > > > > > > > versus: > > > > > > > > int i = 0; > > > > while (true) { > > > > do { > > > > if (i == 2) break; > > > > foo = ballot(true); // ballot 1 > > > > i++; > > > > } while (threadID % 2 == 0); > > > > > > > > if (i == 2) break; > > > > bar = ballot(true); // ballot 2 > > > > i++; > > > > } > > > > > > I think you can remove the second "i++", otherwise we can > > increment "i" twice > > > if threadID % 2 != 0, but I see the issue. Using the equivalence > > classes would > > > prevent this transform, since we have re-arranged the control > > flow in that way, > > > > What do you mean? Note that the transforms that lead from the first > > example to the second are actually desirable if there aren't any > > ballots or other convergent operations, so banning them entirely is a > > no-go. Otherwise, note that the ballot here can be nested arbitrarily > > deep, which means that jump forwarding/tail duplication has to be > > aware of the entire loop unless we have some kind of merge intrinsic. > > > > Also, I should say that while interesting, control equivalence classes > > can be part of the *implementation*, but not the *specification*. We > > need to define the semantics of the IR -- that is, how a program > > compiled from any given IR is allowed to do when executed (and *not* > > what it looks like when compiled) -- *first*, and then which > > transforms are allowed/not allowed will fall out of that. We can't > > start by listing disallowed transforms, because then when someone > > comes along and writes a new optimization, it might be technically > > "allowed" even though it breaks something in practice. The only way to > > conclusively prove that transforms will never break code that's > > supposed to work (except for bugs in the transforms) is to define the > > semantics of the IR, and then to make sure that it refines the > > semantics of the source language. This is why the LangRef almost never > > talks about allowed/disallowed transforms except as examples to > > explain some given semantics, and if you don't follow that rule, your > > patch will probably be rejected. > > > > Now, to define a semantics for ballot(), we need to define what it's > > allowed to return for any given execution of any given program, and to > > do that, we need to define which threads must be active together when > > it's reached, which in turns means we need to define when control flow > > can re-converge somehow. Any proposal for how to handle ballot() must > > start here. > > > > > I'm not sure if using these rules will be easier or harder than > > dealing with > > > intrinsics. One problem is that the rules for single-threaded > > code might seem > > > arbitrary, and it would be hard to reason about them in a larger > > context. > > > > > > > Nicolai's proposal solves this by having the frontend emit a > > merge intrinsic > > > > before the i++ is emitted. This prevents the jump forwarding > > from occurring. > > > > > > I was thinking about getting through the single-thread view and > > the issues with > > > that first, then I will think more about the multi-thread and > > explicit convergence. > > > > > > If we are done with the single-thread stuff for now these are the > > question that I > > > have been thinking about with the multi-threaded view: > > > > > > What happens to new control flow created by transforms, and what > > will guide > > > the insertion of intrinsics in the new code? Code structurization > > is one example > > > were this could happen. > > > > Hopefully, it's clear from the above how this all falls out. The > > frontend for e.g. GLSL or SPIR-V would have to insert the merge > > intrinsics to preserve the semantics of the source language. Any > > transforms must refine the semantics of the IR, although I can't think > > of a scenario where that would involve emitting any new merge > > intrinsics. Usually, structurized IR's have their own semantics about > > when control flow converges, so a code structurizer should respect the > > original semantics. AMDGPU has its own code structurizer that runs > > very late in the pipeline (although there are plans to remove it), so > > we might have to change that to make it respect the merge intrinsic > > intrinsics, and then we'll correctly implement them "for free". > > > > > > > > Would they only be present if ballot and similar functions are > > used, or do they > > > have to be present everywhere? > > > > They'd only have to be present when ballot or other convergent > > functions are called, since otherwise it doesn't matter when control > > flow re-converges. However, we may want to keep them around for > > performance reasons (usually destroying convergence points is bad for > > performance). > > > > > > > > How are uniform branches handled? Do they affect the convergence > > model? > > > > We may be able to remove convergence points if branches are uniform. > > In Nicolai's proposal, I believe we'd want to remove a merge intrinsic > > when all the branches that it post-dominates that aren't also > > post-dominated by some other merge intrinsic are uniform. > > > > > > > > > > > > > > > - Jan > > > > > > > > > On Monday, January 28, 2019, 11:16:36 AM EST, Connor Abbott > > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>> wrote: > > > > > > > > > > > > On Fri, Jan 25, 2019 at 3:05 AM Jan Sjodin <jan_sjodin at yahoo.com > > <mailto:jan_sjodin at yahoo.com>> wrote: > > > > > > > > > for (...) { > > > > > ballot(); > > > > > if (... /* non-uniform */) continue; > > > > > } > > > > > > > > > > into > > > > > > > > > > for (...) { > > > > > do { > > > > > ballot(); > > > > > } while (... /* non-uniform */); > > > > > } > > > > > > > > I'm not sure if I follow this example, could you and explain a > > bit more? > > > > It looks to me that the condition in the "if" must be false (if > the > > > > same condition is used in the while), or we would > > > > call ballot the wrong number of times. > > > > > > Yes, the idea is that the same condition is used in the if and > > the do-while. I think I messed up the example a little... in the > > second snippet, we're supposed to break out of the inner loop if the > > outer loop's exit condition is true. Here's a more concrete example: > > > > > > for (int i = 0; i < 2; i++) { > > > foo = ballot(true); // ballot 1 > > > > > > if (threadID /* ID of the thread within a wavefront/warp */ % > > 2 == 0) continue; > > > > > > bar = ballot(true); // ballot 2 > > > } > > > > > > versus: > > > > > > int i = 0; > > > while (true) { > > > do { > > > if (i == 2) break; > > > foo = ballot(true); // ballot 1 > > > i++; > > > } while (threadID % 2 == 0); > > > > > > if (i == 2) break; > > > bar = ballot(true); // ballot 2 > > > i++; > > > } > > > > > > From a single-threaded perspective, these two snippets are > > identical, even if ballot() writes to arbitrary memory. The first > > can easily get transformed to something like the second when LLVM > > decides to duplicate the final i++ through jump forwarding, and then > > re-interprets the loop as two nested loops and splits the loop > > header in two. This is what currently happens with DOOM when we try > > to enable subgroup operations with it. Let's say there are two > > threads in a wavefront. Then the execution trace mandated by SPIR-V > > for the first looks like: > > > > > > thread 0 | thread 1 > > > ballot 1 = 0b11 | ballot 1 = 0b11 > > > skipped | ballot 2 = 0b10 > > > ballot 1 = 0b11 | ballot 1 = 0b11 > > > skipped | ballot 2 = 0b10 > > > > > > Now, contrast this with the execution trace that programmers > > would expect for the second example: > > > > > > thread 0 | thread 1 > > > ballot 1 = 0b11 | ballot 1 = 0b11 > > > ballot 1 = 0b01 | skipped > > > skipped | ballot 2 = 0b10 > > > skipped | ballot 1 = 0b10 > > > skipped | ballot 2 = 0b10 > > > > > > Nicolai's proposal solves this by having the frontend emit a > > merge intrinsic before the i++ is emitted. This prevents the jump > > forwarding from occurring. > > > > > > > > > > > > > > About the CSE, when would that be legal? I can imagine with > uniform > > > > branches that it could work, but would like to see an example to > > > > fully understand this. > > > > > > > > I agree that it would be more conservative than if we model the > > threading, > > > > but I'm not sure about the cost/benefit. I am mostly curious if > > it is > > > > possible to have a single-thread view or not. Then we would > have to > > > > see if it is adequate. > > > > > > > > - Jan > > > > > > > > On Thursday, January 24, 2019, 10:31:47 AM EST, Connor Abbott > > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>> wrote: > > > > > > > > > > > > I don't see how this would fix the continue vs. nested loop > > problem I > > > > explained earlier. That is, how would this prevent turning: > > > > > > > > for (...) { > > > > ballot(); > > > > if (... /* non-uniform */) continue; > > > > } > > > > > > > > into > > > > > > > > for (...) { > > > > do { > > > > ballot(); > > > > } while (... /* non-uniform */); > > > > } > > > > > > > > and vice versa? Note that there's no duplication going on here, > and > > > > the single-threaded flow of control is exactly the same. > > > > > > > > Another reason this isn't so great is that it prevents e.g. CSE > on > > > > ballots that actually should be combined, since you're > > modelling it as > > > > a write. It seems like the optimizer is going to need some > special > > > > knowledge of convergent things that fake memory constraints > > can't give > > > > us. > > > > > > > > On Thu, Jan 24, 2019 at 4:06 PM Jan Sjodin > > <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>> wrote: > > > > > > > > > > > > > > > I was looking into ballot() and how if it is possible to keep > > a single-threaded > > > > > view of the code, but add some extra conditions that must > > hold after the > > > > > transformation. I had the initial idea that each call to > > ballot() in a > > > > > single-threaded program can be seen as a partial write to a > > memory > > > > > location, and each location memory location is unique for > > every call site, > > > > > plus there some externally observable side effect. We can > > abstract this > > > > > away by tagging the calls, e.g. by using aliases. > > > > > > > > > > For example: > > > > > > > > > > if (...) { > > > > > foo1 = ballot(); > > > > > } else { > > > > > foo2 = ballot(); > > > > > } > > > > > > > > > > simply becomes: > > > > > > > > > > if (...) { > > > > > foo1 = ballot_1(); > > > > > } else { > > > > > foo2 = ballot_2(); > > > > > } > > > > > > > > > > > > > > > and > > > > > > > > > > if (...) { > > > > > } else { > > > > > } > > > > > ballot(); > > > > > > > > > > becomes > > > > > > > > > > if (...) { > > > > > } else { > > > > > } > > > > > ballot_1(); > > > > > > > > > > In the first case it would prevent combining the two calls > > into one > > > > > after the if. In the second example there is generally > > nothing that > > > > > says it could not be transformed into the first example with > two > > > > > calls to ballot_1(), which should not be allowed. > > > > > > > > > > Another form of duplication that we must allow are loop > > transforms, > > > > > like unrolling or peeling. These might seem similar to the > > example > > > > > above, since we are cloning code and with conditions etc. But > > > > > they are different since they calls are in different loop > > iterations. > > > > > > > > > > The condition that needs to be met is that: > > > > > > > > > > There must be a single path between all cloned ballot_n() > > functions. > > > > > > > > > > The reason for this condition is that if we clone the same > > call, then > > > > > the copies must be mutually exclusive, but if they are cloned > > from > > > > > a loop, there must be a path, or we would skip iterations. > > > > > > > > > > If we want to be more sophisticated we can add: > > > > > > > > > > If there is no such path, the calls must be separated by > > uniform branches. > > > > > > > > > > After the transform everything should be re-tagged, since we > > already > > > > > checked the calls and we don't want to check them again. > > Also, not all > > > > > transforms need (or should) have the tagging checked. One > > example is > > > > > inlining, where multiple copies are created, but they are > > clearly different > > > > > calls. The tagging can be done temporarily for a single pass, > > and then > > > > > eliminated. This could be a good tool for debugging as well, > > since it can > > > > > detect if a transform is suspect. > > > > > > > > > > The code would of course have to make sense as far as control > > flow. If > > > > > we have: > > > > > > > > > > for(;;) { > > > > > if(...) { > > > > > ballot(); > > > > > break; > > > > > } > > > > > } > > > > > > > > > > This would have to be translated to: > > > > > > > > > > for(;;) { > > > > > if(...) { > > > > > ballot(); > > > > > if(UnknownConstant) { // Not a uniform condition, but > > later on translated to "true" > > > > > break; > > > > > } > > > > > } > > > > > > > > > > However, I think that this is the way the code is generated > > today anyway. > > > > > There would have to be some attribute that indicate that > > these calls (or functions) > > > > > contain ballot or other cross-lane operations so they could > > be tagged and > > > > > checked. The attribute would be used by the passes to know > > that the special > > > > > conditions exist for those calls. > > > > > > > > > > As far as what it means to have a path, it could be > complicated. > > > > > For example: > > > > > > > > > > x = ... > > > > > ballot_1(); > > > > > > > > > > could be transformed to: > > > > > > > > > > if (x < 4711) { > > > > > ballot_1(); > > > > > > > > > > if(x >= 4711) { > > > > > ballot_1(); > > > > > } > > > > > > > > > > So a simple path check would say there is a path, and the > > transform is legal, > > > > > but if we examine the conditions, there is no path, and the > > transform should not be legal. > > > > > It could be made even more obscure of course, but I don't see > > any optimizations really > > > > > doing this kind of thing, > > > > > > > > > > - Jan > > > > > > > > > > On Saturday, December 29, 2018, 11:32:25 AM EST, Nicolai > > Hähnle via llvm-dev <llvm-dev at lists.llvm.org > > <mailto:llvm-dev at lists.llvm.org>> wrote: > > > > > > > > > > > > > > > On 20.12.18 18:03, Connor Abbott wrote: > > > > > > We already have the notion of "convergent" functions like > > > > > > syncthreads(), to which we cannot add control-flow > > dependencies. > > > > > > That is, it's legal to hoist syncthreads out of an "if", > > but it's > > > > > > not legal to sink it into an "if". It's not clear to me > > why we > > > > > > can't have "anticonvergent" (terrible name) functions > > which cannot > > > > > > have control-flow dependencies removed from them? > > ballot() would be > > > > > > both convergent and anticonvergent. > > > > > > > > > > > > Would that solve your problem? > > > > > > > > > > > > > > > > > > I think it's important to note that we already have such an > > attribute, > > > > > > although with the opposite sense - it's impossible to > > remove control > > > > > > flow dependencies from a call unless you mark it as > > "speculatable". > > > > > > > > > > This isn't actually true. If both sides of an if/else have > > the same > > > > > non-speculative function call, it can still be moved out of > > control flow. > > > > > > > > > > That's because doing so doesn't change anything at all from a > > > > > single-threaded perspective. Hence why I think we should > > model the > > > > > communication between threads honestly. > > > > > > > > > > > > > > > > However, this doesn't prevent > > > > > > > > > > > > if (...) { > > > > > > } else { > > > > > > } > > > > > > foo = ballot(); > > > > > > > > > > > > from being turned into > > > > > > > > > > > > if (...) { > > > > > > foo1 = ballot(); > > > > > > } else { > > > > > > foo2 = ballot(); > > > > > > } > > > > > > foo = phi(foo1, foo2) > > > > > > > > > > > > and vice versa. We have a "noduplicate" attribute which > > prevents > > > > > > transforming the first into the second, but not the other > > way around. Of > > > > > > course we could keep going this way and add a "nocombine" > > attribute to > > > > > > complement noduplicate. But even then, there are even still > > problematic > > > > > > transforms. For example, take this program, which is > > simplified from a > > > > > > real game that doesn't work with the AMDGPU backend: > > > > > > > > > > > > while (cond1 /* uniform */) { > > > > > > ballot(); > > > > > > ... > > > > > > if (cond2 /* non-uniform */) continue; > > > > > > ... > > > > > > } > > > > > > > > > > > > In SPIR-V, when using structured control flow, the > > semantics of this are > > > > > > pretty clearly defined. In particular, there's a continue > > block after > > > > > > the body of the loop where control flow re-converges, and > > the only back > > > > > > edge is from the continue block, so the ballot is in > > uniform control > > > > > > flow. But LLVM will get rid of the continue block since > > it's empty, and > > > > > > re-analyze the loop as two nested loops, splitting the loop > > header in > > > > > > two, producing a CFG which corresponds to this: > > > > > > > > > > > > while (cond1 /* uniform */) { > > > > > > do { > > > > > > ballot(); > > > > > > ... > > > > > > } while (cond2 /* non-uniform */); > > > > > > ... > > > > > > } > > > > > > > > > > > > Now, in an implementation where control flow re-converges > > at the > > > > > > immediate post-dominator, this won't do the right thing > > anymore. In > > > > > > order to handle it correctly, you'd effectively need to > > always flatten > > > > > > nested loops, which will probably be really bad for > > performance if the > > > > > > programmer actually wanted the second thing. It also makes > > it impossible > > > > > > when translating a high-level language to LLVM to get the > > "natural" > > > > > > behavior which game developers actually expect. This is > > exactly the sort > > > > > > of "spooky action at a distance" which makes me think that > > everything > > > > > > we've done so far is really insufficient, and we need to > > add an explicit > > > > > > notion of control-flow divergence and reconvergence to the > > IR. We need a > > > > > > way to say that control flow re-converges at the continue > > block, so that > > > > > > LLVM won't eliminate it, and we can vectorize it correctly > > without > > > > > > penalizing cases where it's better for control flow not to > > re-converge. > > > > > > > > > > Well said! > > > > > > > > > > Cheers, > > > > > > > > > > Nicolai > > > > > -- > > > > > Lerne, wie die Welt wirklich ist, > > > > > Aber vergiss niemals, wie sie sein sollte. > > > > > _______________________________________________ > > > > > LLVM Developers mailing list > > > > > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > Lerne, wie die Welt wirklich ist, > Aber vergiss niemals, wie sie sein sollte. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190209/36b5849b/attachment-0001.html>
Nicolai Hähnle via llvm-dev
2019-Feb-09 20:01 UTC
[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)
On 09.02.19 18:06, Connor Abbott wrote:> On Sat, Feb 9, 2019 at 4:44 PM Jan Sjodin <jan_sjodin at yahoo.com > <mailto:jan_sjodin at yahoo.com>> wrote: > > > The reason I'm looking for solutions that can work without "scanning the > > code" or "spooky action at a distance" is that we should have a solution > > that's easily digestible by folks who are not aware of GPU execution models. > > > > The fallback model for such folks should be: if your pass doesn't touch > > specially-marked calls (whether that marker is "convergent" or something > > else), then you don't have to worry. Which, as a corollary, means that > > transforms can be conservatively correct by not touching such calls. > > > > I don't see how to achieve this goal while forcing passes to do specific > > code scans. > > I'm wondering if using instructions/intrinsics is the right thing, or > is it a property of a basic block that control flow converges? Can the > intrinsics exist anywhere in a basic block? Is there a difference > between instructions in the same BB before and after a convergence > instrinsic? > > > The way I was thinking, we'd allow the merge instruction anywhere in a > basic block. Otherwise, we'd have to allow for cases where there's a > trivial edge from a basic block with one sucessor to a block with one > predecessor that can't be eliminated. For example, in the single-break > case mentioned earlier: > > while (true) { > if (...) { > ballot(); // should be executed before re-converging. > break; > } > } > ballot(); // should be executed after re-converging > > With a merge instruction, the two ballots could be in the same basic > block, with a merge instruction separating them, versus two basic blocks > with the second marked as a merge block. Of course, this does have the > disadvantage that the backend would have to duplicate any convergent > operations before the merge instruction/intrinsic if the block has > multiple predecessors. I think someone will have to actually implement > it and see which is better.Agreed.> If the IR will be multi-threaded by default (otherwise a flag for a > module would be needed), then I'm thinking that we would need a more > fundamental change to the IR to represent this, especially since > convergence can be specified without the presence of cross-lane > functions. > > > I think that whatever we do, we definitely shouldn't allow any > convergent calls in a function unless the function definition itself is > marked convergent. That is, no one has to worry about cross-thread > operations in a function, and passes can even delete merge instructions > (although not recommended for performance!) unless it's marked > convergent. I believe this is already the case now. And since > convergence has to be specified somehow as soon as there are any > convergent calls, I don't think it's that much more radical to be able > to specify convergence without them. We need a fundamental change to the > IR to be able to correctly handle ballot() and friends anyways, and once > you accept that, then some kind of ability to explicitly specify merge > points isn't really all that much more of a change, and makes it a lot > easier to adapt optimization passes. > > > >>> It's certainly interesting to think about how to maintain correctness > >>> in the face of ballots() with such a pass, but a) it's certainly no > >>> harder with merge intrinsics than merges being implicit and b) I doubt > >>> that's useful for anything you'd want to do with a GPU. > >> > >> Irreducible control flow has to be handled somehow, and linearization > >> is the only transform I know of, that will handle everything. I'm not > >> sure what the execution model says about irreducible control flow. > > > > For what it's worth, the reconverging CFG approach can also handle > > arbitrary irreducible control flow. > > > > Whether (or how) it can do this while being compatible with whatever > > semantics we come up with for cross-lane operations remains to be seen. > > I am wondering about execution order and if that is part of the > specification? > For example: > > if (A) { > L1: > ballot(); > if (B) > goto L2; > else > goto END: > } else { > L2: > ballot(); > if (C) > goto L1; > else > goto END; > } > END: > > Should the L1 ballot be executed first, or the L2, or does it not > matter? I'm not sure what the ballots should return. Would swapping > the branches from A (use !A instead of A) be a legal transform? > > > The proposal doesn't, and I don't think we should, specify which order > the diverging threads should execute in.Agreed. Basically if your high-level source code contains goto's, you should just be on your own - within reason, to ensure composability of code. For example, a reasonable rule would be that ballot & friends are undefined after a goto instruction, until execution leaves the innermost scope which is SESE (cannot be entered or left by goto jumps). Cheers, Nicolai> That's going to be highly > architecture-specific, and might even be determined at runtime. That is, > threads that have diverged should be treated just like completely > separate threads. (There's a subtle difference wrt atomics and forward > progress guarantees on some architectures, since execution of divergent > threads is usually serialized, though... but close enough). So yes, > swapping the branch condition should be allowed. > > For your example, I think that under Nicolai's proposal only a merge > instruction after END would do anything (since none of the other basic > blocks post-dominate anything). Then control flow would only be > guaranteed to re-converge at the very end, and each unique path through > the loop could eventually split into its own thread group. We might want > to allow control to converge earlier, to make it easier to map this to > structured control flow. Of course, if you're writing GPU-oriented code > with divergent *and* irreducible control flow like that, then... well, > there's not much hope anyways :) > > > - Jan > > > > On Friday, February 1, 2019, 4:57:54 PM EST, Nicolai Hähnle > <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>> wrote: > > > On 31.01.19 15:59, Jan Sjodin wrote: > >> > Any transform that re-arranges control flow would potentially > have to > >> > know about the properties of ballot(), and the rules with > respect to > >> > the CFG (and maybe consider the target) to know where to > insert the > >> > intrinsics. > > > >> But the same is true for basically any approach to handling this. In > >> fact, adding the merge intrinsics makes this much easier. Beyond the > >> usual problems with hoisting ballots, which passes currently > avoid via > >> the current convergent + speculatable attributes, we'd only have to > >> add awareness to passes that they can't duplicate/combine merge > >> intrinsics or move them past convergent intrinsics, which is a local > >> property and hence easily checkable. In example I explained, without > >> some kind of merge intrinsic, tail duplication has to look at the > >> entire loop to know whether the transform is legal. Of course, you > >> could have some kind of "no convergent calls" flag on a > function, but > >> that doesn't eliminate the nastyness when there are convergent > calls. > > > > We will have to determine if the intrinsics are worth more > compared to > > scanning the code. > > The reason I'm looking for solutions that can work without "scanning > the > code" or "spooky action at a distance" is that we should have a > solution > that's easily digestible by folks who are not aware of GPU execution > models. > > The fallback model for such folks should be: if your pass doesn't touch > specially-marked calls (whether that marker is "convergent" or > something > else), then you don't have to worry. Which, as a corollary, means that > transforms can be conservatively correct by not touching such calls. > > I don't see how to achieve this goal while forcing passes to do > specific > code scans. > > > >> > I had the impression that the control flow convergence was > >> > in part specified by what the target architecture can handle. > > > >> This is true, although hopefully we can agree on something that > >> everyone can implement. > > > > Yes, hopefully there is an execution model that is works for > everyone. > > > >> > One of > >> > the more complicated cases would be linearization where the > control > >> > flow is completely rewritten, and is encoded in a variable > that says > >> > which basic block is the next one to execute. > > > >> It's certainly interesting to think about how to maintain > correctness > >> in the face of ballots() with such a pass, but a) it's certainly no > >> harder with merge intrinsics than merges being implicit and b) I > doubt > >> that's useful for anything you'd want to do with a GPU. > > > > Irreducible control flow has to be handled somehow, and linearization > > is the only transform I know of, that will handle everything. I'm not > > sure what the execution model says about irreducible control flow. > > For what it's worth, the reconverging CFG approach can also handle > arbitrary irreducible control flow. > > Whether (or how) it can do this while being compatible with whatever > semantics we come up with for cross-lane operations remains to be seen. > > > >> > Another case is DCE, > >> > where a ballot() could be eliminated, and it would > potentially have to > >> > remove a number of intrinsics to enable later optimizations > (unless it > >> > would affect performance?), so the intrinsics will require some > >> > non-local updates. > > > >> Removing merge intrinsics if it's profitable and allowed is a > >> separate optimization, that can be done relatively quickly in a > single > >> pass without impacting the performance of other optimizations. It's > >> requiring expensive non-local checks in optimizations which modify > >> control flow that's a problem. > > > > There are certainly a lot of tradeoffs. My point was simply that the > > intrinsics are not strictly local. > > Right. I have some ideas for how to fix this, but wasn't able to > work on > them for the last weeks. > > Cheers, > Nicolai > > > >> > So we might have them without a ballot(), which would seem to > make it > >> > more difficult for structurizers or other transforms to > maintain the > >> > semantics and insert intrinsics. > > > >> It's not any more difficult code-wise, since the case where there is > >> a ballot needs to be handled anyways. And while it might take longer > >> to process stuff when this hypothetical pass encounters a merge > >> intrinsic (I can't think of a real-world case where it would > matter), > >> it should result in better code generated. > > > > I was thinking of linearization or something similar, where an > > instrinsic by itself might be hard to preserve, compared to > looking at > > a ballot() and derive where intrinsics should be inserted. > > > >> > > > How are uniform branches handled? Do they affect the > convergence model? > >> > > > > >> > > We may be able to remove convergence points if branches are > uniform. > >> > > In Nicolai's proposal, I believe we'd want to remove a merge > intrinsic > >> > > when all the branches that it post-dominates that aren't also > >> > > post-dominated by some other merge intrinsic are uniform. > >> > > >> > I couldn't quite understand the last sentence, but I assume the > >> > conditions would prevent removing convergence points that help > >> > performance. Post-domination might not be adequate if there > are loops > >> > btw. > >> > >> It should be -- Nicolai's proposal is that a merge intrinsic merges > >> all the divergences caused by branches post dominated by the > >> intrinsic, so if all the divergent branches are merged by some other > >> intrinsic earlier, then there's no divergence and the merge > intrinsic > >> is a no-op. > > > > Makes sense. > > > > - Jan > > On Wednesday, January 30, 2019, 11:41:29 AM EST, Connor Abbott > > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>> wrote: > > > > > > On Wed, Jan 30, 2019 at 4:20 PM Jan Sjodin <jan_sjodin at yahoo.com > <mailto:jan_sjodin at yahoo.com> > > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> wrote: > > > > > > > > > > > > for (int i = 0; i < 2; i++) { > > > > > foo = ballot(true); // ballot 1 > > > > > > > > > > if (threadID /* ID of the thread within a > wavefront/warp */ % 2 == 0) continue; > > > > > > > > > > bar = ballot(true); // ballot 2 > > > > > } > > > > > > > > > > versus: > > > > > > > > > > int i = 0; > > > > > while (true) { > > > > > do { > > > > > if (i == 2) break; > > > > > foo = ballot(true); // ballot 1 > > > > > i++; > > > > > } while (threadID % 2 == 0); > > > > > > > > > > if (i == 2) break; > > > > > bar = ballot(true); // ballot 2 > > > > > i++; > > > > > } > > > > > > > > I think you can remove the second "i++", otherwise we can > increment "i" twice > > > > if threadID % 2 != 0, but I see the issue. Using the > equivalence classes would > > > > prevent this transform, since we have re-arranged the > control flow in that way, > > > > > > What do you mean? Note that the transforms that lead from > the first > > > example to the second are actually desirable if there aren't any > > > ballots or other convergent operations, so banning them > entirely is a > > > no-go. Otherwise, note that the ballot here can be nested > arbitrarily > > > deep, which means that jump forwarding/tail duplication has > to be > > > aware of the entire loop unless we have some kind of merge > intrinsic. > > > > Yes, if there weren't any calls to a ballot, then the > transform would > > be legal. What I was saying in the beginning was that ballot() > would > > have some special rules attached to it. It is of course > undesirable to > > have flags to enforce correctness, but that is the point we are at > > right now. > > > > > Also, I should say that while interesting, control > equivalence classes > > > can be part of the *implementation*, but not the > *specification*. We > > > need > > > to define the semantics of the IR -- that is, how a program > > > compiled from any given IR is allowed to do when executed > (and *not* > > > what it looks like when compiled) -- *first*, and then which > > > transforms are allowed/not allowed will fall out of that. We > can't > > > start by listing disallowed transforms, because then when > someone > > > comes along and writes a new optimization, it might be > technically > > > "allowed" even though it breaks something in practice. > > > > I think you misunderstand me if you think I was listing disallowed > > transforms. My question was if it is possible to have > multi-threaded > > semantics in the source language, but in the compiler have a > > single-threaded view, where some properties of the CFG would > determine > > what is legal and not for some functions with a special flag. > > > > > > But that's practically the same as listing allowed and disallowed > > transforms -- you're defining what the final IR is allowed to > look like, > > not how it's allowed to execute. If at all possible, the former > should > > be derived from the latter. > > > > I agree > > it is more desirable to have the the semantics specified in the > > IR. However, I am exploring this from a practical point of > view, since > > ballot() is very rare compared to all code that is being > compiled for > > all targets. These intrinsics would always have to be > considered when > > writing a pass. They seem to me harder to think about, and > > test, for someone who is working on a single-threaded target, > compared > > to looking at a flag. If we allow ourselves to look at which > > transforms that might violate these properties, we could would > free up > > the rest of the compiler (and developers) to not have to worry > about > > these things. Intrinsics would have to be maintained > throughout the > > entire compilation process in every pass. > > > > > The only way to > > > conclusively prove that transforms will never break code that's > > > supposed to work (except for bugs in the transforms) is to > define the > > > semantics of the IR, and then to make sure that it refines the > > > semantics of the source language. This is why the LangRef > almost never > > > talks about allowed/disallowed transforms except as examples to > > > explain some given semantics, and if you don't follow that > rule, your > > > patch will probably be rejected. > > > > I'm trying to figure out if we are in the "almost never" territory > > here or not. > > > > > > I don't think so at all. In addition to being much, much harder to > > reason about and prove that the approach is sound, I think it'll > be more > > intrusive. > > > > > > > Now, to define a semantics for ballot(), we need to define > what it's > > > allowed to return for any given execution of any given > program, and to > > > do that, we need to define which threads must be active > together when > > > it's reached, which in turns means we need to define when > control flow > > > can re-converge somehow. Any proposal for how to handle > ballot() must > > > start here. > > > > > > I'm not sure if using these rules will be easier or harder > than dealing with > > > > intrinsics. One problem is that the rules for > single-threaded code might seem > > > > arbitrary, and it would be hard to reason about them in a > larger context. > > > > > > What happens to new control flow created by transforms, > and what will guide > > > > the insertion of intrinsics in the new code? Code > structurization is one example > > > > were this could happen. > > > > > Hopefully, it's clear from the above how this all falls out. The > > > frontend for e.g. GLSL or SPIR-V would have to insert the merge > > > intrinsics to preserve the semantics of the source language. Any > > > transforms must refine the semantics of the IR, although I > can't think > > > of a scenario where that would involve emitting any new merge > > > intrinsics. Usually, structurized IR's have their own > semantics about > > > when control flow converges, so a code structurizer should > respect the > > > original semantics. AMDGPU has its own code structurizer > that runs > > > very late in the pipeline (although there are plans to > remove it), so > > > we might have to change that to make it respect the merge > intrinsic > > > intrinsics, and then we'll correctly implement them "for free". > > > > Any transform that re-arranges control flow would potentially > have to > > know about the properties of ballot(), and the rules with > respect to > > the CFG (and maybe consider the target) to know where to > insert the > > intrinsics. > > > > > > But the same is true for basically any approach to handling this. In > > fact, adding the merge intrinsics makes this much easier. Beyond the > > usual problems with hoisting ballots, which passes currently > avoid via > > the current convergent + speculatable attributes, we'd only have > to add > > awareness to passes that they can't duplicate/combine merge > intrinsics > > or move them past convergent intrinsics, which is a local > property and > > hence easily checkable. In example I explained, without some kind of > > merge intrinsic, tail duplication has to look at the entire loop > to know > > whether the transform is legal. Of course, you could have some > kind of > > "no convergent calls" flag on a function, but that doesn't > eliminate the > > nastyness when there are convergent calls. > > > > I had the impression that the control flow convergence was > > in part specified by what the target architecture can handle. > > > > > > This is true, although hopefully we can agree on something that > everyone > > can implement. > > > > One of > > the more complicated cases would be linearization where the > control > > flow is completely rewritten, and is encoded in a variable > that says > > which basic block is the next one to execute. > > > > > > It's certainly interesting to think about how to maintain > correctness in > > the face of ballots() with such a pass, but a) it's certainly no > harder > > with merge intrinsics than merges being implicit and b) I doubt > that's > > useful for anything you'd want to do with a GPU. > > > > Another case is DCE, > > where a ballot() could be eliminated, and it would potentially > have to > > remove a number of intrinsics to enable later optimizations > (unless it > > would affect performance?), so the intrinsics will require some > > non-local updates. > > > > > > Removing merge intrinsics if it's profitable and allowed is a > separate > > optimization, that can be done relatively quickly in a single pass > > without impacting the performance of other optimizations. It's > requiring > > expensive non-local checks in optimizations which modify control > flow > > that's a problem. > > > > > > > > > Would they only be present if ballot and similar > functions are used, or do they > > > > > have to be present everywhere? > > > > > > > > They'd only have to be present when ballot or other convergent > > > > functions are called, since otherwise it doesn't matter > when control > > > flow re-converges. However, we may want to keep them around for > > > performance reasons (usually destroying convergence points > is bad for > > > performance). > > > > So we might have them without a ballot(), which would seem to > make it > > more difficult for structurizers or other transforms to > maintain the > > semantics and insert intrinsics. > > > > > > It's not any more difficult code-wise, since the case where there > is a > > ballot needs to be handled anyways. And while it might take > longer to > > process stuff when this hypothetical pass encounters a merge > intrinsic > > (I can't think of a real-world case where it would matter), it > should > > result in better code generated. > > > > > > > > How are uniform branches handled? Do they affect the > convergence model? > > > > > > > We may be able to remove convergence points if branches are > uniform. > > > In Nicolai's proposal, I believe we'd want to remove a merge > intrinsic > > > when all the branches that it post-dominates that aren't also > > > post-dominated by some other merge intrinsic are uniform. > > > > I couldn't quite understand the last sentence, but I assume the > > conditions would prevent removing convergence points that help > > performance. Post-domination might not be adequate if there > are loops > > btw. > > > > > > It should be -- Nicolai's proposal is that a merge intrinsic > merges all > > the divergences caused by branches post dominated by the > intrinsic, so > > if all the divergent branches are merged by some other intrinsic > > earlier, then there's no divergence and the merge intrinsic is a > no-op. > > > > > > - Jan > > > > > > > > On Wednesday, January 30, 2019, 6:29:52 AM EST, Connor Abbott > > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com> > <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote: > > > > > > On Mon, Jan 28, 2019 at 9:09 PM Jan Sjodin > <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com> > > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> > wrote: > > > > > > > for (int i = 0; i < 2; i++) { > > > > foo = ballot(true); // ballot 1 > > > > > > > > if (threadID /* ID of the thread within a > wavefront/warp */ > > % 2 == 0) continue; > > > > > > > > bar = ballot(true); // ballot 2 > > > > } > > > > > > > > versus: > > > > > > > > int i = 0; > > > > while (true) { > > > > do { > > > > if (i == 2) break; > > > > foo = ballot(true); // ballot 1 > > > > i++; > > > > } while (threadID % 2 == 0); > > > > > > > > if (i == 2) break; > > > > bar = ballot(true); // ballot 2 > > > > i++; > > > > } > > > > > > I think you can remove the second "i++", otherwise we can > > increment "i" twice > > > if threadID % 2 != 0, but I see the issue. Using the > equivalence > > classes would > > > prevent this transform, since we have re-arranged the control > > flow in that way, > > > > What do you mean? Note that the transforms that lead from the > first > > example to the second are actually desirable if there aren't any > > ballots or other convergent operations, so banning them > entirely is a > > no-go. Otherwise, note that the ballot here can be nested > arbitrarily > > deep, which means that jump forwarding/tail duplication has to be > > aware of the entire loop unless we have some kind of merge > intrinsic. > > > > Also, I should say that while interesting, control equivalence > classes > > can be part of the *implementation*, but not the > *specification*. We > > need to define the semantics of the IR -- that is, how a program > > compiled from any given IR is allowed to do when executed (and > *not* > > what it looks like when compiled) -- *first*, and then which > > transforms are allowed/not allowed will fall out of that. We can't > > start by listing disallowed transforms, because then when someone > > comes along and writes a new optimization, it might be technically > > "allowed" even though it breaks something in practice. The > only way to > > conclusively prove that transforms will never break code that's > > supposed to work (except for bugs in the transforms) is to > define the > > semantics of the IR, and then to make sure that it refines the > > semantics of the source language. This is why the LangRef > almost never > > talks about allowed/disallowed transforms except as examples to > > explain some given semantics, and if you don't follow that > rule, your > > patch will probably be rejected. > > > > Now, to define a semantics for ballot(), we need to define > what it's > > allowed to return for any given execution of any given > program, and to > > do that, we need to define which threads must be active > together when > > it's reached, which in turns means we need to define when > control flow > > can re-converge somehow. Any proposal for how to handle > ballot() must > > start here. > > > > > I'm not sure if using these rules will be easier or harder > than > > dealing with > > > intrinsics. One problem is that the rules for single-threaded > > code might seem > > > arbitrary, and it would be hard to reason about them in a > larger > > context. > > > > > > > Nicolai's proposal solves this by having the frontend emit a > > merge intrinsic > > > > before the i++ is emitted. This prevents the jump forwarding > > from occurring. > > > > > > I was thinking about getting through the single-thread > view and > > the issues with > > > that first, then I will think more about the multi-thread and > > explicit convergence. > > > > > > If we are done with the single-thread stuff for now these > are the > > question that I > > > have been thinking about with the multi-threaded view: > > > > > > What happens to new control flow created by transforms, > and what > > will guide > > > the insertion of intrinsics in the new code? Code > structurization > > is one example > > > were this could happen. > > > > Hopefully, it's clear from the above how this all falls out. The > > frontend for e.g. GLSL or SPIR-V would have to insert the merge > > intrinsics to preserve the semantics of the source language. Any > > transforms must refine the semantics of the IR, although I > can't think > > of a scenario where that would involve emitting any new merge > > intrinsics. Usually, structurized IR's have their own > semantics about > > when control flow converges, so a code structurizer should > respect the > > original semantics. AMDGPU has its own code structurizer that runs > > very late in the pipeline (although there are plans to remove > it), so > > we might have to change that to make it respect the merge > intrinsic > > intrinsics, and then we'll correctly implement them "for free". > > > > > > > > Would they only be present if ballot and similar functions are > > used, or do they > > > have to be present everywhere? > > > > They'd only have to be present when ballot or other convergent > > functions are called, since otherwise it doesn't matter when > control > > flow re-converges. However, we may want to keep them around for > > performance reasons (usually destroying convergence points is > bad for > > performance). > > > > > > > > How are uniform branches handled? Do they affect the > convergence > > model? > > > > We may be able to remove convergence points if branches are > uniform. > > In Nicolai's proposal, I believe we'd want to remove a merge > intrinsic > > when all the branches that it post-dominates that aren't also > > post-dominated by some other merge intrinsic are uniform. > > > > > > > > > > > > > > > - Jan > > > > > > > > > On Monday, January 28, 2019, 11:16:36 AM EST, Connor Abbott > > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com> > <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote: > > > > > > > > > > > > On Fri, Jan 25, 2019 at 3:05 AM Jan Sjodin > <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com> > > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> > wrote: > > > > > > > > > for (...) { > > > > > ballot(); > > > > > if (... /* non-uniform */) continue; > > > > > } > > > > > > > > > > into > > > > > > > > > > for (...) { > > > > > do { > > > > > ballot(); > > > > > } while (... /* non-uniform */); > > > > > } > > > > > > > > I'm not sure if I follow this example, could you and > explain a > > bit more? > > > > It looks to me that the condition in the "if" must be > false (if the > > > > same condition is used in the while), or we would > > > > call ballot the wrong number of times. > > > > > > Yes, the idea is that the same condition is used in the if and > > the do-while. I think I messed up the example a little... in the > > second snippet, we're supposed to break out of the inner loop > if the > > outer loop's exit condition is true. Here's a more concrete > example: > > > > > > for (int i = 0; i < 2; i++) { > > > foo = ballot(true); // ballot 1 > > > > > > if (threadID /* ID of the thread within a > wavefront/warp */ % > > 2 == 0) continue; > > > > > > bar = ballot(true); // ballot 2 > > > } > > > > > > versus: > > > > > > int i = 0; > > > while (true) { > > > do { > > > if (i == 2) break; > > > foo = ballot(true); // ballot 1 > > > i++; > > > } while (threadID % 2 == 0); > > > > > > if (i == 2) break; > > > bar = ballot(true); // ballot 2 > > > i++; > > > } > > > > > > From a single-threaded perspective, these two snippets are > > identical, even if ballot() writes to arbitrary memory. The first > > can easily get transformed to something like the second when LLVM > > decides to duplicate the final i++ through jump forwarding, > and then > > re-interprets the loop as two nested loops and splits the loop > > header in two. This is what currently happens with DOOM when > we try > > to enable subgroup operations with it. Let's say there are two > > threads in a wavefront. Then the execution trace mandated by > SPIR-V > > for the first looks like: > > > > > > thread 0 | thread 1 > > > ballot 1 = 0b11 | ballot 1 = 0b11 > > > skipped | ballot 2 = 0b10 > > > ballot 1 = 0b11 | ballot 1 = 0b11 > > > skipped | ballot 2 = 0b10 > > > > > > Now, contrast this with the execution trace that programmers > > would expect for the second example: > > > > > > thread 0 | thread 1 > > > ballot 1 = 0b11 | ballot 1 = 0b11 > > > ballot 1 = 0b01 | skipped > > > skipped | ballot 2 = 0b10 > > > skipped | ballot 1 = 0b10 > > > skipped | ballot 2 = 0b10 > > > > > > Nicolai's proposal solves this by having the frontend emit a > > merge intrinsic before the i++ is emitted. This prevents the jump > > forwarding from occurring. > > > > > > > > > > > > > > About the CSE, when would that be legal? I can imagine > with uniform > > > > branches that it could work, but would like to see an > example to > > > > fully understand this. > > > > > > > > I agree that it would be more conservative than if we > model the > > threading, > > > > but I'm not sure about the cost/benefit. I am mostly > curious if > > it is > > > > possible to have a single-thread view or not. Then we > would have to > > > > see if it is adequate. > > > > > > > > - Jan > > > > > > > > On Thursday, January 24, 2019, 10:31:47 AM EST, Connor > Abbott > > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com> > <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote: > > > > > > > > > > > > I don't see how this would fix the continue vs. nested loop > > problem I > > > > explained earlier. That is, how would this prevent turning: > > > > > > > > for (...) { > > > > ballot(); > > > > if (... /* non-uniform */) continue; > > > > } > > > > > > > > into > > > > > > > > for (...) { > > > > do { > > > > ballot(); > > > > } while (... /* non-uniform */); > > > > } > > > > > > > > and vice versa? Note that there's no duplication going > on here, and > > > > the single-threaded flow of control is exactly the same. > > > > > > > > Another reason this isn't so great is that it prevents > e.g. CSE on > > > > ballots that actually should be combined, since you're > > modelling it as > > > > a write. It seems like the optimizer is going to need > some special > > > > knowledge of convergent things that fake memory constraints > > can't give > > > > us. > > > > > > > > On Thu, Jan 24, 2019 at 4:06 PM Jan Sjodin > > <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com> > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> wrote: > > > > > > > > > > > > > > > I was looking into ballot() and how if it is possible > to keep > > a single-threaded > > > > > view of the code, but add some extra conditions that must > > hold after the > > > > > transformation. I had the initial idea that each call to > > ballot() in a > > > > > single-threaded program can be seen as a partial write > to a > > memory > > > > > location, and each location memory location is unique for > > every call site, > > > > > plus there some externally observable side effect. We can > > abstract this > > > > > away by tagging the calls, e.g. by using aliases. > > > > > > > > > > For example: > > > > > > > > > > if (...) { > > > > > foo1 = ballot(); > > > > > } else { > > > > > foo2 = ballot(); > > > > > } > > > > > > > > > > simply becomes: > > > > > > > > > > if (...) { > > > > > foo1 = ballot_1(); > > > > > } else { > > > > > foo2 = ballot_2(); > > > > > } > > > > > > > > > > > > > > > and > > > > > > > > > > if (...) { > > > > > } else { > > > > > } > > > > > ballot(); > > > > > > > > > > becomes > > > > > > > > > > if (...) { > > > > > } else { > > > > > } > > > > > ballot_1(); > > > > > > > > > > In the first case it would prevent combining the two calls > > into one > > > > > after the if. In the second example there is generally > > nothing that > > > > > says it could not be transformed into the first > example with two > > > > > calls to ballot_1(), which should not be allowed. > > > > > > > > > > Another form of duplication that we must allow are loop > > transforms, > > > > > like unrolling or peeling. These might seem similar to the > > example > > > > > above, since we are cloning code and with conditions > etc. But > > > > > they are different since they calls are in different loop > > iterations. > > > > > > > > > > The condition that needs to be met is that: > > > > > > > > > > There must be a single path between all cloned ballot_n() > > functions. > > > > > > > > > > The reason for this condition is that if we clone the same > > call, then > > > > > the copies must be mutually exclusive, but if they are > cloned > > from > > > > > a loop, there must be a path, or we would skip iterations. > > > > > > > > > > If we want to be more sophisticated we can add: > > > > > > > > > > If there is no such path, the calls must be separated by > > uniform branches. > > > > > > > > > > After the transform everything should be re-tagged, > since we > > already > > > > > checked the calls and we don't want to check them again. > > Also, not all > > > > > transforms need (or should) have the tagging checked. One > > example is > > > > > inlining, where multiple copies are created, but they are > > clearly different > > > > > calls. The tagging can be done temporarily for a > single pass, > > and then > > > > > eliminated. This could be a good tool for debugging as > well, > > since it can > > > > > detect if a transform is suspect. > > > > > > > > > > The code would of course have to make sense as far as > control > > flow. If > > > > > we have: > > > > > > > > > > for(;;) { > > > > > if(...) { > > > > > ballot(); > > > > > break; > > > > > } > > > > > } > > > > > > > > > > This would have to be translated to: > > > > > > > > > > for(;;) { > > > > > if(...) { > > > > > ballot(); > > > > > if(UnknownConstant) { // Not a uniform > condition, but > > later on translated to "true" > > > > > break; > > > > > } > > > > > } > > > > > > > > > > However, I think that this is the way the code is > generated > > today anyway. > > > > > There would have to be some attribute that indicate that > > these calls (or functions) > > > > > contain ballot or other cross-lane operations so they > could > > be tagged and > > > > > checked. The attribute would be used by the passes to know > > that the special > > > > > conditions exist for those calls. > > > > > > > > > > As far as what it means to have a path, it could be > complicated. > > > > > For example: > > > > > > > > > > x = ... > > > > > ballot_1(); > > > > > > > > > > could be transformed to: > > > > > > > > > > if (x < 4711) { > > > > > ballot_1(); > > > > > > > > > > if(x >= 4711) { > > > > > ballot_1(); > > > > > } > > > > > > > > > > So a simple path check would say there is a path, and the > > transform is legal, > > > > > but if we examine the conditions, there is no path, > and the > > transform should not be legal. > > > > > It could be made even more obscure of course, but I > don't see > > any optimizations really > > > > > doing this kind of thing, > > > > > > > > > > - Jan > > > > > > > > > > On Saturday, December 29, 2018, 11:32:25 AM EST, Nicolai > > Hähnle via llvm-dev <llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org> > > <mailto:llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>>> wrote: > > > > > > > > > > > > > > > On 20.12.18 18:03, Connor Abbott wrote: > > > > > > We already have the notion of "convergent" > functions like > > > > > > syncthreads(), to which we cannot add control-flow > > dependencies. > > > > > > That is, it's legal to hoist syncthreads out of > an "if", > > but it's > > > > > > not legal to sink it into an "if". It's not > clear to me > > why we > > > > > > can't have "anticonvergent" (terrible name) functions > > which cannot > > > > > > have control-flow dependencies removed from them? > > ballot() would be > > > > > > both convergent and anticonvergent. > > > > > > > > > > > > Would that solve your problem? > > > > > > > > > > > > > > > > > > I think it's important to note that we already have > such an > > attribute, > > > > > > although with the opposite sense - it's impossible to > > remove control > > > > > > flow dependencies from a call unless you mark it as > > "speculatable". > > > > > > > > > > This isn't actually true. If both sides of an if/else have > > the same > > > > > non-speculative function call, it can still be moved > out of > > control flow. > > > > > > > > > > That's because doing so doesn't change anything at all > from a > > > > > single-threaded perspective. Hence why I think we should > > model the > > > > > communication between threads honestly. > > > > > > > > > > > > > > > > However, this doesn't prevent > > > > > > > > > > > > if (...) { > > > > > > } else { > > > > > > } > > > > > > foo = ballot(); > > > > > > > > > > > > from being turned into > > > > > > > > > > > > if (...) { > > > > > > foo1 = ballot(); > > > > > > } else { > > > > > > foo2 = ballot(); > > > > > > } > > > > > > foo = phi(foo1, foo2) > > > > > > > > > > > > and vice versa. We have a "noduplicate" attribute which > > prevents > > > > > > transforming the first into the second, but not the > other > > way around. Of > > > > > > course we could keep going this way and add a > "nocombine" > > attribute to > > > > > > complement noduplicate. But even then, there are > even still > > problematic > > > > > > transforms. For example, take this program, which is > > simplified from a > > > > > > real game that doesn't work with the AMDGPU backend: > > > > > > > > > > > > while (cond1 /* uniform */) { > > > > > > ballot(); > > > > > > ... > > > > > > if (cond2 /* non-uniform */) continue; > > > > > > ... > > > > > > } > > > > > > > > > > > > In SPIR-V, when using structured control flow, the > > semantics of this are > > > > > > pretty clearly defined. In particular, there's a > continue > > block after > > > > > > the body of the loop where control flow > re-converges, and > > the only back > > > > > > edge is from the continue block, so the ballot is in > > uniform control > > > > > > flow. But LLVM will get rid of the continue block since > > it's empty, and > > > > > > re-analyze the loop as two nested loops, splitting > the loop > > header in > > > > > > two, producing a CFG which corresponds to this: > > > > > > > > > > > > while (cond1 /* uniform */) { > > > > > > do { > > > > > > ballot(); > > > > > > ... > > > > > > } while (cond2 /* non-uniform */); > > > > > > ... > > > > > > } > > > > > > > > > > > > Now, in an implementation where control flow > re-converges > > at the > > > > > > immediate post-dominator, this won't do the right thing > > anymore. In > > > > > > order to handle it correctly, you'd effectively need to > > always flatten > > > > > > nested loops, which will probably be really bad for > > performance if the > > > > > > programmer actually wanted the second thing. It also > makes > > it impossible > > > > > > when translating a high-level language to LLVM to > get the > > "natural" > > > > > > behavior which game developers actually expect. This is > > exactly the sort > > > > > > of "spooky action at a distance" which makes me > think that > > everything > > > > > > we've done so far is really insufficient, and we need to > > add an explicit > > > > > > notion of control-flow divergence and reconvergence > to the > > IR. We need a > > > > > > way to say that control flow re-converges at the > continue > > block, so that > > > > > > LLVM won't eliminate it, and we can vectorize it > correctly > > without > > > > > > penalizing cases where it's better for control flow > not to > > re-converge. > > > > > > > > > > Well said! > > > > > > > > > > Cheers, > > > > > > > > > > Nicolai > > > > > -- > > > > > Lerne, wie die Welt wirklich ist, > > > > > Aber vergiss niemals, wie sie sein sollte. > > > > > _______________________________________________ > > > > > LLVM Developers mailing list > > > > > llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>> > > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > Lerne, wie die Welt wirklich ist, > Aber vergiss niemals, wie sie sein sollte. >-- Lerne, wie die Welt wirklich ist, Aber vergiss niemals, wie sie sein sollte.