Nicolai Hähnle via llvm-dev
2020-Aug-18 08:44 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Hi Hal, On Mon, Aug 17, 2020 at 8:30 PM Hal Finkel <hfinkel at anl.gov> wrote:> I agree that the new scheme seems better for several reasons. So it > seems like your recommendation is that we consider the existing > attribute usage deprecated?Yes. Some form of auto-upgrade could be attempted via the ConvergenceControlHeuristic of https://reviews.llvm.org/D85609, though that's slightly big for on-the-fly upgrading in the bitcode reader. So while the groundwork for potentially doing it is there, I don't see a strong need or desire for it at the moment. Cheers, Nicolai> Or should we auto-upgrade it to something? > > -Hal > > > > > > > >> -Hal > >> > >> > >>> > >>>> -Hal > >>>> > >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote: > >>>>> Hi all, > >>>>> > >>>>> please see https://reviews.llvm.org/D85603 and its related changes for > >>>>> our most recent and hopefully final attempt at putting the > >>>>> `convergent` attribute on a solid theoretical foundation in a way that > >>>>> is useful for modern GPU compiler use cases. We have clear line of > >>>>> sight to enabling a new control flow implementation in the AMDGPU > >>>>> backend which is built on this foundation. I have started upstreaming > >>>>> some ground work recently. We expect this foundation to also be usable > >>>>> by non-GPU whole-program vectorization environments, if they choose to > >>>>> do so. > >>>>> > >>>>> What follows is a (not quite) brief overview of what this is all > >>>>> about, but do see the linked review and its "stack" for a fuller story. > >>>>> > >>>>> > >>>>> The Problem > >>>>> ----------- > >>>>> `convergent` was originally introduced in LLVM to label "barrier" > >>>>> instructions in GPU programming languages. Those put some unusual > >>>>> constraints on program transforms that weren't particularly well > >>>>> understood at the time. Since then, GPU languages have introduced > >>>>> additional instructions such as wave shuffles or subgroup reductions > >>>>> that made the problem harder and exposed some shortcomings of the > >>>>> current definition of `convergent`. At the same time, those additional > >>>>> instructions helped illuminate what the problem really is. > >>>>> > >>>>> Briefly, in whole-program vectorization environments such as GPUs, we > >>>>> can expose opportunities for performance optimizations to developers > >>>>> by allowing data exchange between threads that are grouped together as > >>>>> lanes in a vector. As long as control flow is uniform, i.e. all > >>>>> threads of the vector follow the same path through the CFG, it is > >>>>> natural that all threads assigned to the same vector participate in > >>>>> this data exchange. When control flow diverges, some threads may be > >>>>> unavailable for the data exchange, but we still need to allow data > >>>>> exchange among the threads that _are_ available. > >>>>> > >>>>> This means that those data exchange operations -- i.e., `convergent` > >>>>> operations -- communicate among a set of threads that implicitly > >>>>> depends on the position of the operation in control flow. > >>>>> > >>>>> The problem boils down to defining how the set of communicating > >>>>> threads is determined. > >>>>> > >>>>> This problem inherently has a large target- and environment-specific > >>>>> component. Even if control flow is fully uniform, there is the > >>>>> question of which threads are grouped together in the first place. In > >>>>> the same way that LLVM IR has no built-in understanding of how threads > >>>>> come to be in the first place (even in the more well-known > >>>>> multi-threaded CPU programming world where convergent operations don't > >>>>> exist), we don't concern ourselves at all with these types of > >>>>> questions here. For the purpose of target-independent LLVM IR > >>>>> semantics, we simply assume that they are answered _somehow_ and > >>>>> instead focus on the part of the problem that relates to control flow. > >>>>> > >>>>> When looking at high-level language source, it is often "intuitively > >>>>> obvious" which threads should communicate. This is not the case in an > >>>>> unstructured CFG. Here's an example of a non-trivial loop where a > >>>>> reduction is used (the result of the reduction is the sum of the input > >>>>> values of all participating threads): > >>>>> > >>>>> A: > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %B, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> Suppose this code is executed by two threads grouped in a (very short) > >>>>> vector, and the threads execute the following sequences of basic blocks: > >>>>> > >>>>>> Thread 1: A B B C D > >>>>>> Thread 2: A B C B C D > >>>>> There are two different intuitive ways of "aligning" the threads for > >>>>> the purpose of communication between convergent operations: > >>>>> > >>>>>> Align based on natural loops: > >>>>>> Thread 1: A B - B C D > >>>>>> Thread 2: A B C B C D > >>>>>> Align based on a nested loop interpretation: > >>>>>> Thread 1: A B B C - - D > >>>>>> Thread 2: A B - C B C D > >>>>> (Explanation of the alignment: In the first alignment, %sum.b is > >>>>> always the sum of the input values from both threads. In the second > >>>>> alignment, the first computation of %sum.b is collaborative between > >>>>> the two threads; the second one in each thread simply returns the > >>>>> input value for that thread.) > >>>>> > >>>>> The example only has a single natural loop, but it could result from a > >>>>> high-level language source that has two nested do-while loops, so the > >>>>> second interpretation may well be what the programmer intended. > >>>>> > >>>>> It has often been proposed that the alignment should be defined purely > >>>>> based on the IR shown above, such as by always aligning based on > >>>>> natural loops. This doesn't work, even if we ignore the possibility of > >>>>> irreducible control flow. In the example, we could in principle > >>>>> generate IR as follows if the "nested loop alignment" is desired: > >>>>> > >>>>> A: > >>>>> br label %outer.hdr > >>>>> > >>>>> outer.hdr: > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %outer.hdr, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> From a single-threaded perspective, it would be correct to short-cut > >>>>> all paths through outer.hdr, but this results in the original IR and > >>>>> therefore different behavior for the convergent operation (under such > >>>>> a proposal). Worse, the part of the IR which makes the short-cutting > >>>>> incorrect -- i.e., the affected convergent operation -- is potentially > >>>>> far away from the paths that we're short-cutting. We want to avoid > >>>>> this "spooky action at a distance" because it puts an undue burden on > >>>>> generic code transforms. > >>>>> > >>>>> > >>>>> The Solution > >>>>> ------------ > >>>>> We introduce explicit annotations in the IR using "convergence tokens" > >>>>> that "anchor" a convergent operation with respect to some other point > >>>>> in control flow or with respect to the function entry. The details are > >>>>> in the linked Phabricator review, so I will simply illustrate here how > >>>>> this solves the example above. > >>>>> > >>>>> To obtain the original natural loop alignment, we augment the example > >>>>> as follows: > >>>>> > >>>>> A: > >>>>> %anchor = call token @llvm.experimental.convergence.anchor() > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> %loop = call token @llvm.experimental.convergence.loop() [ > >>>>> "convergencectrl"(token %anchor) ] > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > >>>>> %loop) ] > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %B, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> To obtain the loop nest alignment, we augment it as follows: > >>>>> > >>>>> A: > >>>>> %anchor = call token @llvm.experimental.convergence.anchor() > >>>>> br label %outer.hdr > >>>>> > >>>>> outer.hdr: > >>>>> %outer = call token @llvm.experimental.convergence.loop() [ > >>>>> "convergencectrl"(token %anchor) ] > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> %inner = call token @llvm.experimental.convergence.loop() [ > >>>>> "convergencectrl"(token %outer) ] > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > >>>>> %inner) ] > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %outer.hdr, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> We end up with two nested natural loops as before, but short-cutting > >>>>> through outer.hdr is easily seen as forbidden because of the "loop > >>>>> heart" intrinsic along the path through outer.hdr. (We call it a > >>>>> "heart" because intuitively that's where loop iterations are counted > >>>>> for the purpose of convergent operations, i.e. we count "heartbeats" > >>>>> of the loop.) Having the non-removable outer.hdr block may seem like > >>>>> an inefficiency, but at least in the AMDGPU backend we end up needing > >>>>> that basic block anyway to implement the desired behavior. We don't > >>>>> literally count loop iterations in the AMDGPU backend, but we end up > >>>>> doing something isomorphic that also requires the block. We suspect > >>>>> that other implementations will have similar requirements. In any > >>>>> case, backends are free to optimize the block away using > >>>>> target-specific handling. > >>>>> > >>>>> > >>>>> Have you tried other solutions? > >>>>> ------------------------------- > >>>>> Yes, we've tried many things over the years. I've personally been > >>>>> thinking about and working on this problem on and off for a > >>>>> significant amount of time over the last four years. I supervised a > >>>>> practical master thesis on a related topic during this time. Other > >>>>> people have thought about the problem a lot as well. Discussions with > >>>>> many people in the LLVM community, inside AMD, and elsewhere have all > >>>>> helped our understanding of the problem. Many different ideas were > >>>>> considered and ultimately rejected in the process. > >>>>> > >>>>> All attempts to solve the problem _without_ some form of "convergence > >>>>> token" have suffered from spooky action at a distance. This includes > >>>>> the current definition of `convergent` in the LangRef. > >>>>> > >>>>> We tried _very_ hard to find a workable definition of `convergent` > >>>>> that doesn't talk about groups of threads in the LangRef, but > >>>>> ultimately came to the conclusion that thinking about convergent > >>>>> operations inherently involves threads of execution, and it's much > >>>>> easier to just accept that. As I've already written above, convergent > >>>>> operations really are a form of (often non- or inaccessible-memory) > >>>>> communication among threads where the set of communicating threads is > >>>>> affected by the static position of the operation in control flow. > >>>>> > >>>>> The proposal contains no lock-step execution, even though that's what > >>>>> many people think of when they think about GPU programming models. It > >>>>> is purely expressed in terms of communicating threads, which could in > >>>>> theory just be threads on a CPU. We believe that this is more in line > >>>>> with how LLVM works. Besides, we found lock-step execution semantics > >>>>> to be undesirable from first principles, because they prevent certain > >>>>> desirable program transforms or at least make them much more difficult > >>>>> to achieve. > >>>>> > >>>>> The most recent earlier proposal we brought to the LLVM community is > >>>>> here: https://reviews.llvm.org/D68994. That one already used a form of > >>>>> convergence token, but we ultimately rejected it for two main reasons. > >>>>> First, the convergence control intrinsics described there create more > >>>>> "noise" in the IR, i.e. there are more intrinsic calls than with what > >>>>> we have now. Second, the particular formalism of dynamic instances > >>>>> used there had some undesirable side effects that put undue burden on > >>>>> transforms that should be trivial like dead-code elimination. > >>>>> > >>>>> The current proposal still talks about dynamic instances, but there is > >>>>> no longer any automatic propagation of convergence: even if two > >>>>> threads execute the same dynamic instances of an instruction, whether > >>>>> they execute the same dynamic instance of the very next instruction in > >>>>> straight-line code is implementation-defined in general (which means > >>>>> that program transforms are free to make changes that could affect this). > >>>>> > >>>>> We chose to use an operand bundle for the convergence token so that > >>>>> generic code can identify the relevant token value easily. We > >>>>> considered adding the convergence token as a regular function call > >>>>> argument, but this makes it impractical for generic transforms to > >>>>> understand functions that take both a convergence token and some other > >>>>> token. > >>>>> > >>>>> Cheers, > >>>>> Nicolai > >>>> -- > >>>> Hal Finkel > >>>> Lead, Compiler Technology and Programming Languages > >>>> Leadership Computing Facility > >>>> Argonne National Laboratory > >>>> > >> -- > >> Hal Finkel > >> Lead, Compiler Technology and Programming Languages > >> Leadership Computing Facility > >> Argonne National Laboratory > >> > > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory >-- Lerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte.
Nicolai Hähnle via llvm-dev
2020-Oct-28 22:13 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Hi all, As there has been quiet on the topic for quite some time now, I'm pinging this thread to see if ideally we can just go ahead with the current proposal, or if/how more discussion is required. Quick pointer to the proposed LangRef changes for the `convergent` attribute is here: https://reviews.llvm.org/D85603 The presentation from the dev meeting three weeks ago is here: https://www.youtube.com/watch?v=_Z5DuiVCFAw Thanks, Nicolai On Tue, Aug 18, 2020 at 10:44 AM Nicolai Hähnle <nhaehnle at gmail.com> wrote:> > Hi Hal, > > On Mon, Aug 17, 2020 at 8:30 PM Hal Finkel <hfinkel at anl.gov> wrote: > > I agree that the new scheme seems better for several reasons. So it > > seems like your recommendation is that we consider the existing > > attribute usage deprecated? > > Yes. > > Some form of auto-upgrade could be attempted via the > ConvergenceControlHeuristic of https://reviews.llvm.org/D85609, though > that's slightly big for on-the-fly upgrading in the bitcode reader. So > while the groundwork for potentially doing it is there, I don't see a > strong need or desire for it at the moment. > > Cheers, > Nicolai > > > > Or should we auto-upgrade it to something? > > > > -Hal > > > > > > > > > > > > >> -Hal > > >> > > >> > > >>> > > >>>> -Hal > > >>>> > > >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote: > > >>>>> Hi all, > > >>>>> > > >>>>> please see https://reviews.llvm.org/D85603 and its related changes for > > >>>>> our most recent and hopefully final attempt at putting the > > >>>>> `convergent` attribute on a solid theoretical foundation in a way that > > >>>>> is useful for modern GPU compiler use cases. We have clear line of > > >>>>> sight to enabling a new control flow implementation in the AMDGPU > > >>>>> backend which is built on this foundation. I have started upstreaming > > >>>>> some ground work recently. We expect this foundation to also be usable > > >>>>> by non-GPU whole-program vectorization environments, if they choose to > > >>>>> do so. > > >>>>> > > >>>>> What follows is a (not quite) brief overview of what this is all > > >>>>> about, but do see the linked review and its "stack" for a fuller story. > > >>>>> > > >>>>> > > >>>>> The Problem > > >>>>> ----------- > > >>>>> `convergent` was originally introduced in LLVM to label "barrier" > > >>>>> instructions in GPU programming languages. Those put some unusual > > >>>>> constraints on program transforms that weren't particularly well > > >>>>> understood at the time. Since then, GPU languages have introduced > > >>>>> additional instructions such as wave shuffles or subgroup reductions > > >>>>> that made the problem harder and exposed some shortcomings of the > > >>>>> current definition of `convergent`. At the same time, those additional > > >>>>> instructions helped illuminate what the problem really is. > > >>>>> > > >>>>> Briefly, in whole-program vectorization environments such as GPUs, we > > >>>>> can expose opportunities for performance optimizations to developers > > >>>>> by allowing data exchange between threads that are grouped together as > > >>>>> lanes in a vector. As long as control flow is uniform, i.e. all > > >>>>> threads of the vector follow the same path through the CFG, it is > > >>>>> natural that all threads assigned to the same vector participate in > > >>>>> this data exchange. When control flow diverges, some threads may be > > >>>>> unavailable for the data exchange, but we still need to allow data > > >>>>> exchange among the threads that _are_ available. > > >>>>> > > >>>>> This means that those data exchange operations -- i.e., `convergent` > > >>>>> operations -- communicate among a set of threads that implicitly > > >>>>> depends on the position of the operation in control flow. > > >>>>> > > >>>>> The problem boils down to defining how the set of communicating > > >>>>> threads is determined. > > >>>>> > > >>>>> This problem inherently has a large target- and environment-specific > > >>>>> component. Even if control flow is fully uniform, there is the > > >>>>> question of which threads are grouped together in the first place. In > > >>>>> the same way that LLVM IR has no built-in understanding of how threads > > >>>>> come to be in the first place (even in the more well-known > > >>>>> multi-threaded CPU programming world where convergent operations don't > > >>>>> exist), we don't concern ourselves at all with these types of > > >>>>> questions here. For the purpose of target-independent LLVM IR > > >>>>> semantics, we simply assume that they are answered _somehow_ and > > >>>>> instead focus on the part of the problem that relates to control flow. > > >>>>> > > >>>>> When looking at high-level language source, it is often "intuitively > > >>>>> obvious" which threads should communicate. This is not the case in an > > >>>>> unstructured CFG. Here's an example of a non-trivial loop where a > > >>>>> reduction is used (the result of the reduction is the sum of the input > > >>>>> values of all participating threads): > > >>>>> > > >>>>> A: > > >>>>> br label %B > > >>>>> > > >>>>> B: > > >>>>> ... > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > > >>>>> ... > > >>>>> br i1 %cc1, label %B, label %C > > >>>>> > > >>>>> C: > > >>>>> ... > > >>>>> br i1 %cc2, label %B, label %D > > >>>>> > > >>>>> D: > > >>>>> ; loop exit > > >>>>> > > >>>>> Suppose this code is executed by two threads grouped in a (very short) > > >>>>> vector, and the threads execute the following sequences of basic blocks: > > >>>>> > > >>>>>> Thread 1: A B B C D > > >>>>>> Thread 2: A B C B C D > > >>>>> There are two different intuitive ways of "aligning" the threads for > > >>>>> the purpose of communication between convergent operations: > > >>>>> > > >>>>>> Align based on natural loops: > > >>>>>> Thread 1: A B - B C D > > >>>>>> Thread 2: A B C B C D > > >>>>>> Align based on a nested loop interpretation: > > >>>>>> Thread 1: A B B C - - D > > >>>>>> Thread 2: A B - C B C D > > >>>>> (Explanation of the alignment: In the first alignment, %sum.b is > > >>>>> always the sum of the input values from both threads. In the second > > >>>>> alignment, the first computation of %sum.b is collaborative between > > >>>>> the two threads; the second one in each thread simply returns the > > >>>>> input value for that thread.) > > >>>>> > > >>>>> The example only has a single natural loop, but it could result from a > > >>>>> high-level language source that has two nested do-while loops, so the > > >>>>> second interpretation may well be what the programmer intended. > > >>>>> > > >>>>> It has often been proposed that the alignment should be defined purely > > >>>>> based on the IR shown above, such as by always aligning based on > > >>>>> natural loops. This doesn't work, even if we ignore the possibility of > > >>>>> irreducible control flow. In the example, we could in principle > > >>>>> generate IR as follows if the "nested loop alignment" is desired: > > >>>>> > > >>>>> A: > > >>>>> br label %outer.hdr > > >>>>> > > >>>>> outer.hdr: > > >>>>> br label %B > > >>>>> > > >>>>> B: > > >>>>> ... > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > > >>>>> ... > > >>>>> br i1 %cc1, label %B, label %C > > >>>>> > > >>>>> C: > > >>>>> ... > > >>>>> br i1 %cc2, label %outer.hdr, label %D > > >>>>> > > >>>>> D: > > >>>>> ; loop exit > > >>>>> > > >>>>> From a single-threaded perspective, it would be correct to short-cut > > >>>>> all paths through outer.hdr, but this results in the original IR and > > >>>>> therefore different behavior for the convergent operation (under such > > >>>>> a proposal). Worse, the part of the IR which makes the short-cutting > > >>>>> incorrect -- i.e., the affected convergent operation -- is potentially > > >>>>> far away from the paths that we're short-cutting. We want to avoid > > >>>>> this "spooky action at a distance" because it puts an undue burden on > > >>>>> generic code transforms. > > >>>>> > > >>>>> > > >>>>> The Solution > > >>>>> ------------ > > >>>>> We introduce explicit annotations in the IR using "convergence tokens" > > >>>>> that "anchor" a convergent operation with respect to some other point > > >>>>> in control flow or with respect to the function entry. The details are > > >>>>> in the linked Phabricator review, so I will simply illustrate here how > > >>>>> this solves the example above. > > >>>>> > > >>>>> To obtain the original natural loop alignment, we augment the example > > >>>>> as follows: > > >>>>> > > >>>>> A: > > >>>>> %anchor = call token @llvm.experimental.convergence.anchor() > > >>>>> br label %B > > >>>>> > > >>>>> B: > > >>>>> %loop = call token @llvm.experimental.convergence.loop() [ > > >>>>> "convergencectrl"(token %anchor) ] > > >>>>> ... > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > > >>>>> %loop) ] > > >>>>> ... > > >>>>> br i1 %cc1, label %B, label %C > > >>>>> > > >>>>> C: > > >>>>> ... > > >>>>> br i1 %cc2, label %B, label %D > > >>>>> > > >>>>> D: > > >>>>> ; loop exit > > >>>>> > > >>>>> To obtain the loop nest alignment, we augment it as follows: > > >>>>> > > >>>>> A: > > >>>>> %anchor = call token @llvm.experimental.convergence.anchor() > > >>>>> br label %outer.hdr > > >>>>> > > >>>>> outer.hdr: > > >>>>> %outer = call token @llvm.experimental.convergence.loop() [ > > >>>>> "convergencectrl"(token %anchor) ] > > >>>>> br label %B > > >>>>> > > >>>>> B: > > >>>>> %inner = call token @llvm.experimental.convergence.loop() [ > > >>>>> "convergencectrl"(token %outer) ] > > >>>>> ... > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > > >>>>> %inner) ] > > >>>>> ... > > >>>>> br i1 %cc1, label %B, label %C > > >>>>> > > >>>>> C: > > >>>>> ... > > >>>>> br i1 %cc2, label %outer.hdr, label %D > > >>>>> > > >>>>> D: > > >>>>> ; loop exit > > >>>>> > > >>>>> We end up with two nested natural loops as before, but short-cutting > > >>>>> through outer.hdr is easily seen as forbidden because of the "loop > > >>>>> heart" intrinsic along the path through outer.hdr. (We call it a > > >>>>> "heart" because intuitively that's where loop iterations are counted > > >>>>> for the purpose of convergent operations, i.e. we count "heartbeats" > > >>>>> of the loop.) Having the non-removable outer.hdr block may seem like > > >>>>> an inefficiency, but at least in the AMDGPU backend we end up needing > > >>>>> that basic block anyway to implement the desired behavior. We don't > > >>>>> literally count loop iterations in the AMDGPU backend, but we end up > > >>>>> doing something isomorphic that also requires the block. We suspect > > >>>>> that other implementations will have similar requirements. In any > > >>>>> case, backends are free to optimize the block away using > > >>>>> target-specific handling. > > >>>>> > > >>>>> > > >>>>> Have you tried other solutions? > > >>>>> ------------------------------- > > >>>>> Yes, we've tried many things over the years. I've personally been > > >>>>> thinking about and working on this problem on and off for a > > >>>>> significant amount of time over the last four years. I supervised a > > >>>>> practical master thesis on a related topic during this time. Other > > >>>>> people have thought about the problem a lot as well. Discussions with > > >>>>> many people in the LLVM community, inside AMD, and elsewhere have all > > >>>>> helped our understanding of the problem. Many different ideas were > > >>>>> considered and ultimately rejected in the process. > > >>>>> > > >>>>> All attempts to solve the problem _without_ some form of "convergence > > >>>>> token" have suffered from spooky action at a distance. This includes > > >>>>> the current definition of `convergent` in the LangRef. > > >>>>> > > >>>>> We tried _very_ hard to find a workable definition of `convergent` > > >>>>> that doesn't talk about groups of threads in the LangRef, but > > >>>>> ultimately came to the conclusion that thinking about convergent > > >>>>> operations inherently involves threads of execution, and it's much > > >>>>> easier to just accept that. As I've already written above, convergent > > >>>>> operations really are a form of (often non- or inaccessible-memory) > > >>>>> communication among threads where the set of communicating threads is > > >>>>> affected by the static position of the operation in control flow. > > >>>>> > > >>>>> The proposal contains no lock-step execution, even though that's what > > >>>>> many people think of when they think about GPU programming models. It > > >>>>> is purely expressed in terms of communicating threads, which could in > > >>>>> theory just be threads on a CPU. We believe that this is more in line > > >>>>> with how LLVM works. Besides, we found lock-step execution semantics > > >>>>> to be undesirable from first principles, because they prevent certain > > >>>>> desirable program transforms or at least make them much more difficult > > >>>>> to achieve. > > >>>>> > > >>>>> The most recent earlier proposal we brought to the LLVM community is > > >>>>> here: https://reviews.llvm.org/D68994. That one already used a form of > > >>>>> convergence token, but we ultimately rejected it for two main reasons. > > >>>>> First, the convergence control intrinsics described there create more > > >>>>> "noise" in the IR, i.e. there are more intrinsic calls than with what > > >>>>> we have now. Second, the particular formalism of dynamic instances > > >>>>> used there had some undesirable side effects that put undue burden on > > >>>>> transforms that should be trivial like dead-code elimination. > > >>>>> > > >>>>> The current proposal still talks about dynamic instances, but there is > > >>>>> no longer any automatic propagation of convergence: even if two > > >>>>> threads execute the same dynamic instances of an instruction, whether > > >>>>> they execute the same dynamic instance of the very next instruction in > > >>>>> straight-line code is implementation-defined in general (which means > > >>>>> that program transforms are free to make changes that could affect this). > > >>>>> > > >>>>> We chose to use an operand bundle for the convergence token so that > > >>>>> generic code can identify the relevant token value easily. We > > >>>>> considered adding the convergence token as a regular function call > > >>>>> argument, but this makes it impractical for generic transforms to > > >>>>> understand functions that take both a convergence token and some other > > >>>>> token. > > >>>>> > > >>>>> Cheers, > > >>>>> Nicolai > > >>>> -- > > >>>> Hal Finkel > > >>>> Lead, Compiler Technology and Programming Languages > > >>>> Leadership Computing Facility > > >>>> Argonne National Laboratory > > >>>> > > >> -- > > >> Hal Finkel > > >> Lead, Compiler Technology and Programming Languages > > >> Leadership Computing Facility > > >> Argonne National Laboratory > > >> > > > > > -- > > Hal Finkel > > Lead, Compiler Technology and Programming Languages > > Leadership Computing Facility > > Argonne National Laboratory > > > > > -- > Lerne, wie die Welt wirklich ist, > aber vergiss niemals, wie sie sein sollte.-- Lerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte.
Owen Anderson via llvm-dev
2020-Oct-28 23:09 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Hi Nicolai,> On Oct 28, 2020, at 3:13 PM, Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > The presentation from the dev meeting three weeks ago is here: > https://www.youtube.com/watch?v=_Z5DuiVCFAw <https://www.youtube.com/watch?v=_Z5DuiVCFAw>I have not studied the new proposal yet, but I wanted to ask a question about the JumpThreading example in your dev meeting talk: does allowing this transformation break programs? I’m asking because my reading of the semantic correctness of the jump-threading transformation is that it *is* allowed by the current convergent annotation: the execution of the barrier() intrinsic depends on the value of `condition` in both the before and after cases. In the former, it is a transitive dependency via `flag`, but still a dependency. This seems to differ from your interpretation of the example, so I’m wondering to what degree clarifying the definition to include transitive dependencies would eliminate some or all spooky action, and whether doing so would result in broken programs. Thanks, —Owen -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201028/362f5fe1/attachment.html>
Owen Anderson via llvm-dev
2020-Oct-29 01:59 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Hi Nicolai,> On Oct 28, 2020, at 3:13 PM, Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > The presentation from the dev meeting three weeks ago is here: > https://www.youtube.com/watch?v=_Z5DuiVCFAw <https://www.youtube.com/watch?v=_Z5DuiVCFAw>I have not studied the new proposal yet, but I wanted to ask a question about the JumpThreading example in your dev meeting talk: does allowing this transformation break programs? I’m asking because my reading of the semantic correctness of the jump-threading transformation is that it *is* allowed by the current convergent annotation: the execution of the barrier() intrinsic depends on the value of `condition` in both the before and after cases. In the former, it is a transitive dependency via `flag`, but still a dependency. This seems to differ from your interpretation of the example, so I’m wondering to what degree clarifying the definition to include transitive dependencies would eliminate some or all spooky action, and whether doing so would result in broken programs. Thanks, —Owen _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201028/9617636d/attachment.html>
Reid Kleckner via llvm-dev
2020-Oct-29 22:33 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
I watched the talk on the design, and it all made sense to me. I can't claim to have a deep knowledge of the requirements of GPU architectures, but I can say that this is basically the kind of stuff we had in mind when the token type was designed. What you are saying about modelling these special GPU operations as accessing inaccessible memory makes sense to me, but again, I am not an expert. One of the challenges we've faced when trying to create regions for unpacked call sequences[1] is that unreachable code elimination can often "break" a region by deleting the token consumer. It's not clear if your proposal suffers from this problem, but it's something to keep in mind, since optimizers can often discover that a plain store is storing to a null pointer, that can become unreachable, and there goes the entire rest of the function, leaving you with a half-open region. I can't imagine a plausible series of transforms on a reasonable GPU program would end up in this situation, though. [1] http://lists.llvm.org/pipermail/llvm-dev/2020-January/138627.html On Wed, Oct 28, 2020 at 3:14 PM Nicolai Hähnle via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi all, > > As there has been quiet on the topic for quite some time now, I'm > pinging this thread to see if ideally we can just go ahead with the > current proposal, or if/how more discussion is required. > > Quick pointer to the proposed LangRef changes for the `convergent` > attribute is here: https://reviews.llvm.org/D85603 > > The presentation from the dev meeting three weeks ago is here: > https://www.youtube.com/watch?v=_Z5DuiVCFAw > > Thanks, > Nicolai > > > On Tue, Aug 18, 2020 at 10:44 AM Nicolai Hähnle <nhaehnle at gmail.com> > wrote: > > > > Hi Hal, > > > > On Mon, Aug 17, 2020 at 8:30 PM Hal Finkel <hfinkel at anl.gov> wrote: > > > I agree that the new scheme seems better for several reasons. So it > > > seems like your recommendation is that we consider the existing > > > attribute usage deprecated? > > > > Yes. > > > > Some form of auto-upgrade could be attempted via the > > ConvergenceControlHeuristic of https://reviews.llvm.org/D85609, though > > that's slightly big for on-the-fly upgrading in the bitcode reader. So > > while the groundwork for potentially doing it is there, I don't see a > > strong need or desire for it at the moment. > > > > Cheers, > > Nicolai > > > > > > > Or should we auto-upgrade it to something? > > > > > > -Hal > > > > > > > > > > > > > > > > > >> -Hal > > > >> > > > >> > > > >>> > > > >>>> -Hal > > > >>>> > > > >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote: > > > >>>>> Hi all, > > > >>>>> > > > >>>>> please see https://reviews.llvm.org/D85603 and its related > changes for > > > >>>>> our most recent and hopefully final attempt at putting the > > > >>>>> `convergent` attribute on a solid theoretical foundation in a > way that > > > >>>>> is useful for modern GPU compiler use cases. We have clear line > of > > > >>>>> sight to enabling a new control flow implementation in the AMDGPU > > > >>>>> backend which is built on this foundation. I have started > upstreaming > > > >>>>> some ground work recently. We expect this foundation to also be > usable > > > >>>>> by non-GPU whole-program vectorization environments, if they > choose to > > > >>>>> do so. > > > >>>>> > > > >>>>> What follows is a (not quite) brief overview of what this is all > > > >>>>> about, but do see the linked review and its "stack" for a fuller > story. > > > >>>>> > > > >>>>> > > > >>>>> The Problem > > > >>>>> ----------- > > > >>>>> `convergent` was originally introduced in LLVM to label "barrier" > > > >>>>> instructions in GPU programming languages. Those put some unusual > > > >>>>> constraints on program transforms that weren't particularly well > > > >>>>> understood at the time. Since then, GPU languages have introduced > > > >>>>> additional instructions such as wave shuffles or subgroup > reductions > > > >>>>> that made the problem harder and exposed some shortcomings of the > > > >>>>> current definition of `convergent`. At the same time, those > additional > > > >>>>> instructions helped illuminate what the problem really is. > > > >>>>> > > > >>>>> Briefly, in whole-program vectorization environments such as > GPUs, we > > > >>>>> can expose opportunities for performance optimizations to > developers > > > >>>>> by allowing data exchange between threads that are grouped > together as > > > >>>>> lanes in a vector. As long as control flow is uniform, i.e. all > > > >>>>> threads of the vector follow the same path through the CFG, it is > > > >>>>> natural that all threads assigned to the same vector participate > in > > > >>>>> this data exchange. When control flow diverges, some threads may > be > > > >>>>> unavailable for the data exchange, but we still need to allow > data > > > >>>>> exchange among the threads that _are_ available. > > > >>>>> > > > >>>>> This means that those data exchange operations -- i.e., > `convergent` > > > >>>>> operations -- communicate among a set of threads that implicitly > > > >>>>> depends on the position of the operation in control flow. > > > >>>>> > > > >>>>> The problem boils down to defining how the set of communicating > > > >>>>> threads is determined. > > > >>>>> > > > >>>>> This problem inherently has a large target- and > environment-specific > > > >>>>> component. Even if control flow is fully uniform, there is the > > > >>>>> question of which threads are grouped together in the first > place. In > > > >>>>> the same way that LLVM IR has no built-in understanding of how > threads > > > >>>>> come to be in the first place (even in the more well-known > > > >>>>> multi-threaded CPU programming world where convergent operations > don't > > > >>>>> exist), we don't concern ourselves at all with these types of > > > >>>>> questions here. For the purpose of target-independent LLVM IR > > > >>>>> semantics, we simply assume that they are answered _somehow_ and > > > >>>>> instead focus on the part of the problem that relates to control > flow. > > > >>>>> > > > >>>>> When looking at high-level language source, it is often > "intuitively > > > >>>>> obvious" which threads should communicate. This is not the case > in an > > > >>>>> unstructured CFG. Here's an example of a non-trivial loop where a > > > >>>>> reduction is used (the result of the reduction is the sum of the > input > > > >>>>> values of all participating threads): > > > >>>>> > > > >>>>> A: > > > >>>>> br label %B > > > >>>>> > > > >>>>> B: > > > >>>>> ... > > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > > > >>>>> ... > > > >>>>> br i1 %cc1, label %B, label %C > > > >>>>> > > > >>>>> C: > > > >>>>> ... > > > >>>>> br i1 %cc2, label %B, label %D > > > >>>>> > > > >>>>> D: > > > >>>>> ; loop exit > > > >>>>> > > > >>>>> Suppose this code is executed by two threads grouped in a (very > short) > > > >>>>> vector, and the threads execute the following sequences of basic > blocks: > > > >>>>> > > > >>>>>> Thread 1: A B B C D > > > >>>>>> Thread 2: A B C B C D > > > >>>>> There are two different intuitive ways of "aligning" the threads > for > > > >>>>> the purpose of communication between convergent operations: > > > >>>>> > > > >>>>>> Align based on natural loops: > > > >>>>>> Thread 1: A B - B C D > > > >>>>>> Thread 2: A B C B C D > > > >>>>>> Align based on a nested loop interpretation: > > > >>>>>> Thread 1: A B B C - - D > > > >>>>>> Thread 2: A B - C B C D > > > >>>>> (Explanation of the alignment: In the first alignment, %sum.b is > > > >>>>> always the sum of the input values from both threads. In the > second > > > >>>>> alignment, the first computation of %sum.b is collaborative > between > > > >>>>> the two threads; the second one in each thread simply returns the > > > >>>>> input value for that thread.) > > > >>>>> > > > >>>>> The example only has a single natural loop, but it could result > from a > > > >>>>> high-level language source that has two nested do-while loops, > so the > > > >>>>> second interpretation may well be what the programmer intended. > > > >>>>> > > > >>>>> It has often been proposed that the alignment should be defined > purely > > > >>>>> based on the IR shown above, such as by always aligning based on > > > >>>>> natural loops. This doesn't work, even if we ignore the > possibility of > > > >>>>> irreducible control flow. In the example, we could in principle > > > >>>>> generate IR as follows if the "nested loop alignment" is desired: > > > >>>>> > > > >>>>> A: > > > >>>>> br label %outer.hdr > > > >>>>> > > > >>>>> outer.hdr: > > > >>>>> br label %B > > > >>>>> > > > >>>>> B: > > > >>>>> ... > > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > > > >>>>> ... > > > >>>>> br i1 %cc1, label %B, label %C > > > >>>>> > > > >>>>> C: > > > >>>>> ... > > > >>>>> br i1 %cc2, label %outer.hdr, label %D > > > >>>>> > > > >>>>> D: > > > >>>>> ; loop exit > > > >>>>> > > > >>>>> From a single-threaded perspective, it would be correct to > short-cut > > > >>>>> all paths through outer.hdr, but this results in the original IR > and > > > >>>>> therefore different behavior for the convergent operation (under > such > > > >>>>> a proposal). Worse, the part of the IR which makes the > short-cutting > > > >>>>> incorrect -- i.e., the affected convergent operation -- is > potentially > > > >>>>> far away from the paths that we're short-cutting. We want to > avoid > > > >>>>> this "spooky action at a distance" because it puts an undue > burden on > > > >>>>> generic code transforms. > > > >>>>> > > > >>>>> > > > >>>>> The Solution > > > >>>>> ------------ > > > >>>>> We introduce explicit annotations in the IR using "convergence > tokens" > > > >>>>> that "anchor" a convergent operation with respect to some other > point > > > >>>>> in control flow or with respect to the function entry. The > details are > > > >>>>> in the linked Phabricator review, so I will simply illustrate > here how > > > >>>>> this solves the example above. > > > >>>>> > > > >>>>> To obtain the original natural loop alignment, we augment the > example > > > >>>>> as follows: > > > >>>>> > > > >>>>> A: > > > >>>>> %anchor = call token > @llvm.experimental.convergence.anchor() > > > >>>>> br label %B > > > >>>>> > > > >>>>> B: > > > >>>>> %loop = call token @llvm.experimental.convergence.loop() [ > > > >>>>> "convergencectrl"(token %anchor) ] > > > >>>>> ... > > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ > "convergencectrl"(token > > > >>>>> %loop) ] > > > >>>>> ... > > > >>>>> br i1 %cc1, label %B, label %C > > > >>>>> > > > >>>>> C: > > > >>>>> ... > > > >>>>> br i1 %cc2, label %B, label %D > > > >>>>> > > > >>>>> D: > > > >>>>> ; loop exit > > > >>>>> > > > >>>>> To obtain the loop nest alignment, we augment it as follows: > > > >>>>> > > > >>>>> A: > > > >>>>> %anchor = call token > @llvm.experimental.convergence.anchor() > > > >>>>> br label %outer.hdr > > > >>>>> > > > >>>>> outer.hdr: > > > >>>>> %outer = call token @llvm.experimental.convergence.loop() [ > > > >>>>> "convergencectrl"(token %anchor) ] > > > >>>>> br label %B > > > >>>>> > > > >>>>> B: > > > >>>>> %inner = call token @llvm.experimental.convergence.loop() [ > > > >>>>> "convergencectrl"(token %outer) ] > > > >>>>> ... > > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ > "convergencectrl"(token > > > >>>>> %inner) ] > > > >>>>> ... > > > >>>>> br i1 %cc1, label %B, label %C > > > >>>>> > > > >>>>> C: > > > >>>>> ... > > > >>>>> br i1 %cc2, label %outer.hdr, label %D > > > >>>>> > > > >>>>> D: > > > >>>>> ; loop exit > > > >>>>> > > > >>>>> We end up with two nested natural loops as before, but > short-cutting > > > >>>>> through outer.hdr is easily seen as forbidden because of the > "loop > > > >>>>> heart" intrinsic along the path through outer.hdr. (We call it a > > > >>>>> "heart" because intuitively that's where loop iterations are > counted > > > >>>>> for the purpose of convergent operations, i.e. we count > "heartbeats" > > > >>>>> of the loop.) Having the non-removable outer.hdr block may seem > like > > > >>>>> an inefficiency, but at least in the AMDGPU backend we end up > needing > > > >>>>> that basic block anyway to implement the desired behavior. We > don't > > > >>>>> literally count loop iterations in the AMDGPU backend, but we > end up > > > >>>>> doing something isomorphic that also requires the block. We > suspect > > > >>>>> that other implementations will have similar requirements. In any > > > >>>>> case, backends are free to optimize the block away using > > > >>>>> target-specific handling. > > > >>>>> > > > >>>>> > > > >>>>> Have you tried other solutions? > > > >>>>> ------------------------------- > > > >>>>> Yes, we've tried many things over the years. I've personally been > > > >>>>> thinking about and working on this problem on and off for a > > > >>>>> significant amount of time over the last four years. I > supervised a > > > >>>>> practical master thesis on a related topic during this time. > Other > > > >>>>> people have thought about the problem a lot as well. Discussions > with > > > >>>>> many people in the LLVM community, inside AMD, and elsewhere > have all > > > >>>>> helped our understanding of the problem. Many different ideas > were > > > >>>>> considered and ultimately rejected in the process. > > > >>>>> > > > >>>>> All attempts to solve the problem _without_ some form of > "convergence > > > >>>>> token" have suffered from spooky action at a distance. This > includes > > > >>>>> the current definition of `convergent` in the LangRef. > > > >>>>> > > > >>>>> We tried _very_ hard to find a workable definition of > `convergent` > > > >>>>> that doesn't talk about groups of threads in the LangRef, but > > > >>>>> ultimately came to the conclusion that thinking about convergent > > > >>>>> operations inherently involves threads of execution, and it's > much > > > >>>>> easier to just accept that. As I've already written above, > convergent > > > >>>>> operations really are a form of (often non- or > inaccessible-memory) > > > >>>>> communication among threads where the set of communicating > threads is > > > >>>>> affected by the static position of the operation in control flow. > > > >>>>> > > > >>>>> The proposal contains no lock-step execution, even though that's > what > > > >>>>> many people think of when they think about GPU programming > models. It > > > >>>>> is purely expressed in terms of communicating threads, which > could in > > > >>>>> theory just be threads on a CPU. We believe that this is more in > line > > > >>>>> with how LLVM works. Besides, we found lock-step execution > semantics > > > >>>>> to be undesirable from first principles, because they prevent > certain > > > >>>>> desirable program transforms or at least make them much more > difficult > > > >>>>> to achieve. > > > >>>>> > > > >>>>> The most recent earlier proposal we brought to the LLVM > community is > > > >>>>> here: https://reviews.llvm.org/D68994. That one already used a > form of > > > >>>>> convergence token, but we ultimately rejected it for two main > reasons. > > > >>>>> First, the convergence control intrinsics described there create > more > > > >>>>> "noise" in the IR, i.e. there are more intrinsic calls than with > what > > > >>>>> we have now. Second, the particular formalism of dynamic > instances > > > >>>>> used there had some undesirable side effects that put undue > burden on > > > >>>>> transforms that should be trivial like dead-code elimination. > > > >>>>> > > > >>>>> The current proposal still talks about dynamic instances, but > there is > > > >>>>> no longer any automatic propagation of convergence: even if two > > > >>>>> threads execute the same dynamic instances of an instruction, > whether > > > >>>>> they execute the same dynamic instance of the very next > instruction in > > > >>>>> straight-line code is implementation-defined in general (which > means > > > >>>>> that program transforms are free to make changes that could > affect this). > > > >>>>> > > > >>>>> We chose to use an operand bundle for the convergence token so > that > > > >>>>> generic code can identify the relevant token value easily. We > > > >>>>> considered adding the convergence token as a regular function > call > > > >>>>> argument, but this makes it impractical for generic transforms to > > > >>>>> understand functions that take both a convergence token and some > other > > > >>>>> token. > > > >>>>> > > > >>>>> Cheers, > > > >>>>> Nicolai > > > >>>> -- > > > >>>> Hal Finkel > > > >>>> Lead, Compiler Technology and Programming Languages > > > >>>> Leadership Computing Facility > > > >>>> Argonne National Laboratory > > > >>>> > > > >> -- > > > >> Hal Finkel > > > >> Lead, Compiler Technology and Programming Languages > > > >> Leadership Computing Facility > > > >> Argonne National Laboratory > > > >> > > > > > > > -- > > > Hal Finkel > > > Lead, Compiler Technology and Programming Languages > > > Leadership Computing Facility > > > Argonne National Laboratory > > > > > > > > > -- > > Lerne, wie die Welt wirklich ist, > > aber vergiss niemals, wie sie sein sollte. > > > > -- > Lerne, wie die Welt wirklich ist, > aber vergiss niemals, wie sie sein sollte. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201029/2c8210f0/attachment.html>