Nicolai Hähnle via llvm-dev
2020-Aug-17 17:52 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
On Mon, Aug 17, 2020 at 7:14 PM Hal Finkel <hfinkel at anl.gov> wrote:> On 8/17/20 11:51 AM, Nicolai Hähnle wrote: > > Hi Hal, > > > > On Mon, Aug 17, 2020 at 2:13 AM Hal Finkel <hfinkel at anl.gov> wrote: > >> Thanks for sending this. What do you think that we should do with the > >> existing convergent attribute? > > My preference, which is implicitly expressed in the review, is to use > > `convergent` both for the new and the old thing. They are implicitly > > distinguished via the "convergencectrl" operand bundle. > > > > The main reason for going down that route is that `convergent` is > > really an odd duck in that it is the only attribute in LLVM that > > _adds_ new constraints on program transforms. All other attributes > > _remove_ constraints on program transforms by expressing constraints > > on what the function does (e.g. readnone, nosync, willreturn, ...). > > Matt tried to push for changing that, making the "convergent" > > restrictions the default and adding `noconvergent` on most things, but > > that effort got stuck. In the meantime, let's avoid adding yet another > > odd duck like this :) > > > > Cheers, > > Nicolai > > > Understood. Does the existing attribute have a sound definition, just > not one that's useful for (all) GPU use cases? Or is the existing > attribute not really sound and all use cases eventually need to be replaced?The existing definition is insufficient for some GPU use cases, but it also has a problem that I like to call spooky action at a distance: some program transforms may have to look at code which they'd normally not look at in order to correctly determine whether they're correct for the current definition of `convergent`. The really problematic one that I'm aware of in practice is jump threading -- the implementation of that is incorrect wrt `convergent` -- but perhaps others are affected as well. Cheers, Nicolai> > -Hal > > > > > > > >> -Hal > >> > >> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote: > >>> Hi all, > >>> > >>> please see https://reviews.llvm.org/D85603 and its related changes for > >>> our most recent and hopefully final attempt at putting the > >>> `convergent` attribute on a solid theoretical foundation in a way that > >>> is useful for modern GPU compiler use cases. We have clear line of > >>> sight to enabling a new control flow implementation in the AMDGPU > >>> backend which is built on this foundation. I have started upstreaming > >>> some ground work recently. We expect this foundation to also be usable > >>> by non-GPU whole-program vectorization environments, if they choose to > >>> do so. > >>> > >>> What follows is a (not quite) brief overview of what this is all > >>> about, but do see the linked review and its "stack" for a fuller story. > >>> > >>> > >>> The Problem > >>> ----------- > >>> `convergent` was originally introduced in LLVM to label "barrier" > >>> instructions in GPU programming languages. Those put some unusual > >>> constraints on program transforms that weren't particularly well > >>> understood at the time. Since then, GPU languages have introduced > >>> additional instructions such as wave shuffles or subgroup reductions > >>> that made the problem harder and exposed some shortcomings of the > >>> current definition of `convergent`. At the same time, those additional > >>> instructions helped illuminate what the problem really is. > >>> > >>> Briefly, in whole-program vectorization environments such as GPUs, we > >>> can expose opportunities for performance optimizations to developers > >>> by allowing data exchange between threads that are grouped together as > >>> lanes in a vector. As long as control flow is uniform, i.e. all > >>> threads of the vector follow the same path through the CFG, it is > >>> natural that all threads assigned to the same vector participate in > >>> this data exchange. When control flow diverges, some threads may be > >>> unavailable for the data exchange, but we still need to allow data > >>> exchange among the threads that _are_ available. > >>> > >>> This means that those data exchange operations -- i.e., `convergent` > >>> operations -- communicate among a set of threads that implicitly > >>> depends on the position of the operation in control flow. > >>> > >>> The problem boils down to defining how the set of communicating > >>> threads is determined. > >>> > >>> This problem inherently has a large target- and environment-specific > >>> component. Even if control flow is fully uniform, there is the > >>> question of which threads are grouped together in the first place. In > >>> the same way that LLVM IR has no built-in understanding of how threads > >>> come to be in the first place (even in the more well-known > >>> multi-threaded CPU programming world where convergent operations don't > >>> exist), we don't concern ourselves at all with these types of > >>> questions here. For the purpose of target-independent LLVM IR > >>> semantics, we simply assume that they are answered _somehow_ and > >>> instead focus on the part of the problem that relates to control flow. > >>> > >>> When looking at high-level language source, it is often "intuitively > >>> obvious" which threads should communicate. This is not the case in an > >>> unstructured CFG. Here's an example of a non-trivial loop where a > >>> reduction is used (the result of the reduction is the sum of the input > >>> values of all participating threads): > >>> > >>> A: > >>> br label %B > >>> > >>> B: > >>> ... > >>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > >>> ... > >>> br i1 %cc1, label %B, label %C > >>> > >>> C: > >>> ... > >>> br i1 %cc2, label %B, label %D > >>> > >>> D: > >>> ; loop exit > >>> > >>> Suppose this code is executed by two threads grouped in a (very short) > >>> vector, and the threads execute the following sequences of basic blocks: > >>> > >>>> Thread 1: A B B C D > >>>> Thread 2: A B C B C D > >>> There are two different intuitive ways of "aligning" the threads for > >>> the purpose of communication between convergent operations: > >>> > >>>> Align based on natural loops: > >>>> Thread 1: A B - B C D > >>>> Thread 2: A B C B C D > >>>> Align based on a nested loop interpretation: > >>>> Thread 1: A B B C - - D > >>>> Thread 2: A B - C B C D > >>> (Explanation of the alignment: In the first alignment, %sum.b is > >>> always the sum of the input values from both threads. In the second > >>> alignment, the first computation of %sum.b is collaborative between > >>> the two threads; the second one in each thread simply returns the > >>> input value for that thread.) > >>> > >>> The example only has a single natural loop, but it could result from a > >>> high-level language source that has two nested do-while loops, so the > >>> second interpretation may well be what the programmer intended. > >>> > >>> It has often been proposed that the alignment should be defined purely > >>> based on the IR shown above, such as by always aligning based on > >>> natural loops. This doesn't work, even if we ignore the possibility of > >>> irreducible control flow. In the example, we could in principle > >>> generate IR as follows if the "nested loop alignment" is desired: > >>> > >>> A: > >>> br label %outer.hdr > >>> > >>> outer.hdr: > >>> br label %B > >>> > >>> B: > >>> ... > >>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > >>> ... > >>> br i1 %cc1, label %B, label %C > >>> > >>> C: > >>> ... > >>> br i1 %cc2, label %outer.hdr, label %D > >>> > >>> D: > >>> ; loop exit > >>> > >>> From a single-threaded perspective, it would be correct to short-cut > >>> all paths through outer.hdr, but this results in the original IR and > >>> therefore different behavior for the convergent operation (under such > >>> a proposal). Worse, the part of the IR which makes the short-cutting > >>> incorrect -- i.e., the affected convergent operation -- is potentially > >>> far away from the paths that we're short-cutting. We want to avoid > >>> this "spooky action at a distance" because it puts an undue burden on > >>> generic code transforms. > >>> > >>> > >>> The Solution > >>> ------------ > >>> We introduce explicit annotations in the IR using "convergence tokens" > >>> that "anchor" a convergent operation with respect to some other point > >>> in control flow or with respect to the function entry. The details are > >>> in the linked Phabricator review, so I will simply illustrate here how > >>> this solves the example above. > >>> > >>> To obtain the original natural loop alignment, we augment the example > >>> as follows: > >>> > >>> A: > >>> %anchor = call token @llvm.experimental.convergence.anchor() > >>> br label %B > >>> > >>> B: > >>> %loop = call token @llvm.experimental.convergence.loop() [ > >>> "convergencectrl"(token %anchor) ] > >>> ... > >>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > >>> %loop) ] > >>> ... > >>> br i1 %cc1, label %B, label %C > >>> > >>> C: > >>> ... > >>> br i1 %cc2, label %B, label %D > >>> > >>> D: > >>> ; loop exit > >>> > >>> To obtain the loop nest alignment, we augment it as follows: > >>> > >>> A: > >>> %anchor = call token @llvm.experimental.convergence.anchor() > >>> br label %outer.hdr > >>> > >>> outer.hdr: > >>> %outer = call token @llvm.experimental.convergence.loop() [ > >>> "convergencectrl"(token %anchor) ] > >>> br label %B > >>> > >>> B: > >>> %inner = call token @llvm.experimental.convergence.loop() [ > >>> "convergencectrl"(token %outer) ] > >>> ... > >>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > >>> %inner) ] > >>> ... > >>> br i1 %cc1, label %B, label %C > >>> > >>> C: > >>> ... > >>> br i1 %cc2, label %outer.hdr, label %D > >>> > >>> D: > >>> ; loop exit > >>> > >>> We end up with two nested natural loops as before, but short-cutting > >>> through outer.hdr is easily seen as forbidden because of the "loop > >>> heart" intrinsic along the path through outer.hdr. (We call it a > >>> "heart" because intuitively that's where loop iterations are counted > >>> for the purpose of convergent operations, i.e. we count "heartbeats" > >>> of the loop.) Having the non-removable outer.hdr block may seem like > >>> an inefficiency, but at least in the AMDGPU backend we end up needing > >>> that basic block anyway to implement the desired behavior. We don't > >>> literally count loop iterations in the AMDGPU backend, but we end up > >>> doing something isomorphic that also requires the block. We suspect > >>> that other implementations will have similar requirements. In any > >>> case, backends are free to optimize the block away using > >>> target-specific handling. > >>> > >>> > >>> Have you tried other solutions? > >>> ------------------------------- > >>> Yes, we've tried many things over the years. I've personally been > >>> thinking about and working on this problem on and off for a > >>> significant amount of time over the last four years. I supervised a > >>> practical master thesis on a related topic during this time. Other > >>> people have thought about the problem a lot as well. Discussions with > >>> many people in the LLVM community, inside AMD, and elsewhere have all > >>> helped our understanding of the problem. Many different ideas were > >>> considered and ultimately rejected in the process. > >>> > >>> All attempts to solve the problem _without_ some form of "convergence > >>> token" have suffered from spooky action at a distance. This includes > >>> the current definition of `convergent` in the LangRef. > >>> > >>> We tried _very_ hard to find a workable definition of `convergent` > >>> that doesn't talk about groups of threads in the LangRef, but > >>> ultimately came to the conclusion that thinking about convergent > >>> operations inherently involves threads of execution, and it's much > >>> easier to just accept that. As I've already written above, convergent > >>> operations really are a form of (often non- or inaccessible-memory) > >>> communication among threads where the set of communicating threads is > >>> affected by the static position of the operation in control flow. > >>> > >>> The proposal contains no lock-step execution, even though that's what > >>> many people think of when they think about GPU programming models. It > >>> is purely expressed in terms of communicating threads, which could in > >>> theory just be threads on a CPU. We believe that this is more in line > >>> with how LLVM works. Besides, we found lock-step execution semantics > >>> to be undesirable from first principles, because they prevent certain > >>> desirable program transforms or at least make them much more difficult > >>> to achieve. > >>> > >>> The most recent earlier proposal we brought to the LLVM community is > >>> here: https://reviews.llvm.org/D68994. That one already used a form of > >>> convergence token, but we ultimately rejected it for two main reasons. > >>> First, the convergence control intrinsics described there create more > >>> "noise" in the IR, i.e. there are more intrinsic calls than with what > >>> we have now. Second, the particular formalism of dynamic instances > >>> used there had some undesirable side effects that put undue burden on > >>> transforms that should be trivial like dead-code elimination. > >>> > >>> The current proposal still talks about dynamic instances, but there is > >>> no longer any automatic propagation of convergence: even if two > >>> threads execute the same dynamic instances of an instruction, whether > >>> they execute the same dynamic instance of the very next instruction in > >>> straight-line code is implementation-defined in general (which means > >>> that program transforms are free to make changes that could affect this). > >>> > >>> We chose to use an operand bundle for the convergence token so that > >>> generic code can identify the relevant token value easily. We > >>> considered adding the convergence token as a regular function call > >>> argument, but this makes it impractical for generic transforms to > >>> understand functions that take both a convergence token and some other > >>> token. > >>> > >>> Cheers, > >>> Nicolai > >> -- > >> Hal Finkel > >> Lead, Compiler Technology and Programming Languages > >> Leadership Computing Facility > >> Argonne National Laboratory > >> > > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory >-- Lerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte.
Hal Finkel via llvm-dev
2020-Aug-17 18:29 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
On 8/17/20 12:52 PM, Nicolai Hähnle wrote:> On Mon, Aug 17, 2020 at 7:14 PM Hal Finkel <hfinkel at anl.gov> wrote: >> On 8/17/20 11:51 AM, Nicolai Hähnle wrote: >>> Hi Hal, >>> >>> On Mon, Aug 17, 2020 at 2:13 AM Hal Finkel <hfinkel at anl.gov> wrote: >>>> Thanks for sending this. What do you think that we should do with the >>>> existing convergent attribute? >>> My preference, which is implicitly expressed in the review, is to use >>> `convergent` both for the new and the old thing. They are implicitly >>> distinguished via the "convergencectrl" operand bundle. >>> >>> The main reason for going down that route is that `convergent` is >>> really an odd duck in that it is the only attribute in LLVM that >>> _adds_ new constraints on program transforms. All other attributes >>> _remove_ constraints on program transforms by expressing constraints >>> on what the function does (e.g. readnone, nosync, willreturn, ...). >>> Matt tried to push for changing that, making the "convergent" >>> restrictions the default and adding `noconvergent` on most things, but >>> that effort got stuck. In the meantime, let's avoid adding yet another >>> odd duck like this :) >>> >>> Cheers, >>> Nicolai >> >> Understood. Does the existing attribute have a sound definition, just >> not one that's useful for (all) GPU use cases? Or is the existing >> attribute not really sound and all use cases eventually need to be replaced? > The existing definition is insufficient for some GPU use cases, but it > also has a problem that I like to call spooky action at a distance: > some program transforms may have to look at code which they'd normally > not look at in order to correctly determine whether they're correct > for the current definition of `convergent`. The really problematic one > that I'm aware of in practice is jump threading -- the implementation > of that is incorrect wrt `convergent` -- but perhaps others are > affected as well. > > Cheers, > NicolaiI agree that the new scheme seems better for several reasons. So it seems like your recommendation is that we consider the existing attribute usage deprecated? Or should we auto-upgrade it to something? -Hal> > >> -Hal >> >> >>> >>>> -Hal >>>> >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote: >>>>> Hi all, >>>>> >>>>> please see https://reviews.llvm.org/D85603 and its related changes for >>>>> our most recent and hopefully final attempt at putting the >>>>> `convergent` attribute on a solid theoretical foundation in a way that >>>>> is useful for modern GPU compiler use cases. We have clear line of >>>>> sight to enabling a new control flow implementation in the AMDGPU >>>>> backend which is built on this foundation. I have started upstreaming >>>>> some ground work recently. We expect this foundation to also be usable >>>>> by non-GPU whole-program vectorization environments, if they choose to >>>>> do so. >>>>> >>>>> What follows is a (not quite) brief overview of what this is all >>>>> about, but do see the linked review and its "stack" for a fuller story. >>>>> >>>>> >>>>> The Problem >>>>> ----------- >>>>> `convergent` was originally introduced in LLVM to label "barrier" >>>>> instructions in GPU programming languages. Those put some unusual >>>>> constraints on program transforms that weren't particularly well >>>>> understood at the time. Since then, GPU languages have introduced >>>>> additional instructions such as wave shuffles or subgroup reductions >>>>> that made the problem harder and exposed some shortcomings of the >>>>> current definition of `convergent`. At the same time, those additional >>>>> instructions helped illuminate what the problem really is. >>>>> >>>>> Briefly, in whole-program vectorization environments such as GPUs, we >>>>> can expose opportunities for performance optimizations to developers >>>>> by allowing data exchange between threads that are grouped together as >>>>> lanes in a vector. As long as control flow is uniform, i.e. all >>>>> threads of the vector follow the same path through the CFG, it is >>>>> natural that all threads assigned to the same vector participate in >>>>> this data exchange. When control flow diverges, some threads may be >>>>> unavailable for the data exchange, but we still need to allow data >>>>> exchange among the threads that _are_ available. >>>>> >>>>> This means that those data exchange operations -- i.e., `convergent` >>>>> operations -- communicate among a set of threads that implicitly >>>>> depends on the position of the operation in control flow. >>>>> >>>>> The problem boils down to defining how the set of communicating >>>>> threads is determined. >>>>> >>>>> This problem inherently has a large target- and environment-specific >>>>> component. Even if control flow is fully uniform, there is the >>>>> question of which threads are grouped together in the first place. In >>>>> the same way that LLVM IR has no built-in understanding of how threads >>>>> come to be in the first place (even in the more well-known >>>>> multi-threaded CPU programming world where convergent operations don't >>>>> exist), we don't concern ourselves at all with these types of >>>>> questions here. For the purpose of target-independent LLVM IR >>>>> semantics, we simply assume that they are answered _somehow_ and >>>>> instead focus on the part of the problem that relates to control flow. >>>>> >>>>> When looking at high-level language source, it is often "intuitively >>>>> obvious" which threads should communicate. This is not the case in an >>>>> unstructured CFG. Here's an example of a non-trivial loop where a >>>>> reduction is used (the result of the reduction is the sum of the input >>>>> values of all participating threads): >>>>> >>>>> A: >>>>> br label %B >>>>> >>>>> B: >>>>> ... >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent >>>>> ... >>>>> br i1 %cc1, label %B, label %C >>>>> >>>>> C: >>>>> ... >>>>> br i1 %cc2, label %B, label %D >>>>> >>>>> D: >>>>> ; loop exit >>>>> >>>>> Suppose this code is executed by two threads grouped in a (very short) >>>>> vector, and the threads execute the following sequences of basic blocks: >>>>> >>>>>> Thread 1: A B B C D >>>>>> Thread 2: A B C B C D >>>>> There are two different intuitive ways of "aligning" the threads for >>>>> the purpose of communication between convergent operations: >>>>> >>>>>> Align based on natural loops: >>>>>> Thread 1: A B - B C D >>>>>> Thread 2: A B C B C D >>>>>> Align based on a nested loop interpretation: >>>>>> Thread 1: A B B C - - D >>>>>> Thread 2: A B - C B C D >>>>> (Explanation of the alignment: In the first alignment, %sum.b is >>>>> always the sum of the input values from both threads. In the second >>>>> alignment, the first computation of %sum.b is collaborative between >>>>> the two threads; the second one in each thread simply returns the >>>>> input value for that thread.) >>>>> >>>>> The example only has a single natural loop, but it could result from a >>>>> high-level language source that has two nested do-while loops, so the >>>>> second interpretation may well be what the programmer intended. >>>>> >>>>> It has often been proposed that the alignment should be defined purely >>>>> based on the IR shown above, such as by always aligning based on >>>>> natural loops. This doesn't work, even if we ignore the possibility of >>>>> irreducible control flow. In the example, we could in principle >>>>> generate IR as follows if the "nested loop alignment" is desired: >>>>> >>>>> A: >>>>> br label %outer.hdr >>>>> >>>>> outer.hdr: >>>>> br label %B >>>>> >>>>> B: >>>>> ... >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent >>>>> ... >>>>> br i1 %cc1, label %B, label %C >>>>> >>>>> C: >>>>> ... >>>>> br i1 %cc2, label %outer.hdr, label %D >>>>> >>>>> D: >>>>> ; loop exit >>>>> >>>>> From a single-threaded perspective, it would be correct to short-cut >>>>> all paths through outer.hdr, but this results in the original IR and >>>>> therefore different behavior for the convergent operation (under such >>>>> a proposal). Worse, the part of the IR which makes the short-cutting >>>>> incorrect -- i.e., the affected convergent operation -- is potentially >>>>> far away from the paths that we're short-cutting. We want to avoid >>>>> this "spooky action at a distance" because it puts an undue burden on >>>>> generic code transforms. >>>>> >>>>> >>>>> The Solution >>>>> ------------ >>>>> We introduce explicit annotations in the IR using "convergence tokens" >>>>> that "anchor" a convergent operation with respect to some other point >>>>> in control flow or with respect to the function entry. The details are >>>>> in the linked Phabricator review, so I will simply illustrate here how >>>>> this solves the example above. >>>>> >>>>> To obtain the original natural loop alignment, we augment the example >>>>> as follows: >>>>> >>>>> A: >>>>> %anchor = call token @llvm.experimental.convergence.anchor() >>>>> br label %B >>>>> >>>>> B: >>>>> %loop = call token @llvm.experimental.convergence.loop() [ >>>>> "convergencectrl"(token %anchor) ] >>>>> ... >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token >>>>> %loop) ] >>>>> ... >>>>> br i1 %cc1, label %B, label %C >>>>> >>>>> C: >>>>> ... >>>>> br i1 %cc2, label %B, label %D >>>>> >>>>> D: >>>>> ; loop exit >>>>> >>>>> To obtain the loop nest alignment, we augment it as follows: >>>>> >>>>> A: >>>>> %anchor = call token @llvm.experimental.convergence.anchor() >>>>> br label %outer.hdr >>>>> >>>>> outer.hdr: >>>>> %outer = call token @llvm.experimental.convergence.loop() [ >>>>> "convergencectrl"(token %anchor) ] >>>>> br label %B >>>>> >>>>> B: >>>>> %inner = call token @llvm.experimental.convergence.loop() [ >>>>> "convergencectrl"(token %outer) ] >>>>> ... >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token >>>>> %inner) ] >>>>> ... >>>>> br i1 %cc1, label %B, label %C >>>>> >>>>> C: >>>>> ... >>>>> br i1 %cc2, label %outer.hdr, label %D >>>>> >>>>> D: >>>>> ; loop exit >>>>> >>>>> We end up with two nested natural loops as before, but short-cutting >>>>> through outer.hdr is easily seen as forbidden because of the "loop >>>>> heart" intrinsic along the path through outer.hdr. (We call it a >>>>> "heart" because intuitively that's where loop iterations are counted >>>>> for the purpose of convergent operations, i.e. we count "heartbeats" >>>>> of the loop.) Having the non-removable outer.hdr block may seem like >>>>> an inefficiency, but at least in the AMDGPU backend we end up needing >>>>> that basic block anyway to implement the desired behavior. We don't >>>>> literally count loop iterations in the AMDGPU backend, but we end up >>>>> doing something isomorphic that also requires the block. We suspect >>>>> that other implementations will have similar requirements. In any >>>>> case, backends are free to optimize the block away using >>>>> target-specific handling. >>>>> >>>>> >>>>> Have you tried other solutions? >>>>> ------------------------------- >>>>> Yes, we've tried many things over the years. I've personally been >>>>> thinking about and working on this problem on and off for a >>>>> significant amount of time over the last four years. I supervised a >>>>> practical master thesis on a related topic during this time. Other >>>>> people have thought about the problem a lot as well. Discussions with >>>>> many people in the LLVM community, inside AMD, and elsewhere have all >>>>> helped our understanding of the problem. Many different ideas were >>>>> considered and ultimately rejected in the process. >>>>> >>>>> All attempts to solve the problem _without_ some form of "convergence >>>>> token" have suffered from spooky action at a distance. This includes >>>>> the current definition of `convergent` in the LangRef. >>>>> >>>>> We tried _very_ hard to find a workable definition of `convergent` >>>>> that doesn't talk about groups of threads in the LangRef, but >>>>> ultimately came to the conclusion that thinking about convergent >>>>> operations inherently involves threads of execution, and it's much >>>>> easier to just accept that. As I've already written above, convergent >>>>> operations really are a form of (often non- or inaccessible-memory) >>>>> communication among threads where the set of communicating threads is >>>>> affected by the static position of the operation in control flow. >>>>> >>>>> The proposal contains no lock-step execution, even though that's what >>>>> many people think of when they think about GPU programming models. It >>>>> is purely expressed in terms of communicating threads, which could in >>>>> theory just be threads on a CPU. We believe that this is more in line >>>>> with how LLVM works. Besides, we found lock-step execution semantics >>>>> to be undesirable from first principles, because they prevent certain >>>>> desirable program transforms or at least make them much more difficult >>>>> to achieve. >>>>> >>>>> The most recent earlier proposal we brought to the LLVM community is >>>>> here: https://reviews.llvm.org/D68994. That one already used a form of >>>>> convergence token, but we ultimately rejected it for two main reasons. >>>>> First, the convergence control intrinsics described there create more >>>>> "noise" in the IR, i.e. there are more intrinsic calls than with what >>>>> we have now. Second, the particular formalism of dynamic instances >>>>> used there had some undesirable side effects that put undue burden on >>>>> transforms that should be trivial like dead-code elimination. >>>>> >>>>> The current proposal still talks about dynamic instances, but there is >>>>> no longer any automatic propagation of convergence: even if two >>>>> threads execute the same dynamic instances of an instruction, whether >>>>> they execute the same dynamic instance of the very next instruction in >>>>> straight-line code is implementation-defined in general (which means >>>>> that program transforms are free to make changes that could affect this). >>>>> >>>>> We chose to use an operand bundle for the convergence token so that >>>>> generic code can identify the relevant token value easily. We >>>>> considered adding the convergence token as a regular function call >>>>> argument, but this makes it impractical for generic transforms to >>>>> understand functions that take both a convergence token and some other >>>>> token. >>>>> >>>>> Cheers, >>>>> Nicolai >>>> -- >>>> Hal Finkel >>>> Lead, Compiler Technology and Programming Languages >>>> Leadership Computing Facility >>>> Argonne National Laboratory >>>> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Nicolai Hähnle via llvm-dev
2020-Aug-18 08:44 UTC
[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Hi Hal, On Mon, Aug 17, 2020 at 8:30 PM Hal Finkel <hfinkel at anl.gov> wrote:> I agree that the new scheme seems better for several reasons. So it > seems like your recommendation is that we consider the existing > attribute usage deprecated?Yes. Some form of auto-upgrade could be attempted via the ConvergenceControlHeuristic of https://reviews.llvm.org/D85609, though that's slightly big for on-the-fly upgrading in the bitcode reader. So while the groundwork for potentially doing it is there, I don't see a strong need or desire for it at the moment. Cheers, Nicolai> Or should we auto-upgrade it to something? > > -Hal > > > > > > > >> -Hal > >> > >> > >>> > >>>> -Hal > >>>> > >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote: > >>>>> Hi all, > >>>>> > >>>>> please see https://reviews.llvm.org/D85603 and its related changes for > >>>>> our most recent and hopefully final attempt at putting the > >>>>> `convergent` attribute on a solid theoretical foundation in a way that > >>>>> is useful for modern GPU compiler use cases. We have clear line of > >>>>> sight to enabling a new control flow implementation in the AMDGPU > >>>>> backend which is built on this foundation. I have started upstreaming > >>>>> some ground work recently. We expect this foundation to also be usable > >>>>> by non-GPU whole-program vectorization environments, if they choose to > >>>>> do so. > >>>>> > >>>>> What follows is a (not quite) brief overview of what this is all > >>>>> about, but do see the linked review and its "stack" for a fuller story. > >>>>> > >>>>> > >>>>> The Problem > >>>>> ----------- > >>>>> `convergent` was originally introduced in LLVM to label "barrier" > >>>>> instructions in GPU programming languages. Those put some unusual > >>>>> constraints on program transforms that weren't particularly well > >>>>> understood at the time. Since then, GPU languages have introduced > >>>>> additional instructions such as wave shuffles or subgroup reductions > >>>>> that made the problem harder and exposed some shortcomings of the > >>>>> current definition of `convergent`. At the same time, those additional > >>>>> instructions helped illuminate what the problem really is. > >>>>> > >>>>> Briefly, in whole-program vectorization environments such as GPUs, we > >>>>> can expose opportunities for performance optimizations to developers > >>>>> by allowing data exchange between threads that are grouped together as > >>>>> lanes in a vector. As long as control flow is uniform, i.e. all > >>>>> threads of the vector follow the same path through the CFG, it is > >>>>> natural that all threads assigned to the same vector participate in > >>>>> this data exchange. When control flow diverges, some threads may be > >>>>> unavailable for the data exchange, but we still need to allow data > >>>>> exchange among the threads that _are_ available. > >>>>> > >>>>> This means that those data exchange operations -- i.e., `convergent` > >>>>> operations -- communicate among a set of threads that implicitly > >>>>> depends on the position of the operation in control flow. > >>>>> > >>>>> The problem boils down to defining how the set of communicating > >>>>> threads is determined. > >>>>> > >>>>> This problem inherently has a large target- and environment-specific > >>>>> component. Even if control flow is fully uniform, there is the > >>>>> question of which threads are grouped together in the first place. In > >>>>> the same way that LLVM IR has no built-in understanding of how threads > >>>>> come to be in the first place (even in the more well-known > >>>>> multi-threaded CPU programming world where convergent operations don't > >>>>> exist), we don't concern ourselves at all with these types of > >>>>> questions here. For the purpose of target-independent LLVM IR > >>>>> semantics, we simply assume that they are answered _somehow_ and > >>>>> instead focus on the part of the problem that relates to control flow. > >>>>> > >>>>> When looking at high-level language source, it is often "intuitively > >>>>> obvious" which threads should communicate. This is not the case in an > >>>>> unstructured CFG. Here's an example of a non-trivial loop where a > >>>>> reduction is used (the result of the reduction is the sum of the input > >>>>> values of all participating threads): > >>>>> > >>>>> A: > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %B, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> Suppose this code is executed by two threads grouped in a (very short) > >>>>> vector, and the threads execute the following sequences of basic blocks: > >>>>> > >>>>>> Thread 1: A B B C D > >>>>>> Thread 2: A B C B C D > >>>>> There are two different intuitive ways of "aligning" the threads for > >>>>> the purpose of communication between convergent operations: > >>>>> > >>>>>> Align based on natural loops: > >>>>>> Thread 1: A B - B C D > >>>>>> Thread 2: A B C B C D > >>>>>> Align based on a nested loop interpretation: > >>>>>> Thread 1: A B B C - - D > >>>>>> Thread 2: A B - C B C D > >>>>> (Explanation of the alignment: In the first alignment, %sum.b is > >>>>> always the sum of the input values from both threads. In the second > >>>>> alignment, the first computation of %sum.b is collaborative between > >>>>> the two threads; the second one in each thread simply returns the > >>>>> input value for that thread.) > >>>>> > >>>>> The example only has a single natural loop, but it could result from a > >>>>> high-level language source that has two nested do-while loops, so the > >>>>> second interpretation may well be what the programmer intended. > >>>>> > >>>>> It has often been proposed that the alignment should be defined purely > >>>>> based on the IR shown above, such as by always aligning based on > >>>>> natural loops. This doesn't work, even if we ignore the possibility of > >>>>> irreducible control flow. In the example, we could in principle > >>>>> generate IR as follows if the "nested loop alignment" is desired: > >>>>> > >>>>> A: > >>>>> br label %outer.hdr > >>>>> > >>>>> outer.hdr: > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %outer.hdr, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> From a single-threaded perspective, it would be correct to short-cut > >>>>> all paths through outer.hdr, but this results in the original IR and > >>>>> therefore different behavior for the convergent operation (under such > >>>>> a proposal). Worse, the part of the IR which makes the short-cutting > >>>>> incorrect -- i.e., the affected convergent operation -- is potentially > >>>>> far away from the paths that we're short-cutting. We want to avoid > >>>>> this "spooky action at a distance" because it puts an undue burden on > >>>>> generic code transforms. > >>>>> > >>>>> > >>>>> The Solution > >>>>> ------------ > >>>>> We introduce explicit annotations in the IR using "convergence tokens" > >>>>> that "anchor" a convergent operation with respect to some other point > >>>>> in control flow or with respect to the function entry. The details are > >>>>> in the linked Phabricator review, so I will simply illustrate here how > >>>>> this solves the example above. > >>>>> > >>>>> To obtain the original natural loop alignment, we augment the example > >>>>> as follows: > >>>>> > >>>>> A: > >>>>> %anchor = call token @llvm.experimental.convergence.anchor() > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> %loop = call token @llvm.experimental.convergence.loop() [ > >>>>> "convergencectrl"(token %anchor) ] > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > >>>>> %loop) ] > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %B, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> To obtain the loop nest alignment, we augment it as follows: > >>>>> > >>>>> A: > >>>>> %anchor = call token @llvm.experimental.convergence.anchor() > >>>>> br label %outer.hdr > >>>>> > >>>>> outer.hdr: > >>>>> %outer = call token @llvm.experimental.convergence.loop() [ > >>>>> "convergencectrl"(token %anchor) ] > >>>>> br label %B > >>>>> > >>>>> B: > >>>>> %inner = call token @llvm.experimental.convergence.loop() [ > >>>>> "convergencectrl"(token %outer) ] > >>>>> ... > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token > >>>>> %inner) ] > >>>>> ... > >>>>> br i1 %cc1, label %B, label %C > >>>>> > >>>>> C: > >>>>> ... > >>>>> br i1 %cc2, label %outer.hdr, label %D > >>>>> > >>>>> D: > >>>>> ; loop exit > >>>>> > >>>>> We end up with two nested natural loops as before, but short-cutting > >>>>> through outer.hdr is easily seen as forbidden because of the "loop > >>>>> heart" intrinsic along the path through outer.hdr. (We call it a > >>>>> "heart" because intuitively that's where loop iterations are counted > >>>>> for the purpose of convergent operations, i.e. we count "heartbeats" > >>>>> of the loop.) Having the non-removable outer.hdr block may seem like > >>>>> an inefficiency, but at least in the AMDGPU backend we end up needing > >>>>> that basic block anyway to implement the desired behavior. We don't > >>>>> literally count loop iterations in the AMDGPU backend, but we end up > >>>>> doing something isomorphic that also requires the block. We suspect > >>>>> that other implementations will have similar requirements. In any > >>>>> case, backends are free to optimize the block away using > >>>>> target-specific handling. > >>>>> > >>>>> > >>>>> Have you tried other solutions? > >>>>> ------------------------------- > >>>>> Yes, we've tried many things over the years. I've personally been > >>>>> thinking about and working on this problem on and off for a > >>>>> significant amount of time over the last four years. I supervised a > >>>>> practical master thesis on a related topic during this time. Other > >>>>> people have thought about the problem a lot as well. Discussions with > >>>>> many people in the LLVM community, inside AMD, and elsewhere have all > >>>>> helped our understanding of the problem. Many different ideas were > >>>>> considered and ultimately rejected in the process. > >>>>> > >>>>> All attempts to solve the problem _without_ some form of "convergence > >>>>> token" have suffered from spooky action at a distance. This includes > >>>>> the current definition of `convergent` in the LangRef. > >>>>> > >>>>> We tried _very_ hard to find a workable definition of `convergent` > >>>>> that doesn't talk about groups of threads in the LangRef, but > >>>>> ultimately came to the conclusion that thinking about convergent > >>>>> operations inherently involves threads of execution, and it's much > >>>>> easier to just accept that. As I've already written above, convergent > >>>>> operations really are a form of (often non- or inaccessible-memory) > >>>>> communication among threads where the set of communicating threads is > >>>>> affected by the static position of the operation in control flow. > >>>>> > >>>>> The proposal contains no lock-step execution, even though that's what > >>>>> many people think of when they think about GPU programming models. It > >>>>> is purely expressed in terms of communicating threads, which could in > >>>>> theory just be threads on a CPU. We believe that this is more in line > >>>>> with how LLVM works. Besides, we found lock-step execution semantics > >>>>> to be undesirable from first principles, because they prevent certain > >>>>> desirable program transforms or at least make them much more difficult > >>>>> to achieve. > >>>>> > >>>>> The most recent earlier proposal we brought to the LLVM community is > >>>>> here: https://reviews.llvm.org/D68994. That one already used a form of > >>>>> convergence token, but we ultimately rejected it for two main reasons. > >>>>> First, the convergence control intrinsics described there create more > >>>>> "noise" in the IR, i.e. there are more intrinsic calls than with what > >>>>> we have now. Second, the particular formalism of dynamic instances > >>>>> used there had some undesirable side effects that put undue burden on > >>>>> transforms that should be trivial like dead-code elimination. > >>>>> > >>>>> The current proposal still talks about dynamic instances, but there is > >>>>> no longer any automatic propagation of convergence: even if two > >>>>> threads execute the same dynamic instances of an instruction, whether > >>>>> they execute the same dynamic instance of the very next instruction in > >>>>> straight-line code is implementation-defined in general (which means > >>>>> that program transforms are free to make changes that could affect this). > >>>>> > >>>>> We chose to use an operand bundle for the convergence token so that > >>>>> generic code can identify the relevant token value easily. We > >>>>> considered adding the convergence token as a regular function call > >>>>> argument, but this makes it impractical for generic transforms to > >>>>> understand functions that take both a convergence token and some other > >>>>> token. > >>>>> > >>>>> Cheers, > >>>>> Nicolai > >>>> -- > >>>> Hal Finkel > >>>> Lead, Compiler Technology and Programming Languages > >>>> Leadership Computing Facility > >>>> Argonne National Laboratory > >>>> > >> -- > >> Hal Finkel > >> Lead, Compiler Technology and Programming Languages > >> Leadership Computing Facility > >> Argonne National Laboratory > >> > > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory >-- Lerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte.