*Hi, I am considering a language extension to Clang for optimizing GPU programs. This extension will allow the compiler to use different optimization strategies for divergent and non-divergent branches (to be explained below). We have observed significant performance gain by leveraging this proposed extension, so I want to discuss it here to see how the community likes/dislikes the idea. I will focus on the CUDA language and the PTX backend for now, but I believe this addition will benefit OpenCL and its backends too. Background on branch divergenceCUDA programs have a very different execution model for code with branches. A CUDA program is executed by an array of threads broken into groups called warps. A warp typically contains 32 threads, and all the threads in a warp execute instructions in lock-step, i.e., executing the same instruction at any given time. Therefore, if the code contains divergent branches (i.e., threads in a warp do not agree on which path of the branch to take), the warp has to execute all the paths from that branch with different subsets of threads enabled until they converge at a post-dominating BB of the paths. For example, // threadIdx.x returns the index of a thread in the warpif (threadIdx.x == 0) { foo();} else { bar();}The warp that contains thread 0-31 needs to execute foo() with only thread 0 enabled and then bar() with the other 31 threads enabled. Therefore, the run time of the above code will be the run time of foo() + the run time of bar(). More details about branch divergence can be found in the CUDA C programming guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture <http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>How branch divergence affects compiler optimizationsDue to CUDA's different execution model, some optimizations in LLVM, such as jump threading, can be unfortunately harmful. The above figure illustrates jump threading. In the original CFG (on the left), the first condition “if foo == bar” implies the second condition “if foo <= bar”. Therefore, jump threading redirects BB1 directly to BB2 so that the transformed code needn’t compute the second condition when the first condition is true.One important complication here is that BB1 does not directly point to the second condition. Instead, the code needs to call baz() before computing “if foo <= bar”. Therefore, jump threading has to duplicate the function call to baz() to match the semantics that the code runs baz() regardless of “if foo == bar”. For CPU programs, jump threading likely increases execution speed, because it makes some paths shorter. However, for CUDA programs, jump threading on divergent branches is almost certainly a bad idea. Suppose both conditions in the above example are divergent within a warp. To synchronize execution of all threads in a warp, the warp has to sequentially execute all basic blocks in the jump-threaded CFG. With baz() duplicated in the jump-threaded CFG, the warp needs to execute more code than for the original CFG. We have observed that jump threading incurs ~50% slowdown for some benchmarks. Note that jump threading is not the only optimization that can hurt the performance of CUDA programs due to branch divergence. Loop unswitching on divergent branches can also hurt performance because it may duplicate code too. Annotations for branch divergenceIdeally, we want the compiler to automatically figure out which branches are divergent or not. However, doing that precisely is extremely hard and can be expensive. Therefore, I am proposing a compromise to have programmers provide some optimization hints. The annotation in my mind is in the format of “#pragma clang branch non_divergence”. Programmers can add this annotation right before a control statement (such as if, for, and while), indicating the branch derived from the control statement is not divergent.For example, #pragma clang branch non_divergenceif (a > 0) { …}indicates the condition (a > 0) is uniform across all threads in a warp. The optimizer can then enable certain optimizations such as jump threading and loop unswitching only on non-divergent branches. In longer term, the optimizer can even adopt some cheap data-flow analysis to conservatively compute whether a branch is non-divergent. For example, if a condition is not derived from blockIdx or threadIdx, it is guaranteed to hold the same value for all threads in a warp. How the compiler can leverage these annotationsSimilar to the annotations for loop optimizations (http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations <http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations>), clang can attach metadata to the branch instructions following "#pragma clang non_divergent". For example, the source code snippet in the previous section will be translated to: %cond = icmp sgt i32 %a, 0 br i1 %cond, label %then, label %else, !llvm.branch !0...!0 = !{!0, !1}!1 !{!"llvm.branch.non_divergent"}The llvm.branch metadata indicates %cond computes the same value for all threads in a warp. This metadata can be leveraged by the IR optimizer and the NVPTX backend for better optimization. Besides the opportunities of tuning certain IR optimizations aforementioned, I also noticed the NVPTX backend could emit more efficient PTX instructions (such as bra.uni and ret.uni) for non-divergent branches.Thanks for reading! Any feedbacks are welcomed. *Jingyue -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/0df9b2d3/attachment.html>
Hi Jingyue, Have you considered using dynamic uniformity checks? In my experience you can obtain most of the benefit you describe without the need for static information simply by inserting branch-if-none instructions that jump over the bodies of conditional regions. This technique is described under Runtime Branch Uniformity Optimization in this paper, though I'm pretty confident it had been in use much longer than that: http://www.eecs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf -Owen> On Jan 23, 2015, at 10:29 PM, Jingyue Wu <jingyue at google.com> wrote: > > Hi, > > I am considering a language extension to Clang for optimizing GPU programs. This extension will allow the compiler to use different optimization strategies for divergent and non-divergent branches (to be explained below). We have observed significant performance gain by leveraging this proposed extension, so I want to discuss it here to see how the community likes/dislikes the idea. I will focus on the CUDA language and the PTX backend for now, but I believe this addition will benefit OpenCL and its backends too. > > Background on branch divergence > > CUDA programs have a very different execution model for code with branches. A CUDA program is executed by an array of threads broken into groups called warps. A warp typically contains 32 threads, and all the threads in a warp execute instructions in lock-step, i.e., executing the same instruction at any given time. Therefore, if the code contains divergent branches (i.e., threads in a warp do not agree on which path of the branch to take), the warp has to execute all the paths from that branch with different subsets of threads enabled until they converge at a post-dominating BB of the paths. For example, > > // threadIdx.x returns the index of a thread in the warp > if (threadIdx.x == 0) { > foo(); > } else { > bar(); > } > > The warp that contains thread 0-31 needs to execute foo() with only thread 0 enabled and then bar() with the other 31 threads enabled. Therefore, the run time of the above code will be the run time of foo() + the run time of bar(). > > More details about branch divergence can be found in the CUDA C programming guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture > > How branch divergence affects compiler optimizations > > Due to CUDA's different execution model, some optimizations in LLVM, such as jump threading, can be unfortunately harmful. > > > > > The above figure illustrates jump threading. In the original CFG (on the left), the first condition “if foo == bar” implies the second condition “if foo <= bar”. Therefore, jump threading redirects BB1 directly to BB2 so that the transformed code needn’t compute the second condition when the first condition is true. > > One important complication here is that BB1 does not directly point to the second condition. Instead, the code needs to call baz() before computing “if foo <= bar”. Therefore, jump threading has to duplicate the function call to baz() to match the semantics that the code runs baz() regardless of “if foo == bar”. > > For CPU programs, jump threading likely increases execution speed, because it makes some paths shorter. However, for CUDA programs, jump threading on divergent branches is almost certainly a bad idea. Suppose both conditions in the above example are divergent within a warp. To synchronize execution of all threads in a warp, the warp has to sequentially execute all basic blocks in the jump-threaded CFG. With baz() duplicated in the jump-threaded CFG, the warp needs to execute more code than for the original CFG. We have observed that jump threading incurs ~50% slowdown for some benchmarks. > > Note that jump threading is not the only optimization that can hurt the performance of CUDA programs due to branch divergence. Loop unswitching on divergent branches can also hurt performance because it may duplicate code too. > > Annotations for branch divergence > > Ideally, we want the compiler to automatically figure out which branches are divergent or not. However, doing that precisely is extremely hard and can be expensive. Therefore, I am proposing a compromise to have programmers provide some optimization hints. > > The annotation in my mind is in the format of “#pragma clang branch non_divergence”. Programmers can add this annotation right before a control statement (such as if, for, and while), indicating the branch derived from the control statement is not divergent. > > For example, > > #pragma clang branch non_divergence > if (a > 0) { > … > } > > indicates the condition (a > 0) is uniform across all threads in a warp. > > The optimizer can then enable certain optimizations such as jump threading and loop unswitching only on non-divergent branches. In longer term, the optimizer can even adopt some cheap data-flow analysis to conservatively compute whether a branch is non-divergent. For example, if a condition is not derived from blockIdx or threadIdx, it is guaranteed to hold the same value for all threads in a warp. > > How the compiler can leverage these annotations > > Similar to the annotations for loop optimizations (http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations), clang can attach metadata to the branch instructions following "#pragma clang non_divergent". For example, the source code snippet in the previous section will be translated to: > > %cond = icmp sgt i32 %a, 0 > br i1 %cond, label %then, label %else, !llvm.branch !0 > ... > !0 = !{!0, !1} > !1 = !{!"llvm.branch.non_divergent"} > > The llvm.branch metadata indicates %cond computes the same value for all threads in a warp. > > This metadata can be leveraged by the IR optimizer and the NVPTX backend for better optimization. Besides the opportunities of tuning certain IR optimizations aforementioned, I also noticed the NVPTX backend could emit more efficient PTX instructions (such as bra.uni and ret.uni) for non-divergent branches. > > Thanks for reading! Any feedbacks are welcomed. > > Jingyue > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150124/707ab657/attachment.html>
Vinod Grover
2015-Jan-24 17:06 UTC
[LLVMdev] [cfe-dev] Proposal: pragma for branch divergence
In our experience, as Owen also suggests, a pragma or a language extension can be avoided by a combination of static and dynamic analysis. We prefer this approach in our compiler ;) Regards, Vinod On Sat, Jan 24, 2015 at 12:09 AM, Owen Anderson <resistor at mac.com> wrote:> Hi Jingyue, > > Have you considered using dynamic uniformity checks? In my experience you > can obtain most of the benefit you describe without the need for static > information simply by inserting branch-if-none instructions that jump over > the bodies of conditional regions. > > This technique is described under Runtime Branch Uniformity Optimization > in this paper, though I'm pretty confident it had been in use much longer > than that: > http://www.eecs.berkeley.edu/~yunsup/papers/predication-micro2014.pdf > > -Owen > > On Jan 23, 2015, at 10:29 PM, Jingyue Wu <jingyue at google.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Hi, I am considering a language extension to Clang for optimizing GPU > programs. This extension will allow the compiler to use different > optimization strategies for divergent and non-divergent branches (to be > explained below). We have observed significant performance gain by > leveraging this proposed extension, so I want to discuss it here to see how > the community likes/dislikes the idea. I will focus on the CUDA language > and the PTX backend for now, but I believe this addition will benefit > OpenCL and its backends too. Background on branch divergenceCUDA programs > have a very different execution model for code with branches. A CUDA > program is executed by an array of threads broken into groups called warps. > A warp typically contains 32 threads, and all the threads in a warp execute > instructions in lock-step, i.e., executing the same instruction at any > given time. Therefore, if the code contains divergent branches (i.e., > threads in a warp do not agree on which path of the branch to take), the > warp has to execute all the paths from that branch with different subsets > of threads enabled until they converge at a post-dominating BB of the > paths. For example, // threadIdx.x returns the index of a thread in the > warpif (threadIdx.x == 0) { foo();} else { bar();}The warp that contains > thread 0-31 needs to execute foo() with only thread 0 enabled and then > bar() with the other 31 threads enabled. Therefore, the run time of the > above code will be the run time of foo() + the run time of bar(). More > details about branch divergence can be found in the CUDA C programming > guide: > http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture > <http://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>How > branch divergence affects compiler optimizationsDue to CUDA's different > execution model, some optimizations in LLVM, such as jump threading, can be > unfortunately harmful. The above figure illustrates jump threading. In the > original CFG (on the left), the first condition “if foo == bar” implies the > second condition “if foo <= bar”. Therefore, jump threading redirects BB1 > directly to BB2 so that the transformed code needn’t compute the second > condition when the first condition is true.One important complication here > is that BB1 does not directly point to the second condition. Instead, the > code needs to call baz() before computing “if foo <= bar”. Therefore, jump > threading has to duplicate the function call to baz() to match the > semantics that the code runs baz() regardless of “if foo == bar”. For CPU > programs, jump threading likely increases execution speed, because it makes > some paths shorter. However, for CUDA programs, jump threading on divergent > branches is almost certainly a bad idea. Suppose both conditions in the > above example are divergent within a warp. To synchronize execution of all > threads in a warp, the warp has to sequentially execute all basic blocks in > the jump-threaded CFG. With baz() duplicated in the jump-threaded CFG, the > warp needs to execute more code than for the original CFG. We have observed > that jump threading incurs ~50% slowdown for some benchmarks. Note that > jump threading is not the only optimization that can hurt the performance > of CUDA programs due to branch divergence. Loop unswitching on divergent > branches can also hurt performance because it may duplicate code too. > Annotations for branch divergenceIdeally, we want the compiler to > automatically figure out which branches are divergent or not. However, > doing that precisely is extremely hard and can be expensive. Therefore, I > am proposing a compromise to have programmers provide some optimization > hints. The annotation in my mind is in the format of “#pragma clang branch > non_divergence”. Programmers can add this annotation right before a control > statement (such as if, for, and while), indicating the branch derived from > the control statement is not divergent.For example, #pragma clang branch > non_divergenceif (a > 0) { …}indicates the condition (a > 0) is uniform > across all threads in a warp. The optimizer can then enable certain > optimizations such as jump threading and loop unswitching only on > non-divergent branches. In longer term, the optimizer can even adopt some > cheap data-flow analysis to conservatively compute whether a branch is > non-divergent. For example, if a condition is not derived from blockIdx or > threadIdx, it is guaranteed to hold the same value for all threads in a > warp. How the compiler can leverage these annotationsSimilar to the > annotations for loop optimizations > (http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations > <http://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations>), > clang can attach metadata to the branch instructions following "#pragma > clang non_divergent". For example, the source code snippet in the previous > section will be translated to: %cond = icmp sgt i32 %a, 0 br i1 %cond, > label %then, label %else, !llvm.branch !0...!0 = !{!0, !1}!1 > !{!"llvm.branch.non_divergent"}The llvm.branch metadata indicates %cond > computes the same value for all threads in a warp. This metadata can be > leveraged by the IR optimizer and the NVPTX backend for better > optimization. Besides the opportunities of tuning certain IR optimizations > aforementioned, I also noticed the NVPTX backend could emit more efficient > PTX instructions (such as bra.uni and ret.uni) for non-divergent > branches.Thanks for reading! Any feedbacks are welcomed. *Jingyue > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150124/69a2ec17/attachment.html>