Jared C. Davis via llvm-dev
2016-Mar-12 13:45 UTC
[llvm-dev] Approaches for handling branches
Hi, LLVM newbie here. I'm working on writing a compiler from a toy language to LLVM assembly, and hoping to get some advice about how to handle certain kinds of branches. Consider an input program that might look like this if written in C: x1 = cond1 ? f(a1) : cond2 ? f(a2) : cond3 ? f(a1) : f(a4); x2 = cond3 ? f(a3) : cond4 ? f(a1) : f(a4); return g(x1, x2); I'm hoping to compile something like the above in a "good" way that avoids unnecessarily calling F. To mock up how this might work, I wrote some LLVM assembly to try to model the above (full code below: "try1"). I ended up with the following for x1, where @f is declared as an external function, and @ite is a simple if-then-else function that seems to nicely get inlined: %fa1 = call i64 @f (i64 %a1) %fa2 = call i64 @f (i64 %a2) %fa4 = call i64 @f (i64 %a4) %x1.cond3 = call i64 @ite (i64 %cond3, i64 %fa1, i64 %fa4) %x1.cond2 = call i64 @ite (i64 %cond2, i64 %fa2, i64 %x1.cond3) %x1 = call i64 @ite (i64 %cond1, i64 %fa1, i64 %x1.cond2) Unfortunately this seems to produce "bad" x86 code, at least when compiled like this: opt demo.ll -O2 -o=demo.ll.opt && llc -o demo.s demo.ll.opt In particular, the resulting demo.s appears to be doing all of the calls of F first, followed by some conditional moves to select the right answer. (This would of course be great if F is very fast, but it would be bad if F is slow.) I've tried adding some promising looking switches to the opt command (like -delinearize and -sink) and using different register allocation schemes in the llc command, but so far I don't seem to be able to get LLVM to reorder the instructions to push the calls of F into the branches. (I'm afraid I don't really know much about what the switches do.) I seem to be able to do much better by (manually) rewriting the assembly so that I explicitly check the conditions first, and only call F on the various arguments in the branches where they are needed (code below: "try2"). So I could write my frontend so that it generates code like try2. But I worry about this approach because the code for try2 ends up repeating computations like the call of f(a1) and f(a4), and it seems like this will lead to code blowup. My "real" programs will have much more deeply nested branches, and the expressions in the branches are going to be bigger than just a single function call. So if I try to put these expressions only in the branches where they are computed, it seems like that will result in code that grows exponentially in the number of branches that lead to a common subexpression. So I have many questions. I wonder: - Are there other command-line options I should look into that might help LLVM be able able to optimize "try1" to push the calls down into the branches where they are used and avoid the unnecessary calls of F? - Is this kind of reordering anything I should expect LLVM to be able to take care of in the first place, or should I expect to need to work on making sure that my frontend produces code that is closer to what I want the machine to execute? - Are there other approaches that would work better than try1/try2 that I haven't considered? - Are there any examples of developing LLVM frontends for languages that deal with terms that are rich in branches and structure sharing? My real goal is to build a good compiler for efficiently executing DAGs of 64-bit integer operations, where some nodes fan out to several operations, but large chunks of the DAG may not be needed depending on the inputs. I'd be very interested in seeing other approaches to this kind of problem. Thanks very much for your time. Cheers, Jared ------------------------------ ;; Source code for try1/try2 ;; ;; I'm using LLVM on Linux Mint: ;; opt --version: LLVM version 3.4 ;; llc --version: LLVM version 3.4 declare i64 @f (i64 %a) nounwind readonly declare i64 @g (i64 %a, i64 %b) nounwind readonly define i64 @ite (i64 %cond, i64 %then, i64 %else) { ;; If/Then/Else, which will be inlined with -inline %cond.zero = icmp eq i64 %cond, 0 br i1 %cond.zero, label %case.zero, label %case.nonzero case.nonzero: ret i64 %then case.zero: ret i64 %else } define i64 @try1 (i64* %in) { ;; Arguments from an input array. %cond1.addr = getelementptr i64* %in, i32 0 %cond2.addr = getelementptr i64* %in, i32 1 %cond3.addr = getelementptr i64* %in, i32 2 %cond4.addr = getelementptr i64* %in, i32 3 %a1.addr = getelementptr i64* %in, i32 4 %a2.addr = getelementptr i64* %in, i32 5 %a3.addr = getelementptr i64* %in, i32 6 %a4.addr = getelementptr i64* %in, i32 7 ;; Load arguments into temporaries %cond1 = load i64* %cond1.addr %cond2 = load i64* %cond2.addr %cond3 = load i64* %cond3.addr %cond4 = load i64* %cond3.addr %a1 = load i64* %a1.addr %a2 = load i64* %a2.addr %a3 = load i64* %a3.addr %a4 = load i64* %a4.addr ;; Compute F(Ai) %fa1 = call i64 @f (i64 %a1) %fa2 = call i64 @f (i64 %a2) %fa3 = call i64 @f (i64 %a3) %fa4 = call i64 @f (i64 %a4) ; Compute x1 = cond1 ? f(a1) ; : cond2 ? f(a2) ; : cond3 ? f(a1) ; : f(a4); %x1.cond3 = call i64 @ite (i64 %cond3, i64 %fa1, i64 %fa4) %x1.cond2 = call i64 @ite (i64 %cond2, i64 %fa2, i64 %x1.cond3) %x1 = call i64 @ite (i64 %cond1, i64 %fa1, i64 %x1.cond2) ; Compute x2 = cond3 ? f(a3) ; : cond4 ? f(a1) ; : f(a4); %x2.cond4 = call i64 @ite (i64 %cond4, i64 %fa1, i64 %fa4) %x2 = call i64 @ite (i64 %cond3, i64 %fa3, i64 %x2.cond4) ; Return g(x1, x2); %ret = call i64 @g(i64 %x1, i64 %x2) ret i64 %ret } define i64 @try2 (i64* %in) { ;; Arguments from an input array. %cond1.addr = getelementptr i64* %in, i32 0 %cond2.addr = getelementptr i64* %in, i32 1 %cond3.addr = getelementptr i64* %in, i32 2 %cond4.addr = getelementptr i64* %in, i32 3 %a1.addr = getelementptr i64* %in, i32 4 %a2.addr = getelementptr i64* %in, i32 5 %a3.addr = getelementptr i64* %in, i32 6 %a4.addr = getelementptr i64* %in, i32 7 ;; Load arguments into temporaries %cond1 = load i64* %cond1.addr %cond2 = load i64* %cond2.addr %cond3 = load i64* %cond3.addr %cond4 = load i64* %cond3.addr %a1 = load i64* %a1.addr %a2 = load i64* %a2.addr %a3 = load i64* %a3.addr %a4 = load i64* %a4.addr ; Compute x1 = cond1 ? f(a1) ; : cond2 ? f(a2) ; : cond3 ? f(a1) ; : f(a4); %cond1.nonzero = icmp ne i64 %cond1, 0 br i1 %cond1.nonzero, label %x1.cond1.true, label %x1.cond1.false x1.cond1.false: %cond2.nonzero = icmp ne i64 %cond2, 0 br i1 %cond2.nonzero, label %x1.cond2.true, label %x1.cond2.false x1.cond2.false: %cond3.nonzero = icmp ne i64 %cond3, 0 br i1 %cond3.nonzero, label %x1.cond3.true, label %x1.otherwise x1.cond1.true: ; cond1 ? f(a1) %x1.cond1.ans = call i64 @f (i64 %a1) br label %x1.finish x1.cond2.true: ; cond2 ? f(a2) %x1.cond2.ans = call i64 @f (i64 %a2) br label %x1.finish x1.cond3.true: ; cond3 ? f(a1) %x1.cond3.ans = call i64 @f (i64 %a1) br label %x1.finish x1.otherwise: ; otherwise f(a4) %x1.otherwise.ans = call i64 @f (i64 %a4) br label %x1.finish x1.finish: %x1 = phi i64 [%x1.cond1.ans, %x1.cond1.true], [%x1.cond2.ans, %x1.cond2.true], [%x1.cond3.ans, %x1.cond3.true], [%x1.otherwise.ans, %x1.otherwise] ; Compute x2 = cond3 ? f(a3) ; : cond4 ? f(a1) ; : f(a4); %cond3.nonzero_for_x2 = icmp ne i64 %cond3, 0 br i1 %cond3.nonzero_for_x2, label %x2.cond3.true, label %x2.cond3.false x2.cond3.false: %cond4.nonzero = icmp ne i64 %cond4, 0 br i1 %cond4.nonzero, label %x2.cond4.true, label %x2.otherwise x2.cond3.true: ; cond3 ? f(a3) %x2.cond3.ans = call i64 @f (i64 %a3) br label %x2.finish x2.cond4.true: ; cond4 ? f(a1) %x2.cond4.ans = call i64 @f (i64 %a1) br label %x2.finish x2.otherwise: %x2.otherwise.ans = call i64 @f (i64 %a4) br label %x2.finish x2.finish: %x2 = phi i64 [%x2.cond3.ans, %x2.cond3.true], [%x2.cond4.ans, %x2.cond4.true], [%x2.otherwise.ans, %x2.otherwise] ; Return g(x1, x2); %ret = call i64 @g(i64 %x1, i64 %x2) ret i64 %ret } -- Jared C. Davis <jared at cs.utexas.edu> 11410 Windermere Meadows Austin, TX 78759 http://www.cs.utexas.edu/users/jared/
David Blaikie via llvm-dev
2016-Mar-12 21:58 UTC
[llvm-dev] Approaches for handling branches
Try writing the same C code, feeding it through Clang and see how it optimizes, that should give you a sense of what's possible/practical. The main issue here, I think, is that LLVM doesn't know/isn't guaranteed that your function calls don't have side effects (printing things, changing the contents of memory, etc) so it can't cause there to be extra (or fewer) invocations of the function than are present in your original program without making the program incorrect/behave differently. If you know your function calls have no side effects and can be called multiple times I think the attribute you want to add to the function declarations is 'readnone', but I'm not entirely sure. If you add that attribute to your functions, you should be able to see them potentially cloned or coalesced as is 'best' for the code. But if you don't have that guarantee from your input program/language spec, then you need to make sure there aren't excessive/duplicate calls in your input you give to LLVM to optimize. On Sat, Mar 12, 2016 at 5:45 AM, Jared C. Davis via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > LLVM newbie here. I'm working on writing a compiler from a toy language to > LLVM assembly, and hoping to get some advice about how to handle certain > kinds of branches. > > Consider an input program that might look like this if written in C: > > x1 = cond1 ? f(a1) > : cond2 ? f(a2) > : cond3 ? f(a1) > : f(a4); > > x2 = cond3 ? f(a3) > : cond4 ? f(a1) > : f(a4); > > return g(x1, x2); > > I'm hoping to compile something like the above in a "good" way that avoids > unnecessarily calling F. > > To mock up how this might work, I wrote some LLVM assembly to try to model > the above (full code below: "try1"). I ended up with the following for x1, > where @f is declared as an external function, and @ite is a simple > if-then-else function that seems to nicely get inlined: > > %fa1 = call i64 @f (i64 %a1) > %fa2 = call i64 @f (i64 %a2) > %fa4 = call i64 @f (i64 %a4) > > %x1.cond3 = call i64 @ite (i64 %cond3, i64 %fa1, i64 %fa4) > %x1.cond2 = call i64 @ite (i64 %cond2, i64 %fa2, i64 %x1.cond3) > %x1 = call i64 @ite (i64 %cond1, i64 %fa1, i64 %x1.cond2) > > Unfortunately this seems to produce "bad" x86 code, at least when compiled > like this: > > opt demo.ll -O2 -o=demo.ll.opt && llc -o demo.s demo.ll.opt > > In particular, the resulting demo.s appears to be doing all of the calls > of F > first, followed by some conditional moves to select the right answer. > (This > would of course be great if F is very fast, but it would be bad if F is > slow.) > > I've tried adding some promising looking switches to the opt command (like > -delinearize and -sink) and using different register allocation schemes in > the llc command, but so far I don't seem to be able to get LLVM to reorder > the instructions to push the calls of F into the branches. (I'm afraid I > don't really know much about what the switches do.) > > I seem to be able to do much better by (manually) rewriting the assembly so > that I explicitly check the conditions first, and only call F on the > various > arguments in the branches where they are needed (code below: "try2"). > > So I could write my frontend so that it generates code like try2. But I > worry about this approach because the code for try2 ends up repeating > computations like the call of f(a1) and f(a4), and it seems like this will > lead to code blowup. My "real" programs will have much more deeply nested > branches, and the expressions in the branches are going to be bigger than > just a single function call. So if I try to put these expressions only in > the branches where they are computed, it seems like that will result in > code > that grows exponentially in the number of branches that lead to a common > subexpression. > > So I have many questions. I wonder: > > - Are there other command-line options I should look into that might help > LLVM be able able to optimize "try1" to push the calls down into the > branches where they are used and avoid the unnecessary calls of F? > > - Is this kind of reordering anything I should expect LLVM to be able to > take care of in the first place, or should I expect to need to work on > making sure that my frontend produces code that is closer to what I > want > the machine to execute? > > - Are there other approaches that would work better than try1/try2 that I > haven't considered? > > - Are there any examples of developing LLVM frontends for languages that > deal with terms that are rich in branches and structure sharing? My > real > goal is to build a good compiler for efficiently executing DAGs of > 64-bit > integer operations, where some nodes fan out to several operations, but > large chunks of the DAG may not be needed depending on the inputs. I'd > be very interested in seeing other approaches to this kind of problem. > > Thanks very much for your time. > > Cheers, > > Jared > > ------------------------------ > > ;; Source code for try1/try2 > ;; > ;; I'm using LLVM on Linux Mint: > ;; opt --version: LLVM version 3.4 > ;; llc --version: LLVM version 3.4 > > declare i64 @f (i64 %a) nounwind readonly > declare i64 @g (i64 %a, i64 %b) nounwind readonly > > > define i64 @ite (i64 %cond, i64 %then, i64 %else) > { > ;; If/Then/Else, which will be inlined with -inline > %cond.zero = icmp eq i64 %cond, 0 > br i1 %cond.zero, label %case.zero, label %case.nonzero > case.nonzero: > ret i64 %then > case.zero: > ret i64 %else > } > > define i64 @try1 (i64* %in) > { > ;; Arguments from an input array. > %cond1.addr = getelementptr i64* %in, i32 0 > %cond2.addr = getelementptr i64* %in, i32 1 > %cond3.addr = getelementptr i64* %in, i32 2 > %cond4.addr = getelementptr i64* %in, i32 3 > %a1.addr = getelementptr i64* %in, i32 4 > %a2.addr = getelementptr i64* %in, i32 5 > %a3.addr = getelementptr i64* %in, i32 6 > %a4.addr = getelementptr i64* %in, i32 7 > > ;; Load arguments into temporaries > %cond1 = load i64* %cond1.addr > %cond2 = load i64* %cond2.addr > %cond3 = load i64* %cond3.addr > %cond4 = load i64* %cond3.addr > %a1 = load i64* %a1.addr > %a2 = load i64* %a2.addr > %a3 = load i64* %a3.addr > %a4 = load i64* %a4.addr > > ;; Compute F(Ai) > %fa1 = call i64 @f (i64 %a1) > %fa2 = call i64 @f (i64 %a2) > %fa3 = call i64 @f (i64 %a3) > %fa4 = call i64 @f (i64 %a4) > > ; Compute x1 = cond1 ? f(a1) > ; : cond2 ? f(a2) > ; : cond3 ? f(a1) > ; : f(a4); > > %x1.cond3 = call i64 @ite (i64 %cond3, i64 %fa1, i64 %fa4) > %x1.cond2 = call i64 @ite (i64 %cond2, i64 %fa2, i64 %x1.cond3) > %x1 = call i64 @ite (i64 %cond1, i64 %fa1, i64 %x1.cond2) > > ; Compute x2 = cond3 ? f(a3) > ; : cond4 ? f(a1) > ; : f(a4); > > %x2.cond4 = call i64 @ite (i64 %cond4, i64 %fa1, i64 %fa4) > %x2 = call i64 @ite (i64 %cond3, i64 %fa3, i64 %x2.cond4) > > ; Return g(x1, x2); > %ret = call i64 @g(i64 %x1, i64 %x2) > ret i64 %ret > } > > > > define i64 @try2 (i64* %in) > { > ;; Arguments from an input array. > %cond1.addr = getelementptr i64* %in, i32 0 > %cond2.addr = getelementptr i64* %in, i32 1 > %cond3.addr = getelementptr i64* %in, i32 2 > %cond4.addr = getelementptr i64* %in, i32 3 > %a1.addr = getelementptr i64* %in, i32 4 > %a2.addr = getelementptr i64* %in, i32 5 > %a3.addr = getelementptr i64* %in, i32 6 > %a4.addr = getelementptr i64* %in, i32 7 > > ;; Load arguments into temporaries > %cond1 = load i64* %cond1.addr > %cond2 = load i64* %cond2.addr > %cond3 = load i64* %cond3.addr > %cond4 = load i64* %cond3.addr > %a1 = load i64* %a1.addr > %a2 = load i64* %a2.addr > %a3 = load i64* %a3.addr > %a4 = load i64* %a4.addr > > ; Compute x1 = cond1 ? f(a1) > ; : cond2 ? f(a2) > ; : cond3 ? f(a1) > ; : f(a4); > > %cond1.nonzero = icmp ne i64 %cond1, 0 > br i1 %cond1.nonzero, label %x1.cond1.true, label %x1.cond1.false > > x1.cond1.false: > %cond2.nonzero = icmp ne i64 %cond2, 0 > br i1 %cond2.nonzero, label %x1.cond2.true, label %x1.cond2.false > > x1.cond2.false: > %cond3.nonzero = icmp ne i64 %cond3, 0 > br i1 %cond3.nonzero, label %x1.cond3.true, label %x1.otherwise > > > x1.cond1.true: ; cond1 ? f(a1) > %x1.cond1.ans = call i64 @f (i64 %a1) > br label %x1.finish > > x1.cond2.true: ; cond2 ? f(a2) > %x1.cond2.ans = call i64 @f (i64 %a2) > br label %x1.finish > > x1.cond3.true: ; cond3 ? f(a1) > %x1.cond3.ans = call i64 @f (i64 %a1) > br label %x1.finish > > x1.otherwise: ; otherwise f(a4) > %x1.otherwise.ans = call i64 @f (i64 %a4) > br label %x1.finish > > x1.finish: > %x1 = phi i64 [%x1.cond1.ans, %x1.cond1.true], > [%x1.cond2.ans, %x1.cond2.true], > [%x1.cond3.ans, %x1.cond3.true], > [%x1.otherwise.ans, %x1.otherwise] > > > > ; Compute x2 = cond3 ? f(a3) > ; : cond4 ? f(a1) > ; : f(a4); > > %cond3.nonzero_for_x2 = icmp ne i64 %cond3, 0 > br i1 %cond3.nonzero_for_x2, label %x2.cond3.true, label > %x2.cond3.false > > x2.cond3.false: > %cond4.nonzero = icmp ne i64 %cond4, 0 > br i1 %cond4.nonzero, label %x2.cond4.true, label %x2.otherwise > > x2.cond3.true: ; cond3 ? f(a3) > %x2.cond3.ans = call i64 @f (i64 %a3) > br label %x2.finish > > x2.cond4.true: ; cond4 ? f(a1) > %x2.cond4.ans = call i64 @f (i64 %a1) > br label %x2.finish > > x2.otherwise: > %x2.otherwise.ans = call i64 @f (i64 %a4) > br label %x2.finish > > x2.finish: > %x2 = phi i64 [%x2.cond3.ans, %x2.cond3.true], > [%x2.cond4.ans, %x2.cond4.true], > [%x2.otherwise.ans, %x2.otherwise] > > ; Return g(x1, x2); > %ret = call i64 @g(i64 %x1, i64 %x2) > ret i64 %ret > > } > > > > > -- > Jared C. Davis <jared at cs.utexas.edu> > 11410 Windermere Meadows > Austin, TX 78759 > http://www.cs.utexas.edu/users/jared/ > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160312/3942c904/attachment.html>