Joan Lluch via llvm-dev
2019-Oct-01  21:51 UTC
[llvm-dev] [cfe-dev] CFG simplification question, and preservation of branching in the original code
Hi Sanjay, Thanks for your reply.> So yes, the IR optimizer (instcombine is the specific pass) sometimes turns icmp (and select) sequences into ALU ops. Instcombine is almost entirely *target-independent* and should remain that way. The (sometimes unfortunate) decision to create shifts were made based on popular targets of the time (PowerPC and/or x86), and other targets may have suffered because of that.Yes, that’s what I actually found that I didn’t anticipate.> We've been trying to reverse those canonicalizations in IR over the past few years when the choice is clearly not always optimal, but it's not easy. To avoid perf regressions, you first have to make the backend/codegen aware of the alternate pattern that includes icmp/select and transform that to math/logic (translate instcombine code to DAGCombiner). Then, you have to remove the transform from instcombine and replace it with the reverse transform. This can uncover infinite loops and other problems within instcombine.I understand that. In any case, I am glad that at least this is acknowledged as some kind of flaw of the LLVM system, particularly for the optimal implementation of small processor backends. As these targets generally have cheap branches and do not generally have selects or multi-bit shifts, they hardly benefit from transformations involving shifts or aggressively attempting to replace jumps.> So to finally answer the question: If you can transform the shift into an alternate sequence with a "setcc" DAG node in your target's "ISelLowering" code, that's the easiest way forward. Otherwise, you have to weigh the impact of each target-independent transform on every target.Yes, that’s what I have been doing all the time. My backend contains a lot of code only to reverse undesired LLVM transformations, and yet the resulting final assembly is not as good as it could be, because it is often hard or impossible to identify all sources of improvement. It’s ironical that some older, less sophisticated compilers (and GCC) produce /much/ better code than LLVM for simple architectures. In order to give better support to current and future implementations of small processor targets, I wonder if instead of attempting to implement a general solution, we could implement a set of compiler flags (or code hooks) that targets could use to optionally disable undesired IR optimiser actions, however not affecting anything of the current behaviour as long as these flags are not used. If we had that, nothing would need to be done on the existing targets, particularly the major ones, and yet, new backend developments and small processor (AVR, MSP430) could potentially benefit from that. My opinion is that the availability of such options will not defeat the “almost entirely target-independent” nature of the IR optimiser, as the transformations would still be target-agnostic, except that targets would be able to decide which ones are more convenient for them, or disable the non desirable ones. Does this sound ok or feasible?. John> On 1 Oct 2019, at 17:20, Sanjay Patel <spatel at rotateright.com> wrote: > > First, let's agree on terminology: > 1. We're in LLVM. Clang has little or nothing to do with these questions from the perspective of LLVM developers. > 2. The IR optimizer (also known as the middle-end and invoked via 'opt') is what takes LLVM IR from a front-end (clang is just 1 example) and transforms it to different LLVM IR for easier target-specific optimization. > 3. The back-end (invoked using 'llc') is what takes LLVM IR and turns it into assembly for your target. Codegen is 1 part of the backend. > > So yes, the IR optimizer (instcombine is the specific pass) sometimes turns icmp (and select) sequences into ALU ops. Instcombine is almost entirely *target-independent* and should remain that way. The (sometimes unfortunate) decision to create shifts were made based on popular targets of the time (PowerPC and/or x86), and other targets may have suffered because of that. > > We've been trying to reverse those canonicalizations in IR over the past few years when the choice is clearly not always optimal, but it's not easy. To avoid perf regressions, you first have to make the backend/codegen aware of the alternate pattern that includes icmp/select and transform that to math/logic (translate instcombine code to DAGCombiner). Then, you have to remove the transform from instcombine and replace it with the reverse transform. This can uncover infinite loops and other problems within instcombine. > > It's often not clear which form of IR will lead to better optimizations within the IR optimizer itself. We favor the shortest IR sequence in most cases. But if there's a tie, we have to make a judgement about what is easier to analyze/transform when viewed within a longer sequence of IR. > > So to finally answer the question: If you can transform the shift into an alternate sequence with a "setcc" DAG node in your target's "ISelLowering" code, that's the easiest way forward. Otherwise, you have to weigh the impact of each target-independent transform on every target. > > > On Mon, Sep 30, 2019 at 5:31 PM Craig Topper <craig.topper at gmail.com <mailto:craig.topper at gmail.com>> wrote: > For the MSP430 example, I'm guess its InstCombiner::transformSExtICmp or InstCombiner::transformZExtICmp > > ~Craig > > > On Mon, Sep 30, 2019 at 2:21 PM Support IMAP <support at sweetwilliamsl.com <mailto:support at sweetwilliamsl.com>> wrote: > Hi all, > > Ok, I just found a much simpler example of the same issue. > > Consider the following code > > int cmpge32_0(long a) { > return a>=0; > } > > Compiled for the MSP430 with -O1 or -Os results in the following: > > ; Function Attrs: norecurse nounwind readnone > define dso_local i16 @cmpge32_0(i32 %a) local_unnamed_addr #0 { > entry: > %a.lobit = lshr i32 %a, 31 > %0 = trunc i32 %a.lobit to i16 > %.not = xor i16 %0, 1 > ret i16 %.not > } > > The backend then turns this into the following totally suboptimal code: > > cmpge32_0: > mov r13, r12 > inv r12 > swpb r12 > mov.b r12, r12 > clrc > rrc r12 > rra r12 > rra r12 > rra r12 > rra r12 > rra r12 > rra r12 > ret > .Lfunc_end0: > .size cmpge32_0, .Lfunc_end0-cmpge32_0 > > > The cause of this anomaly is again the presence of the Shift instruction (%a.lobit = lshr i32 %a, 31) at the IR level, which is hard to handle by the backend. > > The same C code compiled with -O0 creates the following IR code excerpt instead of the lshr-trunc code sequence > > %cmp = icmp sge i32 %0, 0 > %conv = zext i1 %cmp to i16 > > This compiles into MUCH better code for the MSP430 architecture (and virtually any 16 bit architecture not supporting multiple shifts). > It would be desirable that LLVM would just leave the comparison as is, also for -O1 and above. > > > So Please, can somebody point me to the LLVM class or function that performs the transformation of the comparison above into the undesired shift, so I can investigate what’s going on, or whether there’s something I can do? > > That would be really appreciated. > > John > > > Hi Roman > >> Not exactly, this is IR after optimizations, this is not what clang produces. >> To see what clang produces you want to pass -O0. >> All optimizations beyond that are done by LLVM. > > > Ok, I understand such naming convention, but it is still something that happens at the IR code generation steps, and therefore the backend has little to do about. > > So, what are actually the hooks that I can implement or investigate to modify the undesired behaviour? > > John > > >> On 30 Sep 2019, at 13:35, Roman Lebedev <lebedev.ri at gmail.com <mailto:lebedev.ri at gmail.com>> wrote: >> >> On Mon, Sep 30, 2019 at 11:52 AM Joan Lluch <joan.lluch at icloud.com <mailto:joan.lluch at icloud.com>> wrote: >>> >>> Hi Roman, >>> >>> Is "test" actually an implementation of a 64-bit-wide multiplication >>> compiler-rt builtin? >>> Then i'd think the main problem is that it is being optimized in the >>> first place, you could end up with endless recursion… >>> >>> >>> No, this is not a compiler-rt builtin. My example is of course incidentally taken from the implementation of a signed multiply, but as said, it has nothing to do with rt-builtins, I'm just using that code to show the issue. This function can’t create a recursion because it’s named ’test’, unlike any rt-buitin. You can replace the multiply in the source code by an addition, if you want to avoid calling rt-functions, but this does not change what I attempt to show. Also It’s not meant to be 64 bit wide, but 32 bit wide, because the targets I’m testing are 16 bit, so ints are 16 bit and longs are 32 bit. This is again the function I am testing: >>> >>> >>> long test (long a, long b) >>> { >>> int neg = 0; >>> long res; >>> >>> if (a < 0) >>> { >>> a = -a; >>> neg = 1; >>> } >>> >>> if (b < 0) >>> { >>> b = -b; >>> neg = !neg; >>> } >>> >>> res = a*b; >>> >>> if (neg) >>> res = -res; >>> >>> return res; >>> } >>> >>> >>> >>> LLVM, not clang. >>> >>> >>> I’m not sure about what you mean by that. The shown LLVM IR code is created by executing "clang” command line, so that’s what I attempt to show. >> Not exactly, this is IR after optimizations, this is not what clang produces. >> To see what clang produces you want to pass -O0. >> All optimizations beyond that are done by LLVM. >> >>> So it’s actually the front-end that does such undesired optimisations sometimes, not only the LLVM back-end. This is in part why I am saying this is not right. See copied again the IR code that gets generated for the C code that I posted before. This IR code, including the presence of expensive shifts ( %a.lobit = lshr i32 %a, 31) is generated when -mllvm -phi-node-folding-threshold=1 is specified in the command line, or when the Target implements getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) to return TCC_Expensive for operator types that are bigger than the default target register size. >>> >>> >>> >>> ; ModuleID = 'main.c' >>> source_filename = "main.c" >>> target datalayout = "e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16" >>> target triple = "msp430" >>> >>> ; Function Attrs: norecurse nounwind optsize readnone >>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>> entry: >>> %cmp = icmp slt i32 %a, 0 >>> %sub = sub nsw i32 0, %a >>> %spec.select = select i1 %cmp, i32 %sub, i32 %a >>> %a.lobit = lshr i32 %a, 31 >>> %0 = trunc i32 %a.lobit to i16 >>> %cmp1 = icmp slt i32 %b, 0 >>> br i1 %cmp1, label %if.then2, label %if.end4 >>> >>> if.then2: ; preds = %entry >>> %sub3 = sub nsw i32 0, %b >>> %1 = xor i16 %0, 1 >>> br label %if.end4 >>> >>> if.end4: ; preds = %if.then2, %entry >>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ] >>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ] >>> %mul = mul nsw i32 %b.addr.0, %spec.select >>> %tobool5 = icmp eq i16 %neg.1, 0 >>> %sub7 = sub nsw i32 0, %mul >>> %spec.select18 = select i1 %tobool5, i32 %mul, i32 %sub7 >>> ret i32 %spec.select18 >>> } >>> >>> attributes #0 = { norecurse nounwind optsize readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" } >>> >>> !llvm.module.flags = !{!0} >>> !llvm.ident = !{!1} >>> >>> !0 = !{i32 1, !"wchar_size", i32 2} >>> !1 = !{!"clang version 9.0.0 (https://github.com/llvm/llvm-project.git <https://github.com/llvm/llvm-project.git> 6f7deba43dd25fb7b3eca70f9c388ec9174f455a)"} >>> >>> >>> >>> As you can see, Clang produces a 31 bit wide shift right ( %a.lobit = lshr i32 %a, 31) That’s the fourth instruction on the IR code above. So a shift is produced instead of creating a jump to a new block, as it should be the case as per the C source code. >>> >>> Just as a matter of information. This is the implementation of the getOperationCost function that causes ‘clang’ to correctly replace selects by branches (desirable), but to generate shifts to fold expensive selects (undesirable) >>> >>> >>> unsigned CPU74TTIImpl::getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) >>> { >>> // Big types are expensive >>> unsigned OpSize = Ty->getScalarSizeInBits(); >>> if ( OpSize > 16 ) >>> return TTI::TCC_Expensive; >>> >>> return BaseT::getOperationCost(Opcode, Ty, OpTy); >>> } >>> >>> If the getOperationCost above function was not implemented, then clang would generate the usual series of ‘selects’. But this is even worse because selects imply speculative execution of expensive instructions, or duplicate branching created by the backend, which can’t be easily avoided. >>> >>> Ideally, the IR code above should just place an ‘if.then' block for the if (a < 0) statement in the C source code, instead of attempting to replace a select by a shift (!) >>> >>> If you want to play with these two scenarios, (1) IR code generated with branches, and (2) IR code generated with selects. This can easily be reproduced for the MSP430 target by compiling with the following options >>> (1) -mllvm -phi-node-folding-threshold=1 -c -S -Os >>> (2) -mllvm -phi-node-folding-threshold=2 -c -S -Os >>> >>> For 16 bit targets without selects, or expensive selects, the overall code is better with (1) because that prevents the creation of a different jump for every ‘select’ that (2) would cause. However, the presence of the ‘shift’ instruction for (1) spoils it all. >>> >>> Again, ideally, the use of shifts as a replacement of selects should be avoided, and an “if.then" block should be used as per the original C code. >>> >>> I hope this is clear now. >>> >>> John. >>> >>> >>> On 29 Sep 2019, at 15:57, Roman Lebedev <lebedev.ri at gmail.com <mailto:lebedev.ri at gmail.com>> wrote: >>> >>> On Sun, Sep 29, 2019 at 3:35 PM Joan Lluch via llvm-dev >>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>> >>> >>> Hi Sanjay, >>> >>> Actually, the CodeGenPrepare::optimizeSelectInst is not doing the best it could do in some circumstances: The case of “OptSize" for targets not supporting Select was already mentioned to be detrimental. >>> >>> For targets that actually have selects, but branches are cheap and generally profitable, particularly for expensive operators, the optimizeSelectInst function does not do good either. The function tries to identify consecutive selects with the same condition in order to avoid duplicate branches, which is ok, but then this effort is discarded in isFormingBranchFromSelectProfitable because the identified condition is used more than once (on the said two consecutive selects, of course), which defeats the whole purpose of checking for them, resulting in poor codegen. >>> >>> Yet another issue is that Clang attempts to replace ‘selects’ in the source code, by supposedly optimised code that is not ok for all targets. One example is this: >>> >>> LLVM, not clang. >>> >>> long test (long a, long b) >>> { >>> int neg = 0; >>> long res; >>> >>> if (a < 0) >>> { >>> a = -a; >>> neg = 1; >>> } >>> >>> if (b < 0) >>> { >>> b = -b; >>> neg = !neg; >>> } >>> >>> res = a*b; //(unsigned long)a / (unsigned long)b; // will call __udivsi3 >>> >>> if (neg) >>> res = -res; >>> >>> return res; >>> } >>> >>> >>> This gets compiled into >>> >>> ; Function Attrs: norecurse nounwind readnone >>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>> entry: >>> %cmp = icmp slt i32 %a, 0 >>> %sub = sub nsw i32 0, %a >>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a >>> %a.lobit = lshr i32 %a, 31 >>> %0 = trunc i32 %a.lobit to i16 >>> %cmp1 = icmp slt i32 %b, 0 >>> br i1 %cmp1, label %if.then2, label %if.end4 >>> >>> if.then2: ; preds = %entry >>> %sub3 = sub nsw i32 0, %b >>> %1 = xor i16 %0, 1 >>> br label %if.end4 >>> >>> if.end4: ; preds = %if.then2, %entry >>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ] >>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ] >>> %mul = mul nsw i32 %b.addr.0, %a.addr.0 >>> %tobool5 = icmp eq i16 %neg.1, 0 >>> %sub7 = sub nsw i32 0, %mul >>> %res.0 = select i1 %tobool5, i32 %mul, i32 %sub7 >>> ret i32 %res.0 >>> } >>> >>> The offending part here is this: %a.lobit = lshr i32 %a, 31 . Instead of just creating a “select” instruction, as the original code suggested with the if (a < 0) { neg = 1;} statements, the front-end produces a lshr which is very expensive for small architectures, and makes it very difficult for the backend to fold it again into an actual select (or branch). In my opinion, the original C code should have produced a “select” and give the backend the opportunity to optimise it if required. I think that the frontend should perform only target independent optimisations. >>> >>> >>> You didn't specify how you compile that code. >>> We could also get: https://godbolt.org/z/B-5lj1 <https://godbolt.org/z/B-5lj1> >>> Which can actually be folded further to just >>> long test(long a, long b) { >>> return a * b; >>> } >>> Is "test" actually an implementation of a 64-bit-wide multiplication >>> compiler-rt builtin? >>> Then i'd think the main problem is that it is being optimized in the >>> first place, you could end up with endless recursion... >>> >>> I posted before my view that LLVM is clearly designed to satisfy big boys such as the x86 and ARM targets. This means that, unfortunately, it makes too many general assumptions about what’s cheap, without providing enough hooks to cancel arbitrary optimisations. As I am implementing backends for 8 or 16 bit targets, I find myself doing a lot of work just to reverse optimisations that should have not been applied in the first place. My example above is an instance of a code mutation performed by the frontend that is not desirable. Existing 8 and 16 bit trunk targets (particularly the MSP430 and the AVR) are also negatively affected by the excessively liberal use of shifts by LLVM. >>> >>> The CodeGenPrepare::optimizeSelectInst function needs some changes to respect targets with no selects, and targets that may want to avoid expensive speculative executions. >>> >>> John >>> >>> Roman >>> >>> On 25 Sep 2019, at 16:00, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote: >>> >>> Changing the order of the checks in CodeGenPrepare::optimizeSelectInst() sounds good to me. >>> >>> But you may need to go further for optimum performance. For example, we may be canonicalizing math/logic IR patterns into 'select' such as in the recent: >>> https://reviews.llvm.org/D67799 <https://reviews.llvm.org/D67799> >>> >>> So if you want those to become ALU ops again rather than branches, then you need to do the transform later in the backend. That is, you want to let DAGCombiner run its set of transforms on 'select' nodes. >>> >>> On Wed, Sep 25, 2019 at 4:03 AM Joan Lluch via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote: >>> >>> >>> Hi Craig, >>> >>> Thank you for your reply. I have started looking at “CodeGenPrepare” and I assume you reffer to CodeGenPrepare::optimizeSelectInst. I will try to play a bit with that possibly later today. At first glance, it looks to me that for targets that do not support ’select’ at all, the fact that the function exits early for ‘OptSize’ can be detrimental, because this will just leave ALL existing selects in the code anyway. As said, I will try to play with that later, but right now it looks to me that maybe we should check for TLI->isSelectSupported earlier in the function, to get some more opportunities to such targets without explicit ’select’ support? >>> >>> Thanks >>> >>> John >>> >>> >>> On 25 Sep 2019, at 08:59, Craig Topper <craig.topper at gmail.com <mailto:craig.topper at gmail.com>> wrote: >>> >>> There is code in CodeGenPrepare.cpp that can turn selects into branches that tries to account for multiple selects sharing the same condition. It doesn't look like either AVR or MSP430 enable that code though. >>> >>> ~Craig >>> >>> >>> On Tue, Sep 24, 2019 at 11:27 PM Joan Lluch via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote: >>> >>> >>> Hi Roman, >>> >>> Thank you for your reply. I understand your point. I just want to add something to clarify my original post in relation to your reply. >>> >>> There are already implemented 8-bit and 16-bit backends, namely the AVR and the MSP430, which already "aggressively convert selects into branches”, which already benefit (as they are) from setting "phi-node-folding-threshold’ to 1 or zero. This is because otherwise Clang will generate several selects depending on the same “icmp”. These backends are unable to optimise that, and they just create a comparison and a conditional branch for every “select” in the IR code, in spite that the original C code was already written in a much better way. So the resulting effect is the presence of redundant comparisons and branches in the final code, with a detrimental of generated code quality. >>> >>> The above gets improved by setting "phi-node-folding-threshold’ to 1 because some of these extra ‘selects' are no longer there so the backend stops generating redundant code. >>> >>> John. >>> >>> >>> >>> >>> On 21 Sep 2019, at 14:48, Roman Lebedev <lebedev.ri at gmail.com <mailto:lebedev.ri at gmail.com>> wrote: >>> >>> On Sat, Sep 21, 2019 at 3:18 PM Joan Lluch via cfe-dev >>> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote: >>> >>> >>> Hi all, >>> >>> For my custom architecture, I want to relax the CFG simplification pass, and any other passes replacing conditional branches. >>> >>> I found that the replacement of conditional branches by “select" and other instructions is often too aggressive, and this causes inefficient code for my target as in most cases branches would be cheaper. >>> >>> For example, considering the following c code: >>> >>> long test (long a, long b) >>> { >>> int neg = 0; >>> long res; >>> >>> if (a < 0) >>> { >>> a = -a; >>> neg = 1; >>> } >>> >>> res = a*b; >>> >>> if (neg) >>> res = -res; >>> >>> return res; >>> } >>> >>> >>> This code can be simplified in c, but it’s just an example to show the point. >>> >>> The code above gets compiled like this (-Oz flag): >>> >>> ; Function Attrs: minsize norecurse nounwind optsize readnone >>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>> entry: >>> %cmp = icmp slt i32 %a, 0 >>> %sub = sub nsw i32 0, %a >>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a >>> %mul = mul nsw i32 %a.addr.0, %b >>> %sub2 = sub nsw i32 0, %mul >>> %res.0 = select i1 %cmp, i32 %sub2, i32 %mul >>> ret i32 %res.0 >>> } >>> >>> >>> All branching was removed and replaced by ‘select’ instructions. For my architecture, it would be desirable to keep the original branches in most cases, because even simple 32 bit operations are too expensive to speculatively execute them, and branches are cheap. >>> >>> Setting 'phi-node-folding-threshold’ to 1 or even 0 (instead of the default 2), definitely improves the situation in many cases, but Clang still creates many instances of ‘select’ instructions, which are detrimental to my target. I am unsure about where are they created, as I believe that the simplifycfg pass does not longer create them. >>> >>> You definitively can't ban llvm passes/clang from creating select's. >>> >>> So the question is: Are there any other hooks in clang, or custom code that I can implement, to relax the creation of ’select’ instructions and make it preserve branches in the original c code? >>> >>> I think this is backwards. >>> Sure, you could maybe disable most of the folds that produce selects. >>> That may be good for final codegen, but will also affect other passes >>> since not everything deals with 2-node PHI as good as wit selects. >>> >>> But, what happens if you still get the select-y IR? >>> Doesn't matter how, could be hand-written. >>> >>> I think you might want to instead aggressively convert selects into >>> branches in backend. >>> >>> Thanks, >>> >>> John >>> >>> Roman >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>> _______________________________________________ >>> cfe-dev mailing list >>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev> >>> >>> >>> _______________________________________________ >>> cfe-dev mailing list >>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev> >>> >>> >>> >>> _______________________________________________ >>> cfe-dev mailing list >>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev> >>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>> >>> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191001/b323e7d1/attachment-0001.html>
Sanjay Patel via llvm-dev
2019-Oct-02  12:34 UTC
[llvm-dev] [cfe-dev] CFG simplification question, and preservation of branching in the original code
On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at icloud.com> wrote:> In order to give better support to current and future implementations of > small processor targets, I wonder if instead of attempting to implement a > general solution, we could implement a set of compiler flags (or code > hooks) that targets could use to optionally disable undesired IR optimiser > actions, however not affecting anything of the current behaviour as long as > these flags are not used. If we had that, nothing would need to be done on > the existing targets, particularly the major ones, and yet, new backend > developments and small processor (AVR, MSP430) could potentially benefit > from that. My opinion is that the availability of such options will not > defeat the “almost entirely target-independent” nature of the IR optimiser, > as the transformations would still be target-agnostic, except that targets > would be able to decide which ones are more convenient for them, or disable > the non desirable ones. Does this sound ok or feasible?. >Providing target options/overrides to code that is supposed to be target-independent sounds self-defeating to me. I doubt that proposal would gain much support. Of course, if you're customizing LLVM for your own out-of-trunk backend, you can do anything you'd like if you're willing to maintain the custom patches. I suggest that you file a bug report showing an exact problem caused by 1 of the icmp/select transforms. If it can be done using an in-trunk backend like AVR or MSP430, then we'll have a real example showing the harm and hopefully better ideas about how to solve the bug. On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at icloud.com> wrote:> Hi Sanjay, > > Thanks for your reply. > > So yes, the IR optimizer (instcombine is the specific pass) sometimes > turns icmp (and select) sequences into ALU ops. Instcombine is almost > entirely *target-independent* and should remain that way. The (sometimes > unfortunate) decision to create shifts were made based on popular targets > of the time (PowerPC and/or x86), and other targets may have suffered > because of that. > > > Yes, that’s what I actually found that I didn’t anticipate. > > We've been trying to reverse those canonicalizations in IR over the past > few years when the choice is clearly not always optimal, but it's not easy. > To avoid perf regressions, you first have to make the backend/codegen aware > of the alternate pattern that includes icmp/select and transform that to > math/logic (translate instcombine code to DAGCombiner). Then, you have to > remove the transform from instcombine and replace it with the reverse > transform. This can uncover infinite loops and other problems within > instcombine. > > > I understand that. In any case, I am glad that at least this is > acknowledged as some kind of flaw of the LLVM system, particularly for the > optimal implementation of small processor backends. As these targets > generally have cheap branches and do not generally have selects or > multi-bit shifts, they hardly benefit from transformations involving shifts > or aggressively attempting to replace jumps. > > So to finally answer the question: If you can transform the shift into an > alternate sequence with a "setcc" DAG node in your target's "ISelLowering" > code, that's the easiest way forward. Otherwise, you have to weigh the > impact of each target-independent transform on every target. > > > Yes, that’s what I have been doing all the time. My backend contains a lot > of code only to reverse undesired LLVM transformations, and yet the > resulting final assembly is not as good as it could be, because it is often > hard or impossible to identify all sources of improvement. It’s ironical > that some older, less sophisticated compilers (and GCC) produce /much/ > better code than LLVM for simple architectures. > > In order to give better support to current and future implementations of > small processor targets, I wonder if instead of attempting to implement a > general solution, we could implement a set of compiler flags (or code > hooks) that targets could use to optionally disable undesired IR optimiser > actions, however not affecting anything of the current behaviour as long as > these flags are not used. If we had that, nothing would need to be done on > the existing targets, particularly the major ones, and yet, new backend > developments and small processor (AVR, MSP430) could potentially benefit > from that. My opinion is that the availability of such options will not > defeat the “almost entirely target-independent” nature of the IR optimiser, > as the transformations would still be target-agnostic, except that targets > would be able to decide which ones are more convenient for them, or disable > the non desirable ones. Does this sound ok or feasible?. > > John > > > > On 1 Oct 2019, at 17:20, Sanjay Patel <spatel at rotateright.com> wrote: > > First, let's agree on terminology: > 1. We're in LLVM. Clang has little or nothing to do with these questions > from the perspective of LLVM developers. > 2. The IR optimizer (also known as the middle-end and invoked via 'opt') > is what takes LLVM IR from a front-end (clang is just 1 example) and > transforms it to different LLVM IR for easier target-specific optimization. > 3. The back-end (invoked using 'llc') is what takes LLVM IR and turns it > into assembly for your target. Codegen is 1 part of the backend. > > So yes, the IR optimizer (instcombine is the specific pass) sometimes > turns icmp (and select) sequences into ALU ops. Instcombine is almost > entirely *target-independent* and should remain that way. The (sometimes > unfortunate) decision to create shifts were made based on popular targets > of the time (PowerPC and/or x86), and other targets may have suffered > because of that. > > We've been trying to reverse those canonicalizations in IR over the past > few years when the choice is clearly not always optimal, but it's not easy. > To avoid perf regressions, you first have to make the backend/codegen aware > of the alternate pattern that includes icmp/select and transform that to > math/logic (translate instcombine code to DAGCombiner). Then, you have to > remove the transform from instcombine and replace it with the reverse > transform. This can uncover infinite loops and other problems within > instcombine. > > It's often not clear which form of IR will lead to better optimizations > within the IR optimizer itself. We favor the shortest IR sequence in most > cases. But if there's a tie, we have to make a judgement about what is > easier to analyze/transform when viewed within a longer sequence of IR. > > So to finally answer the question: If you can transform the shift into an > alternate sequence with a "setcc" DAG node in your target's "ISelLowering" > code, that's the easiest way forward. Otherwise, you have to weigh the > impact of each target-independent transform on every target. > > > On Mon, Sep 30, 2019 at 5:31 PM Craig Topper <craig.topper at gmail.com> > wrote: > >> For the MSP430 example, I'm guess its InstCombiner::transformSExtICmp >> or InstCombiner::transformZExtICmp >> >> ~Craig >> >> >> On Mon, Sep 30, 2019 at 2:21 PM Support IMAP <support at sweetwilliamsl.com> >> wrote: >> >>> Hi all, >>> >>> Ok, I just found a much simpler example of the same issue. >>> >>> Consider the following code >>> >>> int cmpge32_0(long a) { >>> return a>=0; >>> } >>> >>> Compiled for the MSP430 with -O1 or -Os results in the following: >>> >>> ; Function Attrs: norecurse nounwind readnone >>> define dso_local i16 @cmpge32_0(i32 %a) local_unnamed_addr #0 { >>> entry: >>> %a.lobit = lshr i32 %a, 31 >>> %0 = trunc i32 %a.lobit to i16 >>> %.not = xor i16 %0, 1 >>> ret i16 %.not >>> } >>> >>> The backend then turns this into the following totally suboptimal code: >>> >>> cmpge32_0: >>> mov r13, r12 >>> inv r12 >>> swpb r12 >>> mov.b r12, r12 >>> clrc >>> rrc r12 >>> rra r12 >>> rra r12 >>> rra r12 >>> rra r12 >>> rra r12 >>> rra r12 >>> ret >>> .Lfunc_end0: >>> .size cmpge32_0, .Lfunc_end0-cmpge32_0 >>> >>> >>> The cause of this anomaly is again the presence of the Shift instruction >>> (%a.lobit = lshr i32 %a, 31) at the IR level, which is hard to handle >>> by the backend. >>> >>> The same C code compiled with -O0 creates the following IR code excerpt >>> instead of the lshr-trunc code sequence >>> >>> %cmp = icmp sge i32 %0, 0 >>> %conv = zext i1 %cmp to i16 >>> >>> This compiles into MUCH better code for the MSP430 architecture (and >>> virtually any 16 bit architecture not supporting multiple shifts). >>> It would be desirable that LLVM would just leave the comparison as is, >>> also for -O1 and above. >>> >>> >>> So Please, can somebody point me to the LLVM class or function that >>> performs the transformation of the comparison above into the undesired >>> shift, so I can investigate what’s going on, or whether there’s something I >>> can do? >>> >>> That would be really appreciated. >>> >>> John >>> >>> >>> Hi Roman >>> >>> Not exactly, this is IR after optimizations, this is not what clang >>> produces. >>> To see what clang produces you want to pass -O0. >>> All optimizations beyond that are done by LLVM. >>> >>> >>> >>> Ok, I understand such naming convention, but it is still something that >>> happens at the IR code generation steps, and therefore the backend has >>> little to do about. >>> >>> So, what are actually the hooks that I can implement or investigate to >>> modify the undesired behaviour? >>> >>> John >>> >>> >>> On 30 Sep 2019, at 13:35, Roman Lebedev <lebedev.ri at gmail.com> wrote: >>> >>> On Mon, Sep 30, 2019 at 11:52 AM Joan Lluch <joan.lluch at icloud.com> >>> wrote: >>> >>> >>> Hi Roman, >>> >>> Is "test" actually an implementation of a 64-bit-wide multiplication >>> compiler-rt builtin? >>> Then i'd think the main problem is that it is being optimized in the >>> first place, you could end up with endless recursion… >>> >>> >>> No, this is not a compiler-rt builtin. My example is of course >>> incidentally taken from the implementation of a signed multiply, but as >>> said, it has nothing to do with rt-builtins, I'm just using that code to >>> show the issue. This function can’t create a recursion because it’s named >>> ’test’, unlike any rt-buitin. You can replace the multiply in the source >>> code by an addition, if you want to avoid calling rt-functions, but this >>> does not change what I attempt to show. Also It’s not meant to be 64 bit >>> wide, but 32 bit wide, because the targets I’m testing are 16 bit, so ints >>> are 16 bit and longs are 32 bit. This is again the function I am testing: >>> >>> >>> long test (long a, long b) >>> { >>> int neg = 0; >>> long res; >>> >>> if (a < 0) >>> { >>> a = -a; >>> neg = 1; >>> } >>> >>> if (b < 0) >>> { >>> b = -b; >>> neg = !neg; >>> } >>> >>> res = a*b; >>> >>> if (neg) >>> res = -res; >>> >>> return res; >>> } >>> >>> >>> >>> LLVM, not clang. >>> >>> >>> I’m not sure about what you mean by that. The shown LLVM IR code is >>> created by executing "clang” command line, so that’s what I attempt to show. >>> >>> Not exactly, this is IR after optimizations, this is not what clang >>> produces. >>> To see what clang produces you want to pass -O0. >>> All optimizations beyond that are done by LLVM. >>> >>> So it’s actually the front-end that does such undesired optimisations >>> sometimes, not only the LLVM back-end. This is in part why I am saying this >>> is not right. See copied again the IR code that gets generated for the C >>> code that I posted before. This IR code, including the presence of >>> expensive shifts ( %a.lobit = lshr i32 %a, 31) is generated when -mllvm >>> -phi-node-folding-threshold=1 is specified in the command line, or when the >>> Target implements getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) >>> to return TCC_Expensive for operator types that are bigger than the default >>> target register size. >>> >>> >>> >>> ; ModuleID = 'main.c' >>> source_filename = "main.c" >>> target datalayout >>> "e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16" >>> target triple = "msp430" >>> >>> ; Function Attrs: norecurse nounwind optsize readnone >>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>> entry: >>> %cmp = icmp slt i32 %a, 0 >>> %sub = sub nsw i32 0, %a >>> %spec.select = select i1 %cmp, i32 %sub, i32 %a >>> %a.lobit = lshr i32 %a, 31 >>> %0 = trunc i32 %a.lobit to i16 >>> %cmp1 = icmp slt i32 %b, 0 >>> br i1 %cmp1, label %if.then2, label %if.end4 >>> >>> if.then2: ; preds = %entry >>> %sub3 = sub nsw i32 0, %b >>> %1 = xor i16 %0, 1 >>> br label %if.end4 >>> >>> if.end4: ; preds = %if.then2, >>> %entry >>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ] >>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ] >>> %mul = mul nsw i32 %b.addr.0, %spec.select >>> %tobool5 = icmp eq i16 %neg.1, 0 >>> %sub7 = sub nsw i32 0, %mul >>> %spec.select18 = select i1 %tobool5, i32 %mul, i32 %sub7 >>> ret i32 %spec.select18 >>> } >>> >>> attributes #0 = { norecurse nounwind optsize readnone >>> "correctly-rounded-divide-sqrt-fp-math"="false" >>> "disable-tail-calls"="false" "less-precise-fpmad"="false" >>> "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" >>> "no-infs-fp-math"="false" "no-jump-tables"="false" >>> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" >>> "no-trapping-math"="false" "stack-protector-buffer-size"="8" >>> "unsafe-fp-math"="false" "use-soft-float"="false" } >>> >>> !llvm.module.flags = !{!0} >>> !llvm.ident = !{!1} >>> >>> !0 = !{i32 1, !"wchar_size", i32 2} >>> !1 = !{!"clang version 9.0.0 (https://github.com/llvm/llvm-project.git >>> 6f7deba43dd25fb7b3eca70f9c388ec9174f455a)"} >>> >>> >>> >>> As you can see, Clang produces a 31 bit wide shift right ( %a.lobit >>> lshr i32 %a, 31) That’s the fourth instruction on the IR code above. So a >>> shift is produced instead of creating a jump to a new block, as it should >>> be the case as per the C source code. >>> >>> Just as a matter of information. This is the implementation of the >>> getOperationCost function that causes ‘clang’ to correctly replace selects >>> by branches (desirable), but to generate shifts to fold expensive selects >>> (undesirable) >>> >>> >>> unsigned CPU74TTIImpl::getOperationCost(unsigned Opcode, Type *Ty, Type >>> *OpTy) >>> { >>> // Big types are expensive >>> unsigned OpSize = Ty->getScalarSizeInBits(); >>> if ( OpSize > 16 ) >>> return TTI::TCC_Expensive; >>> >>> return BaseT::getOperationCost(Opcode, Ty, OpTy); >>> } >>> >>> If the getOperationCost above function was not implemented, then clang >>> would generate the usual series of ‘selects’. But this is even worse >>> because selects imply speculative execution of expensive instructions, or >>> duplicate branching created by the backend, which can’t be easily avoided. >>> >>> Ideally, the IR code above should just place an ‘if.then' block for the >>> if (a < 0) statement in the C source code, instead of attempting to replace >>> a select by a shift (!) >>> >>> If you want to play with these two scenarios, (1) IR code generated with >>> branches, and (2) IR code generated with selects. This can easily be >>> reproduced for the MSP430 target by compiling with the following options >>> (1) -mllvm -phi-node-folding-threshold=1 -c -S -Os >>> (2) -mllvm -phi-node-folding-threshold=2 -c -S -Os >>> >>> For 16 bit targets without selects, or expensive selects, the overall >>> code is better with (1) because that prevents the creation of a different >>> jump for every ‘select’ that (2) would cause. However, the presence of the >>> ‘shift’ instruction for (1) spoils it all. >>> >>> Again, ideally, the use of shifts as a replacement of selects should be >>> avoided, and an “if.then" block should be used as per the original C code. >>> >>> I hope this is clear now. >>> >>> John. >>> >>> >>> On 29 Sep 2019, at 15:57, Roman Lebedev <lebedev.ri at gmail.com> wrote: >>> >>> On Sun, Sep 29, 2019 at 3:35 PM Joan Lluch via llvm-dev >>> <llvm-dev at lists.llvm.org> wrote: >>> >>> >>> Hi Sanjay, >>> >>> Actually, the CodeGenPrepare::optimizeSelectInst is not doing the best >>> it could do in some circumstances: The case of “OptSize" for targets not >>> supporting Select was already mentioned to be detrimental. >>> >>> For targets that actually have selects, but branches are cheap and >>> generally profitable, particularly for expensive operators, the >>> optimizeSelectInst function does not do good either. The function tries to >>> identify consecutive selects with the same condition in order to avoid >>> duplicate branches, which is ok, but then this effort is discarded in >>> isFormingBranchFromSelectProfitable because the identified condition is >>> used more than once (on the said two consecutive selects, of course), which >>> defeats the whole purpose of checking for them, resulting in poor codegen. >>> >>> Yet another issue is that Clang attempts to replace ‘selects’ in the >>> source code, by supposedly optimised code that is not ok for all targets. >>> One example is this: >>> >>> LLVM, not clang. >>> >>> long test (long a, long b) >>> { >>> int neg = 0; >>> long res; >>> >>> if (a < 0) >>> { >>> a = -a; >>> neg = 1; >>> } >>> >>> if (b < 0) >>> { >>> b = -b; >>> neg = !neg; >>> } >>> >>> res = a*b; //(unsigned long)a / (unsigned long)b; // will call __udivsi3 >>> >>> if (neg) >>> res = -res; >>> >>> return res; >>> } >>> >>> >>> This gets compiled into >>> >>> ; Function Attrs: norecurse nounwind readnone >>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>> entry: >>> %cmp = icmp slt i32 %a, 0 >>> %sub = sub nsw i32 0, %a >>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a >>> %a.lobit = lshr i32 %a, 31 >>> %0 = trunc i32 %a.lobit to i16 >>> %cmp1 = icmp slt i32 %b, 0 >>> br i1 %cmp1, label %if.then2, label %if.end4 >>> >>> if.then2: ; preds = %entry >>> %sub3 = sub nsw i32 0, %b >>> %1 = xor i16 %0, 1 >>> br label %if.end4 >>> >>> if.end4: ; preds = %if.then2, >>> %entry >>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ] >>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ] >>> %mul = mul nsw i32 %b.addr.0, %a.addr.0 >>> %tobool5 = icmp eq i16 %neg.1, 0 >>> %sub7 = sub nsw i32 0, %mul >>> %res.0 = select i1 %tobool5, i32 %mul, i32 %sub7 >>> ret i32 %res.0 >>> } >>> >>> The offending part here is this: %a.lobit = lshr i32 %a, 31 . Instead >>> of just creating a “select” instruction, as the original code suggested >>> with the if (a < 0) { neg = 1;} statements, the front-end produces a lshr >>> which is very expensive for small architectures, and makes it very >>> difficult for the backend to fold it again into an actual select (or >>> branch). In my opinion, the original C code should have produced a “select” >>> and give the backend the opportunity to optimise it if required. I think >>> that the frontend should perform only target independent optimisations. >>> >>> >>> You didn't specify how you compile that code. >>> We could also get: https://godbolt.org/z/B-5lj1 >>> Which can actually be folded further to just >>> long test(long a, long b) { >>> return a * b; >>> } >>> Is "test" actually an implementation of a 64-bit-wide multiplication >>> compiler-rt builtin? >>> Then i'd think the main problem is that it is being optimized in the >>> first place, you could end up with endless recursion... >>> >>> I posted before my view that LLVM is clearly designed to satisfy big >>> boys such as the x86 and ARM targets. This means that, unfortunately, it >>> makes too many general assumptions about what’s cheap, without providing >>> enough hooks to cancel arbitrary optimisations. As I am implementing >>> backends for 8 or 16 bit targets, I find myself doing a lot of work just to >>> reverse optimisations that should have not been applied in the first place. >>> My example above is an instance of a code mutation performed by the >>> frontend that is not desirable. Existing 8 and 16 bit trunk targets >>> (particularly the MSP430 and the AVR) are also negatively affected by the >>> excessively liberal use of shifts by LLVM. >>> >>> The CodeGenPrepare::optimizeSelectInst function needs some changes to >>> respect targets with no selects, and targets that may want to avoid >>> expensive speculative executions. >>> >>> John >>> >>> Roman >>> >>> On 25 Sep 2019, at 16:00, Sanjay Patel <spatel at rotateright.com> wrote: >>> >>> Changing the order of the checks in CodeGenPrepare::optimizeSelectInst() >>> sounds good to me. >>> >>> But you may need to go further for optimum performance. For example, we >>> may be canonicalizing math/logic IR patterns into 'select' such as in the >>> recent: >>> https://reviews.llvm.org/D67799 >>> >>> So if you want those to become ALU ops again rather than branches, then >>> you need to do the transform later in the backend. That is, you want to let >>> DAGCombiner run its set of transforms on 'select' nodes. >>> >>> On Wed, Sep 25, 2019 at 4:03 AM Joan Lluch via cfe-dev < >>> cfe-dev at lists.llvm.org> wrote: >>> >>> >>> Hi Craig, >>> >>> Thank you for your reply. I have started looking at “CodeGenPrepare” and >>> I assume you reffer to CodeGenPrepare::optimizeSelectInst. I will try to >>> play a bit with that possibly later today. At first glance, it looks to me >>> that for targets that do not support ’select’ at all, the fact that the >>> function exits early for ‘OptSize’ can be detrimental, because this will >>> just leave ALL existing selects in the code anyway. As said, I will try to >>> play with that later, but right now it looks to me that maybe we should >>> check for TLI->isSelectSupported earlier in the function, to get some more >>> opportunities to such targets without explicit ’select’ support? >>> >>> Thanks >>> >>> John >>> >>> >>> On 25 Sep 2019, at 08:59, Craig Topper <craig.topper at gmail.com> wrote: >>> >>> There is code in CodeGenPrepare.cpp that can turn selects into branches >>> that tries to account for multiple selects sharing the same condition. It >>> doesn't look like either AVR or MSP430 enable that code though. >>> >>> ~Craig >>> >>> >>> On Tue, Sep 24, 2019 at 11:27 PM Joan Lluch via cfe-dev < >>> cfe-dev at lists.llvm.org> wrote: >>> >>> >>> Hi Roman, >>> >>> Thank you for your reply. I understand your point. I just want to add >>> something to clarify my original post in relation to your reply. >>> >>> There are already implemented 8-bit and 16-bit backends, namely the AVR >>> and the MSP430, which already "aggressively convert selects into branches”, >>> which already benefit (as they are) from setting >>> "phi-node-folding-threshold’ to 1 or zero. This is because otherwise Clang >>> will generate several selects depending on the same “icmp”. These backends >>> are unable to optimise that, and they just create a comparison and a >>> conditional branch for every “select” in the IR code, in spite that the >>> original C code was already written in a much better way. So the resulting >>> effect is the presence of redundant comparisons and branches in the final >>> code, with a detrimental of generated code quality. >>> >>> The above gets improved by setting "phi-node-folding-threshold’ to 1 >>> because some of these extra ‘selects' are no longer there so the backend >>> stops generating redundant code. >>> >>> John. >>> >>> >>> >>> >>> On 21 Sep 2019, at 14:48, Roman Lebedev <lebedev.ri at gmail.com> wrote: >>> >>> On Sat, Sep 21, 2019 at 3:18 PM Joan Lluch via cfe-dev >>> <cfe-dev at lists.llvm.org> wrote: >>> >>> >>> Hi all, >>> >>> For my custom architecture, I want to relax the CFG simplification pass, >>> and any other passes replacing conditional branches. >>> >>> I found that the replacement of conditional branches by “select" and >>> other instructions is often too aggressive, and this causes inefficient >>> code for my target as in most cases branches would be cheaper. >>> >>> For example, considering the following c code: >>> >>> long test (long a, long b) >>> { >>> int neg = 0; >>> long res; >>> >>> if (a < 0) >>> { >>> a = -a; >>> neg = 1; >>> } >>> >>> res = a*b; >>> >>> if (neg) >>> res = -res; >>> >>> return res; >>> } >>> >>> >>> This code can be simplified in c, but it’s just an example to show the >>> point. >>> >>> The code above gets compiled like this (-Oz flag): >>> >>> ; Function Attrs: minsize norecurse nounwind optsize readnone >>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>> entry: >>> %cmp = icmp slt i32 %a, 0 >>> %sub = sub nsw i32 0, %a >>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a >>> %mul = mul nsw i32 %a.addr.0, %b >>> %sub2 = sub nsw i32 0, %mul >>> %res.0 = select i1 %cmp, i32 %sub2, i32 %mul >>> ret i32 %res.0 >>> } >>> >>> >>> All branching was removed and replaced by ‘select’ instructions. For my >>> architecture, it would be desirable to keep the original branches in most >>> cases, because even simple 32 bit operations are too expensive to >>> speculatively execute them, and branches are cheap. >>> >>> Setting 'phi-node-folding-threshold’ to 1 or even 0 (instead of the >>> default 2), definitely improves the situation in many cases, but Clang >>> still creates many instances of ‘select’ instructions, which are >>> detrimental to my target. I am unsure about where are they created, as I >>> believe that the simplifycfg pass does not longer create them. >>> >>> You definitively can't ban llvm passes/clang from creating select's. >>> >>> So the question is: Are there any other hooks in clang, or custom code >>> that I can implement, to relax the creation of ’select’ instructions and >>> make it preserve branches in the original c code? >>> >>> I think this is backwards. >>> Sure, you could maybe disable most of the folds that produce selects. >>> That may be good for final codegen, but will also affect other passes >>> since not everything deals with 2-node PHI as good as wit selects. >>> >>> But, what happens if you still get the select-y IR? >>> Doesn't matter how, could be hand-written. >>> >>> I think you might want to instead aggressively convert selects into >>> branches in backend. >>> >>> Thanks, >>> >>> John >>> >>> Roman >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> _______________________________________________ >>> cfe-dev mailing list >>> cfe-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >>> >>> >>> _______________________________________________ >>> cfe-dev mailing list >>> cfe-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >>> >>> >>> >>> _______________________________________________ >>> cfe-dev mailing list >>> cfe-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/86b9f199/attachment-0001.html>
Finkel, Hal J. via llvm-dev
2019-Oct-02  23:54 UTC
[llvm-dev] [cfe-dev] CFG simplification question, and preservation of branching in the original code
On 10/2/19 7:34 AM, Sanjay Patel via llvm-dev wrote:
On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at
icloud.com<mailto:joan.lluch at icloud.com>> wrote:
In order to give better support to current and future implementations of small
processor targets, I wonder if instead of attempting to implement a general
solution, we could implement a set of compiler flags (or code hooks) that
targets could use to optionally disable undesired IR optimiser actions, however
not affecting anything of the current behaviour as long as these flags are not
used. If we had that, nothing would need to be done on the existing targets,
particularly the major ones, and yet, new backend developments and small
processor (AVR, MSP430) could potentially benefit from that. My opinion is that
the availability of such options will not defeat the “almost entirely
target-independent” nature of the IR optimiser, as the transformations would
still be target-agnostic, except that targets would be able to decide which ones
are more convenient for them, or disable the non desirable ones. Does this sound
ok or feasible?.
Providing target options/overrides to code that is supposed to be
target-independent sounds self-defeating to me. I doubt that proposal would gain
much support.
Of course, if you're customizing LLVM for your own out-of-trunk backend, you
can do anything you'd like if you're willing to maintain the custom
patches.
I suggest that you file a bug report showing an exact problem caused by 1 of the
icmp/select transforms.
If it can be done using an in-trunk backend like AVR or MSP430, then we'll
have a real example showing the harm and hopefully better ideas about how to
solve the bug.
+1
I think that it would really benefit us to see some specific examples of the
problems here. There are a couple of questions here that I think are worth
considering:
 1. What is the most-useful canonical form, in terms of enabling other
transformations, including enabling high-quality instruction selection?
We've long recognized a tension inherent in using select as part of our
canonical form. It enables certain optimizations easily (i.e., using only
intra-basic-block reasoning), but also inhibits otherwise-useful
intra-basic-block optimizations. (there's an interesting discussion of this,
e.g., https://bugs.llvm.org/show_bug.cgi?id=34603#c19). We've even stopped
forming selects as aggressively early in the pipeline, although we don't
turn selects into control flow earlier in the pipeline, and they're still
considered canonical. That having been said, we do let TTI affect this in
various ways already, and doing more of that might be okay - but we need to be
careful: you can help the simple cases but harm the complex cases by doing this.
 2. If the icmp/select form is most useful, how can that best be converted into
branches (when the canonical form does not map directly onto the best
instruction sequences)?
This strikes me as a special case of a general problem of mapping data-flow IRs
onto CFGs with branches. That problem has been studied for many years. For
example, see:
Bahmann, Helge, Nico Reissmann, Magnus Jahre, and Jan Christian Meyer.
"Perfect reconstructability of control flow from demand dependence
graphs." ACM Transactions on Architecture and Code Optimization (TACO) 11,
no. 4 (2015): 66. : https://ai.google/research/pubs/pub43246.pdf
And I can certainly imagine LLVM having a good, general infrastructure for
solving this problem - and that benefiting many targets. I don't know with
how much of the select -> ALU ops this kind of approach can deal, however.
 -Hal
On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at
icloud.com<mailto:joan.lluch at icloud.com>> wrote:
Hi Sanjay,
Thanks for your reply.
So yes, the IR optimizer (instcombine is the specific pass) sometimes turns icmp
(and select) sequences into ALU ops. Instcombine is almost entirely
*target-independent* and should remain that way. The (sometimes unfortunate)
decision to create shifts were made based on popular targets of the time
(PowerPC and/or x86), and other targets may have suffered because of that.
Yes, that’s what I actually found that I didn’t anticipate.
We've been trying to reverse those canonicalizations in IR over the past few
years when the choice is clearly not always optimal, but it's not easy. To
avoid perf regressions, you first have to make the backend/codegen aware of the
alternate pattern that includes icmp/select and transform that to math/logic
(translate instcombine code to DAGCombiner). Then, you have to remove the
transform from instcombine and replace it with the reverse transform. This can
uncover infinite loops and other problems within instcombine.
I understand that. In any case, I am glad that at least this is acknowledged as
some kind of flaw of the LLVM system, particularly for the optimal
implementation of small processor backends. As these targets generally have
cheap branches and do not generally have selects or multi-bit shifts, they
hardly benefit from transformations involving shifts or aggressively attempting
to replace jumps.
So to finally answer the question: If you can transform the shift into an
alternate sequence with a "setcc" DAG node in your target's
"ISelLowering" code, that's the easiest way forward. Otherwise,
you have to weigh the impact of each target-independent transform on every
target.
Yes, that’s what I have been doing all the time. My backend contains a lot of
code only to reverse undesired LLVM transformations, and yet the resulting final
assembly is not as good as it could be, because it is often hard or impossible
to identify all sources of improvement. It’s ironical that some older, less
sophisticated compilers (and GCC) produce /much/ better code than LLVM for
simple architectures.
In order to give better support to current and future implementations of small
processor targets, I wonder if instead of attempting to implement a general
solution, we could implement a set of compiler flags (or code hooks) that
targets could use to optionally disable undesired IR optimiser actions, however
not affecting anything of the current behaviour as long as these flags are not
used. If we had that, nothing would need to be done on the existing targets,
particularly the major ones, and yet, new backend developments and small
processor (AVR, MSP430) could potentially benefit from that. My opinion is that
the availability of such options will not defeat the “almost entirely
target-independent” nature of the IR optimiser, as the transformations would
still be target-agnostic, except that targets would be able to decide which ones
are more convenient for them, or disable the non desirable ones. Does this sound
ok or feasible?.
John
On 1 Oct 2019, at 17:20, Sanjay Patel <spatel at
rotateright.com<mailto:spatel at rotateright.com>> wrote:
First, let's agree on terminology:
1. We're in LLVM. Clang has little or nothing to do with these questions
from the perspective of LLVM developers.
2. The IR optimizer (also known as the middle-end and invoked via 'opt')
is what takes LLVM IR from a front-end (clang is just 1 example) and transforms
it to different LLVM IR for easier target-specific optimization.
3. The back-end (invoked using 'llc') is what takes LLVM IR and turns it
into assembly for your target. Codegen is 1 part of the backend.
So yes, the IR optimizer (instcombine is the specific pass) sometimes turns icmp
(and select) sequences into ALU ops. Instcombine is almost entirely
*target-independent* and should remain that way. The (sometimes unfortunate)
decision to create shifts were made based on popular targets of the time
(PowerPC and/or x86), and other targets may have suffered because of that.
We've been trying to reverse those canonicalizations in IR over the past few
years when the choice is clearly not always optimal, but it's not easy. To
avoid perf regressions, you first have to make the backend/codegen aware of the
alternate pattern that includes icmp/select and transform that to math/logic
(translate instcombine code to DAGCombiner). Then, you have to remove the
transform from instcombine and replace it with the reverse transform. This can
uncover infinite loops and other problems within instcombine.
It's often not clear which form of IR will lead to better optimizations
within the IR optimizer itself. We favor the shortest IR sequence in most cases.
But if there's a tie, we have to make a judgement about what is easier to
analyze/transform when viewed within a longer sequence of IR.
So to finally answer the question: If you can transform the shift into an
alternate sequence with a "setcc" DAG node in your target's
"ISelLowering" code, that's the easiest way forward. Otherwise,
you have to weigh the impact of each target-independent transform on every
target.
On Mon, Sep 30, 2019 at 5:31 PM Craig Topper <craig.topper at
gmail.com<mailto:craig.topper at gmail.com>> wrote:
For the MSP430 example, I'm guess its InstCombiner::transformSExtICmp or
InstCombiner::transformZExtICmp
~Craig
On Mon, Sep 30, 2019 at 2:21 PM Support IMAP <support at
sweetwilliamsl.com<mailto:support at sweetwilliamsl.com>> wrote:
Hi all,
Ok, I just found a much simpler example of the same issue.
Consider the following code
int cmpge32_0(long a) {
  return a>=0;
}
Compiled for the MSP430 with -O1 or -Os results in the following:
; Function Attrs: norecurse nounwind readnone
define dso_local i16 @cmpge32_0(i32 %a) local_unnamed_addr #0 {
entry:
  %a.lobit = lshr i32 %a, 31
  %0 = trunc i32 %a.lobit to i16
  %.not = xor i16 %0, 1
  ret i16 %.not
}
The backend then turns this into the following totally suboptimal code:
cmpge32_0:
mov r13, r12
inv r12
swpb r12
mov.b r12, r12
clrc
rrc r12
rra r12
rra r12
rra r12
rra r12
rra r12
rra r12
ret
.Lfunc_end0:
.size cmpge32_0, .Lfunc_end0-cmpge32_0
The cause of this anomaly is again the presence of the Shift instruction
(%a.lobit = lshr i32 %a, 31) at the IR level, which is hard to handle by the
backend.
The same C code compiled with -O0 creates the following IR code excerpt instead
of the lshr-trunc code sequence
  %cmp = icmp sge i32 %0, 0
  %conv = zext i1 %cmp to i16
This compiles into MUCH better code for the MSP430 architecture (and virtually
any 16 bit architecture not supporting multiple shifts).
It would be desirable that LLVM would just leave the comparison as is, also for 
-O1 and above.
So Please, can somebody point me to the LLVM class or function that performs the
transformation of the comparison above into the undesired shift, so I can
investigate what’s going on, or whether there’s something I can do?
That would be really appreciated.
John
Hi Roman
Not exactly, this is IR after optimizations, this is not what clang produces.
To see what clang produces you want to pass -O0.
All optimizations beyond that are done by LLVM.
Ok, I understand such naming convention, but it is still something that happens
at the IR code generation steps, and therefore the backend has little to do
about.
So, what are actually the hooks that I can implement or investigate to modify
the undesired behaviour?
John
On 30 Sep 2019, at 13:35, Roman Lebedev <lebedev.ri at
gmail.com<mailto:lebedev.ri at gmail.com>> wrote:
On Mon, Sep 30, 2019 at 11:52 AM Joan Lluch <joan.lluch at
icloud.com<mailto:joan.lluch at icloud.com>> wrote:
Hi Roman,
Is "test" actually an implementation of a 64-bit-wide multiplication
compiler-rt builtin?
Then i'd think the main problem is that it is being optimized in the
first place, you could end up with endless recursion…
No, this is not a compiler-rt builtin. My example is of course incidentally
taken from the implementation of a signed multiply, but as said, it has nothing
to do with rt-builtins, I'm just using that code to show the issue. This
function can’t create a recursion because it’s named ’test’, unlike any
rt-buitin. You can replace the multiply in the source code by an addition, if
you want to avoid calling rt-functions, but this does not change what I attempt
to show. Also It’s not meant to be 64 bit wide, but 32 bit wide, because the
targets I’m testing are 16 bit, so ints are 16 bit and longs are 32 bit. This is
again the function I am testing:
long test (long a, long b)
{
int neg = 0;
long res;
if (a < 0)
{
  a = -a;
  neg = 1;
}
if (b < 0)
{
  b = -b;
  neg = !neg;
}
res = a*b;
if (neg)
  res = -res;
return res;
}
LLVM, not clang.
I’m not sure about what you mean by that. The shown LLVM IR code is created by
executing "clang” command line, so that’s what I attempt to show.
Not exactly, this is IR after optimizations, this is not what clang produces.
To see what clang produces you want to pass -O0.
All optimizations beyond that are done by LLVM.
So it’s actually the front-end that does such undesired optimisations sometimes,
not only the LLVM back-end. This is in part why I am saying this is not right.
See copied again the IR code that gets generated for the C code that I posted
before. This IR code, including the presence of expensive shifts ( %a.lobit =
lshr i32 %a, 31)  is generated when -mllvm -phi-node-folding-threshold=1 is
specified in the command line, or when the Target implements
getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) to return TCC_Expensive
for operator types that are bigger than the default target register size.
; ModuleID = 'main.c'
source_filename = "main.c"
target datalayout =
"e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16"
target triple = "msp430"
; Function Attrs: norecurse nounwind optsize readnone
define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
entry:
%cmp = icmp slt i32 %a, 0
%sub = sub nsw i32 0, %a
%spec.select = select i1 %cmp, i32 %sub, i32 %a
%a.lobit = lshr i32 %a, 31
%0 = trunc i32 %a.lobit to i16
%cmp1 = icmp slt i32 %b, 0
br i1 %cmp1, label %if.then2, label %if.end4
if.then2:                                         ; preds = %entry
%sub3 = sub nsw i32 0, %b
%1 = xor i16 %0, 1
br label %if.end4
if.end4:                                          ; preds = %if.then2, %entry
%b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ]
%neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ]
%mul = mul nsw i32 %b.addr.0, %spec.select
%tobool5 = icmp eq i16 %neg.1, 0
%sub7 = sub nsw i32 0, %mul
%spec.select18 = select i1 %tobool5, i32 %mul, i32 %sub7
ret i32 %spec.select18
}
attributes #0 = { norecurse nounwind optsize readnone
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-tail-calls"="false"
"less-precise-fpmad"="false"
"min-legal-vector-width"="0"
"no-frame-pointer-elim"="false"
"no-infs-fp-math"="false"
"no-jump-tables"="false"
"no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false"
"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
"unsafe-fp-math"="false"
"use-soft-float"="false" }
!llvm.module.flags = !{!0}
!llvm.ident = !{!1}
!0 = !{i32 1, !"wchar_size", i32 2}
!1 = !{!"clang version 9.0.0 (https://github.com/llvm/llvm-project.git
6f7deba43dd25fb7b3eca70f9c388ec9174f455a)"}
As you can see, Clang produces a 31 bit wide shift right ( %a.lobit = lshr i32
%a, 31) That’s the fourth instruction on the IR code above. So a shift is
produced instead of creating a jump to a new block, as it should be the case as
per the C source code.
Just as a matter of information. This is the implementation of the
getOperationCost function that causes ‘clang’ to correctly replace selects by
branches (desirable), but to generate shifts to fold expensive selects
(undesirable)
unsigned CPU74TTIImpl::getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy)
{
// Big types are expensive
unsigned OpSize = Ty->getScalarSizeInBits();
if ( OpSize > 16 )
  return TTI::TCC_Expensive;
return BaseT::getOperationCost(Opcode, Ty, OpTy);
}
If the getOperationCost above function was not implemented, then clang would
generate the usual series of ‘selects’. But this is even worse because selects
imply speculative execution of expensive instructions, or duplicate branching
created by the backend, which can’t be easily avoided.
Ideally, the IR code above should just place an ‘if.then' block for the if
(a < 0) statement in the C source code, instead of attempting to replace a
select by a shift (!)
If you want to play with these two scenarios, (1) IR code generated with
branches, and (2) IR code generated with selects. This can easily be reproduced
for the MSP430 target by compiling with the following options
(1)  -mllvm -phi-node-folding-threshold=1 -c -S -Os
(2)  -mllvm -phi-node-folding-threshold=2 -c -S -Os
For 16 bit targets without selects, or expensive selects, the overall code is
better with (1) because that prevents the creation of a different jump for every
‘select’ that (2) would cause. However, the presence of the ‘shift’ instruction
for (1) spoils it all.
Again, ideally, the use of shifts as a replacement of selects should be avoided,
and an “if.then" block should be used as per the original C code.
I hope this is clear now.
John.
On 29 Sep 2019, at 15:57, Roman Lebedev <lebedev.ri at
gmail.com<mailto:lebedev.ri at gmail.com>> wrote:
On Sun, Sep 29, 2019 at 3:35 PM Joan Lluch via llvm-dev
<llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
wrote:
Hi Sanjay,
Actually, the CodeGenPrepare::optimizeSelectInst is not doing the best it could
do in some circumstances: The case of “OptSize" for targets not supporting
Select was already mentioned to be detrimental.
For targets that actually have selects, but branches are cheap and generally
profitable, particularly for expensive operators, the optimizeSelectInst
function does not do good either. The function tries to identify consecutive
selects with the same condition in order to avoid duplicate branches, which is
ok, but then this effort is discarded in isFormingBranchFromSelectProfitable
because the identified condition is used more than once (on the said two
consecutive selects, of course), which defeats the whole purpose of checking for
them, resulting in poor codegen.
Yet another issue is that Clang attempts to replace ‘selects’ in the source
code, by supposedly optimised code that is not ok for all targets. One example
is this:
LLVM, not clang.
long test (long a, long b)
{
int neg = 0;
long res;
if (a < 0)
{
 a = -a;
 neg = 1;
}
if (b < 0)
{
 b = -b;
 neg = !neg;
}
res = a*b; //(unsigned long)a / (unsigned long)b;  // will call __udivsi3
if (neg)
 res = -res;
return res;
}
This gets compiled into
; Function Attrs: norecurse nounwind readnone
define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
entry:
%cmp = icmp slt i32 %a, 0
%sub = sub nsw i32 0, %a
%a.addr.0 = select i1 %cmp, i32 %sub, i32 %a
%a.lobit = lshr i32 %a, 31
%0 = trunc i32 %a.lobit to i16
%cmp1 = icmp slt i32 %b, 0
br i1 %cmp1, label %if.then2, label %if.end4
if.then2:                                         ; preds = %entry
%sub3 = sub nsw i32 0, %b
%1 = xor i16 %0, 1
br label %if.end4
if.end4:                                          ; preds = %if.then2, %entry
%b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ]
%neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ]
%mul = mul nsw i32 %b.addr.0, %a.addr.0
%tobool5 = icmp eq i16 %neg.1, 0
%sub7 = sub nsw i32 0, %mul
%res.0 = select i1 %tobool5, i32 %mul, i32 %sub7
ret i32 %res.0
}
The offending part here is this:  %a.lobit = lshr i32 %a, 31 . Instead of just
creating a “select” instruction, as the original code suggested with the if (a
< 0) { neg = 1;} statements, the front-end produces a lshr which is very
expensive for small architectures, and makes it very difficult for the backend
to fold it again into an actual select (or branch). In my opinion, the original
C code should have produced a “select” and give the backend the opportunity to
optimise it if required. I think that the frontend should perform only target
independent optimisations.
You didn't specify how you compile that code.
We could also get: https://godbolt.org/z/B-5lj1
Which can actually be folded further to just
long test(long a, long b) {
 return a * b;
}
Is "test" actually an implementation of a 64-bit-wide multiplication
compiler-rt builtin?
Then i'd think the main problem is that it is being optimized in the
first place, you could end up with endless recursion...
I posted before my view that LLVM is clearly designed to satisfy big boys such
as the x86 and ARM targets. This means that, unfortunately, it makes too many
general assumptions about what’s cheap, without providing enough hooks to cancel
arbitrary optimisations. As I am implementing backends for 8 or 16 bit targets,
I find myself doing a lot of work just to reverse optimisations that should have
not been applied in the first place. My example above is an instance of a code
mutation performed by the frontend that is not desirable. Existing 8 and 16 bit
trunk targets (particularly the MSP430 and the AVR) are also negatively affected
by the excessively liberal use of shifts by LLVM.
The CodeGenPrepare::optimizeSelectInst function needs some changes to respect
targets with no selects, and targets that may want to avoid expensive
speculative executions.
John
Roman
On 25 Sep 2019, at 16:00, Sanjay Patel <spatel at
rotateright.com<mailto:spatel at rotateright.com>> wrote:
Changing the order of the checks in CodeGenPrepare::optimizeSelectInst() sounds
good to me.
But you may need to go further for optimum performance. For example, we may be
canonicalizing math/logic IR patterns into 'select' such as in the
recent:
https://reviews.llvm.org/D67799
So if you want those to become ALU ops again rather than branches, then you need
to do the transform later in the backend. That is, you want to let DAGCombiner
run its set of transforms on 'select' nodes.
On Wed, Sep 25, 2019 at 4:03 AM Joan Lluch via cfe-dev <cfe-dev at
lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Craig,
Thank you for your reply. I have started looking at “CodeGenPrepare” and I
assume you reffer to CodeGenPrepare::optimizeSelectInst. I will try to play a
bit with that possibly later today. At first glance, it looks to me that for
targets that do not support ’select’ at all, the fact that the function exits
early for ‘OptSize’ can be detrimental, because this will just leave ALL
existing selects in the code anyway. As said, I will try to play with that
later, but right now it looks to me that maybe we should check  for
TLI->isSelectSupported earlier in the function, to get some more
opportunities to such targets without explicit ’select’ support?
Thanks
John
On 25 Sep 2019, at 08:59, Craig Topper <craig.topper at
gmail.com<mailto:craig.topper at gmail.com>> wrote:
There is code in CodeGenPrepare.cpp that can turn selects into branches that
tries to account for multiple selects sharing the same condition. It doesn't
look like either AVR or MSP430 enable that code though.
~Craig
On Tue, Sep 24, 2019 at 11:27 PM Joan Lluch via cfe-dev <cfe-dev at
lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi Roman,
Thank you for your reply. I understand your point. I just want to add something
to clarify my original post in relation to your reply.
There are already implemented 8-bit and 16-bit backends, namely the AVR and the
MSP430, which already "aggressively convert selects into branches”, which
already benefit (as they are) from setting "phi-node-folding-threshold’ to
1 or zero. This is because otherwise Clang will generate several selects
depending on the same “icmp”. These backends are unable to optimise that, and
they just create a comparison and a conditional branch for every “select” in the
IR code, in spite that the original C code was already written in a much better
way. So the resulting effect is the presence of redundant comparisons and
branches in the final code, with a detrimental of generated code quality.
The above gets improved by setting "phi-node-folding-threshold’ to 1
because some of these extra ‘selects' are no longer there so the backend
stops generating redundant code.
John.
On 21 Sep 2019, at 14:48, Roman Lebedev <lebedev.ri at
gmail.com<mailto:lebedev.ri at gmail.com>> wrote:
On Sat, Sep 21, 2019 at 3:18 PM Joan Lluch via cfe-dev
<cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote:
Hi all,
For my custom architecture, I want to relax the CFG simplification pass, and any
other passes replacing conditional branches.
I found that the replacement of conditional branches by “select" and other
instructions is often too aggressive, and this causes inefficient code for my
target as in most cases branches would be cheaper.
For example, considering the following c code:
long test (long a, long b)
{
int neg = 0;
long res;
if (a < 0)
{
a = -a;
neg = 1;
}
res = a*b;
if (neg)
res = -res;
return res;
}
This code can be simplified in c, but it’s just an example to show the point.
The code above gets compiled like this (-Oz flag):
; Function Attrs: minsize norecurse nounwind optsize readnone
define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
entry:
%cmp = icmp slt i32 %a, 0
%sub = sub nsw i32 0, %a
%a.addr.0 = select i1 %cmp, i32 %sub, i32 %a
%mul = mul nsw i32 %a.addr.0, %b
%sub2 = sub nsw i32 0, %mul
%res.0 = select i1 %cmp, i32 %sub2, i32 %mul
ret i32 %res.0
}
All branching was removed and replaced by ‘select’ instructions. For my
architecture, it would be desirable to keep the original branches in most cases,
because even simple 32 bit operations are too expensive to speculatively execute
them, and branches are cheap.
Setting  'phi-node-folding-threshold’ to 1 or even 0 (instead of the default
2), definitely improves the situation in many cases, but Clang still creates
many instances of ‘select’ instructions, which are detrimental to my target. I
am unsure about where are they created, as I believe that the simplifycfg pass
does not longer create them.
You definitively can't ban llvm passes/clang from creating select's.
So the question is: Are there any other hooks in clang, or custom code that I
can implement, to relax the creation of ’select’ instructions and make it
preserve branches in the original c code?
I think this is backwards.
Sure, you could maybe disable most of the folds that produce selects.
That may be good for final codegen, but will also affect other passes
since not everything deals with 2-node PHI as good as wit selects.
But, what happens if you still get the select-y IR?
Doesn't matter how, could be hand-written.
I think you might want to instead aggressively convert selects into
branches in backend.
Thanks,
John
Roman
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/4c4b7930/attachment-0001.html>
Joan Lluch via llvm-dev
2019-Oct-03  09:26 UTC
[llvm-dev] [cfe-dev] CFG simplification question, and preservation of branching in the original code
Hi all,> On 2 Oct 2019, at 14:34, Sanjay Patel <spatel at rotateright.com> wrote > Providing target options/overrides to code that is supposed to be target-independent sounds self-defeating to me. I doubt that proposal would gain much support. > Of course, if you're customizing LLVM for your own out-of-trunk backend, you can do anything you'd like if you're willing to maintain the custom patches. > I suggest that you file a bug report showing an exact problem caused by 1 of the icmp/select transforms. > If it can be done using an in-trunk backend like AVR or MSP430, then we'll have a real example showing the harm and hopefully better ideas about how to solve the bug.I fully understand your statement about target-independent code. This has no possible discussion in a general way. However one of my points is that at least some of this code is not really that “target independent” because in fact it depends on cheap features available on ALL targets, which is not the case. That’s why I think that some flexibility must be given. I’m not advocating for “target options/overrides”, but general ones that can be optionally chosen by particular targets. I think that at the end of the day it’s just a matter of language, not a matter of fact. If we look at the problem from some distance, we find that there are already options to alter IR code generation. For example the -phi-node-folding-threshold option is one of them. Ok, no target seems to use that particular one, but it’s there already, and it modifies LLVM-IR generation by telling how aggressively LLVM turns branches into selects. I found that this option produces better LLVM-IR input for the target I’m working on, as well as both the MSP430 and AVR targets. I can of course try to apply my custom patches and maintain them, but although I was able to implement a backend, the complexity and extend of the whole thing is beyond my abilities and resources. Furthermore, I regard this as a LLVM wide problem, and that’s why I try to raise some awareness. Following your suggestion, I just filed a bug with a very simple example that to make things easier is not attributable to the InstCombine pass in this case but to transformations happening during LLVM codegen. https://bugs.llvm.org/show_bug.cgi?id=43542 John> On 2 Oct 2019, at 14:34, Sanjay Patel <spatel at rotateright.com> wrote: > > On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at icloud.com <mailto:joan.lluch at icloud.com>> wrote: > In order to give better support to current and future implementations of small processor targets, I wonder if instead of attempting to implement a general solution, we could implement a set of compiler flags (or code hooks) that targets could use to optionally disable undesired IR optimiser actions, however not affecting anything of the current behaviour as long as these flags are not used. If we had that, nothing would need to be done on the existing targets, particularly the major ones, and yet, new backend developments and small processor (AVR, MSP430) could potentially benefit from that. My opinion is that the availability of such options will not defeat the “almost entirely target-independent” nature of the IR optimiser, as the transformations would still be target-agnostic, except that targets would be able to decide which ones are more convenient for them, or disable the non desirable ones. Does this sound ok or feasible?. > > Providing target options/overrides to code that is supposed to be target-independent sounds self-defeating to me. I doubt that proposal would gain much support. > Of course, if you're customizing LLVM for your own out-of-trunk backend, you can do anything you'd like if you're willing to maintain the custom patches. > I suggest that you file a bug report showing an exact problem caused by 1 of the icmp/select transforms. > If it can be done using an in-trunk backend like AVR or MSP430, then we'll have a real example showing the harm and hopefully better ideas about how to solve the bug. > > > On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at icloud.com <mailto:joan.lluch at icloud.com>> wrote: > Hi Sanjay, > > Thanks for your reply. > >> So yes, the IR optimizer (instcombine is the specific pass) sometimes turns icmp (and select) sequences into ALU ops. Instcombine is almost entirely *target-independent* and should remain that way. The (sometimes unfortunate) decision to create shifts were made based on popular targets of the time (PowerPC and/or x86), and other targets may have suffered because of that. > > Yes, that’s what I actually found that I didn’t anticipate. > >> We've been trying to reverse those canonicalizations in IR over the past few years when the choice is clearly not always optimal, but it's not easy. To avoid perf regressions, you first have to make the backend/codegen aware of the alternate pattern that includes icmp/select and transform that to math/logic (translate instcombine code to DAGCombiner). Then, you have to remove the transform from instcombine and replace it with the reverse transform. This can uncover infinite loops and other problems within instcombine. > > I understand that. In any case, I am glad that at least this is acknowledged as some kind of flaw of the LLVM system, particularly for the optimal implementation of small processor backends. As these targets generally have cheap branches and do not generally have selects or multi-bit shifts, they hardly benefit from transformations involving shifts or aggressively attempting to replace jumps. > >> So to finally answer the question: If you can transform the shift into an alternate sequence with a "setcc" DAG node in your target's "ISelLowering" code, that's the easiest way forward. Otherwise, you have to weigh the impact of each target-independent transform on every target. > > Yes, that’s what I have been doing all the time. My backend contains a lot of code only to reverse undesired LLVM transformations, and yet the resulting final assembly is not as good as it could be, because it is often hard or impossible to identify all sources of improvement. It’s ironical that some older, less sophisticated compilers (and GCC) produce /much/ better code than LLVM for simple architectures. > > In order to give better support to current and future implementations of small processor targets, I wonder if instead of attempting to implement a general solution, we could implement a set of compiler flags (or code hooks) that targets could use to optionally disable undesired IR optimiser actions, however not affecting anything of the current behaviour as long as these flags are not used. If we had that, nothing would need to be done on the existing targets, particularly the major ones, and yet, new backend developments and small processor (AVR, MSP430) could potentially benefit from that. My opinion is that the availability of such options will not defeat the “almost entirely target-independent” nature of the IR optimiser, as the transformations would still be target-agnostic, except that targets would be able to decide which ones are more convenient for them, or disable the non desirable ones. Does this sound ok or feasible?. > > John > > > >> On 1 Oct 2019, at 17:20, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote: >> >> First, let's agree on terminology: >> 1. We're in LLVM. Clang has little or nothing to do with these questions from the perspective of LLVM developers. >> 2. The IR optimizer (also known as the middle-end and invoked via 'opt') is what takes LLVM IR from a front-end (clang is just 1 example) and transforms it to different LLVM IR for easier target-specific optimization. >> 3. The back-end (invoked using 'llc') is what takes LLVM IR and turns it into assembly for your target. Codegen is 1 part of the backend. >> >> So yes, the IR optimizer (instcombine is the specific pass) sometimes turns icmp (and select) sequences into ALU ops. Instcombine is almost entirely *target-independent* and should remain that way. The (sometimes unfortunate) decision to create shifts were made based on popular targets of the time (PowerPC and/or x86), and other targets may have suffered because of that. >> >> We've been trying to reverse those canonicalizations in IR over the past few years when the choice is clearly not always optimal, but it's not easy. To avoid perf regressions, you first have to make the backend/codegen aware of the alternate pattern that includes icmp/select and transform that to math/logic (translate instcombine code to DAGCombiner). Then, you have to remove the transform from instcombine and replace it with the reverse transform. This can uncover infinite loops and other problems within instcombine. >> >> It's often not clear which form of IR will lead to better optimizations within the IR optimizer itself. We favor the shortest IR sequence in most cases. But if there's a tie, we have to make a judgement about what is easier to analyze/transform when viewed within a longer sequence of IR. >> >> So to finally answer the question: If you can transform the shift into an alternate sequence with a "setcc" DAG node in your target's "ISelLowering" code, that's the easiest way forward. Otherwise, you have to weigh the impact of each target-independent transform on every target. >> >> >> On Mon, Sep 30, 2019 at 5:31 PM Craig Topper <craig.topper at gmail.com <mailto:craig.topper at gmail.com>> wrote: >> For the MSP430 example, I'm guess its InstCombiner::transformSExtICmp or InstCombiner::transformZExtICmp >> >> ~Craig >> >> >> On Mon, Sep 30, 2019 at 2:21 PM Support IMAP <support at sweetwilliamsl.com <mailto:support at sweetwilliamsl.com>> wrote: >> Hi all, >> >> Ok, I just found a much simpler example of the same issue. >> >> Consider the following code >> >> int cmpge32_0(long a) { >> return a>=0; >> } >> >> Compiled for the MSP430 with -O1 or -Os results in the following: >> >> ; Function Attrs: norecurse nounwind readnone >> define dso_local i16 @cmpge32_0(i32 %a) local_unnamed_addr #0 { >> entry: >> %a.lobit = lshr i32 %a, 31 >> %0 = trunc i32 %a.lobit to i16 >> %.not = xor i16 %0, 1 >> ret i16 %.not >> } >> >> The backend then turns this into the following totally suboptimal code: >> >> cmpge32_0: >> mov r13, r12 >> inv r12 >> swpb r12 >> mov.b r12, r12 >> clrc >> rrc r12 >> rra r12 >> rra r12 >> rra r12 >> rra r12 >> rra r12 >> rra r12 >> ret >> .Lfunc_end0: >> .size cmpge32_0, .Lfunc_end0-cmpge32_0 >> >> >> The cause of this anomaly is again the presence of the Shift instruction (%a.lobit = lshr i32 %a, 31) at the IR level, which is hard to handle by the backend. >> >> The same C code compiled with -O0 creates the following IR code excerpt instead of the lshr-trunc code sequence >> >> %cmp = icmp sge i32 %0, 0 >> %conv = zext i1 %cmp to i16 >> >> This compiles into MUCH better code for the MSP430 architecture (and virtually any 16 bit architecture not supporting multiple shifts). >> It would be desirable that LLVM would just leave the comparison as is, also for -O1 and above. >> >> >> So Please, can somebody point me to the LLVM class or function that performs the transformation of the comparison above into the undesired shift, so I can investigate what’s going on, or whether there’s something I can do? >> >> That would be really appreciated. >> >> John >> >> >> Hi Roman >> >>> Not exactly, this is IR after optimizations, this is not what clang produces. >>> To see what clang produces you want to pass -O0. >>> All optimizations beyond that are done by LLVM. >> >> >> Ok, I understand such naming convention, but it is still something that happens at the IR code generation steps, and therefore the backend has little to do about. >> >> So, what are actually the hooks that I can implement or investigate to modify the undesired behaviour? >> >> John >> >> >>> On 30 Sep 2019, at 13:35, Roman Lebedev <lebedev.ri at gmail.com <mailto:lebedev.ri at gmail.com>> wrote: >>> >>> On Mon, Sep 30, 2019 at 11:52 AM Joan Lluch <joan.lluch at icloud.com <mailto:joan.lluch at icloud.com>> wrote: >>>> >>>> Hi Roman, >>>> >>>> Is "test" actually an implementation of a 64-bit-wide multiplication >>>> compiler-rt builtin? >>>> Then i'd think the main problem is that it is being optimized in the >>>> first place, you could end up with endless recursion… >>>> >>>> >>>> No, this is not a compiler-rt builtin. My example is of course incidentally taken from the implementation of a signed multiply, but as said, it has nothing to do with rt-builtins, I'm just using that code to show the issue. This function can’t create a recursion because it’s named ’test’, unlike any rt-buitin. You can replace the multiply in the source code by an addition, if you want to avoid calling rt-functions, but this does not change what I attempt to show. Also It’s not meant to be 64 bit wide, but 32 bit wide, because the targets I’m testing are 16 bit, so ints are 16 bit and longs are 32 bit. This is again the function I am testing: >>>> >>>> >>>> long test (long a, long b) >>>> { >>>> int neg = 0; >>>> long res; >>>> >>>> if (a < 0) >>>> { >>>> a = -a; >>>> neg = 1; >>>> } >>>> >>>> if (b < 0) >>>> { >>>> b = -b; >>>> neg = !neg; >>>> } >>>> >>>> res = a*b; >>>> >>>> if (neg) >>>> res = -res; >>>> >>>> return res; >>>> } >>>> >>>> >>>> >>>> LLVM, not clang. >>>> >>>> >>>> I’m not sure about what you mean by that. The shown LLVM IR code is created by executing "clang” command line, so that’s what I attempt to show. >>> Not exactly, this is IR after optimizations, this is not what clang produces. >>> To see what clang produces you want to pass -O0. >>> All optimizations beyond that are done by LLVM. >>> >>>> So it’s actually the front-end that does such undesired optimisations sometimes, not only the LLVM back-end. This is in part why I am saying this is not right. See copied again the IR code that gets generated for the C code that I posted before. This IR code, including the presence of expensive shifts ( %a.lobit = lshr i32 %a, 31) is generated when -mllvm -phi-node-folding-threshold=1 is specified in the command line, or when the Target implements getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) to return TCC_Expensive for operator types that are bigger than the default target register size. >>>> >>>> >>>> >>>> ; ModuleID = 'main.c' >>>> source_filename = "main.c" >>>> target datalayout = "e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16" >>>> target triple = "msp430" >>>> >>>> ; Function Attrs: norecurse nounwind optsize readnone >>>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>>> entry: >>>> %cmp = icmp slt i32 %a, 0 >>>> %sub = sub nsw i32 0, %a >>>> %spec.select = select i1 %cmp, i32 %sub, i32 %a >>>> %a.lobit = lshr i32 %a, 31 >>>> %0 = trunc i32 %a.lobit to i16 >>>> %cmp1 = icmp slt i32 %b, 0 >>>> br i1 %cmp1, label %if.then2, label %if.end4 >>>> >>>> if.then2: ; preds = %entry >>>> %sub3 = sub nsw i32 0, %b >>>> %1 = xor i16 %0, 1 >>>> br label %if.end4 >>>> >>>> if.end4: ; preds = %if.then2, %entry >>>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ] >>>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ] >>>> %mul = mul nsw i32 %b.addr.0, %spec.select >>>> %tobool5 = icmp eq i16 %neg.1, 0 >>>> %sub7 = sub nsw i32 0, %mul >>>> %spec.select18 = select i1 %tobool5, i32 %mul, i32 %sub7 >>>> ret i32 %spec.select18 >>>> } >>>> >>>> attributes #0 = { norecurse nounwind optsize readnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" } >>>> >>>> !llvm.module.flags = !{!0} >>>> !llvm.ident = !{!1} >>>> >>>> !0 = !{i32 1, !"wchar_size", i32 2} >>>> !1 = !{!"clang version 9.0.0 (https://github.com/llvm/llvm-project.git <https://github.com/llvm/llvm-project.git> 6f7deba43dd25fb7b3eca70f9c388ec9174f455a)"} >>>> >>>> >>>> >>>> As you can see, Clang produces a 31 bit wide shift right ( %a.lobit = lshr i32 %a, 31) That’s the fourth instruction on the IR code above. So a shift is produced instead of creating a jump to a new block, as it should be the case as per the C source code. >>>> >>>> Just as a matter of information. This is the implementation of the getOperationCost function that causes ‘clang’ to correctly replace selects by branches (desirable), but to generate shifts to fold expensive selects (undesirable) >>>> >>>> >>>> unsigned CPU74TTIImpl::getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy) >>>> { >>>> // Big types are expensive >>>> unsigned OpSize = Ty->getScalarSizeInBits(); >>>> if ( OpSize > 16 ) >>>> return TTI::TCC_Expensive; >>>> >>>> return BaseT::getOperationCost(Opcode, Ty, OpTy); >>>> } >>>> >>>> If the getOperationCost above function was not implemented, then clang would generate the usual series of ‘selects’. But this is even worse because selects imply speculative execution of expensive instructions, or duplicate branching created by the backend, which can’t be easily avoided. >>>> >>>> Ideally, the IR code above should just place an ‘if.then' block for the if (a < 0) statement in the C source code, instead of attempting to replace a select by a shift (!) >>>> >>>> If you want to play with these two scenarios, (1) IR code generated with branches, and (2) IR code generated with selects. This can easily be reproduced for the MSP430 target by compiling with the following options >>>> (1) -mllvm -phi-node-folding-threshold=1 -c -S -Os >>>> (2) -mllvm -phi-node-folding-threshold=2 -c -S -Os >>>> >>>> For 16 bit targets without selects, or expensive selects, the overall code is better with (1) because that prevents the creation of a different jump for every ‘select’ that (2) would cause. However, the presence of the ‘shift’ instruction for (1) spoils it all. >>>> >>>> Again, ideally, the use of shifts as a replacement of selects should be avoided, and an “if.then" block should be used as per the original C code. >>>> >>>> I hope this is clear now. >>>> >>>> John. >>>> >>>> >>>> On 29 Sep 2019, at 15:57, Roman Lebedev <lebedev.ri at gmail.com <mailto:lebedev.ri at gmail.com>> wrote: >>>> >>>> On Sun, Sep 29, 2019 at 3:35 PM Joan Lluch via llvm-dev >>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >>>> >>>> >>>> Hi Sanjay, >>>> >>>> Actually, the CodeGenPrepare::optimizeSelectInst is not doing the best it could do in some circumstances: The case of “OptSize" for targets not supporting Select was already mentioned to be detrimental. >>>> >>>> For targets that actually have selects, but branches are cheap and generally profitable, particularly for expensive operators, the optimizeSelectInst function does not do good either. The function tries to identify consecutive selects with the same condition in order to avoid duplicate branches, which is ok, but then this effort is discarded in isFormingBranchFromSelectProfitable because the identified condition is used more than once (on the said two consecutive selects, of course), which defeats the whole purpose of checking for them, resulting in poor codegen. >>>> >>>> Yet another issue is that Clang attempts to replace ‘selects’ in the source code, by supposedly optimised code that is not ok for all targets. One example is this: >>>> >>>> LLVM, not clang. >>>> >>>> long test (long a, long b) >>>> { >>>> int neg = 0; >>>> long res; >>>> >>>> if (a < 0) >>>> { >>>> a = -a; >>>> neg = 1; >>>> } >>>> >>>> if (b < 0) >>>> { >>>> b = -b; >>>> neg = !neg; >>>> } >>>> >>>> res = a*b; //(unsigned long)a / (unsigned long)b; // will call __udivsi3 >>>> >>>> if (neg) >>>> res = -res; >>>> >>>> return res; >>>> } >>>> >>>> >>>> This gets compiled into >>>> >>>> ; Function Attrs: norecurse nounwind readnone >>>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>>> entry: >>>> %cmp = icmp slt i32 %a, 0 >>>> %sub = sub nsw i32 0, %a >>>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a >>>> %a.lobit = lshr i32 %a, 31 >>>> %0 = trunc i32 %a.lobit to i16 >>>> %cmp1 = icmp slt i32 %b, 0 >>>> br i1 %cmp1, label %if.then2, label %if.end4 >>>> >>>> if.then2: ; preds = %entry >>>> %sub3 = sub nsw i32 0, %b >>>> %1 = xor i16 %0, 1 >>>> br label %if.end4 >>>> >>>> if.end4: ; preds = %if.then2, %entry >>>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ] >>>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ] >>>> %mul = mul nsw i32 %b.addr.0, %a.addr.0 >>>> %tobool5 = icmp eq i16 %neg.1, 0 >>>> %sub7 = sub nsw i32 0, %mul >>>> %res.0 = select i1 %tobool5, i32 %mul, i32 %sub7 >>>> ret i32 %res.0 >>>> } >>>> >>>> The offending part here is this: %a.lobit = lshr i32 %a, 31 . Instead of just creating a “select” instruction, as the original code suggested with the if (a < 0) { neg = 1;} statements, the front-end produces a lshr which is very expensive for small architectures, and makes it very difficult for the backend to fold it again into an actual select (or branch). In my opinion, the original C code should have produced a “select” and give the backend the opportunity to optimise it if required. I think that the frontend should perform only target independent optimisations. >>>> >>>> >>>> You didn't specify how you compile that code. >>>> We could also get: https://godbolt.org/z/B-5lj1 <https://godbolt.org/z/B-5lj1> >>>> Which can actually be folded further to just >>>> long test(long a, long b) { >>>> return a * b; >>>> } >>>> Is "test" actually an implementation of a 64-bit-wide multiplication >>>> compiler-rt builtin? >>>> Then i'd think the main problem is that it is being optimized in the >>>> first place, you could end up with endless recursion... >>>> >>>> I posted before my view that LLVM is clearly designed to satisfy big boys such as the x86 and ARM targets. This means that, unfortunately, it makes too many general assumptions about what’s cheap, without providing enough hooks to cancel arbitrary optimisations. As I am implementing backends for 8 or 16 bit targets, I find myself doing a lot of work just to reverse optimisations that should have not been applied in the first place. My example above is an instance of a code mutation performed by the frontend that is not desirable. Existing 8 and 16 bit trunk targets (particularly the MSP430 and the AVR) are also negatively affected by the excessively liberal use of shifts by LLVM. >>>> >>>> The CodeGenPrepare::optimizeSelectInst function needs some changes to respect targets with no selects, and targets that may want to avoid expensive speculative executions. >>>> >>>> John >>>> >>>> Roman >>>> >>>> On 25 Sep 2019, at 16:00, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote: >>>> >>>> Changing the order of the checks in CodeGenPrepare::optimizeSelectInst() sounds good to me. >>>> >>>> But you may need to go further for optimum performance. For example, we may be canonicalizing math/logic IR patterns into 'select' such as in the recent: >>>> https://reviews.llvm.org/D67799 <https://reviews.llvm.org/D67799> >>>> >>>> So if you want those to become ALU ops again rather than branches, then you need to do the transform later in the backend. That is, you want to let DAGCombiner run its set of transforms on 'select' nodes. >>>> >>>> On Wed, Sep 25, 2019 at 4:03 AM Joan Lluch via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote: >>>> >>>> >>>> Hi Craig, >>>> >>>> Thank you for your reply. I have started looking at “CodeGenPrepare” and I assume you reffer to CodeGenPrepare::optimizeSelectInst. I will try to play a bit with that possibly later today. At first glance, it looks to me that for targets that do not support ’select’ at all, the fact that the function exits early for ‘OptSize’ can be detrimental, because this will just leave ALL existing selects in the code anyway. As said, I will try to play with that later, but right now it looks to me that maybe we should check for TLI->isSelectSupported earlier in the function, to get some more opportunities to such targets without explicit ’select’ support? >>>> >>>> Thanks >>>> >>>> John >>>> >>>> >>>> On 25 Sep 2019, at 08:59, Craig Topper <craig.topper at gmail.com <mailto:craig.topper at gmail.com>> wrote: >>>> >>>> There is code in CodeGenPrepare.cpp that can turn selects into branches that tries to account for multiple selects sharing the same condition. It doesn't look like either AVR or MSP430 enable that code though. >>>> >>>> ~Craig >>>> >>>> >>>> On Tue, Sep 24, 2019 at 11:27 PM Joan Lluch via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote: >>>> >>>> >>>> Hi Roman, >>>> >>>> Thank you for your reply. I understand your point. I just want to add something to clarify my original post in relation to your reply. >>>> >>>> There are already implemented 8-bit and 16-bit backends, namely the AVR and the MSP430, which already "aggressively convert selects into branches”, which already benefit (as they are) from setting "phi-node-folding-threshold’ to 1 or zero. This is because otherwise Clang will generate several selects depending on the same “icmp”. These backends are unable to optimise that, and they just create a comparison and a conditional branch for every “select” in the IR code, in spite that the original C code was already written in a much better way. So the resulting effect is the presence of redundant comparisons and branches in the final code, with a detrimental of generated code quality. >>>> >>>> The above gets improved by setting "phi-node-folding-threshold’ to 1 because some of these extra ‘selects' are no longer there so the backend stops generating redundant code. >>>> >>>> John. >>>> >>>> >>>> >>>> >>>> On 21 Sep 2019, at 14:48, Roman Lebedev <lebedev.ri at gmail.com <mailto:lebedev.ri at gmail.com>> wrote: >>>> >>>> On Sat, Sep 21, 2019 at 3:18 PM Joan Lluch via cfe-dev >>>> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote: >>>> >>>> >>>> Hi all, >>>> >>>> For my custom architecture, I want to relax the CFG simplification pass, and any other passes replacing conditional branches. >>>> >>>> I found that the replacement of conditional branches by “select" and other instructions is often too aggressive, and this causes inefficient code for my target as in most cases branches would be cheaper. >>>> >>>> For example, considering the following c code: >>>> >>>> long test (long a, long b) >>>> { >>>> int neg = 0; >>>> long res; >>>> >>>> if (a < 0) >>>> { >>>> a = -a; >>>> neg = 1; >>>> } >>>> >>>> res = a*b; >>>> >>>> if (neg) >>>> res = -res; >>>> >>>> return res; >>>> } >>>> >>>> >>>> This code can be simplified in c, but it’s just an example to show the point. >>>> >>>> The code above gets compiled like this (-Oz flag): >>>> >>>> ; Function Attrs: minsize norecurse nounwind optsize readnone >>>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 { >>>> entry: >>>> %cmp = icmp slt i32 %a, 0 >>>> %sub = sub nsw i32 0, %a >>>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a >>>> %mul = mul nsw i32 %a.addr.0, %b >>>> %sub2 = sub nsw i32 0, %mul >>>> %res.0 = select i1 %cmp, i32 %sub2, i32 %mul >>>> ret i32 %res.0 >>>> } >>>> >>>> >>>> All branching was removed and replaced by ‘select’ instructions. For my architecture, it would be desirable to keep the original branches in most cases, because even simple 32 bit operations are too expensive to speculatively execute them, and branches are cheap. >>>> >>>> Setting 'phi-node-folding-threshold’ to 1 or even 0 (instead of the default 2), definitely improves the situation in many cases, but Clang still creates many instances of ‘select’ instructions, which are detrimental to my target. I am unsure about where are they created, as I believe that the simplifycfg pass does not longer create them. >>>> >>>> You definitively can't ban llvm passes/clang from creating select's. >>>> >>>> So the question is: Are there any other hooks in clang, or custom code that I can implement, to relax the creation of ’select’ instructions and make it preserve branches in the original c code? >>>> >>>> I think this is backwards. >>>> Sure, you could maybe disable most of the folds that produce selects. >>>> That may be good for final codegen, but will also affect other passes >>>> since not everything deals with 2-node PHI as good as wit selects. >>>> >>>> But, what happens if you still get the select-y IR? >>>> Doesn't matter how, could be hand-written. >>>> >>>> I think you might want to instead aggressively convert selects into >>>> branches in backend. >>>> >>>> Thanks, >>>> >>>> John >>>> >>>> Roman >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>>> _______________________________________________ >>>> cfe-dev mailing list >>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev> >>>> >>>> >>>> _______________________________________________ >>>> cfe-dev mailing list >>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev> >>>> >>>> >>>> >>>> _______________________________________________ >>>> cfe-dev mailing list >>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev> >>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >>>> >>>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191003/fe12a30d/attachment-0001.html>
Reasonably Related Threads
- [cfe-dev] CFG simplification question, and preservation of branching in the original code
- [cfe-dev] CFG simplification question, and preservation of branching in the original code
- [cfe-dev] CFG simplification question, and preservation of branching in the original code
- [cfe-dev] CFG simplification question, and preservation of branching in the original code
- [cfe-dev] CFG simplification question, and preservation of branching in the original code