Joan Lluch via llvm-dev
2019-Jun-05 15:58 UTC
[llvm-dev] Strange behaviour of post-legalising optimisations(?)
I come across a situation that I am having a hard time to understand.
When I compile the following code :
char *tst( char *dest, const char *src, unsigned int len )
{
for (int i=0 ; i<len ; i++) {
dest[i] = src[i];
}
return dest;
}
Clang generates this for the ‘for’ body:
for.body: ; preds = %for.cond
%arrayidx = getelementptr inbounds i8, i8* %src, i16 %i.0
%0 = load i8, i8* %arrayidx, align 1, !tbaa !2
%arrayidx1 = getelementptr inbounds i8, i8* %dest, i16 %i.0
store i8 %0, i8* %arrayidx1, align 1, !tbaa !2
%inc = add nuw nsw i16 %i.0, 1
br label %for.cond
This gets converted into this by llc:
Creating new node: t2: i16,ch = CopyFromReg t0, Register:i16 %3
Creating new node: t4: i16,ch = CopyFromReg t0, Register:i16 %0
Creating new node: t5: i16 = add t2, t4
Creating constant: t6: i16 = Constant<0>
Creating new node: t7: i16 = undef
Creating new node: t8: i8,ch = load<(load 1 from %ir.scevgep1, !tbaa !2)>
t0, t5, undef:i16
Creating new node: t10: i16,ch = CopyFromReg t0, Register:i16 %2
Creating new node: t11: i16 = add t10, t4
Creating new node: t12: ch = store<(store 1 into %ir.scevgep, !tbaa !2)>
t8:1, t8, t11, undef:i16
Creating constant: t13: i16 = Constant<1>
Creating new node: t14: i16 = add nuw nsw t4, Constant:i16<1>
Creating new node: t16: ch = CopyToReg t0, Register:i16 %1, t14
Creating new node: t17: ch = TokenFactor t16, t12
Creating new node: t19: ch = br t17, BasicBlock:ch<for.cond 0x10983ff38>
Initial selection DAG: %bb.3 'tst:for.body'
SelectionDAG has 20 nodes:
t0: ch = EntryToken
t4: i16,ch = CopyFromReg t0, Register:i16 %0
t6: i16 = Constant<0>
t2: i16,ch = CopyFromReg t0, Register:i16 %3
t5: i16 = add t2, t4
t8: i8,ch = load<(load 1 from %ir.scevgep1, !tbaa !2)> t0, t5, undef:i16
t14: i16 = add nuw nsw t4, Constant:i16<1>
t16: ch = CopyToReg t0, Register:i16 %1, t14
t10: i16,ch = CopyFromReg t0, Register:i16 %2
t11: i16 = add t10, t4
t12: ch = store<(store 1 into %ir.scevgep, !tbaa !2)> t8:1, t8, t11,
undef:i16
t17: ch = TokenFactor t16, t12
t19: ch = br t17, BasicBlock:ch<for.cond 0x10983ff38>
This is ok:
The array indexes get replaced by ‘add’ instructions (see t5 and t11) that later
on can be folded into appropriate load/store addressing modes.
So far so good.
However, if I replace the original code by this:
char *tst( char *dest, const char *src, unsigned int len )
{
if ( dest ) {
for (int i=0 ; i<len ; i++) {
dest[i] = src[i];
}
}
return dest;
}
I get IDENTICAL output from Clang for the ‘for’ body
But totally different code from llc:
Total amount of phi nodes to update: 0
Creating new node: t2: i16,ch = CopyFromReg t0, Register:i16 %1
Creating constant: t3: i16 = Constant<0>
Creating new node: t4: i16 = undef
Creating new node: t5: i8,ch = load<(load 1 from %ir.lsr.iv1, !tbaa !2)>
t0, t2, undef:i16
Creating new node: t7: i16,ch = CopyFromReg t0, Register:i16 %2
Creating new node: t8: ch = store<(store 1 into %ir.lsr.iv, !tbaa !2)>
t5:1, t5, t7, undef:i16
Creating constant: t9: i16 = Constant<1>
Creating new node: t10: i16 = add t7, Constant:i16<1>
Creating new node: t12: ch = CopyToReg t0, Register:i16 %3, t10
Creating new node: t13: i16 = add t2, Constant:i16<1>
Creating new node: t15: ch = CopyToReg t0, Register:i16 %4, t13
Creating new node: t17: i16,ch = CopyFromReg t0, Register:i16 %0
Creating constant: t18: i16 = Constant<-1>
Creating new node: t19: i16 = add t17, Constant:i16<-1>
Creating new node: t21: ch = CopyToReg t0, Register:i16 %5, t19
Creating new node: t22: ch = TokenFactor t12, t15, t21, t8
Creating new node: t24: ch = br t22, BasicBlock:ch<for.cond 0x10985e000>
Initial selection DAG: %bb.3 'tst:for.body'
SelectionDAG has 25 nodes:
t0: ch = EntryToken
t2: i16,ch = CopyFromReg t0, Register:i16 %1
t3: i16 = Constant<0>
t5: i8,ch = load<(load 1 from %ir.lsr.iv1, !tbaa !2)> t0, t2, undef:i16
t7: i16,ch = CopyFromReg t0, Register:i16 %2
t10: i16 = add t7, Constant:i16<1>
t12: ch = CopyToReg t0, Register:i16 %3, t10
t13: i16 = add t2, Constant:i16<1>
t15: ch = CopyToReg t0, Register:i16 %4, t13
t17: i16,ch = CopyFromReg t0, Register:i16 %0
t19: i16 = add t17, Constant:i16<-1>
t21: ch = CopyToReg t0, Register:i16 %5, t19
t8: ch = store<(store 1 into %ir.lsr.iv, !tbaa !2)> t5:1, t5, t7,
undef:i16
t22: ch = TokenFactor t12, t15, t21, t8
t24: ch = br t22, BasicBlock:ch<for.cond 0x10985e000>
Now, instead of using array indexes with ‘add’ instructions (which can be folded
later on into load/store addressing modes), LLVM choses to directly increment
the pointers!. This is suboptimal -at least on my architecture- because at this
point it’s impossible to select optimal addressing modes. Instead, the pointers
must be explicitly incremented with additional instructions as llvm dictates.
Any ideas about why this happens?. Particularly, what could cause this change of
behaviour just by adding an ‘if’ before the ‘for’? Note that the IR code for
the for ‘for’ body is IDENTICAL in both cases, so this is definitely a LLVM
thing.
Joan Lluch
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190605/1e86f68f/attachment.html>
Tim Northover via llvm-dev
2019-Jun-05 17:33 UTC
[llvm-dev] Strange behaviour of post-legalising optimisations(?)
Hi Joan, On Wed, 5 Jun 2019 at 08:58, Joan Lluch via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Any ideas about why this happens?. Particularly, what could cause this change of behaviour just by adding an ‘if’ before the ‘for’? Note that the IR code for the for ‘for’ body is IDENTICAL in both cases, so this is definitely a LLVM thing.If you run "llc -print-after-all" you should be able to see which pass first introduces the behaviour you don't like. In this case it seems to be "Loop Strength Reduction", in lib/Transforms/Scalar/LoopStrengthReduce.cpp. It makes use of a lot of hooks in TargetTransformInfo (search for "TTI.") when making its decisions and if you haven't implemented versions of those matching your target its heuristics are going to get this kind of thing wrong. Beyond that I'm just guessing, but isLegalAddressingMode looks like it could be a critical one for your case (it's the only thing you've mentioned being able to do neatly). Cheers. Tim.
Joan Lluch via llvm-dev
2019-Jun-06 14:28 UTC
[llvm-dev] Strange behaviour of post-legalising optimisations(?)
Hi Tim,
Thank you for your reply. It actually helped a lot to narrow the issue, as
previously I didn’t even know where to look.
I have been following the code in the debugger, specially the
LSRInstance::SolveRecurse function. This function traverses recursively all
possible ‘Formulae’, and determines the best instruction combination for the
loop generation, based on minimal cost. The SolveRecurse function uses
Cost::RateFormula as the basis for the cost computation.
I found that the RateFormula function does not return the what-would-be best
values for my architecture. In particular, my understanding is that the whole
implementation kind of assumes that all architectures will have post-increment
addressing modes, even if they don’t. For example, consider the following
formulas:
reg({%dest,+,1}<nsw><%for.cond>)
reg(%dest) + 1*reg({0,+,1}<nuw><nsw><%for.cond>)
The RateFormula function returns ‘1 instruction’ for either of them, however, in
my case, the first one will require an explicit add instruction (post increment
is not supported), which should account for an instruction count of 2 instead of
just 1. The count should be 2 because this additional add can not be folded with
other instructions, as it’s exclusively to increment the pointer.
So the problem is that in most situations, llvm choses the first form rather
than the second one due to the register overhead in the second one, in spite
that in my architecture it would be much better to chose the second (which would
happen if the first one resulted in a 2 instruction count).
At this point, I advanced significantly on the understanding of the problem, but
I got stuck at a more concrete problem. It seems to me that there’s no way to
instruct LLVM to adopt a more costly result for the first form, as it would be
necessary for my architecture, unless I’m missing something (?).
(As a side note: I implemented “isLegalAddressingMode”, but it turned to be
virtually identical to the default one except for the smaller immediate sizes,
so I think it doesn’t make much difference. Post/Pre increment addressing mode
is not supported in my architecture so this is reflected, and I haven’t either
added the setIndexedLoadAction calls to my TargetLowering constructor. I also
created my own TargetTransformInfo implementation based on the ARM code, which
works fine as the functions get diligently called, but I found that it’s not of
much help because it doesn’t have the necessary hooks to solve this issue)
So any additional help would be appreciated.
Joan Lluch
> On 5 Jun 2019, at 19:33, Tim Northover <t.p.northover at gmail.com>
wrote:
>
> Hi Joan,
>
> On Wed, 5 Jun 2019 at 08:58, Joan Lluch via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Any ideas about why this happens?. Particularly, what could cause this
change of behaviour just by adding an ‘if’ before the ‘for’? Note that the IR
code for the for ‘for’ body is IDENTICAL in both cases, so this is definitely a
LLVM thing.
>
> If you run "llc -print-after-all" you should be able to see which
pass
> first introduces the behaviour you don't like. In this case it seems
> to be "Loop Strength Reduction", in
> lib/Transforms/Scalar/LoopStrengthReduce.cpp. It makes use of a lot of
> hooks in TargetTransformInfo (search for "TTI.") when making its
> decisions and if you haven't implemented versions of those matching
> your target its heuristics are going to get this kind of thing wrong.
>
> Beyond that I'm just guessing, but isLegalAddressingMode looks like it
> could be a critical one for your case (it's the only thing you've
> mentioned being able to do neatly).
>
> Cheers.
>
> Tim.