thr3ads.net - llvm dev - [LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend) [Jun 2014]

If this information is useful, please help other people find it:
Share via:

Jim Grosbach

2014-Jun-27 23:30 UTC

[LLVMdev] Contributing the Apple ARM64 compiler backend

AArch64AddressTypePromotion.cpp does a fair bit of work to help make these
things work out well. It could probably be generalized for non-AArch64 targets
as per the comment in the file header.
> On Jun 26, 2014, at 10:42 AM, Sanjay Patel <spatel at
rotateright.com> wrote:
> 
> Cool HW trick. :)
> Are those 'sxtw' ops free? 
> 
That’ll depend on the details of the micro architecture. I don’t know what is
typical.
> I have to look at the HW manuals again, but I don't think x86-64 has
that capability.
> 
> 
> On Thu, Jun 26, 2014 at 11:23 AM, James Molloy <james.molloy at
arm.com> wrote:
> Hi Sanjay,
> 
>  
> 
> The behaviour I’m talking about I’ve actually pinned down to CodeGenPrepare
not working too well with ISA’s that don’t have a good scaled load. I have a
patch to fix it that is going through performance testing now.
> 
>  
> 
> Your testcase seems specific to x86 – for aarch64 we get the rather spiffy:
> 
>  
> 
> _Z3fooPii:                              // @_Z3fooPii
> 
> // BB#0:                                // %entry
> 
>                 add        w8, w1, #1              // =1
> 
>                 add        w9, w1, #2              // =2
> 
>                 ldr           w8, [x0, w8, sxtw #2]
> 
>                 ldr           w9, [x0, w9, sxtw #2]
> 
>                 add        w8, w9, w8
> 
>                 str           w8, [x0, w1, sxtw #2]
> 
>                 ret
> 
>  
> 
> The sext can be matched as part of the addressing mode for AArch64 – maybe
it’s something in codegenprepare for x86 going awry?
> 
>  
> 
> Cheers,
> 
>  
> 
> James
> 
>  
> 
> From: Sanjay Patel [mailto:spatel at rotateright.com] 
> Sent: 26 June 2014 18:11
> To: Manjunath DN
> Cc: James Molloy; llvmdev at cs.uiuc.edu
> 
> 
> Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
> 
>  
> 
> >> We've also seen similar instances where multiple registers are
used to compute very similar
> >> addresses (such as x+0 and x+4!) and this increases register
pressure.
> 
> I don't have an ARM enabled build of the tools to test with, but I
suspect what I'm seeing here:
> http://llvm.org/bugs/show_bug.cgi?id=20134
> 
>  
> 
> ...would also be bad on AArch64.
> 
>  
> 
> On Wed, Jun 25, 2014 at 8:58 PM, Manjunath DN <manjunath.dn at
gmail.com> wrote:
> 
> HI James,
> 
> Thanks for your reply and hints on what can be done for the Aarch64 backend
optimization for llvm
> 
> We have SPEC license and v8 hardware. So I will start looking into it
> 
> warm regards
> 
> Manjunath
> 
>  
> 
>  
> 
> On Wed, Jun 25, 2014 at 8:42 PM, James Molloy <james.molloy at
arm.com> wrote:
> 
> Hi Manjunath,
> 
> At the time of writing that status we had only done our initial analysis.
> This was done without real hardware and attempted to identify poor code
> sequences but we were unable to quantify how much effect this would
actually
> have.
> 
> Since then we've done more analysis using Cortex-A57 and Cortex-A53 on
an
> internal development platform.
> 
> For SPEC, we are between 10% and 0% behind GCC on 9 benchmarks, and 25%
> ahead on one benchmark. Most benchmarks are less than 5% behind GCC.
> 
> Because of the licencing of SPEC, I have to be quite restricted in what I
> say and I can't give any numbers - sorry about that.
> 
> We are focussing on Cortex-A57, and the things we've identified so far
are:
>   * The CSEL instruction behaves worse than the equivalent branch structure
> in at least one benchmark. In an out of order core, select-like
instructions
> are going to be slower than their branched equivalent if the branch is
> predictable due to CSEL having two dependencies.
> 
>   * Redundant calculations inside if conditions. We've seen:
>     1. "if (a[x].b < c[y].d || a[x].e > c[y].f)" - the
calculations of a[x]
> and c[y] are repeated, when they are common. We've also seen similar
> instances where multiple registers are used to compute very similar
> addresses (such as x+0 and x+4!) and this increases register pressure.
>     2. "if (a < 0 && b == c || a > 0 && b ==
d)" - the first comparison of
> 'a' against zero is done twice, when the flag results of the first
> comparison could be used for the second comparison.
> 
>   * For a loop such as "for (i = 0; i < n; ++i)
> {do_something_with(&x[i]);}", GCC is using &x[i] as the loop
induction
> variable where LLVM uses i and performs the calculation &x[i] on every
> iteration. This only creates one more add instruction but the loop we see
it
> in only has 5 or so instructions.
> 
>   * The inline heuristics are way behind GCC's. If we crank the inline
> threshold up to 1000, we can remove a 6.5% performance regression from one
> benchmark entirely.
> 
>   * We're generating (due to SLP vectorizer and a DAG combine) loads
into Q
> registers when merging consecutive loads. This is bad, because there are no
> callee-saved Q registers! So if the live range crosses a function call, it
> will have to be immediately spilled again.  This can be easily fixed by
> using load-pair instructions instead. I have a patch to fix this.
> 
> The list above is non-exhaustive and only contains things that we think may
> affect multiple benchmarks or real-world code.
> 
> I've also noticed:
>   * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR
q0,
> [..]" pairs, which is less than ideal on A53. If we switched to
emitting
> "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30%
better inline
> memcpy performance on A53. A57 seems to deal well with the LDR q sequence.
> 
> I'm sorry I'm unable to provide code samples for most of the issues
found so
> far - this is an artefact of them having come from SPEC. Trivial examples
do
> not always show the same behaviour, and as we're still investigating we
> haven't yet been able to reduce most of these to an anonymisable
testcase.
> 
> Hope this helps, but doubt it does,
> 
> James
> 
> > -----Original Message-----
> > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] On
> > Behalf Of Manjunath N
> > Sent: 24 June 2014 10:45
> > To: llvmdev at cs.uiuc.edu
> > Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
> >
> >
> >
> > Eric Christopher <echristo <at> gmail.com> writes:
> >
> > >
> > > > The big pain issues I see merging from ARM64 to AArch64 are:
> > > > 1.      Apple have created a fairly complete scheduling
model already
> > for
> > > > ARM64, and we'd have to merge the partial? model in
AArch64 and
> > theirs.
> > We
> > > > risk regressing performance on Apple's targets here, and
we can't
> > determine
> > > > ourselves whether we have or not. This is not ideal.
> > > > 2.      Porting over the DAG-to-DAG optimizations and any
other
> > > > optimizations that rely on the tablegen layout will be very
tricky.
> > > > 3.      The conditional compare pass is fairly comprehensive
- we'd
> > have
> > to
> > > > port that over or rewrite it and that would be a lot of
work.
> > > > 4.      A very quick analysis last night indicated that
ARM64 has
> > > > implemented just under half of the optimizations we
discovered
> > opportunities
> > > > for in SPEC and EEMBC. That's a fairly comprehensive
number of
> > > > optimizations, and they won't all be easy to port.
> > Eric,
> > You mention that there a quite a few  optimization opportunities in
SPEC
> > 2000/ EEMBC.
> > I am looking to optimize the Aarch64 backend. Could you please let me
know
> > the big optimizations possible?
> >
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> ========================================> warm regards,
> Manjunath DN
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
> 
> 
> -- 
> Sanjay Patel
> RotateRight, LLC
> http://www.rotateright.com
> 
> 
> 
> 
> -- 
> Sanjay Patel
> RotateRight, LLC
> http://www.rotateright.com
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140627/e661efba/attachment.html>

Duncan P. N. Exon Smith

2014-Jun-30 23:19 UTC

head link

[LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend)

> On 2014-Jun-27, at 16:30, Jim Grosbach <grosbach at apple.com> wrote:
> 
> AArch64AddressTypePromotion.cpp does a fair bit of work to help make these
things work out well. It could probably be generalized for non-AArch64 targets
as per the comment in the file header.
I spent some time today generalizing AArch64AddressTypePromotion (I'll
send the patches when they're ready), but in the process discovered
that this pass does nothing (!) right now.  I assume this bug was
introduced when we changed the semantics of `Use` (IIRC, ARM64 was
still private at the time).

After the tiny patch inline below, the code looks even better:

--- old.s	2014-06-30 16:12:52.000000000 -0700
+++ new.s	2014-06-30 16:13:24.000000000 -0700
@@ -5,12 +5,10 @@
 _foo:                                   ; @foo
 	.cfi_startproc
 ; BB#0:                                 ; %entry
-	add	w8, w1, #1              ; =1
-	add	w9, w1, #2              ; =2
-	ldr	w8, [x0, w8, sxtw #2]
-	ldr	w9, [x0, w9, sxtw #2]
-	add	 w8, w9, w8
-	str	w8, [x0, w1, sxtw #2]
+	add	x8, x0, w1, sxtw #2
+	ldp	w9, w10, [x8, #4]
+	add	 w9, w10, w9
+	str	 w9, [x8]
 	ret
 	.cfi_endproc

I'll commit the fix once I've written up a testcase.

diff --git a/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
b/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
index 04906f6..bcce295 100644
--- a/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
+++ b/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
@@ -212,12 +212,12 @@ static bool shouldSExtOperand(const Instruction *Inst, int
OpIdx) {
 bool
 AArch64AddressTypePromotion::shouldConsiderSExt(const Instruction *SExt) const
{
   if (SExt->getType() != ConsideredSExtType)
     return false;
 
-  for (const Use &U : SExt->uses()) {
-    if (isa<GetElementPtrInst>(*U))
+  for (const User *U : SExt->users()) {
+    if (isa<GetElementPtrInst>(U))
       return true;
   }
 
   return false;
 }

Duncan P. N. Exon Smith

2014-Jun-30 23:53 UTC

head link

[LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend)

> On 2014-Jun-30, at 16:19, Duncan P. N. Exon Smith <dexonsmith at
apple.com> wrote:
> 
> 
>> On 2014-Jun-27, at 16:30, Jim Grosbach <grosbach at apple.com>
wrote:
>> 
>> AArch64AddressTypePromotion.cpp does a fair bit of work to help make
these things work out well. It could probably be generalized for non-AArch64
targets as per the comment in the file header.
> 
> I spent some time today generalizing AArch64AddressTypePromotion (I'll
> send the patches when they're ready), but in the process discovered
> that this pass does nothing (!) right now.  I assume this bug was
> introduced when we changed the semantics of `Use` (IIRC, ARM64 was
> still private at the time).
> 
> After the tiny patch inline below, the code looks even better:
> 
> --- old.s	2014-06-30 16:12:52.000000000 -0700
> +++ new.s	2014-06-30 16:13:24.000000000 -0700
> @@ -5,12 +5,10 @@
> _foo:                                   ; @foo
> 	.cfi_startproc
> ; BB#0:                                 ; %entry
> -	add	w8, w1, #1              ; =1
> -	add	w9, w1, #2              ; =2
> -	ldr	w8, [x0, w8, sxtw #2]
> -	ldr	w9, [x0, w9, sxtw #2]
> -	add	 w8, w9, w8
> -	str	w8, [x0, w1, sxtw #2]
> +	add	x8, x0, w1, sxtw #2
> +	ldp	w9, w10, [x8, #4]
> +	add	 w9, w10, w9
> +	str	 w9, [x8]
> 	ret
> 	.cfi_endproc
> 
> I'll commit the fix once I've written up a testcase.
FYI, I committed this as r212073.

I'll do a quick audit of other calls of `uses()` and `use_begin()` in
AArch64.

Quentin Colombet

2014-Jul-01 07:40 UTC

head link

[LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend)

> On Jun 30, 2014, at 4:19 PM, Duncan P. N. Exon Smith <dexonsmith at
apple.com> wrote:
> 
> 
>> On 2014-Jun-27, at 16:30, Jim Grosbach <grosbach at apple.com>
wrote:
>> 
>> AArch64AddressTypePromotion.cpp does a fair bit of work to help make
these things work out well. It could probably be generalized for non-AArch64
targets as per the comment in the file header.
> 
> I spent some time today generalizing AArch64AddressTypePromotion (I'll
> send the patches when they're ready), but in the process discovered
> that this pass does nothing (!) right now.
I am not surprised as this pass has been superseded by the refactoring of the
handling of addressing mode in CodeGenPrepare (r200947).
If you disable CGP, you will see this pass doing something :).

In fact, evaluating if this pass is still useful is in our todo list:
<rdar://problem/16005447>.

Cheers,
-Quentin
>  I assume this bug was
> introduced when we changed the semantics of `Use` (IIRC, ARM64 was
> still private at the time).
> 
> After the tiny patch inline below, the code looks even better:
> 
> --- old.s	2014-06-30 16:12:52.000000000 -0700
> +++ new.s	2014-06-30 16:13:24.000000000 -0700
> @@ -5,12 +5,10 @@
> _foo:                                   ; @foo
> 	.cfi_startproc
> ; BB#0:                                 ; %entry
> -	add	w8, w1, #1              ; =1
> -	add	w9, w1, #2              ; =2
> -	ldr	w8, [x0, w8, sxtw #2]
> -	ldr	w9, [x0, w9, sxtw #2]
> -	add	 w8, w9, w8
> -	str	w8, [x0, w1, sxtw #2]
> +	add	x8, x0, w1, sxtw #2
> +	ldp	w9, w10, [x8, #4]
> +	add	 w9, w10, w9
> +	str	 w9, [x8]
> 	ret
> 	.cfi_endproc
> 
> I'll commit the fix once I've written up a testcase.
> 
> diff --git a/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
b/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
> index 04906f6..bcce295 100644
> --- a/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
> +++ b/lib/Target/AArch64/AArch64AddressTypePromotion.cpp
> @@ -212,12 +212,12 @@ static bool shouldSExtOperand(const Instruction
*Inst, int OpIdx) {
> bool
> AArch64AddressTypePromotion::shouldConsiderSExt(const Instruction *SExt)
const {
>   if (SExt->getType() != ConsideredSExtType)
>     return false;
> 
> -  for (const Use &U : SExt->uses()) {
> -    if (isa<GetElementPtrInst>(*U))
> +  for (const User *U : SExt->users()) {
> +    if (isa<GetElementPtrInst>(U))
>       return true;
>   }
> 
>   return false;
> }
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

llvm dev - Jun 2014 - [LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend)

[LLVMdev] Contributing the Apple ARM64 compiler backend

[LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend)

[LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend)

[LLVMdev] AArch64AddressTypePromotion does nothing (was Re: Contributing the Apple ARM64 compiler backend)