Eric Christopher <echristo <at> gmail.com> writes:> > > The big pain issues I see merging from ARM64 to AArch64 are: > > 1. Apple have created a fairly complete scheduling model alreadyfor> > ARM64, and we'd have to merge the partial? model in AArch64 and theirs.We> > risk regressing performance on Apple's targets here, and we can'tdetermine> > ourselves whether we have or not. This is not ideal. > > 2. Porting over the DAG-to-DAG optimizations and any other > > optimizations that rely on the tablegen layout will be very tricky. > > 3. The conditional compare pass is fairly comprehensive - we'd haveto> > port that over or rewrite it and that would be a lot of work. > > 4. A very quick analysis last night indicated that ARM64 has > > implemented just under half of the optimizations we discoveredopportunities> > for in SPEC and EEMBC. That's a fairly comprehensive number of > > optimizations, and they won't all be easy to port.Eric, You mention that there a quite a few optimization opportunities in SPEC 2000/ EEMBC. I am looking to optimize the Aarch64 backend. Could you please let me know the big optimizations possible?
Eric, Feel free to file PRs and CC me on them! :) Chad -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Manjunath N Sent: Tuesday, June 24, 2014 5:45 AM To: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend Eric Christopher <echristo <at> gmail.com> writes:> > > The big pain issues I see merging from ARM64 to AArch64 are: > > 1. Apple have created a fairly complete scheduling model alreadyfor> > ARM64, and we'd have to merge the partial? model in AArch64 and theirs.We> > risk regressing performance on Apple's targets here, and we can'tdetermine> > ourselves whether we have or not. This is not ideal. > > 2. Porting over the DAG-to-DAG optimizations and any other > > optimizations that rely on the tablegen layout will be very tricky. > > 3. The conditional compare pass is fairly comprehensive - we'd haveto> > port that over or rewrite it and that would be a lot of work. > > 4. A very quick analysis last night indicated that ARM64 has > > implemented just under half of the optimizations we discoveredopportunities> > for in SPEC and EEMBC. That's a fairly comprehensive number of > > optimizations, and they won't all be easy to port.Eric, You mention that there a quite a few optimization opportunities in SPEC 2000/ EEMBC. I am looking to optimize the Aarch64 backend. Could you please let me know the big optimizations possible? _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Eric Christopher
2014-Jun-24 14:23 UTC
[LLVMdev] Contributing the Apple ARM64 compiler backend
On Tue, Jun 24, 2014 at 2:45 AM, Manjunath N <manjunath.dn at gmail.com> wrote:> > > Eric Christopher <echristo <at> gmail.com> writes: > >> >> > The big pain issues I see merging from ARM64 to AArch64 are: >> > 1. Apple have created a fairly complete scheduling model already > for >> > ARM64, and we'd have to merge the partial? model in AArch64 and theirs. > We >> > risk regressing performance on Apple's targets here, and we can't > determine >> > ourselves whether we have or not. This is not ideal. >> > 2. Porting over the DAG-to-DAG optimizations and any other >> > optimizations that rely on the tablegen layout will be very tricky. >> > 3. The conditional compare pass is fairly comprehensive - we'd have > to >> > port that over or rewrite it and that would be a lot of work. >> > 4. A very quick analysis last night indicated that ARM64 has >> > implemented just under half of the optimizations we discovered > opportunities >> > for in SPEC and EEMBC. That's a fairly comprehensive number of >> > optimizations, and they won't all be easy to port. > Eric, > You mention that there a quite a few optimization opportunities in SPEC > 2000/ EEMBC. > I am looking to optimize the Aarch64 backend. Could you please let me know > the big optimizations possible? >Wasn't me here. I was just on the thread. Might have been Amara? -eric
Amara Emerson
2014-Jun-24 14:29 UTC
[LLVMdev] Contributing the Apple ARM64 compiler backend
*Flexes buck-passing muscles* Looks like James wrote the original chunk of that quote. Amara> -----Original Message----- > From: Eric Christopher [mailto:echristo at gmail.com] > Sent: 24 June 2014 15:23 > To: Manjunath N; Amara Emerson > Cc: llvmdev at cs.uiuc.edu > Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend > > On Tue, Jun 24, 2014 at 2:45 AM, Manjunath N <manjunath.dn at gmail.com> > wrote: > > > > > > Eric Christopher <echristo <at> gmail.com> writes: > > > >> > >> > The big pain issues I see merging from ARM64 to AArch64 are: > >> > 1. Apple have created a fairly complete scheduling model already > > for > >> > ARM64, and we'd have to merge the partial? model in AArch64 and > theirs. > > We > >> > risk regressing performance on Apple's targets here, and we can't > > determine > >> > ourselves whether we have or not. This is not ideal. > >> > 2. Porting over the DAG-to-DAG optimizations and any other > >> > optimizations that rely on the tablegen layout will be very tricky. > >> > 3. The conditional compare pass is fairly comprehensive - we'd > have > > to > >> > port that over or rewrite it and that would be a lot of work. > >> > 4. A very quick analysis last night indicated that ARM64 has > >> > implemented just under half of the optimizations we discovered > > opportunities > >> > for in SPEC and EEMBC. That's a fairly comprehensive number of > >> > optimizations, and they won't all be easy to port. > > Eric, > > You mention that there a quite a few optimization opportunities in SPEC > > 2000/ EEMBC. > > I am looking to optimize the Aarch64 backend. Could you please let me > know > > the big optimizations possible? > > > > Wasn't me here. I was just on the thread. Might have been Amara? > > -eric
James Molloy
2014-Jun-25 15:12 UTC
[LLVMdev] Contributing the Apple ARM64 compiler backend
Hi Manjunath, At the time of writing that status we had only done our initial analysis. This was done without real hardware and attempted to identify poor code sequences but we were unable to quantify how much effect this would actually have. Since then we've done more analysis using Cortex-A57 and Cortex-A53 on an internal development platform. For SPEC, we are between 10% and 0% behind GCC on 9 benchmarks, and 25% ahead on one benchmark. Most benchmarks are less than 5% behind GCC. Because of the licencing of SPEC, I have to be quite restricted in what I say and I can't give any numbers - sorry about that. We are focussing on Cortex-A57, and the things we've identified so far are: * The CSEL instruction behaves worse than the equivalent branch structure in at least one benchmark. In an out of order core, select-like instructions are going to be slower than their branched equivalent if the branch is predictable due to CSEL having two dependencies. * Redundant calculations inside if conditions. We've seen: 1. "if (a[x].b < c[y].d || a[x].e > c[y].f)" - the calculations of a[x] and c[y] are repeated, when they are common. We've also seen similar instances where multiple registers are used to compute very similar addresses (such as x+0 and x+4!) and this increases register pressure. 2. "if (a < 0 && b == c || a > 0 && b == d)" - the first comparison of 'a' against zero is done twice, when the flag results of the first comparison could be used for the second comparison. * For a loop such as "for (i = 0; i < n; ++i) {do_something_with(&x[i]);}", GCC is using &x[i] as the loop induction variable where LLVM uses i and performs the calculation &x[i] on every iteration. This only creates one more add instruction but the loop we see it in only has 5 or so instructions. * The inline heuristics are way behind GCC's. If we crank the inline threshold up to 1000, we can remove a 6.5% performance regression from one benchmark entirely. * We're generating (due to SLP vectorizer and a DAG combine) loads into Q registers when merging consecutive loads. This is bad, because there are no callee-saved Q registers! So if the live range crosses a function call, it will have to be immediately spilled again. This can be easily fixed by using load-pair instructions instead. I have a patch to fix this. The list above is non-exhaustive and only contains things that we think may affect multiple benchmarks or real-world code. I've also noticed: * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0, [..]" pairs, which is less than ideal on A53. If we switched to emitting "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30% better inline memcpy performance on A53. A57 seems to deal well with the LDR q sequence. I'm sorry I'm unable to provide code samples for most of the issues found so far - this is an artefact of them having come from SPEC. Trivial examples do not always show the same behaviour, and as we're still investigating we haven't yet been able to reduce most of these to an anonymisable testcase. Hope this helps, but doubt it does, James> -----Original Message----- > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On > Behalf Of Manjunath N > Sent: 24 June 2014 10:45 > To: llvmdev at cs.uiuc.edu > Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend > > > > Eric Christopher <echristo <at> gmail.com> writes: > > > > > > The big pain issues I see merging from ARM64 to AArch64 are: > > > 1. Apple have created a fairly complete scheduling model already > for > > > ARM64, and we'd have to merge the partial? model in AArch64 and > theirs. > We > > > risk regressing performance on Apple's targets here, and we can't > determine > > > ourselves whether we have or not. This is not ideal. > > > 2. Porting over the DAG-to-DAG optimizations and any other > > > optimizations that rely on the tablegen layout will be very tricky. > > > 3. The conditional compare pass is fairly comprehensive - we'd > have > to > > > port that over or rewrite it and that would be a lot of work. > > > 4. A very quick analysis last night indicated that ARM64 has > > > implemented just under half of the optimizations we discovered > opportunities > > > for in SPEC and EEMBC. That's a fairly comprehensive number of > > > optimizations, and they won't all be easy to port. > Eric, > You mention that there a quite a few optimization opportunities in SPEC > 2000/ EEMBC. > I am looking to optimize the Aarch64 backend. Could you please let me know > the big optimizations possible? > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Manjunath DN
2014-Jun-26 02:58 UTC
[LLVMdev] Contributing the Apple ARM64 compiler backend
HI James, Thanks for your reply and hints on what can be done for the Aarch64 backend optimization for llvm We have SPEC license and v8 hardware. So I will start looking into it warm regards Manjunath On Wed, Jun 25, 2014 at 8:42 PM, James Molloy <james.molloy at arm.com> wrote:> Hi Manjunath, > > At the time of writing that status we had only done our initial analysis. > This was done without real hardware and attempted to identify poor code > sequences but we were unable to quantify how much effect this would > actually > have. > > Since then we've done more analysis using Cortex-A57 and Cortex-A53 on an > internal development platform. > > For SPEC, we are between 10% and 0% behind GCC on 9 benchmarks, and 25% > ahead on one benchmark. Most benchmarks are less than 5% behind GCC. > > Because of the licencing of SPEC, I have to be quite restricted in what I > say and I can't give any numbers - sorry about that. > > We are focussing on Cortex-A57, and the things we've identified so far are: > * The CSEL instruction behaves worse than the equivalent branch structure > in at least one benchmark. In an out of order core, select-like > instructions > are going to be slower than their branched equivalent if the branch is > predictable due to CSEL having two dependencies. > > * Redundant calculations inside if conditions. We've seen: > 1. "if (a[x].b < c[y].d || a[x].e > c[y].f)" - the calculations of a[x] > and c[y] are repeated, when they are common. We've also seen similar > instances where multiple registers are used to compute very similar > addresses (such as x+0 and x+4!) and this increases register pressure. > 2. "if (a < 0 && b == c || a > 0 && b == d)" - the first comparison of > 'a' against zero is done twice, when the flag results of the first > comparison could be used for the second comparison. > > * For a loop such as "for (i = 0; i < n; ++i) > {do_something_with(&x[i]);}", GCC is using &x[i] as the loop induction > variable where LLVM uses i and performs the calculation &x[i] on every > iteration. This only creates one more add instruction but the loop we see > it > in only has 5 or so instructions. > > * The inline heuristics are way behind GCC's. If we crank the inline > threshold up to 1000, we can remove a 6.5% performance regression from one > benchmark entirely. > > * We're generating (due to SLP vectorizer and a DAG combine) loads into Q > registers when merging consecutive loads. This is bad, because there are no > callee-saved Q registers! So if the live range crosses a function call, it > will have to be immediately spilled again. This can be easily fixed by > using load-pair instructions instead. I have a patch to fix this. > > The list above is non-exhaustive and only contains things that we think may > affect multiple benchmarks or real-world code. > > I've also noticed: > * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0, > [..]" pairs, which is less than ideal on A53. If we switched to emitting > "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30% better inline > memcpy performance on A53. A57 seems to deal well with the LDR q sequence. > > I'm sorry I'm unable to provide code samples for most of the issues found > so > far - this is an artefact of them having come from SPEC. Trivial examples > do > not always show the same behaviour, and as we're still investigating we > haven't yet been able to reduce most of these to an anonymisable testcase. > > Hope this helps, but doubt it does, > > James > > > -----Original Message----- > > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] > On > > Behalf Of Manjunath N > > Sent: 24 June 2014 10:45 > > To: llvmdev at cs.uiuc.edu > > Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend > > > > > > > > Eric Christopher <echristo <at> gmail.com> writes: > > > > > > > > > The big pain issues I see merging from ARM64 to AArch64 are: > > > > 1. Apple have created a fairly complete scheduling model already > > for > > > > ARM64, and we'd have to merge the partial? model in AArch64 and > > theirs. > > We > > > > risk regressing performance on Apple's targets here, and we can't > > determine > > > > ourselves whether we have or not. This is not ideal. > > > > 2. Porting over the DAG-to-DAG optimizations and any other > > > > optimizations that rely on the tablegen layout will be very tricky. > > > > 3. The conditional compare pass is fairly comprehensive - we'd > > have > > to > > > > port that over or rewrite it and that would be a lot of work. > > > > 4. A very quick analysis last night indicated that ARM64 has > > > > implemented just under half of the optimizations we discovered > > opportunities > > > > for in SPEC and EEMBC. That's a fairly comprehensive number of > > > > optimizations, and they won't all be easy to port. > > Eric, > > You mention that there a quite a few optimization opportunities in SPEC > > 2000/ EEMBC. > > I am looking to optimize the Aarch64 backend. Could you please let me > know > > the big optimizations possible? > > > > > > > > _______________________________________________ > > LLVM Developers mailing list > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > >-- ========================================warm regards, Manjunath DN -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140626/c6548a65/attachment.html>