Meador Inge via llvm-dev
2015-Nov-11 19:57 UTC
[llvm-dev] [AArch64] Address computation folding
Hi, I was looking at some AArch64 benchmarks and noticed some simple cases where addresses are being folded into the address mode computations and was curious as to why. In particular, consider the following simple example: void f2(unsigned long *x, unsigned long c) { x[c] *= 2; } This generates: lsl x8, x1, #3 ldr x9, [x0, x8] lsl x9, x9, #1 str x9, [x0, x8] Given the two uses of the address computation I was expecting this: add x8, x0, x1, lsl #3 ldr x9, [x8] lsl x9, x9, #1 str x9, [x8]>From reading 'SelectAddrModeXRO' the computation is getting folded ifthe add node is *only* used with memory related operations? Why wouldn't it consider the number of uses in any operation? The "expected" code is easy to get by checking the number of uses. This may be desirable on some micro-architectures depending on the cost of the various loads and stores. -- Meador
Tim Northover via llvm-dev
2015-Nov-11 21:08 UTC
[llvm-dev] [AArch64] Address computation folding
On 11 November 2015 at 11:57, Meador Inge <meadori at gmail.com> wrote:> Why wouldn't it consider the number of uses in any operation? The > "expected" code is easy to get by checking the number of uses. This > may be desirable on some micro-architectures depending on the cost of > the various loads and stores.As you say, very microarchitecture-dependent. The code produced is probably optimal for Cyclone ("[x0, x8]" is no more expensive than "[x8]" and the "lsl" is slightly cheaper than the complicated "add"). If I'm reading the Cortex-A57 optimisation guide correctly, the same reasoning applies there too. Cheers. Tim.
James Molloy via llvm-dev
2015-Nov-11 21:15 UTC
[llvm-dev] [AArch64] Address computation folding
Hi, Indeed, the complex add is more expensive on all Cortex cores I know of. However there is an important point here that the code sequence we generate requires two registers live instead of one. In high regpressure loops, were probably losing performance. James On Wed, 11 Nov 2015 at 21:09, Tim Northover via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 11 November 2015 at 11:57, Meador Inge <meadori at gmail.com> wrote: > > Why wouldn't it consider the number of uses in any operation? The > > "expected" code is easy to get by checking the number of uses. This > > may be desirable on some micro-architectures depending on the cost of > > the various loads and stores. > > As you say, very microarchitecture-dependent. The code produced is > probably optimal for Cyclone ("[x0, x8]" is no more expensive than > "[x8]" and the "lsl" is slightly cheaper than the complicated "add"). > If I'm reading the Cortex-A57 optimisation guide correctly, the same > reasoning applies there too. > > Cheers. > > Tim. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151111/93848503/attachment.html>
Meador Inge via llvm-dev
2015-Nov-11 22:44 UTC
[llvm-dev] [AArch64] Address computation folding
On Wed, Nov 11, 2015 at 3:08 PM, Tim Northover <t.p.northover at gmail.com> wrote:> As you say, very microarchitecture-dependent. The code produced is > probably optimal for Cyclone ("[x0, x8]" is no more expensive than > "[x8]" and the "lsl" is slightly cheaper than the complicated "add"). > If I'm reading the Cortex-A57 optimisation guide correctly, the same > reasoning applies there too.Yeah, my reading is the same. For Cortex-A57 it looks like the same number of u-ops and latency either way (since LDR [x1, x2] is free). -- Meador