One piece of code I'm writing has a lot of intermediates, and I'm trying to optimize down the number of memory accesses. Here's a snippet from the start of the function, where I think there is some low-hanging fruit: # BB#0: pushq %rbp pushq %r15 pushq %r14 pushq %r13 pushq %r12 pushq %rbx movq %rdx, %rcx movq %rdi, -16(%rsp) # 8-byte Spill movq (%rsi), %rdi movq 8(%rsi), %r8 movq 8(%rcx), %rax movq %rax, -24(%rsp) # 8-byte Spill movq 16(%rcx), %rax movq %rax, -8(%rsp) # 8-byte Spill movq %rdi, %rax mulq -24(%rsp) # 8-byte Folded Reload You'll note that rbx,r12,r13,r14,r15,rbp are all dead after the pushes. But the spill code still insists on using rax to load the spilled values, forcing them to be reloaded later. Is the register allocator (pbqp, I think) capable of having values in registers and on the stack at the same time? -- Taral <taralx at gmail.com> "Please let me know if there's any further trouble I can give you." -- Unknown
David Blaikie
2011-Jul-28 23:05 UTC
[LLVMdev] Spills and values present in both registers & stack
On Tue, Jul 26, 2011 at 11:35 AM, Taral <taralx at gmail.com> wrote:> > One piece of code I'm writing has a lot of intermediates, and I'm > trying to optimize down the number of memory accesses. Here's a > snippet from the start of the function, where I think there is some > low-hanging fruit: > > # BB#0: > pushq %rbp > pushq %r15 > pushq %r14 > pushq %r13 > pushq %r12 > pushq %rbx > movq %rdx, %rcx > movq %rdi, -16(%rsp) # 8-byte Spill > movq (%rsi), %rdi > movq 8(%rsi), %r8 > movq 8(%rcx), %rax > movq %rax, -24(%rsp) # 8-byte Spill > movq 16(%rcx), %rax > movq %rax, -8(%rsp) # 8-byte Spill > movq %rdi, %rax > mulq -24(%rsp) # 8-byte Folded Reload > > You'll note that rbx,r12,r13,r14,r15,rbp are all dead after the > pushes. But the spill code still insists on using rax to load the > spilled values, forcing them to be reloaded later.I'm not the most familiar with this sort of thing - but a small example (of llvm bitcode) & the optimization flags you used, etc, might be helpful (& I might be able to have a go at explaining it, if no one else does).> Is the register > allocator (pbqp, I think) capable of having values in registers and on > the stack at the same time?I can at least confirm that the default register allocator on x86 isn't PBQP, it's the greedy allocator (unless your'e compiling with -O0, in which case it's the fast allocator). - David
On Thu, Jul 28, 2011 at 4:05 PM, David Blaikie <dblaikie at gmail.com> wrote:> I'm not the most familiar with this sort of thing - but a small > example (of llvm bitcode) & the optimization flags you used, etc, > might be helpful (& I might be able to have a go at explaining it, if > no one else does).The actual C code is shorter. (You'll have to unwrap the lines.) Bonus points if you can identify the algorithm. }:> #include <stdint.h> typedef unsigned int uint128_t __attribute__((mode(TI))); typedef struct{ uint64_t l[5]; } s; void f(s * restrict r, const s * restrict x, const s * restrict y) { uint128_t t[5] = {0, 0, 0, 0, 0}; #define BODY(i,j) { int i_ = i < j ? i : j; int j_ = i < j ? j : i; uint128_t m = (uint128_t) x->l[i_] * (y->l[j_] * (i + j > 4 ? 19 : 1)); if (i + j > 4) { t[i + j - 5] += m; } else { t[i + j] += m; } } #define LOOP(i) BODY(i, 0); BODY(i, 1); BODY(i, 2); BODY(i, 3); BODY(i, 4); LOOP(0); LOOP(1); LOOP(2); LOOP(3); LOOP(4); #define FOLD1(i) r->l[i] = ((uint64_t) t[i] & (1LL << 51) - 1) + (i == 0 ? 19 : 1) * (uint64_t)(t[(i + 4) % 5] >> 51) FOLD1(0); FOLD1(1); FOLD1(2); FOLD1(3); FOLD1(4); #define FOLD2(i) r->l[(i + 1) % 5] += (i == 4 ? 19 : 1) * (r->l[i]>> 51); r->l[i] &= (1LL << 51) - 1;FOLD2(0); FOLD2(1); FOLD2(2); FOLD2(3); FOLD2(4); } % clang -O4 -S -o f.l f.c % llc -O3 --pre-RA-sched=list-hybrid --regalloc=pbqp -o f.s f.l -- Taral <taralx at gmail.com> "Please let me know if there's any further trouble I can give you." -- Unknown
Lang Hames
2011-Jul-31 02:16 UTC
[LLVMdev] Spills and values present in both registers & stack
Hi Taral,> You'll note that rbx,r12,r13,r14,r15,rbp are all dead after the > pushes. But the spill code still insists on using rax to load the > spilled values, forcing them to be reloaded later. Is the register > allocator (pbqp, I think) capable of having values in registers and on > the stack at the same time?There are a couple of problems here: (1) PBQP doesn't split live intervals, so those array elements get spilled everywhere, even in places where there are free registers. (2) The PBQP allocator always introduces a stack slot for spilled values, even when they're already on the stack. I definitely want to tackle these issues, but they'll have to wait for a month or so while I write up my thesis. Cheers, Lang.