Maxim Kazantsev via llvm-dev
2021-Jan-28 05:44 UTC
[llvm-dev] Align loops by 32 to use DSB more efficiently in x86
Hello everyone, I wanted to discuss the loop alignment choice in X86 codegen. Currently, LLVM unconditionally aligns all loops by 16 bits. And in some cases it does not interact well with some processor mechanisms, in particular with DSB. The effect I’m observing now has been discussed before, at least it is mentioned in this slides: https://llvm.org/devmtg/2016-11/Slides/Ansari-Code-Alignment.pdf, but it doesn’t seem that any decision has been taken on it ever since. Motivation: The motivating code piece that demonstrated significant score swings is as in the following example: define i32 @test(i32* %p, i64 %len, i32 %x) { entry: br label %loop loop: ; preds = %backedge, %entry %iv = phi i64 [ %iv.next, %backedge ], [ %len, %entry ] %iv.next = add nsw i64 %iv, -1 %cond_1 = icmp eq i64 %iv, 0 br i1 %cond_1, label %exit, label %backedge backedge: ; preds = %loop %addr = getelementptr inbounds i32, i32* %p, i64 %iv.next %loaded = load atomic i32, i32* %addr unordered, align 4 %cond_2 = icmp eq i32 %loaded, %x br i1 %cond_2, label %failure, label %loop exit: ; preds = %loop ret i32 -1 failure: ; preds = %backedge unreachable } Basically this code is searching element x in array of values. Here is llc result for this loop in mtriple=x86_64-apple-macosx: .p2align 4, 0x90 LBB0_1: ## %loop ## =>This Inner Loop Header: Depth=1 subq $1, %rax jb LBB0_4 ## %bb.2: ## %backedge ## in Loop: Header=BB0_1 Depth=1 cmpl %edx, -4(%rdi,%rsi,4) movq %rax, %rsi jne LBB0_1 (Note: the last movq is redundant, inserted by LSR likely due to cost model bug, filed as https://bugs.llvm.org/show_bug.cgi?id=48355. Regardless of it, the situation remains the same). And here is the assembly on x64 platform: 97.34% ↗ 0x30026d50: 83ea01 subl $1, %edx │ 0x30026d53: 0f820b060000 jb 1547 ; 0x30027364 0.04% │ 0x30026d59: 89d3 movl %edx, %ebx │ 0x30026d5b: 394c9810 cmpl %ecx, 16(%rax,%rbx,4) ╰ 0x30026d5f: 75ef jne -17 ; 0x30026d50 Important notes here: * Loop is aligned by 16 bytes; * Loop size is 17 bytes; Depending on a particular machine, this loop is also aligned by 32 bytes. The score difference for this example is dramatic: 32 ops/ms when loop is aligned by 16 (and not by 32), against 51 ops/ms (when loop is aligned by 32). It means that the workload is bound to decoding, and the rest of execution works fine (insns/cycpes grows from 2.5 to 3.7 when aligned). Alignment of this particular loop depends on how code for previous IR is generated, and in our case it varies: * Host to host; * Build to build; * Run to run (observed at least once, might be a JIT effect?). Here are some performance counters collected on this test: Align 16: 17,502,234,018 cycles (81.80%) 26,147,856 idq.all_dsb_cycles_4_uops (81.80%) 17,304,494,357 idq.all_dsb_cycles_any_uops (81.80%) 17,302,169,702 idq.dsb_cycles (81.80%) 34,681,358,154 idq.dsb_uops (81.80%) 23,714,772 idq.all_mite_cycles_4_uops (81.82%) 87,658,024 idq.all_mite_cycles_any_uops (81.83%) 66,740,027 idq.mite_cycles (81.84%) 163,425,243 idq.mite_uops (81.84%) 2,674,281 dsb2mite_switches.penalty_cycles (72.74%) 5,537,882 idq.ms_switches (72.72%) 5.001125917 seconds time elapsed Align 32: 17,686,896,278 cycles (81.79%) 13,101,510,457 idq.all_dsb_cycles_4_uops (81.81%) 13,113,604,250 idq.all_dsb_cycles_any_uops (81.83%) 13,112,858,568 idq.dsb_cycles (81.83%) 52,468,480,758 idq.dsb_uops (81.83%) 23,659,985 idq.all_mite_cycles_4_uops (81.83%) 64,453,229 idq.all_mite_cycles_any_uops (81.83%) 44,043,473 idq.mite_cycles (81.83%) 123,098,484 idq.mite_uops (81.83%) 2,485,361 dsb2mite_switches.penalty_cycles (72.71%) 5,676,762 idq.ms_switches (72.70%) 5.051353273 seconds time elapsed There is a dramatic difference in counter “idq.all_dsb_cycles_4_uops”. The apparent reason of it is that DSB works with 32-byte aligned instruction window. If the loop crosses the border of this window, it cannot work with max possible efficiency. We are only observing this in one workload (out of Java benchmark set we run regularly), but provided how simple this example is, I see no reason why it’s not a generic widespread issue. Obviously, we want such loops aligned by 32 to benefit from DSB. Proposal: We could have an heuristic trying to align such loops by 32 bytes. It may be more or less complex. The simplest solution would be “let’s align all innermost loops by 32” (we already have an experimental option for that). The more accurate solution would be: Align loops by 32 if: * They are innermost; * Size of loop mod 32 is between 16 and 31 (only in this case alignment by 32 will strictly reduce the number of 32 window crossings by 1); * (Optional) The loop is small, e.g. less than 32 bytes; * (Optional) We could make even sharper checks trying to ensure that all other conditions of DSB max utilization are met (may be very complex!) Possible cons we’re aware about: * Code size. Insertion of NOPs for alignment potentially increases code size. * Despite there is clearly a growth, I think it’s negligible: we’ve measured this on a huge Java code corps, and the impact is far less than 1% and not distinguishable from noise in JIT runs. * Code size growth may affect performance of containing loops (e.g. they will stop fitting into some code window/cache line/icache). * This, though theoretically possible, is very unlikely to be a real problem. Outer loops only matter if inner loops are small w.r.t. to them. It means that inner loops will likely be unrolled and fully eliminated, and the parents will become inner loops themselves. * If it’s a real problem, we can restrain the 32-byte alignment from applying to loops with parents. * This maybe interacts with some other processor mechanisms in a negative way. My knowledge here is not strong enough to say which and how. * We did not see negative performance impact on our benchmarks. It does not mean there is none, but we may hope they’re uncommon. I suggest to think of this and chose how restrictively we want to align loops by 32, provided this data. Regards, Max -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210128/0f359782/attachment-0001.html>