David Chisnall
2014-May-10 14:47 UTC
[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
On 10 May 2014, at 13:53, Tim Northover <t.p.northover at gmail.com> wrote:> It doesn't make sense for everything though, particularly if you want > target-specific IR to simply not exist. What would you map ARM's > "ldrex" to on x86?This isn't a great example. Having load-linked / store-conditional in the IR would make a number of transforms related to atomics easier. We currently can't correctly model the weak compare-and-exchange from the C[++]11 memory model and we generate terrible code for a number of common atomic idioms on non-x86 platforms as a result. David
Tim Northover
2014-May-10 15:18 UTC
[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
> This isn't a great example. Having load-linked / store-conditional in the > IR would make a number of transforms related to atomics easier. We > currently can't correctly model the weak compare-and-exchange from > the C[++]11 memory model and we generate terrible code for a number > of common atomic idioms on non-x86 platforms as a result.Actually, I really agree there. I considered it recently, but decided to leave it as an intrinsic for now (the new IR expansion pass happens after most optimisations so there wouldn't be much benefit, but if we did it earlier and the mid-end understood what an ldrex/strex meant, I could see code getting much better). Load linked would be fairly easy (perhaps even written as "load linked", a minor extension to "load atomic"). Store conditional would be a bigger change since stores don't return anything at the moment; passes may not be expecting to have to ReplaceAllUses on them. I'm hoping to have some more time to spend on atomics soon, after this merge business is done. Perhaps then. I don't suppose you have any plans to port Mips to the IR-level LL/SC expansion? Now that the infrastructure is present it's quite a simplification (r206490 in ARM64 for example, though you need existing target-specific intrinsics at the moment). It would be good to iron out any ARM-specific assumptions I've made. But it would still be a construct that probably just couldn't be used on x86 efficiently, not really a step towards a target independent IR. Cheers. Tim.
Suminda Dharmasena
2014-May-10 15:19 UTC
[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
When I meant for macroing facility is something along these lines. Must think more on it though. http://luajit.org/dynasm_features.html -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140510/0dd0afaa/attachment.html>
David Chisnall
2014-May-10 16:38 UTC
[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
On 10 May 2014, at 16:18, Tim Northover <t.p.northover at gmail.com> wrote:> Actually, I really agree there. I considered it recently, but decided > to leave it as an intrinsic for now (the new IR expansion pass happens > after most optimisations so there wouldn't be much benefit, but if we > did it earlier and the mid-end understood what an ldrex/strex meant, I > could see code getting much better). > > Load linked would be fairly easy (perhaps even written as "load > linked", a minor extension to "load atomic"). Store conditional would > be a bigger change since stores don't return anything at the moment; > passes may not be expecting to have to ReplaceAllUses on them.The easiest solution would be to extend the cmpxchg instruction with a weak variant. It is then trivial to map load, modify, weak-cmpxchg to load-linked, modify, store-conditional (that is what weak cmpxchg was intended for in the C[++]11 memory model).> I'm hoping to have some more time to spend on atomics soon, after this > merge business is done. Perhaps then. > > I don't suppose you have any plans to port Mips to the IR-level LL/SC > expansion? Now that the infrastructure is present it's quite a > simplification (r206490 in ARM64 for example, though you need existing > target-specific intrinsics at the moment). It would be good to iron > out any ARM-specific assumptions I've made.I'd rather avoid it, because it doing it that late precludes a lot of optimisations that we're interested in. I'd much rather extend the IR to support them at a generic level. We have a couple of plans for variations of atomic operations in our architecture, so we'll likely end up trying and throwing away a few approaches over the next couple of years.> But it would still be a construct that probably just couldn't be used > on x86 efficiently, not really a step towards a target independent IR.On x86, we could map weak cmpxchg to the same thing as a strong cmpxchg, so it would still generate the same code. The same is true for all architectures with a non-blocking compare and exchange operation. David
Andrew Trick
2014-May-10 18:25 UTC
[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
On May 10, 2014, at 7:47 AM, David Chisnall <David.Chisnall at cl.cam.ac.uk> wrote:> On 10 May 2014, at 13:53, Tim Northover <t.p.northover at gmail.com> wrote: > >> It doesn't make sense for everything though, particularly if you want >> target-specific IR to simply not exist. What would you map ARM's >> "ldrex" to on x86? > > This isn't a great example. Having load-linked / store-conditional in the IR would make a number of transforms related to atomics easier. We currently can't correctly model the weak compare-and-exchange from the C[++]11 memory model and we generate terrible code for a number of common atomic idioms on non-x86 platforms as a result.The IR is missing a weak variant of cmpxchg. But is there anything else missing at IR level? My understanding was that LLVM’s atomic memory ordering constraints are complete, but that codegen is not highly optimized, and maybe conservative for some targets. Which idiom do you have trouble with on non-x86? -Andy
David Chisnall
2014-May-10 18:35 UTC
[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
On 10 May 2014, at 19:25, Andrew Trick <atrick at apple.com> wrote:> The IR is missing a weak variant of cmpxchg. But is there anything else missing at IR level? My understanding was that LLVM’s atomic memory ordering constraints are complete, but that codegen is not highly optimized, and maybe conservative for some targets. Which idiom do you have trouble with on non-x86?The example from our EuroLLVM talk was this: _Atomic(int) a; a *= b; This is (according to the spec) equivalent to this (simplified slightly): int expected = a; int desired; do { desired = expected * b; } while (!compare_swap_weak(current, expected, desired)); What clang generates is almost this, but with a strong compare and swap: define void @mul(i32* %a, i32 %b) #0 { entry: %atomic-load = load atomic i32* %a seq_cst, align 4, !tbaa !1 br label %atomic_op atomic_op: ; preds = %atomic_op, %entry %0 = phi i32 [ %atomic-load, %entry ], [ %1, %atomic_op ] %mul = mul nsw i32 %0, %b %1 = cmpxchg i32* %a, i32 %0, i32 %mul seq_cst %2 = icmp eq i32 %1, %0 br i1 %2, label %atomic_cont, label %atomic_op atomic_cont: ; preds = %atomic_op ret void } This maps trivially to x86: LBB0_1: movl %ecx, %edx imull %esi, %edx movl %ecx, %eax lock cmpxchgl %edx, (%rdi) cmpl %ecx, %eax movl %eax, %ecx jne LBB0_1 For MIPS, what we *should* be generating is: sync 0 # Ensure all other loads / stores are globally visible retry: ll $t4, $a0 # Load the current value of the atomic int mult $t4, $a1 # Multiply by the other argument mflo $t4 # Get the result sc $t4, $a0 # Try to write it back atomically bnez $t4, entry # If we failed, try the whole thing again sync 0 # branch delay slot - ensure seqcst behaviour here What we actually generate is this: # BB#0: # %entry daddiu $sp, $sp, -16 sd $fp, 8($sp) # 8-byte Folded Spill move $fp, $sp addiu $3, $zero, 0 $BB0_1: # %entry # =>This Inner Loop Header: Depth=1 ll $2, 0($4) bne $2, $3, $BB0_3 nop # BB#2: # %entry # in Loop: Header=BB0_1 Depth=1 addiu $6, $zero, 0 sc $6, 0($4) beqz $6, $BB0_1 nop $BB0_3: # %entry sync 0 $BB0_4: # %atomic_op # =>This Loop Header: Depth=1 # Child Loop BB0_5 Depth 2 move $3, $2 mul $6, $3, $5 sync 0 $BB0_5: # %atomic_op # Parent Loop BB0_4 Depth=1 # => This Inner Loop Header: Depth=2 ll $2, 0($4) bne $2, $3, $BB0_7 nop # BB#6: # %atomic_op # in Loop: Header=BB0_5 Depth=2 move $7, $6 sc $7, 0($4) beqz $7, $BB0_5 nop $BB0_7: # %atomic_op # in Loop: Header=BB0_4 Depth=1 sync 0 bne $2, $3, $BB0_4 nop # BB#8: # %atomic_cont move $sp, $fp ld $fp, 8($sp) # 8-byte Folded Reload jr $ra daddiu $sp, $sp, 16 For correctness, we *have* to implement the cmpxchg in the IR as a ll/sc loop, and so we end up with a nested loop for something that is a single line in the source. The idiom of the weak compare and exchange loop is a fairly common one, but we generate spectacularly bad code for it. David
Tim Northover
2014-May-10 18:35 UTC
[LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
> The IR is missing a weak variant of cmpxchg. But is there anything else > missing at IR level? My understanding was that LLVM’s atomic memory > ordering constraints are complete, but that codegen is not highly optimized, > and maybe conservative for some targets. Which idiom do you have trouble > with on non-x86?For myself, I don't like the fact that LLVM's atomicrmw & cmpxchg instructions are so beholden to C. With suitable constraints, an "atomicrmw [](int x) { ... }" isn't unreasonable; but this can only be mapped to a cmpxchg loop with the current IR. Tim.
Maybe Matching Threads
- [LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
- [LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
- [LLVMdev] Replacing Platform Specific IR Codes with Generic Implementation and Introducing Macro Facilities
- [LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
- RFC: Atomic LL/SC loops in LLVM revisited