Alex Bradbury via llvm-dev
2018-Jun-13 15:42 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
# RFC: Atomic LL/SC loops in LLVM revisited ## Summary This proposal gives a brief overview of the challenges of lowering to LL/SC loops and details the approach I am taking for RISC-V. Beyond getting feedback on that work, my intention is to find consensus on moving other backends towards a similar approach and sharing common code where feasible. Scroll down to 'Questions' for a summary of the issues I think need feedback and agreement. For the original discussion of LL/SC lowering, please refer to James Knight's 2016 thread on the topic: http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so generous with their review feedback on this atomics work so far. ## Background: Atomics in LLVM See the documentation for full details <https://llvm.org/docs/Atomics.html>. In short: LLVM defines memory ordering constraints to match the C11/C++11 memory model (unordered, monotonic, acquire, release, acqrel, seqcst). These can be given as parameters to the atomic operations supported in LLVM IR: * Fences with the fence instruction * Atomic load and store with the 'load atomic' and 'store atomic' variants of the load/store instructions.. * Fetch-and-binop / read-modify-write operations through the atomicrmw instruction. * Compare and exchange via the cmpxchg instruction. Takes memory ordering for both success and failure cases. Can also specify a 'weak' vs 'strong' cmpxchg, where the weak variant allows spurious failure ## Background: Atomics in RISC-V For full details see a recent draft of the ISA manual <https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>, which incorporates work from the Memory Consistency Model Task Group to define the memory model. RISC-V implements a weak memory model. For those not familiar, RISC-V is a modular ISA, with standard extensions indicated by single letters. Baseline 'RV32I' or 'RV64I' instruction sets don't support atomic operations beyond fences. However the RV32A and RV64A instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC (load-linked/store-conditional on other architectures). 32-bit atomic operations are supported natively on RV32, and both 32 and 64-bit atomic operations support natively on RV64. AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour. For LR/SC: LR loads a word and registers a reservation on source memory address. SC writes the given word to the memory address and writes success (zero) or failure (non-zero) into the destination register. LR/SC can be used to implement compare-and-exchange or to implement AMOs that don't have a native instruction. To do so, you would typically perform LR and SC in a loop. However, there are strict limits on the instructions that can be placed between a LR and an SC while still guaranteeing forward progress: """ The static code for the LR/SC sequence plus the code to retry the sequence in case of failure must comprise at most 16 integer instructions placed sequentially in memory. For the sequence to be guaranteed to eventually succeed, the dynamic code executed between the LR and SC instructions can only contain other instructions from the base “I” subset, excluding loads, stores, backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM instructions. The code to retry a failing LR/SC sequence can contain backward jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same constraints. """ The native AMOs and LR/SC allow ordering constraints to be specified in the instruction. This isn't possible for load/store instructions, so fences must be inserted to represent the ordering constraints. 8 and 16-bit atomic load/store are therefore supported using 8 and 16-bit load/store plus appropriate fences. See Table A.6 on page 187 in the linked specification for a mapping from C/C++ atomic constructs to RISC-V instructions. ## Background: Lowering atomics in LLVM The AtomicExpandPass can help you support atomics for your taraget in a number of ways. e.g. inserting fences around atomic loads/stores, or converting an atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning the latter ability is problematic - there is no way to guarantee that the invariants for the LL/SC loop required by the target architecture will be maintained. This shows up most frequently when register spills are introduced at O0, but spills could theoretically still occur at higher optimisation levels and there are other potential sources of issues: inappropriate instruction selection, machine block placement, machine outlining (though see D47654 and D47655), and likely more. I highly encourage you to read James Knight's previous post on this topic which goes in to much more detail about the issues handling LL/SC <http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The situation remains pretty much the same: * ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1 and above but use a custom post-regalloc expansion for O0 * MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has the problems described above, so MIPS is in the process of moving towards a two-stage lowering, with the LL/SC loop lowered after register allocation. See D31287 <https://reviews.llvm.org/D31287>. * Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass. Lowering a word-size atomic operations to an LL/SC loop is typically trivial, requiring little surrounding code. Part-word atomics require additional shifting and masking as a word-size access is used. It would be beneficial if the code to generate this shifting and masking could be shared between targets, and if the operations that don't need to be in the LL/SC loop are exposed for LLVM optimisation. The path forwards is very clearly to treat the LL/SC loop as an indivisible operation which is expanded as late as possible (and certainly after register allocation). However, there are a few ways of achieving this. If atomic operations of a given size aren't supported, then calls should be created to the helper functions in libatomic, and this should be done consistently for all atomic operations of that size. I actually found GCC is buggy in that respect <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>. ## Proposed lowering strategy (RISC-V) Basic principles: * The LL/SC loop should be treated as a black box, and expanded post-RA. * Don't introduce intrinsics that simply duplicate existing IR instructions * If code can be safely expanded in the IR, do it there. [I'd welcome feedback on this one - should I be taking a closer look at expansion in SelectionDAG legalisation?] The above can be achieved by extending AtomicExpandPass to support a 'Custom' expansion method, which uses a TargetLowering call to expand to custom IR, including an appropriate intrinsic representing the LL+SC loop. Atomic operations are lowered in the following ways: * Atomic load/store: Allow AtomicExpandPass to insert appropriate fences * Word-sized AMO supported by a native instruction: Leave the IR unchanged and use the normal instruction selection mechanism * Word-sized AMO without a native instruction: Select a pseudo-instruction using the normal instruction selection mechanism. This pseudo-instruction will be expanded after register allocation. * Part-word AMO without a native instruction: Shifting and masking that occurs outside of the LL/SC loop is expanded in the IR, and a call to a target-specific intrinsic to implement the LL/SC loop is inserted (e.g. llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a pseudo-instruction which is expanded after register allocation. * Part-word AMO without a native instruction that can be implemented by a native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR transformation. * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the normal instruction selection mechanism. This pseudo-instruction will be expanded after register allocation. * Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling llvm.riscv.masked.cmpxchg.i32. Scratch registers for these pseudo-instructions are modelled as in ARM and AArch64, by specifying multiple outputs and specifying an @earlyclobber constraint to ensure the register allocator assigns unique registers. e.g.: class PseudoCmpXchg : Pseudo<(outs GPR:$res, GPR:$scratch), (ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering), []> { let Constraints = "@earlyclobber $res, at earlyclobber $scratch"; let mayLoad = 1; let mayStore = 1; let hasSideEffects = 0; } Note that there are additional complications with cmpxchg such as supporting weak cmpxchg (which requires returning a success value), or supporting different failure orderings. It looks like the differentiation between strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now. Supporting only strong cmpxchg and using the success ordering for the failure case is conservative but correct I believe. In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the latest possible moment. The RISCVExpandPseudoInsts pass is registered with addPreEmitPass2. The main aspect I'm unhappy with in this approach is the need to introduce new intrinsics. Ideally these would be documented as not for use by frontends and subject to future removal or alteration - is there precedent for this? Alternatively, see the suggestion below to introduce target-independent masked AMO intrinsics. ## Alternative options 1. Don't expand anything in IR, and lower to a single monolithic pseudo-instruction that is expanded at the last minute. 2. Don't expand anything in IR, and lower to pseudo-instructions in stages. Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is represented by a new pseudoinstruction. This final pseudoinstruction is then expanded after register allocation. This minimises the possibility for sharing logic between backends, but does mean we don't need to expose new intrinsics. Mips adopts this approach in D31287. 3. Target-independent SelectionDAG expansion code converts unsupported atomic operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd, expand to nodes that align the address and calculate the mask as well as an AtomicLoadAddMasked node. I haven't looked at this in great detail. 4. Introducing masked atomic operations to the IR. Mentioned for completeness, I don't think anyone wants this. 5. Introduce target-independent intrinsics for masked atomic operations. This seems worthy of consideration. For 1. and 2. the possibility for sharing logic between backends is minimised and the address calculation, masking and shifting logic is mostly hidden from optimisations (though option 2. allows e.g. MachineCSE). There is the advantage of avoiding the need for new intrinsics. ## Patches up for review I have patches up for review which implement the described strategy. More could be done to increase the potential for code reuse across targets, but I thought it would be worth getting feedback on the path forwards first. * D47587: [RISCV] Codegen support for atomic operations on RV32I. <https://reviews.llvm.org/D47587>. Simply adds support for lowering fences and uses AtomicExpandPass to generate libatomic calls otherwise. Committed in rL334590. * D47589: [RISCV] Add codegen support for atomic load/stores with RV32A. <https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences for atomic load/store. Committed in rL334591. * D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A. <https://reviews.llvm.org/D47882>. Implements the lowering strategy described above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR. Under review. * D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A. <https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with appropriately manipulated operands to implement 8/16-bit AMOs. Under review. * D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR. <https://reviews.llvm.org/D48130> Separated patch as this modifies the existing shouldExpandAtomicCmpXchgInIR interface. Under review. * D48141: [RISCV] Implement codegen for cmpxchg on RV32I. <https://reviews.llvm.org/D48131> Implements the lowering strategy described above. Under review. ## Questions To pick a few to get started: * How do you feel about the described lowering strategy? Am I unfairly overlooking a SelectionDAG approach? * How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and other architectures to use such an approach? * If there is enthusiasm, how worthwhile is it to share logic for generation of masks+shifts needed for part-word atomics? * I'd like to see ARM+AArch64+Hexagon move away from the problematic expansion in IR and to have that code deleted from AtomicExpandPass. Are there any objections? * What are your thoughts on the introduction of new target-independent intrinsics for masked atomics? Many thanks for your feedback, Alex Bradbury, lowRISC CIC
James Y Knight via llvm-dev
2018-Jun-13 22:37 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
Great writeup, thanks! I'll respond to the rest in a bit, but I have a digression first... One detail I hadn't noticed in my previous investigation is that the "compare_exchange_weak" (or our IR "cmpxchg weak") operation is theoretically unsound -- and NECESSARILY so. As mentioned in this thread, there are architectural guarantees for forward progress of an LLSC loop adhering to certain constraints. If your loop does not meet those constraints, it may well livelock vs another CPU executing the same loop, or even spuriously fail deterministically every time it's executed on its own. This thread is all about a proposal to ensure that LLVM emits LLSC loops such that they're guaranteed to meet the architectural guarantees and avoid the possibility of those bad situations. There is not to my knowledge any architecture which makes any guarantees that an LLSC standing alone, without a containing loop, will not spuriously fail. It may in fact fail every time it's executed. Of course, the typical usage of compare_exchange_weak is to embed it within a loop containing other user-code, but given that it contains arbitrary user code, there's simply no way we can guarantee that this larger-loop meets the LLSC-loop construction requirements. Therefore, it's entirely possible for some particular compare_exchange_weak-based loop to livelock or to deterministically spuriously fail. That seems poor, and exactly the kind of thing that we'd very much like to avoid... So, now I wonder -- is there actually a measurable performance impact (on, say, ARM) if we were to just always emit the "strong" cmpxchg loop? I'm feeling rather inclined to say we should not implement the weak variant at all. (And, if others agree with my diagnosis here, also propose to deprecate it from the C/C++ languages...) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180613/3cb76586/attachment.html>
Alex Bradbury via llvm-dev
2018-Jun-13 23:29 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 13 June 2018 at 23:37, James Y Knight via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Great writeup, thanks! > > I'll respond to the rest in a bit, but I have a digression first... > > One detail I hadn't noticed in my previous investigation is that the > "compare_exchange_weak" (or our IR "cmpxchg weak") operation is > theoretically unsound -- and NECESSARILY so. > > As mentioned in this thread, there are architectural guarantees for forward > progress of an LLSC loop adhering to certain constraints. If your loop does > not meet those constraints, it may well livelock vs another CPU executing > the same loop, or even spuriously fail deterministically every time it's > executed on its own. This thread is all about a proposal to ensure that LLVM > emits LLSC loops such that they're guaranteed to meet the architectural > guarantees and avoid the possibility of those bad situations. > > There is not to my knowledge any architecture which makes any guarantees > that an LLSC standing alone, without a containing loop, will not spuriously > fail. It may in fact fail every time it's executed. Of course, the typical > usage of compare_exchange_weak is to embed it within a loop containing other > user-code, but given that it contains arbitrary user code, there's simply no > way we can guarantee that this larger-loop meets the LLSC-loop construction > requirements. Therefore, it's entirely possible for some particular > compare_exchange_weak-based loop to livelock or to deterministically > spuriously fail. That seems poor, and exactly the kind of thing that we'd > very much like to avoid...I hadn't noticed that either - excellent point! The RISC-V requirement for forward progress encompass the LR, the SC, the instructions between it, and the code to retry the LR/SC. A weak cmpxchg by definition separates the LR/SC from the code to retry it and so suffers from the same sort problems that we see when exposing LL and SC individually as IR intrinsics. I agree with your diagnosis as far as RISC-V is concerned at least. Best, Alex
Alex Bradbury via llvm-dev
2018-Jun-14 09:01 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 13 June 2018 at 23:37, James Y Knight via llvm-dev <llvm-dev at lists.llvm.org> wrote:> So, now I wonder -- is there actually a measurable performance impact (on, > say, ARM) if we were to just always emit the "strong" cmpxchg loop? I'm > feeling rather inclined to say we should not implement the weak variant at > all. (And, if others agree with my diagnosis here, also propose to deprecate > it from the C/C++ languages...)I'm not sure deprecation is justified. A "Sufficiently Smart Compiler" could make use of the knowledge that a weak cmpxchg was considered sufficient by the programmer, and use the weak cmpxchg form if the surrounding loop can be shown to meet the forward progress guarantees of the architecture. In LLVM, this could be done by analyzing the surrounding basic block when expanding the cmpxchg pseudo after regalloc. It's unlikely it would be worth the hassle, but it does seem a viable approach. Best, Alex
Tim Northover via llvm-dev
2018-Jun-14 09:28 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
> * I'd like to see ARM+AArch64+Hexagon move away from the problematic > expansion in IR and to have that code deleted from AtomicExpandPass. Are > there any objections?I think it would be a great shame and I'd like to avoid it if at all possible, though I'm also uncomfortable with the current situation. The problem with late expansion is that even the simplest variants add some rather unpleasant pseudo-instructions, and I've never even seen anyone attempt to get good CodeGen out of it (simplest example being "add xD, xD, #1" in the loop for an increment but there are obviously more). Doing so would almost certainly involve duplicating a lot of basic arithmetic instructions into AtomicRMW variants. Added to that is the fact that actual CPU implementations are often a lot less restrictive about what can go into a loop than is required (for example even spilling is fine on ours as long as the atomic object is not in the same cache-line; I suspect that's normal). That casts the issue in a distinctly theoretical light -- we've been doing this for years and as far as I'm aware nobody has ever hit the issue in the real world, or even had CodeGen go wrong in a way that *could* do so outside the -O0 situation. OTOH that *is* an argument for performance over correctness when you get right down to it, so I'm not sure I can be too forceful about it. At least not without a counter-proposal to restore guaranteed correctness. Tim.
Alex Bradbury via llvm-dev
2018-Jun-14 11:32 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 14 June 2018 at 10:28, Tim Northover via llvm-dev <llvm-dev at lists.llvm.org> wrote:>> * I'd like to see ARM+AArch64+Hexagon move away from the problematic >> expansion in IR and to have that code deleted from AtomicExpandPass. Are >> there any objections? > > I think it would be a great shame and I'd like to avoid it if at all > possible, though I'm also uncomfortable with the current situation.Thanks for the reply Tim. It's definitely fair to highlight that ARM and AArch64 has few documented restrictions on code within an LL/SC loop when compared to e.g. RISC-V or MIPS. I'm not sure about Hexagon.> The problem with late expansion is that even the simplest variants add > some rather unpleasant pseudo-instructions, and I've never even seen > anyone attempt to get good CodeGen out of it (simplest example being > "add xD, xD, #1" in the loop for an increment but there are obviously > more). Doing so would almost certainly involve duplicating a lot of > basic arithmetic instructions into AtomicRMW variants.Let's separate the concerns here: 1) Quality of codegen within the LL/SC loop * What is the specific concern here? The LL/SC loop contains a very small number of instructions, even for the masked atomicrmw case. Are you worried about an extra arithmetic instruction or two? Sub-optimal control-flow? Something else? 2) Number of new pseudo-instructions which must be introduced * You'd need new pseudos for each word-sized atomicrmw which expands to an ll/sc loop, and an additional one for the masked form of the operation. You could reduce the number of pseudos by taking the AtomicRMWInst::BinOp as a parameter. The code to map the atomic op to the appropriate instruction is tedious but straight-forward.> Added to that is the fact that actual CPU implementations are often a > lot less restrictive about what can go into a loop than is required > (for example even spilling is fine on ours as long as the atomic > object is not in the same cache-line; I suspect that's normal). That > casts the issue in a distinctly theoretical light -- we've been doing > this for years and as far as I'm aware nobody has ever hit the issue > in the real world, or even had CodeGen go wrong in a way that *could* > do so outside the -O0 situation.That's true as long as the "Exclusives Reservation Granule" == 1 cache-line and you don't deterministically cause the reservation to be cleared in some other way: e.g. repeatable geenrating a conflict miss or triggering a trap etc. I don't think any Arm cores ship with direct-mapped caches so I'll admit this is unlikely. The possibility for issues increases if the Exclusives Reservation Granule is larger. For the Cortex-M4, the ERG is the entire address range <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100166_0001_00_en/ric1417175928887.html>. In that case, spills will surely clear the reservation. There also seem to be documented and very strict forward progress constraints for ARMv8-M. See https://static.docs.arm.com/ddi0553/ah/DDI0553A_h_armv8m_arm.pdf p207 """ Forward progress can only be made using LoadExcl/StoreExcl loops if, for any LoadExcl/StoreExcl loop within a single thread of execution if both of the following are true: • There are no explicit memory accesses, pre-loads, direct or indirect register writes, cache maintenance instructions, SVC instructions, or exception returns between the Load-Exclusive and the Store-Exclusive. • The following conditions apply between the Store-Exclusive having returned a fail result and the retry of the Load-Exclusive: – There are no stores to any location within the same Exclusives reservation granule that the StoreExclusive is accessing. – There are no direct or indirect register writes, other than changes to the flag fields in APSR or FPSCR, caused by data processing or comparison instructions. – There are no direct or indirect cache maintenance instructions, SVC instructions, or exception returns """ Of course it also states that the upper limit for the Exclusives Reservation Granule is 2048 bytes, but the Cortex-M33 has an ERG of the entire address range <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100230_0002_00_en/jfa1443092906126.html> so something doesn't quite add up...> OTOH that *is* an argument for performance over correctness when you > get right down to it, so I'm not sure I can be too forceful about it. > At least not without a counter-proposal to restore guaranteed > correctness.I suppose a machine-level pass could at least scan for any intervening loads/stores in an LL/SC loop and check some other invariants, then warn/fail if they occur. As you point out, this would be conservative for many Arm implementations. Best, Alex
Krzysztof Parzyszek via llvm-dev
2018-Jun-15 15:21 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
Hexagon has a very few restrictions on what will cause loss of a reservation: those are stores to the same address (a 64-bit granule) or any sort of exception/interrupt/system call. Other than that the reservation should stay. The architecture doesn't explicitly guarantee that though, but in the absence of the elements listed above, a program with LL/SC can be expected to make progress. Consequently, the best way for us to handle LL/SC would be to expand them early and let them be optimized as any other code. The usual optimization restrictions should be sufficient to prevent introduction of factors causing a loss of reservation. With the constraints on LL/SC varying wildly between architectures, maybe we should have several options available for different targets? -Krzysztof On 6/13/2018 10:42 AM, Alex Bradbury wrote:> # RFC: Atomic LL/SC loops in LLVM revisited > > ## Summary > > This proposal gives a brief overview of the challenges of lowering to LL/SC > loops and details the approach I am taking for RISC-V. Beyond getting feedback > on that work, my intention is to find consensus on moving other backends > towards a similar approach and sharing common code where feasible. Scroll down > to 'Questions' for a summary of the issues I think need feedback and > agreement. > > For the original discussion of LL/SC lowering, please refer to James > Knight's 2016 thread on the topic: > http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html > > I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so > generous with their review feedback on this atomics work so far. > > ## Background: Atomics in LLVM > > See the documentation for full details <https://llvm.org/docs/Atomics.html>. > In short: LLVM defines memory ordering constraints to match the C11/C++11 > memory model (unordered, monotonic, acquire, release, acqrel, seqcst). > These can be given as parameters to the atomic operations supported in LLVM > IR: > > * Fences with the fence instruction > * Atomic load and store with the 'load atomic' and 'store atomic' variants of > the load/store instructions.. > * Fetch-and-binop / read-modify-write operations through the atomicrmw > instruction. > * Compare and exchange via the cmpxchg instruction. Takes memory ordering for > both success and failure cases. Can also specify a 'weak' vs 'strong' cmpxchg, > where the weak variant allows spurious failure > > ## Background: Atomics in RISC-V > > For full details see a recent draft of the ISA manual > <https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>, > which incorporates work from the Memory Consistency Model Task Group to define > the memory model. RISC-V implements a weak memory model. > > For those not familiar, RISC-V is a modular ISA, with standard extensions > indicated by single letters. Baseline 'RV32I' or 'RV64I' instruction sets > don't support atomic operations beyond fences. However the RV32A and RV64A > instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC > (load-linked/store-conditional on other architectures). 32-bit atomic > operations are supported natively on RV32, and both 32 and 64-bit atomic > operations support natively on RV64. > > AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour. For > LR/SC: LR loads a word and registers a reservation on source memory address. > SC writes the given word to the memory address and writes success (zero) or > failure (non-zero) into the destination register. LR/SC can be used to > implement compare-and-exchange or to implement AMOs that don't have a native > instruction. To do so, you would typically perform LR and SC in a loop. > However, there are strict limits on the instructions that can be placed > between a LR and an SC while still guaranteeing forward progress: > > """ > The static code for the LR/SC sequence plus the code to retry the sequence in > case of failure must comprise at most 16 integer instructions placed > sequentially in memory. For the sequence to be guaranteed to eventually > succeed, the dynamic code executed between the LR and SC instructions can only > contain other instructions from the base “I” subset, excluding loads, stores, > backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM > instructions. The code to retry a failing LR/SC sequence can contain backward > jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same > constraints. > """ > > The native AMOs and LR/SC allow ordering constraints to be specified in the > instruction. This isn't possible for load/store instructions, so fences must > be inserted to represent the ordering constraints. 8 and 16-bit atomic > load/store are therefore supported using 8 and 16-bit load/store plus > appropriate fences. > > See Table A.6 on page 187 in the linked specification for a mapping from C/C++ > atomic constructs to RISC-V instructions. > > ## Background: Lowering atomics in LLVM > > The AtomicExpandPass can help you support atomics for your taraget in a number > of ways. e.g. inserting fences around atomic loads/stores, or converting an > atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning > the latter ability is problematic - there is no way to guarantee that the > invariants for the LL/SC loop required by the target architecture will be > maintained. This shows up most frequently when register spills are introduced > at O0, but spills could theoretically still occur at higher optimisation > levels and there are other potential sources of issues: inappropriate > instruction selection, machine block placement, machine outlining (though see > D47654 and D47655), and likely more. > > I highly encourage you to read James Knight's previous post on this topic > which goes in to much more detail about the issues handling LL/SC > <http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The situation > remains pretty much the same: > > * ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1 > and above but use a custom post-regalloc expansion for O0 > * MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions > which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has > the problems described above, so MIPS is in the process of moving towards a > two-stage lowering, with the LL/SC loop lowered after register allocation. See > D31287 <https://reviews.llvm.org/D31287>. > * Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass. > > Lowering a word-size atomic operations to an LL/SC loop is typically trivial, > requiring little surrounding code. Part-word atomics require additional > shifting and masking as a word-size access is used. It would be beneficial if > the code to generate this shifting and masking could be shared between > targets, and if the operations that don't need to be in the LL/SC loop are > exposed for LLVM optimisation. > > The path forwards is very clearly to treat the LL/SC loop as an indivisible > operation which is expanded as late as possible (and certainly after register > allocation). However, there are a few ways of achieving this. > > If atomic operations of a given size aren't supported, then calls should be > created to the helper functions in libatomic, and this should be done > consistently for all atomic operations of that size. I actually found GCC is > buggy in that respect <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>. > > ## Proposed lowering strategy (RISC-V) > > Basic principles: > * The LL/SC loop should be treated as a black box, and expanded post-RA. > * Don't introduce intrinsics that simply duplicate existing IR instructions > * If code can be safely expanded in the IR, do it there. [I'd welcome feedback > on this one - should I be taking a closer look at expansion in SelectionDAG > legalisation?] > > The above can be achieved by extending AtomicExpandPass to support a 'Custom' > expansion method, which uses a TargetLowering call to expand to custom IR, > including an appropriate intrinsic representing the LL+SC loop. > > Atomic operations are lowered in the following ways: > > * Atomic load/store: Allow AtomicExpandPass to insert appropriate fences > * Word-sized AMO supported by a native instruction: Leave the IR unchanged and > use the normal instruction selection mechanism > * Word-sized AMO without a native instruction: Select a pseudo-instruction > using the normal instruction selection mechanism. This pseudo-instruction will > be expanded after register allocation. > * Part-word AMO without a native instruction: Shifting and masking that occurs > outside of the LL/SC loop is expanded in the IR, and a call to a > target-specific intrinsic to implement the LL/SC loop is inserted (e.g. > llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a > pseudo-instruction which is expanded after register allocation. > * Part-word AMO without a native instruction that can be implemented by a > native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented > by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR > transformation. > * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the > normal instruction selection mechanism. This pseudo-instruction will be > expanded after register allocation. > * Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling > llvm.riscv.masked.cmpxchg.i32. > > Scratch registers for these pseudo-instructions are modelled as in ARM and > AArch64, by specifying multiple outputs and specifying an @earlyclobber > constraint to ensure the register allocator assigns unique registers. e.g.: > > class PseudoCmpXchg > : Pseudo<(outs GPR:$res, GPR:$scratch), > (ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering), []> { > let Constraints = "@earlyclobber $res, at earlyclobber $scratch"; > let mayLoad = 1; > let mayStore = 1; > let hasSideEffects = 0; > } > > Note that there are additional complications with cmpxchg such as supporting > weak cmpxchg (which requires returning a success value), or supporting > different failure orderings. It looks like the differentiation between > strong/weak cmpxchg doesn't survive the translation to SelectionDAG right now. > Supporting only strong cmpxchg and using the success ordering for the failure > case is conservative but correct I believe. > > In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the > latest possible moment. The RISCVExpandPseudoInsts pass is registered with > addPreEmitPass2. > > The main aspect I'm unhappy with in this approach is the need to introduce new > intrinsics. Ideally these would be documented as not for use by frontends and > subject to future removal or alteration - is there precedent for this? > Alternatively, see the suggestion below to introduce target-independent masked > AMO intrinsics. > > ## Alternative options > > 1. Don't expand anything in IR, and lower to a single monolithic > pseudo-instruction that is expanded at the last minute. > 2. Don't expand anything in IR, and lower to pseudo-instructions in stages. > Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC > loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is > represented by a new pseudoinstruction. This final pseudoinstruction is then > expanded after register allocation. This minimises the possibility for sharing > logic between backends, but does mean we don't need to expose new intrinsics. > Mips adopts this approach in D31287. > 3. Target-independent SelectionDAG expansion code converts unsupported atomic > operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd, > expand to nodes that align the address and calculate the mask as well as an > AtomicLoadAddMasked node. I haven't looked at this in great detail. > 4. Introducing masked atomic operations to the IR. Mentioned for completeness, > I don't think anyone wants this. > 5. Introduce target-independent intrinsics for masked atomic operations. This > seems worthy of consideration. > > For 1. and 2. the possibility for sharing logic between backends is minimised > and the address calculation, masking and shifting logic is mostly hidden from > optimisations (though option 2. allows e.g. MachineCSE). There is the > advantage of avoiding the need for new intrinsics. > > ## Patches up for review > > I have patches up for review which implement the described strategy. More > could be done to increase the potential for code reuse across targets, but I > thought it would be worth getting feedback on the path forwards first. > > * D47587: [RISCV] Codegen support for atomic operations on RV32I. > <https://reviews.llvm.org/D47587>. Simply adds support for lowering fences and > uses AtomicExpandPass to generate libatomic calls otherwise. Committed in > rL334590. > * D47589: [RISCV] Add codegen support for atomic load/stores with RV32A. > <https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences for > atomic load/store. Committed in rL334591. > * D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A. > <https://reviews.llvm.org/D47882>. Implements the lowering strategy described > above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR. > Under review. > * D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A. > <https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with > appropriately manipulated operands to implement 8/16-bit AMOs. Under review. > * D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR. > <https://reviews.llvm.org/D48130> Separated patch as this modifies the > existing shouldExpandAtomicCmpXchgInIR interface. Under review. > * D48141: [RISCV] Implement codegen for cmpxchg on RV32I. > <https://reviews.llvm.org/D48131> Implements the lowering strategy described > above. Under review. > > ## Questions > > To pick a few to get started: > > * How do you feel about the described lowering strategy? Am I unfairly > overlooking a SelectionDAG approach? > * How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and > other architectures to use such an approach? > * If there is enthusiasm, how worthwhile is it to share logic for generation > of masks+shifts needed for part-word atomics? > * I'd like to see ARM+AArch64+Hexagon move away from the problematic > expansion in IR and to have that code deleted from AtomicExpandPass. Are > there any objections? > * What are your thoughts on the introduction of new target-independent > intrinsics for masked atomics? > > Many thanks for your feedback, > > Alex Bradbury, lowRISC CIC >-- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
James Y Knight via llvm-dev
2018-Jun-15 22:03 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On Wed, Jun 13, 2018 at 11:42 AM Alex Bradbury <asb at lowrisc.org> wrote:> ## Proposed lowering strategy (RISC-V) > > Basic principles: > * The LL/SC loop should be treated as a black box, and expanded post-RA. >It will probably come as no surprise, but I think this is 100% required for correctness. I'm quite skeptical that it can be correct on any existing architecture, without doing this. My first inclination upon hearing that some architecture may not require such careful handling is that their architecture documentation probably failed to document the careful handling required, not that it's not required. However, in order to not block this on every architecture maintainer being persuaded, it's a good idea to introduce the new functionality into AtomicExpandPass, and switch architectures over to it as the arch maintainers are convinced it's a good idea.> * If code can be safely expanded in the IR, do it there. [I'd welcome > feedback > on this one - should I be taking a closer look at expansion in SelectionDAG > legalisation?Definitely in favor of IR expansion -- we should do as much in IR as is feasible, because it admits better optimization opportunities. Only the LL/SC-loop should be late-expanded.> The above can be achieved by extending AtomicExpandPass to support a > 'Custom'expansion method, which uses a TargetLowering call to expand to custom IR,> including an appropriate intrinsic representing the LL+SC loopI think it'd be better to sink more of the functionality into AtomicExpandPass itself (rather than adding a "Custom" hook). However, that ties into whether to introduce a common intrinsic that can be used across architectures...> * Part-word AMO without a native instruction that can be implemented by a > native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be > implemented > by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR > transformation. >+1, and that transform should be in common code in AtomicExpandPass.> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the > normal instruction selection mechanism. This pseudo-instruction will be > expanded after register allocation. >On RISCV, implementing the whole thing in the pseudo is probably right, since you only really have the inner-loop. But for other archs like ARMv7, I think it'll probably makes sense to continue to handle a bunch of the cmpxchg expansion in IR. There, the current cmpxchg expansion can be quite complex, but only loop really needs to be a primitive (we'd need two loop variants, both "strex, ldrex, loop" or "ldrex, strex, loop", depending on whether it generates an initial ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex, barriers-- can all remain IR-level expanded. I'll note that in all cases, both for RISCV and ARM and others, we _really_ would like to be able to have the compare-failure jump to a different address than success. That isn't possible for an intrinsic call at the moment, but I believe it will be possible to make that work soon, due to already ongoing work for "asm goto", which requires similar. Once we have that ability, I don't see any reason why the late-lowering cmpxchg pseudo should have any performance downside vs IR expansion, at least w.r.t. any correct optimizations. * Part-word compare-and-exchange: Handled similarly to part-word AMOs,> calling > llvm.riscv.masked.cmpxchg.i32. >Yep. Note that there are additional complications with cmpxchg such as supporting> weak cmpxchg (which requires returning a success value), or supporting > different failure orderings. It looks like the differentiation between > strong/weak cmpxchg doesn't survive the translation to SelectionDAG right > now. > Supporting only strong cmpxchg and using the success ordering for the > failure > case is conservative but correct I believe. >Yep. The distinction between strong/weak is only relevant to LL/SC architectures, where you get to avoid making a loop. AFAIK, all ISAs which have a native cmpxchg instruction implement the strong semantics for it, and since the LL/SC loops are expanded in IR right now, they didn't need the weak indicator in SelectionDAG. While strong doesn't _need_ a success indicator output, as noted above it'll generate nicer code when we can. In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the> latest possible moment. The RISCVExpandPseudoInsts pass is registered with > addPreEmitPass2. > > The main aspect I'm unhappy with in this approach is the need to introduce > new > intrinsics. Ideally these would be documented as not for use by frontends > and > subject to future removal or alteration - is there precedent for this? > Alternatively, see the suggestion below to introduce target-independent > masked > AMO intrinsics.I agree -- it'd be great to somehow name/annotate these intrinsics, such that it's clear that they're a private implementation detail _INSIDE_ the llvm target codegen passes, with no stability guaranteed. Not even "opt" passes should be emitting them, so they should never end up in a bitcode file where we'd need to provide backwards compatibility. Maybe we can call them something like "llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32" (where the random number in the middle is something that changes), and document that nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm CodeGen. ## Alternative options [A bunch of alternatives I don't like, so i'm omitting them]> 5. Introduce target-independent intrinsics for masked atomic operations. > This >seems worthy of consideration.>I think it's definitely worthwhile looking to see if the set of intrinsics for the atomicrmw operations (in particular, the set of additional arguments computed in the pre-loop section, and the return value) are likely going to be the same on the different architectures we have now. If so, I think it's definitely worthwhile making a commonly-named intrinsic. However, even so, it should remain for internal use only, and only implemented on targets which tell AtomicExpandPass to generate it. I'd like it to be common only to enhance code-reuse among the targets, not to provide a user-facing API. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/01aca6bf/attachment.html>
James Y Knight via llvm-dev
2018-Jun-15 23:20 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On Thu, Jun 14, 2018 at 5:28 AM Tim Northover <t.p.northover at gmail.com> wrote:> > * I'd like to see ARM+AArch64+Hexagon move away from the problematic > > expansion in IR and to have that code deleted from AtomicExpandPass. > Are > > there any objections? > > I think it would be a great shame and I'd like to avoid it if at all > possible, though I'm also uncomfortable with the current situation. > > The problem with late expansion is that even the simplest variants add > some rather unpleasant pseudo-instructions, and I've never even seen > anyone attempt to get good CodeGen out of it (simplest example being > "add xD, xD, #1" in the loop for an increment but there are obviously > more). Doing so would almost certainly involve duplicating a lot of > basic arithmetic instructions into AtomicRMW variants. >I think it would be potentially-useful, and difficult, to also add pseudos for the atomicrmw operations with an immediate operand of the range allowed by the corresponding arithmetic operation. That doesn't seem like it'd be difficult. But I rather doubt whether any further refinement beyond that would be useful in practice. For example, sure -- ARM has shifted-add-register operations. And, an "add r2, r2, r1, lsl #2" could potentially be (but isn't currently!) generated inside the ll/sc loop for something like: %shl = shl i32 %incr, 2 %0 = atomicrmw add i32* %i, i32 %shl seq_cst ...but the alternative is simply to generate an lsl instruction before the loop. I'm quite skeptical whether the ability to omit the 'lsl' would really be worth caring about. Plus we aren't even doing it now. :)> Added to that is the fact that actual CPU implementations are often a > lot less restrictive about what can go into a loop than is required > (for example even spilling is fine on ours as long as the atomic > object is not in the same cache-line; I suspect that's normal). That > casts the issue in a distinctly theoretical light -- we've been doing > this for years and as far as I'm aware nobody has ever hit the issue > in the real world, or even had CodeGen go wrong in a way that *could* > do so outside the -O0 situationSo, there's two separate issues -- 1. Deterministic failure of a loop even if it's the only thing running. For example, from register spilling to the same cacheline within the loop. As you note, that has been observed, but only at -O0. 2. Livelock between two such loops on different CPUs. AFAIK this has not been reported, but I expect it would be very difficult to observe, because one CPU will likely receive an OS interrupt eventually, and thus break out of the livelock situation. This would likely exhibit simply as worse performance than expected, which I expect would be very difficult to track down and report. I'd suspect that generally, most of the violations of the architectural requirements would only cause #2, not #1. It seems entirely possible to me that this change may well be good both for correctness and performance-under-load.> OTOH that *is* an argument for performance over correctness when youget right down to it, so I'm not sure I can be too forceful about it.> At least not without a counter-proposal to restore guaranteed > correctness.I'd rather look at it from the other direction -- does using a late-lowered atomicrmw pseudo _actually_ cause any observable performance degradation? I rather suspect it won't. Now -- cmpxchg, as mentioned before, is trickier, and I could see that the inability to jump to a separate failure block may make a noticeable difference at the moment. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/a69dcffa/attachment-0001.html>
James Y Knight via llvm-dev
2018-Jun-15 23:44 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On Fri, Jun 15, 2018 at 11:21 AM Krzysztof Parzyszek < kparzysz at codeaurora.org> wrote:> Hexagon has a very few restrictions on what will cause loss of a > reservation: those are stores to the same address (a 64-bit granule) or > any sort of exception/interrupt/system call. Other than that the > reservation should stay. The architecture doesn't explicitly guarantee > that though, but in the absence of the elements listed above, a program > with LL/SC can be expected to make progress. > > Consequently, the best way for us to handle LL/SC would be to expand > them early and let them be optimized as any other code. The usual > optimization restrictions should be sufficient to prevent introduction > of factors causing a loss of reservation. > > With the constraints on LL/SC varying wildly between architectures, > maybe we should have several options available for different targets? >The constraints on LL/SC don't seem to vary that wildly between architectures (except Hexagon, I guess!) -- sure, the exact details are different, but they all (the others) have some set of requirements on the structure of their loop which are not generally possible guarantee in a high level abstract representation. Given that, the details of the requirements aren't particularly relevant, they just need to be properly adhered to when writing the target-specific code to generate the machine instructions for the ll/sc loop. I'm a bit skeptical that Hexagon is truly different in this regards than all the other architectures. But, that said, I don't think it'd be at all problematic to keep the IR-level lowering code around for Hexagon to continue to use, even if we do migrate everything else away. We definitely need to introduce new options alongside the current options to start with, anyways -- at the very least simply to allow migrating each architecture separately. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/67f2a937/attachment.html>
Alex Bradbury via llvm-dev
2018-Jun-17 11:59 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 15 June 2018 at 16:21, Krzysztof Parzyszek via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hexagon has a very few restrictions on what will cause loss of a > reservation: those are stores to the same address (a 64-bit granule) or any > sort of exception/interrupt/system call. Other than that the reservation > should stay. The architecture doesn't explicitly guarantee that though, but > in the absence of the elements listed above, a program with LL/SC can be > expected to make progress. > > Consequently, the best way for us to handle LL/SC would be to expand them > early and let them be optimized as any other code. The usual optimization > restrictions should be sufficient to prevent introduction of factors causing > a loss of reservation.Thanks Krzysztof, it sounds like Hexagon gives so much freedom for LL/SC loops that there really isn't a need to be concerned about expansion in IR. As such, it would clearly make sense to retain that codepath over the long-term, even if architectures are discouraged from using it unless they're _really_ sure it's safe for their architecture.> With the constraints on LL/SC varying wildly between architectures, maybe we > should have several options available for different targets?Hexagon is the first architecutre I've encountered where there are essentially no restrictions, so that certainly changes things. I was mainly trying to determine consensus for the ultimate goal. If everyone agreed that late expansion is the path forwards, we could plan to remove and deprecate the current early expansion codepath at some point. Best, Alex
Alex Bradbury via llvm-dev
2018-Jun-17 12:21 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 15 June 2018 at 23:03, James Y Knight via llvm-dev <llvm-dev at lists.llvm.org> wrote:> However, in order to not block this on every architecture maintainer being > persuaded, it's a good idea to introduce the new functionality into > AtomicExpandPass, and switch architectures over to it as the arch > maintainers are convinced it's a good idea.Indeed, even if everyone agreed this was a good idea I wasn't expecting to do this all at once.>> The above can be achieved by extending AtomicExpandPass to support a >> 'Custom' >> >> expansion method, which uses a TargetLowering call to expand to custom IR, >> including an appropriate intrinsic representing the LL+SC loop > > > I think it'd be better to sink more of the functionality into > AtomicExpandPass itself (rather than adding a "Custom" hook). However, that > ties into whether to introduce a common intrinsic that can be used across > architectures...Yes, I'd like to do more in AtomicExpandPass. Adding the 'Custom' hack was the easiest way of prototyping this, and this thread will hopefully give good guidance on the level of interest in using this in a target-independent way.>> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using >> the >> normal instruction selection mechanism. This pseudo-instruction will be >> expanded after register allocation. > > > On RISCV, implementing the whole thing in the pseudo is probably right, > since you only really have the inner-loop. > > But for other archs like ARMv7, I think it'll probably makes sense to > continue to handle a bunch of the cmpxchg expansion in IR. There, the > current cmpxchg expansion can be quite complex, but only loop really needs > to be a primitive (we'd need two loop variants, both "strex, ldrex, loop" or > "ldrex, strex, loop", depending on whether it generates an initial > ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex, > barriers-- can all remain IR-level expanded.Good point. As you say, the RISC-V expansion is much more straight-forward. Although the barrier could be cleared eagerly after compare failure by an SC to a dummy memory location, I don't currently intend to do so: 1) GCC also doesn't intend to use such an expansion 2) No existing microarchitectural implementations have been shown to benefit from this manual eager reservation clearing 3) Sticking to the simplest expansion is a good starting point, and future microarchitects are most likely to optimise for code that is out in the wild> I'll note that in all cases, both for RISCV and ARM and others, we _really_ > would like to be able to have the compare-failure jump to a different > address than success. That isn't possible for an intrinsic call at the > moment, but I believe it will be possible to make that work soon, due to > already ongoing work for "asm goto", which requires similar. Once we have > that ability, I don't see any reason why the late-lowering cmpxchg pseudo > should have any performance downside vs IR expansion, at least w.r.t. any > correct optimizations.I've seen periodically recurring discussions, but is someone actually actively working on this?>> The main aspect I'm unhappy with in this approach is the need to introduce >> new >> intrinsics. Ideally these would be documented as not for use by frontends >> and >> subject to future removal or alteration - is there precedent for this? >> Alternatively, see the suggestion below to introduce target-independent >> masked >> AMO intrinsics. > > > I agree -- it'd be great to somehow name/annotate these intrinsics, such > that it's clear that they're a private implementation detail _INSIDE_ the > llvm target codegen passes, with no stability guaranteed. Not even "opt" > passes should be emitting them, so they should never end up in a bitcode > file where we'd need to provide backwards compatibility. > > Maybe we can call them something like > "llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32" (where the > random number in the middle is something that changes), and document that > nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm CodeGen.llvm.internal_use_only.masked.atomicrmw.add.i32 would get the point across I think. Is it not possible someone would generate a .bc after AtomicExpandPass has run? Of course even now there's no guarantee such a file might work on a future version of LLVM. e.g. the atomics lowering strategy could change from one release to the next.>> ## Alternative options > > [A bunch of alternatives I don't like, so i'm omitting them] > >> >> 5. Introduce target-independent intrinsics for masked atomic operations. >> This >> >> seems worthy of consideration. > > > I think it's definitely worthwhile looking to see if the set of intrinsics > for the atomicrmw operations (in particular, the set of additional arguments > computed in the pre-loop section, and the return value) are likely going to > be the same on the different architectures we have now. If so, I think it's > definitely worthwhile making a commonly-named intrinsic. However, even so, > it should remain for internal use only, and only implemented on targets > which tell AtomicExpandPass to generate it. I'd like it to be common only to > enhance code-reuse among the targets, not to provide a user-facing API.Yes, giving the impression of providing a new user-facing API is my concern. Particularly as we might define want to have a comprehensive set of intrinsics but have targets support only the subset they require for comprehensives atomics support (as some instructions may go through the usual SDISel route without transformation to intrinsics). A common set of intrinsics is probably ok, but there are cases where you might want a different interface. For instance clearing a reservation requires an SC to a dummy address on RISC-V. Although we have no intention of doing this in the RISC-V backend currently, targets that wanted to implement such an approach might want to have that dummy address as an argument to the intrinsic. Best, Alex
Alex Bradbury via llvm-dev
2018-Sep-21 10:06 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 13 June 2018 at 16:42, Alex Bradbury <asb at lowrisc.org> wrote:> ## Patches up for review > > I have patches up for review which implement the described strategy. More > could be done to increase the potential for code reuse across targets, but I > thought it would be worth getting feedback on the path forwards first. > > * D47587: [RISCV] Codegen support for atomic operations on RV32I. > <https://reviews.llvm.org/D47587>. Simply adds support for lowering fences and > uses AtomicExpandPass to generate libatomic calls otherwise. Committed in > rL334590. > * D47589: [RISCV] Add codegen support for atomic load/stores with RV32A. > <https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences for > atomic load/store. Committed in rL334591. > * D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A. > <https://reviews.llvm.org/D47882>. Implements the lowering strategy described > above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR. > Under review. > * D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A. > <https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with > appropriately manipulated operands to implement 8/16-bit AMOs. Under review. > * D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR. > <https://reviews.llvm.org/D48130> Separated patch as this modifies the > existing shouldExpandAtomicCmpXchgInIR interface. Under review. > * D48141: [RISCV] Implement codegen for cmpxchg on RV32I. > <https://reviews.llvm.org/D48131> Implements the lowering strategy described > above. Under review.To update on this, everything has now landed other than D48141 which implements cmpxchg for RISC-V and introduces target-independent code to help lowering part-word cmpxchg. In response to the comments on this thread and review comments I revised the patches so that as much lowering as possible is done in AtomicExpandPass, thus maximising the potential for code reuse across different targets. I've additionally posted https://reviews.llvm.org/D52234 which updates the AtomicExpandPass docs. If there is consensus, I think both James and I would prefer to more strongly encourage that targets avoid expanding LL/SC loops in IR on the grounds that LLVM has no way to guarantee that the generated code complies with the architecture's requirements for forward progress. I believe it would be worthwhile for the Mips backend to evaluate using AtomicExpandPass in the same way the RISC-V backend now does. It wouldn't provide a correctness benefit over the current logic, but should help reduce the amount of target-specific code for handling atomics. Tim: did you get a chance at looking at a verifier pass? James suggested what the eventual solution for "asm goto" may well help in cleaning up the suboptimal control-flow around a cmpxchg lowered using the approach described in this RFC. Does anybody know of any work in this area? Best, Alex