Alex Bradbury via llvm-dev
2018-Jun-14 11:32 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 14 June 2018 at 10:28, Tim Northover via llvm-dev <llvm-dev at lists.llvm.org> wrote:>> * I'd like to see ARM+AArch64+Hexagon move away from the problematic >> expansion in IR and to have that code deleted from AtomicExpandPass. Are >> there any objections? > > I think it would be a great shame and I'd like to avoid it if at all > possible, though I'm also uncomfortable with the current situation.Thanks for the reply Tim. It's definitely fair to highlight that ARM and AArch64 has few documented restrictions on code within an LL/SC loop when compared to e.g. RISC-V or MIPS. I'm not sure about Hexagon.> The problem with late expansion is that even the simplest variants add > some rather unpleasant pseudo-instructions, and I've never even seen > anyone attempt to get good CodeGen out of it (simplest example being > "add xD, xD, #1" in the loop for an increment but there are obviously > more). Doing so would almost certainly involve duplicating a lot of > basic arithmetic instructions into AtomicRMW variants.Let's separate the concerns here: 1) Quality of codegen within the LL/SC loop * What is the specific concern here? The LL/SC loop contains a very small number of instructions, even for the masked atomicrmw case. Are you worried about an extra arithmetic instruction or two? Sub-optimal control-flow? Something else? 2) Number of new pseudo-instructions which must be introduced * You'd need new pseudos for each word-sized atomicrmw which expands to an ll/sc loop, and an additional one for the masked form of the operation. You could reduce the number of pseudos by taking the AtomicRMWInst::BinOp as a parameter. The code to map the atomic op to the appropriate instruction is tedious but straight-forward.> Added to that is the fact that actual CPU implementations are often a > lot less restrictive about what can go into a loop than is required > (for example even spilling is fine on ours as long as the atomic > object is not in the same cache-line; I suspect that's normal). That > casts the issue in a distinctly theoretical light -- we've been doing > this for years and as far as I'm aware nobody has ever hit the issue > in the real world, or even had CodeGen go wrong in a way that *could* > do so outside the -O0 situation.That's true as long as the "Exclusives Reservation Granule" == 1 cache-line and you don't deterministically cause the reservation to be cleared in some other way: e.g. repeatable geenrating a conflict miss or triggering a trap etc. I don't think any Arm cores ship with direct-mapped caches so I'll admit this is unlikely. The possibility for issues increases if the Exclusives Reservation Granule is larger. For the Cortex-M4, the ERG is the entire address range <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100166_0001_00_en/ric1417175928887.html>. In that case, spills will surely clear the reservation. There also seem to be documented and very strict forward progress constraints for ARMv8-M. See https://static.docs.arm.com/ddi0553/ah/DDI0553A_h_armv8m_arm.pdf p207 """ Forward progress can only be made using LoadExcl/StoreExcl loops if, for any LoadExcl/StoreExcl loop within a single thread of execution if both of the following are true: • There are no explicit memory accesses, pre-loads, direct or indirect register writes, cache maintenance instructions, SVC instructions, or exception returns between the Load-Exclusive and the Store-Exclusive. • The following conditions apply between the Store-Exclusive having returned a fail result and the retry of the Load-Exclusive: – There are no stores to any location within the same Exclusives reservation granule that the StoreExclusive is accessing. – There are no direct or indirect register writes, other than changes to the flag fields in APSR or FPSCR, caused by data processing or comparison instructions. – There are no direct or indirect cache maintenance instructions, SVC instructions, or exception returns """ Of course it also states that the upper limit for the Exclusives Reservation Granule is 2048 bytes, but the Cortex-M33 has an ERG of the entire address range <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100230_0002_00_en/jfa1443092906126.html> so something doesn't quite add up...> OTOH that *is* an argument for performance over correctness when you > get right down to it, so I'm not sure I can be too forceful about it. > At least not without a counter-proposal to restore guaranteed > correctness.I suppose a machine-level pass could at least scan for any intervening loads/stores in an LL/SC loop and check some other invariants, then warn/fail if they occur. As you point out, this would be conservative for many Arm implementations. Best, Alex
Alex Bradbury via llvm-dev
2018-Jun-14 12:44 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On 14 June 2018 at 12:32, Alex Bradbury <asb at lowrisc.org> wrote:> On 14 June 2018 at 10:28, Tim Northover via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> The problem with late expansion is that even the simplest variants add >> some rather unpleasant pseudo-instructions, and I've never even seen >> anyone attempt to get good CodeGen out of it (simplest example being >> "add xD, xD, #1" in the loop for an increment but there are obviously >> more). Doing so would almost certainly involve duplicating a lot of >> basic arithmetic instructions into AtomicRMW variants. > > Let's separate the concerns here: > 1) Quality of codegen within the LL/SC loop > * What is the specific concern here? The LL/SC loop contains a very > small number of instructions, even for the masked atomicrmw case. Are > you worried about an extra arithmetic instruction or two? Sub-optimal > control-flow? Something else?Oh I see what you're saying, it's the fact that by bypassing instruction selection we're missing cases where an ADDI could be selected rather than an ADD, which would potentially free up a register and save the instruction generated for materialising the constant. I don't like to see the compiler generate code that's obviously dumber than what a human would write, but in this case do we really think there would be any sort of measurable impact on performance? Best, Alex
Tim Northover via llvm-dev
2018-Jun-15 11:28 UTC
[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited
On Thu, 14 Jun 2018 at 13:45, Alex Bradbury <asb at lowrisc.org> wrote:> Oh I see what you're saying, it's the fact that by bypassing > instruction selection we're missing cases where an ADDI could be > selected rather than an ADD, which would potentially free up a > register and save the instruction generated for materialising the > constant.Yes.> I don't like to see the compiler generate code that's > obviously dumber than what a human would write, but in this case do we > really think there would be any sort of measurable impact on > performance?It's certainly going to be marginal, but then so is the benefit of late expansion. There's also the barrier that this genuinely is a place where people are willing to hand-code assembly, and go rooting around in the compiler's output to check we're doing what they expect even if they don't use assembly. The assembly is possibly mostly for historical reasons, but it's still out there and we want to convince people to move away from it. I think I'm going to try to implement that verifier pass you mentioned and run it across an iOS build. Cheers. Tim.