So, taking PR25526 as context, and after reading the internet, it seems to me that creating an atomic LL/SC loop at the IR level -- as is the modern way to do it in LLVM -- is effectively impossible to do correctly. Nor, for that matter, was it correct to create such a loop at isel time, as the implementation prior to r205525 did (and, as some targets still do, not having been updated yet to use IR expansion.) In summary (see below for more details): to guarantee forward progress in an LL/SC loop, you typically must use only a subset of the instruction-set (excluding loads, stores, floating-point, branches), the entire loop must fit within some amount of contiguous memory, and it must only execute a small number of instructions. I see no way that those properties can be guaranteed in LLVM, without emitting the entire loop as a fixed blob. For ARM, there's the newly-added ARMExpandPseudo::ExpandCMP_SWAP code, which basically does that. But, it seems kinda ugly, and a bit horrible to have to write all that code to generate a couple instructions -- and that's only handling cmpxchg, not atomicrmw, which would need similar treatment. And, even with all that, it's still creating multiple basic blocks, prior to basic-block placement, so I think there's no guarantee that the blocks get placed contiguously. (It kinda seems like it'd be easier at this point to just expand to inline-asm...) Plus, that's used only with -O0 at the moment. So, it generates worse code (which allowed the implementation to be simpler). It can only returns a success/fail value, instead of branching directly to success/fail successors. It doesn't handle the optimizations like skipping the barrier on early cmpxchg failure. Probably more such issues, too. Anyone got any other ideas for how to do this, without needing to introduce a bunch of per-target code? AtomicExpandPass was nice in that the target only needed to provide a couple of intrinsics, and it generated the rest. It's always a pain when the abstractions are leaky... The typical constraints on an LL/SC loop -- details vary between architectures -- are that between the load-linked and store-conditional instructions you must: 1. Not store to the region of memory ("granule") containing the address you're trying to atomically-modify. The size of the granule might be only the memory addresses being operated on, or the containing cache-line (most common), or maybe even all of memory, or anything in between. Any such store clears the load-linked reservation -- sometimes even if from the same CPU. 2. Not cause data or instruction cache filling/eviction activity. E.g. with a direct-mapped cache, any load, or a branch to a sufficiently far-away instruction might cause a guaranteed cache-line conflict -- and that might kill the reservation on every loop iteration. With a more typical n-way cache, you'd have more leeway of course. 3. Not cause any other architectural exceptions/traps that clear the reservation state. So, that typically excludes floating point, memory barriers, etc. 4. Not take any branches. (This seems the rarest of the constraints; it's possible it's only relevant to Alpha and RISC-V, although maybe others just forgot to mention it as being potentially problematic.) 5. Execute only a small number of instructions within the loop. That last restriction seems most odd as a hard constraint, as opposed to just a performance win. It is apparently because a common implementation <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka8404.html> is for the LL instruction to pull the cache-line local, and invalidate it from all remote caches (as if the LL were actually a write). The subsequent SC will then succeed only if the cacheline has not been invalidated since the LL. With a naive implementation, LL operations on two CPUs executing the same loop might just ping-pong the cacheline back and forth between them forever, without either CPU being able to successfully execute the SC -- by the time each got to it, the cacheline will have already been invalidated by the other CPU. Delaying the remote invalidation request for a little while in this situation is sufficient to prevent that problem. (But how long is a little while?) If you fail to adhere to the requirements, there's two kinds of issues it might cause. Firstly, it could cause the loop to deterministically never terminate, even when there's nobody else contending on the atomic. E.g. ll <stackaddr>; spill to stack; sc <stackaddr>. On some CPUs, the spill to a nearby stack address will clear the reservation. This issue is what motivated the recent ARM changes for -O0. Or, apparently on some Alpha CPUs, taking a branch deterministically kills the reservation. More insidiously, with that same example, you can introduce the possibility of live-lock between LL/SC loops of two processors. A store from another processor necessarily clears the reservation of other CPUs (unlike an LL, which does so in some implementations, but not all). So, the problem with indefinite cacheline ping-pong can be introduced by the extra spill in the loop -- even when it appears that it doesn't cause a problem on your processor. Here's a relevant set of emails <http://yarchive.net/comp/linux/cmpxchg_ll_sc_portability.html> from Linus, also discussing why it's not possible to reliably expose LL/SC intrinsics to C code -- which is the same problems it's not possible to reliably expose them to LLVM IR. The MIPS IV manual is nicely specific in its documentation of what you can depend upon: "An LL on one processor must not take action that, by itself, would cause an SC for the same block on another processor to fail." (which avoids the potential of live-lock between two ll/sc loops where one CPU's LL invalidates the other CPU's reservation before the other can get to the SC operation.) Forward progress can then be guaranteed, unless: "A load, store, or prefetch is executed on the processor executing the LL/SC." or "The instructions executed starting with the LL and ending with the SC do not lie in a 2048-byte contiguous region of virtual memory.". Also: "Exceptions between the LL and SC cause SC to fail, so persistent exceptions must be avoided. Some examples of these are arithmetic operations that trap, system calls, floating-point operations that trap or require software emulation assistance." Similarly, the RISC-V manual -- which I mention only because it is so clear and concise about its architectural requirements -- says: "We mandate that LR/SC sequences of bounded length (16 consecutive static instructions) will eventually succeed, provided they contain only base ISA instructions other than loads, stores, and taken branches." (their "base ISA" excludes floating point or multiply/divide instructions.) Unfortunately, neither ARM nor PPC appear to precisely document the architectural constraints under which forward progress must be guaranteed by the implementation. They certainly have the same underlying implementation issues that give rise to the above rules -- that much seems documented -- they just don't appear to make explicit guarantees on how you can guarantee success. ARM does "recommend" that LL/SC loops fit within 128 bytes, though. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/92bd5d02/attachment.html>
Hi James, Thanks for the very detailed summary, there's some really interesting details in there that I wasn't aware of. On 10 May 2016 at 12:22, James Knight <jyknight at google.com> wrote:> Anyone got any other ideas for how to do this, without needing to introduce > a bunch of per-target code?My only vaguely plausible idea to actually fix it so far was some kind of do-not-touch region for the register allocator to avoid spills (and, in light of your explanations, other things). Maybe some kind of MBB level no-spill/volatile flag would even be enough. IME most of the optimization benefits come at ISel time (choosing better branch sequences, INC vs ADD etc), so I'm not *too* worried about hobbling later machine passes. For completeness, I've also been tempted to simply check for spills and report_fatal_error given how often it's actually come up in the last few years. That's really unfriendly though, so probably not a good idea.> That last restriction seems most odd as a hard constraint, as opposed to > just a performance win. It is apparently because a common implementation is > for the LL instruction to pull the cache-line local, and invalidate it from > all remote caches (as if the LL were actually a write). The subsequent SC > will then succeed only if the cacheline has not been invalidated since the > LL. With a naive implementation, LL operations on two CPUs executing the > same loop might just ping-pong the cacheline back and forth between them > forever, without either CPU being able to successfully execute the SC -- by > the time each got to it, the cacheline will have already been invalidated by > the other CPU. Delaying the remote invalidation request for a little while > in this situation is sufficient to prevent that problem. (But how long is a > little while?)I have a suspicion this kind of unpredictable starvation (or something like it) is why ARM added such CISCy instructions to handle things in v8.1. Cheers. Tim.
> On May 10, 2016, at 1:02 PM, Tim Northover via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi James, > > Thanks for the very detailed summary, there's some really interesting > details in there that I wasn't aware of. > > On 10 May 2016 at 12:22, James Knight <jyknight at google.com> wrote: >> Anyone got any other ideas for how to do this, without needing to introduce >> a bunch of per-target code? > > My only vaguely plausible idea to actually fix it so far was some kind > of do-not-touch region for the register allocator to avoid spills > (and, in light of your explanations, other things). Maybe some kind of > MBB level no-spill/volatile flag would even be enough.One possibility would be to present the sequence as an instruction bundle. That will avoid any instructions/spills/reloads getting moved in the middle of it. Which on the other hand doesn't sound that different from just expand a pseudo instruction... - Matthias
Thanks for the writeup, that is indeed pretty ugly. Simple asm(:::"memory") isn't sufficient either, since the regalloc can decode to spill :-( On Tue, May 10, 2016 at 12:22 PM, James Knight via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > Unfortunately, neither ARM nor PPC appear to precisely document the > architectural constraints under which forward progress must be guaranteed > by the implementation. They certainly have the same underlying > implementation issues that give rise to the above rules -- that much seems > documented -- they just don't appear to make explicit guarantees on how you > can guarantee success. ARM does "recommend" that LL/SC loops fit within 128 > bytes, though. >For ARMv7 from the ARM ARM: A Load-Exclusive instruction tags a small block of memory for exclusive access. The size of the tagged block is IMPLEMENTATION DEFINED, see Tagging and the size of the tagged memory block on page A3-121. A Store-Exclusive instruction to the same address clears the tag. And: The value of a in this assignment is IMPLEMENTATION DEFINED, between a minimum value of 3 and a maximum value of 11. For example, in an implementation where a is 4, a successful LDREX of address 0x000341B4 gives a tag value of bits[31:4] of the address, giving 0x000341B. This means that the four words of memory from 0x000341B0 to 0x000341BF are tagged for exclusive access. The size of the tagged memory block is called the Exclusives Reservation Granule. The Exclusives Reservation Granule is IMPLEMENTATION DEFINED in the range 2-512 words: • 2 words in an implementation where a is 3 • 512 words in an implementation where a is 11 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/506d770f/attachment.html>
Replied too early... Below: On Tue, May 10, 2016 at 2:04 PM, JF Bastien <jfb at google.com> wrote:> Thanks for the writeup, that is indeed pretty ugly. Simple > asm(:::"memory") isn't sufficient either, since the regalloc can decode to > spill :-( > > On Tue, May 10, 2016 at 12:22 PM, James Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: >> >> Unfortunately, neither ARM nor PPC appear to precisely document the >> architectural constraints under which forward progress must be guaranteed >> by the implementation. They certainly have the same underlying >> implementation issues that give rise to the above rules -- that much seems >> documented -- they just don't appear to make explicit guarantees on how you >> can guarantee success. ARM does "recommend" that LL/SC loops fit within 128 >> bytes, though. >> > > For ARMv7 from the ARM ARM: > > A Load-Exclusive instruction tags a small block of memory for exclusive > access. The size of the tagged block is IMPLEMENTATION DEFINED, see Tagging > and the size of the tagged memory block on page A3-121. A Store-Exclusive > instruction to the same address clears the tag. > > And: > > The value of a in this assignment is IMPLEMENTATION DEFINED, between a > minimum value of 3 and a maximum value of 11. For example, in an > implementation where a is 4, a successful LDREX of address 0x000341B4 gives > a tag value of bits[31:4] of the address, giving 0x000341B. This means that > the four words of memory from 0x000341B0 to 0x000341BF are tagged for > exclusive access. > The size of the tagged memory block is called the Exclusives Reservation > Granule. The Exclusives Reservation Granule is IMPLEMENTATION DEFINED in > the range 2-512 words: > • 2 words in an implementation where a is 3 > > • 512 words in an implementation where a is 11 > >There's a bit more info here: When a processor writes using any instruction other than a Store-Exclusive: • if the write is to a physical address that is not covered by its local monitor the write does not affect the state of the local monitor • if the write is to a physical address that is covered by its local monitor it is IMPLEMENTATION DEFINED whether the write affects the state of the local monitor. If the local monitor is in the Exclusive Access state and the processor performs a Store-Exclusive to any address other than the last one from which it performed a Load-Exclusive, it is IMPLEMENTATION DEFINED whether the store updates memory, but in all cases the local monitor is reset to the Open Access state. This mechanism: • is used on a context switch, see Context switch support on page A3-122 • must be treated as a software programming error in all other cases And around similar parts of the manual. You can search the web for these, they're all "superseded" versions of the docs and I can't find the canonical one! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/91ba2cfd/attachment.html>