thr3ads.net - llvm dev - [llvm-dev] Atomic LL/SC loops in llvm [May 2016]

If this information is useful, please help other people find it:
Share via:

James Knight via llvm-dev

2016-May-10 19:22 UTC

[llvm-dev] Atomic LL/SC loops in llvm

So, taking PR25526 as context, and after reading the internet, it seems to me
that creating an atomic LL/SC loop at the IR level -- as is the modern way to do
it in LLVM -- is effectively impossible to do correctly.  Nor, for that matter,
was it correct to create such a loop at isel time, as the implementation prior
to r205525 did (and, as some targets still do, not having been updated yet to
use IR expansion.)

In summary (see below for more details): to guarantee forward progress in an
LL/SC loop, you typically must use only a subset of the instruction-set
(excluding loads, stores, floating-point, branches), the entire loop must fit
within some amount of contiguous memory, and it must only execute a small number
of instructions.

I see no way that those properties can be guaranteed in LLVM, without emitting
the entire loop as a fixed blob.

For ARM, there's the newly-added ARMExpandPseudo::ExpandCMP_SWAP code, which
basically does that. But, it seems kinda ugly, and a bit horrible to have to
write all that code to generate a couple instructions -- and that's only
handling cmpxchg, not atomicrmw, which would need similar treatment. And, even
with all that, it's still creating multiple basic blocks, prior to
basic-block placement, so I think there's no guarantee that the blocks get
placed contiguously. (It kinda seems like it'd be easier at this point to
just expand to inline-asm...)

Plus, that's used only with -O0 at the moment. So, it generates worse code
(which allowed the implementation to be simpler). It can only returns a
success/fail value, instead of branching directly to success/fail successors. It
doesn't handle the optimizations like skipping the barrier on early cmpxchg
failure. Probably more such issues, too.

Anyone got any other ideas for how to do this, without needing to introduce a
bunch of per-target code? AtomicExpandPass was nice in that the target only
needed to provide a couple of intrinsics, and it generated the rest.

It's always a pain when the abstractions are leaky...


The typical constraints on an LL/SC loop -- details vary between architectures
-- are that between the load-linked and store-conditional instructions you must:

  1. Not store to the region of memory ("granule") containing the
address you're trying to atomically-modify. The size of the granule might be
only the memory addresses being operated on, or the containing cache-line (most
common), or maybe even all of memory, or anything in between. Any such store
clears the load-linked reservation -- sometimes even if from the same CPU.

  2. Not cause data or instruction cache filling/eviction activity. E.g. with a
direct-mapped cache, any load, or a branch to a sufficiently far-away
instruction might cause a guaranteed cache-line conflict -- and that might kill
the reservation on every loop iteration. With a more typical n-way cache,
you'd have more leeway of course.

  3. Not cause any other architectural exceptions/traps that clear the
reservation state. So, that typically excludes floating point, memory barriers,
etc.

  4. Not take any branches. (This seems the rarest of the constraints; it's
possible it's only relevant to Alpha and RISC-V, although maybe others just
forgot to mention it as being potentially problematic.)

  5. Execute only a small number of instructions within the loop.

That last restriction seems most odd as a hard constraint, as opposed to just a
performance win. It is apparently because a common implementation
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka8404.html>
is for the LL instruction to pull the cache-line local, and invalidate it from
all remote caches (as if the LL were actually a write). The subsequent SC will
then succeed only if the cacheline has not been invalidated since the LL. With a
naive implementation, LL operations on two CPUs executing the same loop might
just ping-pong the cacheline back and forth between them forever, without either
CPU being able to successfully execute the SC -- by the time each got to it, the
cacheline will have already been invalidated by the other CPU. Delaying the
remote invalidation request for a little while in this situation is sufficient
to prevent that problem. (But how long is a little while?)

If you fail to adhere to the requirements, there's two kinds of issues it
might cause. Firstly, it could cause the loop to deterministically never
terminate, even when there's nobody else contending on the atomic. E.g. ll
<stackaddr>; spill to stack; sc <stackaddr>. On some CPUs, the spill
to a nearby stack address will clear the reservation. This issue is what
motivated the recent ARM changes for -O0. Or, apparently on some Alpha CPUs,
taking a branch deterministically kills the reservation.

More insidiously, with that same example, you can introduce the possibility of
live-lock between LL/SC loops of two processors. A store from another processor
necessarily clears the reservation of other CPUs (unlike an LL, which does so in
some implementations, but not all). So, the problem with indefinite cacheline
ping-pong can be introduced by the extra spill in the loop -- even when it
appears that it doesn't cause a problem on your processor.

Here's a relevant set of emails
<http://yarchive.net/comp/linux/cmpxchg_ll_sc_portability.html> from
Linus, also discussing why it's not possible to reliably expose LL/SC
intrinsics to C code -- which is the same problems it's not possible to
reliably expose them to LLVM IR.

The MIPS IV manual is nicely specific in its documentation of what you can
depend upon:

"An LL on one processor must not take action that, by itself, would cause
an SC for the same block on another processor to fail." (which avoids the
potential of live-lock between two ll/sc loops where one CPU's LL
invalidates the other CPU's reservation before the other can get to the SC
operation.)

Forward progress can then be guaranteed, unless:
"A load, store, or prefetch is executed on the processor executing the
LL/SC." or
"The instructions executed starting with the LL and ending with the SC do
not lie in a 2048-byte contiguous region of virtual memory.".
Also: "Exceptions between the LL and SC cause SC to fail, so persistent
exceptions must be avoided. Some examples of these are arithmetic operations
that trap, system calls, floating-point operations that trap or require software
emulation assistance."

Similarly, the RISC-V manual -- which I mention only because it is so clear and
concise about its architectural requirements -- says:
"We mandate that LR/SC sequences of bounded length (16 consecutive static
instructions) will eventually succeed, provided they contain only base ISA
instructions other than loads, stores, and taken branches." (their
"base ISA" excludes floating point or multiply/divide instructions.)

Unfortunately, neither ARM nor PPC appear to precisely document the
architectural constraints under which forward progress must be guaranteed by the
implementation. They certainly have the same underlying implementation issues
that give rise to the above rules -- that much seems documented -- they just
don't appear to make explicit guarantees on how you can guarantee success.
ARM does "recommend" that LL/SC loops fit within 128 bytes, though.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/92bd5d02/attachment.html>

Tim Northover via llvm-dev

2016-May-10 20:02 UTC

head link

[llvm-dev] Atomic LL/SC loops in llvm

Hi James,

Thanks for the very detailed summary, there's some really interesting
details in there that I wasn't aware of.

On 10 May 2016 at 12:22, James Knight <jyknight at google.com>
wrote:> Anyone got any other ideas for how to do this, without needing to introduce
> a bunch of per-target code?
My only vaguely plausible idea to actually fix it so far was some kind
of do-not-touch region for the register allocator to avoid spills
(and, in light of your explanations, other things). Maybe some kind of
MBB level no-spill/volatile flag would even be enough.

IME most of the optimization benefits come at ISel time (choosing
better branch sequences, INC vs ADD etc), so I'm not *too* worried
about hobbling later machine passes.

For completeness, I've also been tempted to simply check for spills
and report_fatal_error given how often it's actually come up in the
last few years. That's really unfriendly though, so probably not a
good idea.
> That last restriction seems most odd as a hard constraint, as opposed to
> just a performance win. It is apparently because a common implementation is
> for the LL instruction to pull the cache-line local, and invalidate it from
> all remote caches (as if the LL were actually a write). The subsequent SC
> will then succeed only if the cacheline has not been invalidated since the
> LL. With a naive implementation, LL operations on two CPUs executing the
> same loop might just ping-pong the cacheline back and forth between them
> forever, without either CPU being able to successfully execute the SC -- by
> the time each got to it, the cacheline will have already been invalidated
by
> the other CPU. Delaying the remote invalidation request for a little while
> in this situation is sufficient to prevent that problem. (But how long is a
> little while?)
I have a suspicion this kind of unpredictable starvation (or something
like it) is why ARM added such CISCy instructions to handle things in
v8.1.

Cheers.

Tim.

Matthias Braun via llvm-dev

2016-May-10 20:50 UTC

head link

[llvm-dev] Atomic LL/SC loops in llvm

> On May 10, 2016, at 1:02 PM, Tim Northover via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi James,
> 
> Thanks for the very detailed summary, there's some really interesting
> details in there that I wasn't aware of.
> 
> On 10 May 2016 at 12:22, James Knight <jyknight at google.com> wrote:
>> Anyone got any other ideas for how to do this, without needing to
introduce
>> a bunch of per-target code?
> 
> My only vaguely plausible idea to actually fix it so far was some kind
> of do-not-touch region for the register allocator to avoid spills
> (and, in light of your explanations, other things). Maybe some kind of
> MBB level no-spill/volatile flag would even be enough.One possibility would be to present the sequence as an instruction bundle. That
will avoid any instructions/spills/reloads getting moved in the middle of it.
Which on the other hand doesn't sound that different from just expand a
pseudo instruction...

- Matthias

JF Bastien via llvm-dev

2016-May-10 21:04 UTC

head link

[llvm-dev] Atomic LL/SC loops in llvm

Thanks for the writeup, that is indeed pretty ugly. Simple
asm(:::"memory")
isn't sufficient either, since the regalloc can decode to spill :-(

On Tue, May 10, 2016 at 12:22 PM, James Knight via llvm-dev <
llvm-dev at lists.llvm.org> wrote:>
> Unfortunately, neither ARM nor PPC appear to precisely document the
> architectural constraints under which forward progress must be guaranteed
> by the implementation. They certainly have the same underlying
> implementation issues that give rise to the above rules -- that much seems
> documented -- they just don't appear to make explicit guarantees on how
you
> can guarantee success. ARM does "recommend" that LL/SC loops fit
within 128
> bytes, though.
>
For ARMv7 from the ARM ARM:

A Load-Exclusive instruction tags a small block of memory for exclusive
access. The size of the tagged block is IMPLEMENTATION DEFINED, see Tagging
and the size of the tagged memory block on page A3-121. A Store-Exclusive
instruction to the same address clears the tag.

And:

The value of a in this assignment is IMPLEMENTATION DEFINED, between a
minimum value of 3 and a maximum value of 11. For example, in an
implementation where a is 4, a successful LDREX of address 0x000341B4 gives
a tag value of bits[31:4] of the address, giving 0x000341B. This means that
the four words of memory from 0x000341B0 to 0x000341BF are tagged for
exclusive access.
The size of the tagged memory block is called the Exclusives Reservation
Granule. The Exclusives Reservation Granule is IMPLEMENTATION DEFINED in
the range 2-512 words:
• 2 words in an implementation where a is 3

• 512 words in an implementation where a is 11
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/506d770f/attachment.html>

JF Bastien via llvm-dev

2016-May-10 21:08 UTC

head link

[llvm-dev] Atomic LL/SC loops in llvm

Replied too early... Below:

On Tue, May 10, 2016 at 2:04 PM, JF Bastien <jfb at google.com> wrote:
> Thanks for the writeup, that is indeed pretty ugly. Simple
> asm(:::"memory") isn't sufficient either, since the regalloc
can decode to
> spill :-(
>
> On Tue, May 10, 2016 at 12:22 PM, James Knight via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>>
>> Unfortunately, neither ARM nor PPC appear to precisely document the
>> architectural constraints under which forward progress must be
guaranteed
>> by the implementation. They certainly have the same underlying
>> implementation issues that give rise to the above rules -- that much
seems
>> documented -- they just don't appear to make explicit guarantees on
how you
>> can guarantee success. ARM does "recommend" that LL/SC loops
fit within 128
>> bytes, though.
>>
>
> For ARMv7 from the ARM ARM:
>
> A Load-Exclusive instruction tags a small block of memory for exclusive
> access. The size of the tagged block is IMPLEMENTATION DEFINED, see Tagging
> and the size of the tagged memory block on page A3-121. A Store-Exclusive
> instruction to the same address clears the tag.
>
> And:
>
> The value of a in this assignment is IMPLEMENTATION DEFINED, between a
> minimum value of 3 and a maximum value of 11. For example, in an
> implementation where a is 4, a successful LDREX of address 0x000341B4 gives
> a tag value of bits[31:4] of the address, giving 0x000341B. This means that
> the four words of memory from 0x000341B0 to 0x000341BF are tagged for
> exclusive access.
> The size of the tagged memory block is called the Exclusives Reservation
> Granule. The Exclusives Reservation Granule is IMPLEMENTATION DEFINED in
> the range 2-512 words:
> • 2 words in an implementation where a is 3
>
> • 512 words in an implementation where a is 11
>
>
There's a bit more info here:

When a processor writes using any instruction other than a Store-Exclusive:
• if the write is to a physical address that is not covered by its local
monitor the write does not affect the state of the local monitor
• if the write is to a physical address that is covered by its local
monitor it is IMPLEMENTATION DEFINED whether the write affects the state of
the local monitor.

If the local monitor is in the Exclusive Access state and the processor
performs a Store-Exclusive to any address other than the last one from
which it performed a Load-Exclusive, it is IMPLEMENTATION DEFINED whether
the store updates memory, but in all cases the local monitor is reset to
the Open Access state. This mechanism:
• is used on a context switch, see Context switch support on page A3-122
• must be treated as a software programming error in all other cases


And around similar parts of the manual. You can search the web for these,
they're all "superseded" versions of the docs and I can't find
the
canonical one!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/91ba2cfd/attachment.html>

llvm dev - May 2016 - Atomic LL/SC loops in llvm

[llvm-dev] Atomic LL/SC loops in llvm

[llvm-dev] Atomic LL/SC loops in llvm

[llvm-dev] Atomic LL/SC loops in llvm

[llvm-dev] Atomic LL/SC loops in llvm

[llvm-dev] Atomic LL/SC loops in llvm