thr3ads.net - llvm dev - [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Alex Bradbury via llvm-dev

2018-Jun-13 15:42 UTC

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

# RFC: Atomic LL/SC loops in LLVM revisited

## Summary

This proposal gives a brief overview of the challenges of lowering to LL/SC
loops and details the approach I am taking for RISC-V. Beyond getting feedback
on that work, my intention is to find consensus on moving other backends
towards a similar approach and sharing common code where feasible. Scroll down
to 'Questions' for a summary of the issues I think need feedback and
agreement.

For the original discussion of LL/SC lowering, please refer to James
Knight's 2016 thread on the topic:
http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html

I'd like to thank James Knight, JF Bastien, and Eli Friedman for being so
generous with their review feedback on this atomics work so far.

## Background: Atomics in LLVM

See the documentation for full details
<https://llvm.org/docs/Atomics.html>.
In short: LLVM defines memory ordering constraints to match the C11/C++11
memory model (unordered, monotonic, acquire, release, acqrel, seqcst).
These can be given as parameters to the atomic operations supported in LLVM
IR:

* Fences with the fence instruction
* Atomic load and store with the 'load atomic' and 'store
atomic' variants of
the load/store instructions..
* Fetch-and-binop / read-modify-write operations through the atomicrmw
instruction.
* Compare and exchange via the cmpxchg instruction. Takes memory ordering for
both success and failure cases. Can also specify a 'weak' vs
'strong' cmpxchg,
where the weak variant allows spurious failure

## Background: Atomics in RISC-V

For full details see a recent draft of the ISA manual
<https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>,
which incorporates work from the Memory Consistency Model Task Group to define
the memory model. RISC-V implements a weak memory model.

For those not familiar, RISC-V is a modular ISA, with standard extensions
indicated by single letters. Baseline 'RV32I' or 'RV64I'
instruction sets
don't support atomic operations beyond fences. However the RV32A and RV64A
instruction set extensions introduce AMOs (Atomic Memory Operations) and LR/SC
(load-linked/store-conditional on other architectures). 32-bit atomic
operations are supported natively on RV32, and both 32 and 64-bit atomic
operations support natively on RV64.

AMOs such as 'amoadd.w' implement simple fetch-and-dobinop behaviour.
For
LR/SC: LR loads a word and registers a reservation on source memory address.
SC writes the given word to the memory address and writes success (zero) or
failure (non-zero) into the destination register. LR/SC can be used to
implement compare-and-exchange or to implement AMOs that don't have a native
instruction. To do so, you would typically perform LR and SC in a loop.
However, there are strict limits on the instructions that can be placed
between a LR and an SC while still guaranteeing forward progress:

"""
The static code for the LR/SC sequence plus the code to retry the sequence in
case of failure must comprise at most 16 integer instructions placed
sequentially in memory. For the sequence to be guaranteed to eventually
succeed, the dynamic code executed between the LR and SC instructions can only
contain other instructions from the base “I” subset, excluding loads, stores,
backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM
instructions. The code to retry a failing LR/SC sequence can contain backward
jumps and/or branches to repeat the LR/SC sequence, but otherwise has the same
constraints.
"""

The native AMOs and LR/SC allow ordering constraints to be specified in the
instruction. This isn't possible for load/store instructions, so fences must
be inserted to represent the ordering constraints. 8 and 16-bit atomic
load/store are therefore supported using 8 and 16-bit load/store plus
appropriate fences.

See Table A.6 on page 187 in the linked specification for a mapping from C/C++
atomic constructs to RISC-V instructions.

## Background: Lowering atomics in LLVM

The AtomicExpandPass can help you support atomics for your taraget in a number
of ways. e.g. inserting fences around atomic loads/stores, or converting an
atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning
the latter ability is problematic - there is no way to guarantee that the
invariants for the LL/SC loop required by the target architecture will be
maintained. This shows up most frequently when register spills are introduced
at O0, but spills could theoretically still occur at higher optimisation
levels and there are other potential sources of issues: inappropriate
instruction selection, machine block placement, machine outlining (though see
D47654 and D47655), and likely more.

I highly encourage you to read James Knight's previous post on this topic
which goes in to much more detail about the issues handling LL/SC
<http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The
situation
remains pretty much the same:

* ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1
and above but use a custom post-regalloc expansion for O0
* MIPS doesn't use AtomicExpandPass, but selects atomic pseudoinstructions
which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still has
the problems described above, so MIPS is in the process of moving towards a
two-stage lowering, with the LL/SC loop lowered after register allocation. See
D31287 <https://reviews.llvm.org/D31287>.
* Hexagon unconditionally expands to LL/SC loops in IR using AtomicExpandPass.

Lowering a word-size atomic operations to an LL/SC loop is typically trivial,
requiring little surrounding code. Part-word atomics require additional
shifting and masking as a word-size access is used. It would be beneficial if
the code to generate this shifting and masking could be shared between
targets, and if the operations that don't need to be in the LL/SC loop are
exposed for LLVM optimisation.

The path forwards is very clearly to treat the LL/SC loop as an indivisible
operation which is expanded as late as possible (and certainly after register
allocation). However, there are a few ways of achieving this.

If atomic operations of a given size aren't supported, then calls should be
created to the helper functions in libatomic, and this should be done
consistently for all atomic operations of that size. I actually found GCC is
buggy in that respect
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>.

## Proposed lowering strategy (RISC-V)

Basic principles:
* The LL/SC loop should be treated as a black box, and expanded post-RA.
* Don't introduce intrinsics that simply duplicate existing IR instructions
* If code can be safely expanded in the IR, do it there. [I'd welcome
feedback
on this one - should I be taking a closer look at expansion in SelectionDAG
legalisation?]

The above can be achieved by extending AtomicExpandPass to support a
'Custom'
expansion method, which uses a TargetLowering call to expand to custom IR,
including an appropriate intrinsic representing the LL+SC loop.

Atomic operations are lowered in the following ways:

* Atomic load/store: Allow AtomicExpandPass to insert appropriate fences
* Word-sized AMO supported by a native instruction: Leave the IR unchanged and
use the normal instruction selection mechanism
* Word-sized AMO without a native instruction: Select a pseudo-instruction
using the normal instruction selection mechanism. This pseudo-instruction will
be expanded after register allocation.
* Part-word AMO without a native instruction: Shifting and masking that occurs
outside of the LL/SC loop is expanded in the IR, and a call to a
target-specific intrinsic to implement the LL/SC loop is inserted (e.g.
llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a
pseudo-instruction which is expanded after register allocation.
* Part-word AMO without a native instruction that can be implemented by a
native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be implemented
by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
transformation.
* Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
normal instruction selection mechanism. This pseudo-instruction will be
expanded after register allocation.
* Part-word compare-and-exchange: Handled similarly to part-word AMOs, calling
llvm.riscv.masked.cmpxchg.i32.

Scratch registers for these pseudo-instructions are modelled as in ARM and
AArch64, by specifying multiple outputs and specifying an @earlyclobber
constraint to ensure the register allocator assigns unique registers. e.g.:

class PseudoCmpXchg
: Pseudo<(outs GPR:$res, GPR:$scratch),
(ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering), []>
{
let Constraints = "@earlyclobber $res, at earlyclobber $scratch";
let mayLoad = 1;
let mayStore = 1;
let hasSideEffects = 0;
}

Note that there are additional complications with cmpxchg such as supporting
weak cmpxchg (which requires returning a success value), or supporting
different failure orderings. It looks like the differentiation between
strong/weak cmpxchg doesn't survive the translation to SelectionDAG right
now.
Supporting only strong cmpxchg and using the success ordering for the failure
case is conservative but correct I believe.

In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
latest possible moment. The RISCVExpandPseudoInsts pass is registered with
addPreEmitPass2.

The main aspect I'm unhappy with in this approach is the need to introduce
new
intrinsics. Ideally these would be documented as not for use by frontends and
subject to future removal or alteration - is there precedent for this?
Alternatively, see the suggestion below to introduce target-independent masked
AMO intrinsics.

## Alternative options

1. Don't expand anything in IR, and lower to a single monolithic
pseudo-instruction that is expanded at the last minute.
2. Don't expand anything in IR, and lower to pseudo-instructions in stages.
Lower to a monolithic pseudo-instruction where any logic outside of the LL/SC
loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is
represented by a new pseudoinstruction. This final pseudoinstruction is then
expanded after register allocation. This minimises the possibility for sharing
logic between backends, but does mean we don't need to expose new
intrinsics.
Mips adopts this approach in D31287.
3. Target-independent SelectionDAG expansion code converts unsupported atomic
operations. e.g. rather than converting `atomicrmw add i8` to AtomicLoadAdd,
expand to nodes that align the address and calculate the mask as well as an
AtomicLoadAddMasked node. I haven't looked at this in great detail.
4. Introducing masked atomic operations to the IR. Mentioned for completeness,
I don't think anyone wants this.
5. Introduce target-independent intrinsics for masked atomic operations. This
seems worthy of consideration.

For 1. and 2. the possibility for sharing logic between backends is minimised
and the address calculation, masking and shifting logic is mostly hidden from
optimisations (though option 2. allows e.g. MachineCSE). There is the
advantage of avoiding the need for new intrinsics.

## Patches up for review

I have patches up for review which implement the described strategy. More
could be done to increase the potential for code reuse across targets, but I
thought it would be worth getting feedback on the path forwards first.

* D47587: [RISCV] Codegen support for atomic operations on RV32I.
<https://reviews.llvm.org/D47587>. Simply adds support for lowering fences
and
uses AtomicExpandPass to generate libatomic calls otherwise. Committed in
rL334590.
* D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.
<https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert fences
for
atomic load/store. Committed in rL334591.
* D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A.
<https://reviews.llvm.org/D47882>. Implements the lowering strategy
described
above for atomicrmw and adds a hook to allow custom atomicrmw expansion in IR.
Under review.
* D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on RV32A.
<https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with
appropriately manipulated operands to implement 8/16-bit AMOs. Under review.
* D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in IR.
<https://reviews.llvm.org/D48130> Separated patch as this modifies the
existing shouldExpandAtomicCmpXchgInIR interface. Under review.
* D48141: [RISCV] Implement codegen for cmpxchg on RV32I.
<https://reviews.llvm.org/D48131> Implements the lowering strategy
described
above. Under review.

## Questions

To pick a few to get started:

* How do you feel about the described lowering strategy? Am I unfairly
overlooking a SelectionDAG approach?
* How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and
other architectures to use such an approach?
* If there is enthusiasm, how worthwhile is it to share logic for generation
of masks+shifts needed for part-word atomics?
* I'd like to see ARM+AArch64+Hexagon move away from the problematic
expansion in IR and to have that code deleted from AtomicExpandPass. Are
there any objections?
* What are your thoughts on the introduction of new target-independent
intrinsics for masked atomics?

Many thanks for your feedback,

Alex Bradbury, lowRISC CIC

James Y Knight via llvm-dev

2018-Jun-13 22:37 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Great writeup, thanks!

I'll respond to the rest in a bit, but I have a digression first...

One detail I hadn't noticed in my previous investigation is that the
"compare_exchange_weak" (or our IR "cmpxchg weak") operation
is
theoretically unsound -- and NECESSARILY so.

As mentioned in this thread, there are architectural guarantees for forward
progress of an LLSC loop adhering to certain constraints. If your loop does
not meet those constraints, it may well livelock vs another CPU executing
the same loop, or even spuriously fail deterministically every time it's
executed on its own. This thread is all about a proposal to ensure that
LLVM emits LLSC loops such that they're guaranteed to meet the
architectural guarantees and avoid the possibility of those bad situations.

There is not to my knowledge any architecture which makes any guarantees
that an LLSC standing alone, without a containing loop, will not spuriously
fail. It may in fact fail every time it's executed. Of course, the typical
usage of compare_exchange_weak is to embed it within a loop containing
other user-code, but given that it contains arbitrary user code, there's
simply no way we can guarantee that this larger-loop meets the LLSC-loop
construction requirements. Therefore, it's entirely possible for some
particular compare_exchange_weak-based loop to livelock or to
deterministically spuriously fail. That seems poor, and exactly the kind of
thing that we'd very much like to avoid...

So, now I wonder -- is there actually a measurable performance impact (on,
say, ARM) if we were to just always emit the "strong" cmpxchg loop?
I'm
feeling rather inclined to say we should not implement the weak variant at
all. (And, if others agree with my diagnosis here, also propose to
deprecate it from the C/C++ languages...)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180613/3cb76586/attachment.html>

Alex Bradbury via llvm-dev

2018-Jun-13 23:29 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 13 June 2018 at 23:37, James Y Knight via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Great writeup, thanks!
>
> I'll respond to the rest in a bit, but I have a digression first...
>
> One detail I hadn't noticed in my previous investigation is that the
> "compare_exchange_weak" (or our IR "cmpxchg weak")
operation is
> theoretically unsound -- and NECESSARILY so.
>
> As mentioned in this thread, there are architectural guarantees for forward
> progress of an LLSC loop adhering to certain constraints. If your loop does
> not meet those constraints, it may well livelock vs another CPU executing
> the same loop, or even spuriously fail deterministically every time
it's
> executed on its own. This thread is all about a proposal to ensure that
LLVM
> emits LLSC loops such that they're guaranteed to meet the architectural
> guarantees and avoid the possibility of those bad situations.
>
> There is not to my knowledge any architecture which makes any guarantees
> that an LLSC standing alone, without a containing loop, will not spuriously
> fail. It may in fact fail every time it's executed. Of course, the
typical
> usage of compare_exchange_weak is to embed it within a loop containing
other
> user-code, but given that it contains arbitrary user code, there's
simply no
> way we can guarantee that this larger-loop meets the LLSC-loop construction
> requirements. Therefore, it's entirely possible for some particular
> compare_exchange_weak-based loop to livelock or to deterministically
> spuriously fail. That seems poor, and exactly the kind of thing that
we'd
> very much like to avoid...
I hadn't noticed that either - excellent point! The RISC-V requirement
for forward progress encompass the LR, the SC, the instructions
between it, and the code to retry the LR/SC. A weak cmpxchg by
definition separates the LR/SC from the code to retry it and so
suffers from the same sort problems that we see when exposing LL and
SC individually as IR intrinsics. I agree with your diagnosis as far
as RISC-V is concerned at least.

Best,

Alex

Alex Bradbury via llvm-dev

2018-Jun-14 09:01 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 13 June 2018 at 23:37, James Y Knight via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> So, now I wonder -- is there actually a measurable performance impact (on,
> say, ARM) if we were to just always emit the "strong" cmpxchg
loop? I'm
> feeling rather inclined to say we should not implement the weak variant at
> all. (And, if others agree with my diagnosis here, also propose to
deprecate
> it from the C/C++ languages...)
I'm not sure deprecation is justified. A "Sufficiently Smart
Compiler"
could make use of the knowledge that a weak cmpxchg was considered
sufficient by the programmer, and use the weak cmpxchg form if the
surrounding loop can be shown to meet the forward progress guarantees
of the architecture. In LLVM, this could be done by analyzing the
surrounding basic block when expanding the cmpxchg pseudo after
regalloc. It's unlikely it would be worth the hassle, but it does seem
a viable approach.

Best,

Alex

Tim Northover via llvm-dev

2018-Jun-14 09:28 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

>   * I'd like to see ARM+AArch64+Hexagon move away from the problematic
>   expansion in IR and to have that code deleted from AtomicExpandPass. Are
>   there any objections?
I think it would be a great shame and I'd like to avoid it if at all
possible, though I'm also uncomfortable with the current situation.

The problem with late expansion is that even the simplest variants add
some rather unpleasant pseudo-instructions, and I've never even seen
anyone attempt to get good CodeGen out of it (simplest example being
"add xD, xD, #1" in the loop for an increment but there are obviously
more). Doing so would almost certainly involve duplicating a lot of
basic arithmetic instructions into AtomicRMW variants.

Added to that is the fact that actual CPU implementations are often a
lot less restrictive about what can go into a loop than is required
(for example even spilling is fine on ours as long as the atomic
object is not in the same cache-line; I suspect that's normal). That
casts the issue in a distinctly theoretical light -- we've been doing
this for years and as far as I'm aware nobody has ever hit the issue
in the real world, or even had CodeGen go wrong in a way that *could*
do so outside the -O0 situation.

OTOH that *is* an argument for performance over correctness when you
get right down to it, so I'm not sure I can be too forceful about it.
At least not without a counter-proposal to restore guaranteed
correctness.

Tim.

Alex Bradbury via llvm-dev

2018-Jun-14 11:32 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 14 June 2018 at 10:28, Tim Northover via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>>   * I'd like to see ARM+AArch64+Hexagon move away from the
problematic
>>   expansion in IR and to have that code deleted from AtomicExpandPass.
Are
>>   there any objections?
>
> I think it would be a great shame and I'd like to avoid it if at all
> possible, though I'm also uncomfortable with the current situation.
Thanks for the reply Tim. It's definitely fair to highlight that ARM
and AArch64 has few documented restrictions on code within an LL/SC
loop when compared to e.g. RISC-V or MIPS. I'm not sure about Hexagon.
> The problem with late expansion is that even the simplest variants add
> some rather unpleasant pseudo-instructions, and I've never even seen
> anyone attempt to get good CodeGen out of it (simplest example being
> "add xD, xD, #1" in the loop for an increment but there are
obviously
> more). Doing so would almost certainly involve duplicating a lot of
> basic arithmetic instructions into AtomicRMW variants.
Let's separate the concerns here:
1) Quality of codegen within the LL/SC loop
  * What is the specific concern here? The LL/SC loop contains a very
small number of instructions, even for the masked atomicrmw case. Are
you worried about an extra arithmetic instruction or two? Sub-optimal
control-flow? Something else?
2) Number of new pseudo-instructions which must be introduced
  * You'd need new pseudos for each word-sized atomicrmw which expands
to an ll/sc loop, and an additional one for the masked form of the
operation. You could reduce the number of pseudos by taking the
AtomicRMWInst::BinOp as a parameter. The code to map the atomic op to
the appropriate instruction is tedious but straight-forward.
> Added to that is the fact that actual CPU implementations are often a
> lot less restrictive about what can go into a loop than is required
> (for example even spilling is fine on ours as long as the atomic
> object is not in the same cache-line; I suspect that's normal). That
> casts the issue in a distinctly theoretical light -- we've been doing
> this for years and as far as I'm aware nobody has ever hit the issue
> in the real world, or even had CodeGen go wrong in a way that *could*
> do so outside the -O0 situation.
That's true as long as the "Exclusives Reservation Granule" == 1
cache-line and you don't deterministically cause the reservation to be
cleared in some other way: e.g. repeatable geenrating a conflict miss
or triggering a trap etc. I don't think any Arm cores ship with
direct-mapped caches so I'll admit this is unlikely.

The possibility for issues increases if the Exclusives Reservation
Granule is larger. For the Cortex-M4, the ERG is the entire address
range
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100166_0001_00_en/ric1417175928887.html>.
In that case, spills will surely clear the reservation.

There also seem to be documented and very strict forward progress
constraints for ARMv8-M. See
https://static.docs.arm.com/ddi0553/ah/DDI0553A_h_armv8m_arm.pdf p207

"""
Forward progress can only be made using LoadExcl/StoreExcl loops if,
for any LoadExcl/StoreExcl loop within a single thread of execution if
both of the following are true:
• There are no explicit memory accesses, pre-loads, direct or indirect
register writes, cache maintenance instructions, SVC instructions, or
exception returns between the Load-Exclusive and the Store-Exclusive.
• The following conditions apply between the Store-Exclusive having
returned a fail result and the retry of the Load-Exclusive:
  – There are no stores to any location within the same Exclusives
reservation granule that the StoreExclusive is accessing.
  – There are no direct or indirect register writes, other than
changes to the flag fields in APSR or FPSCR, caused by data processing
or comparison instructions.
  – There are no direct or indirect cache maintenance instructions,
SVC instructions, or exception returns
"""

Of course it also states that the upper limit for the Exclusives
Reservation Granule is 2048 bytes, but the Cortex-M33 has an ERG of
the entire address range
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100230_0002_00_en/jfa1443092906126.html>
so something doesn't quite add up...
> OTOH that *is* an argument for performance over correctness when you
> get right down to it, so I'm not sure I can be too forceful about it.
> At least not without a counter-proposal to restore guaranteed
> correctness.
I suppose a machine-level pass could at least scan for any intervening
loads/stores in an LL/SC loop and check some other invariants, then
warn/fail if they occur. As you point out, this would be conservative
for many Arm implementations.

Best,

Alex

Krzysztof Parzyszek via llvm-dev

2018-Jun-15 15:21 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

Hexagon has a very few restrictions on what will cause loss of a 
reservation: those are stores to the same address (a 64-bit granule) or 
any sort of exception/interrupt/system call. Other than that the 
reservation should stay. The architecture doesn't explicitly guarantee 
that though, but in the absence of the elements listed above, a program 
with LL/SC can be expected to make progress.

Consequently, the best way for us to handle LL/SC would be to expand 
them early and let them be optimized as any other code. The usual 
optimization restrictions should be sufficient to prevent introduction 
of factors causing a loss of reservation.

With the constraints on LL/SC varying wildly between architectures, 
maybe we should have several options available for different targets?

-Krzysztof

On 6/13/2018 10:42 AM, Alex Bradbury wrote:> # RFC: Atomic LL/SC loops in LLVM revisited
> 
> ## Summary
> 
> This proposal gives a brief overview of the challenges of lowering to LL/SC
> loops and details the approach I am taking for RISC-V. Beyond getting
feedback
> on that work, my intention is to find consensus on moving other backends
> towards a similar approach and sharing common code where feasible. Scroll
down
> to 'Questions' for a summary of the issues I think need feedback
and
> agreement.
> 
> For the original discussion of LL/SC lowering, please refer to James
> Knight's 2016 thread on the topic:
> http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html
> 
> I'd like to thank James Knight, JF Bastien, and Eli Friedman for being
so
> generous with their review feedback on this atomics work so far.
> 
> ## Background: Atomics in LLVM
> 
> See the documentation for full details
<https://llvm.org/docs/Atomics.html>.
> In short: LLVM defines memory ordering constraints to match the C11/C++11
> memory model (unordered, monotonic, acquire, release, acqrel, seqcst).
> These can be given as parameters to the atomic operations supported in LLVM
> IR:
> 
> * Fences with the fence instruction
> * Atomic load and store with the 'load atomic' and 'store
atomic' variants of
> the load/store instructions..
> * Fetch-and-binop / read-modify-write operations through the atomicrmw
> instruction.
> * Compare and exchange via the cmpxchg instruction. Takes memory ordering
for
> both success and failure cases. Can also specify a 'weak' vs
'strong' cmpxchg,
> where the weak variant allows spurious failure
> 
> ## Background: Atomics in RISC-V
> 
> For full details see a recent draft of the ISA manual
>
<https://github.com/riscv/riscv-isa-manual/releases/download/draft-20180612-548fd40/riscv-spec.pdf>,
> which incorporates work from the Memory Consistency Model Task Group to
define
> the memory model. RISC-V implements a weak memory model.
> 
> For those not familiar, RISC-V is a modular ISA, with standard extensions
> indicated by single letters. Baseline 'RV32I' or 'RV64I'
instruction sets
> don't support atomic operations beyond fences. However the RV32A and
RV64A
> instruction set extensions introduce AMOs (Atomic Memory Operations) and
LR/SC
> (load-linked/store-conditional on other architectures). 32-bit atomic
> operations are supported natively on RV32, and both 32 and 64-bit atomic
> operations support natively on RV64.
> 
> AMOs such as 'amoadd.w' implement simple fetch-and-dobinop
behaviour. For
> LR/SC: LR loads a word and registers a reservation on source memory
address.
> SC writes the given word to the memory address and writes success (zero) or
> failure (non-zero) into the destination register. LR/SC can be used to
> implement compare-and-exchange or to implement AMOs that don't have a
native
> instruction. To do so, you would typically perform LR and SC in a loop.
> However, there are strict limits on the instructions that can be placed
> between a LR and an SC while still guaranteeing forward progress:
> 
> """
> The static code for the LR/SC sequence plus the code to retry the sequence
in
> case of failure must comprise at most 16 integer instructions placed
> sequentially in memory. For the sequence to be guaranteed to eventually
> succeed, the dynamic code executed between the LR and SC instructions can
only
> contain other instructions from the base “I” subset, excluding loads,
stores,
> backward jumps or taken backward branches, FENCE, FENCE.I, and SYSTEM
> instructions. The code to retry a failing LR/SC sequence can contain
backward
> jumps and/or branches to repeat the LR/SC sequence, but otherwise has the
same
> constraints.
> """
> 
> The native AMOs and LR/SC allow ordering constraints to be specified in the
> instruction. This isn't possible for load/store instructions, so fences
must
> be inserted to represent the ordering constraints. 8 and 16-bit atomic
> load/store are therefore supported using 8 and 16-bit load/store plus
> appropriate fences.
> 
> See Table A.6 on page 187 in the linked specification for a mapping from
C/C++
> atomic constructs to RISC-V instructions.
> 
> ## Background: Lowering atomics in LLVM
> 
> The AtomicExpandPass can help you support atomics for your taraget in a
number
> of ways. e.g. inserting fences around atomic loads/stores, or converting an
> atomicrmw/cmpxchg to a LL/SC loop. It operates as an IR-level pass, meaning
> the latter ability is problematic - there is no way to guarantee that the
> invariants for the LL/SC loop required by the target architecture will be
> maintained. This shows up most frequently when register spills are
introduced
> at O0, but spills could theoretically still occur at higher optimisation
> levels and there are other potential sources of issues: inappropriate
> instruction selection, machine block placement, machine outlining (though
see
> D47654 and D47655), and likely more.
> 
> I highly encourage you to read James Knight's previous post on this
topic
> which goes in to much more detail about the issues handling LL/SC
> <http://lists.llvm.org/pipermail/llvm-dev/2016-May/099490.html>. The
situation
> remains pretty much the same:
> 
> * ARM and AArch64 expand to LL/SC loops in IR using AtomicExpandPass for O1
> and above but use a custom post-regalloc expansion for O0
> * MIPS doesn't use AtomicExpandPass, but selects atomic
pseudoinstructions
> which it expands to LL/SC loops in EmitInstrWithCustomInserter. This still
has
> the problems described above, so MIPS is in the process of moving towards a
> two-stage lowering, with the LL/SC loop lowered after register allocation.
See
> D31287 <https://reviews.llvm.org/D31287>.
> * Hexagon unconditionally expands to LL/SC loops in IR using
AtomicExpandPass.
> 
> Lowering a word-size atomic operations to an LL/SC loop is typically
trivial,
> requiring little surrounding code. Part-word atomics require additional
> shifting and masking as a word-size access is used. It would be beneficial
if
> the code to generate this shifting and masking could be shared between
> targets, and if the operations that don't need to be in the LL/SC loop
are
> exposed for LLVM optimisation.
> 
> The path forwards is very clearly to treat the LL/SC loop as an indivisible
> operation which is expanded as late as possible (and certainly after
register
> allocation). However, there are a few ways of achieving this.
> 
> If atomic operations of a given size aren't supported, then calls
should be
> created to the helper functions in libatomic, and this should be done
> consistently for all atomic operations of that size. I actually found GCC
is
> buggy in that respect
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86005>.
> 
> ## Proposed lowering strategy (RISC-V)
> 
> Basic principles:
> * The LL/SC loop should be treated as a black box, and expanded post-RA.
> * Don't introduce intrinsics that simply duplicate existing IR
instructions
> * If code can be safely expanded in the IR, do it there. [I'd welcome
feedback
> on this one - should I be taking a closer look at expansion in SelectionDAG
> legalisation?]
> 
> The above can be achieved by extending AtomicExpandPass to support a
'Custom'
> expansion method, which uses a TargetLowering call to expand to custom IR,
> including an appropriate intrinsic representing the LL+SC loop.
> 
> Atomic operations are lowered in the following ways:
> 
> * Atomic load/store: Allow AtomicExpandPass to insert appropriate fences
> * Word-sized AMO supported by a native instruction: Leave the IR unchanged
and
> use the normal instruction selection mechanism
> * Word-sized AMO without a native instruction: Select a pseudo-instruction
> using the normal instruction selection mechanism. This pseudo-instruction
will
> be expanded after register allocation.
> * Part-word AMO without a native instruction: Shifting and masking that
occurs
> outside of the LL/SC loop is expanded in the IR, and a call to a
> target-specific intrinsic to implement the LL/SC loop is inserted (e.g.
> llvm.riscv.masked.atomicrmw.add.i32). The intrinsic is matched to a
> pseudo-instruction which is expanded after register allocation.
> * Part-word AMO without a native instruction that can be implemented by a
> native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be
implemented
> by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
> transformation.
> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
> normal instruction selection mechanism. This pseudo-instruction will be
> expanded after register allocation.
> * Part-word compare-and-exchange: Handled similarly to part-word AMOs,
calling
> llvm.riscv.masked.cmpxchg.i32.
> 
> Scratch registers for these pseudo-instructions are modelled as in ARM and
> AArch64, by specifying multiple outputs and specifying an @earlyclobber
> constraint to ensure the register allocator assigns unique registers. e.g.:
> 
> class PseudoCmpXchg
>      : Pseudo<(outs GPR:$res, GPR:$scratch),
>               (ins GPR:$addr, GPR:$cmpval, GPR:$newval, i32imm:$ordering),
[]> {
>    let Constraints = "@earlyclobber $res, at earlyclobber
$scratch";
>    let mayLoad = 1;
>    let mayStore = 1;
>    let hasSideEffects = 0;
> }
> 
> Note that there are additional complications with cmpxchg such as
supporting
> weak cmpxchg (which requires returning a success value), or supporting
> different failure orderings. It looks like the differentiation between
> strong/weak cmpxchg doesn't survive the translation to SelectionDAG
right now.
> Supporting only strong cmpxchg and using the success ordering for the
failure
> case is conservative but correct I believe.
> 
> In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at the
> latest possible moment. The RISCVExpandPseudoInsts pass is registered with
> addPreEmitPass2.
> 
> The main aspect I'm unhappy with in this approach is the need to
introduce new
> intrinsics. Ideally these would be documented as not for use by frontends
and
> subject to future removal or alteration - is there precedent for this?
> Alternatively, see the suggestion below to introduce target-independent
masked
> AMO intrinsics.
> 
> ## Alternative options
> 
> 1. Don't expand anything in IR, and lower to a single monolithic
> pseudo-instruction that is expanded at the last minute.
> 2. Don't expand anything in IR, and lower to pseudo-instructions in
stages.
> Lower to a monolithic pseudo-instruction where any logic outside of the
LL/SC
> loop is expanded in EmitInstrWithCustomInserter but the LL/SC loop is
> represented by a new pseudoinstruction. This final pseudoinstruction is
then
> expanded after register allocation. This minimises the possibility for
sharing
> logic between backends, but does mean we don't need to expose new
intrinsics.
> Mips adopts this approach in D31287.
> 3. Target-independent SelectionDAG expansion code converts unsupported
atomic
> operations. e.g. rather than converting `atomicrmw add i8` to
AtomicLoadAdd,
> expand to nodes that align the address and calculate the mask as well as an
> AtomicLoadAddMasked node. I haven't looked at this in great detail.
> 4. Introducing masked atomic operations to the IR. Mentioned for
completeness,
> I don't think anyone wants this.
> 5. Introduce target-independent intrinsics for masked atomic operations.
This
> seems worthy of consideration.
> 
> For 1. and 2. the possibility for sharing logic between backends is
minimised
> and the address calculation, masking and shifting logic is mostly hidden
from
> optimisations (though option 2. allows e.g. MachineCSE). There is the
> advantage of avoiding the need for new intrinsics.
> 
> ## Patches up for review
> 
> I have patches up for review which implement the described strategy. More
> could be done to increase the potential for code reuse across targets, but
I
> thought it would be worth getting feedback on the path forwards first.
> 
> * D47587: [RISCV] Codegen support for atomic operations on RV32I.
> <https://reviews.llvm.org/D47587>. Simply adds support for lowering
fences and
> uses AtomicExpandPass to generate libatomic calls otherwise. Committed in
> rL334590.
> * D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.
> <https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert
fences for
> atomic load/store. Committed in rL334591.
> * D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A.
> <https://reviews.llvm.org/D47882>. Implements the lowering strategy
described
> above for atomicrmw and adds a hook to allow custom atomicrmw expansion in
IR.
> Under review.
> * D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on
RV32A.
> <https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with
> appropriately manipulated operands to implement 8/16-bit AMOs. Under
review.
> * D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in
IR.
> <https://reviews.llvm.org/D48130> Separated patch as this modifies
the
> existing shouldExpandAtomicCmpXchgInIR interface. Under review.
> * D48141: [RISCV] Implement codegen for cmpxchg on RV32I.
> <https://reviews.llvm.org/D48131> Implements the lowering strategy
described
> above. Under review.
> 
> ## Questions
> 
> To pick a few to get started:
> 
> * How do you feel about the described lowering strategy? Am I unfairly
> overlooking a SelectionDAG approach?
> * How much enthusiasm is there for moving ARM, AArch64, Mips, Hexagon, and
> other architectures to use such an approach?
>    * If there is enthusiasm, how worthwhile is it to share logic for
generation
>    of masks+shifts needed for part-word atomics?
>    * I'd like to see ARM+AArch64+Hexagon move away from the problematic
>    expansion in IR and to have that code deleted from AtomicExpandPass. Are
>    there any objections?
> * What are your thoughts on the introduction of new target-independent
> intrinsics for masked atomics?
> 
> Many thanks for your feedback,
> 
> Alex Bradbury, lowRISC CIC
> 
-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

James Y Knight via llvm-dev

2018-Jun-15 22:03 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On Wed, Jun 13, 2018 at 11:42 AM Alex Bradbury <asb at lowrisc.org> wrote:
> ## Proposed lowering strategy (RISC-V)
>
> Basic principles:
> * The LL/SC loop should be treated as a black box, and expanded post-RA.
>
It will probably come as no surprise, but I think this is 100% required for
correctness. I'm quite skeptical that it can be correct on any existing
architecture, without doing this. My first inclination upon hearing that
some architecture may not require such careful handling is that their
architecture documentation probably failed to document the careful handling
required, not that it's not required.

However, in order to not block this on every architecture maintainer being
persuaded, it's a good idea to introduce the new functionality into
AtomicExpandPass, and switch architectures over to it as the arch
maintainers are convinced it's a good idea.

> * If code can be safely expanded in the IR, do it there. [I'd welcome
> feedback
> on this one - should I be taking a closer look at expansion in SelectionDAG
> legalisation?

Definitely in favor of IR expansion -- we should do as much in IR as is
feasible, because it admits better optimization opportunities. Only the
LL/SC-loop should be late-expanded.

> The above can be achieved by extending AtomicExpandPass to support a
> 'Custom'
expansion method, which uses a TargetLowering call to expand to custom
IR,> including an appropriate intrinsic representing the LL+SC loop

I think it'd be better to sink more of the functionality into
AtomicExpandPass itself (rather than adding a "Custom" hook). However,
that
ties into whether to introduce a common intrinsic that can be used across
architectures...

> * Part-word AMO without a native instruction that can be implemented by a
> native word-sized AMO: 8 and 16-bit atomicrmw {and,or,xor} can be
> implemented
> by 32-bit amoand, amoor, amoxor. Perform this conversion as an IR
> transformation.
>+1, and that transform should be in common code in AtomicExpandPass.

> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using the
> normal instruction selection mechanism. This pseudo-instruction will be
> expanded after register allocation.
>
On RISCV, implementing the whole thing in the pseudo is probably right,
since you only really have the inner-loop.

But for other archs like ARMv7, I think it'll probably makes sense to
continue to handle a bunch of the cmpxchg expansion in IR. There, the
current cmpxchg expansion can be quite complex, but only loop really needs
to be a primitive (we'd need two loop variants, both "strex, ldrex,
loop"
or "ldrex, strex, loop", depending on whether it generates an initial
ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex,
barriers--  can all remain IR-level expanded.

I'll note that in all cases, both for RISCV and ARM and others, we _really_
would like to be able to have the compare-failure jump to a different
address than success. That isn't possible for an intrinsic call at the
moment, but I believe it will be possible to make that work soon, due to
already ongoing work for "asm goto", which requires similar. Once we
have
that ability, I don't see any reason why the late-lowering cmpxchg pseudo
should have any performance downside vs IR expansion, at least w.r.t. any
correct optimizations.

* Part-word compare-and-exchange: Handled similarly to part-word
AMOs,> calling
> llvm.riscv.masked.cmpxchg.i32.
>
Yep.

Note that there are additional complications with cmpxchg such as
supporting> weak cmpxchg (which requires returning a success value), or supporting
> different failure orderings. It looks like the differentiation between
> strong/weak cmpxchg doesn't survive the translation to SelectionDAG
right
> now.
> Supporting only strong cmpxchg and using the success ordering for the
> failure
> case is conservative but correct I believe.
>
Yep. The distinction between strong/weak is only relevant to LL/SC
architectures, where you get to avoid making a loop. AFAIK, all ISAs which
have a native cmpxchg instruction implement the strong semantics for it,
and since the LL/SC loops are expanded in IR right now, they didn't need
the weak indicator in SelectionDAG. While strong doesn't _need_ a success
indicator output, as noted above it'll generate nicer code when we can.

In the RISC-V case, the LL/SC loop pseudo-instructions are lowered at
the> latest possible moment. The RISCVExpandPseudoInsts pass is registered with
> addPreEmitPass2.
>
> The main aspect I'm unhappy with in this approach is the need to
introduce
> new
> intrinsics. Ideally these would be documented as not for use by frontends
> and
> subject to future removal or alteration - is there precedent for this?
> Alternatively, see the suggestion below to introduce target-independent
> masked
> AMO intrinsics.

I agree -- it'd be great to somehow name/annotate these intrinsics, such
that it's clear that they're a private implementation detail _INSIDE_
the
llvm target codegen passes, with no stability guaranteed. Not even
"opt"
passes should be emitting them, so they should never end up in a bitcode
file where we'd need to provide backwards compatibility.

Maybe we can call them something like
"llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32"
(where
the random number in the middle is something that changes), and document
that nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm
CodeGen.

## Alternative options

[A bunch of alternatives I don't like, so i'm omitting them]

> 5. Introduce target-independent intrinsics for masked atomic operations.
> This
>
seems worthy of consideration.>
I think it's definitely worthwhile looking to see if the set of intrinsics
for the atomicrmw operations (in particular, the set of additional
arguments computed in the pre-loop section, and the return value) are
likely going to be the same on the different architectures we have now. If
so, I think it's definitely worthwhile making a commonly-named intrinsic.
However, even so, it should remain for internal use only, and only
implemented on targets which tell AtomicExpandPass to generate it. I'd like
it to be common only to enhance code-reuse among the targets, not to
provide a user-facing API.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/01aca6bf/attachment.html>

James Y Knight via llvm-dev

2018-Jun-15 23:20 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On Thu, Jun 14, 2018 at 5:28 AM Tim Northover <t.p.northover at gmail.com>
wrote:
> >   * I'd like to see ARM+AArch64+Hexagon move away from the
problematic
> >   expansion in IR and to have that code deleted from AtomicExpandPass.
> Are
> >   there any objections?
>
> I think it would be a great shame and I'd like to avoid it if at all
> possible, though I'm also uncomfortable with the current situation.
>
> The problem with late expansion is that even the simplest variants add
> some rather unpleasant pseudo-instructions, and I've never even seen
> anyone attempt to get good CodeGen out of it (simplest example being
> "add xD, xD, #1" in the loop for an increment but there are
obviously
> more). Doing so would almost certainly involve duplicating a lot of
> basic arithmetic instructions into AtomicRMW variants.
>
I think it would be potentially-useful, and difficult, to also add pseudos
for the atomicrmw operations with an immediate operand of the range allowed
by the corresponding arithmetic operation. That doesn't seem like it'd
be
difficult. But I rather doubt whether any further refinement beyond that
would be useful in practice.

For example, sure -- ARM has shifted-add-register operations. And, an "add
r2, r2, r1, lsl #2" could potentially be (but isn't currently!)
generated
inside the ll/sc loop for something like:
 %shl = shl i32 %incr, 2
 %0 = atomicrmw add i32* %i, i32 %shl seq_cst
...but the alternative is simply to generate an lsl instruction before the
loop. I'm quite skeptical whether the ability to omit the 'lsl'
would
really be worth caring about. Plus we aren't even doing it now. :)

> Added to that is the fact that actual CPU implementations are often a
> lot less restrictive about what can go into a loop than is required
> (for example even spilling is fine on ours as long as the atomic
> object is not in the same cache-line; I suspect that's normal). That
> casts the issue in a distinctly theoretical light -- we've been doing
> this for years and as far as I'm aware nobody has ever hit the issue
> in the real world, or even had CodeGen go wrong in a way that *could*
> do so outside the -O0 situation

So, there's two separate issues --
1. Deterministic failure of a loop even if it's the only thing running. For
example, from register spilling to the same cacheline within the loop. As
you note, that has been observed, but only at -O0.
2. Livelock between two such loops on different CPUs. AFAIK this has not
been reported, but I expect it would be very difficult to observe, because
one CPU will likely receive an OS interrupt eventually, and thus break out
of the livelock situation. This would likely exhibit simply as worse
performance than expected, which I expect would be very difficult to track
down and report.

I'd suspect that generally, most of the violations of the architectural
requirements would only cause #2, not #1. It seems entirely possible to me
that this change may well be good both for correctness and
performance-under-load.

> OTOH that *is* an argument for performance over correctness when you
get right down to it, so I'm not sure I can be too forceful about
it.> At least not without a counter-proposal to restore guaranteed
> correctness.

I'd rather look at it from the other direction -- does using a late-lowered
atomicrmw pseudo _actually_ cause any observable performance degradation? I
rather suspect it won't.

Now -- cmpxchg, as mentioned before, is trickier, and I could see that the
inability to jump to a separate failure block may make a noticeable
difference at the moment.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/a69dcffa/attachment-0001.html>

James Y Knight via llvm-dev

2018-Jun-15 23:44 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On Fri, Jun 15, 2018 at 11:21 AM Krzysztof Parzyszek <
kparzysz at codeaurora.org> wrote:
> Hexagon has a very few restrictions on what will cause loss of a
> reservation: those are stores to the same address (a 64-bit granule) or
> any sort of exception/interrupt/system call. Other than that the
> reservation should stay. The architecture doesn't explicitly guarantee
> that though, but in the absence of the elements listed above, a program
> with LL/SC can be expected to make progress.
>
> Consequently, the best way for us to handle LL/SC would be to expand
> them early and let them be optimized as any other code. The usual
> optimization restrictions should be sufficient to prevent introduction
> of factors causing a loss of reservation.
>
> With the constraints on LL/SC varying wildly between architectures,
> maybe we should have several options available for different targets?
>
The constraints on LL/SC don't seem to vary that wildly between
architectures (except Hexagon, I guess!) -- sure, the exact details are
different, but they all (the others) have some set of requirements on the
structure of their loop which are not generally possible guarantee in a
high level abstract representation. Given that, the details of the
requirements aren't particularly relevant, they just need to be properly
adhered to when writing the target-specific code to generate the machine
instructions for the ll/sc loop.

I'm a bit skeptical that Hexagon is truly different in this regards than
all the other architectures. But, that said, I don't think it'd be at
all
problematic to keep the IR-level lowering code around for Hexagon to
continue to use, even if we do migrate everything else away.

We definitely need to introduce new options alongside the current options
to start with, anyways -- at the very least simply to allow migrating each
architecture separately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180615/67f2a937/attachment.html>

Alex Bradbury via llvm-dev

2018-Jun-17 11:59 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 15 June 2018 at 16:21, Krzysztof Parzyszek via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Hexagon has a very few restrictions on what will cause loss of a
> reservation: those are stores to the same address (a 64-bit granule) or any
> sort of exception/interrupt/system call. Other than that the reservation
> should stay. The architecture doesn't explicitly guarantee that though,
but
> in the absence of the elements listed above, a program with LL/SC can be
> expected to make progress.
>
> Consequently, the best way for us to handle LL/SC would be to expand them
> early and let them be optimized as any other code. The usual optimization
> restrictions should be sufficient to prevent introduction of factors
causing
> a loss of reservation.
Thanks Krzysztof, it sounds like Hexagon gives so much freedom for
LL/SC loops that there really isn't a need to be concerned about
expansion in IR. As such, it would clearly make sense to retain that
codepath over the long-term, even if architectures are discouraged
from using it unless they're _really_ sure it's safe for their
architecture.
> With the constraints on LL/SC varying wildly between architectures, maybe
we
> should have several options available for different targets?
Hexagon is the first architecutre I've encountered where there are
essentially no restrictions, so that certainly changes things. I was
mainly trying to determine consensus for the ultimate goal. If
everyone agreed that late expansion is the path forwards, we could
plan to remove and deprecate the current early expansion codepath at
some point.

Best,

Alex

Alex Bradbury via llvm-dev

2018-Jun-17 12:21 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 15 June 2018 at 23:03, James Y Knight via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> However, in order to not block this on every architecture maintainer being
> persuaded, it's a good idea to introduce the new functionality into
> AtomicExpandPass, and switch architectures over to it as the arch
> maintainers are convinced it's a good idea.
Indeed, even if everyone agreed this was a good idea I wasn't
expecting to do this all at once.
>> The above can be achieved by extending AtomicExpandPass to support a
>> 'Custom'
>>
>> expansion method, which uses a TargetLowering call to expand to custom
IR,
>> including an appropriate intrinsic representing the LL+SC loop
>
>
> I think it'd be better to sink more of the functionality into
> AtomicExpandPass itself (rather than adding a "Custom" hook).
However, that
> ties into whether to introduce a common intrinsic that can be used across
> architectures...
Yes, I'd like to do more in AtomicExpandPass. Adding the 'Custom'
hack
was the easiest way of prototyping this, and this thread will
hopefully give good guidance on the level of interest in using this in
a target-independent way.
>> * Word-sized compared-and-exchange: Lower to a pseudo-instruction using
>> the
>> normal instruction selection mechanism. This pseudo-instruction will be
>> expanded after register allocation.
>
>
> On RISCV, implementing the whole thing in the pseudo is probably right,
> since you only really have the inner-loop.
>
> But for other archs like ARMv7, I think it'll probably makes sense to
> continue to handle a bunch of the cmpxchg expansion in IR. There, the
> current cmpxchg expansion can be quite complex, but only loop really needs
> to be a primitive (we'd need two loop variants, both "strex,
ldrex, loop" or
> "ldrex, strex, loop", depending on whether it generates an
initial
> ldrex+barrier first). All the rest -- initial ldrex+barrier, clrex,
> barriers--  can all remain IR-level expanded.
Good point. As you say, the RISC-V expansion is much more
straight-forward. Although the barrier could be cleared eagerly after
compare failure by an SC to a dummy memory location, I don't currently
intend to do so:
1) GCC also doesn't intend to use such an expansion
2) No existing microarchitectural implementations have been shown to
benefit from this manual eager reservation clearing
3) Sticking to the simplest expansion is a good starting point, and
future microarchitects are most likely to optimise for code that is
out in the wild
> I'll note that in all cases, both for RISCV and ARM and others, we
_really_
> would like to be able to have the compare-failure jump to a different
> address than success. That isn't possible for an intrinsic call at the
> moment, but I believe it will be possible to make that work soon, due to
> already ongoing work for "asm goto", which requires similar. Once
we have
> that ability, I don't see any reason why the late-lowering cmpxchg
pseudo
> should have any performance downside vs IR expansion, at least w.r.t. any
> correct optimizations.
I've seen periodically recurring discussions, but is someone actually
actively working on this?
>> The main aspect I'm unhappy with in this approach is the need to
introduce
>> new
>> intrinsics. Ideally these would be documented as not for use by
frontends
>> and
>> subject to future removal or alteration - is there precedent for this?
>> Alternatively, see the suggestion below to introduce target-independent
>> masked
>> AMO intrinsics.
>
>
> I agree -- it'd be great to somehow name/annotate these intrinsics,
such
> that it's clear that they're a private implementation detail
_INSIDE_ the
> llvm target codegen passes, with no stability guaranteed. Not even
"opt"
> passes should be emitting them, so they should never end up in a bitcode
> file where we'd need to provide backwards compatibility.
>
> Maybe we can call them something like
> "llvm.INTERNAL_CODEGEN.f90d461eee5d32a1.masked.atomicrmw.add.i32"
(where the
> random number in the middle is something that changes), and document that
> nobody must use intrinsics in the INTERNAL_CODEGEN, other than llvm
CodeGen.
llvm.internal_use_only.masked.atomicrmw.add.i32 would get the point
across I think.

Is it not possible someone would generate a .bc after AtomicExpandPass
has run? Of course even now there's no guarantee such a file might
work on a future version of LLVM. e.g. the atomics lowering strategy
could change from one release to the next.
>> ## Alternative options
>
> [A bunch of alternatives I don't like, so i'm omitting them]
>
>>
>> 5. Introduce target-independent intrinsics for masked atomic
operations.
>> This
>>
>> seems worthy of consideration.
>
>
> I think it's definitely worthwhile looking to see if the set of
intrinsics
> for the atomicrmw operations (in particular, the set of additional
arguments
> computed in the pre-loop section, and the return value) are likely going to
> be the same on the different architectures we have now. If so, I think
it's
> definitely worthwhile making a commonly-named intrinsic. However, even so,
> it should remain for internal use only, and only implemented on targets
> which tell AtomicExpandPass to generate it. I'd like it to be common
only to
> enhance code-reuse among the targets, not to provide a user-facing API.
Yes, giving the impression of providing a new user-facing API is my
concern. Particularly as we might define want to have a comprehensive
set of intrinsics but have targets support only the subset they
require for comprehensives atomics support (as some instructions may
go through the usual SDISel route without transformation to
intrinsics).

A common set of intrinsics is probably ok, but there are cases where
you might want a different interface. For instance clearing a
reservation requires an SC to a dummy address on RISC-V. Although we
have no intention of doing this in the RISC-V backend currently,
targets that wanted to implement such an approach might want to have
that dummy address as an argument to the intrinsic.

Best,

Alex

Alex Bradbury via llvm-dev

2018-Sep-21 10:06 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 13 June 2018 at 16:42, Alex Bradbury <asb at lowrisc.org> wrote:
> ## Patches up for review
>
> I have patches up for review which implement the described strategy. More
> could be done to increase the potential for code reuse across targets, but
I
> thought it would be worth getting feedback on the path forwards first.
>
> * D47587: [RISCV] Codegen support for atomic operations on RV32I.
> <https://reviews.llvm.org/D47587>. Simply adds support for lowering
fences and
> uses AtomicExpandPass to generate libatomic calls otherwise. Committed in
> rL334590.
> * D47589: [RISCV] Add codegen support for atomic load/stores with RV32A.
> <https://reviews.llvm.org/D47589>. Use AtomicExpandPass to insert
fences for
> atomic load/store. Committed in rL334591.
> * D47882: [RISCV] Codegen for i8, i16, and i32 atomicrmw with RV32A.
> <https://reviews.llvm.org/D47882>. Implements the lowering strategy
described
> above for atomicrmw and adds a hook to allow custom atomicrmw expansion in
IR.
> Under review.
> * D48129: [RISCV] Improved lowering for bit-wise atomicrmw {i8, i16} on
RV32A.
> <https://reviews.llvm.org/D48129>. Uses 32-bit AMO{AND,OR,XOR} with
> appropriately manipulated operands to implement 8/16-bit AMOs. Under
review.
> * D48130: [AtomicExpandPass]: Add a hook for custom cmpxchg expansion in
IR.
> <https://reviews.llvm.org/D48130> Separated patch as this modifies
the
> existing shouldExpandAtomicCmpXchgInIR interface. Under review.
> * D48141: [RISCV] Implement codegen for cmpxchg on RV32I.
> <https://reviews.llvm.org/D48131> Implements the lowering strategy
described
> above. Under review.
To update on this, everything has now landed other than D48141 which
implements cmpxchg for RISC-V and introduces target-independent code
to help lowering part-word cmpxchg. In response to the comments on
this thread and review comments I revised the patches so that as much
lowering as possible is done in AtomicExpandPass, thus maximising the
potential for code reuse across different targets.

I've additionally posted https://reviews.llvm.org/D52234 which updates
the AtomicExpandPass docs. If there is consensus, I think both James
and I would prefer to more strongly encourage that targets avoid
expanding LL/SC loops in IR on the grounds that LLVM has no way to
guarantee that the generated code complies with the architecture's
requirements for forward progress.

I believe it would be worthwhile for the Mips backend to evaluate
using AtomicExpandPass in the same way the RISC-V backend now does. It
wouldn't provide a correctness benefit over the current logic, but
should help reduce the amount of target-specific code for handling
atomics.

Tim: did you get a chance at looking at a verifier pass?

James suggested what the eventual solution for "asm goto" may well
help in cleaning up the suboptimal control-flow around a cmpxchg
lowered using the approach described in this RFC. Does anybody know of
any work in this area?

Best,

Alex

llvm dev - Jun 2018 - RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited