thr3ads.net - llvm dev - [llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited [Jun 2018]

If this information is useful, please help other people find it:
Share via:

Alex Bradbury via llvm-dev

2018-Jun-14 11:32 UTC

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 14 June 2018 at 10:28, Tim Northover via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>>   * I'd like to see ARM+AArch64+Hexagon move away from the
problematic
>>   expansion in IR and to have that code deleted from AtomicExpandPass.
Are
>>   there any objections?
>
> I think it would be a great shame and I'd like to avoid it if at all
> possible, though I'm also uncomfortable with the current situation.
Thanks for the reply Tim. It's definitely fair to highlight that ARM
and AArch64 has few documented restrictions on code within an LL/SC
loop when compared to e.g. RISC-V or MIPS. I'm not sure about Hexagon.
> The problem with late expansion is that even the simplest variants add
> some rather unpleasant pseudo-instructions, and I've never even seen
> anyone attempt to get good CodeGen out of it (simplest example being
> "add xD, xD, #1" in the loop for an increment but there are
obviously
> more). Doing so would almost certainly involve duplicating a lot of
> basic arithmetic instructions into AtomicRMW variants.
Let's separate the concerns here:
1) Quality of codegen within the LL/SC loop
  * What is the specific concern here? The LL/SC loop contains a very
small number of instructions, even for the masked atomicrmw case. Are
you worried about an extra arithmetic instruction or two? Sub-optimal
control-flow? Something else?
2) Number of new pseudo-instructions which must be introduced
  * You'd need new pseudos for each word-sized atomicrmw which expands
to an ll/sc loop, and an additional one for the masked form of the
operation. You could reduce the number of pseudos by taking the
AtomicRMWInst::BinOp as a parameter. The code to map the atomic op to
the appropriate instruction is tedious but straight-forward.
> Added to that is the fact that actual CPU implementations are often a
> lot less restrictive about what can go into a loop than is required
> (for example even spilling is fine on ours as long as the atomic
> object is not in the same cache-line; I suspect that's normal). That
> casts the issue in a distinctly theoretical light -- we've been doing
> this for years and as far as I'm aware nobody has ever hit the issue
> in the real world, or even had CodeGen go wrong in a way that *could*
> do so outside the -O0 situation.
That's true as long as the "Exclusives Reservation Granule" == 1
cache-line and you don't deterministically cause the reservation to be
cleared in some other way: e.g. repeatable geenrating a conflict miss
or triggering a trap etc. I don't think any Arm cores ship with
direct-mapped caches so I'll admit this is unlikely.

The possibility for issues increases if the Exclusives Reservation
Granule is larger. For the Cortex-M4, the ERG is the entire address
range
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100166_0001_00_en/ric1417175928887.html>.
In that case, spills will surely clear the reservation.

There also seem to be documented and very strict forward progress
constraints for ARMv8-M. See
https://static.docs.arm.com/ddi0553/ah/DDI0553A_h_armv8m_arm.pdf p207

"""
Forward progress can only be made using LoadExcl/StoreExcl loops if,
for any LoadExcl/StoreExcl loop within a single thread of execution if
both of the following are true:
• There are no explicit memory accesses, pre-loads, direct or indirect
register writes, cache maintenance instructions, SVC instructions, or
exception returns between the Load-Exclusive and the Store-Exclusive.
• The following conditions apply between the Store-Exclusive having
returned a fail result and the retry of the Load-Exclusive:
  – There are no stores to any location within the same Exclusives
reservation granule that the StoreExclusive is accessing.
  – There are no direct or indirect register writes, other than
changes to the flag fields in APSR or FPSCR, caused by data processing
or comparison instructions.
  – There are no direct or indirect cache maintenance instructions,
SVC instructions, or exception returns
"""

Of course it also states that the upper limit for the Exclusives
Reservation Granule is 2048 bytes, but the Cortex-M33 has an ERG of
the entire address range
<http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.100230_0002_00_en/jfa1443092906126.html>
so something doesn't quite add up...
> OTOH that *is* an argument for performance over correctness when you
> get right down to it, so I'm not sure I can be too forceful about it.
> At least not without a counter-proposal to restore guaranteed
> correctness.
I suppose a machine-level pass could at least scan for any intervening
loads/stores in an LL/SC loop and check some other invariants, then
warn/fail if they occur. As you point out, this would be conservative
for many Arm implementations.

Best,

Alex

Alex Bradbury via llvm-dev

2018-Jun-14 12:44 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On 14 June 2018 at 12:32, Alex Bradbury <asb at lowrisc.org>
wrote:> On 14 June 2018 at 10:28, Tim Northover via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> The problem with late expansion is that even the simplest variants add
>> some rather unpleasant pseudo-instructions, and I've never even
seen
>> anyone attempt to get good CodeGen out of it (simplest example being
>> "add xD, xD, #1" in the loop for an increment but there are
obviously
>> more). Doing so would almost certainly involve duplicating a lot of
>> basic arithmetic instructions into AtomicRMW variants.
>
> Let's separate the concerns here:
> 1) Quality of codegen within the LL/SC loop
>   * What is the specific concern here? The LL/SC loop contains a very
> small number of instructions, even for the masked atomicrmw case. Are
> you worried about an extra arithmetic instruction or two? Sub-optimal
> control-flow? Something else?
Oh I see what you're saying, it's the fact that by bypassing
instruction selection we're missing cases where an ADDI could be
selected rather than an ADD, which would potentially free up a
register and save the instruction generated for materialising the
constant. I don't like to see the compiler generate code that's
obviously dumber than what a human would write, but in this case do we
really think there would be any sort of measurable impact on
performance?

Best,

Alex

Tim Northover via llvm-dev

2018-Jun-15 11:28 UTC

head link

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

On Thu, 14 Jun 2018 at 13:45, Alex Bradbury <asb at lowrisc.org>
wrote:> Oh I see what you're saying, it's the fact that by bypassing
> instruction selection we're missing cases where an ADDI could be
> selected rather than an ADD, which would potentially free up a
> register and save the instruction generated for materialising the
> constant.
Yes.
> I don't like to see the compiler generate code that's
> obviously dumber than what a human would write, but in this case do we
> really think there would be any sort of measurable impact on
> performance?
It's certainly going to be marginal, but then so is the benefit of
late expansion.

There's also the barrier that this genuinely is a place where people
are willing to hand-code assembly, and go rooting around in the
compiler's output to check we're doing what they expect even if they
don't use assembly. The assembly is possibly mostly for historical
reasons, but it's still out there and we want to convince people to
move away from it.

I think I'm going to try to implement that verifier pass you mentioned
and run it across an iOS build.

Cheers.

Tim.

llvm dev - Jun 2018 - RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited

[llvm-dev] RFC: Atomic LL/SC loops in LLVM revisited