thr3ads.net - llvm dev - [LLVMdev] Plan to optimize atomics in LLVM [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Tim Northover

2014-Aug-08 06:34 UTC

[LLVMdev] Plan to optimize atomics in LLVM

> Longer term, I hope to improve the fence elimination of the ARM backend
with
> a kind of PRE algorithm. Both of these improvements to the ARM backend
> should be fairly straightforward to port to the POWER architecture later,
> and I hope to also do that.
>
> Any reason these couldn't be done at the IR level?
I definitely agree here. At the time, it was a plausible idea (the
barriers didn't even exist in IR most of the time). But the
implementation was always going to be much more complicated and less
portable than in IR, and what we actually have is very flawed in its
own right (only applies to ARM mode, unmaintained,

Actually, I think we decided to remove it a while back, but I haven't
gotten around to it yet.

Cheers.

Tim.

Robin Morisset

2014-Aug-08 17:53 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

I am planning in doing in IR, but with target specific-passes (such as
X86ExpandAtomicPass), that just share some of the code (possibly by having
each of the target-specific passes inherit from and override a
target-independent pass).

The reasons for doing it in IR are the following:
- easier sharing of target-independent code
- easier dealing with control-flow (especially useful for advanced fence
elimination)
But it must be target-dependent as for example on Power a seq_cst store has
a fence before it, while on ARM it has a fence both before and after it
(per http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

For this exact reason, I am planning on splitting
AtomicExpandLoadLinkedPass in a target-independent and a target-dependent
file: the current pass claims to be target-independent but is actually
designed for ARM: for example it puts a release-fence before a seq_cst CAS,
which would be unsound on Power if the backend was more agressive and using
lwsync for release_fences. Since these fences are not chosen based on the
LLVM fences semantics, but on the hardware memory model, I was thinking of
inserting target-specific intrinsics (dmb/isb on ARM, hwsync/lwsync/isync
on Power), to make it clearer that these passes are target-specific and
unsound outside of their target.

Another thing I would have to move to this IR pass is the insertion of
fences around atomic stores/loads when insertFencesForAtomic==true. It is
currently happening in SelectionDAGBuilder, which makes it impossible to do
fence elimination at the IR level.

Is it reasonable, or is there some rule against using hardware-specific
intrinsics at the hardware level (or some other problem with this approach)?

Cheers,
Robin Morisset

On Thu, Aug 7, 2014 at 11:34 PM, Tim Northover <t.p.northover at
gmail.com>
wrote:
> > Longer term, I hope to improve the fence elimination of the ARM
backend
> with
> > a kind of PRE algorithm. Both of these improvements to the ARM backend
> > should be fairly straightforward to port to the POWER architecture
later,
> > and I hope to also do that.
> >
> > Any reason these couldn't be done at the IR level?
>
> I definitely agree here. At the time, it was a plausible idea (the
> barriers didn't even exist in IR most of the time). But the
> implementation was always going to be much more complicated and less
> portable than in IR, and what we actually have is very flawed in its
> own right (only applies to ARM mode, unmaintained,
>
> Actually, I think we decided to remove it a while back, but I haven't
> gotten around to it yet.
>
> Cheers.
>
> Tim.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/a581880f/attachment.html>

Tim Northover

2014-Aug-08 18:42 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

> I am planning in doing in IR, but with target specific-passes (such as
X86ExpandAtomicPass)
> that just share some of the code
This would more normally be done via target hooks in LLVM, though the
principle is sound.
> But it must be target-dependent as for example on Power a
> seq_cst store has a fence before it, while on ARM it has a fence
> both before and after it (per
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
That certainly seems to suggest some kind of parametrisation.
> For this exact reason, I am planning on splitting
AtomicExpandLoadLinkedPass
> in a target-independent and a target-dependent file: the current pass
claims
> to be target-independent but is actually designed for ARM: for example it
> puts a release-fence before a seq_cst CAS, which would be unsound on Power
> if the backend was more agressive and using lwsync for release_fences.
I don't know the Power architecture, but this failure ought to be
describable in terms of LLVM's own memory model (if a valid Power
implementation of LLVM IR can trigger it, then the transformation
itself is wrong). Do you have an example execution in mind that shows
it?
> Since
> these fences are not chosen based on the LLVM fences semantics, but on the
> hardware memory model, I was thinking of inserting target-specific
> intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
> clearer that these passes are target-specific and unsound outside of their
> target.
If they *are* unsound, that should be changed immediately (and I'll
almost certainly make time to do so, hence my questions here). It's
completely unacceptable to give LLVM's "fence whatever" a
target-specific meaning in my opinion, even briefly.
> Another thing I would have to move to this IR pass is the insertion of
> fences around atomic stores/loads when insertFencesForAtomic==true. It is
> currently happening in SelectionDAGBuilder, which makes it impossible to do
> fence elimination at the IR level.
I'm a little worried about this being just one monster "do stuff with
atomics" pass, especially if it ends up one-per-target, but even if
the bulk can remain generic. I'd prefer a more compositional approach
if it can be managed.
> Is it reasonable, or is there some rule against using hardware-specific
> intrinsics at the hardware level (or some other problem with this
approach)?
Lots of the changes sound like they're going in the right direction.
I'd particularly pleased to see other architectures using (via
whatever adaptations are necessary) the atomic expansion pass; I think
that could significantly simplify other backends.

I'm a little concerned about changing the "fence XYZ" conversion
into
target intrinsics, but it looks likely it'd be necessary for
performance even if the current scheme does turn out to be correct so
I say go for it!

Cheers.

Tim.

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Aug 2014 - [LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

Apparently Analagous Threads