thr3ads.net - llvm dev - [LLVMdev] Plan to optimize atomics in LLVM [Aug 2014]

If this information is useful, please help other people find it:
Share via:

Tim Northover

2014-Aug-08 18:42 UTC

[LLVMdev] Plan to optimize atomics in LLVM

> I am planning in doing in IR, but with target specific-passes (such as
X86ExpandAtomicPass)
> that just share some of the code
This would more normally be done via target hooks in LLVM, though the
principle is sound.
> But it must be target-dependent as for example on Power a
> seq_cst store has a fence before it, while on ARM it has a fence
> both before and after it (per
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
That certainly seems to suggest some kind of parametrisation.
> For this exact reason, I am planning on splitting
AtomicExpandLoadLinkedPass
> in a target-independent and a target-dependent file: the current pass
claims
> to be target-independent but is actually designed for ARM: for example it
> puts a release-fence before a seq_cst CAS, which would be unsound on Power
> if the backend was more agressive and using lwsync for release_fences.
I don't know the Power architecture, but this failure ought to be
describable in terms of LLVM's own memory model (if a valid Power
implementation of LLVM IR can trigger it, then the transformation
itself is wrong). Do you have an example execution in mind that shows
it?
> Since
> these fences are not chosen based on the LLVM fences semantics, but on the
> hardware memory model, I was thinking of inserting target-specific
> intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
> clearer that these passes are target-specific and unsound outside of their
> target.
If they *are* unsound, that should be changed immediately (and I'll
almost certainly make time to do so, hence my questions here). It's
completely unacceptable to give LLVM's "fence whatever" a
target-specific meaning in my opinion, even briefly.
> Another thing I would have to move to this IR pass is the insertion of
> fences around atomic stores/loads when insertFencesForAtomic==true. It is
> currently happening in SelectionDAGBuilder, which makes it impossible to do
> fence elimination at the IR level.
I'm a little worried about this being just one monster "do stuff with
atomics" pass, especially if it ends up one-per-target, but even if
the bulk can remain generic. I'd prefer a more compositional approach
if it can be managed.
> Is it reasonable, or is there some rule against using hardware-specific
> intrinsics at the hardware level (or some other problem with this
approach)?
Lots of the changes sound like they're going in the right direction.
I'd particularly pleased to see other architectures using (via
whatever adaptations are necessary) the atomic expansion pass; I think
that could significantly simplify other backends.

I'm a little concerned about changing the "fence XYZ" conversion
into
target intrinsics, but it looks likely it'd be necessary for
performance even if the current scheme does turn out to be correct so
I say go for it!

Cheers.

Tim.

Robin Morisset

2014-Aug-08 19:22 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

Here is an example of code on which ExpandLoadLinkedPass looks unsound to
me (It is a variant of the SB pattern in the literature):
> x, y two globals of type *i32 (atomic)
Thread 0:>  store atomic i32 1, i32* @x seq_cst, align 4
>  %res = cmpxchg i32* @y, i32 0, i32 0, seq_cst, seq_cst
>  %old_y = extractvalue {i32, i1} %res, 0
> Thread 1:
>   store atomic i32 1, i32* @y seq_cst, align 4
>  %res = cmpxchg i32* @x, i32 0, i32 0, seq_cst, seq_cst
>  %old_x = extractvalue {i32, i1} %res, 0

The pass turns the cmpxchg into (approximately, there is a bunch of
control-flow in addition):
> fence release
> load-linked monotonic
> store-conditional monotonic
> fence seq_cst
>From my reading of Atomics.rst, it would be sound to reorder (It does notsay much about load-linked, so I am treating it as a normal load here)
> store seq_cst
> fence release
> load-linked monotonic
into
> load-linked monotonic
> store seq_cst
> fence release

Which would make an execution ending in %old_x = %old_y = 0 possible, while
it is impossible in the original program.
Since some atomic stores may be turned into AtomicCmpXchg if they are too
big, it may even occur in code with originally only atomic stores and loads.

Fixing it is a two line change in insertLeadingFence, but it triggers some
test failures, both because of tests looking specifically for a fence
release here, or looking for a dmb ishst which is asserted to be correct
for fence release on swift processors (I have no idea if it is correct in
this specific scenario, I could not find documentation on the exact quirks
of the memory model of swift, assuming it is public).


About your other comments:
I will look at target hooks, I had not thought about using them.

I agree that some modularity is required. But currently there is lots of
code duplication, for example RMWs are lowered into monotonic operations +
fences in ExpandLoadLinked (as we just saw), and load/stores are lowered
into monotonic operations + fences in SelectionDAGBuilder.. I hope to share
some code between those by pushing things into a common ExpandAtomicsPass.

Thanks,
Robin


On Fri, Aug 8, 2014 at 11:42 AM, Tim Northover <t.p.northover at
gmail.com>
wrote:
> > I am planning in doing in IR, but with target specific-passes (such as
> X86ExpandAtomicPass)
> > that just share some of the code
>
> This would more normally be done via target hooks in LLVM, though the
> principle is sound.
>
> > But it must be target-dependent as for example on Power a
> > seq_cst store has a fence before it, while on ARM it has a fence
> > both before and after it (per
> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
>
> That certainly seems to suggest some kind of parametrisation.
>
> > For this exact reason, I am planning on splitting
> AtomicExpandLoadLinkedPass
> > in a target-independent and a target-dependent file: the current pass
> claims
> > to be target-independent but is actually designed for ARM: for example
it
> > puts a release-fence before a seq_cst CAS, which would be unsound on
> Power
> > if the backend was more agressive and using lwsync for release_fences.
>
> I don't know the Power architecture, but this failure ought to be
> describable in terms of LLVM's own memory model (if a valid Power
> implementation of LLVM IR can trigger it, then the transformation
> itself is wrong). Do you have an example execution in mind that shows
> it?
>
> > Since
> > these fences are not chosen based on the LLVM fences semantics, but on
> the
> > hardware memory model, I was thinking of inserting target-specific
> > intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
> > clearer that these passes are target-specific and unsound outside of
> their
> > target.
>
> If they *are* unsound, that should be changed immediately (and I'll
> almost certainly make time to do so, hence my questions here). It's
> completely unacceptable to give LLVM's "fence whatever" a
> target-specific meaning in my opinion, even briefly.
>
> > Another thing I would have to move to this IR pass is the insertion of
> > fences around atomic stores/loads when insertFencesForAtomic==true. It
is
> > currently happening in SelectionDAGBuilder, which makes it impossible
to
> do
> > fence elimination at the IR level.
>
> I'm a little worried about this being just one monster "do stuff
with
> atomics" pass, especially if it ends up one-per-target, but even if
> the bulk can remain generic. I'd prefer a more compositional approach
> if it can be managed.
>
> > Is it reasonable, or is there some rule against using
hardware-specific
> > intrinsics at the hardware level (or some other problem with this
> approach)?
>
> Lots of the changes sound like they're going in the right direction.
> I'd particularly pleased to see other architectures using (via
> whatever adaptations are necessary) the atomic expansion pass; I think
> that could significantly simplify other backends.
>
> I'm a little concerned about changing the "fence XYZ"
conversion into
> target intrinsics, but it looks likely it'd be necessary for
> performance even if the current scheme does turn out to be correct so
> I say go for it!
>
> Cheers.
>
> Tim.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/c4060604/attachment.html>

Robin Morisset

2014-Aug-08 19:23 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

I forgot to say that x and y are to be initially 0 for the example to work.


On Fri, Aug 8, 2014 at 12:22 PM, Robin Morisset <morisset at google.com>
wrote:
> Here is an example of code on which ExpandLoadLinkedPass looks unsound to
> me (It is a variant of the SB pattern in the literature):
>
>> x, y two globals of type *i32 (atomic)
>
> Thread 0:
>>  store atomic i32 1, i32* @x seq_cst, align 4
>>  %res = cmpxchg i32* @y, i32 0, i32 0, seq_cst, seq_cst
>>  %old_y = extractvalue {i32, i1} %res, 0
>> Thread 1:
>>   store atomic i32 1, i32* @y seq_cst, align 4
>>  %res = cmpxchg i32* @x, i32 0, i32 0, seq_cst, seq_cst
>>  %old_x = extractvalue {i32, i1} %res, 0
>
>
> The pass turns the cmpxchg into (approximately, there is a bunch of
> control-flow in addition):
>
>> fence release
>> load-linked monotonic
>> store-conditional monotonic
>> fence seq_cst
>
>
> From my reading of Atomics.rst, it would be sound to reorder (It does not
> say much about load-linked, so I am treating it as a normal load here)
>
>> store seq_cst
>> fence release
>> load-linked monotonic
>
> into
>
>> load-linked monotonic
>> store seq_cst
>> fence release
>
>
> Which would make an execution ending in %old_x = %old_y = 0 possible,
> while it is impossible in the original program.
> Since some atomic stores may be turned into AtomicCmpXchg if they are too
> big, it may even occur in code with originally only atomic stores and
loads.
>
> Fixing it is a two line change in insertLeadingFence, but it triggers some
> test failures, both because of tests looking specifically for a fence
> release here, or looking for a dmb ishst which is asserted to be correct
> for fence release on swift processors (I have no idea if it is correct in
> this specific scenario, I could not find documentation on the exact quirks
> of the memory model of swift, assuming it is public).
>
>
> About your other comments:
> I will look at target hooks, I had not thought about using them.
>
> I agree that some modularity is required. But currently there is lots of
> code duplication, for example RMWs are lowered into monotonic operations +
> fences in ExpandLoadLinked (as we just saw), and load/stores are lowered
> into monotonic operations + fences in SelectionDAGBuilder.. I hope to share
> some code between those by pushing things into a common ExpandAtomicsPass.
>
> Thanks,
> Robin
>
>
> On Fri, Aug 8, 2014 at 11:42 AM, Tim Northover <t.p.northover at
gmail.com>
> wrote:
>
>> > I am planning in doing in IR, but with target specific-passes
(such as
>> X86ExpandAtomicPass)
>> > that just share some of the code
>>
>> This would more normally be done via target hooks in LLVM, though the
>> principle is sound.
>>
>> > But it must be target-dependent as for example on Power a
>> > seq_cst store has a fence before it, while on ARM it has a fence
>> > both before and after it (per
>> http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
>>
>> That certainly seems to suggest some kind of parametrisation.
>>
>> > For this exact reason, I am planning on splitting
>> AtomicExpandLoadLinkedPass
>> > in a target-independent and a target-dependent file: the current
pass
>> claims
>> > to be target-independent but is actually designed for ARM: for
example
>> it
>> > puts a release-fence before a seq_cst CAS, which would be unsound
on
>> Power
>> > if the backend was more agressive and using lwsync for
release_fences.
>>
>> I don't know the Power architecture, but this failure ought to be
>> describable in terms of LLVM's own memory model (if a valid Power
>> implementation of LLVM IR can trigger it, then the transformation
>> itself is wrong). Do you have an example execution in mind that shows
>> it?
>>
>> > Since
>> > these fences are not chosen based on the LLVM fences semantics,
but on
>> the
>> > hardware memory model, I was thinking of inserting target-specific
>> > intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make
it
>> > clearer that these passes are target-specific and unsound outside
of
>> their
>> > target.
>>
>> If they *are* unsound, that should be changed immediately (and I'll
>> almost certainly make time to do so, hence my questions here). It's
>> completely unacceptable to give LLVM's "fence whatever" a
>> target-specific meaning in my opinion, even briefly.
>>
>> > Another thing I would have to move to this IR pass is the
insertion of
>> > fences around atomic stores/loads when
insertFencesForAtomic==true. It
>> is
>> > currently happening in SelectionDAGBuilder, which makes it
impossible
>> to do
>> > fence elimination at the IR level.
>>
>> I'm a little worried about this being just one monster "do
stuff with
>> atomics" pass, especially if it ends up one-per-target, but even
if
>> the bulk can remain generic. I'd prefer a more compositional
approach
>> if it can be managed.
>>
>> > Is it reasonable, or is there some rule against using
hardware-specific
>> > intrinsics at the hardware level (or some other problem with this
>> approach)?
>>
>> Lots of the changes sound like they're going in the right
direction.
>> I'd particularly pleased to see other architectures using (via
>> whatever adaptations are necessary) the atomic expansion pass; I think
>> that could significantly simplify other backends.
>>
>> I'm a little concerned about changing the "fence XYZ"
conversion into
>> target intrinsics, but it looks likely it'd be necessary for
>> performance even if the current scheme does turn out to be correct so
>> I say go for it!
>>
>> Cheers.
>>
>> Tim.
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/e09758a1/attachment.html>

Tim Northover

2014-Aug-08 19:42 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

Hi Robin,
> But currently there is lots of
> code duplication, for example RMWs are lowered into monotonic operations +
> fences in ExpandLoadLinked (as we just saw), and load/stores are lowered
> into monotonic operations + fences in SelectionDAGBuilder.. I hope to share
> some code between those by pushing things into a common ExpandAtomicsPass.
I've not had time to think about your examples yet (I will!), but
thought I should comment on the history behind this one anyway.

Until very recently SelectionDAG was the only place that did this kind
of transformation. I added ExpandLoadLinked to give a neater route
(primarily for ARM, as you've noticed). But we still had to leave the
legacy path in place for the other architectures.

I'd love to see them do things before SDAG (if nothing else, we have a
long-term goal to nuke the DAG; but I also think it's a much neater
solution anyway). But I didn't have time to port all of the existing
backends, so we ended up with the duplication you're seeing.

I do like the direction you're suggesting here though. The more we can
move out of SelectionDAG, the less we have to port to GlobalISel when
it finally gets here.

Cheers.

Tim.

Philip Reames

2014-Aug-08 20:49 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

On 08/08/2014 11:42 AM, Tim Northover wrote:>> I am planning in doing in IR, but with target specific-passes (such as
X86ExpandAtomicPass)
>> that just share some of the code
> This would more normally be done via target hooks in LLVM, though the
> principle is sound.
>
>> But it must be target-dependent as for example on Power a
>> seq_cst store has a fence before it, while on ARM it has a fence
>> both before and after it (per
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)
> That certainly seems to suggest some kind of parametrisation.An alternate way of saying this might be that both ARM and Power require 
the store to be fenced before and after.  On Power the fence after is 
implicit, where on ARM it is not.  (Is this actually correct?  I don't 
know either of these models well.)

Could you use that framing to factor the arch specific and general 
parts?  I'd really love to have a generic barrier combine pass which can 
work on the IR semantics independent of the architecture barrier
semantics.>
>> For this exact reason, I am planning on splitting
AtomicExpandLoadLinkedPass
>> in a target-independent and a target-dependent file: the current pass
claims
>> to be target-independent but is actually designed for ARM: for example
it
>> puts a release-fence before a seq_cst CAS, which would be unsound on
Power
>> if the backend was more agressive and using lwsync for release_fences.
> I don't know the Power architecture, but this failure ought to be
> describable in terms of LLVM's own memory model (if a valid Power
> implementation of LLVM IR can trigger it, then the transformation
> itself is wrong). Do you have an example execution in mind that shows
> it?
>
>> Since
>> these fences are not chosen based on the LLVM fences semantics, but on
the
>> hardware memory model, I was thinking of inserting target-specific
>> intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
>> clearer that these passes are target-specific and unsound outside of
their
>> target.
> If they *are* unsound, that should be changed immediately (and I'll
> almost certainly make time to do so, hence my questions here). It's
> completely unacceptable to give LLVM's "fence whatever" a
> target-specific meaning in my opinion, even briefly.I strongly agree with Tim's point here.  A latent bug is still a
bug.>
>> Another thing I would have to move to this IR pass is the insertion of
>> fences around atomic stores/loads when insertFencesForAtomic==true. It
is
>> currently happening in SelectionDAGBuilder, which makes it impossible
to do
>> fence elimination at the IR level.
> I'm a little worried about this being just one monster "do stuff
with
> atomics" pass, especially if it ends up one-per-target, but even if
> the bulk can remain generic. I'd prefer a more compositional approach
> if it can be managed.
>
>> Is it reasonable, or is there some rule against using hardware-specific
>> intrinsics at the hardware level (or some other problem with this
approach)?
> Lots of the changes sound like they're going in the right direction.
> I'd particularly pleased to see other architectures using (via
> whatever adaptations are necessary) the atomic expansion pass; I think
> that could significantly simplify other backends.
>
> I'm a little concerned about changing the "fence XYZ"
conversion into
> target intrinsics, but it looks likely it'd be necessary for
> performance even if the current scheme does turn out to be correct so
> I say go for it!I would say there's a burden of justification that the target intrinsic 
approach is substantially better performance wise.  This doesn't have to 
be extensive, but something should be presented. (If the generic 
approach is actually possible.)

Philip

Robin Morisset

2014-Aug-08 21:10 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

On Fri, Aug 8, 2014 at 1:49 PM, Philip Reames <listmail at
philipreames.com>
wrote:
>
> On 08/08/2014 11:42 AM, Tim Northover wrote:
>
>> I am planning in doing in IR, but with target specific-passes (such as
>>> X86ExpandAtomicPass)
>>> that just share some of the code
>>>
>> This would more normally be done via target hooks in LLVM, though the
>> principle is sound.
>>
>>  But it must be target-dependent as for example on Power a
>>> seq_cst store has a fence before it, while on ARM it has a fence
>>> both before and after it (per http://www.cl.cam.ac.uk/~
>>> pes20/cpp/cpp0xmappings.html)
>>>
>> That certainly seems to suggest some kind of parametrisation.
>>
> An alternate way of saying this might be that both ARM and Power require
> the store to be fenced before and after.  On Power the fence after is
> implicit, where on ARM it is not.  (Is this actually correct?  I don't
know
> either of these models well.)
>
> Could you use that framing to factor the arch specific and general parts?
>  I'd really love to have a generic barrier combine pass which can work
on
> the IR semantics independent of the architecture barrier semantics.

More precisely, Both ARM and Power require a barrier between every store
seq_cst and every later load seq_cst (among lots of other requirements).
On Power the mapping achieves this by a full fence before every load
seq_cst, whereas ARM uses a full fence after ever store seq_cst.

I would also love to have a generic barrier combine pass, but I strongly
doubt it is at all possible.

>
>>  Is it reasonable, or is there some rule against using
hardware-specific
>>> intrinsics at the hardware level (or some other problem with this
>>> approach)?
>>>
>> Lots of the changes sound like they're going in the right
direction.
>> I'd particularly pleased to see other architectures using (via
>> whatever adaptations are necessary) the atomic expansion pass; I think
>> that could significantly simplify other backends.
>>
>> I'm a little concerned about changing the "fence XYZ"
conversion into
>> target intrinsics, but it looks likely it'd be necessary for
>> performance even if the current scheme does turn out to be correct so
>> I say go for it!
>>
> I would say there's a burden of justification that the target intrinsic
> approach is substantially better performance wise.  This doesn't have
to be
> extensive, but something should be presented. (If the generic approach is
> actually possible.)

For one simple example: acquire loads on ARM that are followed by a
dependent branch can be implemented by putting an isb fence at each target
of the branch (I can lookup the reference for this if you want), which is
supposedly cheaper (I am hoping to get some benchmarks on this and similar
things soon). But all the C11 fences, including the acquire fence require a
full dmb barrier. So it is impossible to express this optimized mapping of
acquire loads in a target-independent way.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140808/4f85ac1b/attachment.html>

Tim Northover

2014-Aug-15 09:26 UTC

head link

[LLVMdev] Plan to optimize atomics in LLVM

> From my reading of Atomics.rst, it would be sound to reorder (It does not
> say much about load-linked, so I am treating it as a normal load here)
>
>> store seq_cst
>> fence release
>> load-linked monotonic
>
> into
>
>> load-linked monotonic
>> store seq_cst
>> fence release
> Which would make an execution ending in %old_x = %old_y = 0 possible, while
> it is impossible in the original program.
Hmm, evil. Well, I'm convinced. Thanks very much for taking the time
to come up with an example and telling us about it.
> Fixing it is a two line change in insertLeadingFence, but it triggers some
> test failures, both because of tests looking specifically for a fence
> release here,
That's fine, we can change those easily enough. And the "dmb
ishst"
(as I understand it, it *is* a release fence, but not almost certainly
not suitable for preventing this reordering).

Cheers.

Tim.

llvm dev - Aug 2014 - [LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM

[LLVMdev] Plan to optimize atomics in LLVM