thr3ads.net - llvm dev - [LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions [May 2014]

If this information is useful, please help other people find it:
Share via:

Tim Northover

2014-May-29 11:55 UTC

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

Hi,

I've been looking at improving atomicrmw & cmpxchg code more,
particularly on architectures using the load-linked/store-conditional
model.

The summary is that current expansion for cmpxchg seems to happen too
late for LLVM to make meaningful use of the opportunities it provides.
I'd like to move it earlier and express it in terms of a first-class
pair of "load linked" and "store conditional" instructions.

More details are attached and inlined below. Currently the only code I
have is for one or two AtomicExpand peepholes, so any suggestions or
alternatives would be very welcome. It's possible this is entirely the
wrong approach.

What do people think?

Cheers.

Tim.

The Problem
-----------

We now expand atomic instructions to load-linked & store-conditional
intrinsics
just before ISel, but this expansion opens up more optimisation opportunities
that LLVM *would* be able to exploit if it saw the control flow earlier.

For example the return value of the C++11 and C11 compare_exchange operations is
actually whether the exchange succeeded, which leads to some common idioms in
Clang-produced IR.
>From "if(__c11_compare_exchange_strong(...))":    %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
    %success = icmp eq i32 %loaded, %oldval
    br i1 %success, label %true, label %false
the control-flow here should be something like:
    loop:
      %loaded = load linked i32* %addr seq_cst
      %trystore = icmp eq %loaded, %oldval
      br i1 %trystore, label %store.cond, label %false
    store.cond:
      %success = store conditional i32 %newval, i32* %addr seq_cst
      br i1 %success, label %true, label %loop
>From "return __c11_compare_exchange_strong(...);":    %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
    %success = icmp eq i32 %loaded, %oldval
    %res = zext i1 %success to i32
    ret i32 %res
here, we'd want something like:
    loop:
      %loaded = load linked i32* %addr seq_cst
      %trystore = icmp eq %loaded, %oldval
      br i1 %trystore, label %store.cond, label %false
    store.cond:
      %success = store conditional i32 %newval, i32* %addr seq_cst
      br i1 %success, label %true, label %loop
    true:
      ret i32 1
    false:
      ret i32 0

In both cases the compare is largely redundant in LL/SC architectures. You know
by the control flow graph what its result will be, and if LLVM sees this early
enough it can produce something close to optimal code.

Unfortunately, most of LLVM has finished by the time this expansion occurs, so
we end up with redundant compares.

Now, it's certainly possible to put various peephole optimisations into the
expansion pass to cover these cases (at least), but to me that feels like
fragile duplication of effort we've already made in the mid-end.

My Proposal
-----------

I think the best way to solve this longer term is to add explicit "load
linked"
and "store conditional" instructions to the LLVM IR, with equal status
to
"cmpxchg" and "atomicrmw".

We could then simplify the AtomicExpandLoadLinked pass to use these, change it
to use something like a TargetTransformInfo hook instead of TargetLowering and
schedule it much earlier in the mid-end. X86 & SystemZ would simply request
no
expansion happen.

This means Clang could still emit cmpxchg and atomicrmw, which closely match C
&
C++ semantics, but most of LLVM would see the control-flow and operations in a
form much closer to how the targers will actually emit them.

In addition, other languages would be able to provide more general atomic
operations, if they want (slightly more complex bit twiddling seems like it
could be particularly useful in some algorithms). Current atomic instructions
cover (most of) C and C++, but nothing else.

Potential issues
----------------

1. Can invalid transformations be done on them? Separating an LL from its
   corresponding SC by too much almost guarantees failure (an OS task switch
   happens, clearing the monitor).
2. We'd need to document their interaction with LLVM's memory model.
I'm not
   sure I've seen a good happens-before model that explicitly uses LL/SC.
The
   common ones people came up with in the Java & C++11 standardisation
process
   seem to assume atomic RMW operations instead.
3. People might be assuming StoreInst produces no value and not calling
   ReplaceAllUses. A separate StoreConditionalInst is one option, but that risks
   the reverse, people not considering it a store when it is.
4. Not all targets can support them. There are very few LLVM instructions like
   that, which is a little concerning. But not all targets can support current
   atomics either.

   If we make vulnerability to ABA problems target-specific in our documentation
   then X86 & SystemZ could use the same trick as for min/max to implement
them
   generically: a load/cmpxchg loop.

Alternatives
------------

1. Enhancing MachineCSE and friends so they can optimise this.
    - Works for "%old = atomicrmw OP %addr, %inc; OP %old, %inc"
already.
    - Cmpxchg is more difficult. It's the control-flow optimisations that
are
      particularly powerful in these cases.
    - It's more duplicated effort between different parts of LLVM.

2. Peephole optimisations in the AtomicExpandLoadLinked pass.
    - Feels like a hack, spotting particular patterns that the mid-end can
      already cope with.
    - Deals with toy functions well, but it's not quite clear that those are
the
      only options in more general code (after inlining, say).
    - Potentially safest option: atomicrmw & cmpxchg preserved until
      very late. Nothing's going to come in and convert valid code to
invalid.

3. Moving expansion further forward but still using intrinsics.
    - The intrinsics need to have fully general side effects to be correct:
      effectively an "asm volatile" for optimisation purposes, which
is very
      heavy-handed for LLVM's other optimisations.
    - Still need target hooks to create the calls, because intrinsics don't
get
      type lowered and so you can't write a fully generic one (e.g. an i64
ldrex
      on ARM needs to return { i32, i32 }).

4. Change the cmpxchg operation to return (say) { iN, i1 } where the second
   value indicates success.
    - Probably good idea anyway as part of support for weak compare-exchange
      operations.
    - Doesn't actually help this issue much: it's mostly control-flow
      optimisations that seem to be helpful, and this would leave us with phi
      nodes in need of materialisation when lowered.
    - Might make the peephole optimisations in 2 slightly simpler.

5. Some kind of generalisation that can cover both LL/SC and transactional
   memory to avoid doing this again if and when that works out?
    - I can't think of one that's both neat and practically
implementable for
      both.
    - Suggestions welcome.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ll-sc.rst
Type: application/octet-stream
Size: 6183 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140529/813c8a8f/attachment.obj>

Amara Emerson

2014-May-29 12:35 UTC

head link

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

Hi Tim,
> 1. Can invalid transformations be done on them? Separating an LL from its
>    corresponding SC by too much almost guarantees failure (an OS task
switch
>    happens, clearing the monitor).
>From my admittedly weak understanding, LICM shouldn't hoist loadsunless they're unordered and this would also apply here to the new
load linked instruction. In what other cases could the LL/SC become
separated enough for OS context switches to become a problem?

Amara

Tim Northover

2014-May-29 13:53 UTC

head link

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

Hi Amara,

I've had a chat with Chandler on IRC since sending this, and it looks
like it's at the very least a less pressing issue.

The summary was that we may want LL/SC instructions in future
(possibly to represent situations currently handed by cmpxchg loops),
but they wouldn't actually solve my problem anyway becaue cmpxchg
would still be the canonical representation when on its own.

Fortunately, he also suggested a very neat solution which deals with
my issues: just run EarlyCSE to tidy up my mess.

On 29 May 2014 13:35, Amara Emerson <amara.emerson at gmail.com>
wrote:> From my admittedly weak understanding, LICM shouldn't hoist loads
> unless they're unordered and this would also apply here to the new
> load linked instruction. In what other cases could the LL/SC become
> separated enough for OS context switches to become a problem?
I think the most likely would be instrumentation passes. They could
even insert another atomicrmw, which would be guaranteed to cause
failure regardless of time taken.

Cheers.

Tim.

Philip Reames

2014-May-29 16:03 UTC

head link

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

I have some reservations about this proposal.  I don't have anything 
particularly concrete, but the idea of supporting both LL/SC and 
atomicrwm in the IR concerns me from a complexity perspective.  If we 
find that there is significant profit in supporting both, we can, but I 
would like to see documentation of significant benefit before changing 
the IR.

Tim, for those of us not directly involved, could you share a selection 
of bugs or other background?  I'd like to read through and try to get a 
better sense for the problem you're trying to solve.

Philip

On 05/29/2014 04:55 AM, Tim Northover wrote:> Hi,
>
> I've been looking at improving atomicrmw & cmpxchg code more,
> particularly on architectures using the load-linked/store-conditional
> model.
>
> The summary is that current expansion for cmpxchg seems to happen too
> late for LLVM to make meaningful use of the opportunities it provides.
> I'd like to move it earlier and express it in terms of a first-class
> pair of "load linked" and "store conditional"
instructions.
>
> More details are attached and inlined below. Currently the only code I
> have is for one or two AtomicExpand peepholes, so any suggestions or
> alternatives would be very welcome. It's possible this is entirely the
> wrong approach.
>
> What do people think?
>
> Cheers.
>
> Tim.
>
> The Problem
> -----------
>
> We now expand atomic instructions to load-linked & store-conditional
intrinsics
> just before ISel, but this expansion opens up more optimisation
opportunities
> that LLVM *would* be able to exploit if it saw the control flow earlier.
>
> For example the return value of the C++11 and C11 compare_exchange
operations is
> actually whether the exchange succeeded, which leads to some common idioms
in
> Clang-produced IR.
>
>  From "if(__c11_compare_exchange_strong(...))":
>      %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
>      %success = icmp eq i32 %loaded, %oldval
>      br i1 %success, label %true, label %false
> the control-flow here should be something like:
>      loop:
>        %loaded = load linked i32* %addr seq_cst
>        %trystore = icmp eq %loaded, %oldval
>        br i1 %trystore, label %store.cond, label %false
>      store.cond:
>        %success = store conditional i32 %newval, i32* %addr seq_cst
>        br i1 %success, label %true, label %loop
>
>  From "return __c11_compare_exchange_strong(...);":
>      %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst
>      %success = icmp eq i32 %loaded, %oldval
>      %res = zext i1 %success to i32
>      ret i32 %res
> here, we'd want something like:
>      loop:
>        %loaded = load linked i32* %addr seq_cst
>        %trystore = icmp eq %loaded, %oldval
>        br i1 %trystore, label %store.cond, label %false
>      store.cond:
>        %success = store conditional i32 %newval, i32* %addr seq_cst
>        br i1 %success, label %true, label %loop
>      true:
>        ret i32 1
>      false:
>        ret i32 0
>
> In both cases the compare is largely redundant in LL/SC architectures. You
know
> by the control flow graph what its result will be, and if LLVM sees this
early
> enough it can produce something close to optimal code.
>
> Unfortunately, most of LLVM has finished by the time this expansion occurs,
so
> we end up with redundant compares.
>
> Now, it's certainly possible to put various peephole optimisations into
the
> expansion pass to cover these cases (at least), but to me that feels like
> fragile duplication of effort we've already made in the mid-end.
>
> My Proposal
> -----------
>
> I think the best way to solve this longer term is to add explicit
"load linked"
> and "store conditional" instructions to the LLVM IR, with equal
status to
> "cmpxchg" and "atomicrmw".
>
> We could then simplify the AtomicExpandLoadLinked pass to use these, change
it
> to use something like a TargetTransformInfo hook instead of TargetLowering
and
> schedule it much earlier in the mid-end. X86 & SystemZ would simply
request no
> expansion happen.
>
> This means Clang could still emit cmpxchg and atomicrmw, which closely
match C &
> C++ semantics, but most of LLVM would see the control-flow and operations
in a
> form much closer to how the targers will actually emit them.
>
> In addition, other languages would be able to provide more general atomic
> operations, if they want (slightly more complex bit twiddling seems like it
> could be particularly useful in some algorithms). Current atomic
instructions
> cover (most of) C and C++, but nothing else.
>
> Potential issues
> ----------------
>
> 1. Can invalid transformations be done on them? Separating an LL from its
>     corresponding SC by too much almost guarantees failure (an OS task
switch
>     happens, clearing the monitor).
> 2. We'd need to document their interaction with LLVM's memory
model. I'm not
>     sure I've seen a good happens-before model that explicitly uses
LL/SC. The
>     common ones people came up with in the Java & C++11 standardisation
process
>     seem to assume atomic RMW operations instead.
> 3. People might be assuming StoreInst produces no value and not calling
>     ReplaceAllUses. A separate StoreConditionalInst is one option, but that
risks
>     the reverse, people not considering it a store when it is.
> 4. Not all targets can support them. There are very few LLVM instructions
like
>     that, which is a little concerning. But not all targets can support
current
>     atomics either.
>
>     If we make vulnerability to ABA problems target-specific in our
documentation
>     then X86 & SystemZ could use the same trick as for min/max to
implement them
>     generically: a load/cmpxchg loop.
>
> Alternatives
> ------------
>
> 1. Enhancing MachineCSE and friends so they can optimise this.
>      - Works for "%old = atomicrmw OP %addr, %inc; OP %old, %inc"
already.
>      - Cmpxchg is more difficult. It's the control-flow optimisations
that are
>        particularly powerful in these cases.
>      - It's more duplicated effort between different parts of LLVM.
>
> 2. Peephole optimisations in the AtomicExpandLoadLinked pass.
>      - Feels like a hack, spotting particular patterns that the mid-end can
>        already cope with.
>      - Deals with toy functions well, but it's not quite clear that
those are the
>        only options in more general code (after inlining, say).
>      - Potentially safest option: atomicrmw & cmpxchg preserved until
>        very late. Nothing's going to come in and convert valid code to
invalid.
>
> 3. Moving expansion further forward but still using intrinsics.
>      - The intrinsics need to have fully general side effects to be
correct:
>        effectively an "asm volatile" for optimisation purposes,
which is very
>        heavy-handed for LLVM's other optimisations.
>      - Still need target hooks to create the calls, because intrinsics
don't get
>        type lowered and so you can't write a fully generic one (e.g. an
i64 ldrex
>        on ARM needs to return { i32, i32 }).
>
> 4. Change the cmpxchg operation to return (say) { iN, i1 } where the second
>     value indicates success.
>      - Probably good idea anyway as part of support for weak
compare-exchange
>        operations.
>      - Doesn't actually help this issue much: it's mostly
control-flow
>        optimisations that seem to be helpful, and this would leave us with
phi
>        nodes in need of materialisation when lowered.
>      - Might make the peephole optimisations in 2 slightly simpler.
>
> 5. Some kind of generalisation that can cover both LL/SC and transactional
>     memory to avoid doing this again if and when that works out?
>      - I can't think of one that's both neat and practically
implementable for
>        both.
>      - Suggestions welcome.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140529/8b05f284/attachment.html>

Tim Northover

2014-May-29 17:21 UTC

head link

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

Hi Philip,

On 29 May 2014 17:03, Philip Reames <listmail at philipreames.com>
wrote:> I have some reservations about this proposal.  I don't have anything
> particularly concrete, but the idea of supporting both LL/SC and atomicrwm
> in the IR concerns me from a complexity perspective.
Well, I'll start by saying my particular optimisation use case looks
like it's not enough to justify the addition. I've got something
basically working for my efficiency worries, with less effort than I
thought. So I'm a lot less keen on it myself than I was a few hours
ago.

But I'm still worried about how closely LLVM IR is tied to both C and
X86 in this matter. A weak cmpxchg would go a long way to resolving
this, but it's still difficult to see a path from an IR-level "cmpxchg
weak" to optimal "atomicrmw lambda" support in LL/SC backends.

Given C like

    void atomic_foo(int *addr) {
      int oldval = *addr;
      do {
        newval = foo(oldval);
      } while (__c11_compare_exchange_weak(addr, &oldval, newval));

The cmpxchg representation would be something like:

    define void @atomic_foo(int *addr) {
    entry:
        %firstval = load i32* %addr
        br label %loop
    loop:
        %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop]
        %newval = call i32 @foo(i32 %oldval)
        %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval
        %wrongval = extractvalue { i32, i1 } %res, 0
        %success = extractvalue { i32, i1 } %res, 1
        br i1 %success, label %end, label %loop
    end:
        ret void
    }

But the optimal LL/SC form would be more like:

    define void @atomic_foo(int *addr) {
    entry:
        br label %loop
    loop:
        %oldval = load linked i32* %addr
        %newval = call i32 @foo(i32 %oldval)
        %success = store conditional i32 %newval, i32* %addr
        br i1 %success, label %end, label %loop
    end:
        ret void
    }

That kind of analysis is a very big burden to put on any pass. On the
other hand, mapping the other way doesn't seem much simpler either.

I feel like there ought to be a good way to combine this with
Haswell's xbegin/xend functionality in an even more generic IR
construct too, but I can't quite come up with a representation. More
thought needed.
> Tim, for those of us not directly involved, could you share a selection of
> bugs or other background?  I'd like to read through and try to get a
better
> sense for the problem you're trying to solve.
My immediate concern was the C++11 compare_exchange which inserts an
automatic and mostly redundant icmp after the result. Variants of the
IR in my original message are representative, though possibly not
exhaustive, of what might be seen.

Of course, it's all much more speculative now. Except, possibly, "how
should we handle compare_exchange_weak".

Cheers.

Tim.

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - May 2014 - [LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

Possibly Parallel Threads