thr3ads.net - llvm dev - [LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions [May 2014]

If this information is useful, please help other people find it:
Share via:

Tim Northover

2014-May-29 17:21 UTC

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

Hi Philip,

On 29 May 2014 17:03, Philip Reames <listmail at philipreames.com>
wrote:> I have some reservations about this proposal.  I don't have anything
> particularly concrete, but the idea of supporting both LL/SC and atomicrwm
> in the IR concerns me from a complexity perspective.
Well, I'll start by saying my particular optimisation use case looks
like it's not enough to justify the addition. I've got something
basically working for my efficiency worries, with less effort than I
thought. So I'm a lot less keen on it myself than I was a few hours
ago.

But I'm still worried about how closely LLVM IR is tied to both C and
X86 in this matter. A weak cmpxchg would go a long way to resolving
this, but it's still difficult to see a path from an IR-level "cmpxchg
weak" to optimal "atomicrmw lambda" support in LL/SC backends.

Given C like

    void atomic_foo(int *addr) {
      int oldval = *addr;
      do {
        newval = foo(oldval);
      } while (__c11_compare_exchange_weak(addr, &oldval, newval));

The cmpxchg representation would be something like:

    define void @atomic_foo(int *addr) {
    entry:
        %firstval = load i32* %addr
        br label %loop
    loop:
        %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop]
        %newval = call i32 @foo(i32 %oldval)
        %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval
        %wrongval = extractvalue { i32, i1 } %res, 0
        %success = extractvalue { i32, i1 } %res, 1
        br i1 %success, label %end, label %loop
    end:
        ret void
    }

But the optimal LL/SC form would be more like:

    define void @atomic_foo(int *addr) {
    entry:
        br label %loop
    loop:
        %oldval = load linked i32* %addr
        %newval = call i32 @foo(i32 %oldval)
        %success = store conditional i32 %newval, i32* %addr
        br i1 %success, label %end, label %loop
    end:
        ret void
    }

That kind of analysis is a very big burden to put on any pass. On the
other hand, mapping the other way doesn't seem much simpler either.

I feel like there ought to be a good way to combine this with
Haswell's xbegin/xend functionality in an even more generic IR
construct too, but I can't quite come up with a representation. More
thought needed.
> Tim, for those of us not directly involved, could you share a selection of
> bugs or other background?  I'd like to read through and try to get a
better
> sense for the problem you're trying to solve.
My immediate concern was the C++11 compare_exchange which inserts an
automatic and mostly redundant icmp after the result. Variants of the
IR in my original message are representative, though possibly not
exhaustive, of what might be seen.

Of course, it's all much more speculative now. Except, possibly, "how
should we handle compare_exchange_weak".

Cheers.

Tim.

Philip Reames

2014-May-29 18:31 UTC

head link

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

On 05/29/2014 10:21 AM, Tim Northover wrote:> Hi Philip,
>
> On 29 May 2014 17:03, Philip Reames <listmail at philipreames.com>
wrote:
>> I have some reservations about this proposal.  I don't have
anything
>> particularly concrete, but the idea of supporting both LL/SC and
atomicrwm
>> in the IR concerns me from a complexity perspective.
> Well, I'll start by saying my particular optimisation use case looks
> like it's not enough to justify the addition. I've got something
> basically working for my efficiency worries, with less effort than I
> thought. So I'm a lot less keen on it myself than I was a few hours
> ago.
Good to know.>
> But I'm still worried about how closely LLVM IR is tied to both C and
> X86 in this matter. A weak cmpxchg would go a long way to resolving
> this, but it's still difficult to see a path from an IR-level
"cmpxchg
> weak" to optimal "atomicrmw lambda" support in LL/SC
backends.I share your concerns actually.  It doesn't effect my current usage so 
it's not a high priority for me, but from a idealist perspective, it is 
worrying.  On the other hand, overly generic IR is an evil in and of 
itself.  So it's a delicate balance.>
> Given C like
>
>      void atomic_foo(int *addr) {
>        int oldval = *addr;
>        do {
>          newval = foo(oldval);
>        } while (__c11_compare_exchange_weak(addr, &oldval, newval));
>
> The cmpxchg representation would be something like:
>
>      define void @atomic_foo(int *addr) {
>      entry:
>          %firstval = load i32* %addr
>          br label %loop
>      loop:
>          %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop]
>          %newval = call i32 @foo(i32 %oldval)
>          %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval
>          %wrongval = extractvalue { i32, i1 } %res, 0
>          %success = extractvalue { i32, i1 } %res, 1
>          br i1 %success, label %end, label %loop
>      end:
>          ret void
>      }
>
> But the optimal LL/SC form would be more like:
>
>      define void @atomic_foo(int *addr) {
>      entry:
>          br label %loop
>      loop:
>          %oldval = load linked i32* %addr
>          %newval = call i32 @foo(i32 %oldval)
>          %success = store conditional i32 %newval, i32* %addr
>          br i1 %success, label %end, label %loop
>      end:
>          ret void
>      }
>
> That kind of analysis is a very big burden to put on any pass. On the
> other hand, mapping the other way doesn't seem much simpler either.
>
> I feel like there ought to be a good way to combine this with
> Haswell's xbegin/xend functionality in an even more generic IR
> construct too, but I can't quite come up with a representation. More
> thought needed.I agree with both points, but particularly the more thought needed one.  :)

While it's tempting to introduce scoped atomic constructs which map 
nicely to all three, this looses much of the power of the xbegin/xend 
scheme.  Being able to spread transactions across function boundaries is 
essential.

I suspect we'll have to end up modelling the transaction boundaries as 
some form of memory fence.  This doesn't get all of their semantics, but 
it does prevent a number of illegal transforms.
xbegin -> loadstore, storestore fence after (i.e. stores can't float out 
of the atomic region!)
xend -> storestore, storeload fence before (nor this way)

You probably do want to allow load reordering into a transaction past an 
xend.  Doing so past a xbegin is legal (I think?), but likely not 
profitable.  It can turn a potentially succeeding transaction into an 
always failing one.  (Or an always succeeding one.)  There's a lot of 
cases to be explored here both w.r.t. legality and profitability.

It would also be good to get input from folks who've built previous 
compilers with T.M. constructs.  I just don't know enough about prior 
art to propose a good design.

>
>> Tim, for those of us not directly involved, could you share a selection
of
>> bugs or other background?  I'd like to read through and try to get
a better
>> sense for the problem you're trying to solve.
> My immediate concern was the C++11 compare_exchange which inserts an
> automatic and mostly redundant icmp after the result. Variants of the
> IR in my original message are representative, though possibly not
> exhaustive, of what might be seen.
>
> Of course, it's all much more speculative now. Except, possibly,
"how
> should we handle compare_exchange_weak".
>
> Cheers.
>
> Tim.

JF Bastien

2014-May-29 21:51 UTC

head link

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

What I'm missing from this discussion is *when* LL/SC would be
emitted: is it Clang that lowers code differently when it knows about
the target; is it the user that calls these directly; or is it a pass
that does this IR transform early enough so more optimizations can be
applied? The latter seems like the best solution to me, but it still
requires adding this new construct to the IR and documenting it, and I
think doing so may uncover more issues that we'll need to discuss.

I'm somewhat wary of deviating further from LLVM's weird superset of
C++11's memory model and operations (yes yes, I know there's history
there) before they gets "fixed" in the language.

David Chisnall

2014-May-30 11:24 UTC

head link

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

On 29 May 2014, at 18:21, Tim Northover <t.p.northover at gmail.com>
wrote:
>    void atomic_foo(int *addr) {
>      int oldval = *addr;
>      do {
>        newval = foo(oldval);
>      } while (__c11_compare_exchange_weak(addr, &oldval, newval));
This particular example is a bit difficult, because the best representation
depends on the complexity of foo().  If foo() is simple (and guaranteed not to
contain any atomics) then you'd want to do the load-linked at the start and
store-conditional at the end, something like this:

     int oldval = load_linked(addr);
     do {
       newval = foo(oldval);
     } while (!store_conditional(addr, newval));

If, however, foo() is not visible, is complex, or has some other memory
accesses[1] then the correct lowering would be a load at the start and then a
load-linked/store conditional (but without a loop) at the end.

     int oldval = *addr;
     int oldval2;
     do {
       newval = foo(oldval);
       oldval2 = load_linked(addr);
     } while ((oldval != oldval2) || !store_conditional(addr, newval));

The lowering that we currently generate, however, is more like this:

     int oldval = *addr;
     int success = 1;
     do {
       newval = foo(oldval);
       int oldval2;
       do {
         oldval2 = load_linked(addr);
         if (oldval2 != oldval) {
           success = 0;
           break;
         }
       } while ((oldval != oldval2) || !store_conditional(addr, newval));
     } while (!success);

This is clearly suboptimal.  My problem with your current proposal is that we
really need to preserve the weak cmpxchg semantics through optimisation to
ensure correctness.  The front end does not know (except in trivial cases) which
of the two earlier forms is correct.  It could always emit the second form for
any weak cmpxchg, but that then makes optimisation harder because optimisers
have to infer that it means a weak cmpxchg[2].

When we get to CodeGenPrepare, it's fairly easy to transform the original
structure into the first correct output, if there are no memory accesses in the
middle.  Transforming the second version into the first is harder unless the
structure is preserved.

In short, I believe that adding ll/sc to the IR will:

- Not make generating IR easier than adding weak cmpxchg
- Not make optimising IR easier than adding weak cmpxchg
- Not make code generation easier than adding weak cmpxchg
- Be more invasive than adding weak cmpxchg

David

[1] RISC-V, for example, requires that there be no memory accesses *at all*
between ll and sc, some ARM implementations require that there be no memory
accesses within the same cache line (see a bug report from a year or so ago on
Apple hardware).

[2] There's a lot of work being done by Peter Sewell's group and others
currently on what are semantically valid transforms for the C11 memory model, so
as long as we preserve those semantics in the IR we have lots of opportunities
to improve optimisations.

llvm dev - May 2014 - [LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions