Tim Northover
2014-May-29 17:21 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
Hi Philip, On 29 May 2014 17:03, Philip Reames <listmail at philipreames.com> wrote:> I have some reservations about this proposal. I don't have anything > particularly concrete, but the idea of supporting both LL/SC and atomicrwm > in the IR concerns me from a complexity perspective.Well, I'll start by saying my particular optimisation use case looks like it's not enough to justify the addition. I've got something basically working for my efficiency worries, with less effort than I thought. So I'm a lot less keen on it myself than I was a few hours ago. But I'm still worried about how closely LLVM IR is tied to both C and X86 in this matter. A weak cmpxchg would go a long way to resolving this, but it's still difficult to see a path from an IR-level "cmpxchg weak" to optimal "atomicrmw lambda" support in LL/SC backends. Given C like void atomic_foo(int *addr) { int oldval = *addr; do { newval = foo(oldval); } while (__c11_compare_exchange_weak(addr, &oldval, newval)); The cmpxchg representation would be something like: define void @atomic_foo(int *addr) { entry: %firstval = load i32* %addr br label %loop loop: %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop] %newval = call i32 @foo(i32 %oldval) %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval %wrongval = extractvalue { i32, i1 } %res, 0 %success = extractvalue { i32, i1 } %res, 1 br i1 %success, label %end, label %loop end: ret void } But the optimal LL/SC form would be more like: define void @atomic_foo(int *addr) { entry: br label %loop loop: %oldval = load linked i32* %addr %newval = call i32 @foo(i32 %oldval) %success = store conditional i32 %newval, i32* %addr br i1 %success, label %end, label %loop end: ret void } That kind of analysis is a very big burden to put on any pass. On the other hand, mapping the other way doesn't seem much simpler either. I feel like there ought to be a good way to combine this with Haswell's xbegin/xend functionality in an even more generic IR construct too, but I can't quite come up with a representation. More thought needed.> Tim, for those of us not directly involved, could you share a selection of > bugs or other background? I'd like to read through and try to get a better > sense for the problem you're trying to solve.My immediate concern was the C++11 compare_exchange which inserts an automatic and mostly redundant icmp after the result. Variants of the IR in my original message are representative, though possibly not exhaustive, of what might be seen. Of course, it's all much more speculative now. Except, possibly, "how should we handle compare_exchange_weak". Cheers. Tim.
Philip Reames
2014-May-29 18:31 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
On 05/29/2014 10:21 AM, Tim Northover wrote:> Hi Philip, > > On 29 May 2014 17:03, Philip Reames <listmail at philipreames.com> wrote: >> I have some reservations about this proposal. I don't have anything >> particularly concrete, but the idea of supporting both LL/SC and atomicrwm >> in the IR concerns me from a complexity perspective. > Well, I'll start by saying my particular optimisation use case looks > like it's not enough to justify the addition. I've got something > basically working for my efficiency worries, with less effort than I > thought. So I'm a lot less keen on it myself than I was a few hours > ago.Good to know.> > But I'm still worried about how closely LLVM IR is tied to both C and > X86 in this matter. A weak cmpxchg would go a long way to resolving > this, but it's still difficult to see a path from an IR-level "cmpxchg > weak" to optimal "atomicrmw lambda" support in LL/SC backends.I share your concerns actually. It doesn't effect my current usage so it's not a high priority for me, but from a idealist perspective, it is worrying. On the other hand, overly generic IR is an evil in and of itself. So it's a delicate balance.> > Given C like > > void atomic_foo(int *addr) { > int oldval = *addr; > do { > newval = foo(oldval); > } while (__c11_compare_exchange_weak(addr, &oldval, newval)); > > The cmpxchg representation would be something like: > > define void @atomic_foo(int *addr) { > entry: > %firstval = load i32* %addr > br label %loop > loop: > %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop] > %newval = call i32 @foo(i32 %oldval) > %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval > %wrongval = extractvalue { i32, i1 } %res, 0 > %success = extractvalue { i32, i1 } %res, 1 > br i1 %success, label %end, label %loop > end: > ret void > } > > But the optimal LL/SC form would be more like: > > define void @atomic_foo(int *addr) { > entry: > br label %loop > loop: > %oldval = load linked i32* %addr > %newval = call i32 @foo(i32 %oldval) > %success = store conditional i32 %newval, i32* %addr > br i1 %success, label %end, label %loop > end: > ret void > } > > That kind of analysis is a very big burden to put on any pass. On the > other hand, mapping the other way doesn't seem much simpler either. > > I feel like there ought to be a good way to combine this with > Haswell's xbegin/xend functionality in an even more generic IR > construct too, but I can't quite come up with a representation. More > thought needed.I agree with both points, but particularly the more thought needed one. :) While it's tempting to introduce scoped atomic constructs which map nicely to all three, this looses much of the power of the xbegin/xend scheme. Being able to spread transactions across function boundaries is essential. I suspect we'll have to end up modelling the transaction boundaries as some form of memory fence. This doesn't get all of their semantics, but it does prevent a number of illegal transforms. xbegin -> loadstore, storestore fence after (i.e. stores can't float out of the atomic region!) xend -> storestore, storeload fence before (nor this way) You probably do want to allow load reordering into a transaction past an xend. Doing so past a xbegin is legal (I think?), but likely not profitable. It can turn a potentially succeeding transaction into an always failing one. (Or an always succeeding one.) There's a lot of cases to be explored here both w.r.t. legality and profitability. It would also be good to get input from folks who've built previous compilers with T.M. constructs. I just don't know enough about prior art to propose a good design.> >> Tim, for those of us not directly involved, could you share a selection of >> bugs or other background? I'd like to read through and try to get a better >> sense for the problem you're trying to solve. > My immediate concern was the C++11 compare_exchange which inserts an > automatic and mostly redundant icmp after the result. Variants of the > IR in my original message are representative, though possibly not > exhaustive, of what might be seen. > > Of course, it's all much more speculative now. Except, possibly, "how > should we handle compare_exchange_weak". > > Cheers. > > Tim.
JF Bastien
2014-May-29 21:51 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
What I'm missing from this discussion is *when* LL/SC would be emitted: is it Clang that lowers code differently when it knows about the target; is it the user that calls these directly; or is it a pass that does this IR transform early enough so more optimizations can be applied? The latter seems like the best solution to me, but it still requires adding this new construct to the IR and documenting it, and I think doing so may uncover more issues that we'll need to discuss. I'm somewhat wary of deviating further from LLVM's weird superset of C++11's memory model and operations (yes yes, I know there's history there) before they gets "fixed" in the language.
David Chisnall
2014-May-30 11:24 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
On 29 May 2014, at 18:21, Tim Northover <t.p.northover at gmail.com> wrote:> void atomic_foo(int *addr) { > int oldval = *addr; > do { > newval = foo(oldval); > } while (__c11_compare_exchange_weak(addr, &oldval, newval));This particular example is a bit difficult, because the best representation depends on the complexity of foo(). If foo() is simple (and guaranteed not to contain any atomics) then you'd want to do the load-linked at the start and store-conditional at the end, something like this: int oldval = load_linked(addr); do { newval = foo(oldval); } while (!store_conditional(addr, newval)); If, however, foo() is not visible, is complex, or has some other memory accesses[1] then the correct lowering would be a load at the start and then a load-linked/store conditional (but without a loop) at the end. int oldval = *addr; int oldval2; do { newval = foo(oldval); oldval2 = load_linked(addr); } while ((oldval != oldval2) || !store_conditional(addr, newval)); The lowering that we currently generate, however, is more like this: int oldval = *addr; int success = 1; do { newval = foo(oldval); int oldval2; do { oldval2 = load_linked(addr); if (oldval2 != oldval) { success = 0; break; } } while ((oldval != oldval2) || !store_conditional(addr, newval)); } while (!success); This is clearly suboptimal. My problem with your current proposal is that we really need to preserve the weak cmpxchg semantics through optimisation to ensure correctness. The front end does not know (except in trivial cases) which of the two earlier forms is correct. It could always emit the second form for any weak cmpxchg, but that then makes optimisation harder because optimisers have to infer that it means a weak cmpxchg[2]. When we get to CodeGenPrepare, it's fairly easy to transform the original structure into the first correct output, if there are no memory accesses in the middle. Transforming the second version into the first is harder unless the structure is preserved. In short, I believe that adding ll/sc to the IR will: - Not make generating IR easier than adding weak cmpxchg - Not make optimising IR easier than adding weak cmpxchg - Not make code generation easier than adding weak cmpxchg - Be more invasive than adding weak cmpxchg David [1] RISC-V, for example, requires that there be no memory accesses *at all* between ll and sc, some ARM implementations require that there be no memory accesses within the same cache line (see a bug report from a year or so ago on Apple hardware). [2] There's a lot of work being done by Peter Sewell's group and others currently on what are semantically valid transforms for the C11 memory model, so as long as we preserve those semantics in the IR we have lots of opportunities to improve optimisations.