Tim Northover
2014-May-29 11:55 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
Hi, I've been looking at improving atomicrmw & cmpxchg code more, particularly on architectures using the load-linked/store-conditional model. The summary is that current expansion for cmpxchg seems to happen too late for LLVM to make meaningful use of the opportunities it provides. I'd like to move it earlier and express it in terms of a first-class pair of "load linked" and "store conditional" instructions. More details are attached and inlined below. Currently the only code I have is for one or two AtomicExpand peepholes, so any suggestions or alternatives would be very welcome. It's possible this is entirely the wrong approach. What do people think? Cheers. Tim. The Problem ----------- We now expand atomic instructions to load-linked & store-conditional intrinsics just before ISel, but this expansion opens up more optimisation opportunities that LLVM *would* be able to exploit if it saw the control flow earlier. For example the return value of the C++11 and C11 compare_exchange operations is actually whether the exchange succeeded, which leads to some common idioms in Clang-produced IR.>From "if(__c11_compare_exchange_strong(...))":%loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst %success = icmp eq i32 %loaded, %oldval br i1 %success, label %true, label %false the control-flow here should be something like: loop: %loaded = load linked i32* %addr seq_cst %trystore = icmp eq %loaded, %oldval br i1 %trystore, label %store.cond, label %false store.cond: %success = store conditional i32 %newval, i32* %addr seq_cst br i1 %success, label %true, label %loop>From "return __c11_compare_exchange_strong(...);":%loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst %success = icmp eq i32 %loaded, %oldval %res = zext i1 %success to i32 ret i32 %res here, we'd want something like: loop: %loaded = load linked i32* %addr seq_cst %trystore = icmp eq %loaded, %oldval br i1 %trystore, label %store.cond, label %false store.cond: %success = store conditional i32 %newval, i32* %addr seq_cst br i1 %success, label %true, label %loop true: ret i32 1 false: ret i32 0 In both cases the compare is largely redundant in LL/SC architectures. You know by the control flow graph what its result will be, and if LLVM sees this early enough it can produce something close to optimal code. Unfortunately, most of LLVM has finished by the time this expansion occurs, so we end up with redundant compares. Now, it's certainly possible to put various peephole optimisations into the expansion pass to cover these cases (at least), but to me that feels like fragile duplication of effort we've already made in the mid-end. My Proposal ----------- I think the best way to solve this longer term is to add explicit "load linked" and "store conditional" instructions to the LLVM IR, with equal status to "cmpxchg" and "atomicrmw". We could then simplify the AtomicExpandLoadLinked pass to use these, change it to use something like a TargetTransformInfo hook instead of TargetLowering and schedule it much earlier in the mid-end. X86 & SystemZ would simply request no expansion happen. This means Clang could still emit cmpxchg and atomicrmw, which closely match C & C++ semantics, but most of LLVM would see the control-flow and operations in a form much closer to how the targers will actually emit them. In addition, other languages would be able to provide more general atomic operations, if they want (slightly more complex bit twiddling seems like it could be particularly useful in some algorithms). Current atomic instructions cover (most of) C and C++, but nothing else. Potential issues ---------------- 1. Can invalid transformations be done on them? Separating an LL from its corresponding SC by too much almost guarantees failure (an OS task switch happens, clearing the monitor). 2. We'd need to document their interaction with LLVM's memory model. I'm not sure I've seen a good happens-before model that explicitly uses LL/SC. The common ones people came up with in the Java & C++11 standardisation process seem to assume atomic RMW operations instead. 3. People might be assuming StoreInst produces no value and not calling ReplaceAllUses. A separate StoreConditionalInst is one option, but that risks the reverse, people not considering it a store when it is. 4. Not all targets can support them. There are very few LLVM instructions like that, which is a little concerning. But not all targets can support current atomics either. If we make vulnerability to ABA problems target-specific in our documentation then X86 & SystemZ could use the same trick as for min/max to implement them generically: a load/cmpxchg loop. Alternatives ------------ 1. Enhancing MachineCSE and friends so they can optimise this. - Works for "%old = atomicrmw OP %addr, %inc; OP %old, %inc" already. - Cmpxchg is more difficult. It's the control-flow optimisations that are particularly powerful in these cases. - It's more duplicated effort between different parts of LLVM. 2. Peephole optimisations in the AtomicExpandLoadLinked pass. - Feels like a hack, spotting particular patterns that the mid-end can already cope with. - Deals with toy functions well, but it's not quite clear that those are the only options in more general code (after inlining, say). - Potentially safest option: atomicrmw & cmpxchg preserved until very late. Nothing's going to come in and convert valid code to invalid. 3. Moving expansion further forward but still using intrinsics. - The intrinsics need to have fully general side effects to be correct: effectively an "asm volatile" for optimisation purposes, which is very heavy-handed for LLVM's other optimisations. - Still need target hooks to create the calls, because intrinsics don't get type lowered and so you can't write a fully generic one (e.g. an i64 ldrex on ARM needs to return { i32, i32 }). 4. Change the cmpxchg operation to return (say) { iN, i1 } where the second value indicates success. - Probably good idea anyway as part of support for weak compare-exchange operations. - Doesn't actually help this issue much: it's mostly control-flow optimisations that seem to be helpful, and this would leave us with phi nodes in need of materialisation when lowered. - Might make the peephole optimisations in 2 slightly simpler. 5. Some kind of generalisation that can cover both LL/SC and transactional memory to avoid doing this again if and when that works out? - I can't think of one that's both neat and practically implementable for both. - Suggestions welcome. -------------- next part -------------- A non-text attachment was scrubbed... Name: ll-sc.rst Type: application/octet-stream Size: 6183 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140529/813c8a8f/attachment.obj>
Amara Emerson
2014-May-29 12:35 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
Hi Tim,> 1. Can invalid transformations be done on them? Separating an LL from its > corresponding SC by too much almost guarantees failure (an OS task switch > happens, clearing the monitor). >From my admittedly weak understanding, LICM shouldn't hoist loadsunless they're unordered and this would also apply here to the new load linked instruction. In what other cases could the LL/SC become separated enough for OS context switches to become a problem? Amara
Tim Northover
2014-May-29 13:53 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
Hi Amara, I've had a chat with Chandler on IRC since sending this, and it looks like it's at the very least a less pressing issue. The summary was that we may want LL/SC instructions in future (possibly to represent situations currently handed by cmpxchg loops), but they wouldn't actually solve my problem anyway becaue cmpxchg would still be the canonical representation when on its own. Fortunately, he also suggested a very neat solution which deals with my issues: just run EarlyCSE to tidy up my mess. On 29 May 2014 13:35, Amara Emerson <amara.emerson at gmail.com> wrote:> From my admittedly weak understanding, LICM shouldn't hoist loads > unless they're unordered and this would also apply here to the new > load linked instruction. In what other cases could the LL/SC become > separated enough for OS context switches to become a problem?I think the most likely would be instrumentation passes. They could even insert another atomicrmw, which would be guaranteed to cause failure regardless of time taken. Cheers. Tim.
Philip Reames
2014-May-29 16:03 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
I have some reservations about this proposal. I don't have anything particularly concrete, but the idea of supporting both LL/SC and atomicrwm in the IR concerns me from a complexity perspective. If we find that there is significant profit in supporting both, we can, but I would like to see documentation of significant benefit before changing the IR. Tim, for those of us not directly involved, could you share a selection of bugs or other background? I'd like to read through and try to get a better sense for the problem you're trying to solve. Philip On 05/29/2014 04:55 AM, Tim Northover wrote:> Hi, > > I've been looking at improving atomicrmw & cmpxchg code more, > particularly on architectures using the load-linked/store-conditional > model. > > The summary is that current expansion for cmpxchg seems to happen too > late for LLVM to make meaningful use of the opportunities it provides. > I'd like to move it earlier and express it in terms of a first-class > pair of "load linked" and "store conditional" instructions. > > More details are attached and inlined below. Currently the only code I > have is for one or two AtomicExpand peepholes, so any suggestions or > alternatives would be very welcome. It's possible this is entirely the > wrong approach. > > What do people think? > > Cheers. > > Tim. > > The Problem > ----------- > > We now expand atomic instructions to load-linked & store-conditional intrinsics > just before ISel, but this expansion opens up more optimisation opportunities > that LLVM *would* be able to exploit if it saw the control flow earlier. > > For example the return value of the C++11 and C11 compare_exchange operations is > actually whether the exchange succeeded, which leads to some common idioms in > Clang-produced IR. > > From "if(__c11_compare_exchange_strong(...))": > %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst > %success = icmp eq i32 %loaded, %oldval > br i1 %success, label %true, label %false > the control-flow here should be something like: > loop: > %loaded = load linked i32* %addr seq_cst > %trystore = icmp eq %loaded, %oldval > br i1 %trystore, label %store.cond, label %false > store.cond: > %success = store conditional i32 %newval, i32* %addr seq_cst > br i1 %success, label %true, label %loop > > From "return __c11_compare_exchange_strong(...);": > %loaded = cmpxchg i32* %addr, i32 %oldval, i32 %newval seq_cst seq_cst > %success = icmp eq i32 %loaded, %oldval > %res = zext i1 %success to i32 > ret i32 %res > here, we'd want something like: > loop: > %loaded = load linked i32* %addr seq_cst > %trystore = icmp eq %loaded, %oldval > br i1 %trystore, label %store.cond, label %false > store.cond: > %success = store conditional i32 %newval, i32* %addr seq_cst > br i1 %success, label %true, label %loop > true: > ret i32 1 > false: > ret i32 0 > > In both cases the compare is largely redundant in LL/SC architectures. You know > by the control flow graph what its result will be, and if LLVM sees this early > enough it can produce something close to optimal code. > > Unfortunately, most of LLVM has finished by the time this expansion occurs, so > we end up with redundant compares. > > Now, it's certainly possible to put various peephole optimisations into the > expansion pass to cover these cases (at least), but to me that feels like > fragile duplication of effort we've already made in the mid-end. > > My Proposal > ----------- > > I think the best way to solve this longer term is to add explicit "load linked" > and "store conditional" instructions to the LLVM IR, with equal status to > "cmpxchg" and "atomicrmw". > > We could then simplify the AtomicExpandLoadLinked pass to use these, change it > to use something like a TargetTransformInfo hook instead of TargetLowering and > schedule it much earlier in the mid-end. X86 & SystemZ would simply request no > expansion happen. > > This means Clang could still emit cmpxchg and atomicrmw, which closely match C & > C++ semantics, but most of LLVM would see the control-flow and operations in a > form much closer to how the targers will actually emit them. > > In addition, other languages would be able to provide more general atomic > operations, if they want (slightly more complex bit twiddling seems like it > could be particularly useful in some algorithms). Current atomic instructions > cover (most of) C and C++, but nothing else. > > Potential issues > ---------------- > > 1. Can invalid transformations be done on them? Separating an LL from its > corresponding SC by too much almost guarantees failure (an OS task switch > happens, clearing the monitor). > 2. We'd need to document their interaction with LLVM's memory model. I'm not > sure I've seen a good happens-before model that explicitly uses LL/SC. The > common ones people came up with in the Java & C++11 standardisation process > seem to assume atomic RMW operations instead. > 3. People might be assuming StoreInst produces no value and not calling > ReplaceAllUses. A separate StoreConditionalInst is one option, but that risks > the reverse, people not considering it a store when it is. > 4. Not all targets can support them. There are very few LLVM instructions like > that, which is a little concerning. But not all targets can support current > atomics either. > > If we make vulnerability to ABA problems target-specific in our documentation > then X86 & SystemZ could use the same trick as for min/max to implement them > generically: a load/cmpxchg loop. > > Alternatives > ------------ > > 1. Enhancing MachineCSE and friends so they can optimise this. > - Works for "%old = atomicrmw OP %addr, %inc; OP %old, %inc" already. > - Cmpxchg is more difficult. It's the control-flow optimisations that are > particularly powerful in these cases. > - It's more duplicated effort between different parts of LLVM. > > 2. Peephole optimisations in the AtomicExpandLoadLinked pass. > - Feels like a hack, spotting particular patterns that the mid-end can > already cope with. > - Deals with toy functions well, but it's not quite clear that those are the > only options in more general code (after inlining, say). > - Potentially safest option: atomicrmw & cmpxchg preserved until > very late. Nothing's going to come in and convert valid code to invalid. > > 3. Moving expansion further forward but still using intrinsics. > - The intrinsics need to have fully general side effects to be correct: > effectively an "asm volatile" for optimisation purposes, which is very > heavy-handed for LLVM's other optimisations. > - Still need target hooks to create the calls, because intrinsics don't get > type lowered and so you can't write a fully generic one (e.g. an i64 ldrex > on ARM needs to return { i32, i32 }). > > 4. Change the cmpxchg operation to return (say) { iN, i1 } where the second > value indicates success. > - Probably good idea anyway as part of support for weak compare-exchange > operations. > - Doesn't actually help this issue much: it's mostly control-flow > optimisations that seem to be helpful, and this would leave us with phi > nodes in need of materialisation when lowered. > - Might make the peephole optimisations in 2 slightly simpler. > > 5. Some kind of generalisation that can cover both LL/SC and transactional > memory to avoid doing this again if and when that works out? > - I can't think of one that's both neat and practically implementable for > both. > - Suggestions welcome. > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140529/8b05f284/attachment.html>
Tim Northover
2014-May-29 17:21 UTC
[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
Hi Philip, On 29 May 2014 17:03, Philip Reames <listmail at philipreames.com> wrote:> I have some reservations about this proposal. I don't have anything > particularly concrete, but the idea of supporting both LL/SC and atomicrwm > in the IR concerns me from a complexity perspective.Well, I'll start by saying my particular optimisation use case looks like it's not enough to justify the addition. I've got something basically working for my efficiency worries, with less effort than I thought. So I'm a lot less keen on it myself than I was a few hours ago. But I'm still worried about how closely LLVM IR is tied to both C and X86 in this matter. A weak cmpxchg would go a long way to resolving this, but it's still difficult to see a path from an IR-level "cmpxchg weak" to optimal "atomicrmw lambda" support in LL/SC backends. Given C like void atomic_foo(int *addr) { int oldval = *addr; do { newval = foo(oldval); } while (__c11_compare_exchange_weak(addr, &oldval, newval)); The cmpxchg representation would be something like: define void @atomic_foo(int *addr) { entry: %firstval = load i32* %addr br label %loop loop: %oldval = phi i32 [%firstval, %entry], [%wrongval, %loop] %newval = call i32 @foo(i32 %oldval) %res = cmpxchg weak i32* %addr, i32 %oldval, i32 %newval %wrongval = extractvalue { i32, i1 } %res, 0 %success = extractvalue { i32, i1 } %res, 1 br i1 %success, label %end, label %loop end: ret void } But the optimal LL/SC form would be more like: define void @atomic_foo(int *addr) { entry: br label %loop loop: %oldval = load linked i32* %addr %newval = call i32 @foo(i32 %oldval) %success = store conditional i32 %newval, i32* %addr br i1 %success, label %end, label %loop end: ret void } That kind of analysis is a very big burden to put on any pass. On the other hand, mapping the other way doesn't seem much simpler either. I feel like there ought to be a good way to combine this with Haswell's xbegin/xend functionality in an even more generic IR construct too, but I can't quite come up with a representation. More thought needed.> Tim, for those of us not directly involved, could you share a selection of > bugs or other background? I'd like to read through and try to get a better > sense for the problem you're trying to solve.My immediate concern was the C++11 compare_exchange which inserts an automatic and mostly redundant icmp after the result. Variants of the IR in my original message are representative, though possibly not exhaustive, of what might be seen. Of course, it's all much more speculative now. Except, possibly, "how should we handle compare_exchange_weak". Cheers. Tim.