In working on an LLVM backend for SBCL (a lisp compiler), there are certain sequences of code that must be atomic with regards to async signals. So, for example, on x86, a single SUB on a memory location should be used, not a load/sub/store sequence. LLVM's IR doesn't currently have any way to express this kind of constraint (...and really, that's essentially impossible since different architectures have different possibilities, so I'm not asking for this...). All I really would like is to be able to specify the exact instruction sequence to emit there. I'd hoped that inline asm would be the way to do so, but LLVM doesn't appear to support asm output when using the JIT compiler. Is there any hope for inline asm being supported with the JIT anytime soon? Or is there an alternative suggested way of doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment, but that's both more expensive than I need as it has an unnecessary LOCK prefix, and is also theoretically incorrect. While it generates correct code currently on x86-64, LLVM doesn't actually *guarantee* that it generates a single instruction, that's just "luck". Additionally, I think there will be some situations where a particular ordering of memory operations is required. LLVM makes no guarantees about the order of stores, unless there's some way that you could tell the difference in a linear program. Unfortunately, I don't have a linear program, I have a program which can run signal handlers between arbitrary instructions. So, I think I'll need something like an llvm.memory.barrier of type "ss", except only affecting the codegen, not actually inserting a processor memory barrier. Is there already some way to insert a codegen-barrier with no additional runtime cost (beyond the opportunity-cost of not being able to reorder/delete stores across the barrier)? If not, can such a thing be added? On x86, this is a non-issue, since the processor already implicitly has inter-processor store-store barriers, so using: call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0) is fine: it's a noop at runtime but ensures the correct sequence of stores...but I'm thinking ahead here to other architectures where that would actually require expensive instructions to be emitted. Thanks, James
On Jan 3, 2010, at 10:10 PM, James Y Knight wrote:> In working on an LLVM backend for SBCL (a lisp compiler), there are > certain sequences of code that must be atomic with regards to async > signals. So, for example, on x86, a single SUB on a memory location > should be used, not a load/sub/store sequence. LLVM's IR doesn't > currently have any way to express this kind of constraint (...and > really, that's essentially impossible since different architectures > have different possibilities, so I'm not asking for this...).Why do you want to do this? As far as I'm aware, there's no guarantee that a memory-memory SUB will be observed atomically across all processors. Remember that most processors are going to be breaking X86 instructions up into micro-ops, which might get reordered/interleaved in any number of different ways.> All I really would like is to be able to specify the exact instruction > sequence to emit there. I'd hoped that inline asm would be the way to > do so, but LLVM doesn't appear to support asm output when using the > JIT compiler. Is there any hope for inline asm being supported with > the JIT anytime soon? Or is there an alternative suggested way of > doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment, > but that's both more expensive than I need as it has an unnecessary > LOCK prefix, and is also theoretically incorrect. While it generates > correct code currently on x86-64, LLVM doesn't actually *guarantee* > that it generates a single instruction, that's just "luck".It's not luck. That's exactly what the atomic intrinsics guarantee: that no other processor can observe an intermediate state of the operation. What they don't guarantee per the LangRef is sequential consistency. If you care about that, you need to use explicit fencing. --Owen -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2620 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100104/3f64e2b4/attachment.bin>
On Mon, Jan 4, 2010 at 12:20 AM, Owen Anderson <resistor at mac.com> wrote:> > On Jan 3, 2010, at 10:10 PM, James Y Knight wrote: > >> In working on an LLVM backend for SBCL (a lisp compiler), there are >> certain sequences of code that must be atomic with regards to async >> signals. So, for example, on x86, a single SUB on a memory location >> should be used, not a load/sub/store sequence. LLVM's IR doesn't >> currently have any way to express this kind of constraint (...and >> really, that's essentially impossible since different architectures >> have different possibilities, so I'm not asking for this...). > > Why do you want to do this? As far as I'm aware, there's no guarantee that a memory-memory SUB will be observed atomically across all processors. Remember that most processors are going to be breaking X86 instructions up into micro-ops, which might get reordered/interleaved in any number of different ways.I'm assuming 'memory-memory' there is a typo, and we're just talking about, a 'sub' instruction with a memory destination. In that case, I'll go further: the Intel IA-32 manual explicitly tells you that x86 processors are allowed to do the read and write halves of that single instruction interleaved with other writes to that memory location from other processors (See section 8.2.3.1 of [1]). =[ I can tell you from bitter experience debugging code that assumed this, it does in fact happen. I have watched reference counters miss both increments and decrements from it on both Intel and AMD systems.>> All I really would like is to be able to specify the exact instruction >> sequence to emit there. I'd hoped that inline asm would be the way to >> do so, but LLVM doesn't appear to support asm output when using the >> JIT compiler. Is there any hope for inline asm being supported with >> the JIT anytime soon? Or is there an alternative suggested way of >> doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment, >> but that's both more expensive than I need as it has an unnecessary >> LOCK prefix, and is also theoretically incorrect.As I've mentioned above, I assure you the LOCK prefix matters. The strange thing is that you think this is inefficient. Modern processors don't lock the bus given this prefix to a 'sub' instruction; they just lock the cache and use the coherency model to resolve the issue. This is much cheaper than, say, an 'xchg' instruction on an x86 processor. What is the performance problem you are actually trying to solve here?> What they don't guarantee per the LangRef is sequential consistency. If you care about that, you need to use explicit fencing.Side note: I regret greatly that I didn't know enough of the sequential consistency concerns here to address them more fully when I was working on this. =/ Even explicit fencing has subtle problems with it as currently specified. Is this causing problems for people (other than jyasskin who clued me in on the whole matter)?
Responding to the original email... On Sun, Jan 3, 2010 at 10:10 PM, James Y Knight <foom at fuhm.net> wrote:> In working on an LLVM backend for SBCL (a lisp compiler), there are > certain sequences of code that must be atomic with regards to async > signals.Can you define exactly what 'atomic with regards to async signals' this entails? Your descriptions led me to think you may mean something other than the POSIX definition, but maybe I'm just misinterpreting it. Are these signals guaranteed to run in the same thread? On the same processor? Is there concurrent code running in the address space when they run? <snip, this seems to be well handled on sibling email...>> Additionally, I think there will be some situations where a particular > ordering of memory operations is required. LLVM makes no guarantees > about the order of stores, unless there's some way that you could tell > the difference in a linear program. Unfortunately, I don't have a > linear program, I have a program which can run signal handlers between > arbitrary instructions. So, I think I'll need something like an > llvm.memory.barrier of type "ss", except only affecting the codegen, > not actually inserting a processor memory barrier.The processor can reorder memory operations as well (within limits). Consider that 'memset' to zero is often codegened to a non-temporal store to memory. This exempts it from all ordering considerations except for an explicit memory fence in the processor. If code were to execute between those two instructions, the contents of the memory could read "andthenumberofcountingshallbethree", or 'feedbeef', or '0000...' or '11111...' there's just no telling.> Is there already some way to insert a codegen-barrier with no > additional runtime cost (beyond the opportunity-cost of not being able > to reorder/delete stores across the barrier)? If not, can such a thing > be added? On x86, this is a non-issue, since the processor already > implicitly has inter-processor store-store barriers, so using: > call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0) > is fine: it's a noop at runtime but ensures the correct sequence of > stores...but I'm thinking ahead here to other architectures where that > would actually require expensive instructions to be emitted.But... if it *did* require expensive instructions, wouldn't you want them?!?! The reason we don't emit on x86 is because of its memory ordering guarantees. If it didn't have them, we would emit instructions to impose one because otherwise the wrong thing might happen. I think you should trust LLVM to only emit expensive instructions to achieve the ordering semantics you specify when they are necessary for the architecture, and file bugs if it ever fails. The only useful thing I can think of is if you happen to know that you execute on some "uniprocessor" with at most one thread of execution; and thus gain memory ordering constraints beyond those which can be assumed across an entire architecture (this is certainly true for x86). If it is useful to leverage this to optimize codegen, it should be at the target level, with some target options to specify that consistency assumptions should be greater than normal. The intrinsics and semantics should remain the same regardless.
On Jan 4, 2010, at 4:35 AM, Chandler Carruth wrote:> Responding to the original email... > > On Sun, Jan 3, 2010 at 10:10 PM, James Y Knight <foom at fuhm.net> wrote: >> In working on an LLVM backend for SBCL (a lisp compiler), there are >> certain sequences of code that must be atomic with regards to async >> signals. > > Can you define exactly what 'atomic with regards to async signals' > this entails? Your descriptions led me to think you may mean something > other than the POSIX definition, but maybe I'm just misinterpreting > it. Are these signals guaranteed to run in the same thread? On the > same processor? Is there concurrent code running in the address space > when they run?Hi, thanks everyone for all the comments. I think maybe I wasn't clear that I *only* care about atomicity w.r.t. a signal handler interruption in the same thread, *not* across threads. Therefore, many of the problems of cross-CPU atomicity are not relevant. The signal handler gets invoked via pthread_kill, and is thus necessarily running in the same thread as the code being interrupted. The memory in question can be considered thread-local here, so I'm not worried about other threads touching it at all. I also realize I had (at least :) one error in my original email: of course, the atomic operations llvm provides *ARE* guaranteed to do the right thing w.r.t. atomicity against signal handlers...they in fact just do more than I need, not less. I'm not sure why I thought they were both more and less than I needed before, and sorry if it confused you about what I'm trying to accomplish. Here's a concrete example, in hopes it will clarify matters: @pseudo_atomic = thread_local global i64 0 declare i64* @alloc(i64) declare void @do_pending_interrupt() declare i64 @llvm.atomic.load.sub.i64.p0i64(i64* nocapture, i64) nounwind declare void @llvm.memory.barrier(i1, i1, i1, i1, i1) define i64* @foo() { ;; Note that we're in an allocation section store i64 1, i64* @pseudo_atomic ;; Barrier only to ensure instruction ordering, not needed as a true memory barrier call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0) ;; Call might actually be inlined, so cannot depend upon unknown call causing correct codegen effects. %obj = call i64* @alloc(i64 32) %obj_header = getelementptr i64* %obj, i64 0 store i64 5, i64* %obj_header ;; store obj type (5) in header word %obj_len = getelementptr i64* %obj, i64 1 store i64 2, i64* %obj_len ;; store obj length (2) in length slot ...etc... ;; Check if we were interrupted: %res = call i64 @llvm.atomic.load.sub.i64.p0i64(i64* @pseudo_atomic, i64 1) %was_interrupted = icmp eq i64 %res, 1 br i1 %was_interrupted, label %do-interruption, label %continue continue: ret i64* %obj do-interruption: call void @do_pending_interrupt() br label %continue } A signal handler will check the thread-local @pseudo_atomic variable: if it was already set it will just change the value to 2 and return, waiting to be reinvoked by do_pending_interrupt at the end of the pseudo-atomic section. This is because it may get confused by the proto-object being built up in this code. This sequence that SBCL does today with its internal codegen is basically like: MOV <pseudo_atomic>, 1 [[do allocation, fill in object, etc]] XOR <pseudo_atomic>, 1 JEQ continue <<call do_pending_interrupt>> continue: ... The important things here are: 1) Stores cannot be migrated from within the MOV/XOR instructions to outside by the codegen. 2) There's no way an interruption can be missed: the XOR is atomic with regards to signals executing in the same thread, it's either fully executed or not (both load+store). But I don't care whether it's visible on other CPUs or not: it's a thread-local variable in any case. Those are the two properties I'd like to get from LLVM, without actually ever invoking superfluous processor synchronization.> The processor can reorder memory operations as well (within limits). > Consider that 'memset' to zero is often codegened to a non-temporal > store to memory. This exempts it from all ordering considerationsMy understanding is that processor reordering only affects what you might see from another CPU: the processor will undo speculatively executed operations if the sequence of instructions actually executed is not the sequence it predicted, so within a single CPU you should never be able tell the difference. But I must admit I don't know anything about non-temporal stores. Within a single thread, if I do a non-temporal store, followed by a load, am I not guaranteed to get back the value I stored? James