thr3ads.net - llvm dev - [LLVMdev] ASM output with JIT / codegen barriers [Jan 2010]

If this information is useful, please help other people find it:
Share via:

James Y Knight

2010-Jan-04 06:10 UTC

[LLVMdev] ASM output with JIT / codegen barriers

In working on an LLVM backend for SBCL (a lisp compiler), there are  
certain sequences of code that must be atomic with regards to async  
signals. So, for example, on x86, a single SUB on a memory location  
should be used, not a load/sub/store sequence. LLVM's IR doesn't  
currently have any way to express this kind of constraint (...and  
really, that's essentially impossible since different architectures  
have different possibilities, so I'm not asking for this...).

All I really would like is to be able to specify the exact instruction  
sequence to emit there. I'd hoped that inline asm would be the way to  
do so, but LLVM doesn't appear to support asm output when using the  
JIT compiler. Is there any hope for inline asm being supported with  
the JIT anytime soon? Or is there an alternative suggested way of  
doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment,  
but that's both more expensive than I need as it has an unnecessary  
LOCK prefix, and is also theoretically incorrect. While it generates  
correct code currently on x86-64, LLVM doesn't actually *guarantee*  
that it generates a single instruction, that's just "luck".


Additionally, I think there will be some situations where a particular  
ordering of memory operations is required. LLVM makes no guarantees  
about the order of stores, unless there's some way that you could tell  
the difference in a linear program. Unfortunately, I don't have a  
linear program, I have a program which can run signal handlers between  
arbitrary instructions. So, I think I'll need something like an  
llvm.memory.barrier of type "ss", except only affecting the codegen,  
not actually inserting a processor memory barrier.

Is there already some way to insert a codegen-barrier with no  
additional runtime cost (beyond the opportunity-cost of not being able  
to reorder/delete stores across the barrier)? If not, can such a thing  
be added? On x86, this is a non-issue, since the processor already  
implicitly has inter-processor store-store barriers, so using:
   call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)
is fine: it's a noop at runtime but ensures the correct sequence of  
stores...but I'm thinking ahead here to other architectures where that  
would actually require expensive instructions to be emitted.

Thanks,
James

Owen Anderson

2010-Jan-04 08:20 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

On Jan 3, 2010, at 10:10 PM, James Y Knight wrote:
> In working on an LLVM backend for SBCL (a lisp compiler), there are  
> certain sequences of code that must be atomic with regards to async  
> signals. So, for example, on x86, a single SUB on a memory location  
> should be used, not a load/sub/store sequence. LLVM's IR doesn't  
> currently have any way to express this kind of constraint (...and  
> really, that's essentially impossible since different architectures  
> have different possibilities, so I'm not asking for this...).
Why do you want to do this?  As far as I'm aware, there's no guarantee
that a memory-memory SUB will be observed atomically across all processors. 
Remember that most processors are going to be breaking X86 instructions up into
micro-ops, which might get reordered/interleaved in any number of different
ways.
> All I really would like is to be able to specify the exact instruction  
> sequence to emit there. I'd hoped that inline asm would be the way to  
> do so, but LLVM doesn't appear to support asm output when using the  
> JIT compiler. Is there any hope for inline asm being supported with  
> the JIT anytime soon? Or is there an alternative suggested way of  
> doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the moment,  
> but that's both more expensive than I need as it has an unnecessary  
> LOCK prefix, and is also theoretically incorrect. While it generates  
> correct code currently on x86-64, LLVM doesn't actually *guarantee*  
> that it generates a single instruction, that's just "luck".
It's not luck.  That's exactly what the atomic intrinsics guarantee:
that no other processor can observe an intermediate state of the operation. 
What they don't guarantee per the LangRef is sequential consistency.  If you
care about that, you need to use explicit fencing.

--Owen

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2620 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20100104/3f64e2b4/attachment.bin>

Chandler Carruth

2010-Jan-04 09:17 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

On Mon, Jan 4, 2010 at 12:20 AM, Owen Anderson <resistor at mac.com>
wrote:>
> On Jan 3, 2010, at 10:10 PM, James Y Knight wrote:
>
>> In working on an LLVM backend for SBCL (a lisp compiler), there are
>> certain sequences of code that must be atomic with regards to async
>> signals. So, for example, on x86, a single SUB on a memory location
>> should be used, not a load/sub/store sequence. LLVM's IR
doesn't
>> currently have any way to express this kind of constraint (...and
>> really, that's essentially impossible since different architectures
>> have different possibilities, so I'm not asking for this...).
>
> Why do you want to do this?  As far as I'm aware, there's no
guarantee that a memory-memory SUB will be observed atomically across all
processors.  Remember that most processors are going to be breaking X86
instructions up into micro-ops, which might get reordered/interleaved in any
number of different ways.
I'm assuming 'memory-memory' there is a typo, and we're just
talking
about, a 'sub' instruction with a memory destination. In that case,
I'll go further: the Intel IA-32 manual explicitly tells you that x86
processors are allowed to do the read and write halves of that single
instruction interleaved with other writes to that memory location from
other processors (See section 8.2.3.1 of [1]). =[ I can tell you from
bitter experience debugging code that assumed this, it does in fact
happen. I have watched reference counters miss both increments and
decrements from it on both Intel and AMD systems.
>> All I really would like is to be able to specify the exact instruction
>> sequence to emit there. I'd hoped that inline asm would be the way
to
>> do so, but LLVM doesn't appear to support asm output when using the
>> JIT compiler. Is there any hope for inline asm being supported with
>> the JIT anytime soon? Or is there an alternative suggested way of
>> doing this? I'm using llvm.atomic.load.sub.i64.p0i64 for the
moment,
>> but that's both more expensive than I need as it has an unnecessary
>> LOCK prefix, and is also theoretically incorrect.
As I've mentioned above, I assure you the LOCK prefix matters. The
strange thing is that you think this is inefficient. Modern processors
don't lock the bus given this prefix to a 'sub' instruction; they
just
lock the cache and use the coherency model to resolve the issue. This
is much cheaper than, say, an 'xchg' instruction on an x86 processor.
What is the performance problem you are actually trying to solve here?
> What they don't guarantee per the LangRef is sequential consistency.
 If you care about that, you need to use explicit fencing.
Side note: I regret greatly that I didn't know enough of the
sequential consistency concerns here to address them more fully when I
was working on this. =/ Even explicit fencing has subtle problems with
it as currently specified. Is this causing problems for people (other
than jyasskin who clued me in on the whole matter)?

Chandler Carruth

2010-Jan-04 09:35 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

Responding to the original email...

On Sun, Jan 3, 2010 at 10:10 PM, James Y Knight <foom at fuhm.net>
wrote:> In working on an LLVM backend for SBCL (a lisp compiler), there are
> certain sequences of code that must be atomic with regards to async
> signals.
Can you define exactly what 'atomic with regards to async signals'
this entails? Your descriptions led me to think you may mean something
other than the POSIX definition, but maybe I'm just misinterpreting
it. Are these signals guaranteed to run in the same thread? On the
same processor? Is there concurrent code running in the address space
when they run?

<snip, this seems to be well handled on sibling email...>
> Additionally, I think there will be some situations where a particular
> ordering of memory operations is required. LLVM makes no guarantees
> about the order of stores, unless there's some way that you could tell
> the difference in a linear program. Unfortunately, I don't have a
> linear program, I have a program which can run signal handlers between
> arbitrary instructions. So, I think I'll need something like an
> llvm.memory.barrier of type "ss", except only affecting the
codegen,
> not actually inserting a processor memory barrier.
The processor can reorder memory operations as well (within limits).
Consider that 'memset' to zero is often codegened to a non-temporal
store to memory. This exempts it from all ordering considerations
except for an explicit memory fence in the processor. If code were to
execute between those two instructions, the contents of the memory
could read "andthenumberofcountingshallbethree", or
'feedbeef', or
'0000...' or '11111...' there's just no telling.
> Is there already some way to insert a codegen-barrier with no
> additional runtime cost (beyond the opportunity-cost of not being able
> to reorder/delete stores across the barrier)? If not, can such a thing
> be added? On x86, this is a non-issue, since the processor already
> implicitly has inter-processor store-store barriers, so using:
>   call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)
> is fine: it's a noop at runtime but ensures the correct sequence of
> stores...but I'm thinking ahead here to other architectures where that
> would actually require expensive instructions to be emitted.
But... if it *did* require expensive instructions, wouldn't you want
them?!?! The reason we don't emit on x86 is because of its memory
ordering guarantees. If it didn't have them, we would emit
instructions to impose one because otherwise the wrong thing might
happen. I think you should trust LLVM to only emit expensive
instructions to achieve the ordering semantics you specify when they
are necessary for the architecture, and file bugs if it ever fails.

The only useful thing I can think of is if you happen to know that you
execute on some "uniprocessor" with at most one thread of execution;
and thus gain memory ordering constraints beyond those which can be
assumed across an entire architecture (this is certainly true for
x86). If it is useful to leverage this to optimize codegen, it should
be at the target level, with some target options to specify that
consistency assumptions should be greater than normal. The intrinsics
and semantics should remain the same regardless.

James Y Knight

2010-Jan-04 21:13 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

On Jan 4, 2010, at 4:35 AM, Chandler Carruth wrote:
> Responding to the original email...
>
> On Sun, Jan 3, 2010 at 10:10 PM, James Y Knight <foom at fuhm.net>
wrote:
>> In working on an LLVM backend for SBCL (a lisp compiler), there are
>> certain sequences of code that must be atomic with regards to async
>> signals.
>
> Can you define exactly what 'atomic with regards to async signals'
> this entails? Your descriptions led me to think you may mean something
> other than the POSIX definition, but maybe I'm just misinterpreting
> it. Are these signals guaranteed to run in the same thread? On the
> same processor? Is there concurrent code running in the address space
> when they run?
Hi, thanks everyone for all the comments. I think maybe I wasn't clear  
that I *only* care about atomicity w.r.t. a signal handler  
interruption in the same thread, *not* across threads. Therefore, many  
of the problems of cross-CPU atomicity are not relevant. The signal  
handler gets invoked via pthread_kill, and is thus necessarily running  
in the same thread as the code being interrupted. The memory in  
question can be considered thread-local here, so I'm not worried about  
other threads touching it at all.

I also realize I had (at least :) one error in my original email: of  
course, the atomic operations llvm provides *ARE* guaranteed to do the  
right thing w.r.t. atomicity against signal handlers...they in fact  
just do more than I need, not less. I'm not sure why I thought they  
were both more and less than I needed before, and sorry if it confused  
you about what I'm trying to accomplish.

Here's a concrete example, in hopes it will clarify matters:

@pseudo_atomic = thread_local global i64 0
declare i64* @alloc(i64)
declare void @do_pending_interrupt()
declare i64 @llvm.atomic.load.sub.i64.p0i64(i64* nocapture, i64)  
nounwind
declare void @llvm.memory.barrier(i1, i1, i1, i1, i1)

define i64* @foo() {
   ;; Note that we're in an allocation section
   store i64 1, i64* @pseudo_atomic
   ;; Barrier only to ensure instruction ordering, not needed as a  
true memory barrier
   call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)

   ;; Call might actually be inlined, so cannot depend upon unknown  
call causing correct codegen effects.
   %obj = call i64* @alloc(i64 32)
   %obj_header = getelementptr i64* %obj, i64 0
   store i64 5, i64* %obj_header ;; store obj type (5) in header word
   %obj_len = getelementptr i64* %obj, i64 1
   store i64 2, i64* %obj_len ;; store obj length (2) in length slot
   ...etc...

   ;; Check if we were interrupted:
   %res = call i64 @llvm.atomic.load.sub.i64.p0i64(i64*  
@pseudo_atomic, i64 1)
   %was_interrupted = icmp eq i64 %res, 1
   br i1 %was_interrupted, label %do-interruption, label %continue

continue:
   ret i64* %obj

do-interruption:
   call void @do_pending_interrupt()
   br label %continue
}

A signal handler will check the thread-local @pseudo_atomic variable:  
if it was already set it will just change the value to 2 and return,  
waiting to be reinvoked by do_pending_interrupt at the end of the  
pseudo-atomic section. This is because it may get confused by the  
proto-object being built up in this code.

This sequence that SBCL does today with its internal codegen is  
basically like:
MOV <pseudo_atomic>, 1
[[do allocation, fill in object, etc]]
XOR <pseudo_atomic>, 1
JEQ continue
<<call do_pending_interrupt>>
continue:
...

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to  
outside by the codegen.
2) There's no way an interruption can be missed: the XOR is atomic  
with regards to signals executing in the same thread, it's either  
fully executed or not (both load+store). But I don't care whether it's  
visible on other CPUs or not: it's a thread-local variable in any case.

Those are the two properties I'd like to get from LLVM, without  
actually ever invoking superfluous processor synchronization.
> The processor can reorder memory operations as well (within limits).
> Consider that 'memset' to zero is often codegened to a non-temporal
> store to memory. This exempts it from all ordering considerations
My understanding is that processor reordering only affects what you  
might see from another CPU: the processor will undo speculatively  
executed operations if the sequence of instructions actually executed  
is not the sequence it predicted, so within a single CPU you should  
never be able tell the difference.

But I must admit I don't know anything about non-temporal stores.  
Within a single thread, if I do a non-temporal store, followed by a  
load, am I not guaranteed to get back the value I stored?

James

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Jan 2010 - [LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

Seemingly Similar Threads