thr3ads.net - llvm dev - [LLVMdev] ASM output with JIT / codegen barriers [Jan 2010]

If this information is useful, please help other people find it:
Share via:

James Y Knight

2010-Jan-04 21:13 UTC

[LLVMdev] ASM output with JIT / codegen barriers

On Jan 4, 2010, at 4:35 AM, Chandler Carruth wrote:
> Responding to the original email...
>
> On Sun, Jan 3, 2010 at 10:10 PM, James Y Knight <foom at fuhm.net>
wrote:
>> In working on an LLVM backend for SBCL (a lisp compiler), there are
>> certain sequences of code that must be atomic with regards to async
>> signals.
>
> Can you define exactly what 'atomic with regards to async signals'
> this entails? Your descriptions led me to think you may mean something
> other than the POSIX definition, but maybe I'm just misinterpreting
> it. Are these signals guaranteed to run in the same thread? On the
> same processor? Is there concurrent code running in the address space
> when they run?
Hi, thanks everyone for all the comments. I think maybe I wasn't clear  
that I *only* care about atomicity w.r.t. a signal handler  
interruption in the same thread, *not* across threads. Therefore, many  
of the problems of cross-CPU atomicity are not relevant. The signal  
handler gets invoked via pthread_kill, and is thus necessarily running  
in the same thread as the code being interrupted. The memory in  
question can be considered thread-local here, so I'm not worried about  
other threads touching it at all.

I also realize I had (at least :) one error in my original email: of  
course, the atomic operations llvm provides *ARE* guaranteed to do the  
right thing w.r.t. atomicity against signal handlers...they in fact  
just do more than I need, not less. I'm not sure why I thought they  
were both more and less than I needed before, and sorry if it confused  
you about what I'm trying to accomplish.

Here's a concrete example, in hopes it will clarify matters:

@pseudo_atomic = thread_local global i64 0
declare i64* @alloc(i64)
declare void @do_pending_interrupt()
declare i64 @llvm.atomic.load.sub.i64.p0i64(i64* nocapture, i64)  
nounwind
declare void @llvm.memory.barrier(i1, i1, i1, i1, i1)

define i64* @foo() {
   ;; Note that we're in an allocation section
   store i64 1, i64* @pseudo_atomic
   ;; Barrier only to ensure instruction ordering, not needed as a  
true memory barrier
   call void @llvm.memory.barrier(i1 0, i1 0, i1 0, i1 1, i1 0)

   ;; Call might actually be inlined, so cannot depend upon unknown  
call causing correct codegen effects.
   %obj = call i64* @alloc(i64 32)
   %obj_header = getelementptr i64* %obj, i64 0
   store i64 5, i64* %obj_header ;; store obj type (5) in header word
   %obj_len = getelementptr i64* %obj, i64 1
   store i64 2, i64* %obj_len ;; store obj length (2) in length slot
   ...etc...

   ;; Check if we were interrupted:
   %res = call i64 @llvm.atomic.load.sub.i64.p0i64(i64*  
@pseudo_atomic, i64 1)
   %was_interrupted = icmp eq i64 %res, 1
   br i1 %was_interrupted, label %do-interruption, label %continue

continue:
   ret i64* %obj

do-interruption:
   call void @do_pending_interrupt()
   br label %continue
}

A signal handler will check the thread-local @pseudo_atomic variable:  
if it was already set it will just change the value to 2 and return,  
waiting to be reinvoked by do_pending_interrupt at the end of the  
pseudo-atomic section. This is because it may get confused by the  
proto-object being built up in this code.

This sequence that SBCL does today with its internal codegen is  
basically like:
MOV <pseudo_atomic>, 1
[[do allocation, fill in object, etc]]
XOR <pseudo_atomic>, 1
JEQ continue
<<call do_pending_interrupt>>
continue:
...

The important things here are:
1) Stores cannot be migrated from within the MOV/XOR instructions to  
outside by the codegen.
2) There's no way an interruption can be missed: the XOR is atomic  
with regards to signals executing in the same thread, it's either  
fully executed or not (both load+store). But I don't care whether it's  
visible on other CPUs or not: it's a thread-local variable in any case.

Those are the two properties I'd like to get from LLVM, without  
actually ever invoking superfluous processor synchronization.
> The processor can reorder memory operations as well (within limits).
> Consider that 'memset' to zero is often codegened to a non-temporal
> store to memory. This exempts it from all ordering considerations
My understanding is that processor reordering only affects what you  
might see from another CPU: the processor will undo speculatively  
executed operations if the sequence of instructions actually executed  
is not the sequence it predicted, so within a single CPU you should  
never be able tell the difference.

But I must admit I don't know anything about non-temporal stores.  
Within a single thread, if I do a non-temporal store, followed by a  
load, am I not guaranteed to get back the value I stored?

James

Chandler Carruth

2010-Jan-05 02:43 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at fuhm.net>
wrote:> Hi, thanks everyone for all the comments. I think maybe I wasn't clear
that
> I *only* care about atomicity w.r.t. a signal handler interruption in the
> same thread, *not* across threads. Therefore, many of the problems of
> cross-CPU atomicity are not relevant. The signal handler gets invoked via
> pthread_kill, and is thus necessarily running in the same thread as the
code
> being interrupted. The memory in question can be considered thread-local
> here, so I'm not worried about other threads touching it at all.
Ok, this helps make sense, but it still is confusing to phrase this as
"single threaded". While the signal handler code may execute
exclusively to any other code, it does not share the stack frame, etc.
I'd describe this more as two threads of mutually exclusive execution
or some such. I'm not familiar with what synchronization occurs as
part of the interrupt process, but I'd verify it before making too
many assumptions.
> This sequence that SBCL does today with its internal codegen is basically
> like:
> MOV <pseudo_atomic>, 1
> [[do allocation, fill in object, etc]]
> XOR <pseudo_atomic>, 1
> JEQ continue
> <<call do_pending_interrupt>>
> continue:
> ...
>
> The important things here are:
> 1) Stores cannot be migrated from within the MOV/XOR instructions to
outside
> by the codegen.
Basically, this is merely the problem that x86 places a stricter
requirement on memory ordering than LLVM. Where x86 requires that
stores occur in program order, LLVM reserves the right to change that.
I have no idea if it is worthwhile to support memory barriers solely
within the flow of execution, but it seems highly suspicious. On at
least some non-x86 architectures, I suspect you'll need a memory
barrier here anyways, so it seems reasonable to place one anyways. I
*highly* doubt these fences are an overriding performance concern on
x86, do you have any benchmarks that indicate they are?
> 2) There's no way an interruption can be missed: the XOR is atomic with
> regards to signals executing in the same thread, it's either fully
executed
> or not (both load+store). But I don't care whether it's visible on
other
> CPUs or not: it's a thread-local variable in any case.
>
> Those are the two properties I'd like to get from LLVM, without
actually
> ever invoking superfluous processor synchronization.
Before we start extending LLVM to support expressing the finest points
of the x86 memory model in an optimal fashion given a single thread of
execution, I'd really need to see some compelling benchmarks that it
is a major performance problem. My understanding of the implementation
of these aspects of the x86 architecture is that they shouldn't have a
particularly high overhead.
>> The processor can reorder memory operations as well (within limits).
>> Consider that 'memset' to zero is often codegened to a
non-temporal
>> store to memory. This exempts it from all ordering considerations
>
> My understanding is that processor reordering only affects what you might
> see from another CPU: the processor will undo speculatively executed
> operations if the sequence of instructions actually executed is not the
> sequence it predicted, so within a single CPU you should never be able tell
> the difference.
>
> But I must admit I don't know anything about non-temporal stores.
Within a
> single thread, if I do a non-temporal store, followed by a load, am I not
> guaranteed to get back the value I stored?
If you read the *same address*, then the ordering is guaranteed, but
the Intel documentation specifically exempts these instructions from
the general rule that writes will not be reordered with other writes.
This means that a non-temporal store might be reordered to occur after
the "xor" to your atomic integer, even if the instruction came prior
to the xor.
>
> James
>

Jeffrey Yasskin

2010-Jan-05 04:51 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

On Mon, Jan 4, 2010 at 8:43 PM, Chandler Carruth <chandlerc at google.com>
wrote:> On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at fuhm.net>
wrote:
>> Hi, thanks everyone for all the comments. I think maybe I wasn't
clear that
>> I *only* care about atomicity w.r.t. a signal handler interruption in
the
>> same thread, *not* across threads. Therefore, many of the problems of
>> cross-CPU atomicity are not relevant. The signal handler gets invoked
via
>> pthread_kill, and is thus necessarily running in the same thread as the
code
>> being interrupted. The memory in question can be considered
thread-local
>> here, so I'm not worried about other threads touching it at all.
>
> Ok, this helps make sense, but it still is confusing to phrase this as
> "single threaded". While the signal handler code may execute
> exclusively to any other code, it does not share the stack frame, etc.
> I'd describe this more as two threads of mutually exclusive execution
> or some such.
I'm pretty sure James's way of describing it is accurate. It's a
single thread with an asynchronous signal, and C allows things in that
situation that it disallows for the multi-threaded case. In
particular, global objects of type "volatile sig_atomic_t" can be read
and written between signal handlers in a thread and that thread's main
control flow without locking. C++0x also defines an
atomic_signal_fence(memory_order) that only synchronizes with signal
handlers, in addition to the atomic_thread_fence(memory_order) that
synchronizes to other threads. See [atomics.fences]
> I'm not familiar with what synchronization occurs as
> part of the interrupt process, but I'd verify it before making too
> many assumptions.
>
>> This sequence that SBCL does today with its internal codegen is
basically
>> like:
>> MOV <pseudo_atomic>, 1
>> [[do allocation, fill in object, etc]]
>> XOR <pseudo_atomic>, 1
>> JEQ continue
>> <<call do_pending_interrupt>>
>> continue:
>> ...
>>
>> The important things here are:
>> 1) Stores cannot be migrated from within the MOV/XOR instructions to
outside
>> by the codegen.
>
> Basically, this is merely the problem that x86 places a stricter
> requirement on memory ordering than LLVM. Where x86 requires that
> stores occur in program order, LLVM reserves the right to change that.
> I have no idea if it is worthwhile to support memory barriers solely
> within the flow of execution, but it seems highly suspicious.
It's needed to support std::atomic_signal_fence. gcc will initially
implement that with
  asm volatile("":::"memory")
but as James points out, that kills the JIT, and probably will keep
doing so until llvm-mc is finished or someone implements a special
case for it.
> On at
> least some non-x86 architectures, I suspect you'll need a memory
> barrier here anyways, so it seems reasonable to place one anyways. I
> *highly* doubt these fences are an overriding performance concern on
> x86, do you have any benchmarks that indicate they are?
Memory fences are as expensive as atomic operations on x86 (quite
expensive), but you're right that benchmarks are a good idea anyway.
>> 2) There's no way an interruption can be missed: the XOR is atomic
with
>> regards to signals executing in the same thread, it's either fully
executed
>> or not (both load+store). But I don't care whether it's visible
on other
>> CPUs or not: it's a thread-local variable in any case.
>>
>> Those are the two properties I'd like to get from LLVM, without
actually
>> ever invoking superfluous processor synchronization.
>
> Before we start extending LLVM to support expressing the finest points
> of the x86 memory model in an optimal fashion given a single thread of
> execution, I'd really need to see some compelling benchmarks that it
> is a major performance problem. My understanding of the implementation
> of these aspects of the x86 architecture is that they shouldn't have a
> particularly high overhead.
>
>>> The processor can reorder memory operations as well (within
limits).
>>> Consider that 'memset' to zero is often codegened to a
non-temporal
>>> store to memory. This exempts it from all ordering considerations
>>
>> My understanding is that processor reordering only affects what you
might
>> see from another CPU: the processor will undo speculatively executed
>> operations if the sequence of instructions actually executed is not the
>> sequence it predicted, so within a single CPU you should never be able
tell
>> the difference.
>>
>> But I must admit I don't know anything about non-temporal stores.
Within a
>> single thread, if I do a non-temporal store, followed by a load, am I
not
>> guaranteed to get back the value I stored?
>
> If you read the *same address*, then the ordering is guaranteed, but
> the Intel documentation specifically exempts these instructions from
> the general rule that writes will not be reordered with other writes.
> This means that a non-temporal store might be reordered to occur after
> the "xor" to your atomic integer, even if the instruction came
prior
> to the xor.
It exempts these instructions from the cross-processor guarantees, but
I don't see anything saying that, for example, a temporal store in a
single processor's instruction stream after a non-temporal store may
be overwritten by the non-temporal store. Do you see something I'm
missing? If not, for single-thread signals, I think it's only compiler
reordering James has to worry about.

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Jan 2010 - [LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

Possibly Parallel Threads