thr3ads.net - llvm dev - [LLVMdev] Non-temporal moves in memset [Was: ASM output with JIT / codegen barriers] [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Jeffrey Yasskin

2010-Jan-05 04:51 UTC

[LLVMdev] ASM output with JIT / codegen barriers

On Mon, Jan 4, 2010 at 8:43 PM, Chandler Carruth <chandlerc at google.com>
wrote:> On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at fuhm.net>
wrote:
>> Hi, thanks everyone for all the comments. I think maybe I wasn't
clear that
>> I *only* care about atomicity w.r.t. a signal handler interruption in
the
>> same thread, *not* across threads. Therefore, many of the problems of
>> cross-CPU atomicity are not relevant. The signal handler gets invoked
via
>> pthread_kill, and is thus necessarily running in the same thread as the
code
>> being interrupted. The memory in question can be considered
thread-local
>> here, so I'm not worried about other threads touching it at all.
>
> Ok, this helps make sense, but it still is confusing to phrase this as
> "single threaded". While the signal handler code may execute
> exclusively to any other code, it does not share the stack frame, etc.
> I'd describe this more as two threads of mutually exclusive execution
> or some such.
I'm pretty sure James's way of describing it is accurate. It's a
single thread with an asynchronous signal, and C allows things in that
situation that it disallows for the multi-threaded case. In
particular, global objects of type "volatile sig_atomic_t" can be read
and written between signal handlers in a thread and that thread's main
control flow without locking. C++0x also defines an
atomic_signal_fence(memory_order) that only synchronizes with signal
handlers, in addition to the atomic_thread_fence(memory_order) that
synchronizes to other threads. See [atomics.fences]
> I'm not familiar with what synchronization occurs as
> part of the interrupt process, but I'd verify it before making too
> many assumptions.
>
>> This sequence that SBCL does today with its internal codegen is
basically
>> like:
>> MOV <pseudo_atomic>, 1
>> [[do allocation, fill in object, etc]]
>> XOR <pseudo_atomic>, 1
>> JEQ continue
>> <<call do_pending_interrupt>>
>> continue:
>> ...
>>
>> The important things here are:
>> 1) Stores cannot be migrated from within the MOV/XOR instructions to
outside
>> by the codegen.
>
> Basically, this is merely the problem that x86 places a stricter
> requirement on memory ordering than LLVM. Where x86 requires that
> stores occur in program order, LLVM reserves the right to change that.
> I have no idea if it is worthwhile to support memory barriers solely
> within the flow of execution, but it seems highly suspicious.
It's needed to support std::atomic_signal_fence. gcc will initially
implement that with
  asm volatile("":::"memory")
but as James points out, that kills the JIT, and probably will keep
doing so until llvm-mc is finished or someone implements a special
case for it.
> On at
> least some non-x86 architectures, I suspect you'll need a memory
> barrier here anyways, so it seems reasonable to place one anyways. I
> *highly* doubt these fences are an overriding performance concern on
> x86, do you have any benchmarks that indicate they are?
Memory fences are as expensive as atomic operations on x86 (quite
expensive), but you're right that benchmarks are a good idea anyway.
>> 2) There's no way an interruption can be missed: the XOR is atomic
with
>> regards to signals executing in the same thread, it's either fully
executed
>> or not (both load+store). But I don't care whether it's visible
on other
>> CPUs or not: it's a thread-local variable in any case.
>>
>> Those are the two properties I'd like to get from LLVM, without
actually
>> ever invoking superfluous processor synchronization.
>
> Before we start extending LLVM to support expressing the finest points
> of the x86 memory model in an optimal fashion given a single thread of
> execution, I'd really need to see some compelling benchmarks that it
> is a major performance problem. My understanding of the implementation
> of these aspects of the x86 architecture is that they shouldn't have a
> particularly high overhead.
>
>>> The processor can reorder memory operations as well (within
limits).
>>> Consider that 'memset' to zero is often codegened to a
non-temporal
>>> store to memory. This exempts it from all ordering considerations
>>
>> My understanding is that processor reordering only affects what you
might
>> see from another CPU: the processor will undo speculatively executed
>> operations if the sequence of instructions actually executed is not the
>> sequence it predicted, so within a single CPU you should never be able
tell
>> the difference.
>>
>> But I must admit I don't know anything about non-temporal stores.
Within a
>> single thread, if I do a non-temporal store, followed by a load, am I
not
>> guaranteed to get back the value I stored?
>
> If you read the *same address*, then the ordering is guaranteed, but
> the Intel documentation specifically exempts these instructions from
> the general rule that writes will not be reordered with other writes.
> This means that a non-temporal store might be reordered to occur after
> the "xor" to your atomic integer, even if the instruction came
prior
> to the xor.
It exempts these instructions from the cross-processor guarantees, but
I don't see anything saying that, for example, a temporal store in a
single processor's instruction stream after a non-temporal store may
be overwritten by the non-temporal store. Do you see something I'm
missing? If not, for single-thread signals, I think it's only compiler
reordering James has to worry about.

Chandler Carruth

2010-Jan-05 06:09 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

On Mon, Jan 4, 2010 at 8:51 PM, Jeffrey Yasskin <jyasskin at google.com>
wrote:> On Mon, Jan 4, 2010 at 8:43 PM, Chandler Carruth <chandlerc at
google.com> wrote:
>> On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at fuhm.net>
wrote:
>>> Hi, thanks everyone for all the comments. I think maybe I
wasn't clear that
>>> I *only* care about atomicity w.r.t. a signal handler interruption
in the
>>> same thread, *not* across threads. Therefore, many of the problems
of
>>> cross-CPU atomicity are not relevant. The signal handler gets
invoked via
>>> pthread_kill, and is thus necessarily running in the same thread as
the code
>>> being interrupted. The memory in question can be considered
thread-local
>>> here, so I'm not worried about other threads touching it at
all.
>>
>> Ok, this helps make sense, but it still is confusing to phrase this as
>> "single threaded". While the signal handler code may execute
>> exclusively to any other code, it does not share the stack frame, etc.
>> I'd describe this more as two threads of mutually exclusive
execution
>> or some such.
>
> I'm pretty sure James's way of describing it is accurate. It's
a
> single thread with an asynchronous signal, and C allows things in that
> situation that it disallows for the multi-threaded case. In
> particular, global objects of type "volatile sig_atomic_t" can be
read
> and written between signal handlers in a thread and that thread's main
> control flow without locking. C++0x also defines an
> atomic_signal_fence(memory_order) that only synchronizes with signal
> handlers, in addition to the atomic_thread_fence(memory_order) that
> synchronizes to other threads. See [atomics.fences]
Very interesting, and thanks for the clarifications. I'm not
particularly familiar with either those parts of C or C++0x, although
it's on the list... =D
>> I'm not familiar with what synchronization occurs as
>> part of the interrupt process, but I'd verify it before making too
>> many assumptions.
>>
>>> This sequence that SBCL does today with its internal codegen is
basically
>>> like:
>>> MOV <pseudo_atomic>, 1
>>> [[do allocation, fill in object, etc]]
>>> XOR <pseudo_atomic>, 1
>>> JEQ continue
>>> <<call do_pending_interrupt>>
>>> continue:
>>> ...
>>>
>>> The important things here are:
>>> 1) Stores cannot be migrated from within the MOV/XOR instructions
to outside
>>> by the codegen.
>>
>> Basically, this is merely the problem that x86 places a stricter
>> requirement on memory ordering than LLVM. Where x86 requires that
>> stores occur in program order, LLVM reserves the right to change that.
>> I have no idea if it is worthwhile to support memory barriers solely
>> within the flow of execution, but it seems highly suspicious.
>
> It's needed to support std::atomic_signal_fence. gcc will initially
> implement that with
>  asm volatile("":::"memory")
> but as James points out, that kills the JIT, and probably will keep
> doing so until llvm-mc is finished or someone implements a special
> case for it.
Want to propose an extension to the current atomics of LLVM? Could we
potentially clarify your previous concern regarding the pairing of
barriers to operations, as it seems like they would involve related
bits of the lang ref? Happy to work with you on that sometime this Q
if you're interested; I'll certainly have more time. =]
>> On at
>> least some non-x86 architectures, I suspect you'll need a memory
>> barrier here anyways, so it seems reasonable to place one anyways. I
>> *highly* doubt these fences are an overriding performance concern on
>> x86, do you have any benchmarks that indicate they are?
>
> Memory fences are as expensive as atomic operations on x86 (quite
> expensive), but you're right that benchmarks are a good idea anyway.
>
>>> 2) There's no way an interruption can be missed: the XOR is
atomic with
>>> regards to signals executing in the same thread, it's either
fully executed
>>> or not (both load+store). But I don't care whether it's
visible on other
>>> CPUs or not: it's a thread-local variable in any case.
>>>
>>> Those are the two properties I'd like to get from LLVM, without
actually
>>> ever invoking superfluous processor synchronization.
>>
>> Before we start extending LLVM to support expressing the finest points
>> of the x86 memory model in an optimal fashion given a single thread of
>> execution, I'd really need to see some compelling benchmarks that
it
>> is a major performance problem. My understanding of the implementation
>> of these aspects of the x86 architecture is that they shouldn't
have a
>> particularly high overhead.
>>
>>>> The processor can reorder memory operations as well (within
limits).
>>>> Consider that 'memset' to zero is often codegened to a
non-temporal
>>>> store to memory. This exempts it from all ordering
considerations
>>>
>>> My understanding is that processor reordering only affects what you
might
>>> see from another CPU: the processor will undo speculatively
executed
>>> operations if the sequence of instructions actually executed is not
the
>>> sequence it predicted, so within a single CPU you should never be
able tell
>>> the difference.
>>>
>>> But I must admit I don't know anything about non-temporal
stores. Within a
>>> single thread, if I do a non-temporal store, followed by a load, am
I not
>>> guaranteed to get back the value I stored?
>>
>> If you read the *same address*, then the ordering is guaranteed, but
>> the Intel documentation specifically exempts these instructions from
>> the general rule that writes will not be reordered with other writes.
>> This means that a non-temporal store might be reordered to occur after
>> the "xor" to your atomic integer, even if the instruction
came prior
>> to the xor.
>
> It exempts these instructions from the cross-processor guarantees, but
> I don't see anything saying that, for example, a temporal store in a
> single processor's instruction stream after a non-temporal store may
> be overwritten by the non-temporal store. Do you see something I'm
> missing? If not, for single-thread signals, I think it's only compiler
> reordering James has to worry about.
The exemption I'm referring to (Section 8.2.2 of System Programming
Guide from Intel) is to the write-write ordering of the
*single-processor* model. Reading the referenced section on the
non-temporal behavior for these instructions (10.4.6 of volume 1 of
the architecture manual) doesn't entirely clarify the matter for me
either. It specifically says that the non-temporal writes may occur
outside of program order, but doesn't seem clarify exactly what the
result is of overlapping temporal writes are without fences within the
same program thread. The only examples I'm finding are for
multiprocessor scenarios. =/

Jeffrey Yasskin

2010-Jan-05 13:32 UTC

head link

[LLVMdev] ASM output with JIT / codegen barriers

On Tue, Jan 5, 2010 at 12:09 AM, Chandler Carruth <chandlerc at
google.com> wrote:> On Mon, Jan 4, 2010 at 8:51 PM, Jeffrey Yasskin <jyasskin at
google.com> wrote:
>> On Mon, Jan 4, 2010 at 8:43 PM, Chandler Carruth <chandlerc at
google.com> wrote:
>>> On Mon, Jan 4, 2010 at 1:13 PM, James Y Knight <foom at
fuhm.net> wrote:
>>>> The important things here are:
>>>> 1) Stores cannot be migrated from within the MOV/XOR
instructions to outside
>>>> by the codegen.
>>>
>>> Basically, this is merely the problem that x86 places a stricter
>>> requirement on memory ordering than LLVM. Where x86 requires that
>>> stores occur in program order, LLVM reserves the right to change
that.
>>> I have no idea if it is worthwhile to support memory barriers
solely
>>> within the flow of execution, but it seems highly suspicious.
>>
>> It's needed to support std::atomic_signal_fence. gcc will initially
>> implement that with
>>  asm volatile("":::"memory")
>> but as James points out, that kills the JIT, and probably will keep
>> doing so until llvm-mc is finished or someone implements a special
>> case for it.
>
> Want to propose an extension to the current atomics of LLVM? Could we
> potentially clarify your previous concern regarding the pairing of
> barriers to operations, as it seems like they would involve related
> bits of the lang ref? Happy to work with you on that sometime this Q
> if you're interested; I'll certainly have more time. =]
I have some ideas for that, and will be happy to help.
>>>>> The processor can reorder memory operations as well (within
limits).
>>>>> Consider that 'memset' to zero is often codegened
to a non-temporal
>>>>> store to memory. This exempts it from all ordering
considerations
>>>>
>>>> My understanding is that processor reordering only affects what
you might
>>>> see from another CPU: the processor will undo speculatively
executed
>>>> operations if the sequence of instructions actually executed is
not the
>>>> sequence it predicted, so within a single CPU you should never
be able tell
>>>> the difference.
>>>>
>>>> But I must admit I don't know anything about non-temporal
stores. Within a
>>>> single thread, if I do a non-temporal store, followed by a
load, am I not
>>>> guaranteed to get back the value I stored?
>>>
>>> If you read the *same address*, then the ordering is guaranteed,
but
>>> the Intel documentation specifically exempts these instructions
from
>>> the general rule that writes will not be reordered with other
writes.
>>> This means that a non-temporal store might be reordered to occur
after
>>> the "xor" to your atomic integer, even if the instruction
came prior
>>> to the xor.
>>
>> It exempts these instructions from the cross-processor guarantees, but
>> I don't see anything saying that, for example, a temporal store in
a
>> single processor's instruction stream after a non-temporal store
may
>> be overwritten by the non-temporal store. Do you see something I'm
>> missing? If not, for single-thread signals, I think it's only
compiler
>> reordering James has to worry about.
>
> The exemption I'm referring to (Section 8.2.2 of System Programming
> Guide from Intel) is to the write-write ordering of the
> *single-processor* model. Reading the referenced section on the
> non-temporal behavior for these instructions (10.4.6 of volume 1 of
> the architecture manual) doesn't entirely clarify the matter for me
> either. It specifically says that the non-temporal writes may occur
> outside of program order, but doesn't seem clarify exactly what the
> result is of overlapping temporal writes are without fences within the
> same program thread. The only examples I'm finding are for
> multiprocessor scenarios. =/
Yeah, it's not 100% clear. I'm pretty sure that x86 maintains the
fiction of a linear "instruction stream" within each processor, even
in the presence of interrupts (which underly pthread_kill and OS-level
thread switching). For example, in 6.6, we have "The ability of a P6
family processor to speculatively execute instructions does not affect
the taking of interrupts by the processor. Interrupts are taken at
instruction boundaries located during the retirement phase of
instruction execution; so they are always taken in the “in-order”
instruction stream."

But I'm not an expert in non-temporal anything.

James Y Knight

2010-Jan-05 17:53 UTC

head link

[LLVMdev] Non-temporal moves in memset [Was: ASM output with JIT / codegen barriers]

On Jan 5, 2010, at 1:09 AM, Chandler Carruth wrote:
>>>>> Consider that 'memset' to zero is often codegened
to a non-
>>>>> temporal
>>>>> store to memory. This exempts it from all ordering
considerations

Hm...off topic from my original email since I think this is only  
relevant for multithreaded code...

But from what I can tell, an implementation of memset that does not  
contain an sfence after using movnti is considered broken. Callers of  
memset would not (and should not need to) know that they must use an  
actual memory barrier (sfence) after the memset call to get the usual  
x86 store-store guarantee.

Thread describing that bug in glibc memset implementation:
http://sourceware.org/ml/libc-alpha/2007-11/msg00017.html

Redhat errata including that fix in a stable update:
http://rhn.redhat.com/errata/RHBA-2008-0083.html

Then there's a recent discussion on the topic of who is responsible  
for calling sfence on the gcc mailing list:
http://www.mail-archive.com/gcc at gcc.gnu.org/msg45939.html

Unfortunately, that thread didn't seem to have any firm conclusion,  
but ISTM that the current default assumption is (b): anything that  
uses movnti is assumed to surround such uses with memory fences so  
that other code doesn't need to.

James

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - Jan 2010 - [LLVMdev] Non-temporal moves in memset [Was: ASM output with JIT / codegen barriers]

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] ASM output with JIT / codegen barriers

[LLVMdev] Non-temporal moves in memset [Was: ASM output with JIT / codegen barriers]

Reasonably Related Threads