thr3ads.net - llvm dev - [llvm-dev] RFC: non-temporal fencing in LLVM IR [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2016-Jan-15 00:27 UTC

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On 01/14/2016 04:05 PM, Hans Boehm via llvm-dev wrote:>
>
> On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <jfb at google.com 
> <mailto:jfb at google.com>> wrote:
>
>     On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer
>     <david.majnemer at gmail.com <mailto:david.majnemer at
gmail.com>> wrote:
>
>
>
>         On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com
>         <mailto:jfb at google.com>> wrote:
>
>             On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via
>             llvm-dev <llvm-dev at lists.llvm.org
>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>
>                 On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via
>                 llvm-dev <llvm-dev at lists.llvm.org
>                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>                     I agree with Tim's assessment for ARM.  That's
>                     interesting; I wasn't previously aware of that
>                     instruction.
>
>                     My understanding is that Alpha would have the same
>                     problem for normal loads.
>
>                     I'm all in favor of more systematic handling of
>                     the fences associated with x86 non-temporal accesses.
>
>                     AFAICT, nontemporal loads and stores seem to have
>                     different fencing rules on x86, none of them very
>                     clear. Nontemporal stores should probably ideally
>                     use an SFENCE. Locked instructions seem to be
>                     documented to work with MOVNTDQA.  In both cases,
>                     there seems to be only empirical evidence as to
>                     which side(s) of the nontemporal operations they
>                     should go on?
>
>                     I finally decided that I was OK with using a
>                     LOCKed top-of-stack update as a fence in Java on
>                     x86.  I'm significantly less enthusiastic for
>                     C++.  I also think that risks unexpected coherence
>                     miss problems, though they would probably be very
>                     rare. But they would be very surprising if they
>                     did occur.
>
>
>                 Today's LLVM already emits 'lock or %eax,
(%esp)' for
>                 'fence
>                
seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST)
>                 when targeting 32-bit x86 machines which do not
>                 support mfence.  What instruction sequence should we
>                 be using instead?
>
>
>             Do they have non-temporal accesses in the ISA?
>
>
>         I thought not but there appear to be instructions
>         like movntps.  mfence was introduced in SSE2 while movntps and
>         sfence were introduced in SSE.
>
>
>     So the new builtin could be sfence? I think the codegen you point
>     out for SEQ_CST is fine if we fix the memory model as suggested.
>
>
> I agree that it's fine to use a locked instruction as a seq_cst fence 
> if MFENCE is not available.It's not clear to me this is true if the seq_cst fence is expected to 
fence non-temporal stores.  I think in practice, you'd be very unlikely 
to notice a difference, but I can't point to anything in the Intel docs 
which justifies a lock prefixed instruction as sufficient to fence any 
non-temporal access.
> If you have to dirty a cache line, (%esp) seems like relatively safe one.Agreed.  As we discussed previously, it is possible to false sharing in 
C++, but this would require one thread to be accessing information 
stored in the last frame of another running thread's stack.  That seems 
sufficiently unlikely to be ignored.
> (I'm assuming that CPUID is appreciably slower and out of the 
> running?  I haven't tried.  But it also probably clobbers too many 
> registers.)This is my belief.  I haven't actually tried this experiment, but I've 
seen no reports that CPUID is a good choice here.
> It's only the idea of writing to a memory location when MFENCE is 
> available, and could be used instead, that seems questionable.While in principal I agree, it appears in practice that this tradeoff is 
worthwhile.  The hardware doesn't seem to optimize for the MFENCE case 
whereas lock prefix instructions appear to be handled much
better.>
> What exactly would the non-temporal fences be?  It seems that on x86, 
> the load and store case may differ.  In theory, there's also a before 
> vs. after question.  In practice code using MOVNTA seems to assume 
> that you only need an SFENCE afterwards.  I can't back that up with 
> spec verbiage.  I don't know about MOVNTDQA.  What about ARM?I'll leave this to JF to answer.  I'm not knowledgeable enough about 
non-temporals to answer without substantial research
first.>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/cd9f16a8/attachment.html>

JF Bastien via llvm-dev

2016-Jan-15 08:15 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

>
> I agree that it's fine to use a locked instruction as a seq_cst fence
if
> MFENCE is not available.
>
> It's not clear to me this is true if the seq_cst fence is expected to
> fence non-temporal stores.  I think in practice, you'd be very unlikely
to
> notice a difference, but I can't point to anything in the Intel docs
which
> justifies a lock prefixed instruction as sufficient to fence any
> non-temporal access.
>
Correct, that's why changing the memory model is critical: seq_cst fence
wouldn't have any guarantee w.r.t. non-temporal.


What exactly would the non-temporal fences be?  It seems that on x86,
the> load and store case may differ.  In theory, there's also a before vs.
after
> question.  In practice code using MOVNTA seems to assume that you only need
> an SFENCE afterwards.  I can't back that up with spec verbiage.  I
don't
> know about MOVNTDQA.  What about ARM?
>
> I'll leave this to JF to answer.  I'm not knowledgeable enough
about
> non-temporals to answer without substantial research first.
>
I'm proposing two builtins:
- __builtin_nontemporal_load_fence
- __builtin_nontemporal_store_fence

I've I've got this right, on x86 they would respectively be a nop, and
sfence.

They otherwise act as memory code motion barriers unless accesses are
proven to not alias. I think it may be possible to loosen the rule so they
act closer to acquire/release (allowing accesses to move into the pair) but
I'm not convinced that this works for every ISA so I'd err on the side
of
caution (since this can be loosened later).
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/aa0fc855/attachment.html>

Hans Boehm via llvm-dev

2016-Jan-15 19:21 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On Thu, Jan 14, 2016 at 4:27 PM, Philip Reames <listmail at
philipreames.com>
wrote:
> It's not clear to me this is true if the seq_cst fence is expected to
> fence non-temporal stores.  I think in practice, you'd be very unlikely
to
> notice a difference, but I can't point to anything in the Intel docs
which
> justifies a lock prefixed instruction as sufficient to fence any
> non-temporal access.
>
Agreed.  I think it's not guaranteed.  And the most rational explanation
for the fact that LOCK; X is faster than MFENCE seems to be that LOCK only
deals with normal write-back cacheable accesses, and hence may not work for
cases like this.

> If you have to dirty a cache line, (%esp) seems like relatively safe one.
>
> Agreed.  As we discussed previously, it is possible to false sharing in
> C++, but this would require one thread to be accessing information stored
> in the last frame of another running thread's stack.  That seems
> sufficiently unlikely to be ignored.
>
I disagree with the reasoning, but not really with the conclusion.
Starting a thread with a lambda that captures locals by reference is likely
to do this, and is a common C++ idiom, especially in textbook examples.
This is aggravated by the fact that I don't understand the hardware
prefetcher, and that it sometimes seems to fetch an adjacent line.  (Note
that C, unlike C++, allows implementations to make thread stacks
inaccessible to other threads.  Some of us consider that a bug and would
refuse to use a general purpose implementation that actually did this.  I
suspect there are enough of us that it doesn't matter.)

I think a stronger argument is that the compiler is always allowed to push
temporaries on the stack.  So this looks exactly as though a sequentially
consistent fence required a stack temporary.

> It's only the idea of writing to a memory location when MFENCE is
> available, and could be used instead, that seems questionable.
>
> While in principal I agree, it appears in practice that this tradeoff is
> worthwhile.  The hardware doesn't seem to optimize for the MFENCE case
> whereas lock prefix instructions appear to be handled much better.
>The concern is that it is actually fairly easy to get contention as a
result in C++.  And programmers might think they know that certain fences
shouldn't use temporaries and the rest of their code should run in
registers.  But I agree this is not a completely clear call.  I wish x86
provided a plain fence instruction that handled the common case
efficiently, so we could avoid these trade-offs.  (A "sequentially
consistent store" instruction might be even better, in that it should
largely eliminate fences and allows other optimizations.)
>Hans
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/38ffac0d/attachment.html>

Hans Boehm via llvm-dev

2016-Jan-15 20:04 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On Fri, Jan 15, 2016 at 12:15 AM, JF Bastien <jfb at google.com>
wrote:>
> What exactly would the non-temporal fences be?  It seems that on x86, the
>> load and store case may differ.  In theory, there's also a before
vs. after
>> question.  In practice code using MOVNTA seems to assume that you only
need
>> an SFENCE afterwards.  I can't back that up with spec verbiage.  I
don't
>> know about MOVNTDQA.  What about ARM?
>>
>> I'll leave this to JF to answer.  I'm not knowledgeable enough
about
>> non-temporals to answer without substantial research first.
>>
>
> I'm proposing two builtins:
> - __builtin_nontemporal_load_fence
> - __builtin_nontemporal_store_fence
>
> I've I've got this right, on x86 they would respectively be a nop,
and
> sfence.
>
> They otherwise act as memory code motion barriers unless accesses are
> proven to not alias. I think it may be possible to loosen the rule so they
> act closer to acquire/release (allowing accesses to move into the pair) but
> I'm not convinced that this works for every ISA so I'd err on the
side of
> caution (since this can be loosened later).
>
> What would the semantics be?  They restore the normal architecturalordering guarantees relied upon by the synchronization primitives, so that
non-temporal accesses don't need to be considered when  implementing
synchronization?

Then I think an SFENCE following x86 non-temporal stores would be correct.
And empirically we don't need anything to before a non-temporal store to
order it with respect to earlier normal stores.  But I don't the latter
conclusion follows from the spec.

I looked at the MOVNTDQA non-temporal load documentation again, and I'm
confused.  It sounds like so long as the memory is WB-cacheable, we may be
OK without any fences.  But I can't tell that for sure.  In the WC case, a
LOCKed instruction seems to be documented to work as a fence.

In the ARM LDNP case, things seem to be messy.  I don't think we currently
need fences for C++, since we don't normally use the dependency-based
ordering guarantees.  (Except to prevent out-of-thin-air results, which
don't seem to be precluded by the ARM spec.  Intentional or bug?)  But the
difference does matter when implementing Java final fields or
memory_order_consume.

I'm actually getting a little worried that these things are just too
idiosynchratic to reflect in portable intrinsics.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/1d397ab9/attachment.html>

llvm dev - Jan 2016 - RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR