Philip Reames via llvm-dev
2016-Jan-15 00:27 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On 01/14/2016 04:05 PM, Hans Boehm via llvm-dev wrote:> > > On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <jfb at google.com > <mailto:jfb at google.com>> wrote: > > On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer > <david.majnemer at gmail.com <mailto:david.majnemer at gmail.com>> wrote: > > > > On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com > <mailto:jfb at google.com>> wrote: > > On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via > llvm-dev <llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>> wrote: > > > > On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via > llvm-dev <llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>> wrote: > > I agree with Tim's assessment for ARM. That's > interesting; I wasn't previously aware of that > instruction. > > My understanding is that Alpha would have the same > problem for normal loads. > > I'm all in favor of more systematic handling of > the fences associated with x86 non-temporal accesses. > > AFAICT, nontemporal loads and stores seem to have > different fencing rules on x86, none of them very > clear. Nontemporal stores should probably ideally > use an SFENCE. Locked instructions seem to be > documented to work with MOVNTDQA. In both cases, > there seems to be only empirical evidence as to > which side(s) of the nontemporal operations they > should go on? > > I finally decided that I was OK with using a > LOCKed top-of-stack update as a fence in Java on > x86. I'm significantly less enthusiastic for > C++. I also think that risks unexpected coherence > miss problems, though they would probably be very > rare. But they would be very surprising if they > did occur. > > > Today's LLVM already emits 'lock or %eax, (%esp)' for > 'fence > seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) > when targeting 32-bit x86 machines which do not > support mfence. What instruction sequence should we > be using instead? > > > Do they have non-temporal accesses in the ISA? > > > I thought not but there appear to be instructions > like movntps. mfence was introduced in SSE2 while movntps and > sfence were introduced in SSE. > > > So the new builtin could be sfence? I think the codegen you point > out for SEQ_CST is fine if we fix the memory model as suggested. > > > I agree that it's fine to use a locked instruction as a seq_cst fence > if MFENCE is not available.It's not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores. I think in practice, you'd be very unlikely to notice a difference, but I can't point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access.> If you have to dirty a cache line, (%esp) seems like relatively safe one.Agreed. As we discussed previously, it is possible to false sharing in C++, but this would require one thread to be accessing information stored in the last frame of another running thread's stack. That seems sufficiently unlikely to be ignored.> (I'm assuming that CPUID is appreciably slower and out of the > running? I haven't tried. But it also probably clobbers too many > registers.)This is my belief. I haven't actually tried this experiment, but I've seen no reports that CPUID is a good choice here.> It's only the idea of writing to a memory location when MFENCE is > available, and could be used instead, that seems questionable.While in principal I agree, it appears in practice that this tradeoff is worthwhile. The hardware doesn't seem to optimize for the MFENCE case whereas lock prefix instructions appear to be handled much better.> > What exactly would the non-temporal fences be? It seems that on x86, > the load and store case may differ. In theory, there's also a before > vs. after question. In practice code using MOVNTA seems to assume > that you only need an SFENCE afterwards. I can't back that up with > spec verbiage. I don't know about MOVNTDQA. What about ARM?I'll leave this to JF to answer. I'm not knowledgeable enough about non-temporals to answer without substantial research first.> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/cd9f16a8/attachment.html>
JF Bastien via llvm-dev
2016-Jan-15 08:15 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
> > I agree that it's fine to use a locked instruction as a seq_cst fence if > MFENCE is not available. > > It's not clear to me this is true if the seq_cst fence is expected to > fence non-temporal stores. I think in practice, you'd be very unlikely to > notice a difference, but I can't point to anything in the Intel docs which > justifies a lock prefixed instruction as sufficient to fence any > non-temporal access. >Correct, that's why changing the memory model is critical: seq_cst fence wouldn't have any guarantee w.r.t. non-temporal. What exactly would the non-temporal fences be? It seems that on x86, the> load and store case may differ. In theory, there's also a before vs. after > question. In practice code using MOVNTA seems to assume that you only need > an SFENCE afterwards. I can't back that up with spec verbiage. I don't > know about MOVNTDQA. What about ARM? > > I'll leave this to JF to answer. I'm not knowledgeable enough about > non-temporals to answer without substantial research first. >I'm proposing two builtins: - __builtin_nontemporal_load_fence - __builtin_nontemporal_store_fence I've I've got this right, on x86 they would respectively be a nop, and sfence. They otherwise act as memory code motion barriers unless accesses are proven to not alias. I think it may be possible to loosen the rule so they act closer to acquire/release (allowing accesses to move into the pair) but I'm not convinced that this works for every ISA so I'd err on the side of caution (since this can be loosened later). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/aa0fc855/attachment.html>
Hans Boehm via llvm-dev
2016-Jan-15 19:21 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On Thu, Jan 14, 2016 at 4:27 PM, Philip Reames <listmail at philipreames.com> wrote:> It's not clear to me this is true if the seq_cst fence is expected to > fence non-temporal stores. I think in practice, you'd be very unlikely to > notice a difference, but I can't point to anything in the Intel docs which > justifies a lock prefixed instruction as sufficient to fence any > non-temporal access. >Agreed. I think it's not guaranteed. And the most rational explanation for the fact that LOCK; X is faster than MFENCE seems to be that LOCK only deals with normal write-back cacheable accesses, and hence may not work for cases like this.> If you have to dirty a cache line, (%esp) seems like relatively safe one. > > Agreed. As we discussed previously, it is possible to false sharing in > C++, but this would require one thread to be accessing information stored > in the last frame of another running thread's stack. That seems > sufficiently unlikely to be ignored. >I disagree with the reasoning, but not really with the conclusion. Starting a thread with a lambda that captures locals by reference is likely to do this, and is a common C++ idiom, especially in textbook examples. This is aggravated by the fact that I don't understand the hardware prefetcher, and that it sometimes seems to fetch an adjacent line. (Note that C, unlike C++, allows implementations to make thread stacks inaccessible to other threads. Some of us consider that a bug and would refuse to use a general purpose implementation that actually did this. I suspect there are enough of us that it doesn't matter.) I think a stronger argument is that the compiler is always allowed to push temporaries on the stack. So this looks exactly as though a sequentially consistent fence required a stack temporary.> It's only the idea of writing to a memory location when MFENCE is > available, and could be used instead, that seems questionable. > > While in principal I agree, it appears in practice that this tradeoff is > worthwhile. The hardware doesn't seem to optimize for the MFENCE case > whereas lock prefix instructions appear to be handled much better. >The concern is that it is actually fairly easy to get contention as a result in C++. And programmers might think they know that certain fences shouldn't use temporaries and the rest of their code should run in registers. But I agree this is not a completely clear call. I wish x86 provided a plain fence instruction that handled the common case efficiently, so we could avoid these trade-offs. (A "sequentially consistent store" instruction might be even better, in that it should largely eliminate fences and allows other optimizations.)>Hans -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/38ffac0d/attachment.html>
Hans Boehm via llvm-dev
2016-Jan-15 20:04 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On Fri, Jan 15, 2016 at 12:15 AM, JF Bastien <jfb at google.com> wrote:> > What exactly would the non-temporal fences be? It seems that on x86, the >> load and store case may differ. In theory, there's also a before vs. after >> question. In practice code using MOVNTA seems to assume that you only need >> an SFENCE afterwards. I can't back that up with spec verbiage. I don't >> know about MOVNTDQA. What about ARM? >> >> I'll leave this to JF to answer. I'm not knowledgeable enough about >> non-temporals to answer without substantial research first. >> > > I'm proposing two builtins: > - __builtin_nontemporal_load_fence > - __builtin_nontemporal_store_fence > > I've I've got this right, on x86 they would respectively be a nop, and > sfence. > > They otherwise act as memory code motion barriers unless accesses are > proven to not alias. I think it may be possible to loosen the rule so they > act closer to acquire/release (allowing accesses to move into the pair) but > I'm not convinced that this works for every ISA so I'd err on the side of > caution (since this can be loosened later). > > What would the semantics be? They restore the normal architecturalordering guarantees relied upon by the synchronization primitives, so that non-temporal accesses don't need to be considered when implementing synchronization? Then I think an SFENCE following x86 non-temporal stores would be correct. And empirically we don't need anything to before a non-temporal store to order it with respect to earlier normal stores. But I don't the latter conclusion follows from the spec. I looked at the MOVNTDQA non-temporal load documentation again, and I'm confused. It sounds like so long as the memory is WB-cacheable, we may be OK without any fences. But I can't tell that for sure. In the WC case, a LOCKed instruction seems to be documented to work as a fence. In the ARM LDNP case, things seem to be messy. I don't think we currently need fences for C++, since we don't normally use the dependency-based ordering guarantees. (Except to prevent out-of-thin-air results, which don't seem to be precluded by the ARM spec. Intentional or bug?) But the difference does matter when implementing Java final fields or memory_order_consume. I'm actually getting a little worried that these things are just too idiosynchratic to reflect in portable intrinsics. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/1d397ab9/attachment.html>