JF Bastien via llvm-dev
2016-Jan-14 21:37 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.majnemer at gmail.com> wrote:> > > On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com> wrote: > >> On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> >>> >>> On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> I agree with Tim's assessment for ARM. That's interesting; I wasn't >>>> previously aware of that instruction. >>>> >>>> My understanding is that Alpha would have the same problem for normal >>>> loads. >>>> >>>> I'm all in favor of more systematic handling of the fences associated >>>> with x86 non-temporal accesses. >>>> >>>> AFAICT, nontemporal loads and stores seem to have different fencing >>>> rules on x86, none of them very clear. Nontemporal stores should probably >>>> ideally use an SFENCE. Locked instructions seem to be documented to work >>>> with MOVNTDQA. In both cases, there seems to be only empirical evidence as >>>> to which side(s) of the nontemporal operations they should go on? >>>> >>>> I finally decided that I was OK with using a LOCKed top-of-stack update >>>> as a fence in Java on x86. I'm significantly less enthusiastic for C++. I >>>> also think that risks unexpected coherence miss problems, though they would >>>> probably be very rare. But they would be very surprising if they did occur. >>>> >>> >>> Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence >>> seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when >>> targeting 32-bit x86 machines which do not support mfence. What >>> instruction sequence should we be using instead? >>> >> >> Do they have non-temporal accesses in the ISA? >> > > I thought not but there appear to be instructions like movntps. mfence > was introduced in SSE2 while movntps and sfence were introduced in SSE. >So the new builtin could be sfence? I think the codegen you point out for SEQ_CST is fine if we fix the memory model as suggested. On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.northover at gmail.com>>>>> wrote: >>>> >>>>> > I haven't touched ARMv8 in a few years so I'm rusty on the >>>>> non-temporal >>>>> > details for that ISA. I lifted this example from here: >>>>> > >>>>> > >>>>> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html >>>>> > >>>>> > Which is correct? >>>>> >>>>> FWIW, I agree with John here. The example I'd give for the unexpected >>>>> behaviour allowed in the spec is: >>>>> >>>>> .Lwait_for_data: >>>>> ldr x0, [x3] >>>>> cbz x0, .Lwait_for_data >>>>> ldnp x2, x1, [x0] >>>>> >>>>> where another thread first writes to a buffer then tells us where that >>>>> buffer is. For a normal ldp, the address dependency rule means we >>>>> don't need a barrier or acquiring load to ensure we see the real data >>>>> in the buffer. For ldnp, we would need a barrier to prevent stale >>>>> data. >>>>> >>>>> I suspect this is actually even closer to the x86 situation than what >>>>> the guide implies (which looks like a straight-up exposed pipeline to >>>>> me, beyond even what Alpha would have done). >>>>> >>>>> Cheers. >>>>> >>>>> Tim. >>>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/81f513e2/attachment.html>
Hans Boehm via llvm-dev
2016-Jan-15 00:05 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <jfb at google.com> wrote:> On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer <david.majnemer at gmail.com> > wrote: > >> >> >> On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com> wrote: >> >>> On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> >>>> >>>> On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> >>>>> I agree with Tim's assessment for ARM. That's interesting; I wasn't >>>>> previously aware of that instruction. >>>>> >>>>> My understanding is that Alpha would have the same problem for normal >>>>> loads. >>>>> >>>>> I'm all in favor of more systematic handling of the fences associated >>>>> with x86 non-temporal accesses. >>>>> >>>>> AFAICT, nontemporal loads and stores seem to have different fencing >>>>> rules on x86, none of them very clear. Nontemporal stores should probably >>>>> ideally use an SFENCE. Locked instructions seem to be documented to work >>>>> with MOVNTDQA. In both cases, there seems to be only empirical evidence as >>>>> to which side(s) of the nontemporal operations they should go on? >>>>> >>>>> I finally decided that I was OK with using a LOCKed top-of-stack >>>>> update as a fence in Java on x86. I'm significantly less enthusiastic for >>>>> C++. I also think that risks unexpected coherence miss problems, though >>>>> they would probably be very rare. But they would be very surprising if >>>>> they did occur. >>>>> >>>> >>>> Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence >>>> seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when >>>> targeting 32-bit x86 machines which do not support mfence. What >>>> instruction sequence should we be using instead? >>>> >>> >>> Do they have non-temporal accesses in the ISA? >>> >> >> I thought not but there appear to be instructions like movntps. mfence >> was introduced in SSE2 while movntps and sfence were introduced in SSE. >> > > So the new builtin could be sfence? I think the codegen you point out for > SEQ_CST is fine if we fix the memory model as suggested. >I agree that it's fine to use a locked instruction as a seq_cst fence if MFENCE is not available. If you have to dirty a cache line, (%esp) seems like relatively safe one. (I'm assuming that CPUID is appreciably slower and out of the running? I haven't tried. But it also probably clobbers too many registers.) It's only the idea of writing to a memory location when MFENCE is available, and could be used instead, that seems questionable. What exactly would the non-temporal fences be? It seems that on x86, the load and store case may differ. In theory, there's also a before vs. after question. In practice code using MOVNTA seems to assume that you only need an SFENCE afterwards. I can't back that up with spec verbiage. I don't know about MOVNTDQA. What about ARM? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/ff7e6e46/attachment.html>
Philip Reames via llvm-dev
2016-Jan-15 00:27 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On 01/14/2016 04:05 PM, Hans Boehm via llvm-dev wrote:> > > On Thu, Jan 14, 2016 at 1:37 PM, JF Bastien <jfb at google.com > <mailto:jfb at google.com>> wrote: > > On Thu, Jan 14, 2016 at 1:35 PM, David Majnemer > <david.majnemer at gmail.com <mailto:david.majnemer at gmail.com>> wrote: > > > > On Thu, Jan 14, 2016 at 1:13 PM, JF Bastien <jfb at google.com > <mailto:jfb at google.com>> wrote: > > On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via > llvm-dev <llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>> wrote: > > > > On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via > llvm-dev <llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>> wrote: > > I agree with Tim's assessment for ARM. That's > interesting; I wasn't previously aware of that > instruction. > > My understanding is that Alpha would have the same > problem for normal loads. > > I'm all in favor of more systematic handling of > the fences associated with x86 non-temporal accesses. > > AFAICT, nontemporal loads and stores seem to have > different fencing rules on x86, none of them very > clear. Nontemporal stores should probably ideally > use an SFENCE. Locked instructions seem to be > documented to work with MOVNTDQA. In both cases, > there seems to be only empirical evidence as to > which side(s) of the nontemporal operations they > should go on? > > I finally decided that I was OK with using a > LOCKed top-of-stack update as a fence in Java on > x86. I'm significantly less enthusiastic for > C++. I also think that risks unexpected coherence > miss problems, though they would probably be very > rare. But they would be very surprising if they > did occur. > > > Today's LLVM already emits 'lock or %eax, (%esp)' for > 'fence > seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) > when targeting 32-bit x86 machines which do not > support mfence. What instruction sequence should we > be using instead? > > > Do they have non-temporal accesses in the ISA? > > > I thought not but there appear to be instructions > like movntps. mfence was introduced in SSE2 while movntps and > sfence were introduced in SSE. > > > So the new builtin could be sfence? I think the codegen you point > out for SEQ_CST is fine if we fix the memory model as suggested. > > > I agree that it's fine to use a locked instruction as a seq_cst fence > if MFENCE is not available.It's not clear to me this is true if the seq_cst fence is expected to fence non-temporal stores. I think in practice, you'd be very unlikely to notice a difference, but I can't point to anything in the Intel docs which justifies a lock prefixed instruction as sufficient to fence any non-temporal access.> If you have to dirty a cache line, (%esp) seems like relatively safe one.Agreed. As we discussed previously, it is possible to false sharing in C++, but this would require one thread to be accessing information stored in the last frame of another running thread's stack. That seems sufficiently unlikely to be ignored.> (I'm assuming that CPUID is appreciably slower and out of the > running? I haven't tried. But it also probably clobbers too many > registers.)This is my belief. I haven't actually tried this experiment, but I've seen no reports that CPUID is a good choice here.> It's only the idea of writing to a memory location when MFENCE is > available, and could be used instead, that seems questionable.While in principal I agree, it appears in practice that this tradeoff is worthwhile. The hardware doesn't seem to optimize for the MFENCE case whereas lock prefix instructions appear to be handled much better.> > What exactly would the non-temporal fences be? It seems that on x86, > the load and store case may differ. In theory, there's also a before > vs. after question. In practice code using MOVNTA seems to assume > that you only need an SFENCE afterwards. I can't back that up with > spec verbiage. I don't know about MOVNTDQA. What about ARM?I'll leave this to JF to answer. I'm not knowledgeable enough about non-temporals to answer without substantial research first.> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/cd9f16a8/attachment.html>