JF Bastien via llvm-dev
2016-Jan-13 18:44 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On Wed, Jan 13, 2016 at 10:32 AM, John Brawn <John.Brawn at arm.com> wrote:> *What about non-x86 architectures?* > > > > Architectures such as ARMv8 support non-temporal instructions and require > barriers such as DMB nshld to order loads and DMB nshst to order stores. > > > > Even ARM's address-dependency rule (a.k.a. the ill-fated > std::memory_order_consume) fails to hold with non-temporals: > > LDR X0, [X3] > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes! > > > > What exactly do you mean by ‘X0 may not be loaded’ in your example here? > If you mean that the LDNP > > could start executing with the value of X0 from before the LDR, e.g. > initially X0=0x100, the LDR loads > > X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think > that’s true. According to > > section C3.2.4 of the ARMv8 ARMARM *other* observers may observe the LDR > and the LDNP in the wrong > > order, but the CPU executing the instructions will observe them in program > order. >I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal details for that ISA. I lifted this example from here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html Which is correct? I have no idea if that affects anything in this RFC though.>Agreed, but I don't want to be misleading! The current example serves as a good justification for non-temporal read barriers, it would be a shame to justify myself on incorrect data :-) John> > > > *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *JF > Bastien via llvm-dev > *Sent:* 13 January 2016 07:16 > *To:* llvm-dev > *Cc:* Hans Boehm > *Subject:* [llvm-dev] RFC: non-temporal fencing in LLVM IR > > > > Hello, fencing enthusiasts! > > > > *TL;DR:* We'd like to propose an addition to the LLVM memory model > requiring non-temporal accesses be surrounded by non-temporal load barriers > and non-temporal store barriers, and we'd like to add such orderings to the > fence IR opcode. > > > > We are open to different approaches, hence this email instead of a patch. > > > > > > *Who's "we"?* > > > > Philip Reames brought this to my attention, and we've had numerous > discussions with Hans Boehm on the topic. Any mistakes below are my own, > all the clever bits are theirs. > > > > > > *Why?* > > > > Ignore non-temporals for a moment, on most x86 targets LLVM generates an > mfence for seq_cst atomic fencing. One could instead use a locked > idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0. > Philip has measured this as equivalent on micro-benchmarks, but as ~25% > faster in macro-benchmarks (other codebases confirm this). There's one > problem with this approach: non-temporal accesses on x86 are only ordered > by fence instructions! This means that code using non-temporal accesses > can't rely on LLVM's fence opcode to do the right thing, they instead > have to rely on architecture-specific _mm*fence intrinsics. > > > > > > *But wait! Who said developers need to issue any type of fence when using > non-temporals?* > > > > Well, the LLVM memory model sure didn't. The x86 memory model does (volume > 3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the > backends are free to ignore the !nontemporal metadata, and AFAICT the x86 > backend doesn't add those fences. > > > > Therefore even without the above optimization the LLVM language reference > is incorrect: non-temporals should be bracketed by barriers. This applies > even without threading! Non-temporal accesses aren't guaranteed to interact > well with regular accesses, which means that regular loads cannot move > "down" a non-temporal barrier, and regular stores cannot move "up" a > non-temporal barrier. > > > > > > *Why not just have the compiler add the fences?* > > > > LLVM could do this, either as a per-backend thing or a hookable pass such > as AtomicExpandPass. It seems more natural to ask the programmer to > express intent, just as is done with atomics. In fact, a backend is current > free to ignore !nontemporal on load and store and could therefore > generate only half of what's requested, leading to incorrect code. That > would of course be silly, backends should either honor all !nontemporal or > none of them but who knows what the middle-end does. > > > > Put another way: some optimized C library use non-temporal accesses (when > string instructions aren't du jour) and they terminate their copying with > an sfence. It's a de-facto convention, the ABI doesn't say anything, but > let's avoid divergence. > > > > Aside: one day we may live in the fence elimination promised land > <http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html> where > fences are exactly where they need to be, no more, no less. > > > > > > *Isn't x86's **lfence just a no-op?* > > > > Yes, but we're proposing the addition of a target-independent non-temporal > load barrier. It'll be up to the x86 backend to make it an > X86ISD::MEMBARRIER and other backends to get it right (hint: it's not > always a no-op). > > > > > > *Won't this optimization cause coherency misses? C++ access the thread > stack concurrently all the time!* > > > > Maybe, but then it isn't much of an optimization if it's slowing code > down. LLVM doesn't just target C++, and it's really up to the backend to > decide whether one fence type is better than another (on x86, whether a > locked top-of-stack idempotent operation is better than mfence). Other > languages have private stacks where this isn't an issue, and where the > stack top can reasonably be assumed to be in cache. > > > > > > *How will this affect non-user-mode code (i.e. kernel code)?* > > > > Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and > C++11 barriers aren't specified as a specific instruction. > > > > > > *Is it safe to access top-of-stack?* > > > > AFAIK yes, and the ABI-specified red zone has our back (or front if the > stack grows up ☻). > > > > > > *What about non-x86 architectures?* > > > > Architectures such as ARMv8 support non-temporal instructions and require > barriers such as DMB nshld to order loads and DMB nshst to order stores. > > > > Even ARM's address-dependency rule (a.k.a. the ill-fated > std::memory_order_consume) fails to hold with non-temporals: > > LDR X0, [X3] > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes! > > > > > > *Who uses non-temporals anyways?* > > > > That's an awfully personal question! >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160113/e6f64687/attachment.html>
Tim Northover via llvm-dev
2016-Jan-13 18:59 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
> I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal > details for that ISA. I lifted this example from here: > > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html > > Which is correct?FWIW, I agree with John here. The example I'd give for the unexpected behaviour allowed in the spec is: .Lwait_for_data: ldr x0, [x3] cbz x0, .Lwait_for_data ldnp x2, x1, [x0] where another thread first writes to a buffer then tells us where that buffer is. For a normal ldp, the address dependency rule means we don't need a barrier or acquiring load to ensure we see the real data in the buffer. For ldnp, we would need a barrier to prevent stale data. I suspect this is actually even closer to the x86 situation than what the guide implies (which looks like a straight-up exposed pipeline to me, beyond even what Alpha would have done). Cheers. Tim.
Hans Boehm via llvm-dev
2016-Jan-14 03:00 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
I agree with Tim's assessment for ARM. That's interesting; I wasn't previously aware of that instruction. My understanding is that Alpha would have the same problem for normal loads. I'm all in favor of more systematic handling of the fences associated with x86 non-temporal accesses. AFAICT, nontemporal loads and stores seem to have different fencing rules on x86, none of them very clear. Nontemporal stores should probably ideally use an SFENCE. Locked instructions seem to be documented to work with MOVNTDQA. In both cases, there seems to be only empirical evidence as to which side(s) of the nontemporal operations they should go on? I finally decided that I was OK with using a LOCKed top-of-stack update as a fence in Java on x86. I'm significantly less enthusiastic for C++. I also think that risks unexpected coherence miss problems, though they would probably be very rare. But they would be very surprising if they did occur. On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.northover at gmail.com> wrote:> > I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal > > details for that ISA. I lifted this example from here: > > > > > http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html > > > > Which is correct? > > FWIW, I agree with John here. The example I'd give for the unexpected > behaviour allowed in the spec is: > > .Lwait_for_data: > ldr x0, [x3] > cbz x0, .Lwait_for_data > ldnp x2, x1, [x0] > > where another thread first writes to a buffer then tells us where that > buffer is. For a normal ldp, the address dependency rule means we > don't need a barrier or acquiring load to ensure we see the real data > in the buffer. For ldnp, we would need a barrier to prevent stale > data. > > I suspect this is actually even closer to the x86 situation than what > the guide implies (which looks like a straight-up exposed pipeline to > me, beyond even what Alpha would have done). > > Cheers. > > Tim. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160113/d0b57337/attachment.html>