Hal Finkel via llvm-dev
2016-Jan-14 20:51 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
Hi JF, Philip, Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load. How will the usage model for those change? Thanks again, Hal ----- Original Message -----> From: "Philip Reames via llvm-dev" <llvm-dev at lists.llvm.org> > To: "JF Bastien" <jfb at google.com>, "llvm-dev" > <llvm-dev at lists.llvm.org> > Cc: "Hans Boehm" <hboehm at google.com> > Sent: Wednesday, January 13, 2016 11:45:35 AM > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR> On 01/12/2016 11:16 PM, JF Bastien wrote:> > Hello, fencing enthusiasts! >> > TL;DR: We'd like to propose an addition to the LLVM memory model > > requiring non-temporal accesses be surrounded by non-temporal load > > barriers and non-temporal store barriers, and we'd like to add such > > orderings to the fence IR opcode. >> > We are open to different approaches, hence this email instead of a > > patch. >> > Who's "we"? >> > Philip Reames brought this to my attention, and we've had numerous > > discussions with Hans Boehm on the topic. Any mistakes below are my > > own, all the clever bits are theirs. >> > Why? >> > Ignore non-temporals for a moment, on most x86 targets LLVM > > generates > > an mfence for seq_cst atomic fencing. One could instead use a > > locked > > idempotent atomic accesses to top-of-stack such as lock or4i > > [RSP-8] > > 0 . Philip has measured this as equivalent on micro-benchmarks, but > > as ~25% faster in macro-benchmarks (other codebases confirm this). > > There's one problem with this approach: non-temporal accesses on > > x86 > > are only ordered by fence instructions! This means that code using > > non-temporal accesses can't rely on LLVM's fence opcode to do the > > right thing, they instead have to rely on architecture-specific > > _mm*fence intrinsics. > > Just for clarify: the proposal to change the implementation of > ceq_cst is arguable separate from this proposal. It will go through > normal patch review once the semantics are addressed. Whatever we > end up doing with ceq_cst, we currently have a semantic hole in our > specification around non-temporals that needs addressed.> Another approach would be to define the current fences as fencing > non-temporals and introducing new ones that don't. Either approach > is workable. I believe that new fences for non-temporals are the > appropriate choice given that would more closely match existing > practice.> We could also consider forward serialize bitcode to the stronger form > whichever choice we made. That would be conservatively correct thing > to do for older bitcode which might be assuming strong semantics > than our barriers explicitly provided.> > But wait! Who said developers need to issue any type of fence when > > using non-temporals? >> > Well, the LLVM memory model sure didn't. The x86 memory model does > > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than > > x86 and the backends are free to ignore the !nontemporal metadata, > > and AFAICT the x86 backend doesn't add those fences. >> > Therefore even without the above optimization the LLVM language > > reference is incorrect: non-temporals should be bracketed by > > barriers. This applies even without threading! Non-temporal > > accesses > > aren't guaranteed to interact well with regular accesses, which > > means that regular loads cannot move "down" a non-temporal barrier, > > and regular stores cannot move "up" a non-temporal barrier. >> > Why not just have the compiler add the fences? >> > LLVM could do this, either as a per-backend thing or a hookable > > pass > > such as AtomicExpandPass . It seems more natural to ask the > > programmer to express intent, just as is done with atomics. In > > fact, > > a backend is current free to ignore !nontemporal on load and store > > and could therefore generate only half of what's requested, leading > > to incorrect code. That would of course be silly, backends should > > either honor all !nontemporal or none of them but who knows what > > the > > middle-end does. >> > Put another way: some optimized C library use non-temporal accesses > > (when string instructions aren't du jour) and they terminate their > > copying with an sfence . It's a de-facto convention, the ABI > > doesn't > > say anything, but let's avoid divergence. >> > Aside: one day we may live in the fence elimination promised land > > where fences are exactly where they need to be, no more, no less. >> > Isn't x86's lfence just a no-op? >> > Yes, but we're proposing the addition of a target-independent > > non-temporal load barrier. It'll be up to the x86 backend to make > > it > > an X86ISD::MEMBARRIER and other backends to get it right (hint: > > it's > > not always a no-op). >> > Won't this optimization cause coherency misses? C++ access the > > thread > > stack concurrently all the time! >> > Maybe, but then it isn't much of an optimization if it's slowing > > code > > down. LLVM doesn't just target C++, and it's really up to the > > backend to decide whether one fence type is better than another (on > > x86, whether a locked top-of-stack idempotent operation is better > > than mfence ). Other languages have private stacks where this isn't > > an issue, and where the stack top can reasonably be assumed to be > > in > > cache. >> > How will this affect non-user-mode code (i.e. kernel code)? >> > Kernel code still has to ask for _mm_ mfence if it wants mfence : > > C11 > > and C++11 barriers aren't specified as a specific instruction. >> > Is it safe to access top-of-stack? >> > AFAIK yes, and the ABI-specified red zone has our back (or front if > > the stack grows up ☻). >> > What about non-x86 architectures? >> > Architectures such as ARMv8 support non-temporal instructions and > > require barriers such as DMB nshld to order loads and DMB nshst to > > order stores. >> > Even ARM's address-dependency rule (a.k.a. the ill-fated > > std::memory_order_consume ) fails to hold with non-temporals: >> > > LDR X0, [X3] > > >> > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction > > > executes! > > > > > Who uses non-temporals anyways? >> > That's an awfully personal question! > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- -- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
JF Bastien via llvm-dev
2016-Jan-14 21:02 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel <hfinkel at anl.gov> wrote:> Hi JF, Philip, > > Clang currently has __builtin_nontemporal_store and > __builtin_nontemporal_load. How will the usage model for those change? >I think you would use them in the same way, but you'd have to also use __builtin_nontemporal_store_fence and __builtin_nontemporal_load_fence. Unless we have LLVM automagically figure out where non-temporal fences should go, which I think isn't as good of an approach. Thanks again,> Hal > > ----- Original Message ----- > > > From: "Philip Reames via llvm-dev" <llvm-dev at lists.llvm.org> > > To: "JF Bastien" <jfb at google.com>, "llvm-dev" > > <llvm-dev at lists.llvm.org> > > Cc: "Hans Boehm" <hboehm at google.com> > > Sent: Wednesday, January 13, 2016 11:45:35 AM > > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR > > > On 01/12/2016 11:16 PM, JF Bastien wrote: > > > > Hello, fencing enthusiasts! > > > > > > TL;DR: We'd like to propose an addition to the LLVM memory model > > > requiring non-temporal accesses be surrounded by non-temporal load > > > barriers and non-temporal store barriers, and we'd like to add such > > > orderings to the fence IR opcode. > > > > > > We are open to different approaches, hence this email instead of a > > > patch. > > > > > > Who's "we"? > > > > > > Philip Reames brought this to my attention, and we've had numerous > > > discussions with Hans Boehm on the topic. Any mistakes below are my > > > own, all the clever bits are theirs. > > > > > > Why? > > > > > > Ignore non-temporals for a moment, on most x86 targets LLVM > > > generates > > > an mfence for seq_cst atomic fencing. One could instead use a > > > locked > > > idempotent atomic accesses to top-of-stack such as lock or4i > > > [RSP-8] > > > 0 . Philip has measured this as equivalent on micro-benchmarks, but > > > as ~25% faster in macro-benchmarks (other codebases confirm this). > > > There's one problem with this approach: non-temporal accesses on > > > x86 > > > are only ordered by fence instructions! This means that code using > > > non-temporal accesses can't rely on LLVM's fence opcode to do the > > > right thing, they instead have to rely on architecture-specific > > > _mm*fence intrinsics. > > > > Just for clarify: the proposal to change the implementation of > > ceq_cst is arguable separate from this proposal. It will go through > > normal patch review once the semantics are addressed. Whatever we > > end up doing with ceq_cst, we currently have a semantic hole in our > > specification around non-temporals that needs addressed. > > > Another approach would be to define the current fences as fencing > > non-temporals and introducing new ones that don't. Either approach > > is workable. I believe that new fences for non-temporals are the > > appropriate choice given that would more closely match existing > > practice. > > > We could also consider forward serialize bitcode to the stronger form > > whichever choice we made. That would be conservatively correct thing > > to do for older bitcode which might be assuming strong semantics > > than our barriers explicitly provided. > > > > But wait! Who said developers need to issue any type of fence when > > > using non-temporals? > > > > > > Well, the LLVM memory model sure didn't. The x86 memory model does > > > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than > > > x86 and the backends are free to ignore the !nontemporal metadata, > > > and AFAICT the x86 backend doesn't add those fences. > > > > > > Therefore even without the above optimization the LLVM language > > > reference is incorrect: non-temporals should be bracketed by > > > barriers. This applies even without threading! Non-temporal > > > accesses > > > aren't guaranteed to interact well with regular accesses, which > > > means that regular loads cannot move "down" a non-temporal barrier, > > > and regular stores cannot move "up" a non-temporal barrier. > > > > > > Why not just have the compiler add the fences? > > > > > > LLVM could do this, either as a per-backend thing or a hookable > > > pass > > > such as AtomicExpandPass . It seems more natural to ask the > > > programmer to express intent, just as is done with atomics. In > > > fact, > > > a backend is current free to ignore !nontemporal on load and store > > > and could therefore generate only half of what's requested, leading > > > to incorrect code. That would of course be silly, backends should > > > either honor all !nontemporal or none of them but who knows what > > > the > > > middle-end does. > > > > > > Put another way: some optimized C library use non-temporal accesses > > > (when string instructions aren't du jour) and they terminate their > > > copying with an sfence . It's a de-facto convention, the ABI > > > doesn't > > > say anything, but let's avoid divergence. > > > > > > Aside: one day we may live in the fence elimination promised land > > > where fences are exactly where they need to be, no more, no less. > > > > > > Isn't x86's lfence just a no-op? > > > > > > Yes, but we're proposing the addition of a target-independent > > > non-temporal load barrier. It'll be up to the x86 backend to make > > > it > > > an X86ISD::MEMBARRIER and other backends to get it right (hint: > > > it's > > > not always a no-op). > > > > > > Won't this optimization cause coherency misses? C++ access the > > > thread > > > stack concurrently all the time! > > > > > > Maybe, but then it isn't much of an optimization if it's slowing > > > code > > > down. LLVM doesn't just target C++, and it's really up to the > > > backend to decide whether one fence type is better than another (on > > > x86, whether a locked top-of-stack idempotent operation is better > > > than mfence ). Other languages have private stacks where this isn't > > > an issue, and where the stack top can reasonably be assumed to be > > > in > > > cache. > > > > > > How will this affect non-user-mode code (i.e. kernel code)? > > > > > > Kernel code still has to ask for _mm_ mfence if it wants mfence : > > > C11 > > > and C++11 barriers aren't specified as a specific instruction. > > > > > > Is it safe to access top-of-stack? > > > > > > AFAIK yes, and the ABI-specified red zone has our back (or front if > > > the stack grows up ☻). > > > > > > What about non-x86 architectures? > > > > > > Architectures such as ARMv8 support non-temporal instructions and > > > require barriers such as DMB nshld to order loads and DMB nshst to > > > order stores. > > > > > > Even ARM's address-dependency rule (a.k.a. the ill-fated > > > std::memory_order_consume ) fails to hold with non-temporals: > > > > > > > LDR X0, [X3] > > > > > > > > > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction > > > > executes! > > > > > > > > Who uses non-temporals anyways? > > > > > > That's an awfully personal question! > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > -- > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/0a70229a/attachment.html>
Hal Finkel via llvm-dev
2016-Jan-14 21:05 UTC
[llvm-dev] RFC: non-temporal fencing in LLVM IR
----- Original Message -----> From: "JF Bastien" <jfb at google.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "Philip Reames" <listmail at philipreames.com>, "Hans Boehm" <hboehm at google.com>, "llvm-dev" > <llvm-dev at lists.llvm.org> > Sent: Thursday, January 14, 2016 3:02:20 PM > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR > > > > > On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel < hfinkel at anl.gov > > wrote: > > > Hi JF, Philip, > > Clang currently has __builtin_nontemporal_store and > __builtin_nontemporal_load. How will the usage model for those > change? > > > > I think you would use them in the same way, but you'd have to also > use __builtin_nontemporal_store_fence and > __builtin_nontemporal_load_fence.So we'll add new fence intrinsics. That makes sense.> > > Unless we have LLVM automagically figure out where non-temporal > fences should go, which I think isn't as good of an approach. >I agree. Such a determination is likely to be too conservative in practice. -Hal> > Thanks again, > Hal > > ----- Original Message ----- > > > From: "Philip Reames via llvm-dev" < llvm-dev at lists.llvm.org > > > To: "JF Bastien" < jfb at google.com >, "llvm-dev" > > < llvm-dev at lists.llvm.org > > > Cc: "Hans Boehm" < hboehm at google.com > > > Sent: Wednesday, January 13, 2016 11:45:35 AM > > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR > > > On 01/12/2016 11:16 PM, JF Bastien wrote: > > > > Hello, fencing enthusiasts! > > > > > > TL;DR: We'd like to propose an addition to the LLVM memory model > > > requiring non-temporal accesses be surrounded by non-temporal > > > load > > > barriers and non-temporal store barriers, and we'd like to add > > > such > > > orderings to the fence IR opcode. > > > > > > We are open to different approaches, hence this email instead of > > > a > > > patch. > > > > > > Who's "we"? > > > > > > Philip Reames brought this to my attention, and we've had > > > numerous > > > discussions with Hans Boehm on the topic. Any mistakes below are > > > my > > > own, all the clever bits are theirs. > > > > > > Why? > > > > > > Ignore non-temporals for a moment, on most x86 targets LLVM > > > generates > > > an mfence for seq_cst atomic fencing. One could instead use a > > > locked > > > idempotent atomic accesses to top-of-stack such as lock or4i > > > [RSP-8] > > > 0 . Philip has measured this as equivalent on micro-benchmarks, > > > but > > > > > as ~25% faster in macro-benchmarks (other codebases confirm > > > this). > > > There's one problem with this approach: non-temporal accesses on > > > x86 > > > are only ordered by fence instructions! This means that code > > > using > > > non-temporal accesses can't rely on LLVM's fence opcode to do the > > > right thing, they instead have to rely on architecture-specific > > > _mm*fence intrinsics. > > > > Just for clarify: the proposal to change the implementation of > > ceq_cst is arguable separate from this proposal. It will go through > > normal patch review once the semantics are addressed. Whatever we > > end up doing with ceq_cst, we currently have a semantic hole in our > > specification around non-temporals that needs addressed. > > > Another approach would be to define the current fences as fencing > > non-temporals and introducing new ones that don't. Either approach > > is workable. I believe that new fences for non-temporals are the > > appropriate choice given that would more closely match existing > > practice. > > > We could also consider forward serialize bitcode to the stronger > > form > > whichever choice we made. That would be conservatively correct > > thing > > to do for older bitcode which might be assuming strong semantics > > than our barriers explicitly provided. > > > > But wait! Who said developers need to issue any type of fence > > > when > > > using non-temporals? > > > > > > Well, the LLVM memory model sure didn't. The x86 memory model > > > does > > > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more > > > than > > > x86 and the backends are free to ignore the !nontemporal > > > metadata, > > > and AFAICT the x86 backend doesn't add those fences. > > > > > > Therefore even without the above optimization the LLVM language > > > reference is incorrect: non-temporals should be bracketed by > > > barriers. This applies even without threading! Non-temporal > > > accesses > > > aren't guaranteed to interact well with regular accesses, which > > > means that regular loads cannot move "down" a non-temporal > > > barrier, > > > and regular stores cannot move "up" a non-temporal barrier. > > > > > > Why not just have the compiler add the fences? > > > > > > LLVM could do this, either as a per-backend thing or a hookable > > > pass > > > such as AtomicExpandPass . It seems more natural to ask the > > > programmer to express intent, just as is done with atomics. In > > > fact, > > > a backend is current free to ignore !nontemporal on load and > > > store > > > and could therefore generate only half of what's requested, > > > leading > > > to incorrect code. That would of course be silly, backends should > > > either honor all !nontemporal or none of them but who knows what > > > the > > > middle-end does. > > > > > > Put another way: some optimized C library use non-temporal > > > accesses > > > (when string instructions aren't du jour) and they terminate > > > their > > > copying with an sfence . It's a de-facto convention, the ABI > > > doesn't > > > say anything, but let's avoid divergence. > > > > > > Aside: one day we may live in the fence elimination promised land > > > where fences are exactly where they need to be, no more, no less. > > > > > > Isn't x86's lfence just a no-op? > > > > > > Yes, but we're proposing the addition of a target-independent > > > non-temporal load barrier. It'll be up to the x86 backend to make > > > it > > > an X86ISD::MEMBARRIER and other backends to get it right (hint: > > > it's > > > not always a no-op). > > > > > > Won't this optimization cause coherency misses? C++ access the > > > thread > > > stack concurrently all the time! > > > > > > Maybe, but then it isn't much of an optimization if it's slowing > > > code > > > down. LLVM doesn't just target C++, and it's really up to the > > > backend to decide whether one fence type is better than another > > > (on > > > x86, whether a locked top-of-stack idempotent operation is better > > > than mfence ). Other languages have private stacks where this > > > isn't > > > an issue, and where the stack top can reasonably be assumed to be > > > in > > > cache. > > > > > > How will this affect non-user-mode code (i.e. kernel code)? > > > > > > Kernel code still has to ask for _mm_ mfence if it wants mfence : > > > C11 > > > and C++11 barriers aren't specified as a specific instruction. > > > > > > Is it safe to access top-of-stack? > > > > > > AFAIK yes, and the ABI-specified red zone has our back (or front > > > if > > > the stack grows up ☻). > > > > > > What about non-x86 architectures? > > > > > > Architectures such as ARMv8 support non-temporal instructions and > > > require barriers such as DMB nshld to order loads and DMB nshst > > > to > > > order stores. > > > > > > Even ARM's address-dependency rule (a.k.a. the ill-fated > > > std::memory_order_consume ) fails to hold with non-temporals: > > > > > > > LDR X0, [X3] > > > > > > > > > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction > > > > executes! > > > > > > > > Who uses non-temporals anyways? > > > > > > That's an awfully personal question! > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > -- > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory