thr3ads.net - llvm dev - [llvm-dev] RFC: non-temporal fencing in LLVM IR [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2016-Jan-14 20:51 UTC

[llvm-dev] RFC: non-temporal fencing in LLVM IR

Hi JF, Philip,

Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load.
How will the usage model for those change?

Thanks again,
Hal

----- Original Message ----- 
> From: "Philip Reames via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: "JF Bastien" <jfb at google.com>, "llvm-dev"
> <llvm-dev at lists.llvm.org>
> Cc: "Hans Boehm" <hboehm at google.com>
> Sent: Wednesday, January 13, 2016 11:45:35 AM
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
> On 01/12/2016 11:16 PM, JF Bastien wrote:
> > Hello, fencing enthusiasts!
> 
> > TL;DR: We'd like to propose an addition to the LLVM memory model
> > requiring non-temporal accesses be surrounded by non-temporal load
> > barriers and non-temporal store barriers, and we'd like to add
such
> > orderings to the fence IR opcode.
> 
> > We are open to different approaches, hence this email instead of a
> > patch.
> 
> > Who's "we"?
> 
> > Philip Reames brought this to my attention, and we've had numerous
> > discussions with Hans Boehm on the topic. Any mistakes below are my
> > own, all the clever bits are theirs.
> 
> > Why?
> 
> > Ignore non-temporals for a moment, on most x86 targets LLVM
> > generates
> > an mfence for seq_cst atomic fencing. One could instead use a
> > locked
> > idempotent atomic accesses to top-of-stack such as lock or4i
> > [RSP-8]
> > 0 . Philip has measured this as equivalent on micro-benchmarks, but
> > as ~25% faster in macro-benchmarks (other codebases confirm this).
> > There's one problem with this approach: non-temporal accesses on
> > x86
> > are only ordered by fence instructions! This means that code using
> > non-temporal accesses can't rely on LLVM's fence opcode to do
the
> > right thing, they instead have to rely on architecture-specific
> > _mm*fence intrinsics.
> 
> Just for clarify: the proposal to change the implementation of
> ceq_cst is arguable separate from this proposal. It will go through
> normal patch review once the semantics are addressed. Whatever we
> end up doing with ceq_cst, we currently have a semantic hole in our
> specification around non-temporals that needs addressed.
> Another approach would be to define the current fences as fencing
> non-temporals and introducing new ones that don't. Either approach
> is workable. I believe that new fences for non-temporals are the
> appropriate choice given that would more closely match existing
> practice.
> We could also consider forward serialize bitcode to the stronger form
> whichever choice we made. That would be conservatively correct thing
> to do for older bitcode which might be assuming strong semantics
> than our barriers explicitly provided.
> > But wait! Who said developers need to issue any type of fence when
> > using non-temporals?
> 
> > Well, the LLVM memory model sure didn't. The x86 memory model does
> > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than
> > x86 and the backends are free to ignore the !nontemporal metadata,
> > and AFAICT the x86 backend doesn't add those fences.
> 
> > Therefore even without the above optimization the LLVM language
> > reference is incorrect: non-temporals should be bracketed by
> > barriers. This applies even without threading! Non-temporal
> > accesses
> > aren't guaranteed to interact well with regular accesses, which
> > means that regular loads cannot move "down" a non-temporal
barrier,
> > and regular stores cannot move "up" a non-temporal barrier.
> 
> > Why not just have the compiler add the fences?
> 
> > LLVM could do this, either as a per-backend thing or a hookable
> > pass
> > such as AtomicExpandPass . It seems more natural to ask the
> > programmer to express intent, just as is done with atomics. In
> > fact,
> > a backend is current free to ignore !nontemporal on load and store
> > and could therefore generate only half of what's requested,
leading
> > to incorrect code. That would of course be silly, backends should
> > either honor all !nontemporal or none of them but who knows what
> > the
> > middle-end does.
> 
> > Put another way: some optimized C library use non-temporal accesses
> > (when string instructions aren't du jour) and they terminate their
> > copying with an sfence . It's a de-facto convention, the ABI
> > doesn't
> > say anything, but let's avoid divergence.
> 
> > Aside: one day we may live in the fence elimination promised land
> > where fences are exactly where they need to be, no more, no less.
> 
> > Isn't x86's lfence just a no-op?
> 
> > Yes, but we're proposing the addition of a target-independent
> > non-temporal load barrier. It'll be up to the x86 backend to make
> > it
> > an X86ISD::MEMBARRIER and other backends to get it right (hint:
> > it's
> > not always a no-op).
> 
> > Won't this optimization cause coherency misses? C++ access the
> > thread
> > stack concurrently all the time!
> 
> > Maybe, but then it isn't much of an optimization if it's
slowing
> > code
> > down. LLVM doesn't just target C++, and it's really up to the
> > backend to decide whether one fence type is better than another (on
> > x86, whether a locked top-of-stack idempotent operation is better
> > than mfence ). Other languages have private stacks where this
isn't
> > an issue, and where the stack top can reasonably be assumed to be
> > in
> > cache.
> 
> > How will this affect non-user-mode code (i.e. kernel code)?
> 
> > Kernel code still has to ask for _mm_ mfence if it wants mfence :
> > C11
> > and C++11 barriers aren't specified as a specific instruction.
> 
> > Is it safe to access top-of-stack?
> 
> > AFAIK yes, and the ABI-specified red zone has our back (or front if
> > the stack grows up ☻).
> 
> > What about non-x86 architectures?
> 
> > Architectures such as ARMv8 support non-temporal instructions and
> > require barriers such as DMB nshld to order loads and DMB nshst to
> > order stores.
> 
> > Even ARM's address-dependency rule (a.k.a. the ill-fated
> > std::memory_order_consume ) fails to hold with non-temporals:
> 
> > > LDR X0, [X3]
> > 
> 
> > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction
> > > executes!
> > 
> 
> > Who uses non-temporals anyways?
> 
> > That's an awfully personal question!
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

JF Bastien via llvm-dev

2016-Jan-14 21:02 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hi JF, Philip,
>
> Clang currently has __builtin_nontemporal_store and
> __builtin_nontemporal_load. How will the usage model for those change?
>
I think you would use them in the same way, but you'd have to also use
__builtin_nontemporal_store_fence and __builtin_nontemporal_load_fence.

Unless we have LLVM automagically figure out where non-temporal fences
should go, which I think isn't as good of an approach.


Thanks again,> Hal
>
> ----- Original Message -----
>
> > From: "Philip Reames via llvm-dev" <llvm-dev at
lists.llvm.org>
> > To: "JF Bastien" <jfb at google.com>,
"llvm-dev"
> > <llvm-dev at lists.llvm.org>
> > Cc: "Hans Boehm" <hboehm at google.com>
> > Sent: Wednesday, January 13, 2016 11:45:35 AM
> > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
> > On 01/12/2016 11:16 PM, JF Bastien wrote:
>
> > > Hello, fencing enthusiasts!
> >
>
> > > TL;DR: We'd like to propose an addition to the LLVM memory
model
> > > requiring non-temporal accesses be surrounded by non-temporal
load
> > > barriers and non-temporal store barriers, and we'd like to
add such
> > > orderings to the fence IR opcode.
> >
>
> > > We are open to different approaches, hence this email instead of
a
> > > patch.
> >
>
> > > Who's "we"?
> >
>
> > > Philip Reames brought this to my attention, and we've had
numerous
> > > discussions with Hans Boehm on the topic. Any mistakes below are
my
> > > own, all the clever bits are theirs.
> >
>
> > > Why?
> >
>
> > > Ignore non-temporals for a moment, on most x86 targets LLVM
> > > generates
> > > an mfence for seq_cst atomic fencing. One could instead use a
> > > locked
> > > idempotent atomic accesses to top-of-stack such as lock or4i
> > > [RSP-8]
> > > 0 . Philip has measured this as equivalent on micro-benchmarks,
but
> > > as ~25% faster in macro-benchmarks (other codebases confirm
this).
> > > There's one problem with this approach: non-temporal accesses
on
> > > x86
> > > are only ordered by fence instructions! This means that code
using
> > > non-temporal accesses can't rely on LLVM's fence opcode
to do the
> > > right thing, they instead have to rely on architecture-specific
> > > _mm*fence intrinsics.
> >
> > Just for clarify: the proposal to change the implementation of
> > ceq_cst is arguable separate from this proposal. It will go through
> > normal patch review once the semantics are addressed. Whatever we
> > end up doing with ceq_cst, we currently have a semantic hole in our
> > specification around non-temporals that needs addressed.
>
> > Another approach would be to define the current fences as fencing
> > non-temporals and introducing new ones that don't. Either approach
> > is workable. I believe that new fences for non-temporals are the
> > appropriate choice given that would more closely match existing
> > practice.
>
> > We could also consider forward serialize bitcode to the stronger form
> > whichever choice we made. That would be conservatively correct thing
> > to do for older bitcode which might be assuming strong semantics
> > than our barriers explicitly provided.
>
> > > But wait! Who said developers need to issue any type of fence
when
> > > using non-temporals?
> >
>
> > > Well, the LLVM memory model sure didn't. The x86 memory model
does
> > > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more
than
> > > x86 and the backends are free to ignore the !nontemporal
metadata,
> > > and AFAICT the x86 backend doesn't add those fences.
> >
>
> > > Therefore even without the above optimization the LLVM language
> > > reference is incorrect: non-temporals should be bracketed by
> > > barriers. This applies even without threading! Non-temporal
> > > accesses
> > > aren't guaranteed to interact well with regular accesses,
which
> > > means that regular loads cannot move "down" a
non-temporal barrier,
> > > and regular stores cannot move "up" a non-temporal
barrier.
> >
>
> > > Why not just have the compiler add the fences?
> >
>
> > > LLVM could do this, either as a per-backend thing or a hookable
> > > pass
> > > such as AtomicExpandPass . It seems more natural to ask the
> > > programmer to express intent, just as is done with atomics. In
> > > fact,
> > > a backend is current free to ignore !nontemporal on load and
store
> > > and could therefore generate only half of what's requested,
leading
> > > to incorrect code. That would of course be silly, backends should
> > > either honor all !nontemporal or none of them but who knows what
> > > the
> > > middle-end does.
> >
>
> > > Put another way: some optimized C library use non-temporal
accesses
> > > (when string instructions aren't du jour) and they terminate
their
> > > copying with an sfence . It's a de-facto convention, the ABI
> > > doesn't
> > > say anything, but let's avoid divergence.
> >
>
> > > Aside: one day we may live in the fence elimination promised land
> > > where fences are exactly where they need to be, no more, no less.
> >
>
> > > Isn't x86's lfence just a no-op?
> >
>
> > > Yes, but we're proposing the addition of a target-independent
> > > non-temporal load barrier. It'll be up to the x86 backend to
make
> > > it
> > > an X86ISD::MEMBARRIER and other backends to get it right (hint:
> > > it's
> > > not always a no-op).
> >
>
> > > Won't this optimization cause coherency misses? C++ access
the
> > > thread
> > > stack concurrently all the time!
> >
>
> > > Maybe, but then it isn't much of an optimization if it's
slowing
> > > code
> > > down. LLVM doesn't just target C++, and it's really up to
the
> > > backend to decide whether one fence type is better than another
(on
> > > x86, whether a locked top-of-stack idempotent operation is better
> > > than mfence ). Other languages have private stacks where this
isn't
> > > an issue, and where the stack top can reasonably be assumed to be
> > > in
> > > cache.
> >
>
> > > How will this affect non-user-mode code (i.e. kernel code)?
> >
>
> > > Kernel code still has to ask for _mm_ mfence if it wants mfence :
> > > C11
> > > and C++11 barriers aren't specified as a specific
instruction.
> >
>
> > > Is it safe to access top-of-stack?
> >
>
> > > AFAIK yes, and the ABI-specified red zone has our back (or front
if
> > > the stack grows up ☻).
> >
>
> > > What about non-x86 architectures?
> >
>
> > > Architectures such as ARMv8 support non-temporal instructions and
> > > require barriers such as DMB nshld to order loads and DMB nshst
to
> > > order stores.
> >
>
> > > Even ARM's address-dependency rule (a.k.a. the ill-fated
> > > std::memory_order_consume ) fails to hold with non-temporals:
> >
>
> > > > LDR X0, [X3]
> > >
> >
>
> > > > LDNP X2, X1, [X0] // X0 may not be loaded when the
instruction
> > > > executes!
> > >
> >
> > > Who uses non-temporals anyways?
> >
>
> > > That's an awfully personal question!
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> --
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/0a70229a/attachment.html>

Hal Finkel via llvm-dev

2016-Jan-14 21:05 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

----- Original Message -----> From: "JF Bastien" <jfb at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Philip Reames" <listmail at philipreames.com>,
"Hans Boehm" <hboehm at google.com>, "llvm-dev"
> <llvm-dev at lists.llvm.org>
> Sent: Thursday, January 14, 2016 3:02:20 PM
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
> 
> 
> 
> 
> On Thu, Jan 14, 2016 at 12:51 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> Hi JF, Philip,
> 
> Clang currently has __builtin_nontemporal_store and
> __builtin_nontemporal_load. How will the usage model for those
> change?
> 
> 
> 
> I think you would use them in the same way, but you'd have to also
> use __builtin_nontemporal_store_fence and
> __builtin_nontemporal_load_fence.
So we'll add new fence intrinsics. That makes sense.
> 
> 
> Unless we have LLVM automagically figure out where non-temporal
> fences should go, which I think isn't as good of an approach.
> 
I agree. Such a determination is likely to be too conservative in practice.

 -Hal
> 
> Thanks again,
> Hal
> 
> ----- Original Message -----
> 
> > From: "Philip Reames via llvm-dev" < llvm-dev at
lists.llvm.org >
> > To: "JF Bastien" < jfb at google.com >,
"llvm-dev"
> > < llvm-dev at lists.llvm.org >
> > Cc: "Hans Boehm" < hboehm at google.com >
> > Sent: Wednesday, January 13, 2016 11:45:35 AM
> > Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
> 
> > On 01/12/2016 11:16 PM, JF Bastien wrote:
> 
> > > Hello, fencing enthusiasts!
> > 
> 
> > > TL;DR: We'd like to propose an addition to the LLVM memory
model
> > > requiring non-temporal accesses be surrounded by non-temporal
> > > load
> > > barriers and non-temporal store barriers, and we'd like to
add
> > > such
> > > orderings to the fence IR opcode.
> > 
> 
> > > We are open to different approaches, hence this email instead of
> > > a
> > > patch.
> > 
> 
> > > Who's "we"?
> > 
> 
> > > Philip Reames brought this to my attention, and we've had
> > > numerous
> > > discussions with Hans Boehm on the topic. Any mistakes below are
> > > my
> > > own, all the clever bits are theirs.
> > 
> 
> > > Why?
> > 
> 
> > > Ignore non-temporals for a moment, on most x86 targets LLVM
> > > generates
> > > an mfence for seq_cst atomic fencing. One could instead use a
> > > locked
> > > idempotent atomic accesses to top-of-stack such as lock or4i
> > > [RSP-8]
> > > 0 . Philip has measured this as equivalent on micro-benchmarks,
> > > but
> 
> 
> > > as ~25% faster in macro-benchmarks (other codebases confirm
> > > this).
> > > There's one problem with this approach: non-temporal accesses
on
> > > x86
> > > are only ordered by fence instructions! This means that code
> > > using
> > > non-temporal accesses can't rely on LLVM's fence opcode
to do the
> > > right thing, they instead have to rely on architecture-specific
> > > _mm*fence intrinsics.
> > 
> > Just for clarify: the proposal to change the implementation of
> > ceq_cst is arguable separate from this proposal. It will go through
> > normal patch review once the semantics are addressed. Whatever we
> > end up doing with ceq_cst, we currently have a semantic hole in our
> > specification around non-temporals that needs addressed.
> 
> > Another approach would be to define the current fences as fencing
> > non-temporals and introducing new ones that don't. Either approach
> > is workable. I believe that new fences for non-temporals are the
> > appropriate choice given that would more closely match existing
> > practice.
> 
> > We could also consider forward serialize bitcode to the stronger
> > form
> > whichever choice we made. That would be conservatively correct
> > thing
> > to do for older bitcode which might be assuming strong semantics
> > than our barriers explicitly provided.
> 
> > > But wait! Who said developers need to issue any type of fence
> > > when
> > > using non-temporals?
> > 
> 
> > > Well, the LLVM memory model sure didn't. The x86 memory model
> > > does
> > > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more
> > > than
> > > x86 and the backends are free to ignore the !nontemporal
> > > metadata,
> > > and AFAICT the x86 backend doesn't add those fences.
> > 
> 
> > > Therefore even without the above optimization the LLVM language
> > > reference is incorrect: non-temporals should be bracketed by
> > > barriers. This applies even without threading! Non-temporal
> > > accesses
> > > aren't guaranteed to interact well with regular accesses,
which
> > > means that regular loads cannot move "down" a
non-temporal
> > > barrier,
> > > and regular stores cannot move "up" a non-temporal
barrier.
> > 
> 
> > > Why not just have the compiler add the fences?
> > 
> 
> > > LLVM could do this, either as a per-backend thing or a hookable
> > > pass
> > > such as AtomicExpandPass . It seems more natural to ask the
> > > programmer to express intent, just as is done with atomics. In
> > > fact,
> > > a backend is current free to ignore !nontemporal on load and
> > > store
> > > and could therefore generate only half of what's requested,
> > > leading
> > > to incorrect code. That would of course be silly, backends should
> > > either honor all !nontemporal or none of them but who knows what
> > > the
> > > middle-end does.
> > 
> 
> > > Put another way: some optimized C library use non-temporal
> > > accesses
> > > (when string instructions aren't du jour) and they terminate
> > > their
> > > copying with an sfence . It's a de-facto convention, the ABI
> > > doesn't
> > > say anything, but let's avoid divergence.
> > 
> 
> > > Aside: one day we may live in the fence elimination promised land
> > > where fences are exactly where they need to be, no more, no less.
> > 
> 
> > > Isn't x86's lfence just a no-op?
> > 
> 
> > > Yes, but we're proposing the addition of a target-independent
> > > non-temporal load barrier. It'll be up to the x86 backend to
make
> > > it
> > > an X86ISD::MEMBARRIER and other backends to get it right (hint:
> > > it's
> > > not always a no-op).
> > 
> 
> > > Won't this optimization cause coherency misses? C++ access
the
> > > thread
> > > stack concurrently all the time!
> > 
> 
> > > Maybe, but then it isn't much of an optimization if it's
slowing
> > > code
> > > down. LLVM doesn't just target C++, and it's really up to
the
> > > backend to decide whether one fence type is better than another
> > > (on
> > > x86, whether a locked top-of-stack idempotent operation is better
> > > than mfence ). Other languages have private stacks where this
> > > isn't
> > > an issue, and where the stack top can reasonably be assumed to be
> > > in
> > > cache.
> > 
> 
> > > How will this affect non-user-mode code (i.e. kernel code)?
> > 
> 
> > > Kernel code still has to ask for _mm_ mfence if it wants mfence :
> > > C11
> > > and C++11 barriers aren't specified as a specific
instruction.
> > 
> 
> > > Is it safe to access top-of-stack?
> > 
> 
> > > AFAIK yes, and the ABI-specified red zone has our back (or front
> > > if
> > > the stack grows up ☻).
> > 
> 
> > > What about non-x86 architectures?
> > 
> 
> > > Architectures such as ARMv8 support non-temporal instructions and
> > > require barriers such as DMB nshld to order loads and DMB nshst
> > > to
> > > order stores.
> > 
> 
> > > Even ARM's address-dependency rule (a.k.a. the ill-fated
> > > std::memory_order_consume ) fails to hold with non-temporals:
> > 
> 
> > > > LDR X0, [X3]
> > > 
> > 
> 
> > > > LDNP X2, X1, [X0] // X0 may not be loaded when the
instruction
> > > > executes!
> > > 
> > 
> > > Who uses non-temporals anyways?
> > 
> 
> > > That's an awfully personal question!
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> --
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

llvm dev - Jan 2016 - RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR