thr3ads.net - llvm dev - [llvm-dev] RFC: non-temporal fencing in LLVM IR [Jan 2016]

If this information is useful, please help other people find it:
Share via:

JF Bastien via llvm-dev

2016-Jan-13 07:16 UTC

[llvm-dev] RFC: non-temporal fencing in LLVM IR

Hello, fencing enthusiasts!

*TL;DR:* We'd like to propose an addition to the LLVM memory model
requiring non-temporal accesses be surrounded by non-temporal load barriers
and non-temporal store barriers, and we'd like to add such orderings to the
fence IR opcode.

We are open to different approaches, hence this email instead of a patch.


*Who's "we"?*

Philip Reames brought this to my attention, and we've had numerous
discussions with Hans Boehm on the topic. Any mistakes below are my own,
all the clever bits are theirs.


*Why?*

Ignore non-temporals for a moment, on most x86 targets LLVM generates an
mfence for seq_cst atomic fencing. One could instead use a locked
idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0.
Philip has measured this as equivalent on micro-benchmarks, but as ~25%
faster in macro-benchmarks (other codebases confirm this). There's one
problem with this approach: non-temporal accesses on x86 are only ordered
by fence instructions! This means that code using non-temporal accesses
can't rely on LLVM's fence opcode to do the right thing, they instead
have
to rely on architecture-specific _mm*fence intrinsics.


*But wait! Who said developers need to issue any type of fence when using
non-temporals?*

Well, the LLVM memory model sure didn't. The x86 memory model does (volume
3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the
backends are free to ignore the !nontemporal metadata, and AFAICT the x86
backend doesn't add those fences.

Therefore even without the above optimization the LLVM language reference
is incorrect: non-temporals should be bracketed by barriers. This applies
even without threading! Non-temporal accesses aren't guaranteed to interact
well with regular accesses, which means that regular loads cannot move
"down" a non-temporal barrier, and regular stores cannot move
"up" a
non-temporal barrier.


*Why not just have the compiler add the fences?*

LLVM could do this, either as a per-backend thing or a hookable pass such
as AtomicExpandPass. It seems more natural to ask the programmer to express
intent, just as is done with atomics. In fact, a backend is current free to
ignore !nontemporal on load and store and could therefore generate only
half of what's requested, leading to incorrect code. That would of course
be silly, backends should either honor all !nontemporal or none of them but
who knows what the middle-end does.

Put another way: some optimized C library use non-temporal accesses (when
string instructions aren't du jour) and they terminate their copying with
an sfence. It's a de-facto convention, the ABI doesn't say anything, but
let's avoid divergence.

Aside: one day we may live in the fence elimination promised land
<http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html>
where
fences are exactly where they need to be, no more, no less.


*Isn't x86's lfence just a no-op?*

Yes, but we're proposing the addition of a target-independent non-temporal
load barrier. It'll be up to the x86 backend to make it an
X86ISD::MEMBARRIER and other backends to get it right (hint: it's not
always a no-op).


*Won't this optimization cause coherency misses? C++ access the thread
stack concurrently all the time!*

Maybe, but then it isn't much of an optimization if it's slowing code
down.
LLVM doesn't just target C++, and it's really up to the backend to
decide
whether one fence type is better than another (on x86, whether a locked
top-of-stack idempotent operation is better than mfence). Other languages
have private stacks where this isn't an issue, and where the stack top can
reasonably be assumed to be in cache.


*How will this affect non-user-mode code (i.e. kernel code)?*

Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and
C++11 barriers aren't specified as a specific instruction.


*Is it safe to access top-of-stack?*

AFAIK yes, and the ABI-specified red zone has our back (or front if the
stack grows up ☻).


*What about non-x86 architectures?*

Architectures such as ARMv8 support non-temporal instructions and require
barriers such as DMB nshld to order loads and DMB nshst to order stores.

Even ARM's address-dependency rule (a.k.a. the ill-fated
std::memory_order_consume) fails to hold with non-temporals:

LDR X0, [X3]
LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!



*Who uses non-temporals anyways?*

That's an awfully personal question!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160112/d6d189d4/attachment.html>

Philip Reames via llvm-dev

2016-Jan-13 17:45 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On 01/12/2016 11:16 PM, JF Bastien wrote:> Hello, fencing enthusiasts!
>
> *TL;DR:* We'd like to propose an addition to the LLVM memory model 
> requiring non-temporal accesses be surrounded by non-temporal load 
> barriers and non-temporal store barriers, and we'd like to add such 
> orderings to the fence IR opcode.
>
> We are open to different approaches, hence this email instead of a patch.
>
>
> *Who's "we"?*
>
> Philip Reames brought this to my attention, and we've had numerous 
> discussions with Hans Boehm on the topic. Any mistakes below are my 
> own, all the clever bits are theirs.
>
>
> *Why?*
>
> Ignore non-temporals for a moment, on most x86 targets LLVM generates 
> an mfence for seq_cst atomic fencing. One could instead use a locked 
> idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 
> 0. Philip has measured this as equivalent on micro-benchmarks, but as 
> ~25% faster in macro-benchmarks (other codebases confirm this). 
> There's one problem with this approach: non-temporal accesses on x86 
> are only ordered by fence instructions! This means that code using 
> non-temporal accesses can't rely on LLVM's fence opcode to do the 
> right thing, they instead have to rely on architecture-specific 
> _mm*fence intrinsics.Just for clarify: the proposal to change the implementation of ceq_cst 
is arguable separate from this proposal.  It will go through normal 
patch review once the semantics are addressed.  Whatever we end up doing 
with ceq_cst, we currently have a semantic hole in our specification 
around non-temporals that needs addressed.

Another approach would be to define the current fences as fencing 
non-temporals and introducing new ones that don't.  Either approach is 
workable.  I believe that new fences for non-temporals are the 
appropriate choice given that would more closely match existing practice.

We could also consider forward serialize bitcode to the stronger form 
whichever choice we made.  That would be conservatively correct thing to 
do for older bitcode which might be assuming strong semantics than our 
barriers explicitly provided.>
>
> *But wait! Who said developers need to issue any type of fence when 
> using non-temporals?*
>
> Well, the LLVM memory model sure didn't. The x86 memory model does 
> (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than 
> x86 and the backends are free to ignore the !nontemporal metadata, and 
> AFAICT the x86 backend doesn't add those fences.
>
> Therefore even without the above optimization the LLVM language 
> reference is incorrect: non-temporals should be bracketed by barriers. 
> This applies even without threading! Non-temporal accesses aren't 
> guaranteed to interact well with regular accesses, which means that 
> regular loads cannot move "down" a non-temporal barrier, and
regular
> stores cannot move "up" a non-temporal barrier.
>
>
> *Why not just have the compiler add the fences?*
>
> LLVM could do this, either as a per-backend thing or a hookable pass 
> such as AtomicExpandPass. It seems more natural to ask the programmer 
> to express intent, just as is done with atomics. In fact, a backend is 
> current free to ignore !nontemporal on load and store and could 
> therefore generate only half of what's requested, leading to incorrect 
> code. That would of course be silly, backends should either honor all 
> !nontemporal or none of them but who knows what the middle-end does.
>
> Put another way: some optimized C library use non-temporal accesses 
> (when string instructions aren't du jour) and they terminate their 
> copying with an sfence. It's a de-facto convention, the ABI doesn't
> say anything, but let's avoid divergence.
>
> Aside: one day we may live in the fence elimination promised land 
> <http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html>
where
> fences are exactly where they need to be, no more, no less.
>
>
> *Isn't x86's lfence just a no-op?*
>
> Yes, but we're proposing the addition of a target-independent 
> non-temporal load barrier. It'll be up to the x86 backend to make it 
> an X86ISD::MEMBARRIER and other backends to get it right (hint: it's 
> not always a no-op).
>
>
> *Won't this optimization cause coherency misses? C++ access the thread 
> stack concurrently all the time!*
>
> Maybe, but then it isn't much of an optimization if it's slowing
code
> down. LLVM doesn't just target C++, and it's really up to the
backend
> to decide whether one fence type is better than another (on x86, 
> whether a locked top-of-stack idempotent operation is better than 
> mfence). Other languages have private stacks where this isn't an 
> issue, and where the stack top can reasonably be assumed to be in cache.
>
>
> *How will this affect non-user-mode code (i.e. kernel code)?*
>
> Kernel code still has to ask for _mm_mfence if it wants mfence: C11 
> and C++11 barriers aren't specified as a specific instruction.
>
>
> *Is it safe to access top-of-stack?*
>
> AFAIK yes, and the ABI-specified red zone has our back (or front if 
> the stack grows up ☻).
>
>
> *What about non-x86 architectures?*
>
> Architectures such as ARMv8 support non-temporal instructions and 
> require barriers such as DMB nshld to order loads and DMB nshst to 
> order stores.
>
> Even ARM's address-dependency rule (a.k.a. the ill-fated 
> std::memory_order_consume) fails to hold with non-temporals:
>
>     LDR X0, [X3]
>     LDNP X2, X1, [X0] // X0 may not be loaded when the instruction
>     executes!
>
>
>
> *Who uses non-temporals anyways?*
>
> That's an awfully personal question!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160113/e386e389/attachment.html>

John Brawn via llvm-dev

2016-Jan-13 18:32 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

What about non-x86 architectures?

Architectures such as ARMv8 support non-temporal instructions and require
barriers such as DMB nshld to order loads and DMB nshst to order stores.

Even ARM's address-dependency rule (a.k.a. the ill-fated
std::memory_order_consume) fails to hold with non-temporals:
LDR X0, [X3]
LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!

What exactly do you mean by ‘X0 may not be loaded’ in your example here? If you
mean that the LDNP
could start executing with the value of X0 from before the LDR,  e.g. initially
X0=0x100, the LDR loads
X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think that’s
true. According to
section C3.2.4 of the ARMv8 ARMARM other observers may observe the LDR and the
LDNP in the wrong
order, but the CPU executing the instructions will observe them in program
order.

I have no idea if that affects anything in this RFC though.

John

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of JF
Bastien via llvm-dev
Sent: 13 January 2016 07:16
To: llvm-dev
Cc: Hans Boehm
Subject: [llvm-dev] RFC: non-temporal fencing in LLVM IR

Hello, fencing enthusiasts!

TL;DR: We'd like to propose an addition to the LLVM memory model requiring
non-temporal accesses be surrounded by non-temporal load barriers and
non-temporal store barriers, and we'd like to add such orderings to the
fence IR opcode.

We are open to different approaches, hence this email instead of a patch.


Who's "we"?

Philip Reames brought this to my attention, and we've had numerous
discussions with Hans Boehm on the topic. Any mistakes below are my own, all the
clever bits are theirs.


Why?

Ignore non-temporals for a moment, on most x86 targets LLVM generates an mfence
for seq_cst atomic fencing. One could instead use a locked idempotent atomic
accesses to top-of-stack such as lock or4i [RSP-8] 0. Philip has measured this
as equivalent on micro-benchmarks, but as ~25% faster in macro-benchmarks (other
codebases confirm this). There's one problem with this approach:
non-temporal accesses on x86 are only ordered by fence instructions! This means
that code using non-temporal accesses can't rely on LLVM's fence opcode
to do the right thing, they instead have to rely on architecture-specific
_mm*fence intrinsics.


But wait! Who said developers need to issue any type of fence when using
non-temporals?

Well, the LLVM memory model sure didn't. The x86 memory model does (volume 3
section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the backends
are free to ignore the !nontemporal metadata, and AFAICT the x86 backend
doesn't add those fences.

Therefore even without the above optimization the LLVM language reference is
incorrect: non-temporals should be bracketed by barriers. This applies even
without threading! Non-temporal accesses aren't guaranteed to interact well
with regular accesses, which means that regular loads cannot move
"down" a non-temporal barrier, and regular stores cannot move
"up" a non-temporal barrier.


Why not just have the compiler add the fences?

LLVM could do this, either as a per-backend thing or a hookable pass such as
AtomicExpandPass. It seems more natural to ask the programmer to express intent,
just as is done with atomics. In fact, a backend is current free to ignore
!nontemporal on load and store and could therefore generate only half of
what's requested, leading to incorrect code. That would of course be silly,
backends should either honor all !nontemporal or none of them but who knows what
the middle-end does.

Put another way: some optimized C library use non-temporal accesses (when string
instructions aren't du jour) and they terminate their copying with an
sfence. It's a de-facto convention, the ABI doesn't say anything, but
let's avoid divergence.

Aside: one day we may live in the fence elimination promised
land<http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html>
where fences are exactly where they need to be, no more, no less.


Isn't x86's lfence just a no-op?

Yes, but we're proposing the addition of a target-independent non-temporal
load barrier. It'll be up to the x86 backend to make it an
X86ISD::MEMBARRIER and other backends to get it right (hint: it's not always
a no-op).


Won't this optimization cause coherency misses? C++ access the thread stack
concurrently all the time!

Maybe, but then it isn't much of an optimization if it's slowing code
down. LLVM doesn't just target C++, and it's really up to the backend to
decide whether one fence type is better than another (on x86, whether a locked
top-of-stack idempotent operation is better than mfence). Other languages have
private stacks where this isn't an issue, and where the stack top can
reasonably be assumed to be in cache.


How will this affect non-user-mode code (i.e. kernel code)?

Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and C++11
barriers aren't specified as a specific instruction.


Is it safe to access top-of-stack?

AFAIK yes, and the ABI-specified red zone has our back (or front if the stack
grows up ☻).


What about non-x86 architectures?

Architectures such as ARMv8 support non-temporal instructions and require
barriers such as DMB nshld to order loads and DMB nshst to order stores.

Even ARM's address-dependency rule (a.k.a. the ill-fated
std::memory_order_consume) fails to hold with non-temporals:
LDR X0, [X3]
LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!


Who uses non-temporals anyways?

That's an awfully personal question!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160113/51510776/attachment.html>

JF Bastien via llvm-dev

2016-Jan-13 18:44 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On Wed, Jan 13, 2016 at 10:32 AM, John Brawn <John.Brawn at arm.com>
wrote:
> *What about non-x86 architectures?*
>
>
>
> Architectures such as ARMv8 support non-temporal instructions and require
> barriers such as DMB nshld to order loads and DMB nshst to order stores.
>
>
>
> Even ARM's address-dependency rule (a.k.a. the ill-fated
> std::memory_order_consume) fails to hold with non-temporals:
>
> LDR X0, [X3]
>
> LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!
>
>
>
> What exactly do you mean by ‘X0 may not be loaded’ in your example here?
> If you mean that the LDNP
>
> could start executing with the value of X0 from before the LDR,  e.g.
> initially X0=0x100, the LDR loads
>
> X0=0x200 but the LDNP uses the old value of X0=0x100, then I don’t think
> that’s true. According to
>
> section C3.2.4 of the ARMv8 ARMARM *other* observers may observe the LDR
> and the LDNP in the wrong
>
> order, but the CPU executing the instructions will observe them in program
> order.
>
I haven't touched ARMv8 in a few years so I'm rusty on the non-temporal
details for that ISA. I lifted this example from here:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html


Which is correct?


 I have no idea if that affects anything in this RFC
though.>
Agreed, but I don't want to be misleading! The current example serves as a
good justification for non-temporal read barriers, it would be a shame to
justify myself on incorrect data :-)


 John>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of
*JF
> Bastien via llvm-dev
> *Sent:* 13 January 2016 07:16
> *To:* llvm-dev
> *Cc:* Hans Boehm
> *Subject:* [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
>
>
> Hello, fencing enthusiasts!
>
>
>
> *TL;DR:* We'd like to propose an addition to the LLVM memory model
> requiring non-temporal accesses be surrounded by non-temporal load barriers
> and non-temporal store barriers, and we'd like to add such orderings to
the
> fence IR opcode.
>
>
>
> We are open to different approaches, hence this email instead of a patch.
>
>
>
>
>
> *Who's "we"?*
>
>
>
> Philip Reames brought this to my attention, and we've had numerous
> discussions with Hans Boehm on the topic. Any mistakes below are my own,
> all the clever bits are theirs.
>
>
>
>
>
> *Why?*
>
>
>
> Ignore non-temporals for a moment, on most x86 targets LLVM generates an
> mfence for seq_cst atomic fencing. One could instead use a locked
> idempotent atomic accesses to top-of-stack such as lock or4i [RSP-8] 0.
> Philip has measured this as equivalent on micro-benchmarks, but as ~25%
> faster in macro-benchmarks (other codebases confirm this). There's one
> problem with this approach: non-temporal accesses on x86 are only ordered
> by fence instructions! This means that code using non-temporal accesses
> can't rely on LLVM's fence opcode to do the right thing, they
instead
> have to rely on architecture-specific _mm*fence intrinsics.
>
>
>
>
>
> *But wait! Who said developers need to issue any type of fence when using
> non-temporals?*
>
>
>
> Well, the LLVM memory model sure didn't. The x86 memory model does
(volume
> 3 section 8.2.2 Memory Ordering) but LLVM targets more than x86 and the
> backends are free to ignore the !nontemporal metadata, and AFAICT the x86
> backend doesn't add those fences.
>
>
>
> Therefore even without the above optimization the LLVM language reference
> is incorrect: non-temporals should be bracketed by barriers. This applies
> even without threading! Non-temporal accesses aren't guaranteed to
interact
> well with regular accesses, which means that regular loads cannot move
> "down" a non-temporal barrier, and regular stores cannot move
"up" a
> non-temporal barrier.
>
>
>
>
>
> *Why not just have the compiler add the fences?*
>
>
>
> LLVM could do this, either as a per-backend thing or a hookable pass such
> as AtomicExpandPass. It seems more natural to ask the programmer to
> express intent, just as is done with atomics. In fact, a backend is current
> free to ignore !nontemporal on load and store and could therefore
> generate only half of what's requested, leading to incorrect code. That
> would of course be silly, backends should either honor all !nontemporal or
> none of them but who knows what the middle-end does.
>
>
>
> Put another way: some optimized C library use non-temporal accesses (when
> string instructions aren't du jour) and they terminate their copying
with
> an sfence. It's a de-facto convention, the ABI doesn't say
anything, but
> let's avoid divergence.
>
>
>
> Aside: one day we may live in the fence elimination promised land
> <http://lists.llvm.org/pipermail/llvm-dev/2014-September/076701.html>
where
> fences are exactly where they need to be, no more, no less.
>
>
>
>
>
> *Isn't x86's **lfence just a no-op?*
>
>
>
> Yes, but we're proposing the addition of a target-independent
non-temporal
> load barrier. It'll be up to the x86 backend to make it an
> X86ISD::MEMBARRIER and other backends to get it right (hint: it's not
> always a no-op).
>
>
>
>
>
> *Won't this optimization cause coherency misses? C++ access the thread
> stack concurrently all the time!*
>
>
>
> Maybe, but then it isn't much of an optimization if it's slowing
code
> down. LLVM doesn't just target C++, and it's really up to the
backend to
> decide whether one fence type is better than another (on x86, whether a
> locked top-of-stack idempotent operation is better than mfence). Other
> languages have private stacks where this isn't an issue, and where the
> stack top can reasonably be assumed to be in cache.
>
>
>
>
>
> *How will this affect non-user-mode code (i.e. kernel code)?*
>
>
>
> Kernel code still has to ask for _mm_mfence if it wants mfence: C11 and
> C++11 barriers aren't specified as a specific instruction.
>
>
>
>
>
> *Is it safe to access top-of-stack?*
>
>
>
> AFAIK yes, and the ABI-specified red zone has our back (or front if the
> stack grows up ☻).
>
>
>
>
>
> *What about non-x86 architectures?*
>
>
>
> Architectures such as ARMv8 support non-temporal instructions and require
> barriers such as DMB nshld to order loads and DMB nshst to order stores.
>
>
>
> Even ARM's address-dependency rule (a.k.a. the ill-fated
> std::memory_order_consume) fails to hold with non-temporals:
>
> LDR X0, [X3]
>
> LDNP X2, X1, [X0] // X0 may not be loaded when the instruction executes!
>
>
>
>
>
> *Who uses non-temporals anyways?*
>
>
>
> That's an awfully personal question!
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160113/e6f64687/attachment.html>

Hal Finkel via llvm-dev

2016-Jan-14 20:51 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

Hi JF, Philip,

Clang currently has __builtin_nontemporal_store and __builtin_nontemporal_load.
How will the usage model for those change?

Thanks again,
Hal

----- Original Message ----- 
> From: "Philip Reames via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: "JF Bastien" <jfb at google.com>, "llvm-dev"
> <llvm-dev at lists.llvm.org>
> Cc: "Hans Boehm" <hboehm at google.com>
> Sent: Wednesday, January 13, 2016 11:45:35 AM
> Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
> On 01/12/2016 11:16 PM, JF Bastien wrote:
> > Hello, fencing enthusiasts!
> 
> > TL;DR: We'd like to propose an addition to the LLVM memory model
> > requiring non-temporal accesses be surrounded by non-temporal load
> > barriers and non-temporal store barriers, and we'd like to add
such
> > orderings to the fence IR opcode.
> 
> > We are open to different approaches, hence this email instead of a
> > patch.
> 
> > Who's "we"?
> 
> > Philip Reames brought this to my attention, and we've had numerous
> > discussions with Hans Boehm on the topic. Any mistakes below are my
> > own, all the clever bits are theirs.
> 
> > Why?
> 
> > Ignore non-temporals for a moment, on most x86 targets LLVM
> > generates
> > an mfence for seq_cst atomic fencing. One could instead use a
> > locked
> > idempotent atomic accesses to top-of-stack such as lock or4i
> > [RSP-8]
> > 0 . Philip has measured this as equivalent on micro-benchmarks, but
> > as ~25% faster in macro-benchmarks (other codebases confirm this).
> > There's one problem with this approach: non-temporal accesses on
> > x86
> > are only ordered by fence instructions! This means that code using
> > non-temporal accesses can't rely on LLVM's fence opcode to do
the
> > right thing, they instead have to rely on architecture-specific
> > _mm*fence intrinsics.
> 
> Just for clarify: the proposal to change the implementation of
> ceq_cst is arguable separate from this proposal. It will go through
> normal patch review once the semantics are addressed. Whatever we
> end up doing with ceq_cst, we currently have a semantic hole in our
> specification around non-temporals that needs addressed.
> Another approach would be to define the current fences as fencing
> non-temporals and introducing new ones that don't. Either approach
> is workable. I believe that new fences for non-temporals are the
> appropriate choice given that would more closely match existing
> practice.
> We could also consider forward serialize bitcode to the stronger form
> whichever choice we made. That would be conservatively correct thing
> to do for older bitcode which might be assuming strong semantics
> than our barriers explicitly provided.
> > But wait! Who said developers need to issue any type of fence when
> > using non-temporals?
> 
> > Well, the LLVM memory model sure didn't. The x86 memory model does
> > (volume 3 section 8.2.2 Memory Ordering) but LLVM targets more than
> > x86 and the backends are free to ignore the !nontemporal metadata,
> > and AFAICT the x86 backend doesn't add those fences.
> 
> > Therefore even without the above optimization the LLVM language
> > reference is incorrect: non-temporals should be bracketed by
> > barriers. This applies even without threading! Non-temporal
> > accesses
> > aren't guaranteed to interact well with regular accesses, which
> > means that regular loads cannot move "down" a non-temporal
barrier,
> > and regular stores cannot move "up" a non-temporal barrier.
> 
> > Why not just have the compiler add the fences?
> 
> > LLVM could do this, either as a per-backend thing or a hookable
> > pass
> > such as AtomicExpandPass . It seems more natural to ask the
> > programmer to express intent, just as is done with atomics. In
> > fact,
> > a backend is current free to ignore !nontemporal on load and store
> > and could therefore generate only half of what's requested,
leading
> > to incorrect code. That would of course be silly, backends should
> > either honor all !nontemporal or none of them but who knows what
> > the
> > middle-end does.
> 
> > Put another way: some optimized C library use non-temporal accesses
> > (when string instructions aren't du jour) and they terminate their
> > copying with an sfence . It's a de-facto convention, the ABI
> > doesn't
> > say anything, but let's avoid divergence.
> 
> > Aside: one day we may live in the fence elimination promised land
> > where fences are exactly where they need to be, no more, no less.
> 
> > Isn't x86's lfence just a no-op?
> 
> > Yes, but we're proposing the addition of a target-independent
> > non-temporal load barrier. It'll be up to the x86 backend to make
> > it
> > an X86ISD::MEMBARRIER and other backends to get it right (hint:
> > it's
> > not always a no-op).
> 
> > Won't this optimization cause coherency misses? C++ access the
> > thread
> > stack concurrently all the time!
> 
> > Maybe, but then it isn't much of an optimization if it's
slowing
> > code
> > down. LLVM doesn't just target C++, and it's really up to the
> > backend to decide whether one fence type is better than another (on
> > x86, whether a locked top-of-stack idempotent operation is better
> > than mfence ). Other languages have private stacks where this
isn't
> > an issue, and where the stack top can reasonably be assumed to be
> > in
> > cache.
> 
> > How will this affect non-user-mode code (i.e. kernel code)?
> 
> > Kernel code still has to ask for _mm_ mfence if it wants mfence :
> > C11
> > and C++11 barriers aren't specified as a specific instruction.
> 
> > Is it safe to access top-of-stack?
> 
> > AFAIK yes, and the ABI-specified red zone has our back (or front if
> > the stack grows up ☻).
> 
> > What about non-x86 architectures?
> 
> > Architectures such as ARMv8 support non-temporal instructions and
> > require barriers such as DMB nshld to order loads and DMB nshst to
> > order stores.
> 
> > Even ARM's address-dependency rule (a.k.a. the ill-fated
> > std::memory_order_consume ) fails to hold with non-temporals:
> 
> > > LDR X0, [X3]
> > 
> 
> > > LDNP X2, X1, [X0] // X0 may not be loaded when the instruction
> > > executes!
> > 
> 
> > Who uses non-temporals anyways?
> 
> > That's an awfully personal question!
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

llvm dev - Jan 2016 - RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR