thr3ads.net - llvm dev - [llvm-dev] RFC: non-temporal fencing in LLVM IR [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Hans Boehm via llvm-dev

2016-Jan-14 03:00 UTC

[llvm-dev] RFC: non-temporal fencing in LLVM IR

I agree with Tim's assessment for ARM.  That's interesting; I wasn't
previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with
x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules
on x86, none of them very clear.  Nontemporal stores should probably
ideally use an SFENCE.  Locked instructions seem to be documented to work
with MOVNTDQA.  In both cases, there seems to be only empirical evidence as
to which side(s) of the nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as
a fence in Java on x86.  I'm significantly less enthusiastic for C++.  I
also think that risks unexpected coherence miss problems, though they would
probably be very rare.  But they would be very surprising if they did occur.

On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.northover at
gmail.com>
wrote:
> > I haven't touched ARMv8 in a few years so I'm rusty on the
non-temporal
> > details for that ISA. I lifted this example from here:
> >
> >
>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
> >
> > Which is correct?
>
> FWIW, I agree with John here. The example I'd give for the unexpected
> behaviour allowed in the spec is:
>
> .Lwait_for_data:
>     ldr x0, [x3]
>     cbz x0, .Lwait_for_data
>     ldnp x2, x1, [x0]
>
> where another thread first writes to a buffer then tells us where that
> buffer is. For a normal ldp, the address dependency rule means we
> don't need a barrier or acquiring load to ensure we see the real data
> in the buffer. For ldnp, we would need a barrier to prevent stale
> data.
>
> I suspect this is actually even closer to the x86 situation than what
> the guide implies (which looks like a straight-up exposed pipeline to
> me, beyond even what Alpha would have done).
>
> Cheers.
>
> Tim.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160113/d0b57337/attachment.html>

David Majnemer via llvm-dev

2016-Jan-14 21:10 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I agree with Tim's assessment for ARM.  That's interesting; I
wasn't
> previously aware of that instruction.
>
> My understanding is that Alpha would have the same problem for normal
> loads.
>
> I'm all in favor of more systematic handling of the fences associated
with
> x86 non-temporal accesses.
>
> AFAICT, nontemporal loads and stores seem to have different fencing rules
> on x86, none of them very clear.  Nontemporal stores should probably
> ideally use an SFENCE.  Locked instructions seem to be documented to work
> with MOVNTDQA.  In both cases, there seems to be only empirical evidence as
> to which side(s) of the nontemporal operations they should go on?
>
> I finally decided that I was OK with using a LOCKed top-of-stack update as
> a fence in Java on x86.  I'm significantly less enthusiastic for C++. 
I
> also think that risks unexpected coherence miss problems, though they would
> probably be very rare.  But they would be very surprising if they did
occur.
>
Today's LLVM already emits 'lock or %eax, (%esp)' for 'fence
seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST) when
targeting 32-bit x86 machines which do not support mfence.  What
instruction sequence should we be using instead?

>
>
>
> On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.northover at
gmail.com>
> wrote:
>
>> > I haven't touched ARMv8 in a few years so I'm rusty on the
non-temporal
>> > details for that ISA. I lifted this example from here:
>> >
>> >
>>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>> >
>> > Which is correct?
>>
>> FWIW, I agree with John here. The example I'd give for the
unexpected
>> behaviour allowed in the spec is:
>>
>> .Lwait_for_data:
>>     ldr x0, [x3]
>>     cbz x0, .Lwait_for_data
>>     ldnp x2, x1, [x0]
>>
>> where another thread first writes to a buffer then tells us where that
>> buffer is. For a normal ldp, the address dependency rule means we
>> don't need a barrier or acquiring load to ensure we see the real
data
>> in the buffer. For ldnp, we would need a barrier to prevent stale
>> data.
>>
>> I suspect this is actually even closer to the x86 situation than what
>> the guide implies (which looks like a straight-up exposed pipeline to
>> me, beyond even what Alpha would have done).
>>
>> Cheers.
>>
>> Tim.
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/94254859/attachment.html>

JF Bastien via llvm-dev

2016-Jan-14 21:13 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

On Thu, Jan 14, 2016 at 1:10 PM, David Majnemer via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Wed, Jan 13, 2016 at 7:00 PM, Hans Boehm via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I agree with Tim's assessment for ARM.  That's interesting; I
wasn't
>> previously aware of that instruction.
>>
>> My understanding is that Alpha would have the same problem for normal
>> loads.
>>
>> I'm all in favor of more systematic handling of the fences
associated
>> with x86 non-temporal accesses.
>>
>> AFAICT, nontemporal loads and stores seem to have different fencing
rules
>> on x86, none of them very clear.  Nontemporal stores should probably
>> ideally use an SFENCE.  Locked instructions seem to be documented to
work
>> with MOVNTDQA.  In both cases, there seems to be only empirical
evidence as
>> to which side(s) of the nontemporal operations they should go on?
>>
>> I finally decided that I was OK with using a LOCKed top-of-stack update
>> as a fence in Java on x86.  I'm significantly less enthusiastic for
C++.  I
>> also think that risks unexpected coherence miss problems, though they
would
>> probably be very rare.  But they would be very surprising if they did
occur.
>>
>
> Today's LLVM already emits 'lock or %eax, (%esp)' for
'fence
> seq_cst'/__sync_synchronize/__atomic_thread_fence(__ATOMIC_SEQ_CST)
when
> targeting 32-bit x86 machines which do not support mfence.  What
> instruction sequence should we be using instead?
>
Do they have non-temporal accesses in the ISA?


On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.northover at
gmail.com>>> wrote:
>>
>>> > I haven't touched ARMv8 in a few years so I'm rusty on
the non-temporal
>>> > details for that ISA. I lifted this example from here:
>>> >
>>> >
>>>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>>> >
>>> > Which is correct?
>>>
>>> FWIW, I agree with John here. The example I'd give for the
unexpected
>>> behaviour allowed in the spec is:
>>>
>>> .Lwait_for_data:
>>>     ldr x0, [x3]
>>>     cbz x0, .Lwait_for_data
>>>     ldnp x2, x1, [x0]
>>>
>>> where another thread first writes to a buffer then tells us where
that
>>> buffer is. For a normal ldp, the address dependency rule means we
>>> don't need a barrier or acquiring load to ensure we see the
real data
>>> in the buffer. For ldnp, we would need a barrier to prevent stale
>>> data.
>>>
>>> I suspect this is actually even closer to the x86 situation than
what
>>> the guide implies (which looks like a straight-up exposed pipeline
to
>>> me, beyond even what Alpha would have done).
>>>
>>> Cheers.
>>>
>>> Tim.
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160114/9e37b3ce/attachment.html>

John Brawn via llvm-dev

2016-Jan-15 18:22 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

> I haven't touched ARMv8 in a few years so I'm rusty on the
non-temporal
> details for that ISA. I lifted this example from here:
>
>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>
> Which is correct?
I’ve confirmed that this example in the Cortex-A programmers guide is wrong, and
it should
hopefully be corrected in a future version.

John

From: Hans Boehm [mailto:hboehm at google.com]
Sent: 14 January 2016 03:01
To: Tim Northover
Cc: JF Bastien; John Brawn; llvm-dev at lists.llvm.org; nd
Subject: Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR

I agree with Tim's assessment for ARM.  That's interesting; I wasn't
previously aware of that instruction.

My understanding is that Alpha would have the same problem for normal loads.

I'm all in favor of more systematic handling of the fences associated with
x86 non-temporal accesses.

AFAICT, nontemporal loads and stores seem to have different fencing rules on
x86, none of them very clear.  Nontemporal stores should probably ideally use an
SFENCE.  Locked instructions seem to be documented to work with MOVNTDQA.  In
both cases, there seems to be only empirical evidence as to which side(s) of the
nontemporal operations they should go on?

I finally decided that I was OK with using a LOCKed top-of-stack update as a
fence in Java on x86.  I'm significantly less enthusiastic for C++.  I also
think that risks unexpected coherence miss problems, though they would probably
be very rare.  But they would be very surprising if they did occur.

On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.northover at
gmail.com<mailto:t.p.northover at gmail.com>>
wrote:> I haven't touched ARMv8 in a few years so I'm rusty on the
non-temporal
> details for that ISA. I lifted this example from here:
>
>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
>
> Which is correct?
FWIW, I agree with John here. The example I'd give for the unexpected
behaviour allowed in the spec is:

.Lwait_for_data:
    ldr x0, [x3]
    cbz x0, .Lwait_for_data
    ldnp x2, x1, [x0]

where another thread first writes to a buffer then tells us where that
buffer is. For a normal ldp, the address dependency rule means we
don't need a barrier or acquiring load to ensure we see the real data
in the buffer. For ldnp, we would need a barrier to prevent stale
data.

I suspect this is actually even closer to the x86 situation than what
the guide implies (which looks like a straight-up exposed pipeline to
me, beyond even what Alpha would have done).

Cheers.

Tim.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/355a4c86/attachment.html>

Hans Boehm via llvm-dev

2016-Jan-15 18:56 UTC

head link

[llvm-dev] RFC: non-temporal fencing in LLVM IR

It seems to me the intent of that section is intelligible to those of us
who have been spending too much time dealing with these issues, but seems
wrong to everyone else:  If another thread updates [X0] and then [X3] (with
an intervening fence), this thread may see the new value of [X3], but the
old value of [X0], violating the data dependence.  This makes it incorrect
to use such a load for e.g. Java final fields without a fence.  I agree
that the text is at best unclear, but presumably that was indeed the intent?

On Fri, Jan 15, 2016 at 10:22 AM, John Brawn <John.Brawn at arm.com>
wrote:
> > I haven't touched ARMv8 in a few years so I'm rusty on the
non-temporal
> > details for that ISA. I lifted this example from here:
> >
> >
>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
> >
> > Which is correct?
>
>
>
> I’ve confirmed that this example in the Cortex-A programmers guide is
> wrong, and it should
>
> hopefully be corrected in a future version.
>
>
>
> John
>
>
>
> *From:* Hans Boehm [mailto:hboehm at google.com]
> *Sent:* 14 January 2016 03:01
> *To:* Tim Northover
> *Cc:* JF Bastien; John Brawn; llvm-dev at lists.llvm.org; nd
> *Subject:* Re: [llvm-dev] RFC: non-temporal fencing in LLVM IR
>
>
>
> I agree with Tim's assessment for ARM.  That's interesting; I
wasn't
> previously aware of that instruction.
>
>
>
> My understanding is that Alpha would have the same problem for normal
> loads.
>
>
>
> I'm all in favor of more systematic handling of the fences associated
with
> x86 non-temporal accesses.
>
>
>
> AFAICT, nontemporal loads and stores seem to have different fencing rules
> on x86, none of them very clear.  Nontemporal stores should probably
> ideally use an SFENCE.  Locked instructions seem to be documented to work
> with MOVNTDQA.  In both cases, there seems to be only empirical evidence as
> to which side(s) of the nontemporal operations they should go on?
>
>
>
> I finally decided that I was OK with using a LOCKed top-of-stack update as
> a fence in Java on x86.  I'm significantly less enthusiastic for C++. 
I
> also think that risks unexpected coherence miss problems, though they would
> probably be very rare.  But they would be very surprising if they did
occur.
>
>
>
>
>
>
>
> On Wed, Jan 13, 2016 at 10:59 AM, Tim Northover <t.p.northover at
gmail.com>
> wrote:
>
> > I haven't touched ARMv8 in a few years so I'm rusty on the
non-temporal
> > details for that ISA. I lifted this example from here:
> >
> >
>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJACGJJF.html
> >
> > Which is correct?
>
> FWIW, I agree with John here. The example I'd give for the unexpected
> behaviour allowed in the spec is:
>
> .Lwait_for_data:
>     ldr x0, [x3]
>     cbz x0, .Lwait_for_data
>     ldnp x2, x1, [x0]
>
> where another thread first writes to a buffer then tells us where that
> buffer is. For a normal ldp, the address dependency rule means we
> don't need a barrier or acquiring load to ensure we see the real data
> in the buffer. For ldnp, we would need a barrier to prevent stale
> data.
>
> I suspect this is actually even closer to the x86 situation than what
> the guide implies (which looks like a straight-up exposed pipeline to
> me, beyond even what Alpha would have done).
>
> Cheers.
>
> Tim.
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160115/64d6bfd5/attachment.html>

llvm dev - Jan 2016 - RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR

[llvm-dev] RFC: non-temporal fencing in LLVM IR