thr3ads.net - llvm dev - [llvm-dev] [AVX512BW] Nasty KAND issue [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Friedman, Eli via llvm-dev

2016-Oct-20 17:05 UTC

[llvm-dev] [AVX512BW] Nasty KAND issue

On 10/20/2016 9:28 AM, Cameron McInally via llvm-dev
wrote:> I should have attached the generated asm to save some trouble.
> Apologies for that and attaching now...
>
>
>
> On Thu, Oct 20, 2016 at 12:26 PM, Cameron McInally
> <cameron.mcinally at nyu.edu> wrote:
>> On Thu, Oct 20, 2016 at 12:05 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>>>> On Oct 20, 2016, at 8:54 AM, Cameron McInally via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> Hey guys,
>>>>
>>>> I've hit a pretty nasty issue on SKX with ANDs of masks
<= 4 bits.
>>>>
>>>> In the IR, we represent a 4b vector mask as <4 x i1>.
This assumes
>>>> that the storage container for this type is also 4b, but
it's not.
>>> The storage type is not relevant, these bits are “unreachable” from
the IR point of view.
>>> The backend is supposed to lower the operation in a safe way when
it is needed to clear these bit.
>>>
>>> For example if you were to perform some arithmetic operation on
these, it is likely that they would get zero extended to 8bits first and this is
where the upper bits would be cleared.
>>>
>>>
>>>> The
>>>> smallest mask register on SKX is 8b. This also implies that the
>>>> smallest load/store moves 8b.
>>>>
>>>> We run into problems when we try to optimize ANDs (full test
case attached):
>>>>
>>>>   %r1 = and <4 x i1> %r0, <i1 -1, i1 -1, i1 -1, i1
-1>
>>>>
>>>> At the IR level the all1s mask looks like the Identity for this
>>>> operation, so LLVM will remove it. But it is not the Identity
since
>>>> this operation should clear the top 4 bits of the 8 bit
hardware
>>>> register in play. E.g.
>>> No, this operation alone does not need to clear the upper bit, they
are undefined before and after.
>>>
>>>
>>>>         kmovb   -4(%rsp), %k0
>>>>         kandb   %k0, %k1, %k0
>>>>         kmovb   %k0, -4(%rsp)
>>>>
>>>> I began tracking down this issue and found that InstCombine
will
>>>> incorrectly remove the AND. Then I noticed that the Reassociate
pass
>>>> would also remove the AND if InstCombine did not. That made me
>>>> nervous. My current thinking is that this might be a larger
problem
>>>> that shouldn't be patched up. Or maybe I made a faulty
assumption with
>>>> the IR I choose for this operation.
>>> There might be a legitimate issue, but your example fails short to
illustrate it right now: you’re not showing how these upper bits are leaking
into the computation somewhere?
>>>
>>> —
>>> Mehdi
>> Hi Mehdi,
>>
>> Yes, good point. Updated test case exhibiting the dirty bits attached.
>> Notice that the kortest will operate on the dirty bits that should
>> have been zeroed.
>>
>> Perhaps the problem is that the zext of the i4 to i16 does not get
>> generated correctly.
>>
The problem with your IR is actually the load.  When you load a value 
whose size in bits is not a multiple of 8 (like i1, or <4 x i1>, the 
result is undefined unless the unused bits are zero. You can see this in 
the debug output from llc:

SelectionDAG has 15 nodes:
   t0: ch = EntryToken
         t21: i32 = X86ISD::KORTEST t19, t19
       t22: i8 = X86ISD::SETCC Constant:i8<4>, t21
     t23: i32 = zero_extend t22
   t14: ch,glue = CopyToReg t0, Register:i32 %EAX, t23
       t24: i16,ch = load<LD1[%XXX](align=4)(dereferenceable), zext from 
i8> t0, FrameIndex:i64<0>, undef:i64
     t26: i16 = AssertZext t24, ValueType:ch:i4
   t19: v16i1 = bitcast t26
   t15: ch = X86ISD::RET_FLAG t14, TargetConstant:i32<0>, Register:i32 
%EAX, t14:1

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

Mehdi Amini via llvm-dev

2016-Oct-20 17:14 UTC

head link

[llvm-dev] [AVX512BW] Nasty KAND issue

> On Oct 20, 2016, at 10:05 AM, Friedman, Eli <efriedma at
codeaurora.org> wrote:
> 
> On 10/20/2016 9:28 AM, Cameron McInally via llvm-dev wrote:
>> I should have attached the generated asm to save some trouble.
>> Apologies for that and attaching now...
>> 
>> 
>> 
>> On Thu, Oct 20, 2016 at 12:26 PM, Cameron McInally
>> <cameron.mcinally at nyu.edu> wrote:
>>> On Thu, Oct 20, 2016 at 12:05 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>>>>> On Oct 20, 2016, at 8:54 AM, Cameron McInally via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>>>> 
>>>>> Hey guys,
>>>>> 
>>>>> I've hit a pretty nasty issue on SKX with ANDs of masks
<= 4 bits.
>>>>> 
>>>>> In the IR, we represent a 4b vector mask as <4 x i1>.
This assumes
>>>>> that the storage container for this type is also 4b, but
it's not.
>>>> The storage type is not relevant, these bits are “unreachable”
from the IR point of view.
>>>> The backend is supposed to lower the operation in a safe way
when it is needed to clear these bit.
>>>> 
>>>> For example if you were to perform some arithmetic operation on
these, it is likely that they would get zero extended to 8bits first and this is
where the upper bits would be cleared.
>>>> 
>>>> 
>>>>> The
>>>>> smallest mask register on SKX is 8b. This also implies that
the
>>>>> smallest load/store moves 8b.
>>>>> 
>>>>> We run into problems when we try to optimize ANDs (full
test case attached):
>>>>> 
>>>>>  %r1 = and <4 x i1> %r0, <i1 -1, i1 -1, i1 -1, i1
-1>
>>>>> 
>>>>> At the IR level the all1s mask looks like the Identity for
this
>>>>> operation, so LLVM will remove it. But it is not the
Identity since
>>>>> this operation should clear the top 4 bits of the 8 bit
hardware
>>>>> register in play. E.g.
>>>> No, this operation alone does not need to clear the upper bit,
they are undefined before and after.
>>>> 
>>>> 
>>>>>        kmovb   -4(%rsp), %k0
>>>>>        kandb   %k0, %k1, %k0
>>>>>        kmovb   %k0, -4(%rsp)
>>>>> 
>>>>> I began tracking down this issue and found that InstCombine
will
>>>>> incorrectly remove the AND. Then I noticed that the
Reassociate pass
>>>>> would also remove the AND if InstCombine did not. That made
me
>>>>> nervous. My current thinking is that this might be a larger
problem
>>>>> that shouldn't be patched up. Or maybe I made a faulty
assumption with
>>>>> the IR I choose for this operation.
>>>> There might be a legitimate issue, but your example fails short
to illustrate it right now: you’re not showing how these upper bits are leaking
into the computation somewhere?
>>>> 
>>>> —
>>>> Mehdi
>>> Hi Mehdi,
>>> 
>>> Yes, good point. Updated test case exhibiting the dirty bits
attached.
>>> Notice that the kortest will operate on the dirty bits that should
>>> have been zeroed.
>>> 
>>> Perhaps the problem is that the zext of the i4 to i16 does not get
>>> generated correctly.
>>> 
> 
> The problem with your IR is actually the load.  When you load a value whose
size in bits is not a multiple of 8 (like i1, or <4 x i1>, the result is
undefined unless the unused bits are zero.
Almost: LangRef says "When loading a value of a type like i20 with a size
that is not an integral number of bytes, the result is undefined if the value
was not originally written using a store of the same type.”

This is even stronger than what you described:  even if the bits are explicitly 
0, the results can be undefined.

Also, while I guess the most straightforward lowering of the store would clear
the upper bits, that’s not a requirement of LangRef I believe (it seems legal to
me to clear the bits on the load instead for example).
> You can see this in the debug output from llc:
> 
> SelectionDAG has 15 nodes:
>  t0: ch = EntryToken
>        t21: i32 = X86ISD::KORTEST t19, t19
>      t22: i8 = X86ISD::SETCC Constant:i8<4>, t21
>    t23: i32 = zero_extend t22
>  t14: ch,glue = CopyToReg t0, Register:i32 %EAX, t23
>      t24: i16,ch = load<LD1[%XXX](align=4)(dereferenceable), zext from
i8> t0, FrameIndex:i64<0>, undef:i64
If I read correctly, the issue is the "zext from i8” part of the load
right?  The backend assumed they were cleared on the store.

Thanks,

— 
Mehdi

>    t26: i16 = AssertZext t24, ValueType:ch:i4
>  t19: v16i1 = bitcast t26
>  t15: ch = X86ISD::RET_FLAG t14, TargetConstant:i32<0>, Register:i32
%EAX, t14:1
> 
> -Eli
> 
> -- 
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161020/79a0446f/attachment.html>

Cameron McInally via llvm-dev

2016-Oct-20 17:36 UTC

head link

[llvm-dev] [AVX512BW] Nasty KAND issue

On Thu, Oct 20, 2016 at 1:14 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:>
> On Oct 20, 2016, at 10:05 AM, Friedman, Eli <efriedma at
codeaurora.org> wrote:
>
> On 10/20/2016 9:28 AM, Cameron McInally via llvm-dev wrote:
>
> I should have attached the generated asm to save some trouble.
> Apologies for that and attaching now...
>
>
>
> On Thu, Oct 20, 2016 at 12:26 PM, Cameron McInally
> <cameron.mcinally at nyu.edu> wrote:
>
> On Thu, Oct 20, 2016 at 12:05 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
> On Oct 20, 2016, at 8:54 AM, Cameron McInally via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>
> Hey guys,
>
> I've hit a pretty nasty issue on SKX with ANDs of masks <= 4 bits.
>
> In the IR, we represent a 4b vector mask as <4 x i1>. This assumes
> that the storage container for this type is also 4b, but it's not.
>
> The storage type is not relevant, these bits are “unreachable” from the IR
> point of view.
> The backend is supposed to lower the operation in a safe way when it is
> needed to clear these bit.
>
> For example if you were to perform some arithmetic operation on these, it
is
> likely that they would get zero extended to 8bits first and this is where
> the upper bits would be cleared.
>
>
> The
> smallest mask register on SKX is 8b. This also implies that the
> smallest load/store moves 8b.
>
> We run into problems when we try to optimize ANDs (full test case
attached):
>
>  %r1 = and <4 x i1> %r0, <i1 -1, i1 -1, i1 -1, i1 -1>
>
> At the IR level the all1s mask looks like the Identity for this
> operation, so LLVM will remove it. But it is not the Identity since
> this operation should clear the top 4 bits of the 8 bit hardware
> register in play. E.g.
>
> No, this operation alone does not need to clear the upper bit, they are
> undefined before and after.
>
>
>        kmovb   -4(%rsp), %k0
>        kandb   %k0, %k1, %k0
>        kmovb   %k0, -4(%rsp)
>
> I began tracking down this issue and found that InstCombine will
> incorrectly remove the AND. Then I noticed that the Reassociate pass
> would also remove the AND if InstCombine did not. That made me
> nervous. My current thinking is that this might be a larger problem
> that shouldn't be patched up. Or maybe I made a faulty assumption with
> the IR I choose for this operation.
>
> There might be a legitimate issue, but your example fails short to
> illustrate it right now: you’re not showing how these upper bits are
leaking
> into the computation somewhere?
>
> —
> Mehdi
>
> Hi Mehdi,
>
> Yes, good point. Updated test case exhibiting the dirty bits attached.
> Notice that the kortest will operate on the dirty bits that should
> have been zeroed.
>
> Perhaps the problem is that the zext of the i4 to i16 does not get
> generated correctly.
>
>
> The problem with your IR is actually the load.  When you load a value whose
> size in bits is not a multiple of 8 (like i1, or <4 x i1>, the result
is
> undefined unless the unused bits are zero.
>
>
> Almost: LangRef says "When loading a value of a type like i20 with a
size
> that is not an integral number of bytes, the result is undefined if the
> value was not originally written using a store of the same type.”
>
> This is even stronger than what you described:  even if the bits are
> explicitly  0, the results can be undefined.
>
> Also, while I guess the most straightforward lowering of the store would
> clear the upper bits, that’s not a requirement of LangRef I believe (it
> seems legal to me to clear the bits on the load instead for example).
>
> You can see this in the debug output from llc:
>
> SelectionDAG has 15 nodes:
>  t0: ch = EntryToken
>        t21: i32 = X86ISD::KORTEST t19, t19
>      t22: i8 = X86ISD::SETCC Constant:i8<4>, t21
>    t23: i32 = zero_extend t22
>  t14: ch,glue = CopyToReg t0, Register:i32 %EAX, t23
>      t24: i16,ch = load<LD1[%XXX](align=4)(dereferenceable), zext from
i8>
> t0, FrameIndex:i64<0>, undef:i64
>
>
> If I read correctly, the issue is the "zext from i8” part of the load
right?
> The backend assumed they were cleared on the store.
>
> Thanks,
>
> —
> Mehdi
Well that wouldn't be too bad. I.e. explicitly zeroing the top bits on
the <4 x i1> store. I think it would require an extra instruction
though.

Handling <4 x i1> loads/stores as <8 x i1> at my interface sounds
awful. Just FYI.

-Cam

llvm dev - Oct 2016 - [AVX512BW] Nasty KAND issue

[llvm-dev] [AVX512BW] Nasty KAND issue

[llvm-dev] [AVX512BW] Nasty KAND issue

[llvm-dev] [AVX512BW] Nasty KAND issue