thr3ads.net - llvm dev - [llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0 [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Saurabh Verma via llvm-dev

2017-Feb-18 07:33 UTC

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a
difference in the way the shift was handled. Does the initial IR generated
for you show this difference when the option is passed?

Best regards
Saurabh


On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com>
wrote:
> I think this is caused by a front-end change (cc'ing clang-dev) because
> the IR with "-Xclang -disable-llvm-optzns" shows the difference.
>
> But independently of that, there's a missing IR canonicalization -
> instcombine doesn't currently do anything with either version.
>
> And the version where we trunc later survives through the backend and
> produces worse code even for x86 with AVX2:
> before:
>     vmovd    %edi, %xmm1
>     vpmovzxwq    %xmm1, %xmm1
>     vpsraw    %xmm1, %xmm0, %xmm0
>     retq
>
> after:
>     vmovd    %edi, %xmm1
>     vpbroadcastd    %xmm1, %ymm1
>     vmovdqa    LCPI1_0(%rip), %ymm2
>     vpshufb    %ymm2, %ymm1, %ymm1
>     vpermq    $232, %ymm1, %ymm1
>     vpmovzxwd    %xmm1, %ymm1
>     vpmovsxwd    %xmm0, %ymm0
>     vpsravd    %ymm1, %ymm0, %ymm0
>     vpshufb    %ymm2, %ymm0, %ymm0
>     vpermq    $232, %ymm0, %ymm0
>     vzeroupper
>
>
> So this example may have won the bug lottery by exposing all of front-,
> middle-, back-end bugs. :)
>
>
>
> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Correction in the C snippet:
>>
>> typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));
>>
>> v8i16_t foo (v8i16_t a, int n)
>> {
>>    return a >> n;
>> }
>>
>> Best regards
>> Saurabh
>>
>>
>>
>> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at
movidius.com>
>> wrote:
>>
>>> Hello,
>>>
>>> We are investigating a difference in code generation for vector
splat
>>> instructions between llvm-3.9 and llvm-4.0, which could lead to a
>>> performance regression for our target. Here is the C snippet
>>>
>>> typedef signed v8i16_t __attribute__((ext_vector_type(8)))
>>>
>>> v8i16_t foo (v8i16 a, int n)
>>> {
>>>    return result = a >> n;
>>> }
>>>
>>> With llvm-3.9, the generated sequence does a trunc followed by
splat,
>>> but with llvm-4.0 it is reversed to a splat to a bigger vector
followed by
>>> a v8i32->v8i16 trunc. Is this by design? The earlier code
sequence is
>>> definitely better for our target, but are there known scenarios
where the
>>> new sequence would lead to better code?
>>>
>>> Here are the instruction sequences generated in the two cases:
>>>
>>> With llvm 3.9:
>>>
>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>   %3 = trunc i32 %1 to i16
>>>   %4 = insertelement <8 x i16> undef, i16 %3, i32 0
>>>   %5 = shufflevector <8 x i16> %4, <8 x i16> undef,
<8 x i32>
>>> zeroinitializer
>>>   %6 = ashr <8 x i16> %0, %5
>>>   ret <8 x i16> %6
>>> }
>>>
>>>
>>> With llvm 4.0:
>>>
>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>   %3 = insertelement <8 x i32> undef, i32 %1, i32 0
>>>   %4 = shufflevector <8 x i32> %3, <8 x i32> undef,
<8 x i32>
>>> zeroinitializer
>>>   %5 = trunc <8 x i32> %4 to <8 x i16>
>>>   %6 = ashr <8 x i16> %0, %5
>>>   ret <8 x i16> %6
>>> }
>>>
>>> Best regards
>>> Saurabh Verma
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170218/c360a10b/attachment.html>

Sanjay Patel via llvm-dev

2017-Feb-18 16:11 UTC

head link

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Yes, there is an IR difference between clang 3.9.1 and clang trunk before
any IR transforms are done:
https://godbolt.org/g/FuBqIb

We can't solve this problem (moving a trunc ahead of other vector ops) in
general in IR because we take a conservative approach to vector transforms
in IR. That means the burden for solving the general problem falls on the
front-end or the back-end. If you can bisect to find the clang commit where
this changed, that would be very helpful.

However, I think we can handle a very specific case (a too fat splat) in IR
in instcombine, and it will resolve this exact example. This will take a
couple of patches to restore your example. Here's a proposal for the first
one:
https://reviews.llvm.org/D30123


On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <saurabh.verma at
movidius.com>
wrote:
> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a
> difference in the way the shift was handled. Does the initial IR generated
> for you show this difference when the option is passed?
>
> Best regards
> Saurabh
>
>
> On 17 February 2017 at 19:03, Sanjay Patel <spatel at
rotateright.com> wrote:
>
>> I think this is caused by a front-end change (cc'ing clang-dev)
because
>> the IR with "-Xclang -disable-llvm-optzns" shows the
difference.
>>
>> But independently of that, there's a missing IR canonicalization -
>> instcombine doesn't currently do anything with either version.
>>
>> And the version where we trunc later survives through the backend and
>> produces worse code even for x86 with AVX2:
>> before:
>>     vmovd    %edi, %xmm1
>>     vpmovzxwq    %xmm1, %xmm1
>>     vpsraw    %xmm1, %xmm0, %xmm0
>>     retq
>>
>> after:
>>     vmovd    %edi, %xmm1
>>     vpbroadcastd    %xmm1, %ymm1
>>     vmovdqa    LCPI1_0(%rip), %ymm2
>>     vpshufb    %ymm2, %ymm1, %ymm1
>>     vpermq    $232, %ymm1, %ymm1
>>     vpmovzxwd    %xmm1, %ymm1
>>     vpmovsxwd    %xmm0, %ymm0
>>     vpsravd    %ymm1, %ymm0, %ymm0
>>     vpshufb    %ymm2, %ymm0, %ymm0
>>     vpermq    $232, %ymm0, %ymm0
>>     vzeroupper
>>
>>
>> So this example may have won the bug lottery by exposing all of front-,
>> middle-, back-end bugs. :)
>>
>>
>>
>> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Correction in the C snippet:
>>>
>>> typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));
>>>
>>> v8i16_t foo (v8i16_t a, int n)
>>> {
>>>    return a >> n;
>>> }
>>>
>>> Best regards
>>> Saurabh
>>>
>>>
>>>
>>> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at
movidius.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are investigating a difference in code generation for vector
splat
>>>> instructions between llvm-3.9 and llvm-4.0, which could lead to
a
>>>> performance regression for our target. Here is the C snippet
>>>>
>>>> typedef signed v8i16_t __attribute__((ext_vector_type(8)))
>>>>
>>>> v8i16_t foo (v8i16 a, int n)
>>>> {
>>>>    return result = a >> n;
>>>> }
>>>>
>>>> With llvm-3.9, the generated sequence does a trunc followed by
splat,
>>>> but with llvm-4.0 it is reversed to a splat to a bigger vector
followed by
>>>> a v8i32->v8i16 trunc. Is this by design? The earlier code
sequence is
>>>> definitely better for our target, but are there known scenarios
where the
>>>> new sequence would lead to better code?
>>>>
>>>> Here are the instruction sequences generated in the two cases:
>>>>
>>>> With llvm 3.9:
>>>>
>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>   %3 = trunc i32 %1 to i16
>>>>   %4 = insertelement <8 x i16> undef, i16 %3, i32 0
>>>>   %5 = shufflevector <8 x i16> %4, <8 x i16> undef,
<8 x i32>
>>>> zeroinitializer
>>>>   %6 = ashr <8 x i16> %0, %5
>>>>   ret <8 x i16> %6
>>>> }
>>>>
>>>>
>>>> With llvm 4.0:
>>>>
>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>   %3 = insertelement <8 x i32> undef, i32 %1, i32 0
>>>>   %4 = shufflevector <8 x i32> %3, <8 x i32> undef,
<8 x i32>
>>>> zeroinitializer
>>>>   %5 = trunc <8 x i32> %4 to <8 x i16>
>>>>   %6 = ashr <8 x i16> %0, %5
>>>>   ret <8 x i16> %6
>>>> }
>>>>
>>>> Best regards
>>>> Saurabh Verma
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170218/18284a9a/attachment.html>

Sanjay Patel via llvm-dev

2017-Mar-08 15:21 UTC

head link

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

The regression for the reported case should be avoided after:
https://reviews.llvm.org/rL297232
https://reviews.llvm.org/rL297242
https://reviews.llvm.org/rL297280

It would still be good to understand if the clang change was intentional or
if that was a side effect that can be limited.

On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> Yes, there is an IR difference between clang 3.9.1 and clang trunk before
> any IR transforms are done:
> https://godbolt.org/g/FuBqIb
>
> We can't solve this problem (moving a trunc ahead of other vector ops)
in
> general in IR because we take a conservative approach to vector transforms
> in IR. That means the burden for solving the general problem falls on the
> front-end or the back-end. If you can bisect to find the clang commit where
> this changed, that would be very helpful.
>
> However, I think we can handle a very specific case (a too fat splat) in
> IR in instcombine, and it will resolve this exact example. This will take a
> couple of patches to restore your example. Here's a proposal for the
first
> one:
> https://reviews.llvm.org/D30123
>
>
> On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <
> saurabh.verma at movidius.com> wrote:
>
>> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make
a
>> difference in the way the shift was handled. Does the initial IR
generated
>> for you show this difference when the option is passed?
>>
>> Best regards
>> Saurabh
>>
>>
>> On 17 February 2017 at 19:03, Sanjay Patel <spatel at
rotateright.com>
>> wrote:
>>
>>> I think this is caused by a front-end change (cc'ing clang-dev)
because
>>> the IR with "-Xclang -disable-llvm-optzns" shows the
difference.
>>>
>>> But independently of that, there's a missing IR
canonicalization -
>>> instcombine doesn't currently do anything with either version.
>>>
>>> And the version where we trunc later survives through the backend
and
>>> produces worse code even for x86 with AVX2:
>>> before:
>>>     vmovd    %edi, %xmm1
>>>     vpmovzxwq    %xmm1, %xmm1
>>>     vpsraw    %xmm1, %xmm0, %xmm0
>>>     retq
>>>
>>> after:
>>>     vmovd    %edi, %xmm1
>>>     vpbroadcastd    %xmm1, %ymm1
>>>     vmovdqa    LCPI1_0(%rip), %ymm2
>>>     vpshufb    %ymm2, %ymm1, %ymm1
>>>     vpermq    $232, %ymm1, %ymm1
>>>     vpmovzxwd    %xmm1, %ymm1
>>>     vpmovsxwd    %xmm0, %ymm0
>>>     vpsravd    %ymm1, %ymm0, %ymm0
>>>     vpshufb    %ymm2, %ymm0, %ymm0
>>>     vpermq    $232, %ymm0, %ymm0
>>>     vzeroupper
>>>
>>>
>>> So this example may have won the bug lottery by exposing all of
front-,
>>> middle-, back-end bugs. :)
>>>
>>>
>>>
>>> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Correction in the C snippet:
>>>>
>>>> typedef signed short v8i16_t  
__attribute__((ext_vector_type(8)));
>>>>
>>>> v8i16_t foo (v8i16_t a, int n)
>>>> {
>>>>    return a >> n;
>>>> }
>>>>
>>>> Best regards
>>>> Saurabh
>>>>
>>>>
>>>>
>>>> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma
at movidius.com
>>>> > wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are investigating a difference in code generation for
vector splat
>>>>> instructions between llvm-3.9 and llvm-4.0, which could
lead to a
>>>>> performance regression for our target. Here is the C
snippet
>>>>>
>>>>> typedef signed v8i16_t __attribute__((ext_vector_type(8)))
>>>>>
>>>>> v8i16_t foo (v8i16 a, int n)
>>>>> {
>>>>>    return result = a >> n;
>>>>> }
>>>>>
>>>>> With llvm-3.9, the generated sequence does a trunc followed
by splat,
>>>>> but with llvm-4.0 it is reversed to a splat to a bigger
vector followed by
>>>>> a v8i32->v8i16 trunc. Is this by design? The earlier
code sequence is
>>>>> definitely better for our target, but are there known
scenarios where the
>>>>> new sequence would lead to better code?
>>>>>
>>>>> Here are the instruction sequences generated in the two
cases:
>>>>>
>>>>> With llvm 3.9:
>>>>>
>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>>   %3 = trunc i32 %1 to i16
>>>>>   %4 = insertelement <8 x i16> undef, i16 %3, i32 0
>>>>>   %5 = shufflevector <8 x i16> %4, <8 x i16>
undef, <8 x i32>
>>>>> zeroinitializer
>>>>>   %6 = ashr <8 x i16> %0, %5
>>>>>   ret <8 x i16> %6
>>>>> }
>>>>>
>>>>>
>>>>> With llvm 4.0:
>>>>>
>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>>   %3 = insertelement <8 x i32> undef, i32 %1, i32 0
>>>>>   %4 = shufflevector <8 x i32> %3, <8 x i32>
undef, <8 x i32>
>>>>> zeroinitializer
>>>>>   %5 = trunc <8 x i32> %4 to <8 x i16>
>>>>>   %6 = ashr <8 x i16> %0, %5
>>>>>   ret <8 x i16> %6
>>>>> }
>>>>>
>>>>> Best regards
>>>>> Saurabh Verma
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/eb62cb6a/attachment.html>

llvm dev - Feb 2017 - Vector trunc code generation difference between llvm-3.9 and 4.0

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0