thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0 [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Sanjay Patel via llvm-dev

2017-Mar-08 15:21 UTC

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

The regression for the reported case should be avoided after:
https://reviews.llvm.org/rL297232
https://reviews.llvm.org/rL297242
https://reviews.llvm.org/rL297280

It would still be good to understand if the clang change was intentional or
if that was a side effect that can be limited.

On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> Yes, there is an IR difference between clang 3.9.1 and clang trunk before
> any IR transforms are done:
> https://godbolt.org/g/FuBqIb
>
> We can't solve this problem (moving a trunc ahead of other vector ops)
in
> general in IR because we take a conservative approach to vector transforms
> in IR. That means the burden for solving the general problem falls on the
> front-end or the back-end. If you can bisect to find the clang commit where
> this changed, that would be very helpful.
>
> However, I think we can handle a very specific case (a too fat splat) in
> IR in instcombine, and it will resolve this exact example. This will take a
> couple of patches to restore your example. Here's a proposal for the
first
> one:
> https://reviews.llvm.org/D30123
>
>
> On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <
> saurabh.verma at movidius.com> wrote:
>
>> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make
a
>> difference in the way the shift was handled. Does the initial IR
generated
>> for you show this difference when the option is passed?
>>
>> Best regards
>> Saurabh
>>
>>
>> On 17 February 2017 at 19:03, Sanjay Patel <spatel at
rotateright.com>
>> wrote:
>>
>>> I think this is caused by a front-end change (cc'ing clang-dev)
because
>>> the IR with "-Xclang -disable-llvm-optzns" shows the
difference.
>>>
>>> But independently of that, there's a missing IR
canonicalization -
>>> instcombine doesn't currently do anything with either version.
>>>
>>> And the version where we trunc later survives through the backend
and
>>> produces worse code even for x86 with AVX2:
>>> before:
>>>     vmovd    %edi, %xmm1
>>>     vpmovzxwq    %xmm1, %xmm1
>>>     vpsraw    %xmm1, %xmm0, %xmm0
>>>     retq
>>>
>>> after:
>>>     vmovd    %edi, %xmm1
>>>     vpbroadcastd    %xmm1, %ymm1
>>>     vmovdqa    LCPI1_0(%rip), %ymm2
>>>     vpshufb    %ymm2, %ymm1, %ymm1
>>>     vpermq    $232, %ymm1, %ymm1
>>>     vpmovzxwd    %xmm1, %ymm1
>>>     vpmovsxwd    %xmm0, %ymm0
>>>     vpsravd    %ymm1, %ymm0, %ymm0
>>>     vpshufb    %ymm2, %ymm0, %ymm0
>>>     vpermq    $232, %ymm0, %ymm0
>>>     vzeroupper
>>>
>>>
>>> So this example may have won the bug lottery by exposing all of
front-,
>>> middle-, back-end bugs. :)
>>>
>>>
>>>
>>> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Correction in the C snippet:
>>>>
>>>> typedef signed short v8i16_t  
__attribute__((ext_vector_type(8)));
>>>>
>>>> v8i16_t foo (v8i16_t a, int n)
>>>> {
>>>>    return a >> n;
>>>> }
>>>>
>>>> Best regards
>>>> Saurabh
>>>>
>>>>
>>>>
>>>> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma
at movidius.com
>>>> > wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are investigating a difference in code generation for
vector splat
>>>>> instructions between llvm-3.9 and llvm-4.0, which could
lead to a
>>>>> performance regression for our target. Here is the C
snippet
>>>>>
>>>>> typedef signed v8i16_t __attribute__((ext_vector_type(8)))
>>>>>
>>>>> v8i16_t foo (v8i16 a, int n)
>>>>> {
>>>>>    return result = a >> n;
>>>>> }
>>>>>
>>>>> With llvm-3.9, the generated sequence does a trunc followed
by splat,
>>>>> but with llvm-4.0 it is reversed to a splat to a bigger
vector followed by
>>>>> a v8i32->v8i16 trunc. Is this by design? The earlier
code sequence is
>>>>> definitely better for our target, but are there known
scenarios where the
>>>>> new sequence would lead to better code?
>>>>>
>>>>> Here are the instruction sequences generated in the two
cases:
>>>>>
>>>>> With llvm 3.9:
>>>>>
>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>>   %3 = trunc i32 %1 to i16
>>>>>   %4 = insertelement <8 x i16> undef, i16 %3, i32 0
>>>>>   %5 = shufflevector <8 x i16> %4, <8 x i16>
undef, <8 x i32>
>>>>> zeroinitializer
>>>>>   %6 = ashr <8 x i16> %0, %5
>>>>>   ret <8 x i16> %6
>>>>> }
>>>>>
>>>>>
>>>>> With llvm 4.0:
>>>>>
>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>>   %3 = insertelement <8 x i32> undef, i32 %1, i32 0
>>>>>   %4 = shufflevector <8 x i32> %3, <8 x i32>
undef, <8 x i32>
>>>>> zeroinitializer
>>>>>   %5 = trunc <8 x i32> %4 to <8 x i16>
>>>>>   %6 = ashr <8 x i16> %0, %5
>>>>>   ret <8 x i16> %6
>>>>> }
>>>>>
>>>>> Best regards
>>>>> Saurabh Verma
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/eb62cb6a/attachment.html>

Akira Hatanaka via llvm-dev

2017-Mar-09 03:28 UTC

head link

[llvm-dev] [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

There were several patches (r278501 was the first) that fixed vector shift bugs.
I don’t think the IR changes were intentional.

I’m not sure if it’s the right solution, but inserting an integral cast before
the CK_VectorSplat cast in checkVectorShift makes IRGen emit the trunc before
the splat.
> On Mar 8, 2017, at 7:21 AM, Sanjay Patel via cfe-dev <cfe-dev at
lists.llvm.org> wrote:
> 
> The regression for the reported case should be avoided after:
> https://reviews.llvm.org/rL297232 <https://reviews.llvm.org/rL297232>
> https://reviews.llvm.org/rL297242 <https://reviews.llvm.org/rL297242>
> https://reviews.llvm.org/rL297280 <https://reviews.llvm.org/rL297280>
> 
> It would still be good to understand if the clang change was intentional or
if that was a side effect that can be limited.
> 
> On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com
<mailto:spatel at rotateright.com>> wrote:
> Yes, there is an IR difference between clang 3.9.1 and clang trunk before
any IR transforms are done:
> https://godbolt.org/g/FuBqIb <https://godbolt.org/g/FuBqIb>
> 
> We can't solve this problem (moving a trunc ahead of other vector ops)
in general in IR because we take a conservative approach to vector transforms in
IR. That means the burden for solving the general problem falls on the front-end
or the back-end. If you can bisect to find the clang commit where this changed,
that would be very helpful.
> 
> However, I think we can handle a very specific case (a too fat splat) in IR
in instcombine, and it will resolve this exact example. This will take a couple
of patches to restore your example. Here's a proposal for the first one:
> https://reviews.llvm.org/D30123 <https://reviews.llvm.org/D30123>
> 
> 
> On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <saurabh.verma at
movidius.com <mailto:saurabh.verma at movidius.com>> wrote:
> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a
difference in the way the shift was handled. Does the initial IR generated for
you show this difference when the option is passed?
> 
> Best regards
> Saurabh
> 
> 
> On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com
<mailto:spatel at rotateright.com>> wrote:
> I think this is caused by a front-end change (cc'ing clang-dev) because
the IR with "-Xclang -disable-llvm-optzns" shows the difference.
> 
> But independently of that, there's a missing IR canonicalization -
instcombine doesn't currently do anything with either version.
> 
> And the version where we trunc later survives through the backend and
produces worse code even for x86 with AVX2:
> before:
>     vmovd    %edi, %xmm1
>     vpmovzxwq    %xmm1, %xmm1  
>     vpsraw    %xmm1, %xmm0, %xmm0
>     retq
> 
> after:
>     vmovd    %edi, %xmm1
>     vpbroadcastd    %xmm1, %ymm1
>     vmovdqa    LCPI1_0(%rip), %ymm2   
>     vpshufb    %ymm2, %ymm1, %ymm1
>     vpermq    $232, %ymm1, %ymm1     
>     vpmovzxwd    %xmm1, %ymm1   
>     vpmovsxwd    %xmm0, %ymm0
>     vpsravd    %ymm1, %ymm0, %ymm0
>     vpshufb    %ymm2, %ymm0, %ymm0
>     vpermq    $232, %ymm0, %ymm0   
>     vzeroupper
> 
> 
> So this example may have won the bug lottery by exposing all of front-,
middle-, back-end bugs. :)
> 
> 
> 
> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Correction in the C snippet:
> 
> typedef signed short v8i16_t   __attribute__((ext_vector_type(8)));
> 
> v8i16_t foo (v8i16_t a, int n)
> {
>    return a >> n;
> }
> 
> Best regards
> Saurabh
> 
> 
> 
> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at
movidius.com <mailto:saurabh.verma at movidius.com>> wrote:
> Hello,
> 
> We are investigating a difference in code generation for vector splat
instructions between llvm-3.9 and llvm-4.0, which could lead to a performance
regression for our target. Here is the C snippet
> 
> typedef signed v8i16_t __attribute__((ext_vector_type(8)))
> 
> v8i16_t foo (v8i16 a, int n)
> {
>    return result = a >> n;
> }
> 
> With llvm-3.9, the generated sequence does a trunc followed by splat, but
with llvm-4.0 it is reversed to a splat to a bigger vector followed by a
v8i32->v8i16 trunc. Is this by design? The earlier code sequence is
definitely better for our target, but are there known scenarios where the new
sequence would lead to better code?
> 
> Here are the instruction sequences generated in the two cases:
> 
> With llvm 3.9:
> 
> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>   %3 = trunc i32 %1 to i16
>   %4 = insertelement <8 x i16> undef, i16 %3, i32 0
>   %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x
i32> zeroinitializer
>   %6 = ashr <8 x i16> %0, %5
>   ret <8 x i16> %6
> }
> 
> 
> With llvm 4.0:
> 
> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>   %3 = insertelement <8 x i32> undef, i32 %1, i32 0
>   %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x
i32> zeroinitializer
>   %5 = trunc <8 x i32> %4 to <8 x i16>
>   %6 = ashr <8 x i16> %0, %5
>   ret <8 x i16> %6
> }
> 
> Best regards
> Saurabh Verma
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> 
> 
> 
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/0e64ea2d/attachment-0001.html>

Sanjay Patel via llvm-dev

2017-Mar-09 15:26 UTC

head link

[llvm-dev] [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

Thanks, Akira.

I don't know enough about vectors in the front-end to be much use here.
cc'ing authors/reviewers of some of the patches that might be related:
https://reviews.llvm.org/rL284579
https://reviews.llvm.org/rL281669
https://reviews.llvm.org/rL278501

On Wed, Mar 8, 2017 at 8:28 PM, Akira Hatanaka <ahatanaka at apple.com>
wrote:
> There were several patches (r278501 was the first) that fixed vector shift
> bugs. I don’t think the IR changes were intentional.
>
> I’m not sure if it’s the right solution, but inserting an integral cast
> before the CK_VectorSplat cast in checkVectorShift makes IRGen emit the
> trunc before the splat.
>
> On Mar 8, 2017, at 7:21 AM, Sanjay Patel via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
> The regression for the reported case should be avoided after:
> https://reviews.llvm.org/rL297232
> https://reviews.llvm.org/rL297242
> https://reviews.llvm.org/rL297280
>
> It would still be good to understand if the clang change was intentional
> or if that was a side effect that can be limited.
>
> On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
>> Yes, there is an IR difference between clang 3.9.1 and clang trunk
before
>> any IR transforms are done:
>> https://godbolt.org/g/FuBqIb
>>
>> We can't solve this problem (moving a trunc ahead of other vector
ops) in
>> general in IR because we take a conservative approach to vector
transforms
>> in IR. That means the burden for solving the general problem falls on
the
>> front-end or the back-end. If you can bisect to find the clang commit
where
>> this changed, that would be very helpful.
>>
>> However, I think we can handle a very specific case (a too fat splat)
in
>> IR in instcombine, and it will resolve this exact example. This will
take a
>> couple of patches to restore your example. Here's a proposal for
the first
>> one:
>> https://reviews.llvm.org/D30123
>>
>>
>> On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <
>> saurabh.verma at movidius.com> wrote:
>>
>>> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not
make a
>>> difference in the way the shift was handled. Does the initial IR
generated
>>> for you show this difference when the option is passed?
>>>
>>> Best regards
>>> Saurabh
>>>
>>>
>>> On 17 February 2017 at 19:03, Sanjay Patel <spatel at
rotateright.com>
>>> wrote:
>>>
>>>> I think this is caused by a front-end change (cc'ing
clang-dev) because
>>>> the IR with "-Xclang -disable-llvm-optzns" shows the
difference.
>>>>
>>>> But independently of that, there's a missing IR
canonicalization -
>>>> instcombine doesn't currently do anything with either
version.
>>>>
>>>> And the version where we trunc later survives through the
backend and
>>>> produces worse code even for x86 with AVX2:
>>>> before:
>>>>     vmovd    %edi, %xmm1
>>>>     vpmovzxwq    %xmm1, %xmm1
>>>>     vpsraw    %xmm1, %xmm0, %xmm0
>>>>     retq
>>>>
>>>> after:
>>>>     vmovd    %edi, %xmm1
>>>>     vpbroadcastd    %xmm1, %ymm1
>>>>     vmovdqa    LCPI1_0(%rip), %ymm2
>>>>     vpshufb    %ymm2, %ymm1, %ymm1
>>>>     vpermq    $232, %ymm1, %ymm1
>>>>     vpmovzxwd    %xmm1, %ymm1
>>>>     vpmovsxwd    %xmm0, %ymm0
>>>>     vpsravd    %ymm1, %ymm0, %ymm0
>>>>     vpshufb    %ymm2, %ymm0, %ymm0
>>>>     vpermq    $232, %ymm0, %ymm0
>>>>     vzeroupper
>>>>
>>>>
>>>> So this example may have won the bug lottery by exposing all of
front-,
>>>> middle-, back-end bugs. :)
>>>>
>>>>
>>>>
>>>> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev
<
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Correction in the C snippet:
>>>>>
>>>>> typedef signed short v8i16_t  
__attribute__((ext_vector_type(8)));
>>>>>
>>>>> v8i16_t foo (v8i16_t a, int n)
>>>>> {
>>>>>    return a >> n;
>>>>> }
>>>>>
>>>>> Best regards
>>>>> Saurabh
>>>>>
>>>>>
>>>>>
>>>>> On 17 February 2017 at 16:21, Saurabh Verma <
>>>>> saurabh.verma at movidius.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We are investigating a difference in code generation
for vector splat
>>>>>> instructions between llvm-3.9 and llvm-4.0, which could
lead to a
>>>>>> performance regression for our target. Here is the C
snippet
>>>>>>
>>>>>> typedef signed v8i16_t
__attribute__((ext_vector_type(8)))
>>>>>>
>>>>>> v8i16_t foo (v8i16 a, int n)
>>>>>> {
>>>>>>    return result = a >> n;
>>>>>> }
>>>>>>
>>>>>> With llvm-3.9, the generated sequence does a trunc
followed by splat,
>>>>>> but with llvm-4.0 it is reversed to a splat to a bigger
vector followed by
>>>>>> a v8i32->v8i16 trunc. Is this by design? The earlier
code sequence is
>>>>>> definitely better for our target, but are there known
scenarios where the
>>>>>> new sequence would lead to better code?
>>>>>>
>>>>>> Here are the instruction sequences generated in the two
cases:
>>>>>>
>>>>>> With llvm 3.9:
>>>>>>
>>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>>>   %3 = trunc i32 %1 to i16
>>>>>>   %4 = insertelement <8 x i16> undef, i16 %3, i32
0
>>>>>>   %5 = shufflevector <8 x i16> %4, <8 x
i16> undef, <8 x i32>
>>>>>> zeroinitializer
>>>>>>   %6 = ashr <8 x i16> %0, %5
>>>>>>   ret <8 x i16> %6
>>>>>> }
>>>>>>
>>>>>>
>>>>>> With llvm 4.0:
>>>>>>
>>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 {
>>>>>>   %3 = insertelement <8 x i32> undef, i32 %1, i32
0
>>>>>>   %4 = shufflevector <8 x i32> %3, <8 x
i32> undef, <8 x i32>
>>>>>> zeroinitializer
>>>>>>   %5 = trunc <8 x i32> %4 to <8 x i16>
>>>>>>   %6 = ashr <8 x i16> %0, %5
>>>>>>   ret <8 x i16> %6
>>>>>> }
>>>>>>
>>>>>> Best regards
>>>>>> Saurabh Verma
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>
>>>
>>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170309/505b8b50/attachment.html>

llvm dev - Mar 2017 - [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

[llvm-dev] [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0

[llvm-dev] [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0