Sanjay Patel via llvm-dev
2017-Mar-08 15:21 UTC
[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0
The regression for the reported case should be avoided after: https://reviews.llvm.org/rL297232 https://reviews.llvm.org/rL297242 https://reviews.llvm.org/rL297280 It would still be good to understand if the clang change was intentional or if that was a side effect that can be limited. On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com> wrote:> Yes, there is an IR difference between clang 3.9.1 and clang trunk before > any IR transforms are done: > https://godbolt.org/g/FuBqIb > > We can't solve this problem (moving a trunc ahead of other vector ops) in > general in IR because we take a conservative approach to vector transforms > in IR. That means the burden for solving the general problem falls on the > front-end or the back-end. If you can bisect to find the clang commit where > this changed, that would be very helpful. > > However, I think we can handle a very specific case (a too fat splat) in > IR in instcombine, and it will resolve this exact example. This will take a > couple of patches to restore your example. Here's a proposal for the first > one: > https://reviews.llvm.org/D30123 > > > On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma < > saurabh.verma at movidius.com> wrote: > >> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a >> difference in the way the shift was handled. Does the initial IR generated >> for you show this difference when the option is passed? >> >> Best regards >> Saurabh >> >> >> On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com> >> wrote: >> >>> I think this is caused by a front-end change (cc'ing clang-dev) because >>> the IR with "-Xclang -disable-llvm-optzns" shows the difference. >>> >>> But independently of that, there's a missing IR canonicalization - >>> instcombine doesn't currently do anything with either version. >>> >>> And the version where we trunc later survives through the backend and >>> produces worse code even for x86 with AVX2: >>> before: >>> vmovd %edi, %xmm1 >>> vpmovzxwq %xmm1, %xmm1 >>> vpsraw %xmm1, %xmm0, %xmm0 >>> retq >>> >>> after: >>> vmovd %edi, %xmm1 >>> vpbroadcastd %xmm1, %ymm1 >>> vmovdqa LCPI1_0(%rip), %ymm2 >>> vpshufb %ymm2, %ymm1, %ymm1 >>> vpermq $232, %ymm1, %ymm1 >>> vpmovzxwd %xmm1, %ymm1 >>> vpmovsxwd %xmm0, %ymm0 >>> vpsravd %ymm1, %ymm0, %ymm0 >>> vpshufb %ymm2, %ymm0, %ymm0 >>> vpermq $232, %ymm0, %ymm0 >>> vzeroupper >>> >>> >>> So this example may have won the bug lottery by exposing all of front-, >>> middle-, back-end bugs. :) >>> >>> >>> >>> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> Correction in the C snippet: >>>> >>>> typedef signed short v8i16_t __attribute__((ext_vector_type(8))); >>>> >>>> v8i16_t foo (v8i16_t a, int n) >>>> { >>>> return a >> n; >>>> } >>>> >>>> Best regards >>>> Saurabh >>>> >>>> >>>> >>>> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com >>>> > wrote: >>>> >>>>> Hello, >>>>> >>>>> We are investigating a difference in code generation for vector splat >>>>> instructions between llvm-3.9 and llvm-4.0, which could lead to a >>>>> performance regression for our target. Here is the C snippet >>>>> >>>>> typedef signed v8i16_t __attribute__((ext_vector_type(8))) >>>>> >>>>> v8i16_t foo (v8i16 a, int n) >>>>> { >>>>> return result = a >> n; >>>>> } >>>>> >>>>> With llvm-3.9, the generated sequence does a trunc followed by splat, >>>>> but with llvm-4.0 it is reversed to a splat to a bigger vector followed by >>>>> a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is >>>>> definitely better for our target, but are there known scenarios where the >>>>> new sequence would lead to better code? >>>>> >>>>> Here are the instruction sequences generated in the two cases: >>>>> >>>>> With llvm 3.9: >>>>> >>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 { >>>>> %3 = trunc i32 %1 to i16 >>>>> %4 = insertelement <8 x i16> undef, i16 %3, i32 0 >>>>> %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> >>>>> zeroinitializer >>>>> %6 = ashr <8 x i16> %0, %5 >>>>> ret <8 x i16> %6 >>>>> } >>>>> >>>>> >>>>> With llvm 4.0: >>>>> >>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 { >>>>> %3 = insertelement <8 x i32> undef, i32 %1, i32 0 >>>>> %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> >>>>> zeroinitializer >>>>> %5 = trunc <8 x i32> %4 to <8 x i16> >>>>> %6 = ashr <8 x i16> %0, %5 >>>>> ret <8 x i16> %6 >>>>> } >>>>> >>>>> Best regards >>>>> Saurabh Verma >>>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/eb62cb6a/attachment.html>
Akira Hatanaka via llvm-dev
2017-Mar-09 03:28 UTC
[llvm-dev] [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0
There were several patches (r278501 was the first) that fixed vector shift bugs. I don’t think the IR changes were intentional. I’m not sure if it’s the right solution, but inserting an integral cast before the CK_VectorSplat cast in checkVectorShift makes IRGen emit the trunc before the splat.> On Mar 8, 2017, at 7:21 AM, Sanjay Patel via cfe-dev <cfe-dev at lists.llvm.org> wrote: > > The regression for the reported case should be avoided after: > https://reviews.llvm.org/rL297232 <https://reviews.llvm.org/rL297232> > https://reviews.llvm.org/rL297242 <https://reviews.llvm.org/rL297242> > https://reviews.llvm.org/rL297280 <https://reviews.llvm.org/rL297280> > > It would still be good to understand if the clang change was intentional or if that was a side effect that can be limited. > > On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote: > Yes, there is an IR difference between clang 3.9.1 and clang trunk before any IR transforms are done: > https://godbolt.org/g/FuBqIb <https://godbolt.org/g/FuBqIb> > > We can't solve this problem (moving a trunc ahead of other vector ops) in general in IR because we take a conservative approach to vector transforms in IR. That means the burden for solving the general problem falls on the front-end or the back-end. If you can bisect to find the clang commit where this changed, that would be very helpful. > > However, I think we can handle a very specific case (a too fat splat) in IR in instcombine, and it will resolve this exact example. This will take a couple of patches to restore your example. Here's a proposal for the first one: > https://reviews.llvm.org/D30123 <https://reviews.llvm.org/D30123> > > > On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma <saurabh.verma at movidius.com <mailto:saurabh.verma at movidius.com>> wrote: > Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a difference in the way the shift was handled. Does the initial IR generated for you show this difference when the option is passed? > > Best regards > Saurabh > > > On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote: > I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference. > > But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version. > > And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2: > before: > vmovd %edi, %xmm1 > vpmovzxwq %xmm1, %xmm1 > vpsraw %xmm1, %xmm0, %xmm0 > retq > > after: > vmovd %edi, %xmm1 > vpbroadcastd %xmm1, %ymm1 > vmovdqa LCPI1_0(%rip), %ymm2 > vpshufb %ymm2, %ymm1, %ymm1 > vpermq $232, %ymm1, %ymm1 > vpmovzxwd %xmm1, %ymm1 > vpmovsxwd %xmm0, %ymm0 > vpsravd %ymm1, %ymm0, %ymm0 > vpshufb %ymm2, %ymm0, %ymm0 > vpermq $232, %ymm0, %ymm0 > vzeroupper > > > So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :) > > > > On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > Correction in the C snippet: > > typedef signed short v8i16_t __attribute__((ext_vector_type(8))); > > v8i16_t foo (v8i16_t a, int n) > { > return a >> n; > } > > Best regards > Saurabh > > > > On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com <mailto:saurabh.verma at movidius.com>> wrote: > Hello, > > We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet > > typedef signed v8i16_t __attribute__((ext_vector_type(8))) > > v8i16_t foo (v8i16 a, int n) > { > return result = a >> n; > } > > With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code? > > Here are the instruction sequences generated in the two cases: > > With llvm 3.9: > > define <8 x i16> @foo(<8 x i16>, i32) #0 { > %3 = trunc i32 %1 to i16 > %4 = insertelement <8 x i16> undef, i16 %3, i32 0 > %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer > %6 = ashr <8 x i16> %0, %5 > ret <8 x i16> %6 > } > > > With llvm 4.0: > > define <8 x i16> @foo(<8 x i16>, i32) #0 { > %3 = insertelement <8 x i32> undef, i32 %1, i32 0 > %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer > %5 = trunc <8 x i32> %4 to <8 x i16> > %6 = ashr <8 x i16> %0, %5 > ret <8 x i16> %6 > } > > Best regards > Saurabh Verma > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/0e64ea2d/attachment-0001.html>
Sanjay Patel via llvm-dev
2017-Mar-09 15:26 UTC
[llvm-dev] [cfe-dev] Vector trunc code generation difference between llvm-3.9 and 4.0
Thanks, Akira. I don't know enough about vectors in the front-end to be much use here. cc'ing authors/reviewers of some of the patches that might be related: https://reviews.llvm.org/rL284579 https://reviews.llvm.org/rL281669 https://reviews.llvm.org/rL278501 On Wed, Mar 8, 2017 at 8:28 PM, Akira Hatanaka <ahatanaka at apple.com> wrote:> There were several patches (r278501 was the first) that fixed vector shift > bugs. I don’t think the IR changes were intentional. > > I’m not sure if it’s the right solution, but inserting an integral cast > before the CK_VectorSplat cast in checkVectorShift makes IRGen emit the > trunc before the splat. > > On Mar 8, 2017, at 7:21 AM, Sanjay Patel via cfe-dev < > cfe-dev at lists.llvm.org> wrote: > > The regression for the reported case should be avoided after: > https://reviews.llvm.org/rL297232 > https://reviews.llvm.org/rL297242 > https://reviews.llvm.org/rL297280 > > It would still be good to understand if the clang change was intentional > or if that was a side effect that can be limited. > > On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com> > wrote: > >> Yes, there is an IR difference between clang 3.9.1 and clang trunk before >> any IR transforms are done: >> https://godbolt.org/g/FuBqIb >> >> We can't solve this problem (moving a trunc ahead of other vector ops) in >> general in IR because we take a conservative approach to vector transforms >> in IR. That means the burden for solving the general problem falls on the >> front-end or the back-end. If you can bisect to find the clang commit where >> this changed, that would be very helpful. >> >> However, I think we can handle a very specific case (a too fat splat) in >> IR in instcombine, and it will resolve this exact example. This will take a >> couple of patches to restore your example. Here's a proposal for the first >> one: >> https://reviews.llvm.org/D30123 >> >> >> On Sat, Feb 18, 2017 at 12:33 AM, Saurabh Verma < >> saurabh.verma at movidius.com> wrote: >> >>> Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a >>> difference in the way the shift was handled. Does the initial IR generated >>> for you show this difference when the option is passed? >>> >>> Best regards >>> Saurabh >>> >>> >>> On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com> >>> wrote: >>> >>>> I think this is caused by a front-end change (cc'ing clang-dev) because >>>> the IR with "-Xclang -disable-llvm-optzns" shows the difference. >>>> >>>> But independently of that, there's a missing IR canonicalization - >>>> instcombine doesn't currently do anything with either version. >>>> >>>> And the version where we trunc later survives through the backend and >>>> produces worse code even for x86 with AVX2: >>>> before: >>>> vmovd %edi, %xmm1 >>>> vpmovzxwq %xmm1, %xmm1 >>>> vpsraw %xmm1, %xmm0, %xmm0 >>>> retq >>>> >>>> after: >>>> vmovd %edi, %xmm1 >>>> vpbroadcastd %xmm1, %ymm1 >>>> vmovdqa LCPI1_0(%rip), %ymm2 >>>> vpshufb %ymm2, %ymm1, %ymm1 >>>> vpermq $232, %ymm1, %ymm1 >>>> vpmovzxwd %xmm1, %ymm1 >>>> vpmovsxwd %xmm0, %ymm0 >>>> vpsravd %ymm1, %ymm0, %ymm0 >>>> vpshufb %ymm2, %ymm0, %ymm0 >>>> vpermq $232, %ymm0, %ymm0 >>>> vzeroupper >>>> >>>> >>>> So this example may have won the bug lottery by exposing all of front-, >>>> middle-, back-end bugs. :) >>>> >>>> >>>> >>>> On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev < >>>> llvm-dev at lists.llvm.org> wrote: >>>> >>>>> Correction in the C snippet: >>>>> >>>>> typedef signed short v8i16_t __attribute__((ext_vector_type(8))); >>>>> >>>>> v8i16_t foo (v8i16_t a, int n) >>>>> { >>>>> return a >> n; >>>>> } >>>>> >>>>> Best regards >>>>> Saurabh >>>>> >>>>> >>>>> >>>>> On 17 February 2017 at 16:21, Saurabh Verma < >>>>> saurabh.verma at movidius.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> We are investigating a difference in code generation for vector splat >>>>>> instructions between llvm-3.9 and llvm-4.0, which could lead to a >>>>>> performance regression for our target. Here is the C snippet >>>>>> >>>>>> typedef signed v8i16_t __attribute__((ext_vector_type(8))) >>>>>> >>>>>> v8i16_t foo (v8i16 a, int n) >>>>>> { >>>>>> return result = a >> n; >>>>>> } >>>>>> >>>>>> With llvm-3.9, the generated sequence does a trunc followed by splat, >>>>>> but with llvm-4.0 it is reversed to a splat to a bigger vector followed by >>>>>> a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is >>>>>> definitely better for our target, but are there known scenarios where the >>>>>> new sequence would lead to better code? >>>>>> >>>>>> Here are the instruction sequences generated in the two cases: >>>>>> >>>>>> With llvm 3.9: >>>>>> >>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 { >>>>>> %3 = trunc i32 %1 to i16 >>>>>> %4 = insertelement <8 x i16> undef, i16 %3, i32 0 >>>>>> %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> >>>>>> zeroinitializer >>>>>> %6 = ashr <8 x i16> %0, %5 >>>>>> ret <8 x i16> %6 >>>>>> } >>>>>> >>>>>> >>>>>> With llvm 4.0: >>>>>> >>>>>> define <8 x i16> @foo(<8 x i16>, i32) #0 { >>>>>> %3 = insertelement <8 x i32> undef, i32 %1, i32 0 >>>>>> %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> >>>>>> zeroinitializer >>>>>> %5 = trunc <8 x i32> %4 to <8 x i16> >>>>>> %6 = ashr <8 x i16> %0, %5 >>>>>> ret <8 x i16> %6 >>>>>> } >>>>>> >>>>>> Best regards >>>>>> Saurabh Verma >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> llvm-dev at lists.llvm.org >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>>>> >>>> >>> >> > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170309/505b8b50/attachment.html>