Saurabh Verma via llvm-dev
2017-Feb-17 16:38 UTC
[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet: typedef signed short v8i16_t __attribute__((ext_vector_type(8))); v8i16_t foo (v8i16_t a, int n) { return a >> n; } Best regards Saurabh On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com> wrote:> Hello, > > We are investigating a difference in code generation for vector splat > instructions between llvm-3.9 and llvm-4.0, which could lead to a > performance regression for our target. Here is the C snippet > > typedef signed v8i16_t __attribute__((ext_vector_type(8))) > > v8i16_t foo (v8i16 a, int n) > { > return result = a >> n; > } > > With llvm-3.9, the generated sequence does a trunc followed by splat, but > with llvm-4.0 it is reversed to a splat to a bigger vector followed by a > v8i32->v8i16 trunc. Is this by design? The earlier code sequence is > definitely better for our target, but are there known scenarios where the > new sequence would lead to better code? > > Here are the instruction sequences generated in the two cases: > > With llvm 3.9: > > define <8 x i16> @foo(<8 x i16>, i32) #0 { > %3 = trunc i32 %1 to i16 > %4 = insertelement <8 x i16> undef, i16 %3, i32 0 > %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> > zeroinitializer > %6 = ashr <8 x i16> %0, %5 > ret <8 x i16> %6 > } > > > With llvm 4.0: > > define <8 x i16> @foo(<8 x i16>, i32) #0 { > %3 = insertelement <8 x i32> undef, i32 %1, i32 0 > %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> > zeroinitializer > %5 = trunc <8 x i32> %4 to <8 x i16> > %6 = ashr <8 x i16> %0, %5 > ret <8 x i16> %6 > } > > Best regards > Saurabh Verma >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170217/7474c772/attachment.html>
Sanjay Patel via llvm-dev
2017-Feb-17 19:03 UTC
[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0
I think this is caused by a front-end change (cc'ing clang-dev) because the IR with "-Xclang -disable-llvm-optzns" shows the difference. But independently of that, there's a missing IR canonicalization - instcombine doesn't currently do anything with either version. And the version where we trunc later survives through the backend and produces worse code even for x86 with AVX2: before: vmovd %edi, %xmm1 vpmovzxwq %xmm1, %xmm1 vpsraw %xmm1, %xmm0, %xmm0 retq after: vmovd %edi, %xmm1 vpbroadcastd %xmm1, %ymm1 vmovdqa LCPI1_0(%rip), %ymm2 vpshufb %ymm2, %ymm1, %ymm1 vpermq $232, %ymm1, %ymm1 vpmovzxwd %xmm1, %ymm1 vpmovsxwd %xmm0, %ymm0 vpsravd %ymm1, %ymm0, %ymm0 vpshufb %ymm2, %ymm0, %ymm0 vpermq $232, %ymm0, %ymm0 vzeroupper So this example may have won the bug lottery by exposing all of front-, middle-, back-end bugs. :) On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Correction in the C snippet: > > typedef signed short v8i16_t __attribute__((ext_vector_type(8))); > > v8i16_t foo (v8i16_t a, int n) > { > return a >> n; > } > > Best regards > Saurabh > > > > On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com> > wrote: > >> Hello, >> >> We are investigating a difference in code generation for vector splat >> instructions between llvm-3.9 and llvm-4.0, which could lead to a >> performance regression for our target. Here is the C snippet >> >> typedef signed v8i16_t __attribute__((ext_vector_type(8))) >> >> v8i16_t foo (v8i16 a, int n) >> { >> return result = a >> n; >> } >> >> With llvm-3.9, the generated sequence does a trunc followed by splat, but >> with llvm-4.0 it is reversed to a splat to a bigger vector followed by a >> v8i32->v8i16 trunc. Is this by design? The earlier code sequence is >> definitely better for our target, but are there known scenarios where the >> new sequence would lead to better code? >> >> Here are the instruction sequences generated in the two cases: >> >> With llvm 3.9: >> >> define <8 x i16> @foo(<8 x i16>, i32) #0 { >> %3 = trunc i32 %1 to i16 >> %4 = insertelement <8 x i16> undef, i16 %3, i32 0 >> %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> >> zeroinitializer >> %6 = ashr <8 x i16> %0, %5 >> ret <8 x i16> %6 >> } >> >> >> With llvm 4.0: >> >> define <8 x i16> @foo(<8 x i16>, i32) #0 { >> %3 = insertelement <8 x i32> undef, i32 %1, i32 0 >> %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> >> zeroinitializer >> %5 = trunc <8 x i32> %4 to <8 x i16> >> %6 = ashr <8 x i16> %0, %5 >> ret <8 x i16> %6 >> } >> >> Best regards >> Saurabh Verma >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170217/b66f5f67/attachment.html>
Saurabh Verma via llvm-dev
2017-Feb-18 07:33 UTC
[llvm-dev] Vector trunc code generation difference between llvm-3.9 and 4.0
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a difference in the way the shift was handled. Does the initial IR generated for you show this difference when the option is passed? Best regards Saurabh On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com> wrote:> I think this is caused by a front-end change (cc'ing clang-dev) because > the IR with "-Xclang -disable-llvm-optzns" shows the difference. > > But independently of that, there's a missing IR canonicalization - > instcombine doesn't currently do anything with either version. > > And the version where we trunc later survives through the backend and > produces worse code even for x86 with AVX2: > before: > vmovd %edi, %xmm1 > vpmovzxwq %xmm1, %xmm1 > vpsraw %xmm1, %xmm0, %xmm0 > retq > > after: > vmovd %edi, %xmm1 > vpbroadcastd %xmm1, %ymm1 > vmovdqa LCPI1_0(%rip), %ymm2 > vpshufb %ymm2, %ymm1, %ymm1 > vpermq $232, %ymm1, %ymm1 > vpmovzxwd %xmm1, %ymm1 > vpmovsxwd %xmm0, %ymm0 > vpsravd %ymm1, %ymm0, %ymm0 > vpshufb %ymm2, %ymm0, %ymm0 > vpermq $232, %ymm0, %ymm0 > vzeroupper > > > So this example may have won the bug lottery by exposing all of front-, > middle-, back-end bugs. :) > > > > On Fri, Feb 17, 2017 at 9:38 AM, Saurabh Verma via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Correction in the C snippet: >> >> typedef signed short v8i16_t __attribute__((ext_vector_type(8))); >> >> v8i16_t foo (v8i16_t a, int n) >> { >> return a >> n; >> } >> >> Best regards >> Saurabh >> >> >> >> On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com> >> wrote: >> >>> Hello, >>> >>> We are investigating a difference in code generation for vector splat >>> instructions between llvm-3.9 and llvm-4.0, which could lead to a >>> performance regression for our target. Here is the C snippet >>> >>> typedef signed v8i16_t __attribute__((ext_vector_type(8))) >>> >>> v8i16_t foo (v8i16 a, int n) >>> { >>> return result = a >> n; >>> } >>> >>> With llvm-3.9, the generated sequence does a trunc followed by splat, >>> but with llvm-4.0 it is reversed to a splat to a bigger vector followed by >>> a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is >>> definitely better for our target, but are there known scenarios where the >>> new sequence would lead to better code? >>> >>> Here are the instruction sequences generated in the two cases: >>> >>> With llvm 3.9: >>> >>> define <8 x i16> @foo(<8 x i16>, i32) #0 { >>> %3 = trunc i32 %1 to i16 >>> %4 = insertelement <8 x i16> undef, i16 %3, i32 0 >>> %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> >>> zeroinitializer >>> %6 = ashr <8 x i16> %0, %5 >>> ret <8 x i16> %6 >>> } >>> >>> >>> With llvm 4.0: >>> >>> define <8 x i16> @foo(<8 x i16>, i32) #0 { >>> %3 = insertelement <8 x i32> undef, i32 %1, i32 0 >>> %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> >>> zeroinitializer >>> %5 = trunc <8 x i32> %4 to <8 x i16> >>> %6 = ashr <8 x i16> %0, %5 >>> ret <8 x i16> %6 >>> } >>> >>> Best regards >>> Saurabh Verma >>> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170218/c360a10b/attachment.html>