Andrea Di Biagio
2014-Sep-19 20:22 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, I have tested the new shuffle lowering on a AMD Jaguar cpu (which is AVX but not AVX2). On this particular target, there is a delay when output data from an execution unit is used as input to another execution unit of a different cluster. For example, There are 6 executions units which are divided into 3 execution clusters of Float(FPM,FPA), Vector Integer (MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs an addition 1 cycle latency penalty. Your new shuffle lowering algorithm is very good at keeping the computation inside clusters. This is an improvement with respect to the "old" shuffle lowering algorithm. I haven't observed any significant regression in our internal codebase. In one particular case I observed a slowdown (around 1%); here is what I found when investigating on this slowdown. 1. With the new shuffle lowering, there is one case where we end up producing the following sequence: vmovss .LCPxx(%rip), %xmm1 vxorps %xmm0, %xmm0, %xmm0 vblendps $1, %xmm1, %xmm0, %xmm0 Before, we used to generate a simpler: vmovss .LCPxx(%rip), %xmm1 In this particular case, the 'vblendps' is redundant since the vmovss would zero the upper bits in %xmm1. I am not sure why we get this poor-codegen with your new shuffle lowering. I will investigate more on this bug (maybe we no longer trigger some ISel patterns?) and I will try to give you a small reproducible for this paticular case. 2. There are cases where we no longer fold a vector load in one of the operands of a shuffle. This is an example: vmovaps 320(%rsp), %xmm0 vshufps $-27, %xmm0, %xmm0, %xmm0 # %xmm0 = %xmm0[1,1,2,3] Before, we used to emit the following sequence: # 16-byte Folded reload. vpshufd $1, 320(%rsp), %xmm0 # %xmm0 = mem[1,0,0,0] Note: the reason why the shuffle masks are different but still valid is because the upper bits in %xmm0 are unused. Later on, the code uses register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of %xmm0 have a meaning in this context). As for 1. I'll try to create a small reproducible. 3. When zero extending 2 packed 32-bit integers, we should try to emit a vpmovzxdq Example: vmovq 20(%rbx), %xmm0 vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1] Before: vpmovzxdq 20(%rbx), %xmm0 4. We no longer emit a simpler 'vmovq' in the following case: vxorpd %xmm4, %xmm4, %xmm4 vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1] Before, we used to generate: vmovq %xmm2, %xmm4 Before, the vmovq implicitly zero-extended to 128 bits the quadword in %xmm2. Now we always do this with a vxorpd+vblendps. As I said, I will try to create smaller reproducible for each of the problems I found. I hope this helps. I will keep testing. Thanks, Andrea
Chandler Carruth
2014-Sep-19 20:39 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
FWIW, I agree with Quentin. I'm actually confident I have test cases covering all of these. On Fri, Sep 19, 2014 at 1:22 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> 3. When zero extending 2 packed 32-bit integers, we should try to > emit a vpmovzxdq > Example: > vmovq 20(%rbx), %xmm0 > vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1] > > Before: > vpmovzxdq 20(%rbx), %xmm0 >This one is already fixed. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/880c7012/attachment.html>
Andrea Di Biagio
2014-Sep-19 20:44 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Fri, Sep 19, 2014 at 9:39 PM, Chandler Carruth <chandlerc at google.com> wrote:> FWIW, I agree with Quentin. I'm actually confident I have test cases > covering all of these.Ok then. In that case, I will keep investigating to see if I can find something more. Cheers, Andrea> > On Fri, Sep 19, 2014 at 1:22 PM, Andrea Di Biagio > <andrea.dibiagio at gmail.com> wrote: >> >> 3. When zero extending 2 packed 32-bit integers, we should try to >> emit a vpmovzxdq >> Example: >> vmovq 20(%rbx), %xmm0 >> vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1] >> >> Before: >> vpmovzxdq 20(%rbx), %xmm0 > > > This one is already fixed.
Simon Pilgrim
2014-Sep-20 14:12 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On 19 Sep 2014, at 21:22, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> 2. There are cases where we no longer fold a vector load in one of > the operands of a shuffle. > This is an example: > > vmovaps 320(%rsp), %xmm0 > vshufps $-27, %xmm0, %xmm0, %xmm0 # %xmm0 = %xmm0[1,1,2,3] > > Before, we used to emit the following sequence: > # 16-byte Folded reload. > vpshufd $1, 320(%rsp), %xmm0 # %xmm0 = mem[1,0,0,0] > > Note: the reason why the shuffle masks are different but still valid > is because the upper bits in %xmm0 are unused. Later on, the code uses > register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of > %xmm0 have a meaning in this context). > As for 1. I'll try to create a small reproducible.Hi Andrea / Chandler / Quentin, If AVX is available I would expect the vpermilps/vpermilpd instruction to be used for all float/double single vector shuffles, especially as it can deal with the folded load case as well - this would avoid the integer/float execution domain transfer issue with using vpshufd. Thanks, Simon.
Chandler Carruth
2014-Sep-20 18:44 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Sat, Sep 20, 2014 at 7:12 AM, Simon Pilgrim <llvm-dev at redking.me.uk> wrote:> Hi Andrea / Chandler / Quentin, > > If AVX is available I would expect the vpermilps/vpermilpd instruction to > be used for all float/double single vector shuffles, especially as it can > deal with the folded load case as well - this would avoid the integer/float > execution domain transfer issue with using vpshufd. >Yes, this is the obvious solution to folding memory loads. It just isn't implemented yet. Well, actually it is, but I haven't finished writing tests for it. =] -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140920/29e2be8f/attachment.html>
Reasonably Related Threads
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag