Romanova, Katya
2013-Apr-09 02:20 UTC
[LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics
Hello, LLVM generates two additional instructions for 128->256 bit typecasts (e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM register corresponding to source XMM register. vxorps xmm2,xmm2,xmm2 vinsertf128 ymm0,ymm2,xmm0,0x0 Most of the industry-standard C/C++ compilers (GCC, Intel's compiler, Visual Studio compiler) don't generate any extra moves for 128-bit->256-bit typecast intrinsics. None of these compilers zero-extend the upper 128 bits of the 256-bit YMM register. Intel's documentation for the _mm256_castsi128_si256 intrinsic explicitly states that "the upper bits of the resulting vector are undefined" and that "this intrinsic does not introduce extra moves to the generated code". http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm Clang implements these typecast intrinsics differently. Is this intentional? I suspect that this was done to avoid a hardware penalty caused by partial register writes. But, isn't the overall cost of 2 additional instructions (vxor + vinsertf128) for *every* 128-bit->256-bit typecast intrinsic higher than the hardware penalty caused by partial register writes for *rare* cases when the upper part of YMM register corresponding to a source XMM register is not cleared already? Thanks! Katya. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130409/db4c5e47/attachment.html>
Nadav Rotem
2013-Apr-09 05:05 UTC
[LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics
Hi Katya, Can you please open a bugzilla bug report (llvm.org/bugs) ? Thanks, Nadav On Apr 8, 2013, at 7:20 PM, "Romanova, Katya" <Katya_Romanova at playstation.sony.com> wrote:> Hello, > > LLVM generates two additional instructions for 128->256 bit typecasts > (e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM register corresponding to source XMM register. > vxorps xmm2,xmm2,xmm2 > vinsertf128 ymm0,ymm2,xmm0,0x0 > > Most of the industry-standard C/C++ compilers (GCC, Intel’s compiler, Visual Studio compiler) don’t > generate any extra moves for 128-bit->256-bit typecast intrinsics. > None of these compilers zero-extend the upper 128 bits of the 256-bit YMM register. Intel’s > documentation for the _mm256_castsi128_si256 intrinsic explicitly states that “the upper bits of the > resulting vector are undefined” and that “this intrinsic does not introduce extra moves to the > generated code”. > > http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm > > Clang implements these typecast intrinsics differently. Is this intentional? I suspect that this was done to avoid a hardware penalty caused by partial register writes. But, isn’t the overall cost of 2 additional instructions (vxor + vinsertf128) for *every* 128-bit->256-bit typecast intrinsic higher than the hardware penalty caused by partial register writes for *rare* cases when the upper part of YMM register corresponding to a source XMM register is not cleared already? > Thanks! > Katya. > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130408/10afb1d9/attachment.html>
Possibly Parallel Threads
- [PATCH RESEND] virtio: Fix typecast of pointer in vring_init()
- [PATCH] virtio: Fix typecast of pointer in vring_init()
- [PATCH RESEND] virtio: Fix typecast of pointer in vring_init()
- [PATCH] virtio: Fix typecast of pointer in vring_init()
- [PATCH RESEND] virtio: Fix typecast of pointer in vring_init()