thr3ads.net - llvm dev - [LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Romanova, Katya

2013-Apr-09 02:20 UTC

[LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics

Hello,

LLVM generates two additional instructions for 128->256 bit typecasts
(e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM register
corresponding to source XMM register.

vxorps xmm2,xmm2,xmm2

vinsertf128 ymm0,ymm2,xmm0,0x0

Most of the industry-standard C/C++ compilers (GCC, Intel's compiler, Visual
Studio compiler) don't

generate any extra moves for 128-bit->256-bit typecast intrinsics.

None of these compilers zero-extend the upper 128 bits of the 256-bit YMM
register. Intel's

documentation for the _mm256_castsi128_si256 intrinsic explicitly states that
"the upper bits of the

resulting vector are undefined" and that "this intrinsic does not
introduce extra moves to the

generated code".

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm

Clang implements these typecast intrinsics differently. Is this intentional? I
suspect that this was done to avoid a hardware penalty caused by partial
register writes. But, isn't the overall cost of 2 additional instructions
(vxor + vinsertf128) for *every* 128-bit->256-bit typecast intrinsic higher
than the hardware penalty caused by partial register writes for *rare* cases
when the upper part of YMM register corresponding to a source XMM register is
not cleared already?

Thanks!

Katya.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130409/db4c5e47/attachment.html>

Nadav Rotem

2013-Apr-09 05:05 UTC

head link

[LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics

Hi Katya, 

Can you please open a bugzilla bug report (llvm.org/bugs) ?

Thanks,
Nadav

On Apr 8, 2013, at 7:20 PM, "Romanova, Katya" <Katya_Romanova at
playstation.sony.com> wrote:
> Hello,
>  
> LLVM generates two additional instructions for 128->256 bit typecasts 
> (e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM
register corresponding to source XMM register.
>     vxorps xmm2,xmm2,xmm2
>     vinsertf128 ymm0,ymm2,xmm0,0x0
> 
> Most of the industry-standard C/C++ compilers (GCC, Intel’s compiler,
Visual Studio compiler) don’t
> generate any extra moves for 128-bit->256-bit typecast intrinsics.
> None of these compilers zero-extend the upper 128 bits of the 256-bit YMM
register. Intel’s
> documentation for the _mm256_castsi128_si256 intrinsic explicitly states
that “the upper bits of the
> resulting vector are undefined” and that “this intrinsic does not introduce
extra moves to the
> generated code”. 
>  
>
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011Update/compiler_c/intref_cls/common/intref_avx_castsi128_si256.htm
>  
> Clang implements these typecast intrinsics differently. Is this
intentional? I suspect that this was done to avoid a hardware penalty caused by
partial register writes. But, isn’t the overall cost of 2 additional
instructions (vxor + vinsertf128) for *every* 128-bit->256-bit typecast
intrinsic higher than the hardware penalty caused by partial register writes for
*rare* cases when the upper part of YMM register corresponding to a source XMM
register is not cleared already?
> Thanks!
> Katya.
>  
>  
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130408/10afb1d9/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Apr 2013 - [LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics

[LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics

[LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics

Reasonably Related Threads