Displaying 13 results from an estimated 13 matches for "vpbroadcastd".
Did you mean:
vpbroadcastq
2016 Jun 23
2
AVX512 instruction generated when JIT compiling for an avx2 architecture
...$2, %rcx
movq (%r10), %r8
movq 8(%r10), %r10
movq %r8, %rdi
shrq $32, %rdi
movq %r10, %rsi
shrq $32, %rsi
movq %rax, %rdx
shlq $6, %rdx
leaq 48(%rdx,%r9), %rdx
.align 16, 0x90
.LBB0_1:
vmovd %r8d, %xmm0
vpbroadcastd %xmm0, %xmm0
vmovd %edi, %xmm1
vpbroadcastd %xmm1, %xmm1
vmovd %r10d, %xmm2
vpbroadcastd %xmm2, %xmm2
vmovd %esi, %xmm3
vpbroadcastd %xmm3, %xmm3
vmovdqa32 %xmm0, -48(%rdx)
vmovdqa32 %xmm1, -32(%rdx)
vmovdqa32 %xmm2, -16(%rd...
2016 Jun 23
2
AVX512 instruction generated when JIT compiling for an avx2 architecture
..., %rdi
> shrq $32, %rdi
> movq %r10, %rsi
> shrq $32, %rsi
> movq %rax, %rdx
> shlq $6, %rdx
> leaq 48(%rdx,%r9), %rdx
> .align 16, 0x90
> .LBB0_1:
> vmovd %r8d, %xmm0
> vpbroadcastd %xmm0, %xmm0
> vmovd %edi, %xmm1
> vpbroadcastd %xmm1, %xmm1
> vmovd %r10d, %xmm2
> vpbroadcastd %xmm2, %xmm2
> vmovd %esi, %xmm3
> vpbroadcastd %xmm3, %xmm3
> vmovdqa32 %xmm0, -48(%rdx)
>...
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...e version where we trunc later survives through the backend and
> produces worse code even for x86 with AVX2:
> before:
> vmovd %edi, %xmm1
> vpmovzxwq %xmm1, %xmm1
> vpsraw %xmm1, %xmm0, %xmm0
> retq
>
> after:
> vmovd %edi, %xmm1
> vpbroadcastd %xmm1, %ymm1
> vmovdqa LCPI1_0(%rip), %ymm2
> vpshufb %ymm2, %ymm1, %ymm1
> vpermq $232, %ymm1, %ymm1
> vpmovzxwd %xmm1, %ymm1
> vpmovsxwd %xmm0, %ymm0
> vpsravd %ymm1, %ymm0, %ymm0
> vpshufb %ymm2, %ymm0, %ymm0
> vperm...
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet:
typedef signed short v8i16_t __attribute__((ext_vector_type(8)));
v8i16_t foo (v8i16_t a, int n)
{
return a >> n;
}
Best regards
Saurabh
On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
wrote:
> Hello,
>
> We are investigating a difference in code generation for vector splat
> instructions between llvm-3.9
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...duces worse code even for x86 with AVX2:
>>> before:
>>> vmovd %edi, %xmm1
>>> vpmovzxwq %xmm1, %xmm1
>>> vpsraw %xmm1, %xmm0, %xmm0
>>> retq
>>>
>>> after:
>>> vmovd %edi, %xmm1
>>> vpbroadcastd %xmm1, %ymm1
>>> vmovdqa LCPI1_0(%rip), %ymm2
>>> vpshufb %ymm2, %ymm1, %ymm1
>>> vpermq $232, %ymm1, %ymm1
>>> vpmovzxwd %xmm1, %ymm1
>>> vpmovsxwd %xmm0, %ymm0
>>> vpsravd %ymm1, %ymm0, %ymm0
>&g...
2017 Oct 11
1
[PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support
...arch/x86/crypto/camellia-aesni-avx2-asm_64.S
index eee5b3982cfd..93da327fec83 100644
--- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
@@ -69,12 +69,12 @@
/* \
* S-function with AES subbytes \
*/ \
- vbroadcasti128 .Linv_shift_row, t4; \
- vpbroadcastd .L0f0f0f0f, t7; \
- vbroadcasti128 .Lpre_tf_lo_s1, t5; \
- vbroadcasti128 .Lpre_tf_hi_s1, t6; \
- vbroadcasti128 .Lpre_tf_lo_s4, t2; \
- vbroadcasti128 .Lpre_tf_hi_s4, t3; \
+ vbroadcasti128 .Linv_shift_row(%rip), t4; \
+ vpbroadcastd .L0f0f0f0f(%rip), t7; \
+ vbroadcasti128 .Lpre_tf_lo_s1(%rip), t...
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.
Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.
Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to
2018 May 23
33
[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v3:
- Update on message to describe longer term PIE goal.
- Minor change on ftrace if condition.
- Changed code using xchgq.
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace
2017 Oct 11
32
[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce dynamic relocation space on
mapped memory. It also simplifies the relocation process.
- Move the start the module section next to the kernel. Remove the need for
-mcmodel=large on modules. Extends
2017 Oct 11
32
[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce dynamic relocation space on
mapped memory. It also simplifies the relocation process.
- Move the start the module section next to the kernel. Remove the need for
-mcmodel=large on modules. Extends