Displaying 20 results from an estimated 22 matches for "vmovd".
Did you mean:
movd
2016 Jun 23
2
AVX512 instruction generated when JIT compiling for an avx2 architecture
...ddq %r8, %rcx
sarq $2, %rcx
movq (%r10), %r8
movq 8(%r10), %r10
movq %r8, %rdi
shrq $32, %rdi
movq %r10, %rsi
shrq $32, %rsi
movq %rax, %rdx
shlq $6, %rdx
leaq 48(%rdx,%r9), %rdx
.align 16, 0x90
.LBB0_1:
vmovd %r8d, %xmm0
vpbroadcastd %xmm0, %xmm0
vmovd %edi, %xmm1
vpbroadcastd %xmm1, %xmm1
vmovd %r10d, %xmm2
vpbroadcastd %xmm2, %xmm2
vmovd %esi, %xmm3
vpbroadcastd %xmm3, %xmm3
vmovdqa32 %xmm0, -48(%rdx)
vmovdqa32 %xmm1, -32(%rdx)...
2016 Jun 23
2
AVX512 instruction generated when JIT compiling for an avx2 architecture
...10), %r10
> movq %r8, %rdi
> shrq $32, %rdi
> movq %r10, %rsi
> shrq $32, %rsi
> movq %rax, %rdx
> shlq $6, %rdx
> leaq 48(%rdx,%r9), %rdx
> .align 16, 0x90
> .LBB0_1:
> vmovd %r8d, %xmm0
> vpbroadcastd %xmm0, %xmm0
> vmovd %edi, %xmm1
> vpbroadcastd %xmm1, %xmm1
> vmovd %r10d, %xmm2
> vpbroadcastd %xmm2, %xmm2
> vmovd %esi, %xmm3
> vpbroadcastd %xmm3, %xmm3
>...
2017 Oct 11
1
[PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support
...h/x86/crypto/camellia-aesni-avx-asm_64.S b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
index f7c495e2863c..46feaea52632 100644
--- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
@@ -52,10 +52,10 @@
/* \
* S-function with AES subbytes \
*/ \
- vmovdqa .Linv_shift_row, t4; \
- vbroadcastss .L0f0f0f0f, t7; \
- vmovdqa .Lpre_tf_lo_s1, t0; \
- vmovdqa .Lpre_tf_hi_s1, t1; \
+ vmovdqa .Linv_shift_row(%rip), t4; \
+ vbroadcastss .L0f0f0f0f(%rip), t7; \
+ vmovdqa .Lpre_tf_lo_s1(%rip), t0; \
+ vmovdqa .Lpre_tf_hi_s1(%rip), t1; \
\
/* AES inverse sh...
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...> But independently of that, there's a missing IR canonicalization -
> instcombine doesn't currently do anything with either version.
>
> And the version where we trunc later survives through the backend and
> produces worse code even for x86 with AVX2:
> before:
> vmovd %edi, %xmm1
> vpmovzxwq %xmm1, %xmm1
> vpsraw %xmm1, %xmm0, %xmm0
> retq
>
> after:
> vmovd %edi, %xmm1
> vpbroadcastd %xmm1, %ymm1
> vmovdqa LCPI1_0(%rip), %ymm2
> vpshufb %ymm2, %ymm1, %ymm1
> vpermq $232, %ymm1...
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet:
typedef signed short v8i16_t __attribute__((ext_vector_type(8)));
v8i16_t foo (v8i16_t a, int n)
{
return a >> n;
}
Best regards
Saurabh
On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
wrote:
> Hello,
>
> We are investigating a difference in code generation for vector splat
> instructions between llvm-3.9
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...issing IR canonicalization -
>>> instcombine doesn't currently do anything with either version.
>>>
>>> And the version where we trunc later survives through the backend and
>>> produces worse code even for x86 with AVX2:
>>> before:
>>> vmovd %edi, %xmm1
>>> vpmovzxwq %xmm1, %xmm1
>>> vpsraw %xmm1, %xmm0, %xmm0
>>> retq
>>>
>>> after:
>>> vmovd %edi, %xmm1
>>> vpbroadcastd %xmm1, %ymm1
>>> vmovdqa LCPI1_0(%rip), %ymm2
>&g...
2014 Mar 26
3
[LLVMdev] [cfe-dev] computing a conservatively rounded square of a double
On 03/26/2014 11:36 AM, Geoffrey Irving wrote:
> I am trying to compute conservative lower and upper bounds for the
> square of a double. I have set the rounding mode to FE_UPWARDS
> elsewhere, so the code is
>
> struct Interval {
> double nlo, hi;
> };
>
> Interval inspect_singleton_sqr(const double x) {
> Interval s;
> s.nlo = x * -x;
> s.hi = x *
2013 Aug 09
2
[LLVMdev] [RFC] Poor code generation for paired load
...e available on the target. Truncate and shift instructions are useless (instructions 2., 4., and 5.).
Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
** To Reproduce **
Here is a way to reproduce the poor code generation for x86-64.
opt -sroa current_input.ll -S -o - | llc -O3 -o -
You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the next command.
Here is a nicer code produced by modifying the input so that SROA generates friendlier code for this case.
opt -sroa mod_input.ll -S -o - | llc -O3 -o -
Basically the difference between both inputs is that memcpy has not been e...
2013 Aug 12
2
[LLVMdev] [RFC] Poor code generation for paired load
...., and 5.).
>> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>>
>>
>> ** To Reproduce **
>>
>> Here is a way to reproduce the poor code generation for x86-64.
>>
>> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>>
>> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
>> next command.
>>
>> Here is a nicer code produced by modifying the input so that SROA generates
>> friendlier code for this case.
>>
>> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>>
>> Ba...
2019 Jan 22
4
_Float16 support
...ment 0 from single to half
vcvtph2ps xmm0, xmm0 # Convert argument 0 back to single
vmulss xmm0, xmm0, xmm1 # xmm0 = xmm0*xmm1 (single precision)
vcvtps2ph xmm1, xmm0, 4 # Convert the single precision result to half
vmovd eax, xmm1 # Move the half precision result to eax
mov word ptr [rip + x], ax # Store the half precision result in the global, x
ret # Return the single precision result s...
2013 Aug 10
0
[LLVMdev] [RFC] Poor code generation for paired load
...ructions
> are useless (instructions 2., 4., and 5.).
> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>
>
> ** To Reproduce **
>
> Here is a way to reproduce the poor code generation for x86-64.
>
> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>
> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
> next command.
>
> Here is a nicer code produced by modifying the input so that SROA generates
> friendlier code for this case.
>
> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>
> Basically the difference between both...
2013 Aug 12
0
[LLVMdev] [RFC] Poor code generation for paired load
...ructions
> are useless (instructions 2., 4., and 5.).
> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>
>
> ** To Reproduce **
>
> Here is a way to reproduce the poor code generation for x86-64.
>
> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>
> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
> next command.
>
> Here is a nicer code produced by modifying the input so that SROA generates
> friendlier code for this case.
>
> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>
> Basically the difference between both...
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.
Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.
Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to
2018 May 23
33
[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v3:
- Update on message to describe longer term PIE goal.
- Minor change on ftrace if condition.
- Changed code using xchgq.
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace
2017 Oct 11
32
[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce dynamic relocation space on
mapped memory. It also simplifies the relocation process.
- Move the start the module section next to the kernel. Remove the need for
-mcmodel=large on modules. Extends
2017 Oct 11
32
[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce dynamic relocation space on
mapped memory. It also simplifies the relocation process.
- Move the start the module section next to the kernel. Remove the need for
-mcmodel=large on modules. Extends
2019 Jan 24
2
[cfe-dev] _Float16 support
...o half
> vcvtph2ps xmm0, xmm0 # Convert argument 0 back to single
> vmulss xmm0, xmm0, xmm1 # xmm0 = xmm0*xmm1 (single precision)
> vcvtps2ph xmm1, xmm0, 4 # Convert the single precision result to half
> vmovd eax, xmm1 # Move the half precision result to eax
> mov word ptr [rip + x], ax # Store the half precision result in the global, x
> ret # Return the single precisio...