thr3ads.net - search: "vmovd"

AVX512 instruction generated when JIT compiling for an avx2 architecture

2016 Jun 23

2

AVX512 instruction generated when JIT compiling for an avx2 architecture

...ddq %r8, %rcx sarq $2, %rcx movq (%r10), %r8 movq 8(%r10), %r10 movq %r8, %rdi shrq $32, %rdi movq %r10, %rsi shrq $32, %rsi movq %rax, %rdx shlq $6, %rdx leaq 48(%rdx,%r9), %rdx .align 16, 0x90 .LBB0_1: vmovd %r8d, %xmm0 vpbroadcastd %xmm0, %xmm0 vmovd %edi, %xmm1 vpbroadcastd %xmm1, %xmm1 vmovd %r10d, %xmm2 vpbroadcastd %xmm2, %xmm2 vmovd %esi, %xmm3 vpbroadcastd %xmm3, %xmm3 vmovdqa32 %xmm0, -48(%rdx) vmovdqa32 %xmm1, -32(%rdx)...

AVX512 instruction generated when JIT compiling for an avx2 architecture

2016 Jun 23

2

AVX512 instruction generated when JIT compiling for an avx2 architecture

...10), %r10 > movq %r8, %rdi > shrq $32, %rdi > movq %r10, %rsi > shrq $32, %rsi > movq %rax, %rdx > shlq $6, %rdx > leaq 48(%rdx,%r9), %rdx > .align 16, 0x90 > .LBB0_1: > vmovd %r8d, %xmm0 > vpbroadcastd %xmm0, %xmm0 > vmovd %edi, %xmm1 > vpbroadcastd %xmm1, %xmm1 > vmovd %r10d, %xmm2 > vpbroadcastd %xmm2, %xmm2 > vmovd %esi, %xmm3 > vpbroadcastd %xmm3, %xmm3 >...

[PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support

2017 Oct 11

1

[PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support

...h/x86/crypto/camellia-aesni-avx-asm_64.S b/arch/x86/crypto/camellia-aesni-avx-asm_64.S index f7c495e2863c..46feaea52632 100644 --- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S +++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S @@ -52,10 +52,10 @@ /* \ * S-function with AES subbytes \ */ \ - vmovdqa .Linv_shift_row, t4; \ - vbroadcastss .L0f0f0f0f, t7; \ - vmovdqa .Lpre_tf_lo_s1, t0; \ - vmovdqa .Lpre_tf_hi_s1, t1; \ + vmovdqa .Linv_shift_row(%rip), t4; \ + vbroadcastss .L0f0f0f0f(%rip), t7; \ + vmovdqa .Lpre_tf_lo_s1(%rip), t0; \ + vmovdqa .Lpre_tf_hi_s1(%rip), t1; \ \ /* AES inverse sh...

Vector trunc code generation difference between llvm-3.9 and 4.0

2017 Feb 18

2

Vector trunc code generation difference between llvm-3.9 and 4.0

...> But independently of that, there's a missing IR canonicalization - > instcombine doesn't currently do anything with either version. > > And the version where we trunc later survives through the backend and > produces worse code even for x86 with AVX2: > before: > vmovd %edi, %xmm1 > vpmovzxwq %xmm1, %xmm1 > vpsraw %xmm1, %xmm0, %xmm0 > retq > > after: > vmovd %edi, %xmm1 > vpbroadcastd %xmm1, %ymm1 > vmovdqa LCPI1_0(%rip), %ymm2 > vpshufb %ymm2, %ymm1, %ymm1 > vpermq $232, %ymm1...

Vector trunc code generation difference between llvm-3.9 and 4.0

2017 Feb 17

2

Vector trunc code generation difference between llvm-3.9 and 4.0

Correction in the C snippet: typedef signed short v8i16_t __attribute__((ext_vector_type(8))); v8i16_t foo (v8i16_t a, int n) { return a >> n; } Best regards Saurabh On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com> wrote: > Hello, > > We are investigating a difference in code generation for vector splat > instructions between llvm-3.9

Vector trunc code generation difference between llvm-3.9 and 4.0

2017 Mar 08

2

Vector trunc code generation difference between llvm-3.9 and 4.0

...issing IR canonicalization - >>> instcombine doesn't currently do anything with either version. >>> >>> And the version where we trunc later survives through the backend and >>> produces worse code even for x86 with AVX2: >>> before: >>> vmovd %edi, %xmm1 >>> vpmovzxwq %xmm1, %xmm1 >>> vpsraw %xmm1, %xmm0, %xmm0 >>> retq >>> >>> after: >>> vmovd %edi, %xmm1 >>> vpbroadcastd %xmm1, %ymm1 >>> vmovdqa LCPI1_0(%rip), %ymm2 >&g...

[LLVMdev] [cfe-dev] computing a conservatively rounded square of a double

2014 Mar 26

3

[LLVMdev] [cfe-dev] computing a conservatively rounded square of a double

On 03/26/2014 11:36 AM, Geoffrey Irving wrote: > I am trying to compute conservative lower and upper bounds for the > square of a double. I have set the rounding mode to FE_UPWARDS > elsewhere, so the code is > > struct Interval { > double nlo, hi; > }; > > Interval inspect_singleton_sqr(const double x) { > Interval s; > s.nlo = x * -x; > s.hi = x *

[LLVMdev] [RFC] Poor code generation for paired load

2013 Aug 09

2

[LLVMdev] [RFC] Poor code generation for paired load

...e available on the target. Truncate and shift instructions are useless (instructions 2., 4., and 5.). Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair ** To Reproduce ** Here is a way to reproduce the poor code generation for x86-64. opt -sroa current_input.ll -S -o - | llc -O3 -o - You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the next command. Here is a nicer code produced by modifying the input so that SROA generates friendlier code for this case. opt -sroa mod_input.ll -S -o - | llc -O3 -o - Basically the difference between both inputs is that memcpy has not been e...

[LLVMdev] [RFC] Poor code generation for paired load

2013 Aug 12

2

[LLVMdev] [RFC] Poor code generation for paired load

...., and 5.). >> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair >> >> >> ** To Reproduce ** >> >> Here is a way to reproduce the poor code generation for x86-64. >> >> opt -sroa current_input.ll -S -o - | llc -O3 -o - >> >> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the >> next command. >> >> Here is a nicer code produced by modifying the input so that SROA generates >> friendlier code for this case. >> >> opt -sroa mod_input.ll -S -o - | llc -O3 -o - >> >> Ba...

_Float16 support

2019 Jan 22

4

_Float16 support

...ment 0 from single to half vcvtph2ps xmm0, xmm0 # Convert argument 0 back to single vmulss xmm0, xmm0, xmm1 # xmm0 = xmm0*xmm1 (single precision) vcvtps2ph xmm1, xmm0, 4 # Convert the single precision result to half vmovd eax, xmm1 # Move the half precision result to eax mov word ptr [rip + x], ax # Store the half precision result in the global, x ret # Return the single precision result s...

[LLVMdev] [RFC] Poor code generation for paired load

2013 Aug 10

0

[LLVMdev] [RFC] Poor code generation for paired load

...ructions > are useless (instructions 2., 4., and 5.). > Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair > > > ** To Reproduce ** > > Here is a way to reproduce the poor code generation for x86-64. > > opt -sroa current_input.ll -S -o - | llc -O3 -o - > > You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the > next command. > > Here is a nicer code produced by modifying the input so that SROA generates > friendlier code for this case. > > opt -sroa mod_input.ll -S -o - | llc -O3 -o - > > Basically the difference between both...

[LLVMdev] [RFC] Poor code generation for paired load

2013 Aug 12

0

[LLVMdev] [RFC] Poor code generation for paired load

...ructions > are useless (instructions 2., 4., and 5.). > Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair > > > ** To Reproduce ** > > Here is a way to reproduce the poor code generation for x86-64. > > opt -sroa current_input.ll -S -o - | llc -O3 -o - > > You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the > next command. > > Here is a nicer code produced by modifying the input so that SROA generates > friendlier code for this case. > > opt -sroa mod_input.ll -S -o - | llc -O3 -o - > > Basically the difference between both...

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

2018 Mar 13

32

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

Changes: - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

2018 Mar 13

32

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

Changes: - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce

x86: PIE support and option to extend KASLR randomization

2017 Oct 04

28

x86: PIE support and option to extend KASLR randomization

These patches make the changes necessary to build the kernel as Position Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below the top 2G of the virtual address space. It allows to optionally extend the KASLR randomization range from 1G to 3G. Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler changes, PIE support and KASLR in general. Thanks to

x86: PIE support and option to extend KASLR randomization

2017 Oct 04

28

x86: PIE support and option to extend KASLR randomization

These patches make the changes necessary to build the kernel as Position Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below the top 2G of the virtual address space. It allows to optionally extend the KASLR randomization range from 1G to 3G. Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler changes, PIE support and KASLR in general. Thanks to

[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization

2018 May 23

33

[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization

Changes: - patch v3: - Update on message to describe longer term PIE goal. - Minor change on ftrace if condition. - Changed code using xchgq. - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace

[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017 Oct 11

32

[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

Changes: - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce dynamic relocation space on mapped memory. It also simplifies the relocation process. - Move the start the module section next to the kernel. Remove the need for -mcmodel=large on modules. Extends

[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

2017 Oct 11

32

[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization

Changes: - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce dynamic relocation space on mapped memory. It also simplifies the relocation process. - Move the start the module section next to the kernel. Remove the need for -mcmodel=large on modules. Extends

[cfe-dev] _Float16 support

2019 Jan 24

2

[cfe-dev] _Float16 support

...o half > vcvtph2ps xmm0, xmm0 # Convert argument 0 back to single > vmulss xmm0, xmm0, xmm1 # xmm0 = xmm0*xmm1 (single precision) > vcvtps2ph xmm1, xmm0, 4 # Convert the single precision result to half > vmovd eax, xmm1 # Move the half precision result to eax > mov word ptr [rip + x], ax # Store the half precision result in the global, x > ret # Return the single precisio...

search for: vmovd