Displaying 20 results from an estimated 22 matches for "ymm2".
Did you mean:
xmm2
2013 Aug 28
3
[PATCH] x86: AVX instruction emulation fixes
...cpu_has_mmx )
+ case X86EMUL_FPU_ymm:
+ if ( cpu_has_avx )
break;
default:
return X86EMUL_UNHANDLEABLE;
@@ -629,6 +641,73 @@ int main(int argc, char **argv)
else
printf("skipped\n");
+ printf("%-40s", "Testing vmovdqu %ymm2,(%ecx)...");
+ if ( stack_exec && cpu_has_avx )
+ {
+ extern const unsigned char vmovdqu_to_mem[];
+
+ asm volatile ( "vpcmpeqb %%xmm2, %%xmm2, %%xmm2\n"
+ ".pushsection .test, \"a\", @progbits\n"
+...
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...gt; produces worse code even for x86 with AVX2:
> before:
> vmovd %edi, %xmm1
> vpmovzxwq %xmm1, %xmm1
> vpsraw %xmm1, %xmm0, %xmm0
> retq
>
> after:
> vmovd %edi, %xmm1
> vpbroadcastd %xmm1, %ymm1
> vmovdqa LCPI1_0(%rip), %ymm2
> vpshufb %ymm2, %ymm1, %ymm1
> vpermq $232, %ymm1, %ymm1
> vpmovzxwd %xmm1, %ymm1
> vpmovsxwd %xmm0, %ymm0
> vpsravd %ymm1, %ymm0, %ymm0
> vpshufb %ymm2, %ymm0, %ymm0
> vpermq $232, %ymm0, %ymm0
> vzeroupper
>
>
>...
2017 Feb 17
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Correction in the C snippet:
typedef signed short v8i16_t __attribute__((ext_vector_type(8)));
v8i16_t foo (v8i16_t a, int n)
{
return a >> n;
}
Best regards
Saurabh
On 17 February 2017 at 16:21, Saurabh Verma <saurabh.verma at movidius.com>
wrote:
> Hello,
>
> We are investigating a difference in code generation for vector splat
> instructions between llvm-3.9
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
...; vmovd %edi, %xmm1
>>> vpmovzxwq %xmm1, %xmm1
>>> vpsraw %xmm1, %xmm0, %xmm0
>>> retq
>>>
>>> after:
>>> vmovd %edi, %xmm1
>>> vpbroadcastd %xmm1, %ymm1
>>> vmovdqa LCPI1_0(%rip), %ymm2
>>> vpshufb %ymm2, %ymm1, %ymm1
>>> vpermq $232, %ymm1, %ymm1
>>> vpmovzxwd %xmm1, %ymm1
>>> vpmovsxwd %xmm0, %ymm0
>>> vpsravd %ymm1, %ymm0, %ymm0
>>> vpshufb %ymm2, %ymm0, %ymm0
>>> vpermq...
2012 Jan 09
3
[LLVMdev] Calling conventions for YMM registers on AVX
On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote:
>
> On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote:
>
>> I'll explain what we see in the code.
>> 1. The caller saves XMM registers across the call if needed (according to DEFS definition).
>> YMMs are not in the set, so caller does not take care.
>
> This is not how the register allocator
2012 Jan 10
0
[LLVMdev] Calling conventions for YMM registers on AVX
...# @test
# BB#0: # %entry
pushq %rbp
movq %rsp, %rbp
subq $64, %rsp
vmovaps %xmm7, -32(%rbp) # 16-byte Spill
vmovaps %xmm6, -16(%rbp) # 16-byte Spill
vmovaps %ymm3, %ymm6
vmovaps %ymm2, %ymm7
vaddps %ymm7, %ymm0, %ymm0
vaddps %ymm6, %ymm1, %ymm1
callq foo
vsubps %ymm7, %ymm0, %ymm0
vsubps %ymm6, %ymm1, %ymm1
vmovaps -16(%rbp), %xmm6 # 16-byte Reload
vmovaps -32(%rbp), %xmm7 # 16-byte Reload
addq...
2020 Sep 01
2
Vector evolution?
...xor %eax,%eax
1e2: c4 e2 7d 18 05 00 00 vbroadcastss 0x0(%rip),%ymm0 # 1eb
<_Z4fct7Pf+0xb>
1e9: 00 00
1eb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
1f0: c5 fc 59 0c 87 vmulps (%rdi,%rax,4),%ymm0,%ymm1
1f5: c5 fc 59 54 87 20 vmulps 0x20(%rdi,%rax,4),%ymm0,%ymm2
1fb: c5 fc 59 5c 87 40 vmulps 0x40(%rdi,%rax,4),%ymm0,%ymm3
201: c5 fc 59 64 87 60 vmulps 0x60(%rdi,%rax,4),%ymm0,%ymm4
207: c5 fc 11 0c 87 vmovups %ymm1,(%rdi,%rax,4)
20c: c5 fc 11 54 87 20 vmovups %ymm2,0x20(%rdi,%rax,4)
212: c5 fc 11 5c 87 40 vmovups %ymm3,0x40(%rdi,%rax,...
2013 Apr 09
1
[LLVMdev] inefficient code generation for 128-bit->256-bit typecast intrinsics
Hello,
LLVM generates two additional instructions for 128->256 bit typecasts
(e.g. _mm256_castsi128_si256()) to clear out the upper 128 bits of YMM register corresponding to source XMM register.
vxorps xmm2,xmm2,xmm2
vinsertf128 ymm0,ymm2,xmm0,0x0
Most of the industry-standard C/C++ compilers (GCC, Intel's compiler, Visual Studio compiler) don't
generate any extra moves for 128-bit->256-bit typecast intrinsics.
None of these compilers zero-extend the upper 128 bits of the 256-bit YMM register. Intel's
documentatio...
2013 Dec 11
2
[LLVMdev] AVX code gen
...vm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such instructions (using the 3.3 release or either 3.4 rc1 or 3.4 rc2). I am new to clang / llvm so I may not be invoking the tools correctly but given that –fvectorize and –fslp-vectorize are on by default at 3.4 I would have thought that if t...
2017 Aug 17
4
unable to emit vectorized code in LLVM IR
I assume compiler knows that your only have 2 input values that you just
added together 1000 times.
Despite the fact that you stored to a[i] and b[i] here, nothing reads them
other than the addition in the same loop iteration. So the compiler easily
removed the a and b arrays. Same with 'c', it's not read outside the loop
so it doesn't need to exist. So the compiler turned your
2013 Dec 12
0
[LLVMdev] AVX code gen
...vm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such instructions (using the 3.3 release or either 3.4 rc1 or 3.4 rc2). I am new to clang / llvm so I may not be invoking the tools correctly but given that –fvectorize and –fslp-vectorize are on by default at 3.4 I would have thought that if t...
2017 Oct 11
1
[PATCH v1 01/27] x86/crypto: Adapt assembly for PIE support
...vmovdqu %ymm13, 15 * 32(%rax);
@@ -1158,7 +1158,7 @@ ENTRY(camellia_ctr_32way)
/* inpack32_pre: */
vpbroadcastq (key_table)(CTX), %ymm15;
- vpshufb .Lpack_bswap, %ymm15, %ymm15;
+ vpshufb .Lpack_bswap(%rip), %ymm15, %ymm15;
vpxor %ymm0, %ymm15, %ymm0;
vpxor %ymm1, %ymm15, %ymm1;
vpxor %ymm2, %ymm15, %ymm2;
@@ -1242,13 +1242,13 @@ camellia_xts_crypt_32way:
subq $(16 * 32), %rsp;
movq %rsp, %rax;
- vbroadcasti128 .Lxts_gf128mul_and_shl1_mask_0, %ymm12;
+ vbroadcasti128 .Lxts_gf128mul_and_shl1_mask_0(%rip), %ymm12;
/* load IV and construct second IV */
vmovdqu (%rcx), %xmm0;...
2014 Feb 21
2
[LLVMdev] [lldb-dev] How is variable info retrieved in debugging for executables generated by llvm backend?
...size:256;offset:275;encoding:vector;format:vector-uint8;set:Floating
> Point Registers;gcc:17;dwarf:17;#00
> $qRegisterInfo5c#da
> $name:ymm1;bitsize:256;offset:307;encoding:vector;format:vector-uint8;set:Floating
> Point Registers;gcc:18;dwarf:18;#00
> $qRegisterInfo5d#db
> $name:ymm2;bitsize:256;offset:339;encoding:vector;format:vector-uint8;set:Floating
> Point Registers;gcc:19;dwarf:19;#00
> $qRegisterInfo5e#dc
> $name:ymm3;bitsize:256;offset:371;encoding:vector;format:vector-uint8;set:Floating
> Point Registers;gcc:20;dwarf:20;#00
> $qRegisterInfo5f#dd
> $n...
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.
Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to
2017 Oct 04
28
x86: PIE support and option to extend KASLR randomization
These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.
Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general. Thanks to
2014 Feb 20
2
[LLVMdev] [lldb-dev] How is variable info retrieved in debugging for executables generated by llvm backend?
Thank you, Clayton. This is very helpful.
We use the LLDB specific GDB remote extensions, and our debugger server
supports "qRegisterInfo" package. "reg 0x3c" is the frame pointer.
In the example mentioned above, we have SP = FP - 40 for current call frame.
And variable "a" is stored at address (FP + -24) from asm instruction [FP +
-24] = R3;;
Thus we can conclude
2018 May 23
33
[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v3:
- Update on message to describe longer term PIE goal.
- Minor change on ftrace if condition.
- Changed code using xchgq.
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace
2017 Oct 11
32
[PATCH v1 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce dynamic relocation space on
mapped memory. It also simplifies the relocation process.
- Move the start the module section next to the kernel. Remove the need for
-mcmodel=large on modules. Extends