thr3ads.net - similar to: "AVX Scheduling and Parallelism"

Displaying 20 results from an estimated 300 matches similar to: "AVX Scheduling and Parallelism"

2017 Jun 25

AVX Scheduling and Parallelism

Hi, Zvi, I agree. In the context of targeting the KNL, however, I'm a bit concerned about the addressing, and specifically, the size of the resulting encoding: > vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in > zmm0 > > vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344] > ; zmm1<-zmm1+b[401344] The KNL can only

AVX Scheduling and Parallelism

2017 Jun 25

AVX Scheduling and Parallelism

Hi Ahmed, >From what can be seen in the code snippet you provided, the reuse of XMM0 and XMM1 across loop-unroll instances does not inhibit instruction-level parallelism. Modern X86 processors use register renaming that can eliminate the dependencies in the instruction stream. In the example you provided, the processor should be able to identify the 2-vloads + vadd + vstore sequences as

KNL Assembly Code for Matrix Multiplication

2017 Jul 01

KNL Assembly Code for Matrix Multiplication

Thank You, It means vmovdqa64 zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 = [8,9,10,11,12,13,14,15] zmm22 will contain 64 bit constant values which are indexes here zmm22=8, 9, 10, 11, 12,13,14,15. not the values loaded from these locations. and zmm2 contains constant 4000. so, vpmuludq zmm14, zmm10, zmm2 ; will multiply the indexes values with 4000, as for array b the stride is 4000. zmm14=

VBROADCAST Implementation Issues

2017 Aug 06

VBROADCAST Implementation Issues

i want to implement gather for v64i32. i wrote following code. def GATHER_256B : I<0x68, MRMSrcMem, (outs VR_2048:$dst), (ins i2048mem:$src), "GATHER_256B\t{$src, $dst|$dst, $src}", [(set VR_2048:$dst, (v64i32 (masked_gather addr:$src)))], IIC_MOV_MEM>, TA; def: Pat<(v64f32 (masked_gather addr:$src)), (GATHER_256B

VBROADCAST Implementation Issues

2017 Aug 07

VBROADCAST Implementation Issues

Hello, I did as you said, Please tell me whether the following correct now?? def GATHER_256B : I<0x68, MRMSrcMem, (outs VR_2048:$dst, _.KRCWM:$mask_wb), (VR_2048:$src1, _.KRCWM:$mask, ins i2048mem:$src2), "GATHER_256B\t{$src2, {$dst}{${mask}}|${dst} {${mask}}, $src2}"), [(set VR_2048:$dst, _.KRCWM:$mask_wb, (v64i32 (GatherNode

VBROADCAST Implementation Issues

2017 Aug 07

VBROADCAST Implementation Issues

Thank You. Still getting errors.I have modified my instructions as you said as follows: def GATHER_256B : I<0x68, MRMSrcMem, (outs VR_2048:$dst, VK64WM:$mask_wb), (ins VR_2048:$src1, VK64WM:$mask, i2048mem:$src2), "GATHER_256B\t{$src2, {$dst} {${mask}}|${dst} {${mask}}, $src2}", [(set VR_2048:$dst, VK64WM:$mask_wb, (v64i32 (masked_gather

RFC: Adding Support For Vectorcall Calling Convention

2016 Nov 30

RFC: Adding Support For Vectorcall Calling Convention

Adding Support For Vectorcall Calling Convention ===================================================== Vectorcall Calling Convention for x64 ---------------------------------------------------- The __vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible. __vectorcall uses more registers for arguments than __fastcall or the default x64

[LLVMdev] Intel asm syntax and variable names

2015 Jul 23

[LLVMdev] Intel asm syntax and variable names

So, there is no prior art for escaping the name of a global symbol with the same name as a register? If there is, I'd rather we just implement it and leave it at that. We can probably fix the 'flags' case easily in LLVM, but I'd rather not bend over backwards to make ZMM0 be a global name when AVX is disabled. On Thu, Jul 23, 2015 at 9:12 AM, Yatsina, Marina <marina.yatsina at

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

[RFC][VECLIB] how should we legalize VECLIB calls?

Illustrative Example: clang -fveclib=SVML -O3 svml.c -mavx #include <math.h> void foo(double *a, int N){ int i; #pragma clang loop vectorize_width(8) for (i=0;i<N;i++){ a[i] = sin(i); } } Currently, this results in a call to <8 x double> __svml_sin8(<8 x double>) after the vectorizer. This is 8-element SVML sin() called with 8-element argument. On the surface,

[LLVMdev] Intel asm syntax and variable names

2015 Jul 23

[LLVMdev] Intel asm syntax and variable names

Some targets don't have the problem because they prefix all names with an undercore. Apart from that I am not aware of any solution to the problem of keywords clashing with variable names in intel syntax. - Matthias > On Jul 23, 2015, at 9:18 AM, Reid Kleckner <rnk at google.com> wrote: > > So, there is no prior art for escaping the name of a global symbol with the same name

[LLVMdev] Intel asm syntax and variable names

2015 Jul 23

[LLVMdev] Intel asm syntax and variable names

Microsoft assembler treats mov to EAX as a register, even if there is a global memory also named EAX – meaning the register takes precedence. But here I have a bit of a different situation – I have a global variable, which name happens to match an implicit register or a register that does not exist in the current arch, just in future ones. Microsoft assembler treats these cases as memory

[LLVMdev] Intel asm syntax and variable names

2015 Jul 23

[LLVMdev] Intel asm syntax and variable names

Hi all, I've encountered an issue with x86 Intel asm syntax when using certain variable names. If you look at the following example, where I try to do a mov to a memory location named "flags2", llvm- mc works fine: >cat test_good.s mov eax, flags2 >llvm-mc.exe -x86-asm-syntax=intel test_good.s -o - .text movl flags2, %eax But if the memory location is

Unnecessary spill/fill issue

2016 May 06

Unnecessary spill/fill issue

Hi, I am using mcjit in llvm 3.6 to jit kernels to x86 avx2. I've noticed some inefficient use of the stack around constant vectors. In one example, I have code that computes a series of constant vectors at compile time. Each vector has a single use. In the final asm, I see a series of spills at the top of the function of all the constant vectors immediately to stack, then each use references

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

Hello all, This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

~Craig On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov> wrote: > > On 07/23/2018 05:22 PM, Craig Topper wrote: > > Hello all, > > This code https://godbolt.org/g/tTyxpf is a dot product reduction loop > multipying sign extended 16-bit values to produce a 32-bit accumulated > result. The x86 backend is currently not able to optimize it as well as gcc

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov> wrote: > > On 07/23/2018 06:37 PM, Craig Topper wrote: > > > ~Craig > > > On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov> wrote: > >> >> On 07/23/2018 05:22 PM, Craig Topper wrote: >> >> Hello all, >> >> This code https://godbolt.org/g/tTyxpf

[RFC][VECLIB] how should we legalize VECLIB calls?

2018 Jun 29

[RFC][VECLIB] how should we legalize VECLIB calls?

Ashutosh, Thanks for the repy. Related earlier topic on this appears in the review of the SVML patch (@mmasten). Adding few names from there. https://reviews.llvm.org/D19544 There, I see Hal's review comment "let's start only with the directly-legal calls". Apparently, what we have right now in the trunk is "not legal enough". I'll work on the patch to stop

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

On 07/23/2018 06:23 PM, Hal Finkel via llvm-dev wrote: > > On 07/23/2018 05:22 PM, Craig Topper wrote: >> Hello all, >> >> This code https://godbolt.org/g/tTyxpf is a dot product reduction >> loop multipying sign extended 16-bit values to produce a 32-bit >> accumulated result. The x86 backend is currently not able to optimize >> it as well as gcc and icc.

[LLVMdev] Intel asm syntax and variable names

2015 Jul 23

[LLVMdev] Intel asm syntax and variable names

Suppose I have a global variable named 'EAX'. How do Intel assemblers normally escape register names to access such a global variable? On Thu, Jul 23, 2015 at 1:42 AM, Yatsina, Marina <marina.yatsina at intel.com> wrote: > Hi all, > > > > I’ve encountered an issue with x86 Intel asm syntax when using certain > variable names. > > > > If you look at

RFC: code size reduction in X86 by replacing EVEX with VEX encoding

2016 Nov 23

RFC: code size reduction in X86 by replacing EVEX with VEX encoding

Hi All. This is an RFC for a proposed target specific X86 optimization for reducing code size in the encoding of AVX-512 instructions when possible. When the AVX512F instruction set was introduced in X86 it included additional 32 registers of 512bit size each ZMM0 - ZMM31, as well as additional 16 XMM registers XMM16-XMM31 and 16 YMM registers YMM16-YMM31. In order to encode the new registers of

similar to: AVX Scheduling and Parallelism