Displaying 20 results from an estimated 300 matches similar to: "AVX Scheduling and Parallelism"
2017 Jun 25
0
AVX Scheduling and Parallelism
Hi, Zvi,
I agree. In the context of targeting the KNL, however, I'm a bit
concerned about the addressing, and specifically, the size of the
resulting encoding:
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in
> zmm0
>
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]
> ; zmm1<-zmm1+b[401344]
The KNL can only
2017 Jun 25
2
AVX Scheduling and Parallelism
Hi Ahmed,
>From what can be seen in the code snippet you provided, the reuse of XMM0 and XMM1 across loop-unroll instances does not inhibit instruction-level parallelism.
Modern X86 processors use register renaming that can eliminate the dependencies in the instruction stream. In the example you provided, the processor should be able to identify the 2-vloads + vadd + vstore sequences as
2017 Jul 01
2
KNL Assembly Code for Matrix Multiplication
Thank You,
It means vmovdqa64 zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 =
[8,9,10,11,12,13,14,15] zmm22 will contain 64 bit constant values which are
indexes here zmm22=8, 9, 10, 11, 12,13,14,15. not the values loaded from
these locations. and zmm2 contains constant 4000. so,
vpmuludq zmm14, zmm10, zmm2 ; will multiply the indexes values with 4000,
as for array b the stride is 4000.
zmm14=
2017 Aug 06
2
VBROADCAST Implementation Issues
i want to implement gather for v64i32. i wrote following code.
def GATHER_256B : I<0x68, MRMSrcMem, (outs VR_2048:$dst), (ins
i2048mem:$src),
"GATHER_256B\t{$src, $dst|$dst, $src}",
[(set VR_2048:$dst, (v64i32 (masked_gather
addr:$src)))],
IIC_MOV_MEM>, TA;
def: Pat<(v64f32 (masked_gather addr:$src)), (GATHER_256B
2017 Aug 07
2
VBROADCAST Implementation Issues
Hello,
I did as you said,
Please tell me whether the following correct now??
def GATHER_256B : I<0x68, MRMSrcMem, (outs VR_2048:$dst, _.KRCWM:$mask_wb),
(VR_2048:$src1, _.KRCWM:$mask, ins i2048mem:$src2),
"GATHER_256B\t{$src2, {$dst}{${mask}}|${dst} {${mask}},
$src2}"),
[(set VR_2048:$dst, _.KRCWM:$mask_wb, (v64i32
(GatherNode
2017 Aug 07
3
VBROADCAST Implementation Issues
Thank You. Still getting errors.I have modified my instructions as you said
as follows:
def GATHER_256B : I<0x68, MRMSrcMem, (outs VR_2048:$dst, VK64WM:$mask_wb),
(ins VR_2048:$src1, VK64WM:$mask, i2048mem:$src2),
"GATHER_256B\t{$src2, {$dst} {${mask}}|${dst}
{${mask}}, $src2}",
[(set VR_2048:$dst, VK64WM:$mask_wb, (v64i32
(masked_gather
2016 Nov 30
2
RFC: Adding Support For Vectorcall Calling Convention
Adding Support For Vectorcall Calling Convention
=====================================================
Vectorcall Calling Convention for x64
----------------------------------------------------
The __vectorcall calling convention specifies that arguments to
functions are to be passed in registers, when possible. __vectorcall
uses more registers for arguments than __fastcall or the default x64
2015 Jul 23
0
[LLVMdev] Intel asm syntax and variable names
So, there is no prior art for escaping the name of a global symbol with the
same name as a register? If there is, I'd rather we just implement it and
leave it at that.
We can probably fix the 'flags' case easily in LLVM, but I'd rather not
bend over backwards to make ZMM0 be a global name when AVX is disabled.
On Thu, Jul 23, 2015 at 9:12 AM, Yatsina, Marina <marina.yatsina at
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
Illustrative Example:
clang -fveclib=SVML -O3 svml.c -mavx
#include <math.h>
void foo(double *a, int N){
int i;
#pragma clang loop vectorize_width(8)
for (i=0;i<N;i++){
a[i] = sin(i);
}
}
Currently, this results in a call to <8 x double> __svml_sin8(<8 x double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface,
2015 Jul 23
1
[LLVMdev] Intel asm syntax and variable names
Some targets don't have the problem because they prefix all names with an undercore. Apart from that I am not aware of any solution to the problem of keywords clashing with variable names in intel syntax.
- Matthias
> On Jul 23, 2015, at 9:18 AM, Reid Kleckner <rnk at google.com> wrote:
>
> So, there is no prior art for escaping the name of a global symbol with the same name
2015 Jul 23
2
[LLVMdev] Intel asm syntax and variable names
Microsoft assembler treats mov to EAX as a register, even if there is a global memory also named EAX – meaning the register takes precedence.
But here I have a bit of a different situation – I have a global variable, which name happens to match an implicit register or a register that does not exist in the current arch, just in future ones. Microsoft assembler treats these cases as memory
2015 Jul 23
2
[LLVMdev] Intel asm syntax and variable names
Hi all,
I've encountered an issue with x86 Intel asm syntax when using certain variable names.
If you look at the following example, where I try to do a mov to a memory location named "flags2", llvm- mc works fine:
>cat test_good.s
mov eax, flags2
>llvm-mc.exe -x86-asm-syntax=intel test_good.s -o -
.text
movl flags2, %eax
But if the memory location is
2016 May 06
3
Unnecessary spill/fill issue
Hi, I am using mcjit in llvm 3.6 to jit kernels to x86 avx2. I've noticed
some inefficient use of the stack around constant vectors. In one example,
I have code that computes a series of constant vectors at compile time.
Each vector has a single use. In the final asm, I see a series of spills at
the top of the function of all the constant vectors immediately to stack,
then each use references
2018 Jul 23
3
[LoopVectorizer] Improving the performance of dot product reduction loop
Hello all,
This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
multipying sign extended 16-bit values to produce a 32-bit accumulated
result. The x86 backend is currently not able to optimize it as well as gcc
and icc. The IR we are getting from the loop vectorizer has several v8i32
adds and muls inside the loop. These are fed by v8i16 loads and sexts from
v8i16 to v8i32. The
2018 Jul 23
4
[LoopVectorizer] Improving the performance of dot product reduction loop
~Craig
On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 07/23/2018 05:22 PM, Craig Topper wrote:
>
> Hello all,
>
> This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
> multipying sign extended 16-bit values to produce a 32-bit accumulated
> result. The x86 backend is currently not able to optimize it as well as gcc
2018 Jul 24
4
[LoopVectorizer] Improving the performance of dot product reduction loop
On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 07/23/2018 06:37 PM, Craig Topper wrote:
>
>
> ~Craig
>
>
> On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> On 07/23/2018 05:22 PM, Craig Topper wrote:
>>
>> Hello all,
>>
>> This code https://godbolt.org/g/tTyxpf
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
Ashutosh,
Thanks for the repy.
Related earlier topic on this appears in the review of the SVML patch (@mmasten). Adding few names from there.
https://reviews.llvm.org/D19544
There, I see Hal's review comment "let's start only with the directly-legal calls". Apparently, what we have right now
in the trunk is "not legal enough". I'll work on the patch to stop
2018 Jul 23
2
[LoopVectorizer] Improving the performance of dot product reduction loop
On 07/23/2018 06:23 PM, Hal Finkel via llvm-dev wrote:
>
> On 07/23/2018 05:22 PM, Craig Topper wrote:
>> Hello all,
>>
>> This code https://godbolt.org/g/tTyxpf is a dot product reduction
>> loop multipying sign extended 16-bit values to produce a 32-bit
>> accumulated result. The x86 backend is currently not able to optimize
>> it as well as gcc and icc.
2015 Jul 23
0
[LLVMdev] Intel asm syntax and variable names
Suppose I have a global variable named 'EAX'. How do Intel assemblers
normally escape register names to access such a global variable?
On Thu, Jul 23, 2015 at 1:42 AM, Yatsina, Marina <marina.yatsina at intel.com>
wrote:
> Hi all,
>
>
>
> I’ve encountered an issue with x86 Intel asm syntax when using certain
> variable names.
>
>
>
> If you look at
2016 Nov 23
4
RFC: code size reduction in X86 by replacing EVEX with VEX encoding
Hi All.
This is an RFC for a proposed target specific X86 optimization for reducing code size in the encoding of AVX-512 instructions when possible.
When the AVX512F instruction set was introduced in X86 it included additional 32 registers of 512bit size each ZMM0 - ZMM31, as well as additional 16 XMM registers XMM16-XMM31 and 16 YMM registers YMM16-YMM31.
In order to encode the new registers of