Displaying 20 results from an estimated 1000 matches similar to: "AVX 512 Assembly Code Generation issues"
2016 Jun 23
2
AVX512 instruction generated when JIT compiling for an avx2 architecture
With LLVM 3.8 the JIT compiler engine generates an AVX512 instruction
although I target an 'avx2' CPU (intel Core I7).
I just downloaded the most recent 3.8 and still it happens.
It happens with this input module:
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
define void @module_cFFEMJ(i64 %lo, i64 %hi, i64 %myId, i1 %ordered, i64
%start, i32* noalias align 32
2016 Jun 23
2
AVX512 instruction generated when JIT compiling for an avx2 architecture
On 06/23/2016 12:56 PM, Craig Topper wrote:
> Can you check what value "getHostCPUName" returned?
getHostCPUName() = skylake
>
> On Thu, Jun 23, 2016 at 9:53 AM, Frank Winter via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
> With LLVM 3.8 the JIT compiler engine generates an AVX512
> instruction although I
2013 Sep 20
0
[LLVMdev] Passing a 256 bit integer vector with XMM registers
I am implementing a new calling convention for X86 which requires to pass a 256 bit integer vector with two XMM registers rather than one YMM register. For example
define <8 x i32> @add(<8 x i32> %a, <8 x i32> %b) {
%add = add <8 x i32> %a, %b
ret <8 x i32> %add
}
With march=X86-64 and mcpu=corei7-avx, llc with the default calling convention generates the
2017 Jun 25
0
AVX Scheduling and Parallelism
Hi, Zvi,
I agree. In the context of targeting the KNL, however, I'm a bit
concerned about the addressing, and specifically, the size of the
resulting encoding:
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in
> zmm0
>
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]
> ; zmm1<-zmm1+b[401344]
The KNL can only
2017 Jun 25
2
AVX Scheduling and Parallelism
Hi Ahmed,
>From what can be seen in the code snippet you provided, the reuse of XMM0 and XMM1 across loop-unroll instances does not inhibit instruction-level parallelism.
Modern X86 processors use register renaming that can eliminate the dependencies in the instruction stream. In the example you provided, the processor should be able to identify the 2-vloads + vadd + vstore sequences as
2014 Oct 13
2
[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets
Hello,
Depending on how I extract integer lanes from an x86_64 xmm register, the
backend may spill that register in order to load scalars. The effect was
observed on two targets: corei7-avx and btver1 (I haven't checked other
targets).
Here's a test case with spilling/no-spilling code put on conditional
compile:
#if __SSE4_1__ != 0
#include <smmintrin.h>
#else
#include
2017 Jun 24
4
AVX Scheduling and Parallelism
Hello,
After generating AVX code for large no of iterations i came to realize that
it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
factor=1024,
i wonder if this register allocation allows operations in parallel?
Also i know all the elements within a single vector instruction are
computed in parallel but does the elements of multiple instructions
computed in parallel? like are
2016 Jun 29
0
avx512 JIT backend generates wrong code on <4 x float>
Hi Frank,
I recommend trying trunk LLVM. AVX-512 development has been very active recently.
-Hal
----- Original Message -----
> From: "Frank Winter via llvm-dev" <llvm-dev at lists.llvm.org>
> To: "LLVM Dev" <llvm-dev at lists.llvm.org>
> Sent: Wednesday, June 29, 2016 2:41:39 PM
> Subject: [llvm-dev] avx512 JIT backend generates wrong code on <4
2016 Jun 30
1
avx512 JIT backend generates wrong code on <4 x float>
Hi Hal!
Thanks, but unfortunately it didn't help. The exact same assembler
instructions are generated for both 3.8 (yesterday) and trunk (from today).
So, this really looks like a bug.
Best,
Frank
On 06/29/2016 03:48 PM, Hal Finkel wrote:
> Hi Frank,
>
> I recommend trying trunk LLVM. AVX-512 development has been very active recently.
>
> -Hal
>
> ----- Original
2011 Mar 22
0
[LLVMdev] sitofp inst selection in x86/AVX target [PR9473]
Hello LLVMer's
I am now trying to fix a bug PR9473.
sitofp instruction in LLVM IR is converted to vcvtsi2sd(also applied
to vcvtsi2ss case) for x86/AVX backend, but vcvtsi2sd is somewhat odd
instruction format.
VCVTSI2SD xmm1, xmm2, r/m32
VCVTSI2SD xmm1, xmm2, r/m64
bits(127:64) of xmm2 is copied to corresponding bits of xmm1, thus in
many case xmm1 and xmm2 could be same register.
2016 Jun 29
2
avx512 JIT backend generates wrong code on <4 x float>
Hi!
When compiling the attached module with the JIT engine on an Intel KNL I
see wrong code getting emitted. I attach a complete exploit program
which shows the bug in LLVM 3.8. It loads and JIT compiles the module
and prints the assembler. I stumbled on this since the result of an
actual calculation was wrong. So, it's not only the text version of the
assembler also the machine
2018 Jul 24
2
[LoopVectorizer] Improving the performance of dot product reduction loop
On 07/24/2018 02:58 AM, Nema, Ashutosh wrote:
>
>
>
>
>
> *From:*Hal Finkel <hfinkel at anl.gov>
> *Sent:* Tuesday, July 24, 2018 5:05 AM
> *To:* Craig Topper <craig.topper at gmail.com>; hideki.saito at intel.com;
> estotzer at ti.com; Nemanja Ivanovic <nemanja.i.ibm at gmail.com>; Adam
> Nemet <anemet at apple.com>; graham.hunter at
2012 May 24
4
[LLVMdev] use AVX automatically if present
I wonder why AVX is not used automatically if available at the host
machine. In contrast to that, SSE41 instructions (like pmulld) are
automatically used if the host machine supports SSE41.
E.g.
$ cat avx.ll
define void @_fun1(<8 x float>*, <8 x float>*) {
_L1:
%x = load <8 x float>* %0
%y = load <8 x float>* %1
%z = fadd <8 x float> %x, %y
store
2013 Sep 05
1
[LLVMdev] AVX calling convention?
I am tracking down an x86-64 code generation problem that has to do with AVX instructions. The symptom is: a function is called, and the upper half of the function argument (which is short16) is zero. This happens only when I compile code with pocl, but not when I use clang and/or llc manually.
I tracked this down to the following. The call site looks like
vmovdqa 24064(%rsp), %ymm0
vmovdqa
2017 Jul 01
2
KNL Assembly Code for Matrix Multiplication
Thank You,
It means vmovdqa64 zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 =
[8,9,10,11,12,13,14,15] zmm22 will contain 64 bit constant values which are
indexes here zmm22=8, 9, 10, 11, 12,13,14,15. not the values loaded from
these locations. and zmm2 contains constant 4000. so,
vpmuludq zmm14, zmm10, zmm2 ; will multiply the indexes values with 4000,
as for array b the stride is 4000.
zmm14=
2016 May 06
3
Unnecessary spill/fill issue
Hi, I am using mcjit in llvm 3.6 to jit kernels to x86 avx2. I've noticed
some inefficient use of the stack around constant vectors. In one example,
I have code that computes a series of constant vectors at compile time.
Each vector has a single use. In the final asm, I see a series of spills at
the top of the function of all the constant vectors immediately to stack,
then each use references
2018 Jul 23
3
[LoopVectorizer] Improving the performance of dot product reduction loop
Hello all,
This code https://godbolt.org/g/tTyxpf is a dot product reduction loop
multipying sign extended 16-bit values to produce a 32-bit accumulated
result. The x86 backend is currently not able to optimize it as well as gcc
and icc. The IR we are getting from the loop vectorizer has several v8i32
adds and muls inside the loop. These are fed by v8i16 loads and sexts from
v8i16 to v8i32. The
2013 Jul 19
0
[LLVMdev] llvm.x86.sse2.sqrt.pd not using sqrtpd, calling a function that modifies ECX
(Changing subject line as diagnosis has changed)
I'm attaching the compiled code that I've been getting, both with
CodeGenOpt::Default and CodeGenOpt::None . The crash isn't occurring
with CodeGenOpt::None, but that seems to be because ECX isn't being used
- it still gets set to 0x7fffffff by one of the calls to 76719BA1
I notice that X86::SQRTPD[m|r] appear in
2018 Aug 06
2
[PATCH] D50328: [X86][SSE] Combine (some) target shuffles with multiple uses
[NOTE: Removed Phab and reviewers]
> ================
> Comment at: test/CodeGen/X86/2012-01-12-extract-sv.ll:12
> +; CHECK-NEXT: vblendps {{.*#+}} xmm1 = xmm1[0],xmm2[1,2,3]
> +; CHECK-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
> ; CHECK-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
> ----------------
> greened wrote:
>> Can we make this test less brittle by
2004 Aug 06
2
[PATCH] Make SSE Run Time option. Add Win32 SSE code
All,
Attached is a patch that does two things. First it makes the use
of the current SSE code a run time option through the use
of speex_decoder_ctl() and speex_encoder_ctl
It does this twofold. First there is a modification to the configure.in
script which introduces a check based upon platform. It will compile in the
sse assembly if you are on an i?86 based platform by making a