Displaying 20 results from an estimated 2000 matches similar to: "AVX2 codegen - question reg. FMA generation"
2019 Sep 02
2
AVX2 codegen - question reg. FMA generation
On Mon, 2 Sep 2019 at 16:59, Roman Lebedev <lebedev.ri at gmail.com> wrote:
>
> It appears you need 'reassoc' on fmul/fadd:
> https://godbolt.org/z/nuTzx2
Thanks very much, that was it. Either that or providing
-enable-unsafe-fp-math to llc yielded FMAs. I didn't expect this since
using FMAs here instead of mul/add appears to be safer (the reverse is
unsafe).
~ Uday
2013 Dec 12
0
[LLVMdev] AVX code gen
It probably does not pick the right processor architecture.
You could try “clang -mavx” or “clang -march=corei7-avx” for ivy-bridge and “clang -march=core-avx2” or “clang -mavx2" for haswell.
$ clang -march=core-avx2 -O3 -S -o - test.c
.section __TEXT,__text,regular,pure_instructions
.globl _f
.align 4, 0x90
_f: ## @f
2013 Dec 11
2
[LLVMdev] AVX code gen
Hello -
I found this post on the llvm blog: http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that clang / llvm are capable of generating AVX with packed instructions as well as utilizing the full width of the YMM registers… I have an environment where icc generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such
2020 Sep 01
2
Vector evolution?
Hi,
Please consider the following loop:
using v4f32 = float __attribute__((__vector_size__(16)));
void fct6(v4f32 *x)
{
#pragma clang loop vectorize(enable)
for (int i = 0; i < 256; ++i)
x[i] = 7 * x[i];
}
After compiling it with:
clang++ -O3 -march=native -mtune=native \
-Rpass=loop-vectorize,slp-vectorize
-Rpass-missed=loop-vectorize,slp-vectorize
2012 May 24
4
[LLVMdev] use AVX automatically if present
I wonder why AVX is not used automatically if available at the host
machine. In contrast to that, SSE41 instructions (like pmulld) are
automatically used if the host machine supports SSE41.
E.g.
$ cat avx.ll
define void @_fun1(<8 x float>*, <8 x float>*) {
_L1:
%x = load <8 x float>* %0
%y = load <8 x float>* %1
%z = fadd <8 x float> %x, %y
store
2012 May 24
0
[LLVMdev] use AVX automatically if present
On Thu, 24 May 2012, Pan, Wei wrote:
> Very likely AVX is not enabled in your llc. This feature was enabled
> just recently (late of April).
I forgot to mention that I am using recent LLVM-3.1 and in principle my
llc knows about avx as I have shown in the second example. But avx does
not seem to be used by default.
On Thu, 24 May 2012, Henning Thielemann wrote:
> $ llc -o - -mattr
2012 Jan 10
0
[LLVMdev] Calling conventions for YMM registers on AVX
This is the wrong code:
declare <16 x float> @foo(<16 x float>)
define <16 x float> @test(<16 x float> %x, <16 x float> %y) nounwind {
entry:
%x1 = fadd <16 x float> %x, %y
%call = call <16 x float> @foo(<16 x float> %x1) nounwind
%y1 = fsub <16 x float> %call, %y
ret <16 x float> %y1
}
./llc -mattr=+avx
2012 Jan 09
3
[LLVMdev] Calling conventions for YMM registers on AVX
On Jan 9, 2012, at 10:00 AM, Jakob Stoklund Olesen wrote:
>
> On Jan 8, 2012, at 11:18 PM, Demikhovsky, Elena wrote:
>
>> I'll explain what we see in the code.
>> 1. The caller saves XMM registers across the call if needed (according to DEFS definition).
>> YMMs are not in the set, so caller does not take care.
>
> This is not how the register allocator
2012 Mar 01
3
[LLVMdev] Stack alignment on X86 AVX seems incorrect
Hi Elena,
You're correct. LLVM does not align the stack to 32-bytes for AVX and
unaligned moves should be used for YMM spills.
I wrote some code to align the stack to 32-bytes when AVX spills are
present; it does break the x86-64 ABI though. If upstream would be
interested in this code, I can arrange with my employer to send a patch to
the mailing list.
-Cameron
On Mar 1, 2012, at 4:09 PM,
2013 Dec 20
2
[LLVMdev] Commutability of X86 FMA3 instructions.
Hi all,
The 213 variant of the FMA3 instructions is currently marked
commutable (see X86InstrFMA.td). Is that safe? According to the ISA
the FMA3 instructions aren't commutable for non-numeric results, so
I'd have thought commuting this would only be valid in fast-math mode?
For the curious, the reason that I'm asking is that we currently
always select the 213 variant, but this
2013 Sep 05
1
[LLVMdev] AVX calling convention?
I am tracking down an x86-64 code generation problem that has to do with AVX instructions. The symptom is: a function is called, and the upper half of the function argument (which is short16) is zero. This happens only when I compile code with pocl, but not when I use clang and/or llc manually.
I tracked this down to the following. The call site looks like
vmovdqa 24064(%rsp), %ymm0
vmovdqa
2013 Dec 20
2
[LLVMdev] Commutability of X86 FMA3 instructions.
Hi Kay,
My patch will partially address your bug. For now I'm just looking to
switch the default FMA from vfmadd213xx to vfmadd231xx. That will
cause the code in PR17229 to compile as desired, but would regress
code like:
double foo(double a, double b, double c) {
return a * b + c;
}
Which will now require a vmovaps + vfmadd231.
If this impacts real benchmarks we could add an
2017 Feb 18
2
Vector trunc code generation difference between llvm-3.9 and 4.0
Thanks Sanjay. Interestingly for me, disable-llvm-optmzns did not make a
difference in the way the shift was handled. Does the initial IR generated
for you show this difference when the option is passed?
Best regards
Saurabh
On 17 February 2017 at 19:03, Sanjay Patel <spatel at rotateright.com> wrote:
> I think this is caused by a front-end change (cc'ing clang-dev) because
>
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
Illustrative Example:
clang -fveclib=SVML -O3 svml.c -mavx
#include <math.h>
void foo(double *a, int N){
int i;
#pragma clang loop vectorize_width(8)
for (i=0;i<N;i++){
a[i] = sin(i);
}
}
Currently, this results in a call to <8 x double> __svml_sin8(<8 x double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface,
2017 Mar 08
2
Vector trunc code generation difference between llvm-3.9 and 4.0
The regression for the reported case should be avoided after:
https://reviews.llvm.org/rL297232
https://reviews.llvm.org/rL297242
https://reviews.llvm.org/rL297280
It would still be good to understand if the clang change was intentional or
if that was a side effect that can be limited.
On Sat, Feb 18, 2017 at 9:11 AM, Sanjay Patel <spatel at rotateright.com>
wrote:
> Yes, there is an
2016 Jun 29
2
avx512 JIT backend generates wrong code on <4 x float>
Hi!
When compiling the attached module with the JIT engine on an Intel KNL I
see wrong code getting emitted. I attach a complete exploit program
which shows the bug in LLVM 3.8. It loads and JIT compiles the module
and prints the assembler. I stumbled on this since the result of an
actual calculation was wrong. So, it's not only the text version of the
assembler also the machine
2013 Dec 23
2
[LLVMdev] Commutability of X86 FMA3 instructions.
Hi Elena,
Thank you very much for looking in to that.
I'll go ahead and remove the isCommutable flag from those
instructions, since it sounds like that's the right thing to do. I
would still like to change the default from the 231 variant to 213
too, as this will reduce code-size for accumulator-style loops. I have
at least one benchmark that shows significant speedups when this
change
2013 Dec 20
0
[LLVMdev] Commutability of X86 FMA3 instructions.
Hi Lang,
Unfortunately, I don't have an answer on the commutability question, but I
wanted to let you know that I filed a bug on this:
http://llvm.org/bugs/show_bug.cgi?id=17229
This also shows a memory operand variant of the fma that you may want to
consider in your patch and testcases.
Thanks!
On Thu, Dec 19, 2013 at 10:45 PM, Lang Hames <lhames at gmail.com> wrote:
> Hi all,
2018 Jun 29
2
[RFC][VECLIB] how should we legalize VECLIB calls?
Ashutosh,
Thanks for the repy.
Related earlier topic on this appears in the review of the SVML patch (@mmasten). Adding few names from there.
https://reviews.llvm.org/D19544
There, I see Hal's review comment "let's start only with the directly-legal calls". Apparently, what we have right now
in the trunk is "not legal enough". I'll work on the patch to stop
2013 Jul 10
4
[LLVMdev] unaligned AVX store gets split into two instructions
I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector loads
on AVX.
3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as
a single instruction (details below).
In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
which seems to be due to this.
Any ideas why this changed? Thanks!
Zach
LLVM Code:
define <4 x double> @vstore(<4 x