thr3ads.net - llvm dev - [llvm-dev] AVX2 codegen - question reg. FMA generation [Sep 2019]

If this information is useful, please help other people find it:
Share via:

Uday Kumar Reddy B via llvm-dev

2019-Sep-02 11:20 UTC

[llvm-dev] AVX2 codegen - question reg. FMA generation

Hello,

On the appended reasonably simple test case that has an fmul/fadd
sequence on <8 x float> vector types, I don't see the x86-64 code
generator (with cpu set to haswell or later types) turning it into an
AVX2 FMA instructions. Here's the snippet in the output it generates:

$ llc -O3 -mcpu=skylake

---------------------
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
vbroadcastss (%rsi,%rdx,4), %ymm0
vmulps (%rdi,%rcx), %ymm0, %ymm0
vaddps (%rax,%rcx), %ymm0, %ymm0
vmovups %ymm0, (%rax,%rcx)
incq %rdx
addq $32, %rcx
cmpq $15, %rdx
jle .LBB0_2
-----------------------

$ llc --version
LLVM (http://llvm.org/):
  LLVM version 8.0.0
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: skylake
(llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31)

Using opt -O3 followed by llc leads to the same vmulps / vaddps
sequence. (adding -mattr=fma doesn't help, although this I assume
isn't needed given the cpu type.) The result is the same even with
-mcpu=haswell.

This is a common pattern involved in a reduction with two things on
the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx),
and %ymm0. If another register is used to hold a loaded value, the
vfmadd instruction could be used in multiple ways. I suspect I'm
missing something, which I why I'm not already posting this on
llvm-bugs. Is this expected behavior?

-------------------------------------------------------------------------------------------
; ModuleID = 'LLVMDialectModule'
source_filename = "LLVMDialectModule"

declare i8* @malloc(i64)

declare void @free(i8*)

define <8 x float>* @fma(<8 x float>* %0, float* %1, <8 x
float>* %2) {
  br label %4

4:                                                ; preds = %7, %3
  %5 = phi i64 [ %19, %7 ], [ 0, %3 ]
  %6 = icmp slt i64 %5, 16
  br i1 %6, label %7, label %20

7:                                                ; preds = %4
  %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5
  %9 = load <8 x float>, <8 x float>* %8, align 16
  %10 = getelementptr float, float* %1, i64 %5
  %11 = load float, float* %10, align 16
  %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
  %13 = load <8 x float>, <8 x float>* %12, align 16
  %14 = insertelement <8 x float> undef, float %11, i32 0
  %15 = shufflevector <8 x float> %14, <8 x float> undef, <8 x
i32>
zeroinitializer
  %16 = fmul <8 x float> %15, %9
  %17 = fadd <8 x float> %16, %13
  %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
  store <8 x float> %17, <8 x float>* %18, align 16
  %19 = add i64 %5, 1
  br label %4

20:                                               ; preds = %4
  ret <8 x float>* %2
}
-------------------------------------------------------------------------------------------------------

Roman Lebedev via llvm-dev

2019-Sep-02 11:29 UTC

head link

[llvm-dev] AVX2 codegen - question reg. FMA generation

It appears you need 'reassoc' on fmul/fadd:
https://godbolt.org/z/nuTzx2

On Mon, Sep 2, 2019 at 2:20 PM Uday Kumar Reddy B via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Hello,
>
> On the appended reasonably simple test case that has an fmul/fadd
> sequence on <8 x float> vector types, I don't see the x86-64 code
> generator (with cpu set to haswell or later types) turning it into an
> AVX2 FMA instructions. Here's the snippet in the output it generates:
>
> $ llc -O3 -mcpu=skylake
>
> ---------------------
> .LBB0_2:                                # =>This Inner Loop Header:
Depth=1
> vbroadcastss (%rsi,%rdx,4), %ymm0
> vmulps (%rdi,%rcx), %ymm0, %ymm0
> vaddps (%rax,%rcx), %ymm0, %ymm0
> vmovups %ymm0, (%rax,%rcx)
> incq %rdx
> addq $32, %rcx
> cmpq $15, %rdx
> jle .LBB0_2
> -----------------------
>
> $ llc --version
> LLVM (http://llvm.org/):
>   LLVM version 8.0.0
>   Optimized build.
>   Default target: x86_64-unknown-linux-gnu
>   Host CPU: skylake
> (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31)
>
> Using opt -O3 followed by llc leads to the same vmulps / vaddps
> sequence. (adding -mattr=fma doesn't help, although this I assume
> isn't needed given the cpu type.) The result is the same even with
> -mcpu=haswell.
>
> This is a common pattern involved in a reduction with two things on
> the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx),
> and %ymm0. If another register is used to hold a loaded value, the
> vfmadd instruction could be used in multiple ways. I suspect I'm
> missing something, which I why I'm not already posting this on
> llvm-bugs. Is this expected behavior?
>
>
-------------------------------------------------------------------------------------------
> ; ModuleID = 'LLVMDialectModule'
> source_filename = "LLVMDialectModule"
>
> declare i8* @malloc(i64)
>
> declare void @free(i8*)
>
> define <8 x float>* @fma(<8 x float>* %0, float* %1, <8 x
float>* %2) {
>   br label %4
>
> 4:                                                ; preds = %7, %3
>   %5 = phi i64 [ %19, %7 ], [ 0, %3 ]
>   %6 = icmp slt i64 %5, 16
>   br i1 %6, label %7, label %20
>
> 7:                                                ; preds = %4
>   %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5
>   %9 = load <8 x float>, <8 x float>* %8, align 16
>   %10 = getelementptr float, float* %1, i64 %5
>   %11 = load float, float* %10, align 16
>   %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
>   %13 = load <8 x float>, <8 x float>* %12, align 16
>   %14 = insertelement <8 x float> undef, float %11, i32 0
>   %15 = shufflevector <8 x float> %14, <8 x float> undef, <8
x i32>
> zeroinitializer
>   %16 = fmul <8 x float> %15, %9
>   %17 = fadd <8 x float> %16, %13
>   %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
>   store <8 x float> %17, <8 x float>* %18, align 16
>   %19 = add i64 %5, 1
>   br label %4
>
> 20:                                               ; preds = %4
>   ret <8 x float>* %2
> }
Roman
>
-------------------------------------------------------------------------------------------------------
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sanjay Patel via llvm-dev

2019-Sep-02 12:41 UTC

head link

[llvm-dev] AVX2 codegen - question reg. FMA generation

Fusing of the fadd and fmul is not allowed by default.
http://llvm.org/docs/LangRef.html#floating-point-environment

'contract' on the fadd (and an fma-capable target) are the minimum
requirements; 'reassoc' will also work, but that may enable other
(possibly
unintended) transforms.
https://godbolt.org/z/-k6G2h

define float @fma(float %x, float %y, float %z) {
   %m = fmul float %x, %y
   %a = fadd contract float %m, %z
   ret float %a
}

On Mon, Sep 2, 2019 at 7:29 AM Roman Lebedev via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> It appears you need 'reassoc' on fmul/fadd:
> https://godbolt.org/z/nuTzx2
>
> On Mon, Sep 2, 2019 at 2:20 PM Uday Kumar Reddy B via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >
> > Hello,
> >
> > On the appended reasonably simple test case that has an fmul/fadd
> > sequence on <8 x float> vector types, I don't see the x86-64
code
> > generator (with cpu set to haswell or later types) turning it into an
> > AVX2 FMA instructions. Here's the snippet in the output it
generates:
> >
> > $ llc -O3 -mcpu=skylake
> >
> > ---------------------
> > .LBB0_2:                                # =>This Inner Loop Header:
> Depth=1
> > vbroadcastss (%rsi,%rdx,4), %ymm0
> > vmulps (%rdi,%rcx), %ymm0, %ymm0
> > vaddps (%rax,%rcx), %ymm0, %ymm0
> > vmovups %ymm0, (%rax,%rcx)
> > incq %rdx
> > addq $32, %rcx
> > cmpq $15, %rdx
> > jle .LBB0_2
> > -----------------------
> >
> > $ llc --version
> > LLVM (http://llvm.org/):
> >   LLVM version 8.0.0
> >   Optimized build.
> >   Default target: x86_64-unknown-linux-gnu
> >   Host CPU: skylake
> > (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31)
> >
> > Using opt -O3 followed by llc leads to the same vmulps / vaddps
> > sequence. (adding -mattr=fma doesn't help, although this I assume
> > isn't needed given the cpu type.) The result is the same even with
> > -mcpu=haswell.
> >
> > This is a common pattern involved in a reduction with two things on
> > the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx),
> > and %ymm0. If another register is used to hold a loaded value, the
> > vfmadd instruction could be used in multiple ways. I suspect I'm
> > missing something, which I why I'm not already posting this on
> > llvm-bugs. Is this expected behavior?
> >
> >
>
-------------------------------------------------------------------------------------------
> > ; ModuleID = 'LLVMDialectModule'
> > source_filename = "LLVMDialectModule"
> >
> > declare i8* @malloc(i64)
> >
> > declare void @free(i8*)
> >
> > define <8 x float>* @fma(<8 x float>* %0, float* %1, <8
x float>* %2) {
> >   br label %4
> >
> > 4:                                                ; preds = %7, %3
> >   %5 = phi i64 [ %19, %7 ], [ 0, %3 ]
> >   %6 = icmp slt i64 %5, 16
> >   br i1 %6, label %7, label %20
> >
> > 7:                                                ; preds = %4
> >   %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5
> >   %9 = load <8 x float>, <8 x float>* %8, align 16
> >   %10 = getelementptr float, float* %1, i64 %5
> >   %11 = load float, float* %10, align 16
> >   %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
> >   %13 = load <8 x float>, <8 x float>* %12, align 16
> >   %14 = insertelement <8 x float> undef, float %11, i32 0
> >   %15 = shufflevector <8 x float> %14, <8 x float> undef,
<8 x i32>
> > zeroinitializer
> >   %16 = fmul <8 x float> %15, %9
> >   %17 = fadd <8 x float> %16, %13
> >   %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
> >   store <8 x float> %17, <8 x float>* %18, align 16
> >   %19 = add i64 %5, 1
> >   br label %4
> >
> > 20:                                               ; preds = %4
> >   ret <8 x float>* %2
> > }
>
> Roman
>
> >
>
-------------------------------------------------------------------------------------------------------
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190902/3ce8310a/attachment-0001.html>

Uday Kumar Reddy B via llvm-dev

2019-Sep-02 12:49 UTC

head link

[llvm-dev] AVX2 codegen - question reg. FMA generation

On Mon, 2 Sep 2019 at 16:59, Roman Lebedev <lebedev.ri at gmail.com>
wrote:>
> It appears you need 'reassoc' on fmul/fadd:
> https://godbolt.org/z/nuTzx2
Thanks very much, that was it. Either that or providing
-enable-unsafe-fp-math to llc yielded FMAs. I didn't expect this since
using FMAs here instead of mul/add appears to be safer (the reverse is
unsafe).

~ Uday
>
> On Mon, Sep 2, 2019 at 2:20 PM Uday Kumar Reddy B via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >
> > Hello,
> >
> > On the appended reasonably simple test case that has an fmul/fadd
> > sequence on <8 x float> vector types, I don't see the x86-64
code
> > generator (with cpu set to haswell or later types) turning it into an
> > AVX2 FMA instructions. Here's the snippet in the output it
generates:
> >
> > $ llc -O3 -mcpu=skylake
> >
> > ---------------------
> > .LBB0_2:                                # =>This Inner Loop Header:
Depth=1
> > vbroadcastss (%rsi,%rdx,4), %ymm0
> > vmulps (%rdi,%rcx), %ymm0, %ymm0
> > vaddps (%rax,%rcx), %ymm0, %ymm0
> > vmovups %ymm0, (%rax,%rcx)
> > incq %rdx
> > addq $32, %rcx
> > cmpq $15, %rdx
> > jle .LBB0_2
> > -----------------------
> >
> > $ llc --version
> > LLVM (http://llvm.org/):
> >   LLVM version 8.0.0
> >   Optimized build.
> >   Default target: x86_64-unknown-linux-gnu
> >   Host CPU: skylake
> > (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31)
> >
> > Using opt -O3 followed by llc leads to the same vmulps / vaddps
> > sequence. (adding -mattr=fma doesn't help, although this I assume
> > isn't needed given the cpu type.) The result is the same even with
> > -mcpu=haswell.
> >
> > This is a common pattern involved in a reduction with two things on
> > the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx),
> > and %ymm0. If another register is used to hold a loaded value, the
> > vfmadd instruction could be used in multiple ways. I suspect I'm
> > missing something, which I why I'm not already posting this on
> > llvm-bugs. Is this expected behavior?
> >
> >
-------------------------------------------------------------------------------------------
> > ; ModuleID = 'LLVMDialectModule'
> > source_filename = "LLVMDialectModule"
> >
> > declare i8* @malloc(i64)
> >
> > declare void @free(i8*)
> >
> > define <8 x float>* @fma(<8 x float>* %0, float* %1, <8
x float>* %2) {
> >   br label %4
> >
> > 4:                                                ; preds = %7, %3
> >   %5 = phi i64 [ %19, %7 ], [ 0, %3 ]
> >   %6 = icmp slt i64 %5, 16
> >   br i1 %6, label %7, label %20
> >
> > 7:                                                ; preds = %4
> >   %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5
> >   %9 = load <8 x float>, <8 x float>* %8, align 16
> >   %10 = getelementptr float, float* %1, i64 %5
> >   %11 = load float, float* %10, align 16
> >   %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
> >   %13 = load <8 x float>, <8 x float>* %12, align 16
> >   %14 = insertelement <8 x float> undef, float %11, i32 0
> >   %15 = shufflevector <8 x float> %14, <8 x float> undef,
<8 x i32>
> > zeroinitializer
> >   %16 = fmul <8 x float> %15, %9
> >   %17 = fadd <8 x float> %16, %13
> >   %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5
> >   store <8 x float> %17, <8 x float>* %18, align 16
> >   %19 = add i64 %5, 1
> >   br label %4
> >
> > 20:                                               ; preds = %4
> >   ret <8 x float>* %2
> > }
>
> Roman
>
> >
-------------------------------------------------------------------------------------------------------
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


-- 
Founder and Director, PolyMage Labs

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - Sep 2019 - AVX2 codegen - question reg. FMA generation

[llvm-dev] AVX2 codegen - question reg. FMA generation

[llvm-dev] AVX2 codegen - question reg. FMA generation

[llvm-dev] AVX2 codegen - question reg. FMA generation

[llvm-dev] AVX2 codegen - question reg. FMA generation

Reasonably Related Threads