Uday Kumar Reddy B via llvm-dev
2019-Sep-02 11:20 UTC
[llvm-dev] AVX2 codegen - question reg. FMA generation
Hello, On the appended reasonably simple test case that has an fmul/fadd sequence on <8 x float> vector types, I don't see the x86-64 code generator (with cpu set to haswell or later types) turning it into an AVX2 FMA instructions. Here's the snippet in the output it generates: $ llc -O3 -mcpu=skylake --------------------- .LBB0_2: # =>This Inner Loop Header: Depth=1 vbroadcastss (%rsi,%rdx,4), %ymm0 vmulps (%rdi,%rcx), %ymm0, %ymm0 vaddps (%rax,%rcx), %ymm0, %ymm0 vmovups %ymm0, (%rax,%rcx) incq %rdx addq $32, %rcx cmpq $15, %rdx jle .LBB0_2 ----------------------- $ llc --version LLVM (http://llvm.org/): LLVM version 8.0.0 Optimized build. Default target: x86_64-unknown-linux-gnu Host CPU: skylake (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31) Using opt -O3 followed by llc leads to the same vmulps / vaddps sequence. (adding -mattr=fma doesn't help, although this I assume isn't needed given the cpu type.) The result is the same even with -mcpu=haswell. This is a common pattern involved in a reduction with two things on the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx), and %ymm0. If another register is used to hold a loaded value, the vfmadd instruction could be used in multiple ways. I suspect I'm missing something, which I why I'm not already posting this on llvm-bugs. Is this expected behavior? ------------------------------------------------------------------------------------------- ; ModuleID = 'LLVMDialectModule' source_filename = "LLVMDialectModule" declare i8* @malloc(i64) declare void @free(i8*) define <8 x float>* @fma(<8 x float>* %0, float* %1, <8 x float>* %2) { br label %4 4: ; preds = %7, %3 %5 = phi i64 [ %19, %7 ], [ 0, %3 ] %6 = icmp slt i64 %5, 16 br i1 %6, label %7, label %20 7: ; preds = %4 %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5 %9 = load <8 x float>, <8 x float>* %8, align 16 %10 = getelementptr float, float* %1, i64 %5 %11 = load float, float* %10, align 16 %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 %13 = load <8 x float>, <8 x float>* %12, align 16 %14 = insertelement <8 x float> undef, float %11, i32 0 %15 = shufflevector <8 x float> %14, <8 x float> undef, <8 x i32> zeroinitializer %16 = fmul <8 x float> %15, %9 %17 = fadd <8 x float> %16, %13 %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 store <8 x float> %17, <8 x float>* %18, align 16 %19 = add i64 %5, 1 br label %4 20: ; preds = %4 ret <8 x float>* %2 } -------------------------------------------------------------------------------------------------------
Roman Lebedev via llvm-dev
2019-Sep-02 11:29 UTC
[llvm-dev] AVX2 codegen - question reg. FMA generation
It appears you need 'reassoc' on fmul/fadd: https://godbolt.org/z/nuTzx2 On Mon, Sep 2, 2019 at 2:20 PM Uday Kumar Reddy B via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Hello, > > On the appended reasonably simple test case that has an fmul/fadd > sequence on <8 x float> vector types, I don't see the x86-64 code > generator (with cpu set to haswell or later types) turning it into an > AVX2 FMA instructions. Here's the snippet in the output it generates: > > $ llc -O3 -mcpu=skylake > > --------------------- > .LBB0_2: # =>This Inner Loop Header: Depth=1 > vbroadcastss (%rsi,%rdx,4), %ymm0 > vmulps (%rdi,%rcx), %ymm0, %ymm0 > vaddps (%rax,%rcx), %ymm0, %ymm0 > vmovups %ymm0, (%rax,%rcx) > incq %rdx > addq $32, %rcx > cmpq $15, %rdx > jle .LBB0_2 > ----------------------- > > $ llc --version > LLVM (http://llvm.org/): > LLVM version 8.0.0 > Optimized build. > Default target: x86_64-unknown-linux-gnu > Host CPU: skylake > (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31) > > Using opt -O3 followed by llc leads to the same vmulps / vaddps > sequence. (adding -mattr=fma doesn't help, although this I assume > isn't needed given the cpu type.) The result is the same even with > -mcpu=haswell. > > This is a common pattern involved in a reduction with two things on > the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx), > and %ymm0. If another register is used to hold a loaded value, the > vfmadd instruction could be used in multiple ways. I suspect I'm > missing something, which I why I'm not already posting this on > llvm-bugs. Is this expected behavior? > > ------------------------------------------------------------------------------------------- > ; ModuleID = 'LLVMDialectModule' > source_filename = "LLVMDialectModule" > > declare i8* @malloc(i64) > > declare void @free(i8*) > > define <8 x float>* @fma(<8 x float>* %0, float* %1, <8 x float>* %2) { > br label %4 > > 4: ; preds = %7, %3 > %5 = phi i64 [ %19, %7 ], [ 0, %3 ] > %6 = icmp slt i64 %5, 16 > br i1 %6, label %7, label %20 > > 7: ; preds = %4 > %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5 > %9 = load <8 x float>, <8 x float>* %8, align 16 > %10 = getelementptr float, float* %1, i64 %5 > %11 = load float, float* %10, align 16 > %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 > %13 = load <8 x float>, <8 x float>* %12, align 16 > %14 = insertelement <8 x float> undef, float %11, i32 0 > %15 = shufflevector <8 x float> %14, <8 x float> undef, <8 x i32> > zeroinitializer > %16 = fmul <8 x float> %15, %9 > %17 = fadd <8 x float> %16, %13 > %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 > store <8 x float> %17, <8 x float>* %18, align 16 > %19 = add i64 %5, 1 > br label %4 > > 20: ; preds = %4 > ret <8 x float>* %2 > }Roman> ------------------------------------------------------------------------------------------------------- > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Sanjay Patel via llvm-dev
2019-Sep-02 12:41 UTC
[llvm-dev] AVX2 codegen - question reg. FMA generation
Fusing of the fadd and fmul is not allowed by default. http://llvm.org/docs/LangRef.html#floating-point-environment 'contract' on the fadd (and an fma-capable target) are the minimum requirements; 'reassoc' will also work, but that may enable other (possibly unintended) transforms. https://godbolt.org/z/-k6G2h define float @fma(float %x, float %y, float %z) { %m = fmul float %x, %y %a = fadd contract float %m, %z ret float %a } On Mon, Sep 2, 2019 at 7:29 AM Roman Lebedev via llvm-dev < llvm-dev at lists.llvm.org> wrote:> It appears you need 'reassoc' on fmul/fadd: > https://godbolt.org/z/nuTzx2 > > On Mon, Sep 2, 2019 at 2:20 PM Uday Kumar Reddy B via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > > > Hello, > > > > On the appended reasonably simple test case that has an fmul/fadd > > sequence on <8 x float> vector types, I don't see the x86-64 code > > generator (with cpu set to haswell or later types) turning it into an > > AVX2 FMA instructions. Here's the snippet in the output it generates: > > > > $ llc -O3 -mcpu=skylake > > > > --------------------- > > .LBB0_2: # =>This Inner Loop Header: > Depth=1 > > vbroadcastss (%rsi,%rdx,4), %ymm0 > > vmulps (%rdi,%rcx), %ymm0, %ymm0 > > vaddps (%rax,%rcx), %ymm0, %ymm0 > > vmovups %ymm0, (%rax,%rcx) > > incq %rdx > > addq $32, %rcx > > cmpq $15, %rdx > > jle .LBB0_2 > > ----------------------- > > > > $ llc --version > > LLVM (http://llvm.org/): > > LLVM version 8.0.0 > > Optimized build. > > Default target: x86_64-unknown-linux-gnu > > Host CPU: skylake > > (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31) > > > > Using opt -O3 followed by llc leads to the same vmulps / vaddps > > sequence. (adding -mattr=fma doesn't help, although this I assume > > isn't needed given the cpu type.) The result is the same even with > > -mcpu=haswell. > > > > This is a common pattern involved in a reduction with two things on > > the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx), > > and %ymm0. If another register is used to hold a loaded value, the > > vfmadd instruction could be used in multiple ways. I suspect I'm > > missing something, which I why I'm not already posting this on > > llvm-bugs. Is this expected behavior? > > > > > ------------------------------------------------------------------------------------------- > > ; ModuleID = 'LLVMDialectModule' > > source_filename = "LLVMDialectModule" > > > > declare i8* @malloc(i64) > > > > declare void @free(i8*) > > > > define <8 x float>* @fma(<8 x float>* %0, float* %1, <8 x float>* %2) { > > br label %4 > > > > 4: ; preds = %7, %3 > > %5 = phi i64 [ %19, %7 ], [ 0, %3 ] > > %6 = icmp slt i64 %5, 16 > > br i1 %6, label %7, label %20 > > > > 7: ; preds = %4 > > %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5 > > %9 = load <8 x float>, <8 x float>* %8, align 16 > > %10 = getelementptr float, float* %1, i64 %5 > > %11 = load float, float* %10, align 16 > > %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 > > %13 = load <8 x float>, <8 x float>* %12, align 16 > > %14 = insertelement <8 x float> undef, float %11, i32 0 > > %15 = shufflevector <8 x float> %14, <8 x float> undef, <8 x i32> > > zeroinitializer > > %16 = fmul <8 x float> %15, %9 > > %17 = fadd <8 x float> %16, %13 > > %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 > > store <8 x float> %17, <8 x float>* %18, align 16 > > %19 = add i64 %5, 1 > > br label %4 > > > > 20: ; preds = %4 > > ret <8 x float>* %2 > > } > > Roman > > > > ------------------------------------------------------------------------------------------------------- > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190902/3ce8310a/attachment-0001.html>
Uday Kumar Reddy B via llvm-dev
2019-Sep-02 12:49 UTC
[llvm-dev] AVX2 codegen - question reg. FMA generation
On Mon, 2 Sep 2019 at 16:59, Roman Lebedev <lebedev.ri at gmail.com> wrote:> > It appears you need 'reassoc' on fmul/fadd: > https://godbolt.org/z/nuTzx2Thanks very much, that was it. Either that or providing -enable-unsafe-fp-math to llc yielded FMAs. I didn't expect this since using FMAs here instead of mul/add appears to be safer (the reverse is unsafe). ~ Uday> > On Mon, Sep 2, 2019 at 2:20 PM Uday Kumar Reddy B via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > > > Hello, > > > > On the appended reasonably simple test case that has an fmul/fadd > > sequence on <8 x float> vector types, I don't see the x86-64 code > > generator (with cpu set to haswell or later types) turning it into an > > AVX2 FMA instructions. Here's the snippet in the output it generates: > > > > $ llc -O3 -mcpu=skylake > > > > --------------------- > > .LBB0_2: # =>This Inner Loop Header: Depth=1 > > vbroadcastss (%rsi,%rdx,4), %ymm0 > > vmulps (%rdi,%rcx), %ymm0, %ymm0 > > vaddps (%rax,%rcx), %ymm0, %ymm0 > > vmovups %ymm0, (%rax,%rcx) > > incq %rdx > > addq $32, %rcx > > cmpq $15, %rdx > > jle .LBB0_2 > > ----------------------- > > > > $ llc --version > > LLVM (http://llvm.org/): > > LLVM version 8.0.0 > > Optimized build. > > Default target: x86_64-unknown-linux-gnu > > Host CPU: skylake > > (llvm commit 198009ae8db11d7c0b0517f17358870dc486fcfb from Aug 31) > > > > Using opt -O3 followed by llc leads to the same vmulps / vaddps > > sequence. (adding -mattr=fma doesn't help, although this I assume > > isn't needed given the cpu type.) The result is the same even with > > -mcpu=haswell. > > > > This is a common pattern involved in a reduction with two things on > > the RHS. The three things in play here are (%rax,%rcx), (%rdi,%rcx), > > and %ymm0. If another register is used to hold a loaded value, the > > vfmadd instruction could be used in multiple ways. I suspect I'm > > missing something, which I why I'm not already posting this on > > llvm-bugs. Is this expected behavior? > > > > ------------------------------------------------------------------------------------------- > > ; ModuleID = 'LLVMDialectModule' > > source_filename = "LLVMDialectModule" > > > > declare i8* @malloc(i64) > > > > declare void @free(i8*) > > > > define <8 x float>* @fma(<8 x float>* %0, float* %1, <8 x float>* %2) { > > br label %4 > > > > 4: ; preds = %7, %3 > > %5 = phi i64 [ %19, %7 ], [ 0, %3 ] > > %6 = icmp slt i64 %5, 16 > > br i1 %6, label %7, label %20 > > > > 7: ; preds = %4 > > %8 = getelementptr <8 x float>, <8 x float>* %0, i64 %5 > > %9 = load <8 x float>, <8 x float>* %8, align 16 > > %10 = getelementptr float, float* %1, i64 %5 > > %11 = load float, float* %10, align 16 > > %12 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 > > %13 = load <8 x float>, <8 x float>* %12, align 16 > > %14 = insertelement <8 x float> undef, float %11, i32 0 > > %15 = shufflevector <8 x float> %14, <8 x float> undef, <8 x i32> > > zeroinitializer > > %16 = fmul <8 x float> %15, %9 > > %17 = fadd <8 x float> %16, %13 > > %18 = getelementptr <8 x float>, <8 x float>* %2, i64 %5 > > store <8 x float> %17, <8 x float>* %18, align 16 > > %19 = add i64 %5, 1 > > br label %4 > > > > 20: ; preds = %4 > > ret <8 x float>* %2 > > } > > Roman > > > ------------------------------------------------------------------------------------------------------- > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Founder and Director, PolyMage Labs