Yichao Yu via llvm-dev
2017-Jun-10 03:04 UTC
[llvm-dev] Fusing contract fadd/fsub with normal fmul
Hi, On LLVM 5.0 (current trunk), fadd/fsub and fmul that are both marked with `contract` or `fast` can be merged to a fma instruction by the backend. I'm wondering about the exact semantic of this new flag as well as `fast` and in particular, would it be valid to do this when only the `fadd`/`fsub` (and not the `fmul`) is marked with `contract` or at least `fast`. The reasoning is that doing this will have a similar effect as if the `fadd`/`fsub` is performed not to IEEE spec so a single flag on this instruction should be enough for the transformation. The particular case I'm interested in is vectorized loop with reduction like in pseudo C code `s += a[i] * b[i]`. Our front end will recognize this and mark the `+` as `fast` to enable vectorization. It'll be great if this can enable the reduction to be done with `fma` instructions. Yichao Yu
Hal Finkel via llvm-dev
2017-Jun-12 13:36 UTC
[llvm-dev] Fusing contract fadd/fsub with normal fmul
It seems like the contract flag is underspecified in this regard. I'd lean, however, toward requiring it on both instructions in order to contract them. That way inlining a function where contraction was prohibited into a function where contraction was permitted would not be able to effectively remove the final-result rounding from the callee. -Hal On 06/09/2017 10:04 PM, Yichao Yu via llvm-dev wrote:> Hi, > > On LLVM 5.0 (current trunk), fadd/fsub and fmul that are both marked > with `contract` or `fast` can be merged to a fma instruction by the > backend. > > I'm wondering about the exact semantic of this new flag as well as > `fast` and in particular, would it be valid to do this when only the > `fadd`/`fsub` (and not the `fmul`) is marked with `contract` or at > least `fast`. The reasoning is that doing this will have a similar > effect as if the `fadd`/`fsub` is performed not to IEEE spec so a > single flag on this instruction should be enough for the > transformation. > > The particular case I'm interested in is vectorized loop with > reduction like in pseudo C code `s += a[i] * b[i]`. Our front end will > recognize this and mark the `+` as `fast` to enable vectorization. > It'll be great if this can enable the reduction to be done with `fma` > instructions. > > Yichao Yu > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Sanjay Patel via llvm-dev
2017-Jun-12 13:40 UTC
[llvm-dev] Fusing contract fadd/fsub with normal fmul
For reference, the FMF 'contract' patches are listed here: https://bugs.llvm.org/show_bug.cgi?id=25721#c6 If we can make the documentation better, that would certainly be a welcome patch. It would be better to see the IR for your example(s), but I think you'd need 'contract' on both the fmul and fadd to generate an FMA. Conservatively, we wouldn't alter the result if either component somehow required strict FP. To vectorize, you probably need 'fast' on both ops because vectorization would be changing the order of operations (reassociation). On Fri, Jun 9, 2017 at 9:04 PM, Yichao Yu via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > On LLVM 5.0 (current trunk), fadd/fsub and fmul that are both marked > with `contract` or `fast` can be merged to a fma instruction by the > backend. > > I'm wondering about the exact semantic of this new flag as well as > `fast` and in particular, would it be valid to do this when only the > `fadd`/`fsub` (and not the `fmul`) is marked with `contract` or at > least `fast`. The reasoning is that doing this will have a similar > effect as if the `fadd`/`fsub` is performed not to IEEE spec so a > single flag on this instruction should be enough for the > transformation. > > The particular case I'm interested in is vectorized loop with > reduction like in pseudo C code `s += a[i] * b[i]`. Our front end will > recognize this and mark the `+` as `fast` to enable vectorization. > It'll be great if this can enable the reduction to be done with `fma` > instructions. > > Yichao Yu > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170612/da4be716/attachment.html>
Yichao Yu via llvm-dev
2017-Jun-12 18:22 UTC
[llvm-dev] Fusing contract fadd/fsub with normal fmul
On Mon, Jun 12, 2017 at 9:40 AM, Sanjay Patel <spatel at rotateright.com> wrote:> For reference, the FMF 'contract' patches are listed here: > https://bugs.llvm.org/show_bug.cgi?id=25721#c6 > > If we can make the documentation better, that would certainly be a welcome > patch. > > It would be better to see the IR for your example(s), but I think you'd needThe IR of the scalar loop is ``` if13: ; preds = %scalar.ph, %if13 %s.124 = phi double [ %51, %if13 ], [ %bc.merge.rdx, %scalar.ph ] %"i#672.023" = phi i64 [ %52, %if13 ], [ %bc.resume.val, %scalar.ph ] %46 = getelementptr double, double* %13, i64 %"i#672.023" %47 = load double, double* %46, align 8 %48 = getelementptr double, double* %15, i64 %"i#672.023" %49 = load double, double* %48, align 8 %50 = fmul double %47, %49 %51 = fadd fast double %s.124, %50 %52 = add nuw nsw i64 %"i#672.023", 1 %53 = icmp slt i64 %52, %9 br i1 %53, label %if13, label %L11.outer.split.L11.outer.split.split_crit_edge.outer.loopexit ``` And it can be vectorized to ``` vector.body: ; preds %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.phi = phi <4 x double> [ %19, %vector.ph ], [ %40, %vector.body ] %vec.phi94 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %41, %vector.body ] %vec.phi95 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %42, %vector.body ] %vec.phi96 = phi <4 x double> [ zeroinitializer, %vector.ph ], [ %43, %vector.body ] %20 = getelementptr double, double* %13, i64 %index %21 = bitcast double* %20 to <4 x double>* %wide.load = load <4 x double>, <4 x double>* %21, align 8 %22 = getelementptr double, double* %20, i64 4 %23 = bitcast double* %22 to <4 x double>* %wide.load100 = load <4 x double>, <4 x double>* %23, align 8 %24 = getelementptr double, double* %20, i64 8 %25 = bitcast double* %24 to <4 x double>* %wide.load101 = load <4 x double>, <4 x double>* %25, align 8 %26 = getelementptr double, double* %20, i64 12 %27 = bitcast double* %26 to <4 x double>* %wide.load102 = load <4 x double>, <4 x double>* %27, align 8 %28 = getelementptr double, double* %15, i64 %index %29 = bitcast double* %28 to <4 x double>* %wide.load103 = load <4 x double>, <4 x double>* %29, align 8 %30 = getelementptr double, double* %28, i64 4 %31 = bitcast double* %30 to <4 x double>* %wide.load104 = load <4 x double>, <4 x double>* %31, align 8 %32 = getelementptr double, double* %28, i64 8 %33 = bitcast double* %32 to <4 x double>* %wide.load105 = load <4 x double>, <4 x double>* %33, align 8 %34 = getelementptr double, double* %28, i64 12 %35 = bitcast double* %34 to <4 x double>* %wide.load106 = load <4 x double>, <4 x double>* %35, align 8 %36 = fmul <4 x double> %wide.load, %wide.load103 %37 = fmul <4 x double> %wide.load100, %wide.load104 %38 = fmul <4 x double> %wide.load101, %wide.load105 %39 = fmul <4 x double> %wide.load102, %wide.load106 %40 = fadd fast <4 x double> %vec.phi, %36 %41 = fadd fast <4 x double> %vec.phi94, %37 %42 = fadd fast <4 x double> %vec.phi95, %38 %43 = fadd fast <4 x double> %vec.phi96, %39 %index.next = add i64 %index, 16 %44 = icmp eq i64 %index.next, %n.vec br i1 %44, label %middle.block, label %vector.body ``` If contracting normal mul and fast add is allowed, both loop can use fma.> 'contract' on both the fmul and fadd to generate an FMA. Conservatively, we > wouldn't alter the result if either component somehow required strict FP. To > vectorize, you probably need 'fast' on both ops because vectorization would > be changing the order of operations (reassociation). > > > On Fri, Jun 9, 2017 at 9:04 PM, Yichao Yu via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> >> Hi, >> >> On LLVM 5.0 (current trunk), fadd/fsub and fmul that are both marked >> with `contract` or `fast` can be merged to a fma instruction by the >> backend. >> >> I'm wondering about the exact semantic of this new flag as well as >> `fast` and in particular, would it be valid to do this when only the >> `fadd`/`fsub` (and not the `fmul`) is marked with `contract` or at >> least `fast`. The reasoning is that doing this will have a similar >> effect as if the `fadd`/`fsub` is performed not to IEEE spec so a >> single flag on this instruction should be enough for the >> transformation. >> >> The particular case I'm interested in is vectorized loop with >> reduction like in pseudo C code `s += a[i] * b[i]`. Our front end will >> recognize this and mark the `+` as `fast` to enable vectorization. >> It'll be great if this can enable the reduction to be done with `fma` >> instructions. >> >> Yichao Yu >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >