thr3ads.net - llvm dev - [llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions? [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Craig Topper via llvm-dev

2019-Feb-09 19:05 UTC

[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions?

Something like this should work I think.

; ModuleID = 'test.ll'
source_filename = "test.ll"

define void @entry(<4 x i32>* %a, <4 x i32>* %b, <4 x i32>*
%x) {
Entry:
  %tmp = load <4 x i32>, <4 x i32>* %a, align 16
  %tmp1 = load <4 x i32>, <4 x i32>* %b, align 16
  %tmp2 = add <4 x i32> %tmp, %tmp1
  %tmpsign = icmp slt <4 x i32> %tmp, zeroinitializer
  %tmp1sign = icmp slt <4 x i32> %tmp1, zeroinitializer
  %sumsign = icmp slt <4 x i32> %tmp2, zeroinitializer
  %signsequal = icmp eq <4 x i1> %tmpsign, %tmp1sign
  %summismatch = icmp ne <4 x i1> %sumsign, %tmpsign
  %overflow = and <4 x i1> %signsequal, %summismatch
  %tmp5 = bitcast <4 x i1> %overflow to i4
  %tmp6 = icmp ne i4 %tmp5, 0
  br i1 %tmp6, label %OverflowFail, label %OverflowOk

OverflowFail:                                     ; preds = %Entry
  tail call fastcc void @panic()
  unreachable

OverflowOk:                                       ; preds = %Entry
  store <4 x i32> %tmp2, <4 x i32>* %x, align 16
  ret void
}

declare fastcc void @panic()


~Craig


On Sat, Feb 9, 2019 at 10:05 AM Andrew Kelley <andrew at ziglang.org>
wrote:
> >>     On Sat, Feb 9, 2019 at 1:42 AM Craig Topper via llvm-dev
> >>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
> >>
> >>         I don't think I understand your pseudocode using
> >>         llvm.experimental.vector.reduce.umax. All of the types you
> >>         showed are scalar, but that intrinsic doesn't work on
scalars
> >>         so I'm having a hard time understanding what
you're trying to
> >>         do with it. llvm.experimental.vector.reduce.umax takes a
> >>         vector input and returns a scalar result. Are you wanting
to
> >>         find if any of the additions overflowed or a mask of which
> >>         addition overflowed?
>
> Apologies for the confusion - let me try to clarify. Here is frontend
> code that works now:
>
> export fn entry() void {
>     var a: @Vector(4, i32) = []i32{ 1, 2, 3, 4 };
>     var b: @Vector(4, i32) = []i32{ 5, 6, 7, 8 };
>     var x = a +% b;
> }
>
> This generates the following LLVM IR code:
>
> define void @entry() #2 !dbg !41 {
> Entry:
>   %a = alloca <4 x i32>, align 16
>   %b = alloca <4 x i32>, align 16
>   %x = alloca <4 x i32>, align 16
>   store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x
i32>* %a, align 16,
> !dbg !55
>   call void @llvm.dbg.declare(metadata <4 x i32>* %a, metadata !45,
> metadata !DIExpression()), !dbg !55
>   store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x
i32>* %b, align 16,
> !dbg !56
>   call void @llvm.dbg.declare(metadata <4 x i32>* %b, metadata !51,
> metadata !DIExpression()), !dbg !56
>   %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
>   %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
>   %2 = add <4 x i32> %0, %1, !dbg !59
>   store <4 x i32> %2, <4 x i32>* %x, align 16, !dbg !60
>   call void @llvm.dbg.declare(metadata <4 x i32>* %x, metadata !53,
> metadata !DIExpression()), !dbg !60
>   ret void, !dbg !61
> }
>
> However I used the +% operator, which in Zig is wrapping addition. Now I
> want to implement the + operator for vectors, which Zig defines to panic
> if any of the elements overflowed. Here is how the IR could look for this:
>
> define void @entry() #2 !dbg !41 {
> Entry:
>   %a = alloca <4 x i32>, align 16
>   %b = alloca <4 x i32>, align 16
>   %x = alloca <4 x i32>, align 16
>   store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x
i32>* %a, align 16,
> !dbg !55
>   store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x
i32>* %b, align 16,
> !dbg !56
>   %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
>   %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
>   %2 = call { <4 x i32>, <4 x i1> }
@llvm.sadd.with.overflow.i32(i32 %0,
> i32 %1)
>   %3 = extractvalue { <4 x i32>, <4 x i1> } %2, 0, !dbg !56
>   %4 = extractvalue { <4 x i32>, <4 x i1> } %2, 1, !dbg !56
>   %5 = call i1 @llvm.experimental.vector.reduce.umax.i1.v4i1(%4)
>   br i1 %5, label %OverflowFail, label %OverflowOk, !dbg !56
>
> OverflowFail:                                     ; preds = %Entry
>   tail call fastcc void @panic(%"[]u8"* @2, %StackTrace* null),
!dbg !56
>   unreachable, !dbg !56
>
> OverflowOk:                                       ; preds = %Entry
>   store <4 x i32> %3, <4 x i32>* %x, align 16, !dbg !60
>   ret void, !dbg !61
> }
>
> You can see that it depends on @llvm.sadd.with.overflow working on
> vector types, and it relies on @llvm.experimental.vector.reduce.umax. I
> will note that my strategy with sign extension and icmp would be a
> semantically equivalent alternative to @llvm.sadd.with.overflow.
>
> On 2/9/19 12:37 PM, Nikita Popov wrote:
> > On Sat, Feb 9, 2019 at 6:25 PM Simon Pilgrim <llvm-dev at
redking.me.uk
> > <mailto:llvm-dev at redking.me.uk>> wrote:
> >     Regarding the reduction functions - I think the integer intrinsics
> >     at least are relatively stable and we can probably investigate
> >     dropping the experimental tag before the next release (assuming
> >     someone has the time to take on the work) - it'd be nice to
have the
> >     SLP vectorizer emit reduction intrinsics directly for these.
> >
> > The vector reduction intrinsics still need quite a lot of work. Apart
> > from SplitVecOp, all legalizations are currently missing. This is only
> > noticeable on AArch64 right now, because all other targets expand
vector
> > reductions prior to codegen.
>
> My follow-up question, then, is this:
>
> What do you recommend, in terms of LLVM IR, in order to obtain the %5
> value above?
>
> Thanks for the help,
> Andrew
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190209/dc04a888/attachment.html>

Andrew Kelley via llvm-dev

2019-Feb-09 20:56 UTC

head link

[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions?

On 2/9/19 2:05 PM, Craig Topper wrote:> Something like this should work I think.
> 
> ; ModuleID = 'test.ll'
> source_filename = "test.ll"
> 
> define void @entry(<4 x i32>* %a, <4 x i32>* %b, <4 x
i32>* %x) {
> Entry:
>   %tmp = load <4 x i32>, <4 x i32>* %a, align 16
>   %tmp1 = load <4 x i32>, <4 x i32>* %b, align 16
>   %tmp2 = add <4 x i32> %tmp, %tmp1
>   %tmpsign = icmp slt <4 x i32> %tmp, zeroinitializer
>   %tmp1sign = icmp slt <4 x i32> %tmp1, zeroinitializer
>   %sumsign = icmp slt <4 x i32> %tmp2, zeroinitializer
>   %signsequal = icmp eq <4 x i1> %tmpsign, %tmp1sign
>   %summismatch = icmp ne <4 x i1> %sumsign, %tmpsign
>   %overflow = and <4 x i1> %signsequal, %summismatch
>   %tmp5 = bitcast <4 x i1> %overflow to i4
>   %tmp6 = icmp ne i4 %tmp5, 0
>   br i1 %tmp6, label %OverflowFail, label %OverflowOk
> 
> OverflowFail:                                     ; preds = %Entry
>   tail call fastcc void @panic()
>   unreachable
> 
> OverflowOk:                                       ; preds = %Entry
>   store <4 x i32> %tmp2, <4 x i32>* %x, align 16
>   ret void
> }
> 
> declare fastcc void @panic()

Thanks! I was able to get it working with your hint:
>   %tmp5 = bitcast <4 x i1> %overflow to i4
(Thanks also to LebedevRI who pointed this out on IRC)

Until LLVM 9 when the llvm.*.with.overflow.* intrinsics gain vector
support, here's what I ended up with:

  %a = alloca <4 x i32>, align 16
  %b = alloca <4 x i32>, align 16
  %x = alloca <4 x i32>, align 16
  store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x i32>* %a,
align 16,
!dbg !55
  store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x i32>* %b,
align 16,
!dbg !56
  %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
  %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
  %2 = sext <4 x i32> %0 to <4 x i33>, !dbg !59
  %3 = sext <4 x i32> %1 to <4 x i33>, !dbg !59
  %4 = add <4 x i33> %2, %3, !dbg !59
  %5 = trunc <4 x i33> %4 to <4 x i32>, !dbg !59
  %6 = sext <4 x i32> %5 to <4 x i33>, !dbg !59
  %7 = icmp ne <4 x i33> %4, %6, !dbg !59
  %8 = bitcast <4 x i1> %7 to i4, !dbg !59
  %9 = icmp ne i4 %8, 0, !dbg !59
  br i1 %9, label %OverflowFail, label %OverflowOk, !dbg !59

Idea being: extend and do the operation with more bits. Truncate to get
the result. Re-extend the result and check if it is the same as the
pre-truncated result.

This works pretty well unless the vector integer size is as big or
larger than the native vector register. Here's a quick performance test:

https://gist.github.com/andrewrk/b9734f9c310d8b79ec7271e7c0df4023

Summary: safety-checked integer addition with no optimizations

<4 x i32>:
scalar = 893 MiB/s
vector = 3.58 GiB/s

<16 x i128>:
scalar = 3.6 GiB/s
vector = 2.5 GiB/s


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190209/ddeb0994/attachment.sig>

Craig Topper via llvm-dev

2019-Feb-09 21:08 UTC

head link

[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions?

I should mention the code I provided is equivalent to how the vector
version of sadd with overflow is expanded in the backend by default. Having
the vector intrinsic probably won’t improve the generated code over what
you can achieve with IR. Unless the target has some native way of detecting
overflow on vectors like the overflow flag on scalars.

On Sat, Feb 9, 2019 at 12:56 PM Andrew Kelley <andrew at ziglang.org>
wrote:
> On 2/9/19 2:05 PM, Craig Topper wrote:
> > Something like this should work I think.
> >
> > ; ModuleID = 'test.ll'
> > source_filename = "test.ll"
> >
> > define void @entry(<4 x i32>* %a, <4 x i32>* %b, <4 x
i32>* %x) {
> > Entry:
> >   %tmp = load <4 x i32>, <4 x i32>* %a, align 16
> >   %tmp1 = load <4 x i32>, <4 x i32>* %b, align 16
> >   %tmp2 = add <4 x i32> %tmp, %tmp1
> >   %tmpsign = icmp slt <4 x i32> %tmp, zeroinitializer
> >   %tmp1sign = icmp slt <4 x i32> %tmp1, zeroinitializer
> >   %sumsign = icmp slt <4 x i32> %tmp2, zeroinitializer
> >   %signsequal = icmp eq <4 x i1> %tmpsign, %tmp1sign
> >   %summismatch = icmp ne <4 x i1> %sumsign, %tmpsign
> >   %overflow = and <4 x i1> %signsequal, %summismatch
> >   %tmp5 = bitcast <4 x i1> %overflow to i4
> >   %tmp6 = icmp ne i4 %tmp5, 0
> >   br i1 %tmp6, label %OverflowFail, label %OverflowOk
> >
> > OverflowFail:                                     ; preds = %Entry
> >   tail call fastcc void @panic()
> >   unreachable
> >
> > OverflowOk:                                       ; preds = %Entry
> >   store <4 x i32> %tmp2, <4 x i32>* %x, align 16
> >   ret void
> > }
> >
> > declare fastcc void @panic()
>
>
> Thanks! I was able to get it working with your hint:
>
> >   %tmp5 = bitcast <4 x i1> %overflow to i4
>
> (Thanks also to LebedevRI who pointed this out on IRC)
>
> Until LLVM 9 when the llvm.*.with.overflow.* intrinsics gain vector
> support, here's what I ended up with:
>
>   %a = alloca <4 x i32>, align 16
>   %b = alloca <4 x i32>, align 16
>   %x = alloca <4 x i32>, align 16
>   store <4 x i32> <i32 1, i32 2, i32 3, i32 4>, <4 x
i32>* %a, align 16,
> !dbg !55
>   store <4 x i32> <i32 5, i32 6, i32 7, i32 8>, <4 x
i32>* %b, align 16,
> !dbg !56
>   %0 = load <4 x i32>, <4 x i32>* %a, align 16, !dbg !57
>   %1 = load <4 x i32>, <4 x i32>* %b, align 16, !dbg !58
>   %2 = sext <4 x i32> %0 to <4 x i33>, !dbg !59
>   %3 = sext <4 x i32> %1 to <4 x i33>, !dbg !59
>   %4 = add <4 x i33> %2, %3, !dbg !59
>   %5 = trunc <4 x i33> %4 to <4 x i32>, !dbg !59
>   %6 = sext <4 x i32> %5 to <4 x i33>, !dbg !59
>   %7 = icmp ne <4 x i33> %4, %6, !dbg !59
>   %8 = bitcast <4 x i1> %7 to i4, !dbg !59
>   %9 = icmp ne i4 %8, 0, !dbg !59
>   br i1 %9, label %OverflowFail, label %OverflowOk, !dbg !59
>
> Idea being: extend and do the operation with more bits. Truncate to get
> the result. Re-extend the result and check if it is the same as the
> pre-truncated result.
>
> This works pretty well unless the vector integer size is as big or
> larger than the native vector register. Here's a quick performance
test:
>
> https://gist.github.com/andrewrk/b9734f9c310d8b79ec7271e7c0df4023
>
> Summary: safety-checked integer addition with no optimizations
>
> <4 x i32>:
> scalar = 893 MiB/s
> vector = 3.58 GiB/s
>
> <16 x i128>:
> scalar = 3.6 GiB/s
> vector = 2.5 GiB/s
>
>
> --~Craig
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190209/967b5fa7/attachment-0001.html>

llvm dev - Feb 2019 - how experimental are the llvm.experimental.vector.reduce.* functions?

[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions?

[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions?

[llvm-dev] how experimental are the llvm.experimental.vector.reduce.* functions?