thr3ads.net - llvm dev - [LLVMdev] supporting SAD in loop vectorizer [Nov 2014]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2014-Nov-11 14:54 UTC

[LLVMdev] supporting SAD in loop vectorizer

----- Original Message -----> From: "James Molloy" <james at jamesmolloy.co.uk>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Dibyendu Das" <Dibyendu.Das at amd.com>, llvmdev at
cs.uiuc.edu
> Sent: Tuesday, November 11, 2014 8:21:37 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> 
> If you'd like to contribute support for this, look at
> isHorizontalBinOp and go from there. Feel free to ask questions if
> you get stuck.
> 
> 
> 
> FWIW, I've looked at isHorizontalBinOp for inspiration for matching
> AArch64 ADDV-and-friends (horizontal reduction operations), and
> thought it was rather temperamental and noticed it being prone to
> breaking depending on the exact format of the IR. Given that we
> don't have a canonical form for reductions, I think it wrong that we
> expect targets to undo quite complex patterns.
> 
> 
> The reduction pattern is a log2(n) sequence of shuffles and binops,
> that are really rather complex. These sort of things should, IMHO,
> be intrinsics. I chatted with Arnold about this at the devmtg and
> was going to send a patch to do exactly that in a week or so.
Sounds good. We should try hard to canonicalize into the intrinsic in
InstCombine from the shuffles (in addition to emitting it directly from the
vectorizer), but it is likely easier to do there than in the backend.

 -Hal
> 
> 
> Cheers,
> 
> 
> James
> 
> 
> On 11 November 2014 13:35, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> 
> ----- Original Message -----
> > From: "Dibyendu Das" < Dibyendu.Das at amd.com >
> > To: "Hal Finkel" < hfinkel at anl.gov >, "Renato
Golin" <
> > renato.golin at linaro.org >
> > Cc: llvmdev at cs.uiuc.edu
> > Sent: Tuesday, November 4, 2014 12:15:12 PM
> > Subject: RE: [LLVMdev] supporting SAD in loop vectorizer
> > 
> > Here's the simple SAD code:
> > ---------------------------------------------------
> > 1 #include <stdlib.h>
> > 2
> > 3 extern int ly,lx;
> > 4 int sad_c( unsigned char *pix1, unsigned char *pix2)
> > 5 {
> > 6 int i_sum = 0;
> > 7 for( int x = 0; x < lx; x++ )
> > 8 i_sum += abs( pix1[x] - pix2[x] );
> > 9 return i_sum;
> > 10 }
> > 11
> > -----------------------------------------------------
> > 
> > The loop vectorizer does vectorize the loop and then unrolls it
> > twice. The main body of the loop at the end looks like below where
> > we see the icmp, neg select pattern appearing twice.
> > Are we saying we pattern match this to PSADBW in target ?
> 
> Yes.
> 
> > That seems
> > to have some challenges
> 
> It does, but we already have code in the backend that matches other
> horizontal operations (see isHorizontalBinOp and its callers in
> lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be
> significantly more complicated.
> 
> > including the fact that we would need a
> > 4-way unroll to use all of 128b PSADBWs. Or am I
> > missing something ?
> 
> No, each unrolling will get its own, so you'll get a PSADBW from each
> time the loop is unrolled. Each unrolling is vectorized in terms of
> <4 x i32>, and that is the 128 bits you need.
> 
> If you'd like to contribute support for this, look at
> isHorizontalBinOp and go from there. Feel free to ask questions if
> you get stuck.
> 
> -Hal
> 
> 
> 
> > 
> > 2783 vector.body: ; preds > > %vector.body.preheader,
%vector.body
> > 2784 %index = phi i64 [ %index.next, %vector.body ], [ 0,
> > %vector.body.preheader ]
> > 2785 %vec.phi = phi <4 x i32> [ %24, %vector.body ], [
> > zeroinitializer, %vector.body.preheader ]
> > 2786 %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [
> > zeroinitializer, %vector.body.preheader ]
> > 2787 %4 = getelementptr inbounds i8* %pix1, i64 %index
> > 2788 %5 = bitcast i8* %4 to <4 x i8>*
> > 2789 %wide.load = load <4 x i8>* %5, align 1
> > 2790 %.sum19 = or i64 %index, 4
> > 2791 %6 = getelementptr i8* %pix1, i64 %.sum19
> > 2792 %7 = bitcast i8* %6 to <4 x i8>*
> > 2793 %wide.load10 = load <4 x i8>* %7, align 1
> > 2794 %8 = zext <4 x i8> %wide.load to <4 x i32>
> > 2795 %9 = zext <4 x i8> %wide.load10 to <4 x i32>
> > 2796 %10 = getelementptr inbounds i8* %pix2, i64 %index
> > 2797 %11 = bitcast i8* %10 to <4 x i8>*
> > 2798 %wide.load11 = load <4 x i8>* %11, align 1
> > 2799 %.sum20 = or i64 %index, 4
> > 2800 %12 = getelementptr i8* %pix2, i64 %.sum20
> > 2801 %13 = bitcast i8* %12 to <4 x i8>*
> > 2802 %wide.load12 = load <4 x i8>* %13, align 1
> > 2803 %14 = zext <4 x i8> %wide.load11 to <4 x i32>
> > 2804 %15 = zext <4 x i8> %wide.load12 to <4 x i32>
> > 2805 %16 = sub nsw <4 x i32> %8, %14
> > 2806 %17 = sub nsw <4 x i32> %9, %15
> > 2807 %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32 -1,
i32 -1>
> > 2808 %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32 -1,
i32 -1>
> > 2809 %20 = sub <4 x i32> zeroinitializer, %16
> > 2810 %21 = sub <4 x i32> zeroinitializer, %17
> > 2811 %22 = select <4 x i1> %18, <4 x i32> %16, <4 x
i32> %20
> > 2812 %23 = select <4 x i1> %19, <4 x i32> %17, <4 x
i32> %21
> > 2813 %24 = add nsw <4 x i32> %22, %vec.phi
> > 2814 %25 = add nsw <4 x i32> %23, %vec.phi9
> > 2815 %index.next = add i64 %index, 8
> > 2816 %26 = icmp eq i64 %index.next, %n.vec
> > 2817 br i1 %26, label %middle.block.loopexit, label %vector.body,
> > !llvm.loop !1
> > -----------------------------------------------------
> > 
> > -----Original Message-----
> > From: Hal Finkel [mailto: hfinkel at anl.gov ]
> > Sent: Tuesday, November 04, 2014 9:54 PM
> > To: Renato Golin
> > Cc: llvmdev at cs.uiuc.edu ; Das, Dibyendu
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > 
> > ----- Original Message -----
> > > From: "Renato Golin" < renato.golin at linaro.org
>
> > > To: "Dibyendu Das" < Dibyendu.Das at amd.com >
> > > Cc: llvmdev at cs.uiuc.edu
> > > Sent: Tuesday, November 4, 2014 5:23:30 AM
> > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > 
> > > On 4 November 2014 11:06, Das, Dibyendu < Dibyendu.Das at
amd.com >
> > > wrote:
> > > > Is there any plan to support special idioms in the loop
> > > > vectorizer
> > > > like sum of absolute difference (SAD) ? We see some useful
> > > > cases
> > > > where llvm is losing performance at -O3 due to SADs not
being
> > > > vectorized (hence PSADBWs not being generated).
> > > 
> > > It's been a while, but this could either be that the
legalisation
> > > phase is not recognising the reduction or that the cost is not
> > > taking
> > > into account the lowered abs().
> > > 
> > > What does -debug-only=loop-vectorize say about it?
> > 
> > FWIW, I agree, this sounds like a cost-model problem. The
> > loop-vectorizer should be able to vectorize the 'icmp; neg;
select'
> > pattern, and then the backend can pattern-patch that with the
> > reduction (which is a series of shuffles and extract_element) into
> > the single instruction PSADBW -- we're quite likely missing the
> > target code to do that.
> > 
> > -Hal
> > 
> > > 
> > > cheers,
> > > --renato
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2014-Nov-11 15:00 UTC

head link

[LLVMdev] supporting SAD in loop vectorizer

----- Original Message -----> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "James Molloy" <james at jamesmolloy.co.uk>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tuesday, November 11, 2014 8:54:01 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> ----- Original Message -----
> > From: "James Molloy" <james at jamesmolloy.co.uk>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "Dibyendu Das" <Dibyendu.Das at amd.com>, llvmdev
at cs.uiuc.edu
> > Sent: Tuesday, November 11, 2014 8:21:37 AM
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > 
> > 
> > If you'd like to contribute support for this, look at
> > isHorizontalBinOp and go from there. Feel free to ask questions if
> > you get stuck.
> > 
> > 
> > 
> > FWIW, I've looked at isHorizontalBinOp for inspiration for
matching
> > AArch64 ADDV-and-friends (horizontal reduction operations), and
> > thought it was rather temperamental and noticed it being prone to
> > breaking depending on the exact format of the IR. Given that we
> > don't have a canonical form for reductions, I think it wrong that
> > we
> > expect targets to undo quite complex patterns.
> > 
> > 
> > The reduction pattern is a log2(n) sequence of shuffles and binops,
> > that are really rather complex. These sort of things should, IMHO,
> > be intrinsics. I chatted with Arnold about this at the devmtg and
> > was going to send a patch to do exactly that in a week or so.
> 
> Sounds good. We should try hard to canonicalize into the intrinsic in
> InstCombine from the shuffles
Or maybe we should do this in CGP -- would we want to do this if there is no
actual target support?

 -Hal
> (in addition to emitting it directly
> from the vectorizer), but it is likely easier to do there than in
> the backend.
> 
>  -Hal
> 
> > 
> > 
> > Cheers,
> > 
> > 
> > James
> > 
> > 
> > On 11 November 2014 13:35, Hal Finkel < hfinkel at anl.gov >
wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Dibyendu Das" < Dibyendu.Das at amd.com >
> > > To: "Hal Finkel" < hfinkel at anl.gov >,
"Renato Golin" <
> > > renato.golin at linaro.org >
> > > Cc: llvmdev at cs.uiuc.edu
> > > Sent: Tuesday, November 4, 2014 12:15:12 PM
> > > Subject: RE: [LLVMdev] supporting SAD in loop vectorizer
> > > 
> > > Here's the simple SAD code:
> > > ---------------------------------------------------
> > > 1 #include <stdlib.h>
> > > 2
> > > 3 extern int ly,lx;
> > > 4 int sad_c( unsigned char *pix1, unsigned char *pix2)
> > > 5 {
> > > 6 int i_sum = 0;
> > > 7 for( int x = 0; x < lx; x++ )
> > > 8 i_sum += abs( pix1[x] - pix2[x] );
> > > 9 return i_sum;
> > > 10 }
> > > 11
> > > -----------------------------------------------------
> > > 
> > > The loop vectorizer does vectorize the loop and then unrolls it
> > > twice. The main body of the loop at the end looks like below
> > > where
> > > we see the icmp, neg select pattern appearing twice.
> > > Are we saying we pattern match this to PSADBW in target ?
> > 
> > Yes.
> > 
> > > That seems
> > > to have some challenges
> > 
> > It does, but we already have code in the backend that matches other
> > horizontal operations (see isHorizontalBinOp and its callers in
> > lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be
> > significantly more complicated.
> > 
> > > including the fact that we would need a
> > > 4-way unroll to use all of 128b PSADBWs. Or am I
> > > missing something ?
> > 
> > No, each unrolling will get its own, so you'll get a PSADBW from
> > each
> > time the loop is unrolled. Each unrolling is vectorized in terms of
> > <4 x i32>, and that is the 128 bits you need.
> > 
> > If you'd like to contribute support for this, look at
> > isHorizontalBinOp and go from there. Feel free to ask questions if
> > you get stuck.
> > 
> > -Hal
> > 
> > 
> > 
> > > 
> > > 2783 vector.body: ; preds > > > %vector.body.preheader,
%vector.body
> > > 2784 %index = phi i64 [ %index.next, %vector.body ], [ 0,
> > > %vector.body.preheader ]
> > > 2785 %vec.phi = phi <4 x i32> [ %24, %vector.body ], [
> > > zeroinitializer, %vector.body.preheader ]
> > > 2786 %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [
> > > zeroinitializer, %vector.body.preheader ]
> > > 2787 %4 = getelementptr inbounds i8* %pix1, i64 %index
> > > 2788 %5 = bitcast i8* %4 to <4 x i8>*
> > > 2789 %wide.load = load <4 x i8>* %5, align 1
> > > 2790 %.sum19 = or i64 %index, 4
> > > 2791 %6 = getelementptr i8* %pix1, i64 %.sum19
> > > 2792 %7 = bitcast i8* %6 to <4 x i8>*
> > > 2793 %wide.load10 = load <4 x i8>* %7, align 1
> > > 2794 %8 = zext <4 x i8> %wide.load to <4 x i32>
> > > 2795 %9 = zext <4 x i8> %wide.load10 to <4 x i32>
> > > 2796 %10 = getelementptr inbounds i8* %pix2, i64 %index
> > > 2797 %11 = bitcast i8* %10 to <4 x i8>*
> > > 2798 %wide.load11 = load <4 x i8>* %11, align 1
> > > 2799 %.sum20 = or i64 %index, 4
> > > 2800 %12 = getelementptr i8* %pix2, i64 %.sum20
> > > 2801 %13 = bitcast i8* %12 to <4 x i8>*
> > > 2802 %wide.load12 = load <4 x i8>* %13, align 1
> > > 2803 %14 = zext <4 x i8> %wide.load11 to <4 x i32>
> > > 2804 %15 = zext <4 x i8> %wide.load12 to <4 x i32>
> > > 2805 %16 = sub nsw <4 x i32> %8, %14
> > > 2806 %17 = sub nsw <4 x i32> %9, %15
> > > 2807 %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32
-1, i32
> > > -1>
> > > 2808 %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32
-1, i32
> > > -1>
> > > 2809 %20 = sub <4 x i32> zeroinitializer, %16
> > > 2810 %21 = sub <4 x i32> zeroinitializer, %17
> > > 2811 %22 = select <4 x i1> %18, <4 x i32> %16, <4
x i32> %20
> > > 2812 %23 = select <4 x i1> %19, <4 x i32> %17, <4
x i32> %21
> > > 2813 %24 = add nsw <4 x i32> %22, %vec.phi
> > > 2814 %25 = add nsw <4 x i32> %23, %vec.phi9
> > > 2815 %index.next = add i64 %index, 8
> > > 2816 %26 = icmp eq i64 %index.next, %n.vec
> > > 2817 br i1 %26, label %middle.block.loopexit, label %vector.body,
> > > !llvm.loop !1
> > > -----------------------------------------------------
> > > 
> > > -----Original Message-----
> > > From: Hal Finkel [mailto: hfinkel at anl.gov ]
> > > Sent: Tuesday, November 04, 2014 9:54 PM
> > > To: Renato Golin
> > > Cc: llvmdev at cs.uiuc.edu ; Das, Dibyendu
> > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > 
> > > ----- Original Message -----
> > > > From: "Renato Golin" < renato.golin at
linaro.org >
> > > > To: "Dibyendu Das" < Dibyendu.Das at amd.com
>
> > > > Cc: llvmdev at cs.uiuc.edu
> > > > Sent: Tuesday, November 4, 2014 5:23:30 AM
> > > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > > 
> > > > On 4 November 2014 11:06, Das, Dibyendu < Dibyendu.Das at
amd.com
> > > > >
> > > > wrote:
> > > > > Is there any plan to support special idioms in the loop
> > > > > vectorizer
> > > > > like sum of absolute difference (SAD) ? We see some
useful
> > > > > cases
> > > > > where llvm is losing performance at -O3 due to SADs not
being
> > > > > vectorized (hence PSADBWs not being generated).
> > > > 
> > > > It's been a while, but this could either be that the
> > > > legalisation
> > > > phase is not recognising the reduction or that the cost is
not
> > > > taking
> > > > into account the lowered abs().
> > > > 
> > > > What does -debug-only=loop-vectorize say about it?
> > > 
> > > FWIW, I agree, this sounds like a cost-model problem. The
> > > loop-vectorizer should be able to vectorize the 'icmp; neg;
> > > select'
> > > pattern, and then the backend can pattern-patch that with the
> > > reduction (which is a series of shuffles and extract_element)
> > > into
> > > the single instruction PSADBW -- we're quite likely missing
the
> > > target code to do that.
> > > 
> > > -Hal
> > > 
> > > > 
> > > > cheers,
> > > > --renato
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > > 
> > > 
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

James Molloy

2014-Nov-11 15:17 UTC

head link

[LLVMdev] supporting SAD in loop vectorizer

Hi Hal,

My current implementation scalarises if the target SDNode doesn't exit (new
SDNodes ISD::REDUCE_ADD and friends are created) in SelectionDAGBuilder to
the log2(N) method currently employed by the loop vectorizer.

James

On 11 November 2014 15:00, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Hal Finkel" <hfinkel at anl.gov>
> > To: "James Molloy" <james at jamesmolloy.co.uk>
> > Cc: llvmdev at cs.uiuc.edu
> > Sent: Tuesday, November 11, 2014 8:54:01 AM
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> >
> > ----- Original Message -----
> > > From: "James Molloy" <james at jamesmolloy.co.uk>
> > > To: "Hal Finkel" <hfinkel at anl.gov>
> > > Cc: "Dibyendu Das" <Dibyendu.Das at amd.com>,
llvmdev at cs.uiuc.edu
> > > Sent: Tuesday, November 11, 2014 8:21:37 AM
> > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > >
> > >
> > > If you'd like to contribute support for this, look at
> > > isHorizontalBinOp and go from there. Feel free to ask questions
if
> > > you get stuck.
> > >
> > >
> > >
> > > FWIW, I've looked at isHorizontalBinOp for inspiration for
matching
> > > AArch64 ADDV-and-friends (horizontal reduction operations), and
> > > thought it was rather temperamental and noticed it being prone to
> > > breaking depending on the exact format of the IR. Given that we
> > > don't have a canonical form for reductions, I think it wrong
that
> > > we
> > > expect targets to undo quite complex patterns.
> > >
> > >
> > > The reduction pattern is a log2(n) sequence of shuffles and
binops,
> > > that are really rather complex. These sort of things should,
IMHO,
> > > be intrinsics. I chatted with Arnold about this at the devmtg and
> > > was going to send a patch to do exactly that in a week or so.
> >
> > Sounds good. We should try hard to canonicalize into the intrinsic in
> > InstCombine from the shuffles
>
> Or maybe we should do this in CGP -- would we want to do this if there is
> no actual target support?
>
>  -Hal
>
> > (in addition to emitting it directly
> > from the vectorizer), but it is likely easier to do there than in
> > the backend.
> >
> >  -Hal
> >
> > >
> > >
> > > Cheers,
> > >
> > >
> > > James
> > >
> > >
> > > On 11 November 2014 13:35, Hal Finkel < hfinkel at anl.gov
> wrote:
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Dibyendu Das" < Dibyendu.Das at amd.com
>
> > > > To: "Hal Finkel" < hfinkel at anl.gov >,
"Renato Golin" <
> > > > renato.golin at linaro.org >
> > > > Cc: llvmdev at cs.uiuc.edu
> > > > Sent: Tuesday, November 4, 2014 12:15:12 PM
> > > > Subject: RE: [LLVMdev] supporting SAD in loop vectorizer
> > > >
> > > > Here's the simple SAD code:
> > > > ---------------------------------------------------
> > > > 1 #include <stdlib.h>
> > > > 2
> > > > 3 extern int ly,lx;
> > > > 4 int sad_c( unsigned char *pix1, unsigned char *pix2)
> > > > 5 {
> > > > 6 int i_sum = 0;
> > > > 7 for( int x = 0; x < lx; x++ )
> > > > 8 i_sum += abs( pix1[x] - pix2[x] );
> > > > 9 return i_sum;
> > > > 10 }
> > > > 11
> > > > -----------------------------------------------------
> > > >
> > > > The loop vectorizer does vectorize the loop and then unrolls
it
> > > > twice. The main body of the loop at the end looks like below
> > > > where
> > > > we see the icmp, neg select pattern appearing twice.
> > > > Are we saying we pattern match this to PSADBW in target ?
> > >
> > > Yes.
> > >
> > > > That seems
> > > > to have some challenges
> > >
> > > It does, but we already have code in the backend that matches
other
> > > horizontal operations (see isHorizontalBinOp and its callers in
> > > lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't
be
> > > significantly more complicated.
> > >
> > > > including the fact that we would need a
> > > > 4-way unroll to use all of 128b PSADBWs. Or am I
> > > > missing something ?
> > >
> > > No, each unrolling will get its own, so you'll get a PSADBW
from
> > > each
> > > time the loop is unrolled. Each unrolling is vectorized in terms
of
> > > <4 x i32>, and that is the 128 bits you need.
> > >
> > > If you'd like to contribute support for this, look at
> > > isHorizontalBinOp and go from there. Feel free to ask questions
if
> > > you get stuck.
> > >
> > > -Hal
> > >
> > >
> > >
> > > >
> > > > 2783 vector.body: ; preds > > > >
%vector.body.preheader, %vector.body
> > > > 2784 %index = phi i64 [ %index.next, %vector.body ], [ 0,
> > > > %vector.body.preheader ]
> > > > 2785 %vec.phi = phi <4 x i32> [ %24, %vector.body ], [
> > > > zeroinitializer, %vector.body.preheader ]
> > > > 2786 %vec.phi9 = phi <4 x i32> [ %25, %vector.body ],
[
> > > > zeroinitializer, %vector.body.preheader ]
> > > > 2787 %4 = getelementptr inbounds i8* %pix1, i64 %index
> > > > 2788 %5 = bitcast i8* %4 to <4 x i8>*
> > > > 2789 %wide.load = load <4 x i8>* %5, align 1
> > > > 2790 %.sum19 = or i64 %index, 4
> > > > 2791 %6 = getelementptr i8* %pix1, i64 %.sum19
> > > > 2792 %7 = bitcast i8* %6 to <4 x i8>*
> > > > 2793 %wide.load10 = load <4 x i8>* %7, align 1
> > > > 2794 %8 = zext <4 x i8> %wide.load to <4 x i32>
> > > > 2795 %9 = zext <4 x i8> %wide.load10 to <4 x
i32>
> > > > 2796 %10 = getelementptr inbounds i8* %pix2, i64 %index
> > > > 2797 %11 = bitcast i8* %10 to <4 x i8>*
> > > > 2798 %wide.load11 = load <4 x i8>* %11, align 1
> > > > 2799 %.sum20 = or i64 %index, 4
> > > > 2800 %12 = getelementptr i8* %pix2, i64 %.sum20
> > > > 2801 %13 = bitcast i8* %12 to <4 x i8>*
> > > > 2802 %wide.load12 = load <4 x i8>* %13, align 1
> > > > 2803 %14 = zext <4 x i8> %wide.load11 to <4 x
i32>
> > > > 2804 %15 = zext <4 x i8> %wide.load12 to <4 x
i32>
> > > > 2805 %16 = sub nsw <4 x i32> %8, %14
> > > > 2806 %17 = sub nsw <4 x i32> %9, %15
> > > > 2807 %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1,
i32 -1, i32
> > > > -1>
> > > > 2808 %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1,
i32 -1, i32
> > > > -1>
> > > > 2809 %20 = sub <4 x i32> zeroinitializer, %16
> > > > 2810 %21 = sub <4 x i32> zeroinitializer, %17
> > > > 2811 %22 = select <4 x i1> %18, <4 x i32> %16,
<4 x i32> %20
> > > > 2812 %23 = select <4 x i1> %19, <4 x i32> %17,
<4 x i32> %21
> > > > 2813 %24 = add nsw <4 x i32> %22, %vec.phi
> > > > 2814 %25 = add nsw <4 x i32> %23, %vec.phi9
> > > > 2815 %index.next = add i64 %index, 8
> > > > 2816 %26 = icmp eq i64 %index.next, %n.vec
> > > > 2817 br i1 %26, label %middle.block.loopexit, label
%vector.body,
> > > > !llvm.loop !1
> > > > -----------------------------------------------------
> > > >
> > > > -----Original Message-----
> > > > From: Hal Finkel [mailto: hfinkel at anl.gov ]
> > > > Sent: Tuesday, November 04, 2014 9:54 PM
> > > > To: Renato Golin
> > > > Cc: llvmdev at cs.uiuc.edu ; Das, Dibyendu
> > > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > >
> > > > ----- Original Message -----
> > > > > From: "Renato Golin" < renato.golin at
linaro.org >
> > > > > To: "Dibyendu Das" < Dibyendu.Das at
amd.com >
> > > > > Cc: llvmdev at cs.uiuc.edu
> > > > > Sent: Tuesday, November 4, 2014 5:23:30 AM
> > > > > Subject: Re: [LLVMdev] supporting SAD in loop
vectorizer
> > > > >
> > > > > On 4 November 2014 11:06, Das, Dibyendu <
Dibyendu.Das at amd.com
> > > > > >
> > > > > wrote:
> > > > > > Is there any plan to support special idioms in the
loop
> > > > > > vectorizer
> > > > > > like sum of absolute difference (SAD) ? We see
some useful
> > > > > > cases
> > > > > > where llvm is losing performance at -O3 due to
SADs not being
> > > > > > vectorized (hence PSADBWs not being generated).
> > > > >
> > > > > It's been a while, but this could either be that
the
> > > > > legalisation
> > > > > phase is not recognising the reduction or that the cost
is not
> > > > > taking
> > > > > into account the lowered abs().
> > > > >
> > > > > What does -debug-only=loop-vectorize say about it?
> > > >
> > > > FWIW, I agree, this sounds like a cost-model problem. The
> > > > loop-vectorizer should be able to vectorize the 'icmp;
neg;
> > > > select'
> > > > pattern, and then the backend can pattern-patch that with
the
> > > > reduction (which is a series of shuffles and
extract_element)
> > > > into
> > > > the single instruction PSADBW -- we're quite likely
missing the
> > > > target code to do that.
> > > >
> > > > -Hal
> > > >
> > > > >
> > > > > cheers,
> > > > > --renato
> > > > > _______________________________________________
> > > > > LLVM Developers mailing list
> > > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > > >
> > > >
> > > > --
> > > > Hal Finkel
> > > > Assistant Computational Scientist
> > > > Leadership Computing Facility
> > > > Argonne National Laboratory
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141111/04a54e0e/attachment.html>

Bruce Hoult

2014-Nov-12 02:31 UTC

head link

[LLVMdev] supporting SAD in loop vectorizer

While there are no instructions in e.g. ARMv7 to do horizontal reductions
in a single instruction, you can still do better than the loop in the
source code, and the easy way to get the optimum result is probably to
transform the loop into a horizontal reduction intrinsic and then lower it
to a target-appropriate sequence of instructions.

e.g.

vpadd.i8 d1, d16, d17
vpaddl.u8 d1, d1
vpaddl.u16 d1, d1
vmov.32 r1, d1[1]
vmov.32 r0, d1[0]
add r0, r1

(I'm not sure offhand whether an additional reduction stage in the vector
unit and transferring just one result to the integer registers is possible
or desirable)

On Wed, Nov 12, 2014 at 4:00 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Hal Finkel" <hfinkel at anl.gov>
> > To: "James Molloy" <james at jamesmolloy.co.uk>
> > Cc: llvmdev at cs.uiuc.edu
> > Sent: Tuesday, November 11, 2014 8:54:01 AM
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> >
> > ----- Original Message -----
> > > From: "James Molloy" <james at jamesmolloy.co.uk>
> > > To: "Hal Finkel" <hfinkel at anl.gov>
> > > Cc: "Dibyendu Das" <Dibyendu.Das at amd.com>,
llvmdev at cs.uiuc.edu
> > > Sent: Tuesday, November 11, 2014 8:21:37 AM
> > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > >
> > >
> > > If you'd like to contribute support for this, look at
> > > isHorizontalBinOp and go from there. Feel free to ask questions
if
> > > you get stuck.
> > >
> > >
> > >
> > > FWIW, I've looked at isHorizontalBinOp for inspiration for
matching
> > > AArch64 ADDV-and-friends (horizontal reduction operations), and
> > > thought it was rather temperamental and noticed it being prone to
> > > breaking depending on the exact format of the IR. Given that we
> > > don't have a canonical form for reductions, I think it wrong
that
> > > we
> > > expect targets to undo quite complex patterns.
> > >
> > >
> > > The reduction pattern is a log2(n) sequence of shuffles and
binops,
> > > that are really rather complex. These sort of things should,
IMHO,
> > > be intrinsics. I chatted with Arnold about this at the devmtg and
> > > was going to send a patch to do exactly that in a week or so.
> >
> > Sounds good. We should try hard to canonicalize into the intrinsic in
> > InstCombine from the shuffles
>
> Or maybe we should do this in CGP -- would we want to do this if there is
> no actual target support?
>
>  -Hal
>
> > (in addition to emitting it directly
> > from the vectorizer), but it is likely easier to do there than in
> > the backend.
> >
> >  -Hal
> >
> > >
> > >
> > > Cheers,
> > >
> > >
> > > James
> > >
> > >
> > > On 11 November 2014 13:35, Hal Finkel < hfinkel at anl.gov
> wrote:
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Dibyendu Das" < Dibyendu.Das at amd.com
>
> > > > To: "Hal Finkel" < hfinkel at anl.gov >,
"Renato Golin" <
> > > > renato.golin at linaro.org >
> > > > Cc: llvmdev at cs.uiuc.edu
> > > > Sent: Tuesday, November 4, 2014 12:15:12 PM
> > > > Subject: RE: [LLVMdev] supporting SAD in loop vectorizer
> > > >
> > > > Here's the simple SAD code:
> > > > ---------------------------------------------------
> > > > 1 #include <stdlib.h>
> > > > 2
> > > > 3 extern int ly,lx;
> > > > 4 int sad_c( unsigned char *pix1, unsigned char *pix2)
> > > > 5 {
> > > > 6 int i_sum = 0;
> > > > 7 for( int x = 0; x < lx; x++ )
> > > > 8 i_sum += abs( pix1[x] - pix2[x] );
> > > > 9 return i_sum;
> > > > 10 }
> > > > 11
> > > > -----------------------------------------------------
> > > >
> > > > The loop vectorizer does vectorize the loop and then unrolls
it
> > > > twice. The main body of the loop at the end looks like below
> > > > where
> > > > we see the icmp, neg select pattern appearing twice.
> > > > Are we saying we pattern match this to PSADBW in target ?
> > >
> > > Yes.
> > >
> > > > That seems
> > > > to have some challenges
> > >
> > > It does, but we already have code in the backend that matches
other
> > > horizontal operations (see isHorizontalBinOp and its callers in
> > > lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't
be
> > > significantly more complicated.
> > >
> > > > including the fact that we would need a
> > > > 4-way unroll to use all of 128b PSADBWs. Or am I
> > > > missing something ?
> > >
> > > No, each unrolling will get its own, so you'll get a PSADBW
from
> > > each
> > > time the loop is unrolled. Each unrolling is vectorized in terms
of
> > > <4 x i32>, and that is the 128 bits you need.
> > >
> > > If you'd like to contribute support for this, look at
> > > isHorizontalBinOp and go from there. Feel free to ask questions
if
> > > you get stuck.
> > >
> > > -Hal
> > >
> > >
> > >
> > > >
> > > > 2783 vector.body: ; preds > > > >
%vector.body.preheader, %vector.body
> > > > 2784 %index = phi i64 [ %index.next, %vector.body ], [ 0,
> > > > %vector.body.preheader ]
> > > > 2785 %vec.phi = phi <4 x i32> [ %24, %vector.body ], [
> > > > zeroinitializer, %vector.body.preheader ]
> > > > 2786 %vec.phi9 = phi <4 x i32> [ %25, %vector.body ],
[
> > > > zeroinitializer, %vector.body.preheader ]
> > > > 2787 %4 = getelementptr inbounds i8* %pix1, i64 %index
> > > > 2788 %5 = bitcast i8* %4 to <4 x i8>*
> > > > 2789 %wide.load = load <4 x i8>* %5, align 1
> > > > 2790 %.sum19 = or i64 %index, 4
> > > > 2791 %6 = getelementptr i8* %pix1, i64 %.sum19
> > > > 2792 %7 = bitcast i8* %6 to <4 x i8>*
> > > > 2793 %wide.load10 = load <4 x i8>* %7, align 1
> > > > 2794 %8 = zext <4 x i8> %wide.load to <4 x i32>
> > > > 2795 %9 = zext <4 x i8> %wide.load10 to <4 x
i32>
> > > > 2796 %10 = getelementptr inbounds i8* %pix2, i64 %index
> > > > 2797 %11 = bitcast i8* %10 to <4 x i8>*
> > > > 2798 %wide.load11 = load <4 x i8>* %11, align 1
> > > > 2799 %.sum20 = or i64 %index, 4
> > > > 2800 %12 = getelementptr i8* %pix2, i64 %.sum20
> > > > 2801 %13 = bitcast i8* %12 to <4 x i8>*
> > > > 2802 %wide.load12 = load <4 x i8>* %13, align 1
> > > > 2803 %14 = zext <4 x i8> %wide.load11 to <4 x
i32>
> > > > 2804 %15 = zext <4 x i8> %wide.load12 to <4 x
i32>
> > > > 2805 %16 = sub nsw <4 x i32> %8, %14
> > > > 2806 %17 = sub nsw <4 x i32> %9, %15
> > > > 2807 %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1,
i32 -1, i32
> > > > -1>
> > > > 2808 %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1,
i32 -1, i32
> > > > -1>
> > > > 2809 %20 = sub <4 x i32> zeroinitializer, %16
> > > > 2810 %21 = sub <4 x i32> zeroinitializer, %17
> > > > 2811 %22 = select <4 x i1> %18, <4 x i32> %16,
<4 x i32> %20
> > > > 2812 %23 = select <4 x i1> %19, <4 x i32> %17,
<4 x i32> %21
> > > > 2813 %24 = add nsw <4 x i32> %22, %vec.phi
> > > > 2814 %25 = add nsw <4 x i32> %23, %vec.phi9
> > > > 2815 %index.next = add i64 %index, 8
> > > > 2816 %26 = icmp eq i64 %index.next, %n.vec
> > > > 2817 br i1 %26, label %middle.block.loopexit, label
%vector.body,
> > > > !llvm.loop !1
> > > > -----------------------------------------------------
> > > >
> > > > -----Original Message-----
> > > > From: Hal Finkel [mailto: hfinkel at anl.gov ]
> > > > Sent: Tuesday, November 04, 2014 9:54 PM
> > > > To: Renato Golin
> > > > Cc: llvmdev at cs.uiuc.edu ; Das, Dibyendu
> > > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > >
> > > > ----- Original Message -----
> > > > > From: "Renato Golin" < renato.golin at
linaro.org >
> > > > > To: "Dibyendu Das" < Dibyendu.Das at
amd.com >
> > > > > Cc: llvmdev at cs.uiuc.edu
> > > > > Sent: Tuesday, November 4, 2014 5:23:30 AM
> > > > > Subject: Re: [LLVMdev] supporting SAD in loop
vectorizer
> > > > >
> > > > > On 4 November 2014 11:06, Das, Dibyendu <
Dibyendu.Das at amd.com
> > > > > >
> > > > > wrote:
> > > > > > Is there any plan to support special idioms in the
loop
> > > > > > vectorizer
> > > > > > like sum of absolute difference (SAD) ? We see
some useful
> > > > > > cases
> > > > > > where llvm is losing performance at -O3 due to
SADs not being
> > > > > > vectorized (hence PSADBWs not being generated).
> > > > >
> > > > > It's been a while, but this could either be that
the
> > > > > legalisation
> > > > > phase is not recognising the reduction or that the cost
is not
> > > > > taking
> > > > > into account the lowered abs().
> > > > >
> > > > > What does -debug-only=loop-vectorize say about it?
> > > >
> > > > FWIW, I agree, this sounds like a cost-model problem. The
> > > > loop-vectorizer should be able to vectorize the 'icmp;
neg;
> > > > select'
> > > > pattern, and then the backend can pattern-patch that with
the
> > > > reduction (which is a series of shuffles and
extract_element)
> > > > into
> > > > the single instruction PSADBW -- we're quite likely
missing the
> > > > target code to do that.
> > > >
> > > > -Hal
> > > >
> > > > >
> > > > > cheers,
> > > > > --renato
> > > > > _______________________________________________
> > > > > LLVM Developers mailing list
> > > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > > >
> > > >
> > > > --
> > > > Hal Finkel
> > > > Assistant Computational Scientist
> > > > Leadership Computing Facility
> > > > Argonne National Laboratory
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141112/95f3cb44/attachment.html>

Das, Dibyendu

2014-Nov-14 04:51 UTC

head link

[LLVMdev] supporting SAD in loop vectorizer

Hal, James and others-

We have started looking into this support. We may get back for some
clarifications.

-dibyendu

-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Hal Finkel
Sent: Tuesday, November 11, 2014 8:30 PM
To: James Molloy
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] supporting SAD in loop vectorizer

----- Original Message -----> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "James Molloy" <james at jamesmolloy.co.uk>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tuesday, November 11, 2014 8:54:01 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> ----- Original Message -----
> > From: "James Molloy" <james at jamesmolloy.co.uk>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "Dibyendu Das" <Dibyendu.Das at amd.com>, llvmdev
at cs.uiuc.edu
> > Sent: Tuesday, November 11, 2014 8:21:37 AM
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > 
> > 
> > If you'd like to contribute support for this, look at 
> > isHorizontalBinOp and go from there. Feel free to ask questions if 
> > you get stuck.
> > 
> > 
> > 
> > FWIW, I've looked at isHorizontalBinOp for inspiration for
matching
> > AArch64 ADDV-and-friends (horizontal reduction operations), and 
> > thought it was rather temperamental and noticed it being prone to 
> > breaking depending on the exact format of the IR. Given that we 
> > don't have a canonical form for reductions, I think it wrong that
we
> > expect targets to undo quite complex patterns.
> > 
> > 
> > The reduction pattern is a log2(n) sequence of shuffles and binops, 
> > that are really rather complex. These sort of things should, IMHO, 
> > be intrinsics. I chatted with Arnold about this at the devmtg and 
> > was going to send a patch to do exactly that in a week or so.
> 
> Sounds good. We should try hard to canonicalize into the intrinsic in 
> InstCombine from the shuffles
Or maybe we should do this in CGP -- would we want to do this if there is no
actual target support?

 -Hal
> (in addition to emitting it directly
> from the vectorizer), but it is likely easier to do there than in the 
> backend.
> 
>  -Hal
> 
> > 
> > 
> > Cheers,
> > 
> > 
> > James
> > 
> > 
> > On 11 November 2014 13:35, Hal Finkel < hfinkel at anl.gov >
wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Dibyendu Das" < Dibyendu.Das at amd.com >
> > > To: "Hal Finkel" < hfinkel at anl.gov >,
"Renato Golin" <
> > > renato.golin at linaro.org >
> > > Cc: llvmdev at cs.uiuc.edu
> > > Sent: Tuesday, November 4, 2014 12:15:12 PM
> > > Subject: RE: [LLVMdev] supporting SAD in loop vectorizer
> > > 
> > > Here's the simple SAD code:
> > > ---------------------------------------------------
> > > 1 #include <stdlib.h>
> > > 2
> > > 3 extern int ly,lx;
> > > 4 int sad_c( unsigned char *pix1, unsigned char *pix2)
> > > 5 {
> > > 6 int i_sum = 0;
> > > 7 for( int x = 0; x < lx; x++ )
> > > 8 i_sum += abs( pix1[x] - pix2[x] );
> > > 9 return i_sum;
> > > 10 }
> > > 11
> > > -----------------------------------------------------
> > > 
> > > The loop vectorizer does vectorize the loop and then unrolls it 
> > > twice. The main body of the loop at the end looks like below
where
> > > we see the icmp, neg select pattern appearing twice.
> > > Are we saying we pattern match this to PSADBW in target ?
> > 
> > Yes.
> > 
> > > That seems
> > > to have some challenges
> > 
> > It does, but we already have code in the backend that matches other 
> > horizontal operations (see isHorizontalBinOp and its callers in 
> > lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be 
> > significantly more complicated.
> > 
> > > including the fact that we would need a 4-way unroll to use all
of
> > > 128b PSADBWs. Or am I missing something ?
> > 
> > No, each unrolling will get its own, so you'll get a PSADBW from 
> > each time the loop is unrolled. Each unrolling is vectorized in 
> > terms of
> > <4 x i32>, and that is the 128 bits you need.
> > 
> > If you'd like to contribute support for this, look at 
> > isHorizontalBinOp and go from there. Feel free to ask questions if 
> > you get stuck.
> > 
> > -Hal
> > 
> > 
> > 
> > > 
> > > 2783 vector.body: ; preds > > > %vector.body.preheader,
%vector.body
> > > 2784 %index = phi i64 [ %index.next, %vector.body ], [ 0, 
> > > %vector.body.preheader ]
> > > 2785 %vec.phi = phi <4 x i32> [ %24, %vector.body ], [ 
> > > zeroinitializer, %vector.body.preheader ]
> > > 2786 %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [ 
> > > zeroinitializer, %vector.body.preheader ]
> > > 2787 %4 = getelementptr inbounds i8* %pix1, i64 %index
> > > 2788 %5 = bitcast i8* %4 to <4 x i8>*
> > > 2789 %wide.load = load <4 x i8>* %5, align 1
> > > 2790 %.sum19 = or i64 %index, 4
> > > 2791 %6 = getelementptr i8* %pix1, i64 %.sum19
> > > 2792 %7 = bitcast i8* %6 to <4 x i8>*
> > > 2793 %wide.load10 = load <4 x i8>* %7, align 1
> > > 2794 %8 = zext <4 x i8> %wide.load to <4 x i32>
> > > 2795 %9 = zext <4 x i8> %wide.load10 to <4 x i32>
> > > 2796 %10 = getelementptr inbounds i8* %pix2, i64 %index
> > > 2797 %11 = bitcast i8* %10 to <4 x i8>*
> > > 2798 %wide.load11 = load <4 x i8>* %11, align 1
> > > 2799 %.sum20 = or i64 %index, 4
> > > 2800 %12 = getelementptr i8* %pix2, i64 %.sum20
> > > 2801 %13 = bitcast i8* %12 to <4 x i8>*
> > > 2802 %wide.load12 = load <4 x i8>* %13, align 1
> > > 2803 %14 = zext <4 x i8> %wide.load11 to <4 x i32>
> > > 2804 %15 = zext <4 x i8> %wide.load12 to <4 x i32>
> > > 2805 %16 = sub nsw <4 x i32> %8, %14
> > > 2806 %17 = sub nsw <4 x i32> %9, %15
> > > 2807 %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32
-1, i32
> > > -1>
> > > 2808 %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32
-1, i32
> > > -1>
> > > 2809 %20 = sub <4 x i32> zeroinitializer, %16
> > > 2810 %21 = sub <4 x i32> zeroinitializer, %17
> > > 2811 %22 = select <4 x i1> %18, <4 x i32> %16, <4
x i32> %20
> > > 2812 %23 = select <4 x i1> %19, <4 x i32> %17, <4
x i32> %21
> > > 2813 %24 = add nsw <4 x i32> %22, %vec.phi
> > > 2814 %25 = add nsw <4 x i32> %23, %vec.phi9
> > > 2815 %index.next = add i64 %index, 8
> > > 2816 %26 = icmp eq i64 %index.next, %n.vec
> > > 2817 br i1 %26, label %middle.block.loopexit, label %vector.body,
> > > !llvm.loop !1
> > > -----------------------------------------------------
> > > 
> > > -----Original Message-----
> > > From: Hal Finkel [mailto: hfinkel at anl.gov ]
> > > Sent: Tuesday, November 04, 2014 9:54 PM
> > > To: Renato Golin
> > > Cc: llvmdev at cs.uiuc.edu ; Das, Dibyendu
> > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > 
> > > ----- Original Message -----
> > > > From: "Renato Golin" < renato.golin at
linaro.org >
> > > > To: "Dibyendu Das" < Dibyendu.Das at amd.com
>
> > > > Cc: llvmdev at cs.uiuc.edu
> > > > Sent: Tuesday, November 4, 2014 5:23:30 AM
> > > > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > > > 
> > > > On 4 November 2014 11:06, Das, Dibyendu < Dibyendu.Das at
amd.com
> > > > >
> > > > wrote:
> > > > > Is there any plan to support special idioms in the loop
> > > > > vectorizer like sum of absolute difference (SAD) ? We
see some
> > > > > useful cases where llvm is losing performance at -O3
due to
> > > > > SADs not being vectorized (hence PSADBWs not being
generated).
> > > > 
> > > > It's been a while, but this could either be that the 
> > > > legalisation phase is not recognising the reduction or that
the
> > > > cost is not taking into account the lowered abs().
> > > > 
> > > > What does -debug-only=loop-vectorize say about it?
> > > 
> > > FWIW, I agree, this sounds like a cost-model problem. The 
> > > loop-vectorizer should be able to vectorize the 'icmp; neg; 
> > > select'
> > > pattern, and then the backend can pattern-patch that with the 
> > > reduction (which is a series of shuffles and extract_element)
into
> > > the single instruction PSADBW -- we're quite likely missing
the
> > > target code to do that.
> > > 
> > > -Hal
> > > 
> > > > 
> > > > cheers,
> > > > --renato
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu 
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > > 
> > > 
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu 
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

llvm dev - Nov 2014 - [LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer