thr3ads.net - llvm dev - [LLVMdev] supporting SAD in loop vectorizer [Nov 2014]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2014-Nov-04 16:24 UTC

[LLVMdev] supporting SAD in loop vectorizer

----- Original Message -----> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Dibyendu Das" <Dibyendu.Das at amd.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tuesday, November 4, 2014 5:23:30 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> On 4 November 2014 11:06, Das, Dibyendu <Dibyendu.Das at amd.com>
wrote:
> > Is there any plan to support special idioms in the loop vectorizer
> > like sum of absolute difference (SAD) ? We see some useful cases
> > where llvm is losing performance at -O3 due to SADs not being
> > vectorized (hence PSADBWs not being generated).
> 
> It's been a while, but this could either be that the legalisation
> phase is not recognising the reduction or that the cost is not taking
> into account the lowered abs().
> 
> What does -debug-only=loop-vectorize say about it?
FWIW, I agree, this sounds like a cost-model problem. The loop-vectorizer should
be able to vectorize the 'icmp; neg; select' pattern, and then the
backend can pattern-patch that with the reduction (which is a series of shuffles
and extract_element) into the single instruction PSADBW -- we're quite
likely missing the target code to do that.

 -Hal
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Das, Dibyendu

2014-Nov-04 16:26 UTC

head link

[LLVMdev] supporting SAD in loop vectorizer

I will get the debug dump and get back on this.

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Tuesday, November 04, 2014 9:54 PM
To: Renato Golin
Cc: llvmdev at cs.uiuc.edu; Das, Dibyendu
Subject: Re: [LLVMdev] supporting SAD in loop vectorizer

----- Original Message -----> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Dibyendu Das" <Dibyendu.Das at amd.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tuesday, November 4, 2014 5:23:30 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> On 4 November 2014 11:06, Das, Dibyendu <Dibyendu.Das at amd.com>
wrote:
> > Is there any plan to support special idioms in the loop vectorizer 
> > like sum of absolute difference (SAD) ? We see some useful cases 
> > where llvm is losing performance at -O3 due to SADs not being 
> > vectorized (hence PSADBWs not being generated).
> 
> It's been a while, but this could either be that the legalisation 
> phase is not recognising the reduction or that the cost is not taking 
> into account the lowered abs().
> 
> What does -debug-only=loop-vectorize say about it?
FWIW, I agree, this sounds like a cost-model problem. The loop-vectorizer should
be able to vectorize the 'icmp; neg; select' pattern, and then the
backend can pattern-patch that with the reduction (which is a series of shuffles
and extract_element) into the single instruction PSADBW -- we're quite
likely missing the target code to do that.

 -Hal
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Das, Dibyendu

2014-Nov-04 18:15 UTC

head link

[LLVMdev] supporting SAD in loop vectorizer

Here's the simple SAD code:
---------------------------------------------------
  1 #include <stdlib.h>
  2
  3 extern int ly,lx;
  4 int sad_c( unsigned char *pix1, unsigned char *pix2)
  5 {
  6             int i_sum = 0;
  7             for( int x = 0; x < lx; x++ )
  8                 i_sum += abs( pix1[x] - pix2[x] );
  9             return i_sum;
 10 }
 11
-----------------------------------------------------

The loop vectorizer does vectorize the loop and then unrolls it twice. The main
body of the loop at the end looks like below where we see the icmp, neg select
pattern appearing twice.
Are we saying we pattern match this to PSADBW in target ? That seems to have
some challenges including the fact that we would need a 4-way unroll to use all
of 128b PSADBWs. Or am I
missing something ?

2783 vector.body:                                      ; preds =
%vector.body.preheader, %vector.body
2784   %index = phi i64 [ %index.next, %vector.body ], [ 0,
%vector.body.preheader ]
2785   %vec.phi = phi <4 x i32> [ %24, %vector.body ], [ zeroinitializer,
%vector.body.preheader ]
2786   %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [ zeroinitializer,
%vector.body.preheader ]
2787   %4 = getelementptr inbounds i8* %pix1, i64 %index
2788   %5 = bitcast i8* %4 to <4 x i8>*
2789   %wide.load = load <4 x i8>* %5, align 1
2790   %.sum19 = or i64 %index, 4
2791   %6 = getelementptr i8* %pix1, i64 %.sum19
2792   %7 = bitcast i8* %6 to <4 x i8>*
2793   %wide.load10 = load <4 x i8>* %7, align 1
2794   %8 = zext <4 x i8> %wide.load to <4 x i32>
2795   %9 = zext <4 x i8> %wide.load10 to <4 x i32>
2796   %10 = getelementptr inbounds i8* %pix2, i64 %index
2797   %11 = bitcast i8* %10 to <4 x i8>*
2798   %wide.load11 = load <4 x i8>* %11, align 1
2799   %.sum20 = or i64 %index, 4
2800   %12 = getelementptr i8* %pix2, i64 %.sum20
2801   %13 = bitcast i8* %12 to <4 x i8>*
2802   %wide.load12 = load <4 x i8>* %13, align 1
2803   %14 = zext <4 x i8> %wide.load11 to <4 x i32>
2804   %15 = zext <4 x i8> %wide.load12 to <4 x i32>
2805   %16 = sub nsw <4 x i32> %8, %14
2806   %17 = sub nsw <4 x i32> %9, %15
2807   %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32 -1, i32
-1>
2808   %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32 -1, i32
-1>
2809   %20 = sub <4 x i32> zeroinitializer, %16
2810   %21 = sub <4 x i32> zeroinitializer, %17
2811   %22 = select <4 x i1> %18, <4 x i32> %16, <4 x i32> %20
2812   %23 = select <4 x i1> %19, <4 x i32> %17, <4 x i32> %21
2813   %24 = add nsw <4 x i32> %22, %vec.phi
2814   %25 = add nsw <4 x i32> %23, %vec.phi9
2815   %index.next = add i64 %index, 8
2816   %26 = icmp eq i64 %index.next, %n.vec
2817   br i1 %26, label %middle.block.loopexit, label %vector.body, !llvm.loop
!1
-----------------------------------------------------

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Tuesday, November 04, 2014 9:54 PM
To: Renato Golin
Cc: llvmdev at cs.uiuc.edu; Das, Dibyendu
Subject: Re: [LLVMdev] supporting SAD in loop vectorizer

----- Original Message -----> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Dibyendu Das" <Dibyendu.Das at amd.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tuesday, November 4, 2014 5:23:30 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> On 4 November 2014 11:06, Das, Dibyendu <Dibyendu.Das at amd.com>
wrote:
> > Is there any plan to support special idioms in the loop vectorizer 
> > like sum of absolute difference (SAD) ? We see some useful cases 
> > where llvm is losing performance at -O3 due to SADs not being 
> > vectorized (hence PSADBWs not being generated).
> 
> It's been a while, but this could either be that the legalisation 
> phase is not recognising the reduction or that the cost is not taking 
> into account the lowered abs().
> 
> What does -debug-only=loop-vectorize say about it?
FWIW, I agree, this sounds like a cost-model problem. The loop-vectorizer should
be able to vectorize the 'icmp; neg; select' pattern, and then the
backend can pattern-patch that with the reduction (which is a series of shuffles
and extract_element) into the single instruction PSADBW -- we're quite
likely missing the target code to do that.

 -Hal
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2014-Nov-11 13:35 UTC

head link

[LLVMdev] supporting SAD in loop vectorizer

----- Original Message -----> From: "Dibyendu Das" <Dibyendu.Das at amd.com>
> To: "Hal Finkel" <hfinkel at anl.gov>, "Renato
Golin" <renato.golin at linaro.org>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Tuesday, November 4, 2014 12:15:12 PM
> Subject: RE: [LLVMdev] supporting SAD in loop vectorizer
> 
> Here's the simple SAD code:
> ---------------------------------------------------
>   1 #include <stdlib.h>
>   2
>   3 extern int ly,lx;
>   4 int sad_c( unsigned char *pix1, unsigned char *pix2)
>   5 {
>   6             int i_sum = 0;
>   7             for( int x = 0; x < lx; x++ )
>   8                 i_sum += abs( pix1[x] - pix2[x] );
>   9             return i_sum;
>  10 }
>  11
> -----------------------------------------------------
> 
> The loop vectorizer does vectorize the loop and then unrolls it
> twice. The main body of the loop at the end looks like below where
> we see the icmp, neg select pattern appearing twice.
> Are we saying we pattern match this to PSADBW in target ?
Yes.
> That seems
> to have some challenges 
It does, but we already have code in the backend that matches other horizontal
operations (see isHorizontalBinOp and its callers in
lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be
significantly more complicated.
> including the fact that we would need a
> 4-way unroll to use all of 128b PSADBWs. Or am I
> missing something ?
No, each unrolling will get its own, so you'll get a PSADBW from each time
the loop is unrolled. Each unrolling is vectorized in terms of <4 x i32>,
and that is the 128 bits you need.

If you'd like to contribute support for this, look at isHorizontalBinOp and
go from there. Feel free to ask questions if you get stuck.

 -Hal
> 
> 2783 vector.body:                                      ; preds >
%vector.body.preheader, %vector.body
> 2784   %index = phi i64 [ %index.next, %vector.body ], [ 0,
> %vector.body.preheader ]
> 2785   %vec.phi = phi <4 x i32> [ %24, %vector.body ], [
> zeroinitializer, %vector.body.preheader ]
> 2786   %vec.phi9 = phi <4 x i32> [ %25, %vector.body ], [
> zeroinitializer, %vector.body.preheader ]
> 2787   %4 = getelementptr inbounds i8* %pix1, i64 %index
> 2788   %5 = bitcast i8* %4 to <4 x i8>*
> 2789   %wide.load = load <4 x i8>* %5, align 1
> 2790   %.sum19 = or i64 %index, 4
> 2791   %6 = getelementptr i8* %pix1, i64 %.sum19
> 2792   %7 = bitcast i8* %6 to <4 x i8>*
> 2793   %wide.load10 = load <4 x i8>* %7, align 1
> 2794   %8 = zext <4 x i8> %wide.load to <4 x i32>
> 2795   %9 = zext <4 x i8> %wide.load10 to <4 x i32>
> 2796   %10 = getelementptr inbounds i8* %pix2, i64 %index
> 2797   %11 = bitcast i8* %10 to <4 x i8>*
> 2798   %wide.load11 = load <4 x i8>* %11, align 1
> 2799   %.sum20 = or i64 %index, 4
> 2800   %12 = getelementptr i8* %pix2, i64 %.sum20
> 2801   %13 = bitcast i8* %12 to <4 x i8>*
> 2802   %wide.load12 = load <4 x i8>* %13, align 1
> 2803   %14 = zext <4 x i8> %wide.load11 to <4 x i32>
> 2804   %15 = zext <4 x i8> %wide.load12 to <4 x i32>
> 2805   %16 = sub nsw <4 x i32> %8, %14
> 2806   %17 = sub nsw <4 x i32> %9, %15
> 2807   %18 = icmp sgt <4 x i32> %16, <i32 -1, i32 -1, i32 -1, i32
-1>
> 2808   %19 = icmp sgt <4 x i32> %17, <i32 -1, i32 -1, i32 -1, i32
-1>
> 2809   %20 = sub <4 x i32> zeroinitializer, %16
> 2810   %21 = sub <4 x i32> zeroinitializer, %17
> 2811   %22 = select <4 x i1> %18, <4 x i32> %16, <4 x
i32> %20
> 2812   %23 = select <4 x i1> %19, <4 x i32> %17, <4 x
i32> %21
> 2813   %24 = add nsw <4 x i32> %22, %vec.phi
> 2814   %25 = add nsw <4 x i32> %23, %vec.phi9
> 2815   %index.next = add i64 %index, 8
> 2816   %26 = icmp eq i64 %index.next, %n.vec
> 2817   br i1 %26, label %middle.block.loopexit, label %vector.body,
> !llvm.loop !1
> -----------------------------------------------------
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Tuesday, November 04, 2014 9:54 PM
> To: Renato Golin
> Cc: llvmdev at cs.uiuc.edu; Das, Dibyendu
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> 
> ----- Original Message -----
> > From: "Renato Golin" <renato.golin at linaro.org>
> > To: "Dibyendu Das" <Dibyendu.Das at amd.com>
> > Cc: llvmdev at cs.uiuc.edu
> > Sent: Tuesday, November 4, 2014 5:23:30 AM
> > Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
> > 
> > On 4 November 2014 11:06, Das, Dibyendu <Dibyendu.Das at
amd.com>
> > wrote:
> > > Is there any plan to support special idioms in the loop
> > > vectorizer
> > > like sum of absolute difference (SAD) ? We see some useful cases
> > > where llvm is losing performance at -O3 due to SADs not being
> > > vectorized (hence PSADBWs not being generated).
> > 
> > It's been a while, but this could either be that the legalisation
> > phase is not recognising the reduction or that the cost is not
> > taking
> > into account the lowered abs().
> > 
> > What does -debug-only=loop-vectorize say about it?
> 
> FWIW, I agree, this sounds like a cost-model problem. The
> loop-vectorizer should be able to vectorize the 'icmp; neg; select'
> pattern, and then the backend can pattern-patch that with the
> reduction (which is a series of shuffles and extract_element) into
> the single instruction PSADBW -- we're quite likely missing the
> target code to do that.
> 
>  -Hal
> 
> > 
> > cheers,
> > --renato
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Nov 2014 - [LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

[LLVMdev] supporting SAD in loop vectorizer

Apparently Analagous Threads