thr3ads.net - llvm dev - [LLVMdev] Modifications to SLP [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Frank Winter

2015-Jul-07 18:31 UTC

[LLVMdev] Modifications to SLP

Hi all!

It takes the current SLP vectorizer too long to vectorize my scalar 
code. I am talking here about functions that have a single, huge basic 
block with O(10^6) instructions. Here's an example:

   %0 = getelementptr float* %arg1, i32 49
   %1 = load float* %0
   %2 = getelementptr float* %arg1, i32 4145
   %3 = load float* %2
   %4 = getelementptr float* %arg2, i32 49
   %5 = load float* %4
   %6 = getelementptr float* %arg2, i32 4145
   %7 = load float* %6
   %8 = fmul float %7, %1
   %9 = fmul float %5, %3
   %10 = fadd float %9, %8
   %11 = fmul float %7, %3
   %12 = fmul float %5, %1
   %13 = fsub float %12, %11
   %14 = getelementptr float* %arg3, i32 16
   %15 = load float* %14
   %16 = getelementptr float* %arg3, i32 4112
   %17 = load float* %16
   %18 = getelementptr float* %arg4, i32 0
   %19 = load float* %18
   %20 = getelementptr float* %arg4, i32 4096
   %21 = load float* %20
   %22 = fmul float %21, %15
   %23 = fmul float %19, %17
   %24 = fadd float %23, %22
   %25 = fmul float %21, %17
   %26 = fmul float %19, %15
   %27 = fsub float %26, %25
   %28 = fadd float %24, %10
   %29 = fadd float %27, %13
   %30 = getelementptr float* %arg0, i32 0
   store float %29, float* %30
   %31 = getelementptr float* %arg0, i32 4096
   store float %28, float* %31
... and so on ...

The SLP vectorizer would create some code like this:

   %219 = insertelement <4 x float> %218, float %185, i32 2
   %220 = insertelement <4 x float> %219, float %197, i32 3
   %221 = fmul <4 x float> %216, %220
   %222 = fadd <4 x float> %221, %212
   %223 = fmul <4 x float> %207, %220
..
   %234 = bitcast float* %165 to <4 x float>*
   store <4 x float> %233, <4 x float>* %234, align 4


With the current SLP implementation 99.5% of the time is spent in the 
SLP vectorizer and I have the impression that this can be improved for 
my case. I believe that the SLP vectorizer has far more capabilities 
than I would need for these simple (but huge) functions. And I was 
hoping that any of you have an idea how to remove functionality of the 
SLP vectorizer such that it still can vectorize those simple functions...?

Thanks,
Frank

Michael Zolotukhin

2015-Jul-07 18:55 UTC

head link

[LLVMdev] Modifications to SLP

Hi Frank,

The most time consuming part of SLP vectorizer (especially in cases like yours)
is finding sets of consecutive stores. It's currently performed by a
quadratic search (see routine SLPVectorizer::vectorizeStores) - we do pairwise
comparisons between all pointers (but we do limit ourselves to look at at most
16 stores). I think it should be possible to group pointers with a common base,
compute constant relative offset and just sort all of them - this way we’ll save
a lot of expensive computations. However, I haven’t tried implementing this, and
I guess there might be some hard corner cases too. Patches would be welcome
here:)

Thanks,
Michael
> On Jul 7, 2015, at 11:31 AM, Frank Winter <fwinter at jlab.org>
wrote:
> 
> Hi all!
> 
> It takes the current SLP vectorizer too long to vectorize my scalar code. I
am talking here about functions that have a single, huge basic block with
O(10^6) instructions. Here's an example:
> 
>  %0 = getelementptr float* %arg1, i32 49
>  %1 = load float* %0
>  %2 = getelementptr float* %arg1, i32 4145
>  %3 = load float* %2
>  %4 = getelementptr float* %arg2, i32 49
>  %5 = load float* %4
>  %6 = getelementptr float* %arg2, i32 4145
>  %7 = load float* %6
>  %8 = fmul float %7, %1
>  %9 = fmul float %5, %3
>  %10 = fadd float %9, %8
>  %11 = fmul float %7, %3
>  %12 = fmul float %5, %1
>  %13 = fsub float %12, %11
>  %14 = getelementptr float* %arg3, i32 16
>  %15 = load float* %14
>  %16 = getelementptr float* %arg3, i32 4112
>  %17 = load float* %16
>  %18 = getelementptr float* %arg4, i32 0
>  %19 = load float* %18
>  %20 = getelementptr float* %arg4, i32 4096
>  %21 = load float* %20
>  %22 = fmul float %21, %15
>  %23 = fmul float %19, %17
>  %24 = fadd float %23, %22
>  %25 = fmul float %21, %17
>  %26 = fmul float %19, %15
>  %27 = fsub float %26, %25
>  %28 = fadd float %24, %10
>  %29 = fadd float %27, %13
>  %30 = getelementptr float* %arg0, i32 0
>  store float %29, float* %30
>  %31 = getelementptr float* %arg0, i32 4096
>  store float %28, float* %31
> ... and so on ...
> 
> The SLP vectorizer would create some code like this:
> 
>  %219 = insertelement <4 x float> %218, float %185, i32 2
>  %220 = insertelement <4 x float> %219, float %197, i32 3
>  %221 = fmul <4 x float> %216, %220
>  %222 = fadd <4 x float> %221, %212
>  %223 = fmul <4 x float> %207, %220
> ..
>  %234 = bitcast float* %165 to <4 x float>*
>  store <4 x float> %233, <4 x float>* %234, align 4
> 
> 
> With the current SLP implementation 99.5% of the time is spent in the SLP
vectorizer and I have the impression that this can be improved for my case. I
believe that the SLP vectorizer has far more capabilities than I would need for
these simple (but huge) functions. And I was hoping that any of you have an idea
how to remove functionality of the SLP vectorizer such that it still can
vectorize those simple functions...?
> 
> Thanks,
> Frank
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Sanjay Patel

2015-Jul-07 20:30 UTC

head link

[LLVMdev] Modifications to SLP

I forgot to update the SLP thread from last week. I have a patch up for
review that would allow creating wider vectors as requested, but may
increase SLP compile time:
http://reviews.llvm.org/D10950

An audit of the trunk backends shows that only PowerPC + QPX and x86 + AVX
/ AVX512 would potentially get an extra round of store merging from the use
of 'getRegisterBitWidth()'.

As reported in the phab comments, I didn't see any compile time hit on
test-suite for an AVX machine. I'm very curious to know if that patch
causes further blowup in this example.

Frank, what causes a 10^6 instruction function to be generated? Can this be
rolled into a loop?


On Tue, Jul 7, 2015 at 12:55 PM, Michael Zolotukhin <mzolotukhin at
apple.com>
wrote:
> Hi Frank,
>
> The most time consuming part of SLP vectorizer (especially in cases like
> yours) is finding sets of consecutive stores. It's currently performed
by a
> quadratic search (see routine SLPVectorizer::vectorizeStores) - we do
> pairwise comparisons between all pointers (but we do limit ourselves to
> look at at most 16 stores). I think it should be possible to group pointers
> with a common base, compute constant relative offset and just sort all of
> them - this way we’ll save a lot of expensive computations. However, I
> haven’t tried implementing this, and I guess there might be some hard
> corner cases too. Patches would be welcome here:)
>
> Thanks,
> Michael
>
> > On Jul 7, 2015, at 11:31 AM, Frank Winter <fwinter at jlab.org>
wrote:
> >
> > Hi all!
> >
> > It takes the current SLP vectorizer too long to vectorize my scalar
> code. I am talking here about functions that have a single, huge basic
> block with O(10^6) instructions. Here's an example:
> >
> >  %0 = getelementptr float* %arg1, i32 49
> >  %1 = load float* %0
> >  %2 = getelementptr float* %arg1, i32 4145
> >  %3 = load float* %2
> >  %4 = getelementptr float* %arg2, i32 49
> >  %5 = load float* %4
> >  %6 = getelementptr float* %arg2, i32 4145
> >  %7 = load float* %6
> >  %8 = fmul float %7, %1
> >  %9 = fmul float %5, %3
> >  %10 = fadd float %9, %8
> >  %11 = fmul float %7, %3
> >  %12 = fmul float %5, %1
> >  %13 = fsub float %12, %11
> >  %14 = getelementptr float* %arg3, i32 16
> >  %15 = load float* %14
> >  %16 = getelementptr float* %arg3, i32 4112
> >  %17 = load float* %16
> >  %18 = getelementptr float* %arg4, i32 0
> >  %19 = load float* %18
> >  %20 = getelementptr float* %arg4, i32 4096
> >  %21 = load float* %20
> >  %22 = fmul float %21, %15
> >  %23 = fmul float %19, %17
> >  %24 = fadd float %23, %22
> >  %25 = fmul float %21, %17
> >  %26 = fmul float %19, %15
> >  %27 = fsub float %26, %25
> >  %28 = fadd float %24, %10
> >  %29 = fadd float %27, %13
> >  %30 = getelementptr float* %arg0, i32 0
> >  store float %29, float* %30
> >  %31 = getelementptr float* %arg0, i32 4096
> >  store float %28, float* %31
> > ... and so on ...
> >
> > The SLP vectorizer would create some code like this:
> >
> >  %219 = insertelement <4 x float> %218, float %185, i32 2
> >  %220 = insertelement <4 x float> %219, float %197, i32 3
> >  %221 = fmul <4 x float> %216, %220
> >  %222 = fadd <4 x float> %221, %212
> >  %223 = fmul <4 x float> %207, %220
> > ..
> >  %234 = bitcast float* %165 to <4 x float>*
> >  store <4 x float> %233, <4 x float>* %234, align 4
> >
> >
> > With the current SLP implementation 99.5% of the time is spent in the
> SLP vectorizer and I have the impression that this can be improved for my
> case. I believe that the SLP vectorizer has far more capabilities than I
> would need for these simple (but huge) functions. And I was hoping that any
> of you have an idea how to remove functionality of the SLP vectorizer such
> that it still can vectorize those simple functions...?
> >
> > Thanks,
> > Frank
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150707/568d2a14/attachment.html>

llvm dev - Jul 2015 - [LLVMdev] Modifications to SLP

[LLVMdev] Modifications to SLP

[LLVMdev] Modifications to SLP

[LLVMdev] Modifications to SLP