thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2012-Jan-25 00:41 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, 2012-01-24 at 16:08 -0600, Sebastian Pop wrote:> On Mon, Jan 23, 2012 at 10:13 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
> > On Tue, 2012-01-17 at 13:25 -0600, Sebastian Pop wrote:
> >> Hi,
> >>
> >> On Fri, Dec 30, 2011 at 3:09 AM, Tobias Grosser <tobias at
grosser.es> wrote:
> >> > As it seems my intuition is wrong, I am very eager to see and
understand
> >> > an example where a search limit of 4000 is really needed.
> >> >
> >>
> >> To make the ball roll again, I attached a testcase that can be
tuned
> >> to understand the impact on compile time for different sizes of a
> >> basic block.  One can also set the number of iterations in the
loop to
> >> 1 to test the vectorizer with no loops around.
> >>
> >> Hal, could you please report the compile times with/without the
> >> vectorizer for different basic block sizes?
> >
> > I've looked at your test case, and I am pleased to report a
negligible
> > compile-time increase! Also, there is no vectorization of the main
> 
> Good!
> 
> > loop :) Here's why: (as you know) the main part of the loop is
> > essentially one long dependency chain, and so there is nothing to
> > vectorize there. The only vectorization opportunities come from
> > unrolling the loop. Using the default thresholds, the loop will not
even
> > partially unroll (because the body is too large). As a result,
> > essentially nothing happens.
> >
> > I've prepared a reduced version of your test case (attached).
Using
> > -unroll-threshold=300 (along with -unroll-allow-partial), I can make
the
> > loop unroll partially (the reduced loop size is 110, so this allows
> > unrolling 2 iterations). Once this is done, the vectorizer finds
> > candidate pairs and vectorizes [as a practical manner, you need
-basicaa
> > too].
> >
> > I think that even this is probably too big for a regression test. I
> > don't think that the basic structure really adds anything over
existing
> > tests (although I need to make sure that alias-analysis use is
otherwise
> > covered), but I'll copy-and-paste a small portion into a
regression test
> > to cover the search limit logic (which is currently uncovered). We
> > should probably discuss different situations that we'd like to see
> > covered in the regression suite (perhaps post-commit).
> >
> > Thanks for working on this! I'll post an updated patch for review
> > shortly.
> 
> Thanks for the new patch.
> 
> I will send you some more comments on the patch as I'm advancing
> through testing: I found some interesting benchmarks in which
> enabling vectorization gets the performance down by 80% on ARM.
> I will prepare a reduced testcase and try to find out the reason.
> As a first shot, I would say that this comes from the vectorization of
> code in a loop and the overhead of transfer between scalar and
> vector registers.
This is good; as has been pointed out, we'll need to develop a
vectorization cost model for this kind of thing to really be successful,
and so we should start thinking about that.

The pass, as implemented, has an semi-implicit cost model which says
that permutations followed by another vector operation are free, scalar
-> vector transfers are free, and vectorizing a memory operation is just
as good as vectorizing an arithmetic operation. Depending on the system,
these may all be untrue (although on some systems they are true).

If you can generate a test case that would be great, I'd like to look at
it.
> 
> I would like to not stop you from committing the patch just because
> of performance issues: let's address any further improvements once
> the patch is installed on tot.
Sounds good to me.

Thanks again,
Hal
> 
> Thanks again,
> Sebastian
> --
> Qualcomm Innovation Center, Inc is a member of Code Aurora Forum
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Sebastian Pop

2012-Jan-26 20:34 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, Jan 24, 2012 at 6:41 PM, Hal Finkel <hfinkel at anl.gov>
wrote:>> enabling vectorization gets the performance down by 80% on ARM.
>> I will prepare a reduced testcase and try to find out the reason.
>> As a first shot, I would say that this comes from the vectorization of
>> code in a loop and the overhead of transfer between scalar and
>> vector registers.
>
> This is good; as has been pointed out, we'll need to develop a
> vectorization cost model for this kind of thing to really be successful,
> and so we should start thinking about that.
>
> The pass, as implemented, has an semi-implicit cost model which says
> that permutations followed by another vector operation are free, scalar
> -> vector transfers are free, and vectorizing a memory operation is just
> as good as vectorizing an arithmetic operation. Depending on the system,
> these may all be untrue (although on some systems they are true).
>
> If you can generate a test case that would be great, I'd like to look
at
> it.
Here is the testcase with calls to gettimeofday to measure time spent
in the kernel and not in the ini/fini phases.
On ARM I saw around 5 to 6x slowdown in the vector version.
I haven't tried this on x86 yet but that should also produce slowdowns
as the cost between scalar and vector regs is non null there as well.

Sebastian
--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 891 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120126/08013f8f/attachment.c>

Hal Finkel

2012-Jan-26 20:49 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Thu, 2012-01-26 at 14:34 -0600, Sebastian Pop wrote:> On Tue, Jan 24, 2012 at 6:41 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
> >> enabling vectorization gets the performance down by 80% on ARM.
> >> I will prepare a reduced testcase and try to find out the reason.
> >> As a first shot, I would say that this comes from the
vectorization of
> >> code in a loop and the overhead of transfer between scalar and
> >> vector registers.
> >
> > This is good; as has been pointed out, we'll need to develop a
> > vectorization cost model for this kind of thing to really be
successful,
> > and so we should start thinking about that.
> >
> > The pass, as implemented, has an semi-implicit cost model which says
> > that permutations followed by another vector operation are free,
scalar
> > -> vector transfers are free, and vectorizing a memory operation is
just
> > as good as vectorizing an arithmetic operation. Depending on the
system,
> > these may all be untrue (although on some systems they are true).
> >
> > If you can generate a test case that would be great, I'd like to
look at
> > it.
> 
> Here is the testcase with calls to gettimeofday to measure time spent
> in the kernel and not in the ini/fini phases.
> On ARM I saw around 5 to 6x slowdown in the vector version.
> I haven't tried this on x86 yet but that should also produce slowdowns
> as the cost between scalar and vector regs is non null there as well.
Thanks! Did you compile with any non-default flags other than -mllvm
-vectorize?

 -Hal
> 
> Sebastian
> --
> Qualcomm Innovation Center, Inc is a member of Code Aurora Forum
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Jan 2012 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Apparently Analagous Threads