Hal Finkel
2012-Jan-25 00:41 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Tue, 2012-01-24 at 16:08 -0600, Sebastian Pop wrote:> On Mon, Jan 23, 2012 at 10:13 PM, Hal Finkel <hfinkel at anl.gov> wrote: > > On Tue, 2012-01-17 at 13:25 -0600, Sebastian Pop wrote: > >> Hi, > >> > >> On Fri, Dec 30, 2011 at 3:09 AM, Tobias Grosser <tobias at grosser.es> wrote: > >> > As it seems my intuition is wrong, I am very eager to see and understand > >> > an example where a search limit of 4000 is really needed. > >> > > >> > >> To make the ball roll again, I attached a testcase that can be tuned > >> to understand the impact on compile time for different sizes of a > >> basic block. One can also set the number of iterations in the loop to > >> 1 to test the vectorizer with no loops around. > >> > >> Hal, could you please report the compile times with/without the > >> vectorizer for different basic block sizes? > > > > I've looked at your test case, and I am pleased to report a negligible > > compile-time increase! Also, there is no vectorization of the main > > Good! > > > loop :) Here's why: (as you know) the main part of the loop is > > essentially one long dependency chain, and so there is nothing to > > vectorize there. The only vectorization opportunities come from > > unrolling the loop. Using the default thresholds, the loop will not even > > partially unroll (because the body is too large). As a result, > > essentially nothing happens. > > > > I've prepared a reduced version of your test case (attached). Using > > -unroll-threshold=300 (along with -unroll-allow-partial), I can make the > > loop unroll partially (the reduced loop size is 110, so this allows > > unrolling 2 iterations). Once this is done, the vectorizer finds > > candidate pairs and vectorizes [as a practical manner, you need -basicaa > > too]. > > > > I think that even this is probably too big for a regression test. I > > don't think that the basic structure really adds anything over existing > > tests (although I need to make sure that alias-analysis use is otherwise > > covered), but I'll copy-and-paste a small portion into a regression test > > to cover the search limit logic (which is currently uncovered). We > > should probably discuss different situations that we'd like to see > > covered in the regression suite (perhaps post-commit). > > > > Thanks for working on this! I'll post an updated patch for review > > shortly. > > Thanks for the new patch. > > I will send you some more comments on the patch as I'm advancing > through testing: I found some interesting benchmarks in which > enabling vectorization gets the performance down by 80% on ARM. > I will prepare a reduced testcase and try to find out the reason. > As a first shot, I would say that this comes from the vectorization of > code in a loop and the overhead of transfer between scalar and > vector registers.This is good; as has been pointed out, we'll need to develop a vectorization cost model for this kind of thing to really be successful, and so we should start thinking about that. The pass, as implemented, has an semi-implicit cost model which says that permutations followed by another vector operation are free, scalar -> vector transfers are free, and vectorizing a memory operation is just as good as vectorizing an arithmetic operation. Depending on the system, these may all be untrue (although on some systems they are true). If you can generate a test case that would be great, I'd like to look at it.> > I would like to not stop you from committing the patch just because > of performance issues: let's address any further improvements once > the patch is installed on tot.Sounds good to me. Thanks again, Hal> > Thanks again, > Sebastian > -- > Qualcomm Innovation Center, Inc is a member of Code Aurora Forum-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Sebastian Pop
2012-Jan-26 20:34 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Tue, Jan 24, 2012 at 6:41 PM, Hal Finkel <hfinkel at anl.gov> wrote:>> enabling vectorization gets the performance down by 80% on ARM. >> I will prepare a reduced testcase and try to find out the reason. >> As a first shot, I would say that this comes from the vectorization of >> code in a loop and the overhead of transfer between scalar and >> vector registers. > > This is good; as has been pointed out, we'll need to develop a > vectorization cost model for this kind of thing to really be successful, > and so we should start thinking about that. > > The pass, as implemented, has an semi-implicit cost model which says > that permutations followed by another vector operation are free, scalar > -> vector transfers are free, and vectorizing a memory operation is just > as good as vectorizing an arithmetic operation. Depending on the system, > these may all be untrue (although on some systems they are true). > > If you can generate a test case that would be great, I'd like to look at > it.Here is the testcase with calls to gettimeofday to measure time spent in the kernel and not in the ini/fini phases. On ARM I saw around 5 to 6x slowdown in the vector version. I haven't tried this on x86 yet but that should also produce slowdowns as the cost between scalar and vector regs is non null there as well. Sebastian -- Qualcomm Innovation Center, Inc is a member of Code Aurora Forum -------------- next part -------------- A non-text attachment was scrubbed... Name: test.c Type: text/x-csrc Size: 891 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120126/08013f8f/attachment.c>
Hal Finkel
2012-Jan-26 20:49 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Thu, 2012-01-26 at 14:34 -0600, Sebastian Pop wrote:> On Tue, Jan 24, 2012 at 6:41 PM, Hal Finkel <hfinkel at anl.gov> wrote: > >> enabling vectorization gets the performance down by 80% on ARM. > >> I will prepare a reduced testcase and try to find out the reason. > >> As a first shot, I would say that this comes from the vectorization of > >> code in a loop and the overhead of transfer between scalar and > >> vector registers. > > > > This is good; as has been pointed out, we'll need to develop a > > vectorization cost model for this kind of thing to really be successful, > > and so we should start thinking about that. > > > > The pass, as implemented, has an semi-implicit cost model which says > > that permutations followed by another vector operation are free, scalar > > -> vector transfers are free, and vectorizing a memory operation is just > > as good as vectorizing an arithmetic operation. Depending on the system, > > these may all be untrue (although on some systems they are true). > > > > If you can generate a test case that would be great, I'd like to look at > > it. > > Here is the testcase with calls to gettimeofday to measure time spent > in the kernel and not in the ini/fini phases. > On ARM I saw around 5 to 6x slowdown in the vector version. > I haven't tried this on x86 yet but that should also produce slowdowns > as the cost between scalar and vector regs is non null there as well.Thanks! Did you compile with any non-default flags other than -mllvm -vectorize? -Hal> > Sebastian > -- > Qualcomm Innovation Center, Inc is a member of Code Aurora Forum-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Possibly Parallel Threads
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass