thr3ads.net - llvm dev - [LLVMdev] Vectorization: Next Steps [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2012-Feb-13 16:38 UTC

[LLVMdev] Vectorization: Next Steps

On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip Hänsch
wrote:> I will test your suggestion, but I designed the test case to load the
> memory directly into <4 x float> registers. So there is absolutely no
> permutation and other swizzle or move operations. Maybe the heuristic
> should not only count the depth but also the surrounding load/store
> operations.
I've attached two variants of your file, both which vectorize as you'd
expect. The core difference between these and your original file is that
I added the 'restrict' keyword so that the compiler can assume that the
arrays don't alias (or, in the first case, I made them globals). You
also probably need to specify some alignment information, otherwise the
memory operations will be scalarized in codegen.

 -Hal
> 
> Are the load/store operations vectorized, too? (I designed the test
> case to completely fit the SSE registers)
> 
> 2012/2/10 Hal Finkel <hfinkel at anl.gov>
>         Carl-Philip,
>         
>         The reason that this does not vectorize is that it cannot
>         vectorize the
>         stores; this leaves only the mul-add chains (and some chains
>         with
>         loads), and they only have a depth of 2 (the threshold is 6).
>         
>         If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then
>         it will
>         vectorize. The reason the heuristic has such a large default
>         value is to
>         prevent cases where it costs more to permute all of the
>         necessary values
>         into and out of the vector registers than is saved by
>         vectorizing. Does
>         the code generated with -bb-vectorize-req-chain-depth=2 run
>         faster than
>         the unvectorized code?
>         
>         The heuristic can certainly be improved, and these kinds of
>         test cases
>         are very important to that improvement process.
>         
>          -Hal
>         
>         On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
>         > I have a super-simple test case 4x4 matrix * 4-vector which
>         gets
>         > correctly unrolled, but is not vectorized by -bb-vectorize.
>         (I used
>         > llvm 3.1svn)
>         > I attached the test case so you can see what is going wrong
>         there.
>         >
>         > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
>         >         As some of you may know, I committed my basic-block
>         >         autovectorization
>         >         pass a few days ago. I encourage anyone interested
>         to try it
>         >         out (pass
>         >         -vectorize to opt or -mllvm -vectorize to clang) and
>         provide
>         >         feedback.
>         >         Especially in combination with
>         -unroll-allow-partial, I have
>         >         observed
>         >         some significant benchmark speedups, but, I have
>         also observed
>         >         some
>         >         significant slowdowns. I would like to share my
>         thoughts, and
>         >         hopefully
>         >         get feedback, on next steps.
>         >
>         >         1. "Target Data" for vectorization - I think
that in
>         order to
>         >         improve
>         >         the vectorization quality, the vectorizer will need
>         more
>         >         information
>         >         about the target. This information could be provided
>         in the
>         >         form of a
>         >         kind of extended target data. This extended target
>         data might
>         >         contain:
>         >          - What basic types can be vectorized, and how many
>         of them
>         >         will fit
>         >         into (the largest) vector registers
>         >          - What classes of operations can be vectorized
>         (division,
>         >         conversions /
>         >         sign extension, etc. are not always supported)
>         >          - What alignment is necessary for loads and stores
>         >          - Is scalar-to-vector free?
>         >
>         >         2. Feedback between passes - We may to implement a
>         closer
>         >         coupling
>         >         between optimization passes than currently exists.
>         >         Specifically, I have
>         >         in mind two things:
>         >          - The vectorizer should communicate more closely
>         with the
>         >         loop
>         >         unroller. First, the loop unroller should try to
>         unroll to
>         >         preserve
>         >         maximal load/store alignments. Second, I think it
>         would make a
>         >         lot of
>         >         sense to be able to unroll and, only if this helps
>         >         vectorization should
>         >         the unrolled version be kept in preference to the
>         original.
>         >         With basic
>         >         block vectorization, it is often necessary to
>         (partially)
>         >         unroll in
>         >         order to vectorize. Even when we also have real loop
>         >         vectorization,
>         >         however, I still think that it will be important for
>         the loop
>         >         unroller
>         >         to communicate with the vectorizer.
>         >          - After vectorization, it would make sense for the
>         >         vectorization pass
>         >         to request further simplification, but only on those
>         parts of
>         >         the code
>         >         that it modified.
>         >
>         >         3. Loop vectorization - It would be nice to have, in
>         addition
>         >         to
>         >         basic-block vectorization, a more-traditional loop
>         >         vectorization pass. I
>         >         think that we'll need a better loop analysis pass
in
>         order for
>         >         this to
>         >         happen. Some of this was started in
>         LoopDependenceAnalysis,
>         >         but that
>         >         pass is not yet finished. We'll need something
like
>         this to
>         >         recognize
>         >         affine memory references, etc.
>         >
>         >         I look forward to hearing everyone's thoughts.
>         >
>         >          -Hal
>         >
>         >         --
>         >         Hal Finkel
>         >         Postdoctoral Appointee
>         >         Leadership Computing Facility
>         >         Argonne National Laboratory
>         >
>         >         _______________________________________________
>         >         LLVM Developers mailing list
>         >         LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>         >         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>         >
>         
>         --
>         Hal Finkel
>         Postdoctoral Appointee
>         Leadership Computing Facility
>         Argonne National Laboratory
>         
>         
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix2.c
Type: text/x-csrc
Size: 424 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/00c55781/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: matrix3.c
Type: text/x-csrc
Size: 480 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/00c55781/attachment-0001.c>

Carl-Philip Hänsch

2012-Feb-14 16:10 UTC

head link

[LLVMdev] Vectorization: Next Steps

I tested the "restricted" keyword and it works well :)

The generated code is a bunch of shufflevector instructions, but after a
second -O3 pass, everything looks fine.
This problem is described in my ML post "passes propose passes" and
occurs
here again. LLVM has so much great passes, but they cannot start again when
the code was somewhat simplified :(
Maybe that's one more reason to tell the pass scheduler to redo some passes
to find all optimizations. The core really simplifies to what I expected.

2012/2/13 Hal Finkel <hfinkel at anl.gov>
> On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip Hänsch wrote:
> > I will test your suggestion, but I designed the test case to load the
> > memory directly into <4 x float> registers. So there is
absolutely no
> > permutation and other swizzle or move operations. Maybe the heuristic
> > should not only count the depth but also the surrounding load/store
> > operations.
>
> I've attached two variants of your file, both which vectorize as
you'd
> expect. The core difference between these and your original file is that
> I added the 'restrict' keyword so that the compiler can assume that
the
> arrays don't alias (or, in the first case, I made them globals). You
> also probably need to specify some alignment information, otherwise the
> memory operations will be scalarized in codegen.
>
>  -Hal
>
> >
> > Are the load/store operations vectorized, too? (I designed the test
> > case to completely fit the SSE registers)
> >
> > 2012/2/10 Hal Finkel <hfinkel at anl.gov>
> >         Carl-Philip,
> >
> >         The reason that this does not vectorize is that it cannot
> >         vectorize the
> >         stores; this leaves only the mul-add chains (and some chains
> >         with
> >         loads), and they only have a depth of 2 (the threshold is 6).
> >
> >         If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then
> >         it will
> >         vectorize. The reason the heuristic has such a large default
> >         value is to
> >         prevent cases where it costs more to permute all of the
> >         necessary values
> >         into and out of the vector registers than is saved by
> >         vectorizing. Does
> >         the code generated with -bb-vectorize-req-chain-depth=2 run
> >         faster than
> >         the unvectorized code?
> >
> >         The heuristic can certainly be improved, and these kinds of
> >         test cases
> >         are very important to that improvement process.
> >
> >          -Hal
> >
> >         On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip Hänsch wrote:
> >         > I have a super-simple test case 4x4 matrix * 4-vector
which
> >         gets
> >         > correctly unrolled, but is not vectorized by
-bb-vectorize.
> >         (I used
> >         > llvm 3.1svn)
> >         > I attached the test case so you can see what is going
wrong
> >         there.
> >         >
> >         > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
> >         >         As some of you may know, I committed my
basic-block
> >         >         autovectorization
> >         >         pass a few days ago. I encourage anyone
interested
> >         to try it
> >         >         out (pass
> >         >         -vectorize to opt or -mllvm -vectorize to clang)
and
> >         provide
> >         >         feedback.
> >         >         Especially in combination with
> >         -unroll-allow-partial, I have
> >         >         observed
> >         >         some significant benchmark speedups, but, I have
> >         also observed
> >         >         some
> >         >         significant slowdowns. I would like to share my
> >         thoughts, and
> >         >         hopefully
> >         >         get feedback, on next steps.
> >         >
> >         >         1. "Target Data" for vectorization - I
think that in
> >         order to
> >         >         improve
> >         >         the vectorization quality, the vectorizer will
need
> >         more
> >         >         information
> >         >         about the target. This information could be
provided
> >         in the
> >         >         form of a
> >         >         kind of extended target data. This extended
target
> >         data might
> >         >         contain:
> >         >          - What basic types can be vectorized, and how
many
> >         of them
> >         >         will fit
> >         >         into (the largest) vector registers
> >         >          - What classes of operations can be vectorized
> >         (division,
> >         >         conversions /
> >         >         sign extension, etc. are not always supported)
> >         >          - What alignment is necessary for loads and
stores
> >         >          - Is scalar-to-vector free?
> >         >
> >         >         2. Feedback between passes - We may to implement
a
> >         closer
> >         >         coupling
> >         >         between optimization passes than currently
exists.
> >         >         Specifically, I have
> >         >         in mind two things:
> >         >          - The vectorizer should communicate more closely
> >         with the
> >         >         loop
> >         >         unroller. First, the loop unroller should try to
> >         unroll to
> >         >         preserve
> >         >         maximal load/store alignments. Second, I think it
> >         would make a
> >         >         lot of
> >         >         sense to be able to unroll and, only if this
helps
> >         >         vectorization should
> >         >         the unrolled version be kept in preference to the
> >         original.
> >         >         With basic
> >         >         block vectorization, it is often necessary to
> >         (partially)
> >         >         unroll in
> >         >         order to vectorize. Even when we also have real
loop
> >         >         vectorization,
> >         >         however, I still think that it will be important
for
> >         the loop
> >         >         unroller
> >         >         to communicate with the vectorizer.
> >         >          - After vectorization, it would make sense for
the
> >         >         vectorization pass
> >         >         to request further simplification, but only on
those
> >         parts of
> >         >         the code
> >         >         that it modified.
> >         >
> >         >         3. Loop vectorization - It would be nice to have,
in
> >         addition
> >         >         to
> >         >         basic-block vectorization, a more-traditional
loop
> >         >         vectorization pass. I
> >         >         think that we'll need a better loop analysis
pass in
> >         order for
> >         >         this to
> >         >         happen. Some of this was started in
> >         LoopDependenceAnalysis,
> >         >         but that
> >         >         pass is not yet finished. We'll need
something like
> >         this to
> >         >         recognize
> >         >         affine memory references, etc.
> >         >
> >         >         I look forward to hearing everyone's
thoughts.
> >         >
> >         >          -Hal
> >         >
> >         >         --
> >         >         Hal Finkel
> >         >         Postdoctoral Appointee
> >         >         Leadership Computing Facility
> >         >         Argonne National Laboratory
> >         >
> >         >         _______________________________________________
> >         >         LLVM Developers mailing list
> >         >         LLVMdev at cs.uiuc.edu        
http://llvm.cs.uiuc.edu
> >         >         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >         >
> >
> >         --
> >         Hal Finkel
> >         Postdoctoral Appointee
> >         Leadership Computing Facility
> >         Argonne National Laboratory
> >
> >
> >
>
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/2f8de2f6/attachment.html>

Hal Finkel

2012-Feb-14 16:15 UTC

head link

[LLVMdev] Vectorization: Next Steps

If you run with -vectorize instead of -bb-vectorize it will schedule the cleanup
passes for you.

 -Hal

Sent from my Verizon Wireless Droid

-----Original message-----
From: "Carl-Philip Hänsch" <cphaensch at googlemail.com>
To: Hal Finkel <hfinkel at anl.gov>
Cc: llvmdev at cs.uiuc.edu
Sent: Tue, Feb 14, 2012 16:10:28 GMT+00:00
Subject: Re: [LLVMdev] Vectorization: Next Steps

I tested the "restricted" keyword and it works well :)

The generated code is a bunch of shufflevector instructions, but after a
second -O3 pass, everything looks fine.
This problem is described in my ML post "passes propose passes" and
occurs
here again. LLVM has so much great passes, but they cannot start again when
the code was somewhat simplified :(
Maybe that's one more reason to tell the pass scheduler to redo some passes
to find all optimizations. The core really simplifies to what I expected.

2012/2/13 Hal Finkel <hfinkel at anl.gov>
> On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip Hänsch wrote:
> > I will test your suggestion, but I designed the test case to load the
> > memory directly into <4 x float> registers. So there is
absolutely no
> > permutation and other swizzle or move operations. Maybe the heuristic
> > should not only count the depth but also the surrounding load/store
> > operations.
>
> I've attached two variants of your file, both which vectorize as
you'd
> expect. The core difference between these and your original file is that
> I added the 'restrict' keyword so that the compiler can assume that
the
> arrays don't alias (or, in the first case, I made them globals). You
> also probably need to specify some alignment information, otherwise the
> memory operations will be scalarized in codegen.
>
>  -Hal
>
> >
> > Are the load/store operations vectorized, too? (I designed the test
> > case to completely fit the SSE registers)
> >
> > 2012/2/10 Hal Finkel-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/5e19f134/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Feb 2012 - [LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

Apparently Analagous Threads