thr3ads.net - llvm dev - [llvm-dev] Vectorizing remainder loop [Aug 2018]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2018-Aug-03 19:30 UTC

[llvm-dev] Vectorizing remainder loop

>it cannot afford large size masks for large vectors
So, even a standard way of vectorizing remainder in masked or unmasked fashion
wouldn’t work, I suppose. Ouch.

I suppose VPlan should be able to model this kind of gigantic remainder vector
code (when the time comes). Not pretty at all, though.

Now, be fully aware that Direction #2 is really a poor (or rather extremely
poor) person’s remainder loop vectorization approach. Nowhere close to do it
right. If you really have to go down that path…… the following would be the
basics steps.


1)      First, look at the comments in createVectorizedLoopSkeleton(). “old
scalar loop to handle remainder” is the one you want to vectorize again.
Near the end of that function, LoopScalarBody is set. This is the loop body of
interest. OrigLoop becomes the scalar Loop. First, try attaching
a new metadata (with VF info) to it.

2)      Create a RemainderVectorizePass, very similar to LoopVectorizePass, but
let it work only on the loops that has the metadata. Add it to
IPO/PassManagerBuilder
right after LoopVectorize. You could also try adding a state in LoopVectorize
instead of creating a RemainderVectorizePass. You can then read the metadata
and try to vectorize the loop with half the VF. This will result in another
remainder loop. Attach (or modify) the metadata with the halved VF.

3)      You may have to do some cleanup of primary IV between LoopVectorize and
RemainderVectorize. If LoopVectorizationLegality chokes in RemainderVectorize
(i.e., thinks the loop is not legal to vectorize), try resolving what it is
complaining. Legality should be the same between main vector loop and remainder
------ in theory.

4)      Once you get to that point, just run RemainderVectorize enough times,
until VF is small enough.

This shouldn’t require much knowledge on LoopVectorize if things go nice and
smooth.

If you opt to go with tweaking Ashutosh’s patch,

Ashutosh’s patch (https://reviews.llvm.org/D30247 ) is doing what I wrote above
within LoopVectorize (thus it can skip many things since data structures are
still alive), through createVectorEpilog(), instantiating InnerLoopVectorizer
inside InnerLoopVectorizer. That takes care of one remainder loop vectorization
step. You need to modify so that it can do more than one step (like using an
vector of EpilogLoopInfo). If you struggle in convincing
LoopVectorizationLegality to think remainder is just as legal to vectorize as
main vector loop, you should be able to avoid that if you take this approach.

You still have two approaches to unblock yourself in the short term.

Thanks,
Hideki

From: hameeza ahmed [mailto:hahmed2305 at gmail.com]
Sent: Friday, August 03, 2018 10:58 AM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: Craig Topper <craig.topper at gmail.com>; Hal Finkel <hfinkel at
anl.gov>; Friedman, Eli <efriedma at codeaurora.org>; ashutosh.nema at
amd.com; llvm-dev at lists.llvm.org
Subject: Re: Vectorizing remainder loop

Thank You so much...
The hardware is designed already and it cannot afford large size masks for large
vectors. So I m opting for direction 2. Also I did try the patch but i was
getting some errors.

Can you please guide me how to proceed with direction 2?

Thank You
Regards

On Fri, Aug 3, 2018 at 3:28 AM, Saito, Hideki <hideki.saito at
intel.com<mailto:hideki.saito at intel.com>> wrote:

Hi Hameeza,

Aside from Ashutosh's patch.....

When the vector width is that large, we can't keep vectorizing remainder
like below. It'll be a huge code size if nothing else ---- hitting ITLB miss
because of this is very bad, for example.
        VF=2048 // main vector loop
        VF=1024 // vectorized remainder 1
        VF=512   // vectorized remainder 2
        ...
        Vectorize remainder until trip count is small enough for scalar
execution.

Direction #1
Does your HW support efficient masking? If so, the first thing to try is VF=2048
with masking so that you won't have any remainder loop. In other words, bump
up the trip count to the multiple of 2048 and then have an IF branch inside the
loop body so that beyond the original trip count is a no-op. Then vectorize that
loop.

        For (i=0;i<N;i++){
                body
        }
==>
        For (i=0;i<M;i++){ // where M is a multiple of 2048
                If (I < N) {
                        Body
                }
        }

If your HW can't execute vector version of the above loop efficiently
enough, it's already busted. Typically, when VF is that large, what
you'll get in the remainder is masked vector like below, and
vec_remainder_body is reasonably hot as you say in your original mail. As such,
remainder loop vectorization isn't a solution for that problem.

        for (i=0;i<N;i+=2048){
                Vec_body
        }
        for (i<M;i+=1024){ // where M is the smallest multiple of 1024 over N
                If (I < N) {
                        Vec_Remainder_Body
                }
        }

If your HW designers insist that the compiler to generate
        VF=2048 // main vector loop
        VF=1024 // vectorized remainder 1
        VF=512  // vectorized remainder 2
        ...
        Remainder is small enough for scalar.
I suggest you go back and tell them to reconsider the HW design such that the
Direction #1 works well enough on the HW.

Direction #2
In the meantime, if you are really stuck in the situation (i.e,, HW is already
built and you don't have much time), the simplest thing for you to do is to
run the LV second (third/fourth/...) time, after marking the remainder loop with
the metadata so that you know which loops you want to deal with in the second
round. It's very much of a hack but it'll be a small change you need to
make and that way you are not much impacted by other changes VPlan project is
making. If you have a major change outside of the trunk, you may be hit hard.

Direction #3
If you are given time to do the right implementation of remainder loop
vectorization, please join the VPlan bandwagon and work on it there. Major
development like this should happen on VPlans. Please let us know if you can do
that. Ashutosh, how about you?

Hopefully, one or more of the four alternative directions to consider, including
Ashutosh's patch, would work for you.

Thanks,
Hideki
-------------------
Date: Mon, 30 Jul 2018 05:16:15 +0000
From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
To: hameeza ahmed <hahmed2305 at gmail.com<mailto:hahmed2305 at
gmail.com>>, Craig Topper
        <craig.topper at gmail.com<mailto:craig.topper at
gmail.com>>, Hal Finkel <hfinkel at anl.gov<mailto:hfinkel at
anl.gov>>, "Friedman,
        Eli" <efriedma at codeaurora.org<mailto:efriedma at
codeaurora.org>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] Vectorizing remainder loop

Hi Hameeza,

At this point Loop Vectorizer does not have capability to vectorize
epilog/remainder loop.
Sometime back there is an RFC on epilog loop vectorization but it did not went
through because of concerns.
This RFC has a patch as well, maybe you can give a try with it.
http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none

- Ashutosh

From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> On Behalf Of hameeza ahmed via llvm-dev
Sent: Sunday, July 29, 2018 10:24 PM
To: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; Craig Topper <craig.topper at
gmail.com<mailto:craig.topper at gmail.com>>; Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>>; Friedman, Eli <efriedma at
codeaurora.org<mailto:efriedma at codeaurora.org>>
Subject: Re: [llvm-dev] Vectorizing remainder loop

Please help in solving this issue. the issue of scalar remainder loop is really
big and significant with large vector widths.

Please help

Thank You
On Sun, Jul 29, 2018 at 2:52 PM, hameeza ahmed <hahmed2305 at
gmail.com<mailto:hahmed2305 at gmail.com><mailto:hahmed2305 at
gmail.com<mailto:hahmed2305 at gmail.com>>> wrote:
Hello, I m working on a hardware with very large vector width till v2048. Now
when I vectorize using llvm default vectorizer maximum 2047 iterations are
scalar remainder loop. These are not vectorized by llvm which increases the
cost. However these should be vectorized using next available vector width I.e
v1024, v512, v256, v128, v64, v32, v16, v8, v4.....

The issue of scalar remainder loop has been there in llvm but this issue is
enhanced and can't be ignored with large vector width. This is very
important and significant to solve this issue.

Please help. I m trying to see loopvectorizer.cpp but unable to figure out
actual code to make changes.

It's very important for me to solve this issue.

Please help.

Thank you


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180803/5484f9f0/attachment.html>

hameeza ahmed via llvm-dev

2018-Aug-03 21:36 UTC

head link

[llvm-dev] Vectorizing remainder loop

Thank You so much.

I plan to begin with the 2nd option i.e tweaking Ashutosh’s patch.
How should i begin working on this?
Please guide

On Sat, Aug 4, 2018 at 12:30 AM, Saito, Hideki <hideki.saito at intel.com>
wrote:
>
>
> >it cannot afford large size masks for large vectors
>
>
>
> So, even a standard way of vectorizing remainder in masked or unmasked
> fashion wouldn’t work, I suppose. Ouch.
>
>
>
> I suppose VPlan should be able to model this kind of gigantic remainder
> vector code (when the time comes). Not pretty at all, though.
>
>
>
> Now, be fully aware that Direction #2 is really a poor (or rather
> extremely poor) person’s remainder loop vectorization approach. Nowhere
> close to do it right. If you really have to go down that path…… the
> following would be the basics steps.
>
>
>
> 1)      First, look at the comments in createVectorizedLoopSkeleton().
> “old scalar loop to handle remainder” is the one you want to vectorize
> again.
> Near the end of that function, LoopScalarBody is set. This is the loop
> body of interest. OrigLoop becomes the scalar Loop. First, try attaching
> a new metadata (with VF info) to it.
>
> 2)      Create a RemainderVectorizePass, very similar to
> LoopVectorizePass, but let it work only on the loops that has the metadata.
> Add it to IPO/PassManagerBuilder
> right after LoopVectorize. You could also try adding a state in
> LoopVectorize instead of creating a RemainderVectorizePass. You can then
> read the metadata
> and try to vectorize the loop with half the VF. This will result in
> another remainder loop. Attach (or modify) the metadata with the halved VF.
>
> 3)      You may have to do some cleanup of primary IV between
> LoopVectorize and RemainderVectorize. If LoopVectorizationLegality chokes
> in RemainderVectorize
> (i.e., thinks the loop is not legal to vectorize), try resolving what it
> is complaining. Legality should be the same between main vector loop and
> remainder ------ in theory.
>
> 4)      Once you get to that point, just run RemainderVectorize enough
> times, until VF is small enough.
>
>
>
> This shouldn’t require much knowledge on LoopVectorize if things go nice
> and smooth.
>
>
>
> If you opt to go with tweaking Ashutosh’s patch,
>
>
>
> Ashutosh’s patch (https://reviews.llvm.org/D30247 ) is doing what I wrote
> above within LoopVectorize (thus it can skip many things since data
> structures are still alive), through createVectorEpilog(), instantiating
> InnerLoopVectorizer inside InnerLoopVectorizer. That takes care of one
> remainder loop vectorization step. You need to modify so that it can do
> more than one step (like using an vector of EpilogLoopInfo). If you
> struggle in convincing LoopVectorizationLegality to think remainder is just
> as legal to vectorize as main vector loop, you should be able to avoid that
> if you take this approach.
>
>
>
> You still have two approaches to unblock yourself in the short term.
>
>
>
> Thanks,
>
> Hideki
>
>
>
> *From:* hameeza ahmed [mailto:hahmed2305 at gmail.com]
> *Sent:* Friday, August 03, 2018 10:58 AM
> *To:* Saito, Hideki <hideki.saito at intel.com>
> *Cc:* Craig Topper <craig.topper at gmail.com>; Hal Finkel
<hfinkel at anl.gov>;
> Friedman, Eli <efriedma at codeaurora.org>; ashutosh.nema at amd.com;
> llvm-dev at lists.llvm.org
> *Subject:* Re: Vectorizing remainder loop
>
>
>
> Thank You so much...
>
> The hardware is designed already and it cannot afford large size masks for
> large vectors. So I m opting for direction 2. Also I did try the patch but
> i was getting some errors.
>
>
>
> Can you please guide me how to proceed with direction 2?
>
>
>
> Thank You
>
> Regards
>
>
>
> On Fri, Aug 3, 2018 at 3:28 AM, Saito, Hideki <hideki.saito at
intel.com>
> wrote:
>
>
> Hi Hameeza,
>
> Aside from Ashutosh's patch.....
>
> When the vector width is that large, we can't keep vectorizing
remainder
> like below. It'll be a huge code size if nothing else ---- hitting ITLB
> miss because of this is very bad, for example.
>         VF=2048 // main vector loop
>         VF=1024 // vectorized remainder 1
>         VF=512   // vectorized remainder 2
>         ...
>         Vectorize remainder until trip count is small enough for scalar
> execution.
>
> Direction #1
> Does your HW support efficient masking? If so, the first thing to try is
> VF=2048 with masking so that you won't have any remainder loop. In
other
> words, bump up the trip count to the multiple of 2048 and then have an IF
> branch inside the loop body so that beyond the original trip count is a
> no-op. Then vectorize that loop.
>
>         For (i=0;i<N;i++){
>                 body
>         }
> ==>
>         For (i=0;i<M;i++){ // where M is a multiple of 2048
>                 If (I < N) {
>                         Body
>                 }
>         }
>
> If your HW can't execute vector version of the above loop efficiently
> enough, it's already busted. Typically, when VF is that large, what
you'll
> get in the remainder is masked vector like below, and vec_remainder_body is
> reasonably hot as you say in your original mail. As such, remainder loop
> vectorization isn't a solution for that problem.
>
>         for (i=0;i<N;i+=2048){
>                 Vec_body
>         }
>         for (i<M;i+=1024){ // where M is the smallest multiple of 1024
> over N
>                 If (I < N) {
>                         Vec_Remainder_Body
>                 }
>         }
>
> If your HW designers insist that the compiler to generate
>         VF=2048 // main vector loop
>         VF=1024 // vectorized remainder 1
>         VF=512  // vectorized remainder 2
>         ...
>         Remainder is small enough for scalar.
> I suggest you go back and tell them to reconsider the HW design such that
> the Direction #1 works well enough on the HW.
>
> Direction #2
> In the meantime, if you are really stuck in the situation (i.e,, HW is
> already built and you don't have much time), the simplest thing for you
to
> do is to run the LV second (third/fourth/...) time, after marking the
> remainder loop with the metadata so that you know which loops you want to
> deal with in the second round. It's very much of a hack but it'll
be a
> small change you need to make and that way you are not much impacted by
> other changes VPlan project is making. If you have a major change outside
> of the trunk, you may be hit hard.
>
> Direction #3
> If you are given time to do the right implementation of remainder loop
> vectorization, please join the VPlan bandwagon and work on it there. Major
> development like this should happen on VPlans. Please let us know if you
> can do that. Ashutosh, how about you?
>
> Hopefully, one or more of the four alternative directions to consider,
> including Ashutosh's patch, would work for you.
>
> Thanks,
> Hideki
> -------------------
> Date: Mon, 30 Jul 2018 05:16:15 +0000
> From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: hameeza ahmed <hahmed2305 at gmail.com>, Craig Topper
>         <craig.topper at gmail.com>, Hal Finkel <hfinkel at
anl.gov>, "Friedman,
>         Eli" <efriedma at codeaurora.org>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] Vectorizing remainder loop
>
> Hi Hameeza,
>
> At this point Loop Vectorizer does not have capability to vectorize
> epilog/remainder loop.
> Sometime back there is an RFC on epilog loop vectorization but it did not
> went through because of concerns.
> This RFC has a patch as well, maybe you can give a try with it.
> http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-
> Epilog-loop-vectorization-tt106322.html#none
>
> - Ashutosh
>
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of
hameeza
> ahmed via llvm-dev
> Sent: Sunday, July 29, 2018 10:24 PM
> To: llvm-dev <llvm-dev at lists.llvm.org>; Craig Topper <
> craig.topper at gmail.com>; Hal Finkel <hfinkel at anl.gov>;
Friedman, Eli <
> efriedma at codeaurora.org>
> Subject: Re: [llvm-dev] Vectorizing remainder loop
>
> Please help in solving this issue. the issue of scalar remainder loop is
> really big and significant with large vector widths.
>
> Please help
>
> Thank You
>
> On Sun, Jul 29, 2018 at 2:52 PM, hameeza ahmed <hahmed2305 at gmail.com
> <mailto:hahmed2305 at gmail.com>> wrote:
> Hello, I m working on a hardware with very large vector width till v2048.
> Now when I vectorize using llvm default vectorizer maximum 2047 iterations
> are scalar remainder loop. These are not vectorized by llvm which increases
> the cost. However these should be vectorized using next available vector
> width I.e v1024, v512, v256, v128, v64, v32, v16, v8, v4.....
>
> The issue of scalar remainder loop has been there in llvm but this issue
> is enhanced and can't be ignored with large vector width. This is very
> important and significant to solve this issue.
>
> Please help. I m trying to see loopvectorizer.cpp but unable to figure out
> actual code to make changes.
>
> It's very important for me to solve this issue.
>
> Please help.
>
> Thank you
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180804/e2f121ca/attachment.html>

Saito, Hideki via llvm-dev

2018-Aug-03 22:04 UTC

head link

[llvm-dev] Vectorizing remainder loop

I’m looking at his patch deeply enough for the first time. Please take that into
account.


1)      Start from making his patch work again.

2)      Study how EpilogVectorizationInfo is passed around.

First vectorization happens at Line 8013-8016 in the patch. It passes
EpilogVectorizationInfo with nullptr to EpilogLoopInfo.
Remainder vectorization happens at line 3400-3403 in createVectorEpilog(). This
one passes EpilogVectorizationInfo with newly created EpilogLoopInfo.
Notice that this second vectorization is called from
InnerLoopVectorizer::vectorize().


So, the call structure is already like below, which is actually what you need.
So, what you need to do is to check whether this recursion is working.



vectorize() // first instance of InnerLoopVectorizer

                createVectorEpilog()

                                vectorize()  // second instance of
InnerLoopVectorizer

                                                createVectorEpilog()

                                                                …


Hopefully, Ashutosh, the author of the patch, can help you more.



From: hameeza ahmed [mailto:hahmed2305 at gmail.com]
Sent: Friday, August 03, 2018 2:36 PM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: Craig Topper <craig.topper at gmail.com>; Hal Finkel <hfinkel at
anl.gov>; Friedman, Eli <efriedma at codeaurora.org>; ashutosh.nema at
amd.com; llvm-dev at lists.llvm.org
Subject: Re: Vectorizing remainder loop

Thank You so much.

I plan to begin with the 2nd option i.e tweaking Ashutosh’s patch.
How should i begin working on this?
Please guide

On Sat, Aug 4, 2018 at 12:30 AM, Saito, Hideki <hideki.saito at
intel.com<mailto:hideki.saito at intel.com>> wrote:
>it cannot afford large size masks for large vectors
So, even a standard way of vectorizing remainder in masked or unmasked fashion
wouldn’t work, I suppose. Ouch.

I suppose VPlan should be able to model this kind of gigantic remainder vector
code (when the time comes). Not pretty at all, though.

Now, be fully aware that Direction #2 is really a poor (or rather extremely
poor) person’s remainder loop vectorization approach. Nowhere close to do it
right. If you really have to go down that path…… the following would be the
basics steps.


1)      First, look at the comments in createVectorizedLoopSkeleton(). “old
scalar loop to handle remainder” is the one you want to vectorize again.
Near the end of that function, LoopScalarBody is set. This is the loop body of
interest. OrigLoop becomes the scalar Loop. First, try attaching
a new metadata (with VF info) to it.

2)      Create a RemainderVectorizePass, very similar to LoopVectorizePass, but
let it work only on the loops that has the metadata. Add it to
IPO/PassManagerBuilder
right after LoopVectorize. You could also try adding a state in LoopVectorize
instead of creating a RemainderVectorizePass. You can then read the metadata
and try to vectorize the loop with half the VF. This will result in another
remainder loop. Attach (or modify) the metadata with the halved VF.

3)      You may have to do some cleanup of primary IV between LoopVectorize and
RemainderVectorize. If LoopVectorizationLegality chokes in RemainderVectorize
(i.e., thinks the loop is not legal to vectorize), try resolving what it is
complaining. Legality should be the same between main vector loop and remainder
------ in theory.

4)      Once you get to that point, just run RemainderVectorize enough times,
until VF is small enough.

This shouldn’t require much knowledge on LoopVectorize if things go nice and
smooth.

If you opt to go with tweaking Ashutosh’s patch,

Ashutosh’s patch (https://reviews.llvm.org/D30247 ) is doing what I wrote above
within LoopVectorize (thus it can skip many things since data structures are
still alive), through createVectorEpilog(), instantiating InnerLoopVectorizer
inside InnerLoopVectorizer. That takes care of one remainder loop vectorization
step. You need to modify so that it can do more than one step (like using an
vector of EpilogLoopInfo). If you struggle in convincing
LoopVectorizationLegality to think remainder is just as legal to vectorize as
main vector loop, you should be able to avoid that if you take this approach.

You still have two approaches to unblock yourself in the short term.

Thanks,
Hideki

From: hameeza ahmed [mailto:hahmed2305 at gmail.com<mailto:hahmed2305 at
gmail.com>]
Sent: Friday, August 03, 2018 10:58 AM
To: Saito, Hideki <hideki.saito at intel.com<mailto:hideki.saito at
intel.com>>
Cc: Craig Topper <craig.topper at gmail.com<mailto:craig.topper at
gmail.com>>; Hal Finkel <hfinkel at anl.gov<mailto:hfinkel at
anl.gov>>; Friedman, Eli <efriedma at codeaurora.org<mailto:efriedma
at codeaurora.org>>; ashutosh.nema at amd.com<mailto:ashutosh.nema at
amd.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
Subject: Re: Vectorizing remainder loop

Thank You so much...
The hardware is designed already and it cannot afford large size masks for large
vectors. So I m opting for direction 2. Also I did try the patch but i was
getting some errors.

Can you please guide me how to proceed with direction 2?

Thank You
Regards

On Fri, Aug 3, 2018 at 3:28 AM, Saito, Hideki <hideki.saito at
intel.com<mailto:hideki.saito at intel.com>> wrote:

Hi Hameeza,

Aside from Ashutosh's patch.....

When the vector width is that large, we can't keep vectorizing remainder
like below. It'll be a huge code size if nothing else ---- hitting ITLB miss
because of this is very bad, for example.
        VF=2048 // main vector loop
        VF=1024 // vectorized remainder 1
        VF=512   // vectorized remainder 2
        ...
        Vectorize remainder until trip count is small enough for scalar
execution.

Direction #1
Does your HW support efficient masking? If so, the first thing to try is VF=2048
with masking so that you won't have any remainder loop. In other words, bump
up the trip count to the multiple of 2048 and then have an IF branch inside the
loop body so that beyond the original trip count is a no-op. Then vectorize that
loop.

        For (i=0;i<N;i++){
                body
        }
==>
        For (i=0;i<M;i++){ // where M is a multiple of 2048
                If (I < N) {
                        Body
                }
        }

If your HW can't execute vector version of the above loop efficiently
enough, it's already busted. Typically, when VF is that large, what
you'll get in the remainder is masked vector like below, and
vec_remainder_body is reasonably hot as you say in your original mail. As such,
remainder loop vectorization isn't a solution for that problem.

        for (i=0;i<N;i+=2048){
                Vec_body
        }
        for (i<M;i+=1024){ // where M is the smallest multiple of 1024 over N
                If (I < N) {
                        Vec_Remainder_Body
                }
        }

If your HW designers insist that the compiler to generate
        VF=2048 // main vector loop
        VF=1024 // vectorized remainder 1
        VF=512  // vectorized remainder 2
        ...
        Remainder is small enough for scalar.
I suggest you go back and tell them to reconsider the HW design such that the
Direction #1 works well enough on the HW.

Direction #2
In the meantime, if you are really stuck in the situation (i.e,, HW is already
built and you don't have much time), the simplest thing for you to do is to
run the LV second (third/fourth/...) time, after marking the remainder loop with
the metadata so that you know which loops you want to deal with in the second
round. It's very much of a hack but it'll be a small change you need to
make and that way you are not much impacted by other changes VPlan project is
making. If you have a major change outside of the trunk, you may be hit hard.

Direction #3
If you are given time to do the right implementation of remainder loop
vectorization, please join the VPlan bandwagon and work on it there. Major
development like this should happen on VPlans. Please let us know if you can do
that. Ashutosh, how about you?

Hopefully, one or more of the four alternative directions to consider, including
Ashutosh's patch, would work for you.

Thanks,
Hideki
-------------------
Date: Mon, 30 Jul 2018 05:16:15 +0000
From: "Nema, Ashutosh via llvm-dev" <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
To: hameeza ahmed <hahmed2305 at gmail.com<mailto:hahmed2305 at
gmail.com>>, Craig Topper
        <craig.topper at gmail.com<mailto:craig.topper at
gmail.com>>, Hal Finkel <hfinkel at anl.gov<mailto:hfinkel at
anl.gov>>, "Friedman,
        Eli" <efriedma at codeaurora.org<mailto:efriedma at
codeaurora.org>>
Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: Re: [llvm-dev] Vectorizing remainder loop

Hi Hameeza,

At this point Loop Vectorizer does not have capability to vectorize
epilog/remainder loop.
Sometime back there is an RFC on epilog loop vectorization but it did not went
through because of concerns.
This RFC has a patch as well, maybe you can give a try with it.
http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none

- Ashutosh

From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> On Behalf Of hameeza ahmed via llvm-dev
Sent: Sunday, July 29, 2018 10:24 PM
To: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>; Craig Topper <craig.topper at
gmail.com<mailto:craig.topper at gmail.com>>; Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>>; Friedman, Eli <efriedma at
codeaurora.org<mailto:efriedma at codeaurora.org>>
Subject: Re: [llvm-dev] Vectorizing remainder loop

Please help in solving this issue. the issue of scalar remainder loop is really
big and significant with large vector widths.

Please help

Thank You
On Sun, Jul 29, 2018 at 2:52 PM, hameeza ahmed <hahmed2305 at
gmail.com<mailto:hahmed2305 at gmail.com><mailto:hahmed2305 at
gmail.com<mailto:hahmed2305 at gmail.com>>> wrote:
Hello, I m working on a hardware with very large vector width till v2048. Now
when I vectorize using llvm default vectorizer maximum 2047 iterations are
scalar remainder loop. These are not vectorized by llvm which increases the
cost. However these should be vectorized using next available vector width I.e
v1024, v512, v256, v128, v64, v32, v16, v8, v4.....

The issue of scalar remainder loop has been there in llvm but this issue is
enhanced and can't be ignored with large vector width. This is very
important and significant to solve this issue.

Please help. I m trying to see loopvectorizer.cpp but unable to figure out
actual code to make changes.

It's very important for me to solve this issue.

Please help.

Thank you


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180803/6b4b2db9/attachment-0001.html>

llvm dev - Aug 2018 - Vectorizing remainder loop

[llvm-dev] Vectorizing remainder loop

[llvm-dev] Vectorizing remainder loop

[llvm-dev] Vectorizing remainder loop