thr3ads.net - llvm dev - [llvm-dev] enabling interleaved access loop vectorization [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Michael Kuperstein via llvm-dev

2016-Aug-09 18:25 UTC

[llvm-dev] enabling interleaved access loop vectorization

Thanks Ayal!

I'll take a look at DENBench.

As another data point - I tried enabling this on our internal benchmarks.
I'm seeing one regression, and it seems to be a regression of the
"good"
kind - without interleaving we don't vectorize the innermost loop, and with
interleaving we do. The vectorized loop is actually significantly faster
when benchmarked in isolation, but in this specific instance, the static
loop count is unknown, and the dynamic loop count happens to almost always
be 1 - and this lives inside a hot outer loop.
That's something we ought to be handling through PGO (or, conceivably,
outer loop vectorization :-) ).

Michael

On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
> > We also need to understand what to do with edge elements in the vector
> if their loading is not required. We, probably, should issue a masked load
> in this case.
>
>
>
> The existing code solves such edge cases where the last element of an
> InterleaveGroup is absent by making sure the last iteration (and up to last
> VF iterations) are peeled and executed scalarly; see
requiresScalarEpilogue.
>
>
>
>
>
> > All regressions that we see are in 32-bit mode.
>
>
>
> One place to find them, using the default
BaseT::getInterleavedMemoryOpCost(),
> is DENBench’s RGB conversions.
>
>
>
> Ayal.
>
>
>
> *From:* Demikhovsky, Elena
> *Sent:* Monday, August 08, 2016 00:09
> *To:* Michael Kuperstein <mkuper at google.com>; Renato Golin <
> renato.golin at linaro.org>
> *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; Nema, Ashutosh
<
> Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at
rotateright.com>; llvm-dev <
> llvm-dev at lists.llvm.org>; Zaks, Ayal <ayal.zaks at intel.com>
> *Subject:* RE: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> We checked the gathered data again. All regressions that we see are in
> 32-bit mode. The 64-bit mode looks good overall.
>
>
>
> -          * Elena*
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at
google.com>]
> *Sent:* Saturday, August 06, 2016 02:56
> *To:* Renato Golin <renato.golin at linaro.org>
> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew
Simpson <
> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at
amd.com>; Sanjay
> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at
lists.llvm.org>; Zaks,
> Ayal <ayal.zaks at intel.com>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
>
>
>
>
> On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at
linaro.org>
> wrote:
>
> On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com>
wrote:
> > I agree that we can get *more* improvement with better cost modeling,
but
> > I'd expect to be able to get *some* improvement the way things are
right
> > now.
>
> Elena said she saw "some" improvements. :)
>
>
>
> I didn't mean "some improvements, some regressions", I meant
"some of the
> improvement we'd expect from the full solution". :-)
>
>
>
>
> > That's why I'm curious about where we saw regressions -
I'm wondering
> > whether there's really a significant cost modeling issue I'm
missing, or
> > it's something that's easy to fix so that we can make forward
progress,
> > while Ashutosh is working on the longer-term solution.
>
> Sounds like a task to try a few patterns and fiddle with the cost model.
>
> Arnold did a lot of those during the first months of the vectorizer,
> so it might be just a matter of finding the right heuristics, at least
> for the low hanging fruits.
>
> Of course, that'd also involve benchmarking everything else, to make
> sure the new heuristics doesn't introduce regressions on
> non-interleaved vectorisation.
>
>
>
> I don't disagree with you.
>
>
>
> All I'm saying is that before fiddling with the heuristics, it'd be
good
> to understand what exactly breaks if we simply flip the flag. If the answer
> happens to be "nothing" - well, problem solved. Unfortunately,
according to
> Elena, that's not the answer.
>
> I'm going to play with it with our internal benchmarks, but it's my
> understanding that Elena/Ayal already have some idea of what the problems
> are.
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160809/7ddb7570/attachment.html>

Michael Kuperstein via llvm-dev

2016-Aug-10 23:32 UTC

head link

[llvm-dev] enabling interleaved access loop vectorization

So, unfortunately, it turns out I don't have access to DENBench.

Do you happen to have a reduced example that gets pessimized by this?

On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at google.com>
wrote:
> Thanks Ayal!
>
> I'll take a look at DENBench.
>
> As another data point - I tried enabling this on our internal benchmarks.
> I'm seeing one regression, and it seems to be a regression of the
"good"
> kind - without interleaving we don't vectorize the innermost loop, and
with
> interleaving we do. The vectorized loop is actually significantly faster
> when benchmarked in isolation, but in this specific instance, the static
> loop count is unknown, and the dynamic loop count happens to almost always
> be 1 - and this lives inside a hot outer loop.
> That's something we ought to be handling through PGO (or, conceivably,
> outer loop vectorization :-) ).
>
> Michael
>
> On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com>
wrote:
>
>> > We also need to understand what to do with edge elements in the
vector
>> if their loading is not required. We, probably, should issue a masked
load
>> in this case.
>>
>>
>>
>> The existing code solves such edge cases where the last element of an
>> InterleaveGroup is absent by making sure the last iteration (and up to
last
>> VF iterations) are peeled and executed scalarly; see
requiresScalarEpilogue.
>>
>>
>>
>>
>>
>> > All regressions that we see are in 32-bit mode.
>>
>>
>>
>> One place to find them, using the default
BaseT::getInterleavedMemoryOpCost(),
>> is DENBench’s RGB conversions.
>>
>>
>>
>> Ayal.
>>
>>
>>
>> *From:* Demikhovsky, Elena
>> *Sent:* Monday, August 08, 2016 00:09
>> *To:* Michael Kuperstein <mkuper at google.com>; Renato Golin
<
>> renato.golin at linaro.org>
>> *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; Nema,
Ashutosh <
>> Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at
rotateright.com>; llvm-dev <
>> llvm-dev at lists.llvm.org>; Zaks, Ayal <ayal.zaks at
intel.com>
>> *Subject:* RE: [llvm-dev] enabling interleaved access loop
vectorization
>>
>>
>>
>> We checked the gathered data again. All regressions that we see are in
>> 32-bit mode. The 64-bit mode looks good overall.
>>
>>
>>
>> -          * Elena*
>>
>>
>>
>> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at
google.com>]
>>
>> *Sent:* Saturday, August 06, 2016 02:56
>> *To:* Renato Golin <renato.golin at linaro.org>
>> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>;
Matthew Simpson <
>> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at
amd.com>; Sanjay
>> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at
lists.llvm.org>;
>> Zaks, Ayal <ayal.zaks at intel.com>
>> *Subject:* Re: [llvm-dev] enabling interleaved access loop
vectorization
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at
linaro.org>
>> wrote:
>>
>> On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at
google.com> wrote:
>> > I agree that we can get *more* improvement with better cost
modeling,
>> but
>> > I'd expect to be able to get *some* improvement the way things
are right
>> > now.
>>
>> Elena said she saw "some" improvements. :)
>>
>>
>>
>> I didn't mean "some improvements, some regressions", I
meant "some of the
>> improvement we'd expect from the full solution". :-)
>>
>>
>>
>>
>> > That's why I'm curious about where we saw regressions -
I'm wondering
>> > whether there's really a significant cost modeling issue
I'm missing, or
>> > it's something that's easy to fix so that we can make
forward progress,
>> > while Ashutosh is working on the longer-term solution.
>>
>> Sounds like a task to try a few patterns and fiddle with the cost
model.
>>
>> Arnold did a lot of those during the first months of the vectorizer,
>> so it might be just a matter of finding the right heuristics, at least
>> for the low hanging fruits.
>>
>> Of course, that'd also involve benchmarking everything else, to
make
>> sure the new heuristics doesn't introduce regressions on
>> non-interleaved vectorisation.
>>
>>
>>
>> I don't disagree with you.
>>
>>
>>
>> All I'm saying is that before fiddling with the heuristics,
it'd be good
>> to understand what exactly breaks if we simply flip the flag. If the
answer
>> happens to be "nothing" - well, problem solved.
Unfortunately, according to
>> Elena, that's not the answer.
>>
>> I'm going to play with it with our internal benchmarks, but
it's my
>> understanding that Elena/Ayal already have some idea of what the
problems
>> are.
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160810/8d082b7b/attachment.html>

Michael Kuperstein via llvm-dev

2016-Aug-16 21:51 UTC

head link

[llvm-dev] enabling interleaved access loop vectorization

Hi Ayal, Elena,

I'd really like to enable this by default.

As I wrote above, I didn't see any regressions in internal benchmarks, and
there doesn't seem to be anything in SPEC2006 either. I do see a
performance improvement in an internal benchmark (that is, a real
workload).

Would you be able to provide an example that gets pessimized? I have no
doubt you've seen regressions related to this, but the fact they exist
doesn't help me analyze them as long as I can't see them. :-) I'd
really
rather look at regressions before making the change - and either try to
make the necessary improvements to the cost model, or abandon this as
unfeasible for now (pending Ashutosh's work).

If you can't, an alternative is to turn this on, and then, if regressions
show up on anyone's radar (where we can actually get a reproducer), turn it
off again and go back to analysis. But I'd strongly prefer to
"prefetch"
the problem.

Thanks,
  Michael




On Wed, Aug 10, 2016 at 4:32 PM, Michael Kuperstein <mkuper at google.com>
wrote:
> So, unfortunately, it turns out I don't have access to DENBench.
>
> Do you happen to have a reduced example that gets pessimized by this?
>
> On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at
google.com>
> wrote:
>
>> Thanks Ayal!
>>
>> I'll take a look at DENBench.
>>
>> As another data point - I tried enabling this on our internal
benchmarks.
>> I'm seeing one regression, and it seems to be a regression of the
"good"
>> kind - without interleaving we don't vectorize the innermost loop,
and with
>> interleaving we do. The vectorized loop is actually significantly
faster
>> when benchmarked in isolation, but in this specific instance, the
static
>> loop count is unknown, and the dynamic loop count happens to almost
always
>> be 1 - and this lives inside a hot outer loop.
>> That's something we ought to be handling through PGO (or,
conceivably,
>> outer loop vectorization :-) ).
>>
>> Michael
>>
>> On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at
intel.com> wrote:
>>
>>> > We also need to understand what to do with edge elements in
the
>>> vector if their loading is not required. We, probably, should issue
a
>>> masked load in this case.
>>>
>>>
>>>
>>> The existing code solves such edge cases where the last element of
an
>>> InterleaveGroup is absent by making sure the last iteration (and up
to last
>>> VF iterations) are peeled and executed scalarly; see
requiresScalarEpilogue.
>>>
>>>
>>>
>>>
>>>
>>> > All regressions that we see are in 32-bit mode.
>>>
>>>
>>>
>>> One place to find them, using the default
BaseT::getInterleavedMemoryOpCost(),
>>> is DENBench’s RGB conversions.
>>>
>>>
>>>
>>> Ayal.
>>>
>>>
>>>
>>> *From:* Demikhovsky, Elena
>>> *Sent:* Monday, August 08, 2016 00:09
>>> *To:* Michael Kuperstein <mkuper at google.com>; Renato Golin
<
>>> renato.golin at linaro.org>
>>> *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; Nema,
Ashutosh <
>>> Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at
rotateright.com>; llvm-dev
>>> <llvm-dev at lists.llvm.org>; Zaks, Ayal <ayal.zaks at
intel.com>
>>> *Subject:* RE: [llvm-dev] enabling interleaved access loop
vectorization
>>>
>>>
>>>
>>> We checked the gathered data again. All regressions that we see are
in
>>> 32-bit mode. The 64-bit mode looks good overall.
>>>
>>>
>>>
>>> -          * Elena*
>>>
>>>
>>>
>>> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper
at google.com>]
>>>
>>> *Sent:* Saturday, August 06, 2016 02:56
>>> *To:* Renato Golin <renato.golin at linaro.org>
>>> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>;
Matthew Simpson
>>> <mssimpso at codeaurora.org>; Nema, Ashutosh
<Ashutosh.Nema at amd.com>;
>>> Sanjay Patel <spatel at rotateright.com>; llvm-dev
<llvm-dev at lists.llvm.org>;
>>> Zaks, Ayal <ayal.zaks at intel.com>
>>> *Subject:* Re: [llvm-dev] enabling interleaved access loop
vectorization
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at
linaro.org>
>>> wrote:
>>>
>>> On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at
google.com> wrote:
>>> > I agree that we can get *more* improvement with better cost
modeling,
>>> but
>>> > I'd expect to be able to get *some* improvement the way
things are
>>> right
>>> > now.
>>>
>>> Elena said she saw "some" improvements. :)
>>>
>>>
>>>
>>> I didn't mean "some improvements, some regressions",
I meant "some of
>>> the improvement we'd expect from the full solution". :-)
>>>
>>>
>>>
>>>
>>> > That's why I'm curious about where we saw regressions
- I'm wondering
>>> > whether there's really a significant cost modeling issue
I'm missing,
>>> or
>>> > it's something that's easy to fix so that we can make
forward progress,
>>> > while Ashutosh is working on the longer-term solution.
>>>
>>> Sounds like a task to try a few patterns and fiddle with the cost
model.
>>>
>>> Arnold did a lot of those during the first months of the
vectorizer,
>>> so it might be just a matter of finding the right heuristics, at
least
>>> for the low hanging fruits.
>>>
>>> Of course, that'd also involve benchmarking everything else, to
make
>>> sure the new heuristics doesn't introduce regressions on
>>> non-interleaved vectorisation.
>>>
>>>
>>>
>>> I don't disagree with you.
>>>
>>>
>>>
>>> All I'm saying is that before fiddling with the heuristics,
it'd be good
>>> to understand what exactly breaks if we simply flip the flag. If
the answer
>>> happens to be "nothing" - well, problem solved.
Unfortunately, according to
>>> Elena, that's not the answer.
>>>
>>> I'm going to play with it with our internal benchmarks, but
it's my
>>> understanding that Elena/Ayal already have some idea of what the
problems
>>> are.
>>>
>>>
>>>
>>>
---------------------------------------------------------------------
>>> Intel Israel (74) Limited
>>>
>>> This e-mail and any attachments may contain confidential material
for
>>> the sole use of the intended recipient(s). Any review or
distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160816/a7f7bc07/attachment.html>

llvm dev - Aug 2016 - enabling interleaved access loop vectorization

[llvm-dev] enabling interleaved access loop vectorization

[llvm-dev] enabling interleaved access loop vectorization

[llvm-dev] enabling interleaved access loop vectorization