thr3ads.net - llvm dev - [llvm-dev] Loop Distribution pass [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Jonas Paulsson via llvm-dev

2018-Sep-13 08:21 UTC

[llvm-dev] Loop Distribution pass

Hi,

I found with the help of the optimization remarks a loop that could not 
be vectorized, but if loop distribution was enabled this may happen, 
which it in fact did with a very significant benchmark improvement (~25%).

I tried (on SystemZ) to enable this pass, and found that it only 
affected a handful of files on SPEC. This means I could enable this 
without worrying about any regressions on SystemZ at least currently.

I wonder if there is something more to know about this. It seems that no 
other target has enabled this due to general mixed results, or? Is this 
triggering much more on other targets, and if so, why?

/Jonas

Renato Golin via llvm-dev

2018-Sep-13 08:47 UTC

head link

[llvm-dev] Loop Distribution pass

On Thu, 13 Sep 2018 at 09:22, Jonas Paulsson
<paulsson at linux.vnet.ibm.com> wrote:> I found with the help of the optimization remarks a loop that could not
> be vectorized, but if loop distribution was enabled this may happen,
> which it in fact did with a very significant benchmark improvement (~25%).
Hi Jonas,

That's not surprising, given that LD only tries to enable
vectorisation. Performance improvements of course depends on the
target and the quality of LLVM's lowering and further vectorisation.

> I tried (on SystemZ) to enable this pass, and found that it only
> affected a handful of files on SPEC. This means I could enable this
> without worrying about any regressions on SystemZ at least currently.
IIUC, it's all about compile time. Loop distribution analysis is not
terribly complex, but does have a cost (see [1]).

I don't think it will have many regressions because it's *very*
conservative (see [2]), perhaps too much. Shouldn't be too much of a
problem for SystemZ, but I'd wait for others closer to the LD pass to
chime in, before taking any decision. :)

> I wonder if there is something more to know about this. It seems that no
> other target has enabled this due to general mixed results, or? Is this
> triggering much more on other targets, and if so, why?
I think it's mostly about the success rate, given it's too
conservative. But in the past 2 years, improvements in (and around)
the LV have been slowed down a bit due to the move to VPlan.

Actually, I imagine LD would be a great candidate to be a
VPlan-to-VPlan pass, so that it can be combined with others in the
cost analysis, given that it's mostly meant to enable loop
vectorisation.

Adding some VPlan folks in CC.

-- 
cheers,
--renato

[1] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109188.html
[2] http://lists.llvm.org/pipermail/llvm-dev/2016-October/105766.html

Saito, Hideki via llvm-dev

2018-Sep-13 17:02 UTC

head link

[llvm-dev] Loop Distribution pass

Jonas/Renato,
>I think it's mostly about the success rate, given it's too
conservative. But in the past 2 years, improvements in (and around) the LV have
been slowed down a bit due to the move >to VPlan.
It wasn't our intention to slow down LV improvements, but if the project
ended up causing other developers take the stance of wait-and-see, that's an
inevitable side effect of any infrastructure level work. We welcome others work
with us to move things faster. I hope everyone will see that the end result is
well worth the pain it has caused.
>Actually, I imagine LD would be a great candidate to be a VPlan-to-VPlan
pass, so that it can be combined with others in the cost analysis, given that
it's mostly meant to enable >loop vectorisation.
There are other reasons why LD is good on its own, but I certainly agree that LD
shines more when it enables vectorization. In my perspective, however, there is
a value in the standalone LD, and in many cases vectorization oriented LD can
still happen there. Performing LD in VPlan-to-VPlan would improve precision of
the cost modeling, but given that vectorizer's cost model is "ball
park"-based to begin with (we have a lot of optimziers running
downstream!), having extra precision will be worth only by that much.  I have a
thought about moving vectorizer's analysis part (all the way to cost model)
into Analysis. When extra precision is desired, we can utilize such an (heavier
weight) Analysis.

In short, my preference is to make vectorizer's analysis more usable by
other xforms than making more and more loop xforms happen inside LV.

In the meantime, if those who are working on LD needs our input in tuning LD
cost model, I'm more than happy to pitch in. We can also discuss what part
of vectorizer analysis is helpful in LD at the same time.

Thanks,
Hideki

-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Thursday, September 13, 2018 1:48 AM
To: Jonas Paulsson <paulsson at linux.vnet.ibm.com>
Cc: LLVM Dev <llvm-dev at lists.llvm.org>; Adam Nemet <anemet at
apple.com>; Sanjay Patel <spatel at rotateright.com>; Ulrich Weigand
<ulrich.weigand at de.ibm.com>; Saito, Hideki <hideki.saito at
intel.com>; Zaks, Ayal <ayal.zaks at intel.com>; Caballero, Diego
<diego.caballero at intel.com>; Florian Hahn <florian.hahn at
arm.com>
Subject: Re: Loop Distribution pass

On Thu, 13 Sep 2018 at 09:22, Jonas Paulsson <paulsson at
linux.vnet.ibm.com> wrote:> I found with the help of the optimization remarks a loop that could 
> not be vectorized, but if loop distribution was enabled this may 
> happen, which it in fact did with a very significant benchmark improvement
(~25%).
Hi Jonas,

That's not surprising, given that LD only tries to enable vectorisation.
Performance improvements of course depends on the target and the quality of
LLVM's lowering and further vectorisation.

> I tried (on SystemZ) to enable this pass, and found that it only 
> affected a handful of files on SPEC. This means I could enable this 
> without worrying about any regressions on SystemZ at least currently.
IIUC, it's all about compile time. Loop distribution analysis is not
terribly complex, but does have a cost (see [1]).

I don't think it will have many regressions because it's *very*
conservative (see [2]), perhaps too much. Shouldn't be too much of a problem
for SystemZ, but I'd wait for others closer to the LD pass to chime in,
before taking any decision. :)

> I wonder if there is something more to know about this. It seems that 
> no other target has enabled this due to general mixed results, or? Is 
> this triggering much more on other targets, and if so, why?
I think it's mostly about the success rate, given it's too conservative.
But in the past 2 years, improvements in (and around) the LV have been slowed
down a bit due to the move to VPlan.

Actually, I imagine LD would be a great candidate to be a VPlan-to-VPlan pass,
so that it can be combined with others in the cost analysis, given that it's
mostly meant to enable loop vectorisation.

Adding some VPlan folks in CC.

--
cheers,
--renato

[1] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109188.html
[2] http://lists.llvm.org/pipermail/llvm-dev/2016-October/105766.html

Adam Nemet via llvm-dev

2018-Sep-19 17:26 UTC

head link

[llvm-dev] Loop Distribution pass

> On Sep 13, 2018, at 1:21 AM, Jonas Paulsson <paulsson at
linux.vnet.ibm.com> wrote:
> 
> Hi,
> 
> I found with the help of the optimization remarks a loop that could not be
vectorized, but if loop distribution was enabled this may happen, which it in
fact did with a very significant benchmark improvement (~25%).
> 
> I tried (on SystemZ) to enable this pass, and found that it only affected a
handful of files on SPEC. This means I could enable this without worrying about
any regressions on SystemZ at least currently.
> 
> I wonder if there is something more to know about this. It seems that no
other target has enabled this due to general mixed results, or? Is this
triggering much more on other targets, and if so, why?
The main thing that is missing from the pass right now is a serious analysis of
profitability as it affects instruction- and memory-level parallelism.   The
easiest to see this that LD is a reverse transformation of Loop fusion so where
LF helps LD may regress.  MLP is the big one in my opinion which would totally
reverse any gains from vectorization.

We would probably have to do similar things to the SW prefetch insertion pass in
order to analyze accesses that are likely to be skipped by the HW prefetcher. 
Needless to say this is a very micro-architecture specific analysis/cost model. 
If we can establish that ILP/MPL is unaffected even in simplest cases and
vectorization is enabled we could enable the transformation by default (in
addition to the pragma-driven approach  we have now).

Adam
> 
> /Jonas
> 
>

Jonas Paulsson via llvm-dev

2018-Sep-20 16:11 UTC

head link

[llvm-dev] Loop Distribution pass

Hi Adam,


On 2018-09-19 19:26, Adam Nemet wrote:>
>> On Sep 13, 2018, at 1:21 AM, Jonas Paulsson <paulsson at
linux.vnet.ibm.com> wrote:
>>
>> Hi,
>>
>> I found with the help of the optimization remarks a loop that could not
be vectorized, but if loop distribution was enabled this may happen, which it in
fact did with a very significant benchmark improvement (~25%).
>>
>> I tried (on SystemZ) to enable this pass, and found that it only
affected a handful of files on SPEC. This means I could enable this without
worrying about any regressions on SystemZ at least currently.
>>
>> I wonder if there is something more to know about this. It seems that
no other target has enabled this due to general mixed results, or? Is this
triggering much more on other targets, and if so, why?
> The main thing that is missing from the pass right now is a serious
analysis of profitability as it affects instruction- and memory-level
parallelism.   The easiest to see this that LD is a reverse transformation of
Loop fusion so where LF helps LD may regress.  MLP is the big one in my opinion
which would totally reverse any gains from vectorization.
>
> We would probably have to do similar things to the SW prefetch insertion
pass in order to analyze accesses that are likely to be skipped by the HW
prefetcher.  Needless to say this is a very micro-architecture specific
analysis/cost model.  If we can establish that ILP/MPL is unaffected even in
simplest cases and vectorization is enabled we could enable the transformation
by default (in addition to the pragma-driven approach  we have now).Thanks for the reply.

Since this is today extremely conservative and nearly never triggers, at 
least on SystemZ, while still being very beneficial when it does happen, 
it seems that this could be used as-is now on SystemZ with a new TTI 
hook to enable it selectively per target.

The question now is if this is a wise idea? Do you think things will 
change significantly with the Loop Distribution pass in the direction 
that it gets much more enabled, which may then cause regressions on 
SystemZ? If that is the case, perhaps the idea now is that nobody 
activates it per default until some initial reasonable cost modeling has 
been made?

/Jonas

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Sep 2018 - Loop Distribution pass

[llvm-dev] Loop Distribution pass

[llvm-dev] Loop Distribution pass

[llvm-dev] Loop Distribution pass

[llvm-dev] Loop Distribution pass

[llvm-dev] Loop Distribution pass

Maybe Matching Threads