thr3ads.net - llvm dev - [LLVMdev] Separating loop nests based on profile information? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Daniel Berlin

2015-Jan-12 06:35 UTC

[LLVMdev] Separating loop nests based on profile information?

>
>
> Looking at what LLVM does, the failing on the PRE side is that our PRE/GVN
> models are not strong enough to handle this. I'm not sure at all why we
> think anything else is necessary.
>
By this i mean we don't actually perform real PRE for loads. It's a
known
deficiency and one that does not require any duplicate heuristics to solve.
Right now we perform some availability based removal that happens be
willing to do an insertion here or there to try to move a load in some
cases (it will only do a single insertion).  It's not also based on real
availability, but on finding nearest dependencies of loads and squinting at
them (which is not even really the same as finding where a load stops being
available) funny.

It's also not real PRE, just some code that is willing to insert one load
to move a load.

If you want this case to work, what you need is a real Load PRE
implementation, not one based on simple memdep based load moving.

Trying to hack this one up with profile guided code duplication, etc, to
get it to kinda sorta work seems to me to be a bad idea.

> It's certainly not requiring special code duplication heuristics, etc.
>
> So either you are thinking of a case that is different from the above, or
> I am seriously confused :)
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/90307f6f/attachment.html>

Philip Reames

2015-Jan-12 18:08 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On 01/11/2015 10:35 PM, Daniel Berlin wrote:>
>
>     Looking at what LLVM does, the failing on the PRE side is that our
>     PRE/GVN models are not strong enough to handle this. I'm not sure
>     at all why we think anything else is necessary.
>
>
> By this i mean we don't actually perform real PRE for loads. It's a
> known deficiency and one that does not require any duplicate 
> heuristics to solve.
> Right now we perform some availability based removal that happens be 
> willing to do an insertion here or there to try to move a load in some 
> cases (it will only do a single insertion).  It's not also based on 
> real availability, but on finding nearest dependencies of loads and 
> squinting at them (which is not even really the same as finding where 
> a load stops being available) funny.
>
> It's also not real PRE, just some code that is willing to insert one 
> load to move a load.Er, I think we have a terminology problem here.  From what I can gather, 
our current implementation *is* an implementation of classic PRE.  It's 
more or less directly organized around the classic "available" and 
"anticipated" concepts from the literature.  The *implementation* has 
some limitations, but the conceptual framework is there*.  One of the 
key invariants in classic PRE is that you do not change program 
behaviour on any path.  As a result, moving (or at most duplicating) 
loads is pretty much all that happens.

* It took me a few hours of squinting at the code to recognize that 
weird backwards walk as being an implementation of "anctipated", but
it
is.  The "available" part comes from MemoryDependenceAnalysis and the 
forward dataflow algorithm based on those inputs.

If you want to start inserting loads on paths which didn't previously 
have them - for example the loop preheader in your example in the 
previous email - this is not classic PRE as I'm familiar with it.  It 
can be very profitable and worthwhile, but is not classic PRE.   I'm not 
sure what this variant is known as in the literature.

(If I'm being overly pedantic here, or just flat out wrong, please say 
so.  I'm still exploring related material and am trying to categorize 
things in my own head.)

I think we actually in reasonable agreement that we do need to move 
beyond 'classic PRE' as I called it above.  If you have specific 
suggestions, positive or negative examples, or other thoughts to share, 
I'd very much value the input.>
> If you want this case to work, what you need is a real Load PRE 
> implementation, not one based on simple memdep based load moving.
>
> Trying to hack this one up with profile guided code duplication, etc, 
> to get it to kinda sorta work seems to me to be a bad idea.Not sure I really agree here, but I'm open to pushing as far as we can 
with static heuristics first.  I suspect that's likely to be more stable 
w.r.t. optimization effect and applicable to more cases.>
>     It's certainly not requiring special code duplication heuristics,
etc.
>
>     So either you are thinking of a case that is different from the
>     above, or I am seriously confused :)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/4240fd75/attachment.html>

Mehdi Amini

2015-Jan-12 18:35 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

> On Jan 12, 2015, at 10:08 AM, Philip Reames <listmail at
philipreames.com> wrote:
> 
> 
> On 01/11/2015 10:35 PM, Daniel Berlin wrote:
>> 
>> Looking at what LLVM does, the failing on the PRE side is that our
PRE/GVN models are not strong enough to handle this. I'm not sure at all why
we think anything else is necessary.
>> 
>> By this i mean we don't actually perform real PRE for loads.
It's a known deficiency and one that does not require any duplicate
heuristics to solve.
>> Right now we perform some availability based removal that happens be
willing to do an insertion here or there to try to move a load in some cases (it
will only do a single insertion).  It's not also based on real availability,
but on finding nearest dependencies of loads and squinting at them (which is not
even really the same as finding where a load stops being available) funny.
>> 
>> It's also not real PRE, just some code that is willing to insert
one load to move a load.
> Er, I think we have a terminology problem here.  From what I can gather,
our current implementation *is* an implementation of classic PRE.  It's more
or less directly organized around the classic "available" and
"anticipated" concepts from the literature.  The *implementation* has
some limitations, but the conceptual framework is there*.  One of the key
invariants in classic PRE is that you do not change program behaviour on any
path.  As a result, moving (or at most duplicating) loads is pretty much all
that happens.
> 
> * It took me a few hours of squinting at the code to recognize that weird
backwards walk as being an implementation of "anctipated", but it is.
Did you add some comments? It may save a few hours of squinting to someone else
in the future :)

Mehdi


> The "available" part comes from MemoryDependenceAnalysis and the
forward dataflow algorithm based on those inputs.
> 
> If you want to start inserting loads on paths which didn't previously
have them - for example the loop preheader in your example in the previous email
- this is not classic PRE as I'm familiar with it.  It can be very
profitable and worthwhile, but is not classic PRE.   I'm not sure what this
variant is known as in the literature.
> 
> (If I'm being overly pedantic here, or just flat out wrong, please say
so.  I'm still exploring related material and am trying to categorize things
in my own head.)
> 
> I think we actually in reasonable agreement that we do need to move beyond
'classic PRE' as I called it above.  If you have specific suggestions,
positive or negative examples, or other thoughts to share, I'd very much
value the input.
>> 
>> If you want this case to work, what you need is a real Load PRE
implementation, not one based on simple memdep based load moving.
>> 
>> Trying to hack this one up with profile guided code duplication, etc,
to get it to kinda sorta work seems to me to be a bad idea.
> Not sure I really agree here, but I'm open to pushing as far as we can
with static heuristics first.  I suspect that's likely to be more stable
w.r.t. optimization effect and applicable to more cases.
>>   
>> It's certainly not requiring special code duplication heuristics,
etc.
>> 
>> So either you are thinking of a case that is different from the above,
or I am seriously confused :)
>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/97022d24/attachment.html>

Daniel Berlin

2015-Jan-12 21:11 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On Mon, Jan 12, 2015 at 1:08 PM, Philip Reames <listmail at
philipreames.com>
wrote:
>
> On 01/11/2015 10:35 PM, Daniel Berlin wrote:
>
>
>>  Looking at what LLVM does, the failing on the PRE side is that our
>> PRE/GVN models are not strong enough to handle this. I'm not sure
at all
>> why we think anything else is necessary.
>>
>
>  By this i mean we don't actually perform real PRE for loads. It's
a
> known deficiency and one that does not require any duplicate heuristics to
> solve.
> Right now we perform some availability based removal that happens be
> willing to do an insertion here or there to try to move a load in some
> cases (it will only do a single insertion).  It's not also based on
real
> availability, but on finding nearest dependencies of loads and squinting at
> them (which is not even really the same as finding where a load stops being
> available) funny.
>
>  It's also not real PRE, just some code that is willing to insert one
> load to move a load.
>
> Er, I think we have a terminology problem here.  From what I can gather,
> our current implementation *is* an implementation of classic PRE.
>
Scalar yes, load PRE, no.

What load PRE does is not the same as availability.
It finds the nearest dependencies (be they load/load, store/load, or
clobber) for a load walking backwards.
That is not availability.

Availability is a forward problem, not a backwards one :)

It tries to transform the data it gets from memdep into something like
availability info, but you *will not get the same results* in some cases.

  It's more or less directly organized around the classic
"available" and> "anticipated" concepts from the literature.  The *implementation*
has some
> limitations, but the conceptual framework is there*.
>
I agree the anticipation calculation is somewhat there, but even that is
not a standard PRE one.


> One of the key invariants in classic PRE is that you do not change program
> behaviour on any path.
>Well, it's more "do not increase the number of computations on any
path",
but even that gets violated in most standard PRE's because they do LICM out
of loops.

> As a result, moving (or at most duplicating) loads is pretty much all that
> happens.
>
Sure.


>
> * It took me a few hours of squinting at the code to recognize that weird
> backwards walk as being an implementation of "anctipated", but it
is.  The
> "available" part comes from MemoryDependenceAnalysis and the
forward
> dataflow algorithm based on those inputs.
>
memorydependencyanalysis is a backwards walker, it does not compute
availability of a load.
It computes the nearest dependencies.  We then try to take this and turn it
into the result of a forward availability problem.

>
> If you want to start inserting loads on paths which didn't previously
have
> them - for example the loop preheader in your example in the previous email
> - this is not classic PRE as I'm familiar with it.  It can be very
> profitable and worthwhile, but is not classic PRE.   I'm not sure what
this
> variant is known as in the literature.
>
This is definitely a standard SSA based PRE algorithm. Every single one i
am familiar with will all move code out of loops like this, be it scalar or
loads, regardless of the fact that it does in fact, increase the number of
possible computations by one if the loop does not execute.
>
> (If I'm being overly pedantic here, or just flat out wrong, please say
> so.  I'm still exploring related material and am trying to categorize
> things in my own head.)
>
>I do not believe you will find an SSA based  PRE that will not do the above
optimization by default.


I think we actually in reasonable agreement that we do need to move
beyond> 'classic PRE' as I called it above.  If you have specific
suggestions,
> positive or negative examples, or other thoughts to share, I'd very
much
> value the input.
>
>  If you want this case to work, what you need is a real Load PRE
> implementation, not one based on simple memdep based load moving.
>
>
>  Trying to hack this one up with profile guided code duplication, etc, to
> get it to kinda sorta work seems to me to be a bad idea.
>
> Not sure I really agree here, but I'm open to pushing as far as we can
> with static heuristics first.  I suspect that's likely to be more
stable
> w.r.t. optimization effect and applicable to more cases.
>
So, LLVM has been through three PRE implementations.
It had an implementation of e-path PRE, an implementation of GVNPRE (that
only worked on scalar code i believe) a long long time ago, and then the
current one.

The current one is used because it's fast and catches a lot of cases.
The others were eliminated because they were essentially experimental
research-quality implementations, and were slow :)

Because the current PRE is not value based, and it really has no idea about
memory, you will run into limitations on loads pretty quickly (particularly
loads indexed by things like induction variables).
It already tries to work around some of them by doing some actual value
analysis of the load (and doing forwarding), but this only works on basic
cases.

GVNPRE can be implemented and made a lot faster than it was (In GCC, it's
actually one of the fastest high level optimization passes) with very
careful engineering, and would take care of all of these cases.

I'm also fairly positive you could take any good anticipation based PRE
algorithm you can find, and i could make it value based and still fast.

But in the end, pushing the current one seems like a bad idea to me because
it wasn't designed for what you are trying to do - get good PRE results,
and do good load PRE, and we know of better ways to do that.

All pushing it will do is push it out of the sweet spot it was designed for.

If you want to get into profille-guided placement or speculative PRE, it
involves solving min-cut problems on very large graphs in most cases.

The only implementation i'm aware of that is relatively sane is
http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-horspool.pdf
and
https://dspace.library.uvic.ca/bitstream/handle/1828/292/Pereira_0336403_Thesis-Final_21Dec2007_.pdf?sequence=1

(This does not require solving min-cut problems)



In the end though, you are doing the work, so you get to decide what you
want to do :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/bc14e8d8/attachment.html>

Xinliang David Li

2015-Jan-13 05:44 UTC

head link

[LLVMdev] Separating loop nests based on profile information?

On Mon, Jan 12, 2015 at 10:08 AM, Philip Reames <listmail at
philipreames.com>
wrote:
>
> On 01/11/2015 10:35 PM, Daniel Berlin wrote:
>
>
>>  Looking at what LLVM does, the failing on the PRE side is that our
>> PRE/GVN models are not strong enough to handle this. I'm not sure
at all
>> why we think anything else is necessary.
>>
>
>  By this i mean we don't actually perform real PRE for loads. It's
a
> known deficiency and one that does not require any duplicate heuristics to
> solve.
> Right now we perform some availability based removal that happens be
> willing to do an insertion here or there to try to move a load in some
> cases (it will only do a single insertion).  It's not also based on
real
> availability, but on finding nearest dependencies of loads and squinting at
> them (which is not even really the same as finding where a load stops being
> available) funny.
>
>  It's also not real PRE, just some code that is willing to insert one
> load to move a load.
>
> Er, I think we have a terminology problem here.  From what I can gather,
> our current implementation *is* an implementation of classic PRE.  It's
> more or less directly organized around the classic "available"
and
> "anticipated" concepts from the literature.  The *implementation*
has some
> limitations, but the conceptual framework is there*.  One of the key
> invariants in classic PRE is that you do not change program behaviour on
> any path.  As a result, moving (or at most duplicating) loads is pretty
> much all that happens.
>
> * It took me a few hours of squinting at the code to recognize that weird
> backwards walk as being an implementation of "anctipated", but it
is.  The
> "available" part comes from MemoryDependenceAnalysis and the
forward
> dataflow algorithm based on those inputs.
>
> If you want to start inserting loads on paths which didn't previously
have
> them - for example the loop preheader in your example in the previous email
> - this is not classic PRE as I'm familiar with it.  It can be very
> profitable and worthwhile, but is not classic PRE.   I'm not sure what
this
> variant is known as in the literature.
>
> (If I'm being overly pedantic here, or just flat out wrong, please say
> so.  I'm still exploring related material and am trying to categorize
> things in my own head.)
>
> I think we actually in reasonable agreement that we do need to move beyond
> 'classic PRE' as I called it above.  If you have specific
suggestions,
> positive or negative examples, or other thoughts to share, I'd very
much
> value the input.
>
>
>  If you want this case to work, what you need is a real Load PRE
> implementation, not one based on simple memdep based load moving.
>
>
>  Trying to hack this one up with profile guided code duplication, etc, to
> get it to kinda sorta work seems to me to be a bad idea.
>
> Not sure I really agree here, but I'm open to pushing as far as we can
> with static heuristics first.  I suspect that's likely to be more
stable
> w.r.t. optimization effect and applicable to more cases.
>

I agree with Danny.  Profile data is useful  for code placement, and
profitability computation in speculative PRE. I don't see how your proposed
code movement can be an enabler for more PRE.

David
>
>
>>   It's certainly not requiring special code duplication heuristics,
etc.
>>
>>  So either you are thinking of a case that is different from the above,
>> or I am seriously confused :)
>>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/1a134ff3/attachment.html>

llvm dev - Jan 2015 - [LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?

[LLVMdev] Separating loop nests based on profile information?