thr3ads.net - llvm dev - [llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Chad Rosier via llvm-dev

2017-Aug-04 18:41 UTC

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

On 8/4/2017 2:06 PM, Daniel Berlin wrote:> A few notes:
> I'm a bit surprised IPO copy/constant propagation doesn't get this 
> case, but i didn't look if the lattice supports variables.
> In particular, in your example, given no other call sites, it should 
> eliminate the dead code.
> (In a real program, it may require cloning).
In the actual program (SPEC2017/gcc, ironically), there are multiple 
calls to fn2 and only one of them has the property that the 1st and 2nd 
argument are the same (as is shown in my pseudo code). Internally, we 
have another developer, Matt Simpson, working on a function 
specialization patch that might be of value here. Specifically, you 
could clone fn2 based on the fact that a_ptr == dst_ptr and then 
simplify a great deal of the function.  However, that patch is still a WIP.
> GCC will do IPA-CP/const-prop with cloning, and i'm wildly curious if 
> new GCC's catch this case for you at higher optimization levels?
GCC does inline fn2 into fn1 in this particular case, but I'm not 
exactly sure how GCC accomplishes this.  I'm guessing GCC is just more 
aggressive with its inlining (fn2 is also marked with the inline 
keyword, which I assume GCC uses as a hint).  I'm speculating here and 
I've never worked on GCC, so unfortunately I have little to go on.
> If so, it may be worth not looking at this as an inlining problem, but 
> as an area we need IPO infrastructure improvement
Because of the multiple callsites with varying characteristics I'm not 
sure this can be solved in this way.
>
> Otherwise,  a couple things:
> Approximate dominators (for example, semi-dominators) can be computed 
> fast (a DFS walk of the CFG with no real additional computation)
> Except in strange CFGs that jump around a lot, they are the dominators.
>
> More importantly, the dominator is either the sdom or a proper 
> ancestor of the sdom.
>
> The practical impact of this is that if you use them as if they were 
> dominators, the set of conditions you discover will not be "too 
> wrong".  Occasionally wrong, but mostly not.
>
> My guess is the cost of doing approximate dominators is ~50-60% of the 
> cost of doing dominators.  Nowadays, about half the time was in the 
> DFS walk, the other half in the computation. At best, it would be 2-3x 
> faster.
> I've no idea if this changes whether we'd want dominators,
approximate
> dominators, or stick with nothing.
Right, this is kinda one of the bigger questions I'm trying to figure 
out.  My proposed solution doesn't use the dominator tree in order to 
minimize the impact on compile-time.  However, I'd guess the ROI is 
going to be much smaller because of the limited scope. On the other end 
of the spectrum I'd fear the full dominator tree would be too 
computationally expensive (but of course some of that could be mitigated 
by the ability to do incremental updates to the dominator tree).
>
> If you have some sort of dominatorish tree could then just use 
> earlycse's method of dominating condition finding:
> Process in "dom tree" top-down order, push the equivalences you
see,
> pop the relevant ones when you exit the relevant dom tree scope.
>
> In practice, you'd only check comparisons against the hash table.
Humm.. I'll have to think about it for a bit.  I'm thinking this might 
be a good compromise for my needs.
>
> The other option is PredicateInfo, but it requires dominators and 
> modifies the IR.
>
> My guess is this is undesired/too heavyweight for inline cost 
> analysis, however the basic principle on how it renames things could 
> also be applied without IR changing for this specific case.  Unlike 
> the EarlyCSE method, which is O(all instructons) PredicateInfo is 
> O(branches + number of uses of variables affected by conditions) 
>  Without going into futher details, if all you care about is "for each
> condition, give me the set of possibly affected variables" (so you can
> see if they may simplify), we could do that very very quickly (as fast 
> as we can sort a vector). But it does require dominators.
For my particular problem, I think PredicateInfo would be sufficient 
IIUYC.  But as you suggest, I'm thinking people aren't going to be fond 
of using the full dominators.

Lots of great feedback.  Thanks, Danny.
>
>
>
> On Fri, Aug 4, 2017 at 8:56 AM, Chad Rosier <mcrosier at codeaurora.org 
> <mailto:mcrosier at codeaurora.org>> wrote:
>
>     All,
>
>     I'm working on an improvement to the inline cost model, but I'm
>     unsure how to proceed.  Let me begin by first describing the
>     problem I'm trying to solve. Consider the following pseudo C code:
>
>     *typedef struct element {
>       unsigned idx;
>     } element_t;
>     *
>
>     *static inline
>     unsigned char fn2 (element_t *dst_ptr, const element_t *a_ptr,
>                        const element_t *b_ptr, unsigned char changed) {
>       if (a_ptr && b_ptr && a_ptr->idx == b_ptr->idx)
{
>         if (!changed && dst_ptr && dst_ptr->idx ==
a_ptr->idx) {
>           /* Do something */
>         } else {
>           changed = 1;
>           if (!dst_ptr)
>             dst_ptr = fn3();
>           else
>             dst_ptr->idx = a_ptr->idx;
>           /* Do something. */
>         }
>       } else {
>         changed = fn4();
>       }
>       return changed;
>     }
>
>     unsigned char fn1 (element_t *a_ptr, element_t *b_ptr) {
>       unsigned char changed = 0;
>       while (b_ptr) {
>         if (!a_ptr || a_ptr->idx == b_ptr->idx) {
>           changed = fn2 (a_ptr, a_ptr, b_ptr, changed);
>           b_ptr = b_ptr->next;
>         }
>       }
>       return changed;
>     }*
>
>     When the inline cost model computes the inline cost of fn2 it ends
>     up being much higher than the inline threshold.  A fair amount of
>     the cost is due to the inlining of fn3 into fn2.  However, if fn2
>     had been inlined into fn1 the code from fn3 would have been
>     removed as dead/unreachable.
>
>     At the fn2 call site notice the first two arguments are equal. 
>     Thus, in the context of fn2 dst_ptr and a_ptr are equal.  The call
>     site of fn3 is predicated on dst_ptr being null (i.e., if
>     (!dst_ptr) dst_ptr = fn3()), but that code is predicated on a_ptr
>     being non-null.  Therefore, we know the condition !dst_ptr is
>     false (because a_ptr == dst_ptr and a_ptr is non-null) and the
>     call to fn3 is dead.  I suspect one of JumpThreading, EarlyCSE, or
>     GVN does the elimination after inlining, so that's what I'd
like
>     to try and model in the inline cost model.  (Note fn2 has multiple
>     call sides and the property that the first and second arguments
>     are equal isn't true for each call site, so something like IPSCCP
>     doesn't actually help, AFAICT).
>
>     My first attempt at solving this problem did something similar to
>     what is done in JumpThreadingPass::ProcessImpliedCondition().
>     Specifically, it tried to prove that dst_ptr was non-null based on
>     a dominating condition.  The only tricky parts were to deal with
>     hammocks/diamonds when walking up the CFG (See:
>     https://reviews.llvm.org/D36287 <https://reviews.llvm.org/D36287>
>     as a concrete example of how I proposed to get an immediate
>     dominator without the domtree) and to account for the fact that
>     dst_ptr and a_ptr are equal.
>
>     I'm pretty sure I can get this approach to work, however, I'm
not
>     convinced it's really extensible or general.  Should we consider
>     using the full dominator tree in the inline cost model to capture
>     this?
>
>     If you have any thoughts on how to tackle the problem, I would
>     love to hear your feedback!
>
>      Chad
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170804/1aff0757/attachment.html>

Daniel Berlin via llvm-dev

2017-Aug-04 18:55 UTC

head link

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

On Fri, Aug 4, 2017 at 11:41 AM, Chad Rosier <mcrosier at codeaurora.org>
wrote:
>
>
> On 8/4/2017 2:06 PM, Daniel Berlin wrote:
>
> A few notes:
> I'm a bit surprised IPO copy/constant propagation doesn't get this
case,
> but i didn't look if the lattice supports variables.
> In particular, in your example, given no other call sites, it should
> eliminate the dead code.
> (In a real program, it may require cloning).
>
>
> In the actual program (SPEC2017/gcc, ironically), there are multiple calls
> to fn2 and only one of them has the property that the 1st and 2nd argument
> are the same (as is shown in my pseudo code).  Internally, we have another
> developer, Matt Simpson, working on a function specialization patch that
> might be of value here.  Specifically, you could clone fn2 based on the
> fact that a_ptr == dst_ptr and then simplify a great deal of the function.
> However, that patch is still a WIP.
>
FWIW: You almost certainly want to integrate that with IPA based constant
propagation, as it is the thing you should be using to tell you what will
happen in the call.  It should actually not be difficult at all (I can give
you references to papers, but it's just a couple hundred lines of code on
top of our current propagation engine)

(It can also later be used to do type-base devirt).

GCC will do partial specialization (IE contextually decide to clone
callsites where it believes the constantness/etc will cause elimination)

>
> GCC will do IPA-CP/const-prop with cloning, and i'm wildly curious if
new
> GCC's catch this case for you at higher optimization levels?
>
>
> GCC does inline fn2 into fn1 in this particular case, but I'm not
exactly
> sure how GCC accomplishes this.  I'm guessing GCC is just more
aggressive
> with its inlining (fn2 is also marked with the inline keyword, which I
> assume GCC uses as a hint).  I'm speculating here and I've never
worked on
> GCC, so unfortunately I have little to go on.
>
I meant if you turn off inlining :)

I can take a gander though, given the info you've given.


>
>
> If so, it may be worth not looking at this as an inlining problem, but as
> an area we need IPO infrastructure improvement
>
>
> Because of the multiple callsites with varying characteristics I'm not
> sure this can be solved in this way.
>
FWIW: It definitely can.  Whether we want to, ....

That said, this is the whole purpose of IPA const/copy prop.
LLVM stops at the "propagate things that are always constant in every
case"
whereas, most compilers do "if worth it, clone this function callsite where
i can prove it will be constant ".

You probably not want to rely on inlining for all possible IPO effects.
Especially in this case, where IPO can actually do the job.

IMHO, if you can, you want to reserve inlining-as-IPO for the cases where
the IPO algorithms are difficult/expensive/etc.

Or, at the very least, drive inlining like this by the results of those IPO
algorithms.

>
>
> Otherwise,  a couple things:
> Approximate dominators (for example, semi-dominators) can be computed fast
> (a DFS walk of the CFG with no real additional computation)
> Except in strange CFGs that jump around a lot, they are the dominators.
>
> More importantly, the dominator is either the sdom or a proper ancestor of
> the sdom.
>
> The practical impact of this is that if you use them as if they were
> dominators, the set of conditions you discover will not be "too
wrong".
> Occasionally wrong, but mostly not.
>
> My guess is the cost of doing approximate dominators is ~50-60% of the
> cost of doing dominators.  Nowadays, about half the time was in the DFS
> walk, the other half in the computation. At best, it would be 2-3x faster.
> I've no idea if this changes whether we'd want dominators,
approximate
> dominators, or stick with nothing.
>
>
> Right, this is kinda one of the bigger questions I'm trying to figure
> out.  My proposed solution doesn't use the dominator tree in order to
> minimize the impact on compile-time.  However, I'd guess the ROI is
going
> to be much smaller because of the limited scope.  On the other end of the
> spectrum I'd fear the full dominator tree would be too computationally
> expensive (but of course some of that could be mitigated by the ability to
> do incremental updates to the dominator tree).
>
>
> If you have some sort of dominatorish tree could then just use
earlycse's
> method of dominating condition finding:
> Process in "dom tree" top-down order, push the equivalences you
see, pop
> the relevant ones when you exit the relevant dom tree scope.
>
> In practice, you'd only check comparisons against the hash table.
>
>
> Humm.. I'll have to think about it for a bit.  I'm thinking this
might be
> a good compromise for my needs.
>
>
> The other option is PredicateInfo, but it requires dominators and modifies
> the IR.
>
> My guess is this is undesired/too heavyweight for inline cost analysis,
> however the basic principle on how it renames things could also be applied
> without IR changing for this specific case.  Unlike the EarlyCSE method,
> which is O(all instructons) PredicateInfo is O(branches + number of uses of
> variables affected by conditions)  Without going into futher details, if
> all you care about is "for each condition, give me the set of possibly
> affected variables" (so you can see if they may simplify), we could do
that
> very very quickly (as fast as we can sort a vector). But it does require
> dominators.
>
>
> For my particular problem, I think PredicateInfo would be sufficient
> IIUYC.  But as you suggest, I'm thinking people aren't going to be
fond of
> using the full dominators.
>
> Lots of great feedback.  Thanks, Danny.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170804/af7decf7/attachment.html>

Chad Rosier via llvm-dev

2017-Aug-04 19:15 UTC

head link

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

On 8/4/2017 2:55 PM, Daniel Berlin wrote:>
>
> On Fri, Aug 4, 2017 at 11:41 AM, Chad Rosier <mcrosier at codeaurora.org
> <mailto:mcrosier at codeaurora.org>> wrote:
>
>
>
>     On 8/4/2017 2:06 PM, Daniel Berlin wrote:
>>     A few notes:
>>     I'm a bit surprised IPO copy/constant propagation doesn't
get
>>     this case, but i didn't look if the lattice supports variables.
>>     In particular, in your example, given no other call sites, it
>>     should eliminate the dead code.
>>     (In a real program, it may require cloning).
>
>     In the actual program (SPEC2017/gcc, ironically), there are
>     multiple calls to fn2 and only one of them has the property that
>     the 1st and 2nd argument are the same (as is shown in my pseudo
>     code). Internally, we have another developer, Matt Simpson,
>     working on a function specialization patch that might be of value
>     here.  Specifically, you could clone fn2 based on the fact that
>     a_ptr == dst_ptr and then simplify a great deal of the function. 
>     However, that patch is still a WIP.
>
>
> FWIW: You almost certainly want to integrate that with IPA based 
> constant propagation, as it is the thing you should be using to tell 
> you what will happen in the call. It should actually not be difficult 
> at all (I can give you references to papers, but it's just a couple 
> hundred lines of code on top of our current propagation engine)
If I'm not mistaken, this is exactly what Matt is doing.
:D>
> (It can also later be used to do type-base devirt).
>
> GCC will do partial specialization (IE contextually decide to clone 
> callsites where it believes the constantness/etc will cause elimination)
>
>
>
>>     GCC will do IPA-CP/const-prop with cloning, and i'm wildly
>>     curious if new GCC's catch this case for you at higher
>>     optimization levels?
>
>     GCC does inline fn2 into fn1 in this particular case, but I'm not
>     exactly sure how GCC accomplishes this.  I'm guessing GCC is just
>     more aggressive with its inlining (fn2 is also marked with the
>     inline keyword, which I assume GCC uses as a hint).  I'm
>     speculating here and I've never worked on GCC, so unfortunately I
>     have little to go on.
>
>
> I meant if you turn off inlining :)Doh. Right.
>
> I can take a gander though, given the info you've given.
>
>
>
>
>>     If so, it may be worth not looking at this as an inlining
>>     problem, but as an area we need IPO infrastructure improvement
>
>     Because of the multiple callsites with varying characteristics I'm
>     not sure this can be solved in this way.
>
>
> FWIW: It definitely can.  Whether we want to, ....Yes, you're right.
>
> That said, this is the whole purpose of IPA const/copy prop.
> LLVM stops at the "propagate things that are always constant in every 
> case" whereas, most compilers do "if worth it, clone this
function
> callsite where i can prove it will be constant ".
>
> You probably not want to rely on inlining for all possible IPO 
> effects.  Especially in this case, where IPO can actually do the job.Yes, that makes sense.
>
> IMHO, if you can, you want to reserve inlining-as-IPO for the cases 
> where the IPO algorithms are difficult/expensive/etc.
>
> Or, at the very least, drive inlining like this by the results of 
> those IPO algorithms.
Sounds reasonable.  I'll throw this idea around with Matt and Geoff.
>
>
>
>>
>>     Otherwise,  a couple things:
>>     Approximate dominators (for example, semi-dominators) can be
>>     computed fast (a DFS walk of the CFG with no real additional
>>     computation)
>>     Except in strange CFGs that jump around a lot, they are the
>>     dominators.
>>
>>     More importantly, the dominator is either the sdom or a proper
>>     ancestor of the sdom.
>>
>>     The practical impact of this is that if you use them as if they
>>     were dominators, the set of conditions you discover will not be
>>     "too wrong".  Occasionally wrong, but mostly not.
>>
>>     My guess is the cost of doing approximate dominators is ~50-60%
>>     of the cost of doing dominators.  Nowadays, about half the time
>>     was in the DFS walk, the other half in the computation. At best,
>>     it would be 2-3x faster.
>>     I've no idea if this changes whether we'd want dominators,
>>     approximate dominators, or stick with nothing.
>
>     Right, this is kinda one of the bigger questions I'm trying to
>     figure out.  My proposed solution doesn't use the dominator tree
>     in order to minimize the impact on compile-time.  However, I'd
>     guess the ROI is going to be much smaller because of the limited
>     scope.  On the other end of the spectrum I'd fear the full
>     dominator tree would be too computationally expensive (but of
>     course some of that could be mitigated by the ability to do
>     incremental updates to the dominator tree).
>
>>
>>     If you have some sort of dominatorish tree could then just use
>>     earlycse's method of dominating condition finding:
>>     Process in "dom tree" top-down order, push the
equivalences you
>>     see, pop the relevant ones when you exit the relevant dom tree
scope.
>>
>>     In practice, you'd only check comparisons against the hash
table.
>
>     Humm.. I'll have to think about it for a bit. I'm thinking this
>     might be a good compromise for my needs.
>
>>
>>     The other option is PredicateInfo, but it requires dominators and
>>     modifies the IR.
>>
>>     My guess is this is undesired/too heavyweight for inline cost
>>     analysis, however the basic principle on how it renames things
>>     could also be applied without IR changing for this specific
>>     case.  Unlike the EarlyCSE method, which is O(all instructons)
>>     PredicateInfo is O(branches + number of uses of variables
>>     affected by conditions)  Without going into futher details, if
>>     all you care about is "for each condition, give me the set of
>>     possibly affected variables" (so you can see if they may
>>     simplify), we could do that very very quickly (as fast as we can
>>     sort a vector). But it does require dominators.
>
>     For my particular problem, I think PredicateInfo would be
>     sufficient IIUYC.  But as you suggest, I'm thinking people
aren't
>     going to be fond of using the full dominators.
>
>     Lots of great feedback.  Thanks, Danny.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170804/24f3f4dd/attachment.html>

Sean Silva via llvm-dev

2017-Aug-04 22:07 UTC

head link

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

On Fri, Aug 4, 2017 at 11:41 AM, Chad Rosier via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On 8/4/2017 2:06 PM, Daniel Berlin wrote:
>
> A few notes:
> I'm a bit surprised IPO copy/constant propagation doesn't get this
case,
> but i didn't look if the lattice supports variables.
> In particular, in your example, given no other call sites, it should
> eliminate the dead code.
> (In a real program, it may require cloning).
>
>
> In the actual program (SPEC2017/gcc, ironically), there are multiple calls
> to fn2 and only one of them has the property that the 1st and 2nd argument
> are the same (as is shown in my pseudo code).  Internally, we have another
> developer, Matt Simpson, working on a function specialization patch that
> might be of value here.  Specifically, you could clone fn2 based on the
> fact that a_ptr == dst_ptr and then simplify a great deal of the function.
> However, that patch is still a WIP.
>
> GCC will do IPA-CP/const-prop with cloning, and i'm wildly curious if
new
> GCC's catch this case for you at higher optimization levels?
>
>
> GCC does inline fn2 into fn1 in this particular case, but I'm not
exactly
> sure how GCC accomplishes this.  I'm guessing GCC is just more
aggressive
> with its inlining (fn2 is also marked with the inline keyword, which I
> assume GCC uses as a hint).  I'm speculating here and I've never
worked on
> GCC, so unfortunately I have little to go on.
>
> If so, it may be worth not looking at this as an inlining problem, but as
> an area we need IPO infrastructure improvement
>
>
> Because of the multiple callsites with varying characteristics I'm not
> sure this can be solved in this way.
>
>
> Otherwise,  a couple things:
> Approximate dominators (for example, semi-dominators) can be computed fast
> (a DFS walk of the CFG with no real additional computation)
> Except in strange CFGs that jump around a lot, they are the dominators.
>
> More importantly, the dominator is either the sdom or a proper ancestor of
> the sdom.
>
> The practical impact of this is that if you use them as if they were
> dominators, the set of conditions you discover will not be "too
wrong".
> Occasionally wrong, but mostly not.
>
> My guess is the cost of doing approximate dominators is ~50-60% of the
> cost of doing dominators.  Nowadays, about half the time was in the DFS
> walk, the other half in the computation. At best, it would be 2-3x faster.
> I've no idea if this changes whether we'd want dominators,
approximate
> dominators, or stick with nothing.
>
>
> Right, this is kinda one of the bigger questions I'm trying to figure
> out.  My proposed solution doesn't use the dominator tree in order to
> minimize the impact on compile-time.
>
In the new PM, it's possible for a CGSCC pass to get a cached function
analysis from functions that have been visited previously. This requires
the CGSCC iteration on SCC's we visited earlier to keep an up to date
domtree, and I don't know if that's the case (or how much work it would
be
to make it the case).

-- Sean Silva

> However, I'd guess the ROI is going to be much smaller because of the
> limited scope.  On the other end of the spectrum I'd fear the full
> dominator tree would be too computationally expensive (but of course some
> of that could be mitigated by the ability to do incremental updates to the
> dominator tree).
>
>
> If you have some sort of dominatorish tree could then just use
earlycse's
> method of dominating condition finding:
> Process in "dom tree" top-down order, push the equivalences you
see, pop
> the relevant ones when you exit the relevant dom tree scope.
>
> In practice, you'd only check comparisons against the hash table.
>
>
> Humm.. I'll have to think about it for a bit.  I'm thinking this
might be
> a good compromise for my needs.
>
>
> The other option is PredicateInfo, but it requires dominators and modifies
> the IR.
>
> My guess is this is undesired/too heavyweight for inline cost analysis,
> however the basic principle on how it renames things could also be applied
> without IR changing for this specific case.  Unlike the EarlyCSE method,
> which is O(all instructons) PredicateInfo is O(branches + number of uses of
> variables affected by conditions)  Without going into futher details, if
> all you care about is "for each condition, give me the set of possibly
> affected variables" (so you can see if they may simplify), we could do
that
> very very quickly (as fast as we can sort a vector). But it does require
> dominators.
>
>
> For my particular problem, I think PredicateInfo would be sufficient
> IIUYC.  But as you suggest, I'm thinking people aren't going to be
fond of
> using the full dominators.
>
> Lots of great feedback.  Thanks, Danny.
>
>
>
>
>
> On Fri, Aug 4, 2017 at 8:56 AM, Chad Rosier <mcrosier at
codeaurora.org>
> wrote:
>
>> All,
>>
>> I'm working on an improvement to the inline cost model, but I'm
unsure
>> how to proceed.  Let me begin by first describing the problem I'm
trying to
>> solve.  Consider the following pseudo C code:
>>
>>
>>
>>
>> *typedef struct element {   unsigned idx; } element_t; *
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *static inline unsigned char fn2 (element_t *dst_ptr, const element_t
>> *a_ptr,                    const element_t *b_ptr, unsigned char
changed) {
>>   if (a_ptr && b_ptr && a_ptr->idx == b_ptr->idx)
{     if (!changed &&
>> dst_ptr && dst_ptr->idx == a_ptr->idx) {       /* Do
something */     }
>> else {       changed = 1;       if (!dst_ptr)         dst_ptr = fn3();
>>       else         dst_ptr->idx = a_ptr->idx;       /* Do
something. */
>> }   } else {     changed = fn4();   }   return changed; } unsigned char
fn1
>> (element_t *a_ptr, element_t *b_ptr) {   unsigned char changed = 0;  
while
>> (b_ptr) {     if (!a_ptr || a_ptr->idx == b_ptr->idx) {      
changed = fn2
>> (a_ptr, a_ptr, b_ptr, changed);       b_ptr = b_ptr->next;     }   }
>> return changed; }*
>>
>> When the inline cost model computes the inline cost of fn2 it ends up
>> being much higher than the inline threshold.  A fair amount of the cost
is
>> due to the inlining of fn3 into fn2.  However, if fn2 had been inlined
into
>> fn1 the code from fn3 would have been removed as dead/unreachable.
>>
>> At the fn2 call site notice the first two arguments are equal.  Thus,
in
>> the context of fn2 dst_ptr and a_ptr are equal.  The call site of fn3
is
>> predicated on dst_ptr being null (i.e., if (!dst_ptr) dst_ptr = fn3()),
but
>> that code is predicated on a_ptr being non-null.  Therefore, we know
the
>> condition !dst_ptr is false (because a_ptr == dst_ptr and a_ptr is
>> non-null) and the call to fn3 is dead.  I suspect one of JumpThreading,
>> EarlyCSE, or GVN does the elimination after inlining, so that's
what I'd
>> like to try and model in the inline cost model.  (Note fn2 has multiple
>> call sides and the property that the first and second arguments are
equal
>> isn't true for each call site, so something like IPSCCP doesn't
actually
>> help, AFAICT).
>>
>> My first attempt at solving this problem did something similar to what
is
>> done in JumpThreadingPass::ProcessImpliedCondition().  Specifically, it
>> tried to prove that dst_ptr was non-null based on a dominating
condition.
>> The only tricky parts were to deal with hammocks/diamonds when walking
up
>> the CFG (See: https://reviews.llvm.org/D36287 as a concrete example of
>> how I proposed to get an immediate dominator without the domtree) and
to
>> account for the fact that dst_ptr and a_ptr are equal.
>>
>> I'm pretty sure I can get this approach to work, however, I'm
not
>> convinced it's really extensible or general.  Should we consider
using the
>> full dominator tree in the inline cost model to capture this?
>>
>> If you have any thoughts on how to tackle the problem, I would love to
>> hear your feedback!
>>
>>  Chad
>>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170804/3f933dad/attachment.html>

Sander De Smalen via llvm-dev

2017-Aug-07 14:56 UTC

head link

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

Hi,

Coincidentally I've been working to optimize this same case last week. I was
struggling a bit to determine where to put this functionality and eventually
went for the pragmatic approach of creating an experimental pass. Probably not
the eventual solution, but it may provide some useful input to the discussion
here.

Basically, I experimented with a 'pre-inlining-transform' pass that
clones the callsite if it can prove that one of the arguments to the call can be
replaced by a constant, based on dominating conditions, e.g.:

    if (!ptr || ptr && ptr->val)
      foo(ptr, ...)

    =>

    if (!ptr)
      foo(nullptr, ...)
    else if (ptr && ptr->val)
      foo(ptr /*knownNonNull*/, ...)

Here the first argument becomes constant for the first callsite and a further
analysis pass sets the KnownNonNull attribute on the first argument in the
second callsite. The InlinerCost algorithm can then determine it is cheap enough
to inline both cases, because it knows the callee (foo) distinguishes between
the two cases. In the callee, the check for '(ptrA == ptrB)' in function
foo becomes constant for the first callsite because it knows ptrA is nullptr.

To keep compile-time down, it doesn’t use the dominator tree as it seemed
sufficient to look at the direct predecessors of the block containing the call
(and their single-predecessors, for 'and'ed conditions). To keep the
cost of duplicating code down, it only clones the block upto the call if the
number of instructions stays below some threshold, which in practice reduces the
cases to simple callsites with directly dominating conditions. The reasoning
behind this is that duplicating the callsite is probably ‘cheap’ if it can
eliminate an argument with a constant, since this is often a pointer and is
likely to be checked for ‘nullptr’ in the callee which will improve the inline
cost.

We’re still collecting numbers to check there are no regressions from cloning
the callsite and the pass is still a bit work in progress, but I should be able
to share this work if anyone is interested (if only just for reference).

Sander

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sean
Silva via llvm-dev <llvm-dev at lists.llvm.org>
Reply-To: Sean Silva <chisophugis at gmail.com>
Date: Friday, 4 August 2017 at 23:07
To: Chad Rosier <mcrosier at codeaurora.org>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in
inline cost model



On Fri, Aug 4, 2017 at 11:41 AM, Chad Rosier via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:



On 8/4/2017 2:06 PM, Daniel Berlin wrote:
A few notes:
I'm a bit surprised IPO copy/constant propagation doesn't get this case,
but i didn't look if the lattice supports variables.
In particular, in your example, given no other call sites, it should eliminate
the dead code.
(In a real program, it may require cloning).

In the actual program (SPEC2017/gcc, ironically), there are multiple calls to
fn2 and only one of them has the property that the 1st and 2nd argument are the
same (as is shown in my pseudo code).  Internally, we have another developer,
Matt Simpson, working on a function specialization patch that might be of value
here.  Specifically, you could clone fn2 based on the fact that a_ptr == dst_ptr
and then simplify a great deal of the function.  However, that patch is still a
WIP.


GCC will do IPA-CP/const-prop with cloning, and i'm wildly curious if new
GCC's catch this case for you at higher optimization levels?

GCC does inline fn2 into fn1 in this particular case, but I'm not exactly
sure how GCC accomplishes this.  I'm guessing GCC is just more aggressive
with its inlining (fn2 is also marked with the inline keyword, which I assume
GCC uses as a hint).  I'm speculating here and I've never worked on GCC,
so unfortunately I have little to go on.


If so, it may be worth not looking at this as an inlining problem, but as an
area we need IPO infrastructure improvement

Because of the multiple callsites with varying characteristics I'm not sure
this can be solved in this way.



Otherwise,  a couple things:
Approximate dominators (for example, semi-dominators) can be computed fast (a
DFS walk of the CFG with no real additional computation)
Except in strange CFGs that jump around a lot, they are the dominators.

More importantly, the dominator is either the sdom or a proper ancestor of the
sdom.

The practical impact of this is that if you use them as if they were dominators,
the set of conditions you discover will not be "too wrong". 
Occasionally wrong, but mostly not.

My guess is the cost of doing approximate dominators is ~50-60% of the cost of
doing dominators.  Nowadays, about half the time was in the DFS walk, the other
half in the computation. At best, it would be 2-3x faster.
I've no idea if this changes whether we'd want dominators, approximate
dominators, or stick with nothing.

Right, this is kinda one of the bigger questions I'm trying to figure out. 
My proposed solution doesn't use the dominator tree in order to minimize the
impact on compile-time.

In the new PM, it's possible for a CGSCC pass to get a cached function
analysis from functions that have been visited previously. This requires the
CGSCC iteration on SCC's we visited earlier to keep an up to date domtree,
and I don't know if that's the case (or how much work it would be to
make it the case).

-- Sean Silva

However, I'd guess the ROI is going to be much smaller because of the
limited scope.  On the other end of the spectrum I'd fear the full dominator
tree would be too computationally expensive (but of course some of that could be
mitigated by the ability to do incremental updates to the dominator tree).



If you have some sort of dominatorish tree could then just use earlycse's
method of dominating condition finding:
Process in "dom tree" top-down order, push the equivalences you see,
pop the relevant ones when you exit the relevant dom tree scope.

In practice, you'd only check comparisons against the hash table.

Humm.. I'll have to think about it for a bit.  I'm thinking this might
be a good compromise for my needs.



The other option is PredicateInfo, but it requires dominators and modifies the
IR.

My guess is this is undesired/too heavyweight for inline cost analysis, however
the basic principle on how it renames things could also be applied without IR
changing for this specific case.  Unlike the EarlyCSE method, which is O(all
instructons) PredicateInfo is O(branches + number of uses of variables affected
by conditions)  Without going into futher details, if all you care about is
"for each condition, give me the set of possibly affected variables"
(so you can see if they may simplify), we could do that very very quickly (as
fast as we can sort a vector). But it does require dominators.

For my particular problem, I think PredicateInfo would be sufficient IIUYC.  But
as you suggest, I'm thinking people aren't going to be fond of using the
full dominators.

Lots of great feedback.  Thanks, Danny.






On Fri, Aug 4, 2017 at 8:56 AM, Chad Rosier <mcrosier at
codeaurora.org<mailto:mcrosier at codeaurora.org>> wrote:

All,

I'm working on an improvement to the inline cost model, but I'm unsure
how to proceed.  Let me begin by first describing the problem I'm trying to
solve.  Consider the following pseudo C code:

typedef struct element {
  unsigned idx;
} element_t;

static inline
unsigned char fn2 (element_t *dst_ptr, const element_t *a_ptr,
                   const element_t *b_ptr, unsigned char changed) {
  if (a_ptr && b_ptr && a_ptr->idx == b_ptr->idx) {
    if (!changed && dst_ptr && dst_ptr->idx == a_ptr->idx)
{
      /* Do something */
    } else {
      changed = 1;
      if (!dst_ptr)
        dst_ptr = fn3();
      else
        dst_ptr->idx = a_ptr->idx;
      /* Do something. */
    }
  } else {
    changed = fn4();
  }
  return changed;
}

unsigned char fn1 (element_t *a_ptr, element_t *b_ptr) {
  unsigned char changed = 0;
  while (b_ptr) {
    if (!a_ptr || a_ptr->idx == b_ptr->idx) {
      changed = fn2 (a_ptr, a_ptr, b_ptr, changed);
      b_ptr = b_ptr->next;
    }
  }
  return changed;
}

When the inline cost model computes the inline cost of fn2 it ends up being much
higher than the inline threshold.  A fair amount of the cost is due to the
inlining of fn3 into fn2.  However, if fn2 had been inlined into fn1 the code
from fn3 would have been removed as dead/unreachable.

At the fn2 call site notice the first two arguments are equal.  Thus, in the
context of fn2 dst_ptr and a_ptr are equal.  The call site of fn3 is predicated
on dst_ptr being null (i.e., if (!dst_ptr) dst_ptr = fn3()), but that code is
predicated on a_ptr being non-null.  Therefore, we know the condition !dst_ptr
is false (because a_ptr == dst_ptr and a_ptr is non-null) and the call to fn3 is
dead.  I suspect one of JumpThreading, EarlyCSE, or GVN does the elimination
after inlining, so that's what I'd like to try and model in the inline
cost model.  (Note fn2 has multiple call sides and the property that the first
and second arguments are equal isn't true for each call site, so something
like IPSCCP doesn't actually help, AFAICT).

My first attempt at solving this problem did something similar to what is done
in JumpThreadingPass::ProcessImpliedCondition().  Specifically, it tried to
prove that dst_ptr was non-null based on a dominating condition.  The only
tricky parts were to deal with hammocks/diamonds when walking up the CFG (See:
https://reviews.llvm.org/D36287 as a concrete example of how I proposed to get
an immediate dominator without the domtree) and to account for the fact that
dst_ptr and a_ptr are equal.

I'm pretty sure I can get this approach to work, however, I'm not
convinced it's really extensible or general.  Should we consider using the
full dominator tree in the inline cost model to capture this?

If you have any thoughts on how to tackle the problem, I would love to hear your
feedback!

 Chad



_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170807/7a98e638/attachment.html>

llvm dev - Aug 2017 - [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model

[llvm-dev] [RFC][InlineCost] Modeling JumpThreading (or similar) in inline cost model