thr3ads.net - llvm dev - [llvm-dev] RFC: We should stop merging allocas in the inliner [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth via llvm-dev

2016-Aug-02 21:32 UTC

[llvm-dev] RFC: We should stop merging allocas in the inliner

Sorry I missed these comments in my first read through David.

On Mon, Aug 1, 2016 at 1:06 AM Xinliang David Li via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Sun, Jul 31, 2016 at 9:47 PM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Thoughts? The code changes are easy and mechanical. My plan would be:
>>
>
> There is one caveat: stop doing stack merging in inliner will change the
> inliner cost analysis which may have affect inlining decisions and
> performance in an unexpected way.  For instance, the allocatedSize estimate
> won't be as accurate due to double counting.
>
There is the possibility of this, but I think it is relatively unlikely for
a few reasons.

The first is that I don't think the alloca merging is helping that much
here. Specifically, we only merge certain kinds of allocas and we only
merge them in fairly limited cases. So if the thresholds were very
sensitive to this, I would expect us to have seen these thresholds blocking
things more often than we have.

Also, as you mention, there are two places we might hit this. One is due to
the increased number of alloca instructions causing skew in the inline
cost. I think we should "fix" this by stopping counting *static*
alloca
instructions for the purpose of inline cost estimation. I think this will
be a net win. The alloca instructions are completely synthetic when static.
All of them will be folded into a single stack adjustment that is even
shared with register spills etc. I think we should model them as free.
(Clearly, *dynamic* alloca instructions are different here!) But I don't
think this is likely enough to be urgent to fix. I think we could do both
changes in parallel.

The other limit is the allocated size limit, and I think that is the more
likely issue (it sounds like it was the one you were worried about as
well). This one might be an issue, but I also think we can adjust it pretty
freely. The goal of this limit, from my recollection of when it went in, is
more to avoid run-away stack bloating, and the value of the limit is
somewhat arbitrary. And it only impacts inlining into recursive functions.
So I'm less worried about this in general. If it proves to be a problem in
practice, we can do something about it, and if that happens it would
probably be good to know because we shouldn't be relying on merging to
model this anyways.

>
>> 1) Add a flag to turn this off.
>> 2) Run benchmarks with flag to make sure things are actually working.
>> 3) Change default for the flag, make sure chaos doesn't ensue.
>> 4) Delete the (quite substantial) complexity associated with this and
the
>> flag.
>>
>
> The code handling the merging seems to be in an isolated place with <100
> lines of code. While I agree this change may be something good to do in
> principle -- are there any practical reason for doing this?
>
Yes, its complexity which is better addressed by another layer. Every piece
of complexity, especially in something like the inliner, makes it harder to
change and reason about. I'm trying to minimize and isolate the complex
parts of the iniliner in preparation to porting it to the new pass manager.
This seemed like low-hanging fruit.

> you mentioned information loss (aliasing, sanitizer etc) -- are there any
> real issues caused by the merging (with tracking bugs)?
>
I don't have any, but I also don't think we should wait for bugs to come
up
to make principled changes to the compiler, especially those that are
simplifying!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160802/d963c6c4/attachment.html>

Xinliang David Li via llvm-dev

2016-Aug-02 22:01 UTC

head link

[llvm-dev] RFC: We should stop merging allocas in the inliner

On Tue, Aug 2, 2016 at 2:32 PM, Chandler Carruth <chandlerc at gmail.com>
wrote:
> Sorry I missed these comments in my first read through David.
>
> On Mon, Aug 1, 2016 at 1:06 AM Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Sun, Jul 31, 2016 at 9:47 PM, Chandler Carruth via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Thoughts? The code changes are easy and mechanical. My plan would
be:
>>>
>>
>> There is one caveat: stop doing stack merging in inliner will change
the
>> inliner cost analysis which may have affect inlining decisions and
>> performance in an unexpected way.  For instance, the allocatedSize
estimate
>> won't be as accurate due to double counting.
>>
>
> There is the possibility of this, but I think it is relatively unlikely
> for a few reasons.
>
> The first is that I don't think the alloca merging is helping that much
> here. Specifically, we only merge certain kinds of allocas and we only
> merge them in fairly limited cases. So if the thresholds were very
> sensitive to this, I would expect us to have seen these thresholds blocking
> things more often than we have.
>
> Also, as you mention, there are two places we might hit this. One is due
> to the increased number of alloca instructions causing skew in the inline
> cost. I think we should "fix" this by stopping counting *static*
alloca
> instructions for the purpose of inline cost estimation. I think this will
> be a net win. The alloca instructions are completely synthetic when static.
> All of them will be folded into a single stack adjustment that is even
> shared with register spills etc. I think we should model them as free.
> (Clearly, *dynamic* alloca instructions are different here!) But I
don't
> think this is likely enough to be urgent to fix. I think we could do both
> changes in parallel.
>
Yes, I agree with this. In fact, not only should we not model static alloca
as 'cost', if we have not already model them as part of the call
overhead
(stack pointer adjustment in prologue), we should probably also model all
static allocas collectively in the callee as potential savings after
inlining.    If you don't have time to get to it (at least the first part),
Easwaran can probably help with this adjustment.


>
> The other limit is the allocated size limit, and I think that is the more
> likely issue (it sounds like it was the one you were worried about as
> well). This one might be an issue, but I also think we can adjust it pretty
> freely. The goal of this limit, from my recollection of when it went in, is
> more to avoid run-away stack bloating, and the value of the limit is
> somewhat arbitrary. And it only impacts inlining into recursive functions.
> So I'm less worried about this in general. If it proves to be a problem
in
> practice, we can do something about it, and if that happens it would
> probably be good to know because we shouldn't be relying on merging to
> model this anyways.
>
>The current limit (for recursive caller) is indeed arbitrary, but I am more
worried about a more general case which applies to any deep call chains: we
may want to implement heuristic to throttle inlining in general due to
excessive stack frame growth. For this reason, the tracking of stack size
need to match closely to actual stack usage -- otherwise we end up in
either missing inlinings or runtime errors (due to out of stack problem).

One way to solve this without requiring merging is to track the current
node's frame size by combining the caller's original stack size plus the
max of its callee's frame size (not sum).

>
>
>>
>>> 1) Add a flag to turn this off.
>>> 2) Run benchmarks with flag to make sure things are actually
working.
>>> 3) Change default for the flag, make sure chaos doesn't ensue.
>>> 4) Delete the (quite substantial) complexity associated with this
and
>>> the flag.
>>>
>>
>> The code handling the merging seems to be in an isolated place with
<100
>> lines of code. While I agree this change may be something good to do in
>> principle -- are there any practical reason for doing this?
>>
>
> Yes, its complexity which is better addressed by another layer. Every
> piece of complexity, especially in something like the inliner, makes it
> harder to change and reason about. I'm trying to minimize and isolate
the
> complex parts of the iniliner in preparation to porting it to the new pass
> manager. This seemed like low-hanging fruit.
>
>
>> you mentioned information loss (aliasing, sanitizer etc) -- are there
any
>> real issues caused by the merging (with tracking bugs)?
>>
>
> I don't have any, but I also don't think we should wait for bugs to
come
> up to make principled changes to the compiler, especially those that are
> simplifying!
>
thanks. I just wanted to understand the motivation for the change, which
you provided.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160802/1bcfd738/attachment.html>

Chandler Carruth via llvm-dev

2016-Aug-02 22:08 UTC

head link

[llvm-dev] RFC: We should stop merging allocas in the inliner

On Tue, Aug 2, 2016 at 3:01 PM Xinliang David Li via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Tue, Aug 2, 2016 at 2:32 PM, Chandler Carruth <chandlerc at
gmail.com>
> wrote:
>
>> Sorry I missed these comments in my first read through David.
>>
>> On Mon, Aug 1, 2016 at 1:06 AM Xinliang David Li via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> On Sun, Jul 31, 2016 at 9:47 PM, Chandler Carruth via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Thoughts? The code changes are easy and mechanical. My plan
would be:
>>>>
>>>
>>> There is one caveat: stop doing stack merging in inliner will
change the
>>> inliner cost analysis which may have affect inlining decisions and
>>> performance in an unexpected way.  For instance, the allocatedSize
estimate
>>> won't be as accurate due to double counting.
>>>
>>
>> There is the possibility of this, but I think it is relatively unlikely
>> for a few reasons.
>>
>> The first is that I don't think the alloca merging is helping that
much
>> here. Specifically, we only merge certain kinds of allocas and we only
>> merge them in fairly limited cases. So if the thresholds were very
>> sensitive to this, I would expect us to have seen these thresholds
blocking
>> things more often than we have.
>>
>> Also, as you mention, there are two places we might hit this. One is
due
>> to the increased number of alloca instructions causing skew in the
inline
>> cost. I think we should "fix" this by stopping counting
*static* alloca
>> instructions for the purpose of inline cost estimation. I think this
will
>> be a net win. The alloca instructions are completely synthetic when
static.
>> All of them will be folded into a single stack adjustment that is even
>> shared with register spills etc. I think we should model them as free.
>> (Clearly, *dynamic* alloca instructions are different here!) But I
don't
>> think this is likely enough to be urgent to fix. I think we could do
both
>> changes in parallel.
>>
>
> Yes, I agree with this. In fact, not only should we not model static
> alloca as 'cost', if we have not already model them as part of the
call
> overhead (stack pointer adjustment in prologue), we should probably also
> model all static allocas collectively in the callee as potential savings
> after inlining.    If you don't have time to get to it (at least the
first
> part), Easwaran can probably help with this adjustment.
>
Sure, if Easwaran can look at that independently, it'd be great.

>
>> The other limit is the allocated size limit, and I think that is the
more
>> likely issue (it sounds like it was the one you were worried about as
>> well). This one might be an issue, but I also think we can adjust it
pretty
>> freely. The goal of this limit, from my recollection of when it went
in, is
>> more to avoid run-away stack bloating, and the value of the limit is
>> somewhat arbitrary. And it only impacts inlining into recursive
functions.
>> So I'm less worried about this in general. If it proves to be a
problem in
>> practice, we can do something about it, and if that happens it would
>> probably be good to know because we shouldn't be relying on merging
to
>> model this anyways.
>>
>>
> The current limit (for recursive caller) is indeed arbitrary, but I am
> more worried about a more general case which applies to any deep call
> chains: we may want to implement heuristic to throttle inlining in general
> due to excessive stack frame growth. For this reason, the tracking of stack
> size need to match closely to actual stack usage -- otherwise we end up in
> either missing inlinings or runtime errors (due to out of stack problem).
>
> One way to solve this without requiring merging is to track the current
> node's frame size by combining the caller's original stack size
plus the
> max of its callee's frame size (not sum).
>
So there are two ways I can see the threshold being used.

One is to prevent completely nuts run-away stack growth. That use case can
be served with a fairly arbitrary approximation of stack growth and an
arbitrary threshold.

But the other use case is, as you say, to be reasonably careful about
inlining "too much" and hurting stack size. There, I agree with you
that we
need a fairly precise way to model this and I don't think LLVM has one at
the moment.

My belief is that at best we are currently addressing the first of these
goals, and that the difference between merging or not merging could be
easily accomodated by changes to the threshold.

I'm happy for folks to look at going after the second goal, but the merging
thing isn't going to be nearly enough so I don't think that should
really
hold up this move.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160802/9df4828b/attachment.html>

llvm dev - Aug 2016 - RFC: We should stop merging allocas in the inliner

[llvm-dev] RFC: We should stop merging allocas in the inliner

[llvm-dev] RFC: We should stop merging allocas in the inliner

[llvm-dev] RFC: We should stop merging allocas in the inliner