thr3ads.net - llvm dev - [llvm-dev] RFC: Shrink wrapping vs SplitCSR [May 2017]

If this information is useful, please help other people find it:
Share via:

Kit Barton via llvm-dev

2017-May-03 02:54 UTC

[llvm-dev] RFC: Shrink wrapping vs SplitCSR

Hi all, 

We've seen several examples recently of performance opportunities on
POWER if we can improve the location of save/restore code for
callee-saved registers. Both Nemanja and myself have discussed this with
several people, and it seems that there are two possibilities for
improving this:

  1. Extend shrink wrapping to make the analysis of callee-saved
     registers more precise. 
  2. Focus on enabling and (possibly) improving SplitCSR. 

I would like opinions from people on the preferred way to proceed. 

I am leaning toward improving shrink wrapping, at least as a short-term
solution. However, I fully admit that this is because I am familiar with
the shrink wrapping code and completely naive about SplitCSR and what
work would be necessary to get this working well.

My proposal would be to implement the flow sensitive analysis described
by Fred Chow (PLDI '88) and make the necessary extensions in shrink
wrapping to handle multiple save/restore points. At that point we can do
an evaluation to understand the improvements it provides and the impact
on compile time. Once we have these results, we can look at the best way
to enable it (e.g., option, target opt-in, higher opts, etc.). 

Thoughts?

Kit

Francis Visoiu Mistrih via llvm-dev

2017-May-03 17:50 UTC

head link

[llvm-dev] RFC: Shrink wrapping vs SplitCSR

Hi Kit,
> On May 2, 2017, at 7:54 PM, Kit Barton via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> Hi all, 
> 
> We've seen several examples recently of performance opportunities on
> POWER if we can improve the location of save/restore code for
> callee-saved registers. Both Nemanja and myself have discussed this with
> several people, and it seems that there are two possibilities for
> improving this:
> 
>  1. Extend shrink wrapping to make the analysis of callee-saved
>     registers more precise. 
>  2. Focus on enabling and (possibly) improving SplitCSR. 
> 
> I would like opinions from people on the preferred way to proceed. 
> 
> I am leaning toward improving shrink wrapping, at least as a short-term
> solution. However, I fully admit that this is because I am familiar with
> the shrink wrapping code and completely naive about SplitCSR and what
> work would be necessary to get this working well.
> 
> My proposal would be to implement the flow sensitive analysis described
> by Fred Chow (PLDI '88) and make the necessary extensions in shrink
> wrapping to handle multiple save/restore points. At that point we can do
> an evaluation to understand the improvements it provides and the impact
> on compile time. Once we have these results, we can look at the best way
> to enable it (e.g., option, target opt-in, higher opts, etc.). 
Back in 2009, there was an implementation of Fred Chow’s algorithm, that has
been removed in r193749, because it was unused and untested.

I have been working on a new implementation of Fred Chow’s algorithm for a while
now.

It seems that this algorithm has been avoided because of the compile time impact
that was not worth compared to the performance improvement.

The main reason is that there are probably loops in the CFG. In my
implementation, we decided that we never want to save / restore inside a loop,
so we consider loops as a single block. We are using scc_iterators instead of
MachineLoopInfo in order to handle irreducible loops as well, that are not
handled by the current SW implementation. That way, we can compute the
anticipation / availability attributes in linear time.

In terms of correctness, there is one test on X86 that still fails, and two
others on AArch64, because of the way compact unwinding encodes register saves.

In terms of compile-time, there are no regressions on CTMark.

In terms of code-size, we get a 0.8% increase on X86_64 and AArch64, mostly
because we cannot use push / pop anymore.

For now, we only worked on shrink-wrapping CSRs, but keep the stack setup in the
entry block / return blocks, which can give worse results in some cases compared
to the current shrink-wrapping pass. I am currently working on fixing this.

For execution-time, not many improvements showed up due to the stack setup and
the transformation of push/pop -> mov $reg, (mem)/mov (mem), $reg which can
be partially solved.

In terms of what the algorithm can do, and how it can outperform the current
one, I got some stats based on where we save / restore, along with the block
frequency (with PGO), and we can see a theoretical 8% improvement.

I put an early review here a while ago: https://reviews.llvm.org/D30808, and I
will update it soon. As you can see, all the PrologEpilogInserter and
(X86|AArch64)TargetFrameLowering code look horribly hacky, because so many
things assume the stack setup and callee saved saves will stick together. Fixing
all those assumptions is going to be the most tricky part of shrink-wrapping,
not the algorithm itself.

There are some cases where the current shrink-wrapping is better, and that’s in
cases like in the Fig.3 of Chow’s paper, and that can be probably solved by
using something similar to what’s described in this paper: Post Register
Allocation Spill Code Optimization by Christopher Lupo and Kent D. Wilken, which
uses the PST to optimize saves / restores placement along with a cost model.

Cheers,

-- 
Francis Visoiu Mistrih
> 
> Thoughts?
> 
> Kit
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Kit Barton via llvm-dev

2017-May-04 18:45 UTC

head link

[llvm-dev] RFC: Shrink wrapping vs SplitCSR

Francis Visoiu Mistrih <fvisoiumistrih at apple.com> writes:


I'm resending this as I forgot to CC llvm-dev on my reply last night....


Hi Francis,

Thanks for the detailed reply!
I took a quick look at the patch, and I'll try to take a closer look at
it tomorrow.

Could you please subscribe me to future patches? Either myself or
someone from my team can help with testing and/or integration on PPC if
you'd like. This is something we've been struggling with for a while and
are anxious to make some progress on it :)

Kit
> Hi Kit,
>
>> On May 2, 2017, at 7:54 PM, Kit Barton via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>> 
>> 
>> Hi all, 
>> 
>> We've seen several examples recently of performance opportunities
on
>> POWER if we can improve the location of save/restore code for
>> callee-saved registers. Both Nemanja and myself have discussed this
with
>> several people, and it seems that there are two possibilities for
>> improving this:
>> 
>>  1. Extend shrink wrapping to make the analysis of callee-saved
>>     registers more precise. 
>>  2. Focus on enabling and (possibly) improving SplitCSR. 
>> 
>> I would like opinions from people on the preferred way to proceed. 
>> 
>> I am leaning toward improving shrink wrapping, at least as a short-term
>> solution. However, I fully admit that this is because I am familiar
with
>> the shrink wrapping code and completely naive about SplitCSR and what
>> work would be necessary to get this working well.
>> 
>> My proposal would be to implement the flow sensitive analysis described
>> by Fred Chow (PLDI '88) and make the necessary extensions in shrink
>> wrapping to handle multiple save/restore points. At that point we can
do
>> an evaluation to understand the improvements it provides and the impact
>> on compile time. Once we have these results, we can look at the best
way
>> to enable it (e.g., option, target opt-in, higher opts, etc.). 
>
> Back in 2009, there was an implementation of Fred Chow’s algorithm, that
has been removed in r193749, because it was unused and untested.
>
> I have been working on a new implementation of Fred Chow’s algorithm for a
while now.
>
> It seems that this algorithm has been avoided because of the compile time
impact that was not worth compared to the performance improvement.
>
> The main reason is that there are probably loops in the CFG. In my
> implementation, we decided that we never want to save / restore inside a
loop,
> so we consider loops as a single block. We are using scc_iterators instead
of
> MachineLoopInfo in order to handle irreducible loops as well, that are not
> handled by the current SW implementation. That way, we can compute the
> anticipation / availability attributes in linear time.
>
> In terms of correctness, there is one test on X86 that still fails, and two
others on AArch64, because of the way compact unwinding encodes register saves.
>
> In terms of compile-time, there are no regressions on CTMark.
>
> In terms of code-size, we get a 0.8% increase on X86_64 and AArch64, mostly
because we cannot use push / pop anymore.
>
> For now, we only worked on shrink-wrapping CSRs, but keep the stack setup
in the entry block / return blocks, which can give worse results in some cases
compared to the current shrink-wrapping pass. I am currently working on fixing
this.
>
> For execution-time, not many improvements showed up due to the stack setup
and the transformation of push/pop -> mov $reg, (mem)/mov (mem), $reg which
can be partially solved.
>
> In terms of what the algorithm can do, and how it can outperform the
current one, I got some stats based on where we save / restore, along with the
block frequency (with PGO), and we can see a theoretical 8% improvement.
>
> I put an early review here a while ago: https://reviews.llvm.org/D30808,
and I
> will update it soon. As you can see, all the PrologEpilogInserter and
> (X86|AArch64)TargetFrameLowering code look horribly hacky, because so many
> things assume the stack setup and callee saved saves will stick together.
Fixing
> all those assumptions is going to be the most tricky part of
shrink-wrapping,
> not the algorithm itself.
>
> There are some cases where the current shrink-wrapping is better, and
that’s in
> cases like in the Fig.3 of Chow’s paper, and that can be probably solved by
> using something similar to what’s described in this paper: Post Register
> Allocation Spill Code Optimization by Christopher Lupo and Kent D. Wilken,
which
> uses the PST to optimize saves / restores placement along with a cost
model.
>
> Cheers,

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - May 2017 - RFC: Shrink wrapping vs SplitCSR

[llvm-dev] RFC: Shrink wrapping vs SplitCSR

[llvm-dev] RFC: Shrink wrapping vs SplitCSR

[llvm-dev] RFC: Shrink wrapping vs SplitCSR

Maybe Matching Threads