thr3ads.net - llvm dev - [llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization [Mar 2020]

If this information is useful, please help other people find it:
Share via:

Kyungwoo Lee via llvm-dev

2020-Mar-24 21:04 UTC

[llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization

Hi Vedant,

Thanks for your interest and comment.
Size-optimization improves page-faults and a start-up time for a large
application, which this enabling also followed.
Even though I didn't see a large regression/complaint on a CPU-bound case,
which is not a typical case for mobile workload, I wanted to be precautious
of enabling it by default.
However, as with default outlining case, I don't mind enabling this under
-Oz (for minimizing code) with an opt-out option.

Regards,
Kyungwoo

On Tue, Mar 24, 2020 at 12:01 PM Vedant Kumar <vedant_kumar at apple.com>
wrote:
> This looks really interesting. In the slides, it’s mentioned that the
> combination of tuning the MachineOutliner for ThinLTO and of optimizing
> function prolog/epilogs improved measured run-time performance.
>
> What kind of performance impact do you see from simply homogenizing
> prolog/epilogs? (If, say across LNT/aarch64/-Oz the performance impact is
> not large, it may make sense to have homogenization enabled by default.)
>
> best,
> vedant
>
> On Mar 23, 2020, at 11:32 PM, Kyungwoo Lee via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hello,
>
> I'd like to upstream our work over the time which the community would
> benefit from.
> This is a part of effort toward minimizing code size presented in here
>
<https://llvm.org/devmtg/2020-02-23/slides/Kyungwoo-GlobalMachineOutlinerForThinLTO.pdf>.
> In particular, this RFC is about optimizing prolog and epilog for size.
>
> *Homogeneous Prolog and Epilog for Size Optimization, D76570
> <https://reviews.llvm.org/D76570>:*
>
> Prolog and epilog to handle callee-save registers tend to be irregular
> with different immediate offsets, which are not often being outlined (by
> machine outliner) when optimizing for size. From D18619, combining stack
> operations stretched irregularity further.
> This patch tries to emit homogeneous stores and loads with the same offset
> for prolog and epilog respectively.  We have observed that this homogeneous
> prolog and epilog significantly increased the chance of outlining,
> resulting in a code size reduction. However, there were still a great deal
> of outlining opportunities left because the current outliner had to
> conservatively handle instructions with the return register, x30.
> Rather, this patch also forms a custom-outlined helper function on demand
> for prolog and epilog when lowering the frame code.
>
> - Injects HOM_Prolog and HOM_Epilog pseudo instructions in Prolog and
> Epilog Injection Pass
> - Lower and optimize them in AArchLowerHomogneousPrologEpilog Pass
> - Outlined helpers are created on demand. Identical helpers are merged by
> the linker.
> - An opt-in flag is introduced to enable this feature. Another threshold
> flag is also introduced to control the aggressiveness of outlining for
> application's need.
>
> This reduced an average of 4% of code size for LLVM-TestSuite/CTMark
> targeting arm64/-Oz. In a large mobile application, the size benefit was
> even larger reducing the page-faults as well.
>
> *Design Alternatives:*
>
> 1. Expand helpers eagerly by permuting all cases in an earlier module
> pass. Even though this is rather simple and less invasive, it creates many
> redundant helpers which need to be elided by the linker.
> 2. Turn Prolog-Epilog-Injection into a module pass. Need to plumb the
> module through the pass and architecture specific frame-lowering. Not sure
> about other architecture interaction with this module pass.
> 3. Runtime/compiler-rt for all helpers. The combinations of helpers are a
> lot and certainly this approach is not flexible.
>
> Regards,
> Kyungwoo
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200324/4684b433/attachment.html>

Vedant Kumar via llvm-dev

2020-Mar-25 18:17 UTC

head link

[llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization

I see. I think it’d help with the upstreaming effort to have some more concrete
details about performance measurements, so that potential adopters can get a
rough understanding of the expected impact. In particular, if you could share:

- a run-time performance comparison over a representative subset of benchmarks
from LNT (aarch64/-Oz), taken from a stabilized device
- some explanation for any performance differences seen in ^
- ditto for a code size comparison over LNT
- some brief explanation of the methodology used to measure app startup time and
the # of page faults before app startup completes

That would be very valuable.

best,
vedant
> On Mar 24, 2020, at 2:04 PM, Kyungwoo Lee <kyulee.llvm at gmail.com>
wrote:
> 
> Hi Vedant,
> 
> Thanks for your interest and comment.
> Size-optimization improves page-faults and a start-up time for a large
application, which this enabling also followed.
> Even though I didn't see a large regression/complaint on a CPU-bound
case, which is not a typical case for mobile workload, I wanted to be
precautious of enabling it by default.
> However, as with default outlining case, I don't mind enabling this
under -Oz (for minimizing code) with an opt-out option.
> 
> Regards,
> Kyungwoo
> 
> On Tue, Mar 24, 2020 at 12:01 PM Vedant Kumar <vedant_kumar at apple.com
<mailto:vedant_kumar at apple.com>> wrote:
> This looks really interesting. In the slides, it’s mentioned that the
combination of tuning the MachineOutliner for ThinLTO and of optimizing function
prolog/epilogs improved measured run-time performance.
> 
> What kind of performance impact do you see from simply homogenizing
prolog/epilogs? (If, say across LNT/aarch64/-Oz the performance impact is not
large, it may make sense to have homogenization enabled by default.)
> 
> best,
> vedant
> 
>> On Mar 23, 2020, at 11:32 PM, Kyungwoo Lee via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Hello,
>> 
>> I'd like to upstream our work over the time which the community
would benefit from.
>> This is a part of effort toward minimizing code size presented in here
<https://llvm.org/devmtg/2020-02-23/slides/Kyungwoo-GlobalMachineOutlinerForThinLTO.pdf>.
In particular, this RFC is about optimizing prolog and epilog for size.
>> 
>> Homogeneous Prolog and Epilog for Size Optimization, D76570
<https://reviews.llvm.org/D76570>:
>> 
>> Prolog and epilog to handle callee-save registers tend to be irregular
with different immediate offsets, which are not often being outlined (by machine
outliner) when optimizing for size. From D18619, combining stack operations
stretched irregularity further.
>> This patch tries to emit homogeneous stores and loads with the same
offset for prolog and epilog respectively.  We have observed that this
homogeneous prolog and epilog significantly increased the chance of outlining,
resulting in a code size reduction. However, there were still a great deal of
outlining opportunities left because the current outliner had to conservatively
handle instructions with the return register, x30.
>> Rather, this patch also forms a custom-outlined helper function on
demand for prolog and epilog when lowering the frame code.
>> 
>> - Injects HOM_Prolog and HOM_Epilog pseudo instructions in Prolog and
Epilog Injection Pass
>> - Lower and optimize them in AArchLowerHomogneousPrologEpilog Pass
>> - Outlined helpers are created on demand. Identical helpers are merged
by the linker.
>> - An opt-in flag is introduced to enable this feature. Another
threshold flag is also introduced to control the aggressiveness of outlining for
application's need.
>> 
>> This reduced an average of 4% of code size for LLVM-TestSuite/CTMark
targeting arm64/-Oz. In a large mobile application, the size benefit was even
larger reducing the page-faults as well.
>>  
>> Design Alternatives:
>> 
>> 1. Expand helpers eagerly by permuting all cases in an earlier module
pass. Even though this is rather simple and less invasive, it creates many
redundant helpers which need to be elided by the linker.
>> 2. Turn Prolog-Epilog-Injection into a module pass. Need to plumb the
module through the pass and architecture specific frame-lowering. Not sure about
other architecture interaction with this module pass.
>> 3. Runtime/compiler-rt for all helpers. The combinations of helpers are
a lot and certainly this approach is not flexible.
>> 
>> Regards,
>> Kyungwoo
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200325/06bbed95/attachment.html>

Kyungwoo Lee via llvm-dev

2020-Mar-26 00:46 UTC

head link

[llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization

I understand it would be interesting to see performance impacts of a set of
benchmarks even under -Oz optimization.
However, I'm not familiar with LNT and its process. I assume this does not
need to run tests on local (arm64) devices, right? If that is the case, I
do not have resource/way to measure them locally. The large benchmark and
rough performance implication I mentioned is from some internal tests from
automation which I simply submitted, but I couldn't share details
unfortunately.
If running LNT does not require a local device, can you share a point of
how I can submit or access such infrastructure to test new compiler?

Regards,
Kyungwoo


On Wed, Mar 25, 2020 at 11:17 AM Vedant Kumar <vedant_kumar at apple.com>
wrote:
> I see. I think it’d help with the upstreaming effort to have some more
> concrete details about performance measurements, so that potential adopters
> can get a rough understanding of the expected impact. In particular, if you
> could share:
>
> - a run-time performance comparison over a representative subset of
> benchmarks from LNT (aarch64/-Oz), taken from a stabilized device
> - some explanation for any performance differences seen in ^
> - ditto for a code size comparison over LNT
> - some brief explanation of the methodology used to measure app startup
> time and the # of page faults before app startup completes
>
> That would be very valuable.
>
> best,
> vedant
>
> On Mar 24, 2020, at 2:04 PM, Kyungwoo Lee <kyulee.llvm at gmail.com>
wrote:
>
> Hi Vedant,
>
> Thanks for your interest and comment.
> Size-optimization improves page-faults and a start-up time for a large
> application, which this enabling also followed.
> Even though I didn't see a large regression/complaint on a CPU-bound
case,
> which is not a typical case for mobile workload, I wanted to be precautious
> of enabling it by default.
> However, as with default outlining case, I don't mind enabling this
under
> -Oz (for minimizing code) with an opt-out option.
>
> Regards,
> Kyungwoo
>
> On Tue, Mar 24, 2020 at 12:01 PM Vedant Kumar <vedant_kumar at
apple.com>
> wrote:
>
>> This looks really interesting. In the slides, it’s mentioned that the
>> combination of tuning the MachineOutliner for ThinLTO and of optimizing
>> function prolog/epilogs improved measured run-time performance.
>>
>> What kind of performance impact do you see from simply homogenizing
>> prolog/epilogs? (If, say across LNT/aarch64/-Oz the performance impact
is
>> not large, it may make sense to have homogenization enabled by
default.)
>>
>> best,
>> vedant
>>
>> On Mar 23, 2020, at 11:32 PM, Kyungwoo Lee via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hello,
>>
>> I'd like to upstream our work over the time which the community
would
>> benefit from.
>> This is a part of effort toward minimizing code size presented in here
>>
<https://llvm.org/devmtg/2020-02-23/slides/Kyungwoo-GlobalMachineOutlinerForThinLTO.pdf>.
>> In particular, this RFC is about optimizing prolog and epilog for size.
>>
>> *Homogeneous Prolog and Epilog for Size Optimization, D76570
>> <https://reviews.llvm.org/D76570>:*
>>
>> Prolog and epilog to handle callee-save registers tend to be irregular
>> with different immediate offsets, which are not often being outlined
(by
>> machine outliner) when optimizing for size. From D18619, combining
stack
>> operations stretched irregularity further.
>> This patch tries to emit homogeneous stores and loads with the same
>> offset for prolog and epilog respectively.  We have observed that this
>> homogeneous prolog and epilog significantly increased the chance of
>> outlining, resulting in a code size reduction. However, there were
still a
>> great deal of outlining opportunities left because the current outliner
had
>> to conservatively handle instructions with the return register, x30.
>> Rather, this patch also forms a custom-outlined helper function on
demand
>> for prolog and epilog when lowering the frame code.
>>
>> - Injects HOM_Prolog and HOM_Epilog pseudo instructions in Prolog and
>> Epilog Injection Pass
>> - Lower and optimize them in AArchLowerHomogneousPrologEpilog Pass
>> - Outlined helpers are created on demand. Identical helpers are merged
by
>> the linker.
>> - An opt-in flag is introduced to enable this feature. Another
threshold
>> flag is also introduced to control the aggressiveness of outlining for
>> application's need.
>>
>> This reduced an average of 4% of code size for LLVM-TestSuite/CTMark
>> targeting arm64/-Oz. In a large mobile application, the size benefit
was
>> even larger reducing the page-faults as well.
>>
>> *Design Alternatives:*
>>
>> 1. Expand helpers eagerly by permuting all cases in an earlier module
>> pass. Even though this is rather simple and less invasive, it creates
many
>> redundant helpers which need to be elided by the linker.
>> 2. Turn Prolog-Epilog-Injection into a module pass. Need to plumb the
>> module through the pass and architecture specific frame-lowering. Not
sure
>> about other architecture interaction with this module pass.
>> 3. Runtime/compiler-rt for all helpers. The combinations of helpers are
a
>> lot and certainly this approach is not flexible.
>>
>> Regards,
>> Kyungwoo
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200325/d95bac5d/attachment.html>

llvm dev - Mar 2020 - [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization

[llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization

[llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization

[llvm-dev] [RFC][AArch64] Homogeneous Prolog and Epilog for Size Optimization