thr3ads.net - llvm dev - [llvm-dev] Status update on the hot/cold splitting pass [Jan 2019]

If this information is useful, please help other people find it:
Share via:

Vedant Kumar via llvm-dev

2019-Jan-26 00:29 UTC

[llvm-dev] Status update on the hot/cold splitting pass

Hello,

I’d like to give a status update to the community about the recently-added
hot/cold splitting pass. I'll provide some motivation for the pass, describe
its implementation, summarize recent/ongoing work, and share early results.

# Motivation

We (at Apple) have found that memory pressure from resident pages of code is
significant on embedded devices. In particular, this pressure spikes during app
launches. We’ve been looking into ways to reduce memory pressure. Hot/cold
splitting is one part of a solution.

# What does hot/cold splitting do?

The hot/cold splitting pass identifies cold basic blocks and moves them into
separate functions. The linker must order newly-created cold functions away from
the rest of the program (say, into a cold section). The idea here is to have
these cold pages faulted in relatively infrequently (if at all), and to improve
the memory locality of code outside of the cold area.

The pass considers profile data, traps, uses of the `cold` attribute, and
exception-handling code to identify cold blocks. If the pass identifies a cold
region that's profitable to extract, it uses LLVM's CodeExtractor
utility to split the region out of its original function. Newly-created cold
functions are marked `minsize` (-Oz). The splitting process may occur multiple
times per function.

The choice to perform splitting at the IR level gave us a lot of flexibility. It
allowed us to quickly target different architectures and evaluate new phase
orderings. It also made it easier to split out highly complex subgraphs of CFGs
(with both live-ins and live-outs). One disadvantage is that we cannot easily
split out EH pads (llvm.org/PR39545 <http://llvm.org/PR39545>). However,
our experiments show that doing so only increases the total amount of split code
by 2% across the entire iOS shared cache.

# Recent/ongoing work

Aditya and Sebastian contributed the hot/cold splitting pass in September 2018
(r341669). Since then, work on the pass has continued steadily. It gained the
ability to extract larger cold regions (r345209), compile-time improvements
(r351892, r351894), and a more effective cost model (r352228). With some
experimentation, we found that scheduling splitting before inlining gives better
code size results without regressing memory locality (r352080). Along the way,
CodeExtractor got better at handling debug info (r344545, r346255), and a few
other issues in this utility were fixed (r348205, r350420).

At this point, we're able to build & run our software stack with
hot/cold splitting enabled. We’d like to introduce a CC1 option to safely toggle
splitting on/off (https://reviews.llvm.org/D57265
<https://reviews.llvm.org/D57265>). That would help experiment with and/or
deploy the pass.

# Early results

On internal memory benchmarks, we consistently saw that code page faults were
more concentrated with splitting enabled. With splitting, the set of the
most-frequently-accessed 95% (99%) of code pages was 10% (resp. 3.6%) smaller.
We used a facility in the xnu VM to force pages to be faulted periodically, and
ktrace, to collect this data. We settled on this approach because the
alternatives (e.g. directly sampling RSS of various processes) gave unstable
results, even when measures were taken to stabilize a device (e.g. disabling
dynamic frequency switching, SMP, and various other features).

On arm64, the performance impact of enabling splitting in the LLVM test suite
appears to be in the noise. We think this is because split code amount to just
0.1% of all the code in the test suite. Across the iOS shared cache we see that
0.9% of code is split, with higher percentages in key frameworks (e.g. 7% in
libdispatch). For three internal benchmarks, we see geomean score improvements
of 1.58%, 0.56%, and 0.27% respectively. We think these results are promising.
I’d like to encourage others to evaluate the pass and share results.

Thanks!

vedant
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190125/bda5507f/attachment.html>

Aditya K via llvm-dev

2019-Jan-28 18:51 UTC

head link

[llvm-dev] Status update on the hot/cold splitting pass

Very happy to see good results. On our side, we are still struggling with
getting a good profile to get aggressive hot-cold splitting. Static profile
isn't helping much for our use cases. I'll be curious to know if someone
got good improvements only with static profile analysis.


-Aditya

________________________________
From: vsk at apple.com <vsk at apple.com> on behalf of Vedant Kumar
<vedant_kumar at apple.com>
Sent: Friday, January 25, 2019 6:29 PM
To: llvm-dev at lists.llvm.org
Cc: Aditya Kumar; Sebastian Pop; Teresa Johnson; jun.l at samsung.com; Duncan
Smith; Gerolf Hoflehner
Subject: Status update on the hot/cold splitting pass

Hello,

I’d like to give a status update to the community about the recently-added
hot/cold splitting pass. I'll provide some motivation for the pass, describe
its implementation, summarize recent/ongoing work, and share early results.

# Motivation

We (at Apple) have found that memory pressure from resident pages of code is
significant on embedded devices. In particular, this pressure spikes during app
launches. We’ve been looking into ways to reduce memory pressure. Hot/cold
splitting is one part of a solution.

# What does hot/cold splitting do?

The hot/cold splitting pass identifies cold basic blocks and moves them into
separate functions. The linker must order newly-created cold functions away from
the rest of the program (say, into a cold section). The idea here is to have
these cold pages faulted in relatively infrequently (if at all), and to improve
the memory locality of code outside of the cold area.

The pass considers profile data, traps, uses of the `cold` attribute, and
exception-handling code to identify cold blocks. If the pass identifies a cold
region that's profitable to extract, it uses LLVM's CodeExtractor
utility to split the region out of its original function. Newly-created cold
functions are marked `minsize` (-Oz). The splitting process may occur multiple
times per function.

The choice to perform splitting at the IR level gave us a lot of flexibility. It
allowed us to quickly target different architectures and evaluate new phase
orderings. It also made it easier to split out highly complex subgraphs of CFGs
(with both live-ins and live-outs). One disadvantage is that we cannot easily
split out EH pads (llvm.org/PR39545<http://llvm.org/PR39545>). However,
our experiments show that doing so only increases the total amount of split code
by 2% across the entire iOS shared cache.

# Recent/ongoing work

Aditya and Sebastian contributed the hot/cold splitting pass in September 2018
(r341669). Since then, work on the pass has continued steadily. It gained the
ability to extract larger cold regions (r345209), compile-time improvements
(r351892, r351894), and a more effective cost model (r352228). With some
experimentation, we found that scheduling splitting before inlining gives better
code size results without regressing memory locality (r352080). Along the way,
CodeExtractor got better at handling debug info (r344545, r346255), and a few
other issues in this utility were fixed (r348205, r350420).

At this point, we're able to build & run our software stack with
hot/cold splitting enabled. We’d like to introduce a CC1 option to safely toggle
splitting on/off (https://reviews.llvm.org/D57265). That would help experiment
with and/or deploy the pass.

# Early results

On internal memory benchmarks, we consistently saw that code page faults were
more concentrated with splitting enabled. With splitting, the set of the
most-frequently-accessed 95% (99%) of code pages was 10% (resp. 3.6%) smaller.
We used a facility in the xnu VM to force pages to be faulted periodically, and
ktrace, to collect this data. We settled on this approach because the
alternatives (e.g. directly sampling RSS of various processes) gave unstable
results, even when measures were taken to stabilize a device (e.g. disabling
dynamic frequency switching, SMP, and various other features).

On arm64, the performance impact of enabling splitting in the LLVM test suite
appears to be in the noise. We think this is because split code amount to just
0.1% of all the code in the test suite. Across the iOS shared cache we see that
0.9% of code is split, with higher percentages in key frameworks (e.g. 7% in
libdispatch). For three internal benchmarks, we see geomean score improvements
of 1.58%, 0.56%, and 0.27% respectively. We think these results are promising.
I’d like to encourage others to evaluate the pass and share results.

Thanks!

vedant
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190128/00ba4098/attachment-0001.html>

Vedant Kumar via llvm-dev

2019-Jan-28 19:00 UTC

head link

[llvm-dev] Status update on the hot/cold splitting pass

The splitting pass currently doesn’t move cold symbols into a separate section.
Is that affecting your results?

On Darwin, we plan on using a symbol attribute to provide an ordering hint to
the linker (see r352227, N_COLD_FUNC).

vedant
> On Jan 28, 2019, at 10:51 AM, Aditya K via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Very happy to see good results. On our side, we are still struggling with
getting a good profile to get aggressive hot-cold splitting. Static profile
isn't helping much for our use cases. I'll be curious to know if someone
got good improvements only with static profile analysis.
> 
> 
> -Aditya
> 
> From: vsk at apple.com <vsk at apple.com> on behalf of Vedant Kumar
<vedant_kumar at apple.com>
> Sent: Friday, January 25, 2019 6:29 PM
> To: llvm-dev at lists.llvm.org
> Cc: Aditya Kumar; Sebastian Pop; Teresa Johnson; jun.l at samsung.com;
Duncan Smith; Gerolf Hoflehner
> Subject: Status update on the hot/cold splitting pass
>  
> Hello,
> 
> I’d like to give a status update to the community about the recently-added
hot/cold splitting pass. I'll provide some motivation for the pass, describe
its implementation, summarize recent/ongoing work, and share early results.
> 
> # Motivation
> 
> We (at Apple) have found that memory pressure from resident pages of code
is significant on embedded devices. In particular, this pressure spikes during
app launches. We’ve been looking into ways to reduce memory pressure. Hot/cold
splitting is one part of a solution.
> 
> # What does hot/cold splitting do?
> 
> The hot/cold splitting pass identifies cold basic blocks and moves them
into separate functions. The linker must order newly-created cold functions away
from the rest of the program (say, into a cold section). The idea here is to
have these cold pages faulted in relatively infrequently (if at all), and to
improve the memory locality of code outside of the cold area.
> 
> The pass considers profile data, traps, uses of the `cold` attribute, and
exception-handling code to identify cold blocks. If the pass identifies a cold
region that's profitable to extract, it uses LLVM's CodeExtractor
utility to split the region out of its original function. Newly-created cold
functions are marked `minsize` (-Oz). The splitting process may occur multiple
times per function.
> 
> The choice to perform splitting at the IR level gave us a lot of
flexibility. It allowed us to quickly target different architectures and
evaluate new phase orderings. It also made it easier to split out highly complex
subgraphs of CFGs (with both live-ins and live-outs). One disadvantage is that
we cannot easily split out EH pads (llvm.org/PR39545
<http://llvm.org/PR39545>). However, our experiments show that doing so
only increases the total amount of split code by 2% across the entire iOS shared
cache.
> 
> # Recent/ongoing work
> 
> Aditya and Sebastian contributed the hot/cold splitting pass in September
2018 (r341669). Since then, work on the pass has continued steadily. It gained
the ability to extract larger cold regions (r345209), compile-time improvements
(r351892, r351894), and a more effective cost model (r352228). With some
experimentation, we found that scheduling splitting before inlining gives better
code size results without regressing memory locality (r352080). Along the way,
CodeExtractor got better at handling debug info (r344545, r346255), and a few
other issues in this utility were fixed (r348205, r350420).
> 
> At this point, we're able to build & run our software stack with
hot/cold splitting enabled. We’d like to introduce a CC1 option to safely toggle
splitting on/off (https://reviews.llvm.org/D57265
<https://reviews.llvm.org/D57265>). That would help experiment with and/or
deploy the pass.
> 
> # Early results
> 
> On internal memory benchmarks, we consistently saw that code page faults
were more concentrated with splitting enabled. With splitting, the set of the
most-frequently-accessed 95% (99%) of code pages was 10% (resp. 3.6%) smaller.
We used a facility in the xnu VM to force pages to be faulted periodically, and
ktrace, to collect this data. We settled on this approach because the
alternatives (e.g. directly sampling RSS of various processes) gave unstable
results, even when measures were taken to stabilize a device (e.g. disabling
dynamic frequency switching, SMP, and various other features).
> 
> On arm64, the performance impact of enabling splitting in the LLVM test
suite appears to be in the noise. We think this is because split code amount to
just 0.1% of all the code in the test suite. Across the iOS shared cache we see
that 0.9% of code is split, with higher percentages in key frameworks (e.g. 7%
in libdispatch). For three internal benchmarks, we see geomean score
improvements of 1.58%, 0.56%, and 0.27% respectively. We think these results are
promising. I’d like to encourage others to evaluate the pass and share results.
> 
> Thanks!
> 
> vedant
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190128/e4a0cbfd/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Jan 2019 - Status update on the hot/cold splitting pass

[llvm-dev] Status update on the hot/cold splitting pass

[llvm-dev] Status update on the hot/cold splitting pass

[llvm-dev] Status update on the hot/cold splitting pass

Maybe Matching Threads