thr3ads.net - search: "hfsort"

2017 Jul 31

1

[RFC] Profile guided section layout

Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes: > I've recently implemented profile guided section layout in llvm + lld using > the Call-Chain Clustering (C³) heuristic from > https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf > . In the programs I've tested it on I've gotten from 0% to 5% performance > improvement over standard PGO with zero cases of slowdowns and up to 15% > reduction in ITLB misses. > > > There are three parts to this implementation. > > The first is a new ll...

[RFC] Profile guided section layout

2017 Jun 15

7

[RFC] Profile guided section layout

I've recently implemented profile guided section layout in llvm + lld using the Call-Chain Clustering (C³) heuristic from https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf . In the programs I've tested it on I've gotten from 0% to 5% performance improvement over standard PGO with zero cases of slowdowns and up to 15% reduction in ITLB misses. There are three parts to this implementation. The first is a new llvm pass which uses branch frequency i...

[RFC] Propeller: A frame work for Post Link Optimizations

2019 Oct 17

2

[RFC] Propeller: A frame work for Post Link Optimizations

...now on these binaries as all the files are provided. We have also provided the raw perf data files in case you want to independently convert. $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ -reorder-functions=hfsort -relocs=1 --update-debug-sections For version 2, this is the number: Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40 Maximum resident set size (kbytes): 18742688 That is 125 seconds and ~18G of RAM. For version 1, this hangs and we stopped it after several minutes and the maximum RSS size...

[RFC] Propeller: A frame work for Post Link Optimizations

2019 Oct 18

3

[RFC] Propeller: A frame work for Post Link Optimizations

...> are provided. We have also provided the raw perf data files in case you > want to independently convert. > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 --update-debug-sections > > > > For version 2, this is the number: > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40 > > Maximum resident set size (kbytes): 18742688 > > > > That is 125 seconds and ~18G of RAM. > > > > For v...

Regarding basic block layout/code placement optimizations of profile guided optimization (PGO)

2018 Aug 07

3

Regarding basic block layout/code placement optimizations of profile guided optimization (PGO)

Hi, I would like to learn the details regarding what exactly PGO does for basic block layout/code placement optimizations in llvm. Could you please point me to some descriptions? Is it close to this paper (Karl Pettis and Robert C. Hansen. 1990. Profile guided code positioning. PLDI'90) http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Articles/papers15.pdf? Whether it is purely

[RFC] Propeller: A frame work for Post Link Optimizations

2019 Oct 22

2

[RFC] Propeller: A frame work for Post Link Optimizations

...> are provided. We have also provided the raw perf data files in case you > want to independently convert. > > > > $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b > pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+ > -reorder-functions=hfsort -relocs=1 --update-debug-sections > > > > For version 2, this is the number: > > > > Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40 > > Maximum resident set size (kbytes): 18742688 > > > > That is 125 seconds and ~18G of RAM. > > > > For v...

[RFC] Profile guided section layout

2017 Jul 31

3

[RFC] Profile guided section layout

Hi Rafael, On 07/31/2017 04:20 PM, Rafael Avila de Espindola via llvm-dev wrote: > However, do we need to start with instrumentation? The original paper > uses sampling with good results and current intel cpus can record every > branch in a program. > > I would propose starting with just an lld patch that reads the call > graph from a file. The format would be very similar to

[RFC] Propeller: A frame work for Post Link Optimizations

2019 Oct 14

2

[RFC] Propeller: A frame work for Post Link Optimizations

Hello, I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it. BOLT’s performance gains inspired this work and we believe BOLT is a great piece of engineering. However, there are build environments where scalability is critical and memory limits per process are tight : * Debug Fission,

[RFC] Profile guided section layout

2017 Jul 31

2

[RFC] Profile guided section layout

...}); } +// Sort sections by the profile data provided in the .note.llvm.callgraph +// sections. +// +// This algorithm is based on Call-Chain Clustering from: +// Optimizing Function Placement for Large-Scale Data-Center Applications +// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf +// +// This first builds a call graph based on the profile data then iteratively +// merges the hottest call edges as long as it would not create a cluster larger +// than the page size. All clusters are then sorted by a density metric to +// further improve locality. +template <clas...

[RFC] Propeller: A frame work for Post Link Optimizations

2019 Sep 24

9

[RFC] Propeller: A frame work for Post Link Optimizations

Greetings, We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer framework, on large google benchmarks and noticed that it improves key performance metrics of these benchmarks by 2% to 6%, which is pretty impressive as this is over and above a baseline binaryalready heavily optimized with ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of binaries

[RFC] Profile guided section layout

2017 Aug 01

2

[RFC] Profile guided section layout

...}); } +// Sort sections by the profile data provided in the .note.llvm.callgraph +// sections. +// +// This algorithm is based on Call-Chain Clustering from: +// Optimizing Function Placement for Large-Scale Data-Center Applications +// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf +// +// This first builds a call graph based on the profile data then iteratively +// merges the hottest call edges as long as it would not create a cluster larger +// than the page size. All clusters are then sorted by a density metric to +// further improve locality. +template <clas...

search for: hfsort