Displaying 11 results from an estimated 11 matches for "hfsort".
2017 Jul 31
1
[RFC] Profile guided section layout
Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:
> I've recently implemented profile guided section layout in llvm + lld using
> the Call-Chain Clustering (C³) heuristic from
> https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
> . In the programs I've tested it on I've gotten from 0% to 5% performance
> improvement over standard PGO with zero cases of slowdowns and up to 15%
> reduction in ITLB misses.
>
>
> There are three parts to this implementation.
>
> The first is a new ll...
2017 Jun 15
7
[RFC] Profile guided section layout
I've recently implemented profile guided section layout in llvm + lld using
the Call-Chain Clustering (C³) heuristic from
https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
. In the programs I've tested it on I've gotten from 0% to 5% performance
improvement over standard PGO with zero cases of slowdowns and up to 15%
reduction in ITLB misses.
There are three parts to this implementation.
The first is a new llvm pass which uses branch frequency i...
2019 Oct 17
2
[RFC] Propeller: A frame work for Post Link Optimizations
...now on these binaries as all the files
are provided. We have also provided the raw perf data files in case you
want to independently convert.
$ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b
pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+
-reorder-functions=hfsort -relocs=1 --update-debug-sections
For version 2, this is the number:
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40
Maximum resident set size (kbytes): 18742688
That is 125 seconds and ~18G of RAM.
For version 1, this hangs and we stopped it after several minutes and the
maximum RSS size...
2019 Oct 18
3
[RFC] Propeller: A frame work for Post Link Optimizations
...> are provided. We have also provided the raw perf data files in case you
> want to independently convert.
>
>
>
> $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b
> pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+
> -reorder-functions=hfsort -relocs=1 --update-debug-sections
>
>
>
> For version 2, this is the number:
>
>
>
> Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40
>
> Maximum resident set size (kbytes): 18742688
>
>
>
> That is 125 seconds and ~18G of RAM.
>
>
>
> For v...
2018 Aug 07
3
Regarding basic block layout/code placement optimizations of profile guided optimization (PGO)
Hi,
I would like to learn the details regarding what exactly PGO does for basic
block layout/code placement optimizations in llvm. Could you please point
me to some descriptions? Is it close to this paper (Karl Pettis and Robert
C. Hansen. 1990. Profile guided code positioning. PLDI'90)
http://perso.ensta-paristech.fr/~bmonsuez/Cours/B6-4/Articles/papers15.pdf?
Whether it is purely
2019 Oct 22
2
[RFC] Propeller: A frame work for Post Link Optimizations
...> are provided. We have also provided the raw perf data files in case you
> want to independently convert.
>
>
>
> $ /usr/bin/time -v /llvm-bolt clang-10 -o pgo_relocs-bolt-compiler -b
> pgo_relocs-compiler.yaml -split-functions=3 -reorder-blocks=cache+
> -reorder-functions=hfsort -relocs=1 --update-debug-sections
>
>
>
> For version 2, this is the number:
>
>
>
> Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.40
>
> Maximum resident set size (kbytes): 18742688
>
>
>
> That is 125 seconds and ~18G of RAM.
>
>
>
> For v...
2017 Jul 31
3
[RFC] Profile guided section layout
Hi Rafael,
On 07/31/2017 04:20 PM, Rafael Avila de Espindola via llvm-dev wrote:
> However, do we need to start with instrumentation? The original paper
> uses sampling with good results and current intel cpus can record every
> branch in a program.
>
> I would propose starting with just an lld patch that reads the call
> graph from a file. The format would be very similar to
2019 Oct 14
2
[RFC] Propeller: A frame work for Post Link Optimizations
Hello,
I wanted to consolidate all the discussions and our final thoughts on the
concerns raised. I have attached a document consolidating it.
BOLT’s performance gains inspired this work and we believe BOLT
is a great piece of engineering. However, there are build environments
where
scalability is critical and memory limits per process are tight :
* Debug Fission,
2017 Jul 31
2
[RFC] Profile guided section layout
...});
}
+// Sort sections by the profile data provided in the .note.llvm.callgraph
+// sections.
+//
+// This algorithm is based on Call-Chain Clustering from:
+// Optimizing Function Placement for Large-Scale Data-Center Applications
+// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+//
+// This first builds a call graph based on the profile data then iteratively
+// merges the hottest call edges as long as it would not create a cluster larger
+// than the page size. All clusters are then sorted by a density metric to
+// further improve locality.
+template <clas...
2019 Sep 24
9
[RFC] Propeller: A frame work for Post Link Optimizations
Greetings,
We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of
binaries
2017 Aug 01
2
[RFC] Profile guided section layout
...});
}
+// Sort sections by the profile data provided in the .note.llvm.callgraph
+// sections.
+//
+// This algorithm is based on Call-Chain Clustering from:
+// Optimizing Function Placement for Large-Scale Data-Center Applications
+// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+//
+// This first builds a call graph based on the profile data then iteratively
+// merges the hottest call edges as long as it would not create a cluster larger
+// than the page size. All clusters are then sorted by a density metric to
+// further improve locality.
+template <clas...