similar to: [RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data

Displaying 20 results from an estimated 10000 matches similar to: "[RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data"

2020 Aug 05
3
[RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data
On Tue, Aug 4, 2020 at 10:51 PM aditya kumar <hiraditya at gmail.com> wrote: > Glad to hear that there is an interest in a function splitting pass. There > are advantages to splitting functions at different stages as you've already > noted. > Right -- with slightly different objectives. Machine Function Splitting Pass's main focus is on performance improvement. > -
2020 Aug 10
2
[RFC] Machine Function Splitter - Split out cold blocks from machine functions using profile data
>Exceptions >All eh pads are grouped together regardless of their coldness and are part of the original function. There are outstanding issues with splitting eh pads if they reside in separate sections in the binary. This remains as part of future work. Can you elaborate more on the outstanding issues with splitting eh pads? From my dip into the unwind map in gcc_except_table the
2020 Jun 02
2
Improve hot cold splitting to aggressively outline small blocks
Hello Tobias, Thank you for the suggestion! Aditya also mentioned this. I will look into it. Best regards, Ruijie Ruijie Fang Email: ruijief at princeton.edu On Tue, Jun 2, 2020 at 12:48 PM Tobias Hieta <tobias at plexapp.com> wrote: > Hello Ruijie, > > One other workload that would be interesting to test might be clang > itself. Building clang with PGO information is a
2020 Jun 01
2
Improve hot cold splitting to aggressively outline small blocks
Hello, I am Ruijie Fang, a GSoC student working on "Improve hot cold splitting to aggressively outline small blocks." Over the course of last week, I met with my mentor and co-mentor, Aditya Kumar, and Rodrigo Rocha, and we made a preliminary plan on improving the existing hot/cold splitting pass in LLVM through identifying patterns of cold blocks in real-world workloads via block
2020 Jun 02
2
Improve hot cold splitting to aggressively outline small blocks
Hi Teresa, Thank you for your reply! I discussed this with Aditya and Rodrigo today about this. We will always have PGO turned on for our benchmark, (i.e. we assume the profiling information is always available). In terms of the workload we supply to PGO: For postgresql, I suggested we use the "pgbench" benchmark, a TPC-B-based SQL benchmark for postgres, to supply profiling information
2019 Sep 26
2
[RFC] Propeller: A frame work for Post Link Optimizations
On Wed, Sep 25, 2019 at 5:02 PM Eli Friedman via llvm-dev < llvm-dev at lists.llvm.org> wrote: > My biggest question about this architecture is about when propeller runs > basic block reordering within a function. It seems like a lot of the > complexity comes from using the proposed -fbasicblock-sections to generated > mangled ELF, and then re-parsing the mangled ELF as a
2019 Sep 26
2
[RFC] Propeller: A frame work for Post Link Optimizations
On Thu, Sep 26, 2019 at 12:39 PM Eli Friedman <efriedma at quicinc.com> wrote: > > > > From: Xinliang David Li <xinliangli at gmail.com> > Sent: Wednesday, September 25, 2019 5:58 PM > To: Eli Friedman <efriedma at quicinc.com> > Cc: Sriraman Tallam <tmsriram at google.com>; llvm-dev <llvm-dev at lists.llvm.org> > Subject: [EXT] Re: [llvm-dev]
2015 May 04
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 04/30/2015 06:39 PM, Jeremy Fitzhardinge wrote: > On 04/30/2015 03:53 AM, Juergen Gross wrote: >> Paravirtualized spinlocks produce some overhead even if the kernel is >> running on bare metal. The main reason are the more complex locking >> and unlocking functions. Especially unlocking is no longer just one >> instruction but so complex that it is no longer inlined.
2015 May 04
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 04/30/2015 06:39 PM, Jeremy Fitzhardinge wrote: > On 04/30/2015 03:53 AM, Juergen Gross wrote: >> Paravirtualized spinlocks produce some overhead even if the kernel is >> running on bare metal. The main reason are the more complex locking >> and unlocking functions. Especially unlocking is no longer just one >> instruction but so complex that it is no longer inlined.
2015 May 06
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote: > On 05/03/2015 10:55 PM, Juergen Gross wrote: >> I did a small measurement of the pure locking functions on bare metal >> without and with my patches. >> >> spin_lock() for the first time (lock and code not in cache) dropped from >> about 600 to 500 cycles. >> >> spin_unlock() for first time dropped
2015 May 06
2
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote: > On 05/03/2015 10:55 PM, Juergen Gross wrote: >> I did a small measurement of the pure locking functions on bare metal >> without and with my patches. >> >> spin_lock() for the first time (lock and code not in cache) dropped from >> about 600 to 500 cycles. >> >> spin_unlock() for first time dropped
2019 Sep 24
9
[RFC] Propeller: A frame work for Post Link Optimizations
Greetings, We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer framework, on large google benchmarks and noticed that it improves key performance metrics of these benchmarks by 2% to 6%, which is pretty impressive as this is over and above a baseline binaryalready heavily optimized with ThinLTO + PGO. Furthermore, BOLT is also able to improve the performance of binaries
2019 Sep 27
5
[RFC] Propeller: A frame work for Post Link Optimizations
On Thu, Sep 26, 2019 at 5:13 PM Eli Friedman <efriedma at quicinc.com> wrote: > > > -----Original Message----- > > From: Sriraman Tallam <tmsriram at google.com> > > Sent: Thursday, September 26, 2019 3:24 PM > > To: Eli Friedman <efriedma at quicinc.com> > > Cc: Xinliang David Li <xinliangli at gmail.com>; llvm-dev <llvm-dev at
2020 Feb 28
5
A Propeller link (similar to a Thin Link as used by ThinLTO)?
I met with the Propeller team today (we work for the same company but it was my first time meeting two members on the team:) ). One thing I have been reassured: * There is no general disassembly work. General disassembly work would assuredly frighten off developers. (Inherently unreliable, memory usage heavy and difficult to deal with CFI, debug information, etc) Minimal amount of plumbing work
2009 Sep 19
2
Many improvements and a few problems
Hi, I must say I'm very impressed with improvments in the latest versions of the encoder, especially in 2-pass mode. I was trying encodes of videos with sudden changes from no or moderate motion, to high motion scenes. Most samples of Theora quality I saw on the net were very slow motion usually. These high-motion videos were especially hard for Theora and it quickly introduced
2019 Oct 11
2
[RFC] Propeller: A frame work for Post Link Optimizations
Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code? Or, could the propellor-optimized compile step instead split each function into only 2 pieces -- one containing an "optimally-ordered" set of
2017 Jan 30
4
(RFC) Adjusting default loop fully unroll threshold
Currently, loop fully unroller shares the same default threshold as loop dynamic unroller and partial unroller. This seems conservative because unlike dynamic/partial unrolling, fully unrolling will not affect LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed to double the threshold for loop fully unroller. This will change the codegen of several SPECCPU benchmarks: Code
2014 Nov 19
5
[LLVMdev] Odd code layout requirements for MCJIT
I'm part of a team working on adding an llvm codegen backend to HHVM (PHP JIT, http://hhvm.com) using MCJIT. We have a code layout problem and I'm looking for opinions on good ways to solve it. The short version is that the memory we emit code into is split into a few different areas, and we'd like a way to control which area each BasicBlock ends up in during codegen. I know this
2019 Oct 18
3
[RFC] Propeller: A frame work for Post Link Optimizations
Hello Maksim, On Fri, Oct 18, 2019 at 10:57 AM Maksim Panchenko <maks at fb.com> wrote: > Cool. The new numbers look good. If you run BOLT with jemalloc library > > preloaded, you will likely get a runtime closer to 1 minute. We’ve noticed > that > > compared to the default malloc, it improves the multithreaded > > performance and brings down memory usage
2017 Jan 30
2
(RFC) Adjusting default loop fully unroll threshold
On Mon, Jan 30, 2017 at 3:51 PM Mehdi Amini via llvm-dev < llvm-dev at lists.llvm.org> wrote: > On Jan 30, 2017, at 10:49 AM, Dehao Chen via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Currently, loop fully unroller shares the same default threshold as loop > dynamic unroller and partial unroller. This seems conservative because > unlike dynamic/partial