thr3ads.net - llvm dev - [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations [Oct 2019]

If this information is useful, please help other people find it:
Share via:

James Y Knight via llvm-dev

2019-Oct-02 20:58 UTC

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

I'm a bit confused by this subthread -- doesn't BOLT have the exact same
CFI bloat issue? From my cursory reading of the propellor doc, the CFI
duplication is _necessary_ to represent discontiguous functions, not
anything particular to the way Propellor happens to generate those
discontiguous functions.

And emitting discontiguous functions is a fundamental goal of this, right?

On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Thanks for clarifying. This means once you move to the next basic block
> (or any other basic
>
> block in the function) you have to execute an entirely new set of CFI
> instructions
>
> except for the common CIE part. While indeed this is not as bad, on
> average, the overall
>
> active memory footprint will increase.
>
>
>
> Creating one FDE per basic block means that .eh_frame_hdr, an allocatable
> section,
>
> will be bloated too. This will increase the FDE lookup time. I don’t see
> .eh_frame_hdr
>
> being mentioned in the proposal.
>
>
>
> Maksim
>
>
>
> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at
google.com>
> wrote:
>
>
>
>
>
>
>
> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> *Pessimization/overhead for stack unwinding used by system-wide profilers
> and
> for exception handling*
>
> Larger CFI programs put an extra burden on unwinding at runtime as more CFI
> (and thus native) instructions have to be executed. This will cause more
> overhead for any profiler that records stack traces, and, as you correctly
> note
> in the proposal, for any program that heavily uses exceptions.
>
>
>
> The number of CFI instructions that have to be executed when unwinding any
> given stack stays the same. The CFI instructions for a function have to be
> duplicated in every basic block section, but when performing unwinding only
> one such a set is executed -- the copy for the current basic block.
> However, this copy contains precisely the same CFI instructions as the ones
> that would have to be executed if there were no basic block sections.
>
>
>
> --
>
> Krzysztof Pszeniczny
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/80804f93/attachment.html>

Rafael Auler via llvm-dev

2019-Oct-02 22:18 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

You’re correct, except that, in Propeller, CFI duplication happens for every
basic block as it operates with the conservative assumption that a block can be
put anywhere by the linker. That’s a significant bloat that is not cleaned up
later. So, during link time, if N blocks from the same function are contiguous
in the final layout, as it should happen most of the time for any sane BB order,
we would have several FDEs for a region that only needs one. The bloat goes to
the final binary (a lot more FDEs, specifically, one FDE per basic block).

BOLT will only split a function in two parts, and only if it has profile. Most
of the time, a function is not split. It also has an option not to split at all.
For internally reordered basic blocks of a given function, it has CFI
deduplication logic (it will interpret and build the CFI states for each block
and rewrite the CFIs in a way that uses the minimum number of instructions to
encode the states for each block).

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of James Y
Knight via llvm-dev <llvm-dev at lists.llvm.org>
Reply-To: James Y Knight <jyknight at google.com>
Date: Wednesday, October 2, 2019 at 1:59 PM
To: Maksim Panchenko <maks at fb.com>
Cc: "llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
Optimizations

I'm a bit confused by this subthread -- doesn't BOLT have the exact same
CFI bloat issue? From my cursory reading of the propellor doc, the CFI
duplication is _necessary_ to represent discontiguous functions, not anything
particular to the way Propellor happens to generate those discontiguous
functions.

And emitting discontiguous functions is a fundamental goal of this, right?

On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Thanks for clarifying. This means once you move to the next basic block (or any
other basic
block in the function) you have to execute an entirely new set of CFI
instructions
except for the common CIE part. While indeed this is not as bad, on average, the
overall
active memory footprint will increase.

Creating one FDE per basic block means that .eh_frame_hdr, an allocatable
section,
will be bloated too. This will increase the FDE lookup time. I don’t see
.eh_frame_hdr
being mentioned in the proposal.

Maksim

On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at
google.com<mailto:kpszeniczny at google.com>> wrote:



On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
*Pessimization/overhead for stack unwinding used by system-wide profilers and
for exception handling*

Larger CFI programs put an extra burden on unwinding at runtime as more CFI
(and thus native) instructions have to be executed. This will cause more
overhead for any profiler that records stack traces, and, as you correctly note
in the proposal, for any program that heavily uses exceptions.

The number of CFI instructions that have to be executed when unwinding any given
stack stays the same. The CFI instructions for a function have to be duplicated
in every basic block section, but when performing unwinding only one such a set
is executed -- the copy for the current basic block. However, this copy contains
precisely the same CFI instructions as the ones that would have to be executed
if there were no basic block sections.

--
Krzysztof Pszeniczny
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/bb45c590/attachment.html>

Sriraman Tallam via llvm-dev

2019-Oct-03 02:24 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Maks and team, thanks for the detailed feedback and we will address all of your
concerns.  Let’s begin with CFI and DebugInfo first since this is already
being discussed.

TLDR; clang is pathological and the actual CFI bloat will go down from 7x to
2x.

Let us present the CFI Bloats for each benchmark with the default option, which
is creating basic block sections only for functions with samples.   For clang,
it is 7x and *not 17x* (the 17 x number is for all sections), and for the
large benchmarks it is less than 30% or 1.3x. For large benchmarks, Storage is
the highest, going from 18M to 23M, ~30%. Clang is almost pathological here.
This is because 10% of all functions in clang have samples (touched by the
profile information), and all of these functions get full basic block sections.
Whereas, for the large benchmarks, only 0.5% of functions have samples.  Now,
for clang, 10% of functions that have samples also happen to contain 35% of all
basic blocks. This means, we are creating sections for 35% of all basic blocks
and the CFI bloats are clearly showing.

Now, the data for clang also shows that only 7% of the basic blocks have
samples. We are working on restricting basic block sections to only those basic
blocks that have samples. The rest of the basic blocks (cold) in that function
would share the same section.  With this, we would reduce the bloat of CFI
from 7x to 2x. This is not hard to do and we will follow up with this patch.

Also for object size bloats with regards to eh_frame, the reasoning is similar.
 Restricting the section creation to only basic blocks that have profiles will
reduce this a lot more.

Importantly,  if CFI support were available for discontiguous ranges we
wouldn't
have to duplicate CFI FDEs and the bloats would be near minimal.

BOLT parses CFI and DWARF and generates compact information by rewriting it.
Whereas, Propeller uses lld which uses relocations and sections to fixup but
does not rewrite it.  This is by design and that lld is not DWARF and CFI
aware. We designed basic block sections just like function sections.  The
compiler produces a bunch of sections and relocations. The linker patches the
relocations and the debug info and CFI are right, that's it.  For CFI, since
there is no support for discontiguous ranges we have to duplicate and dedup
FDEs only for blocks with sections. We are asking that CFI support
discontiguous ranges and this would look even simpler.  Alternately, if lld
were made DWARF and CFI aware we could rewrite it compactly like BOLT.
These would help with object size bloats and binary size bloats.

On Wed, Oct 2, 2019 at 1:59 PM James Y Knight via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> I'm a bit confused by this subthread -- doesn't BOLT have the exact
same CFI bloat issue? From my cursory reading of the propellor doc, the CFI
duplication is _necessary_ to represent discontiguous functions, not anything
particular to the way Propellor happens to generate those discontiguous
functions.
>
> And emitting discontiguous functions is a fundamental goal of this, right?
>
> On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>>
>> Thanks for clarifying. This means once you move to the next basic block
(or any other basic
>>
>> block in the function) you have to execute an entirely new set of CFI
instructions
>>
>> except for the common CIE part. While indeed this is not as bad, on
average, the overall
>>
>> active memory footprint will increase.
>>
>>
>>
>> Creating one FDE per basic block means that .eh_frame_hdr, an
allocatable section,
>>
>> will be bloated too. This will increase the FDE lookup time. I don’t
see .eh_frame_hdr
>>
>> being mentioned in the proposal.
>>
>>
>>
>> Maksim
>>
>>
>>
>> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny
at google.com> wrote:
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>>
>> *Pessimization/overhead for stack unwinding used by system-wide
profilers and
>> for exception handling*
>>
>> Larger CFI programs put an extra burden on unwinding at runtime as more
CFI
>> (and thus native) instructions have to be executed. This will cause
more
>> overhead for any profiler that records stack traces, and, as you
correctly note
>> in the proposal, for any program that heavily uses exceptions.
>>
>>
>>
>> The number of CFI instructions that have to be executed when unwinding
any given stack stays the same. The CFI instructions for a function have to be
duplicated in every basic block section, but when performing unwinding only one
such a set is executed -- the copy for the current basic block. However, this
copy contains precisely the same CFI instructions as the ones that would have to
be executed if there were no basic block sections.
>>
>>
>>
>> --
>>
>> Krzysztof Pszeniczny
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sriraman Tallam via llvm-dev

2019-Oct-07 18:15 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

We would also like to clarify on the misconceptions around CFI Instructions:

There are two things that need to be clarified here:

1) Extra CFI FDE entries for basic blocks does not mean more dynamic
instructions are executed. In fact, they do not increase at all.  Krys
talked about this earlier.
2) We do deduplication of common static CFI instructions in the FDE
and move it to the CIE .  Hence, moving to a new basic block does not
mean a completely new set of CFI instructions is executed.

On Wed, Oct 2, 2019 at 7:24 PM Sriraman Tallam <tmsriram at google.com>
wrote:>
> Maks and team, thanks for the detailed feedback and we will address all of
your
> concerns.  Let’s begin with CFI and DebugInfo first since this is already
> being discussed.
>
> TLDR; clang is pathological and the actual CFI bloat will go down from 7x
to
> 2x.
>
> Let us present the CFI Bloats for each benchmark with the default option,
which
> is creating basic block sections only for functions with samples.   For
clang,
> it is 7x and *not 17x* (the 17 x number is for all sections), and for the
> large benchmarks it is less than 30% or 1.3x. For large benchmarks, Storage
is
> the highest, going from 18M to 23M, ~30%. Clang is almost pathological
here.
> This is because 10% of all functions in clang have samples (touched by the
> profile information), and all of these functions get full basic block
sections.
> Whereas, for the large benchmarks, only 0.5% of functions have samples. 
Now,
> for clang, 10% of functions that have samples also happen to contain 35% of
all
> basic blocks. This means, we are creating sections for 35% of all basic
blocks
> and the CFI bloats are clearly showing.
>
> Now, the data for clang also shows that only 7% of the basic blocks have
> samples. We are working on restricting basic block sections to only those
basic
> blocks that have samples. The rest of the basic blocks (cold) in that
function
> would share the same section.  With this, we would reduce the bloat of CFI
> from 7x to 2x. This is not hard to do and we will follow up with this
patch.
>
> Also for object size bloats with regards to eh_frame, the reasoning is
similar.
>  Restricting the section creation to only basic blocks that have profiles
will
> reduce this a lot more.
>
> Importantly,  if CFI support were available for discontiguous ranges we
wouldn't
> have to duplicate CFI FDEs and the bloats would be near minimal.
>
> BOLT parses CFI and DWARF and generates compact information by rewriting
it.
> Whereas, Propeller uses lld which uses relocations and sections to fixup
but
> does not rewrite it.  This is by design and that lld is not DWARF and CFI
> aware. We designed basic block sections just like function sections.  The
> compiler produces a bunch of sections and relocations. The linker patches
the
> relocations and the debug info and CFI are right, that's it.  For CFI,
since
> there is no support for discontiguous ranges we have to duplicate and dedup
> FDEs only for blocks with sections. We are asking that CFI support
> discontiguous ranges and this would look even simpler.  Alternately, if lld
> were made DWARF and CFI aware we could rewrite it compactly like BOLT.
> These would help with object size bloats and binary size bloats.
>
>
>
>
>
> On Wed, Oct 2, 2019 at 1:59 PM James Y Knight via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >
> > I'm a bit confused by this subthread -- doesn't BOLT have the
exact same CFI bloat issue? From my cursory reading of the propellor doc, the
CFI duplication is _necessary_ to represent discontiguous functions, not
anything particular to the way Propellor happens to generate those discontiguous
functions.
> >
> > And emitting discontiguous functions is a fundamental goal of this,
right?
> >
> > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> >>
> >> Thanks for clarifying. This means once you move to the next basic
block (or any other basic
> >>
> >> block in the function) you have to execute an entirely new set of
CFI instructions
> >>
> >> except for the common CIE part. While indeed this is not as bad,
on average, the overall
> >>
> >> active memory footprint will increase.
> >>
> >>
> >>
> >> Creating one FDE per basic block means that .eh_frame_hdr, an
allocatable section,
> >>
> >> will be bloated too. This will increase the FDE lookup time. I
don’t see .eh_frame_hdr
> >>
> >> being mentioned in the proposal.
> >>
> >>
> >>
> >> Maksim
> >>
> >>
> >>
> >> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny"
<kpszeniczny at google.com> wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> >>
> >> *Pessimization/overhead for stack unwinding used by system-wide
profilers and
> >> for exception handling*
> >>
> >> Larger CFI programs put an extra burden on unwinding at runtime as
more CFI
> >> (and thus native) instructions have to be executed. This will
cause more
> >> overhead for any profiler that records stack traces, and, as you
correctly note
> >> in the proposal, for any program that heavily uses exceptions.
> >>
> >>
> >>
> >> The number of CFI instructions that have to be executed when
unwinding any given stack stays the same. The CFI instructions for a function
have to be duplicated in every basic block section, but when performing
unwinding only one such a set is executed -- the copy for the current basic
block. However, this copy contains precisely the same CFI instructions as the
ones that would have to be executed if there were no basic block sections.
> >>
> >>
> >>
> >> --
> >>
> >> Krzysztof Pszeniczny
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

James Y Knight via llvm-dev

2019-Oct-11 17:45 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Is there large value from deferring the block ordering to link time? That
is, does the block layout algorithm need to consider global layout issues
when deciding which blocks to put together and which to relegate to the
far-away part of the code?

Or, could the propellor-optimized compile step instead split each function
into only 2 pieces -- one containing an "optimally-ordered" set of hot
blocks from the function, and another containing the cold blocks? The
linker would have less flexibility in placement, but maybe it doesn't
actually need that flexibility?

Apologies if this is obvious for those who actually know what they're
talking about here. :)

On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <rafaelauler at fb.com> wrote:
> You’re correct, except that, in Propeller, CFI duplication happens for
> every basic block as it operates with the conservative assumption that a
> block can be put anywhere by the linker. That’s a significant bloat that is
> not cleaned up later. So, during link time, if N blocks from the same
> function are contiguous in the final layout, as it should happen most of
> the time for any sane BB order, we would have several FDEs for a region
> that only needs one. The bloat goes to the final binary (a lot more FDEs,
> specifically, one FDE per basic block).
>
> BOLT will only split a function in two parts, and only if it has profile.
> Most of the time, a function is not split. It also has an option not to
> split at all. For internally reordered basic blocks of a given function, it
> has CFI deduplication logic (it will interpret and build the CFI states for
> each block and rewrite the CFIs in a way that uses the minimum number of
> instructions to encode the states for each block).
>
>
>
> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
James Y
> Knight via llvm-dev <llvm-dev at lists.llvm.org>
> *Reply-To: *James Y Knight <jyknight at google.com>
> *Date: *Wednesday, October 2, 2019 at 1:59 PM
> *To: *Maksim Panchenko <maks at fb.com>
> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at
lists.llvm.org>
> *Subject: *Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link
> Optimizations
>
>
>
> I'm a bit confused by this subthread -- doesn't BOLT have the exact
same
> CFI bloat issue? From my cursory reading of the propellor doc, the CFI
> duplication is _necessary_ to represent discontiguous functions, not
> anything particular to the way Propellor happens to generate those
> discontiguous functions.
>
>
>
> And emitting discontiguous functions is a fundamental goal of this, right?
>
>
>
> On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Thanks for clarifying. This means once you move to the next basic block
> (or any other basic
>
> block in the function) you have to execute an entirely new set of CFI
> instructions
>
> except for the common CIE part. While indeed this is not as bad, on
> average, the overall
>
> active memory footprint will increase.
>
>
>
> Creating one FDE per basic block means that .eh_frame_hdr, an allocatable
> section,
>
> will be bloated too. This will increase the FDE lookup time. I don’t see
> .eh_frame_hdr
>
> being mentioned in the proposal.
>
>
>
> Maksim
>
>
>
> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <kpszeniczny at
google.com>
> wrote:
>
>
>
>
>
>
>
> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> *Pessimization/overhead for stack unwinding used by system-wide profilers
> and
> for exception handling*
>
> Larger CFI programs put an extra burden on unwinding at runtime as more CFI
> (and thus native) instructions have to be executed. This will cause more
> overhead for any profiler that records stack traces, and, as you correctly
> note
> in the proposal, for any program that heavily uses exceptions.
>
>
>
> The number of CFI instructions that have to be executed when unwinding any
> given stack stays the same. The CFI instructions for a function have to be
> duplicated in every basic block section, but when performing unwinding only
> one such a set is executed -- the copy for the current basic block.
> However, this copy contains precisely the same CFI instructions as the ones
> that would have to be executed if there were no basic block sections.
>
>
>
> --
>
> Krzysztof Pszeniczny
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20191011/7054158a/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Oct 2019 - [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Seemingly Similar Threads