thr3ads.net - llvm dev - [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations [Oct 2019]

If this information is useful, please help other people find it:
Share via:

Sriraman Tallam via llvm-dev

2019-Oct-07 18:15 UTC

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

We would also like to clarify on the misconceptions around CFI Instructions:

There are two things that need to be clarified here:

1) Extra CFI FDE entries for basic blocks does not mean more dynamic
instructions are executed. In fact, they do not increase at all.  Krys
talked about this earlier.
2) We do deduplication of common static CFI instructions in the FDE
and move it to the CIE .  Hence, moving to a new basic block does not
mean a completely new set of CFI instructions is executed.

On Wed, Oct 2, 2019 at 7:24 PM Sriraman Tallam <tmsriram at google.com>
wrote:>
> Maks and team, thanks for the detailed feedback and we will address all of
your
> concerns.  Let’s begin with CFI and DebugInfo first since this is already
> being discussed.
>
> TLDR; clang is pathological and the actual CFI bloat will go down from 7x
to
> 2x.
>
> Let us present the CFI Bloats for each benchmark with the default option,
which
> is creating basic block sections only for functions with samples.   For
clang,
> it is 7x and *not 17x* (the 17 x number is for all sections), and for the
> large benchmarks it is less than 30% or 1.3x. For large benchmarks, Storage
is
> the highest, going from 18M to 23M, ~30%. Clang is almost pathological
here.
> This is because 10% of all functions in clang have samples (touched by the
> profile information), and all of these functions get full basic block
sections.
> Whereas, for the large benchmarks, only 0.5% of functions have samples. 
Now,
> for clang, 10% of functions that have samples also happen to contain 35% of
all
> basic blocks. This means, we are creating sections for 35% of all basic
blocks
> and the CFI bloats are clearly showing.
>
> Now, the data for clang also shows that only 7% of the basic blocks have
> samples. We are working on restricting basic block sections to only those
basic
> blocks that have samples. The rest of the basic blocks (cold) in that
function
> would share the same section.  With this, we would reduce the bloat of CFI
> from 7x to 2x. This is not hard to do and we will follow up with this
patch.
>
> Also for object size bloats with regards to eh_frame, the reasoning is
similar.
>  Restricting the section creation to only basic blocks that have profiles
will
> reduce this a lot more.
>
> Importantly,  if CFI support were available for discontiguous ranges we
wouldn't
> have to duplicate CFI FDEs and the bloats would be near minimal.
>
> BOLT parses CFI and DWARF and generates compact information by rewriting
it.
> Whereas, Propeller uses lld which uses relocations and sections to fixup
but
> does not rewrite it.  This is by design and that lld is not DWARF and CFI
> aware. We designed basic block sections just like function sections.  The
> compiler produces a bunch of sections and relocations. The linker patches
the
> relocations and the debug info and CFI are right, that's it.  For CFI,
since
> there is no support for discontiguous ranges we have to duplicate and dedup
> FDEs only for blocks with sections. We are asking that CFI support
> discontiguous ranges and this would look even simpler.  Alternately, if lld
> were made DWARF and CFI aware we could rewrite it compactly like BOLT.
> These would help with object size bloats and binary size bloats.
>
>
>
>
>
> On Wed, Oct 2, 2019 at 1:59 PM James Y Knight via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >
> > I'm a bit confused by this subthread -- doesn't BOLT have the
exact same CFI bloat issue? From my cursory reading of the propellor doc, the
CFI duplication is _necessary_ to represent discontiguous functions, not
anything particular to the way Propellor happens to generate those discontiguous
functions.
> >
> > And emitting discontiguous functions is a fundamental goal of this,
right?
> >
> > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> >>
> >> Thanks for clarifying. This means once you move to the next basic
block (or any other basic
> >>
> >> block in the function) you have to execute an entirely new set of
CFI instructions
> >>
> >> except for the common CIE part. While indeed this is not as bad,
on average, the overall
> >>
> >> active memory footprint will increase.
> >>
> >>
> >>
> >> Creating one FDE per basic block means that .eh_frame_hdr, an
allocatable section,
> >>
> >> will be bloated too. This will increase the FDE lookup time. I
don’t see .eh_frame_hdr
> >>
> >> being mentioned in the proposal.
> >>
> >>
> >>
> >> Maksim
> >>
> >>
> >>
> >> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny"
<kpszeniczny at google.com> wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> >>
> >> *Pessimization/overhead for stack unwinding used by system-wide
profilers and
> >> for exception handling*
> >>
> >> Larger CFI programs put an extra burden on unwinding at runtime as
more CFI
> >> (and thus native) instructions have to be executed. This will
cause more
> >> overhead for any profiler that records stack traces, and, as you
correctly note
> >> in the proposal, for any program that heavily uses exceptions.
> >>
> >>
> >>
> >> The number of CFI instructions that have to be executed when
unwinding any given stack stays the same. The CFI instructions for a function
have to be duplicated in every basic block section, but when performing
unwinding only one such a set is executed -- the copy for the current basic
block. However, this copy contains precisely the same CFI instructions as the
ones that would have to be executed if there were no basic block sections.
> >>
> >>
> >>
> >> --
> >>
> >> Krzysztof Pszeniczny
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sriraman Tallam via llvm-dev

2019-Oct-07 18:17 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

The next concern we want to address is about C++ Exceptions which was
considered a major omission:

TLDR;  This can be implemented in propeller easily. It was considered
low priority because exception handling code is assumed in general not
performance critical .

Currently, we do not create basic block sections for any block that is
involved in exception handling.   There is nothing fundamentally
difficult about exception handling basic blocks. The exception table
uses address offsets which currently does not use relocations, and we
must modify it to use relocations in order to enable sections for
exception handling basic blocks.

We will re-prioritize and send out a patch to handle exception basic blocks.

Thanks
Sri

On Mon, Oct 7, 2019 at 11:15 AM Sriraman Tallam <tmsriram at google.com>
wrote:>
> We would also like to clarify on the misconceptions around CFI
Instructions:
>
> There are two things that need to be clarified here:
>
> 1) Extra CFI FDE entries for basic blocks does not mean more dynamic
> instructions are executed. In fact, they do not increase at all.  Krys
> talked about this earlier.
> 2) We do deduplication of common static CFI instructions in the FDE
> and move it to the CIE .  Hence, moving to a new basic block does not
> mean a completely new set of CFI instructions is executed.
>
> On Wed, Oct 2, 2019 at 7:24 PM Sriraman Tallam <tmsriram at
google.com> wrote:
> >
> > Maks and team, thanks for the detailed feedback and we will address
all of your
> > concerns.  Let’s begin with CFI and DebugInfo first since this is
already
> > being discussed.
> >
> > TLDR; clang is pathological and the actual CFI bloat will go down from
7x to
> > 2x.
> >
> > Let us present the CFI Bloats for each benchmark with the default
option, which
> > is creating basic block sections only for functions with samples.  
For clang,
> > it is 7x and *not 17x* (the 17 x number is for all sections), and for
the
> > large benchmarks it is less than 30% or 1.3x. For large benchmarks,
Storage is
> > the highest, going from 18M to 23M, ~30%. Clang is almost pathological
here.
> > This is because 10% of all functions in clang have samples (touched by
the
> > profile information), and all of these functions get full basic block
sections.
> > Whereas, for the large benchmarks, only 0.5% of functions have
samples.  Now,
> > for clang, 10% of functions that have samples also happen to contain
35% of all
> > basic blocks. This means, we are creating sections for 35% of all
basic blocks
> > and the CFI bloats are clearly showing.
> >
> > Now, the data for clang also shows that only 7% of the basic blocks
have
> > samples. We are working on restricting basic block sections to only
those basic
> > blocks that have samples. The rest of the basic blocks (cold) in that
function
> > would share the same section.  With this, we would reduce the bloat of
CFI
> > from 7x to 2x. This is not hard to do and we will follow up with this
patch.
> >
> > Also for object size bloats with regards to eh_frame, the reasoning is
similar.
> >  Restricting the section creation to only basic blocks that have
profiles will
> > reduce this a lot more.
> >
> > Importantly,  if CFI support were available for discontiguous ranges
we wouldn't
> > have to duplicate CFI FDEs and the bloats would be near minimal.
> >
> > BOLT parses CFI and DWARF and generates compact information by
rewriting it.
> > Whereas, Propeller uses lld which uses relocations and sections to
fixup but
> > does not rewrite it.  This is by design and that lld is not DWARF and
CFI
> > aware. We designed basic block sections just like function sections. 
The
> > compiler produces a bunch of sections and relocations. The linker
patches the
> > relocations and the debug info and CFI are right, that's it.  For
CFI, since
> > there is no support for discontiguous ranges we have to duplicate and
dedup
> > FDEs only for blocks with sections. We are asking that CFI support
> > discontiguous ranges and this would look even simpler.  Alternately,
if lld
> > were made DWARF and CFI aware we could rewrite it compactly like BOLT.
> > These would help with object size bloats and binary size bloats.
> >
> >
> >
> >
> >
> > On Wed, Oct 2, 2019 at 1:59 PM James Y Knight via llvm-dev
> > <llvm-dev at lists.llvm.org> wrote:
> > >
> > > I'm a bit confused by this subthread -- doesn't BOLT have
the exact same CFI bloat issue? From my cursory reading of the propellor doc,
the CFI duplication is _necessary_ to represent discontiguous functions, not
anything particular to the way Propellor happens to generate those discontiguous
functions.
> > >
> > > And emitting discontiguous functions is a fundamental goal of
this, right?
> > >
> > > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> > >>
> > >> Thanks for clarifying. This means once you move to the next
basic block (or any other basic
> > >>
> > >> block in the function) you have to execute an entirely new
set of CFI instructions
> > >>
> > >> except for the common CIE part. While indeed this is not as
bad, on average, the overall
> > >>
> > >> active memory footprint will increase.
> > >>
> > >>
> > >>
> > >> Creating one FDE per basic block means that .eh_frame_hdr, an
allocatable section,
> > >>
> > >> will be bloated too. This will increase the FDE lookup time.
I don’t see .eh_frame_hdr
> > >>
> > >> being mentioned in the proposal.
> > >>
> > >>
> > >>
> > >> Maksim
> > >>
> > >>
> > >>
> > >> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny"
<kpszeniczny at google.com> wrote:
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> > >>
> > >> *Pessimization/overhead for stack unwinding used by
system-wide profilers and
> > >> for exception handling*
> > >>
> > >> Larger CFI programs put an extra burden on unwinding at
runtime as more CFI
> > >> (and thus native) instructions have to be executed. This will
cause more
> > >> overhead for any profiler that records stack traces, and, as
you correctly note
> > >> in the proposal, for any program that heavily uses
exceptions.
> > >>
> > >>
> > >>
> > >> The number of CFI instructions that have to be executed when
unwinding any given stack stays the same. The CFI instructions for a function
have to be duplicated in every basic block section, but when performing
unwinding only one such a set is executed -- the copy for the current basic
block. However, this copy contains precisely the same CFI instructions as the
ones that would have to be executed if there were no basic block sections.
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> Krzysztof Pszeniczny
> > >>
> > >> _______________________________________________
> > >> LLVM Developers mailing list
> > >> llvm-dev at lists.llvm.org
> > >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Sriraman Tallam via llvm-dev

2019-Oct-08 20:19 UTC

head link

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Some more information about the relaxation pass whose effectiveness
and convergence guarantees were listed as a concern:

TLDR; Our relaxation pass is similar to what LLVM’s MCAssembler does
but with a caveat for efficiency.  Our experimental results show it is
efficient and convergence is guaranteed.

Our relaxation pass is very similar to what MCAssembler does as it
needs to solve the same problem, which is laying out basic blocks and
optimizing branch instruction sizes based on jump offset.  The optimal
algorithm to do this is quadratic and it involves computating of
section offsets multiple times. Our relaxation pass has a caveat for
efficiency. Our relaxation pass recomputes section offsets only after
all the sections have been processed. This is repeated  until
convergence, whereas, the MCAssembler re-computes section offsets
every time a section is modified. Our pass works well in the linker as
it minimizes the section offset recomputation, which is more expensive
to do in the linker (whole program) than doing it in the assembler
(per function).

We shrink jump instructions aggressively and then fix the wrongly
shrunk branches by running a pass that grows them back.  We guarantee
convergence because we always shrink in the first round and always
grow in the second.

Data shows that our relaxation is efficient. For example, for clang,
the aggressive shrinking affects ~995000 branches in 6 iterations and
only 4 branches had to be grown back.

On Mon, Oct 7, 2019 at 11:17 AM Sriraman Tallam <tmsriram at google.com>
wrote:>
> The next concern we want to address is about C++ Exceptions which was
> considered a major omission:
>
> TLDR;  This can be implemented in propeller easily. It was considered
> low priority because exception handling code is assumed in general not
> performance critical .
>
> Currently, we do not create basic block sections for any block that is
> involved in exception handling.   There is nothing fundamentally
> difficult about exception handling basic blocks. The exception table
> uses address offsets which currently does not use relocations, and we
> must modify it to use relocations in order to enable sections for
> exception handling basic blocks.
>
> We will re-prioritize and send out a patch to handle exception basic
blocks.
>
> Thanks
> Sri
>
> On Mon, Oct 7, 2019 at 11:15 AM Sriraman Tallam <tmsriram at
google.com> wrote:
> >
> > We would also like to clarify on the misconceptions around CFI
Instructions:
> >
> > There are two things that need to be clarified here:
> >
> > 1) Extra CFI FDE entries for basic blocks does not mean more dynamic
> > instructions are executed. In fact, they do not increase at all.  Krys
> > talked about this earlier.
> > 2) We do deduplication of common static CFI instructions in the FDE
> > and move it to the CIE .  Hence, moving to a new basic block does not
> > mean a completely new set of CFI instructions is executed.
> >
> > On Wed, Oct 2, 2019 at 7:24 PM Sriraman Tallam <tmsriram at
google.com> wrote:
> > >
> > > Maks and team, thanks for the detailed feedback and we will
address all of your
> > > concerns.  Let’s begin with CFI and DebugInfo first since this is
already
> > > being discussed.
> > >
> > > TLDR; clang is pathological and the actual CFI bloat will go down
from 7x to
> > > 2x.
> > >
> > > Let us present the CFI Bloats for each benchmark with the default
option, which
> > > is creating basic block sections only for functions with samples.
For clang,
> > > it is 7x and *not 17x* (the 17 x number is for all sections), and
for the
> > > large benchmarks it is less than 30% or 1.3x. For large
benchmarks, Storage is
> > > the highest, going from 18M to 23M, ~30%. Clang is almost
pathological here.
> > > This is because 10% of all functions in clang have samples
(touched by the
> > > profile information), and all of these functions get full basic
block sections.
> > > Whereas, for the large benchmarks, only 0.5% of functions have
samples.  Now,
> > > for clang, 10% of functions that have samples also happen to
contain 35% of all
> > > basic blocks. This means, we are creating sections for 35% of all
basic blocks
> > > and the CFI bloats are clearly showing.
> > >
> > > Now, the data for clang also shows that only 7% of the basic
blocks have
> > > samples. We are working on restricting basic block sections to
only those basic
> > > blocks that have samples. The rest of the basic blocks (cold) in
that function
> > > would share the same section.  With this, we would reduce the
bloat of CFI
> > > from 7x to 2x. This is not hard to do and we will follow up with
this patch.
> > >
> > > Also for object size bloats with regards to eh_frame, the
reasoning is similar.
> > >  Restricting the section creation to only basic blocks that have
profiles will
> > > reduce this a lot more.
> > >
> > > Importantly,  if CFI support were available for discontiguous
ranges we wouldn't
> > > have to duplicate CFI FDEs and the bloats would be near minimal.
> > >
> > > BOLT parses CFI and DWARF and generates compact information by
rewriting it.
> > > Whereas, Propeller uses lld which uses relocations and sections
to fixup but
> > > does not rewrite it.  This is by design and that lld is not DWARF
and CFI
> > > aware. We designed basic block sections just like function
sections.  The
> > > compiler produces a bunch of sections and relocations. The linker
patches the
> > > relocations and the debug info and CFI are right, that's it. 
For CFI, since
> > > there is no support for discontiguous ranges we have to duplicate
and dedup
> > > FDEs only for blocks with sections. We are asking that CFI
support
> > > discontiguous ranges and this would look even simpler. 
Alternately, if lld
> > > were made DWARF and CFI aware we could rewrite it compactly like
BOLT.
> > > These would help with object size bloats and binary size bloats.
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Oct 2, 2019 at 1:59 PM James Y Knight via llvm-dev
> > > <llvm-dev at lists.llvm.org> wrote:
> > > >
> > > > I'm a bit confused by this subthread -- doesn't BOLT
have the exact same CFI bloat issue? From my cursory reading of the propellor
doc, the CFI duplication is _necessary_ to represent discontiguous functions,
not anything particular to the way Propellor happens to generate those
discontiguous functions.
> > > >
> > > > And emitting discontiguous functions is a fundamental goal
of this, right?
> > > >
> > > > On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> > > >>
> > > >> Thanks for clarifying. This means once you move to the
next basic block (or any other basic
> > > >>
> > > >> block in the function) you have to execute an entirely
new set of CFI instructions
> > > >>
> > > >> except for the common CIE part. While indeed this is not
as bad, on average, the overall
> > > >>
> > > >> active memory footprint will increase.
> > > >>
> > > >>
> > > >>
> > > >> Creating one FDE per basic block means that
.eh_frame_hdr, an allocatable section,
> > > >>
> > > >> will be bloated too. This will increase the FDE lookup
time. I don’t see .eh_frame_hdr
> > > >>
> > > >> being mentioned in the proposal.
> > > >>
> > > >>
> > > >>
> > > >> Maksim
> > > >>
> > > >>
> > > >>
> > > >> On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny"
<kpszeniczny at google.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via
llvm-dev <llvm-dev at lists.llvm.org> wrote:
> > > >>
> > > >> *Pessimization/overhead for stack unwinding used by
system-wide profilers and
> > > >> for exception handling*
> > > >>
> > > >> Larger CFI programs put an extra burden on unwinding at
runtime as more CFI
> > > >> (and thus native) instructions have to be executed. This
will cause more
> > > >> overhead for any profiler that records stack traces,
and, as you correctly note
> > > >> in the proposal, for any program that heavily uses
exceptions.
> > > >>
> > > >>
> > > >>
> > > >> The number of CFI instructions that have to be executed
when unwinding any given stack stays the same. The CFI instructions for a
function have to be duplicated in every basic block section, but when performing
unwinding only one such a set is executed -- the copy for the current basic
block. However, this copy contains precisely the same CFI instructions as the
ones that would have to be executed if there were no basic block sections.
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> Krzysztof Pszeniczny
> > > >>
> > > >> _______________________________________________
> > > >> LLVM Developers mailing list
> > > >> llvm-dev at lists.llvm.org
> > > >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > >
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > llvm-dev at lists.llvm.org
> > > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Oct 2019 - [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Apparently Analagous Threads