thr3ads.net - llvm dev - [llvm-dev] [RFC] Add IR level interprocedural outliner for code size. [Jul 2017]

If this information is useful, please help other people find it:
Share via:

River Riddle via llvm-dev

2017-Jul-22 23:05 UTC

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Hi Andrey,
 Questions and feedback are very much welcome.
- The explanation as to why the improvements can vary between the IR and
MIR outliner mainly boil down to the level of abstraction that each are
working at. The MIR level has very accurate heuristics and is effectively
the last post ISel target independent code gen pass. The IR outliner on the
other hand has more estimation in the cost model, and can affect the
decisions of function simplification passes, instruction selection, RA,
etc. Taking this into account can lead to different results. Its important
to remember the differences that being at a higher abstraction level can
cause.
- As for the spec(it is 2006, sorry I missed that) command line options, as
well as any other benchmark, I only added "-Oz -mno-red-zone(to keep
comparisons fair) -(whatever enables each transform)" to the default flags
needed for compilation. I'll try to get the exact command line options used
and add them.
- Debug information(line numbers) is currently only kept if it is the same
across all occurrences. This was simply a design choice and can be changed
if keeping one instance is the desired behavior.
- The behavior described with the profile data is already implemented, I
will update the numbers to include the results after including pgo data.
- The LTO results are very experimental given that there isn't a size
pipeline for LTO yet(there should be). The %improvements can be similar to
non LTO but because the LTO binary is generally much smaller the actual
decrease in size is also much smaller. I'll add more detailed LTO numbers
as well.
Thanks,
River Riddle

On Sat, Jul 22, 2017 at 3:23 PM, Andrey Bokhanko <andreybokhanko at
gmail.com>
wrote:
> Hi River,
>
> Very impressive! -- thanks for working on this.
>
> A few questions, if you don't mind.
>
> First, on results (from goo.gl/5k6wsP). Some of them are quite
> surprising. In theory, "top improvements" should be quite similar
in all
> three approaches ("Early&Late Outlining", "Late
Outlining" and "Machine
> Outliner"), with E&LO capturing most of the cases. Yet, they are
very
> different:
>
> Test Suite, top improvements:
> E&LO:
>
>    -
>
>    enc-3des: 66.31%
>    -
>
>    StatementReordering-dbl: 51.45%
>    -
>
>    Symbolics-dbl: 51.42%
>    -
>
>    Recurrences-dbl: 51.38%
>    -
>
>    Packing-dbl: 51.33%
>
> LO:
>
>    -
>
>    enc-3des: 50.7%
>    -
>
>    ecbdes: 46.27%
>    -
>
>    security-rjindael:45.13%
>    -
>
>    ControlFlow-flt: 25.79%
>    -
>
>    ControlFlow-dbl: 25.74%
>
> MO:
>
>    -
>
>    ecbdes: 28.22%
>    -
>
>    Expansion-flt: 22.56%
>    -
>
>    Recurrences-flt: 22.19%
>    -
>
>    StatementReordering-flt: 22.15%
>    -
>
>    Searching-flt: 21.96%
>
>
> SPEC, top improvements:
> E&LO:
>
>    -
>
>    bzip2: 9.15%
>    -
>
>    gcc: 4.03%
>    -
>
>    sphinx3: 3.8%
>    -
>
>    H264ref: 3.24%
>    -
>
>    Perlbench: 3%
>
> LO:
>
>    -
>
>    bzip2: 7.27%
>    -
>
>    sphinx3: 3.65%
>    -
>
>    Namd: 3.08%
>    -
>
>    Gcc: 3.06%
>    -
>
>    H264ref: 3.05%
>
> MO:
>
>    -
>
>    Namd: 7.8%
>    -
>
>    bzip2: 7.27%
>    -
>
>    libquantum: 2.99%
>    -
>
>    h264ref: 2%
>
>
> Do you understand why so?
>
> I'm especially interested in cases where MO managed to find
redundancies
> while E&O+LO didn't. For example, 2.99% on libquantum (or is it
simply
> below "top 5 results" for E&O+LO?) -- did you investigated
this?
>
> Also, it would be nice to specify full options list for SPEC (I assume
> SPEC CPU2006?), similar to how results are reported on spec.org.
>
> And a few questions on the RFC:
>
> On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> * Debug Info:
>>
> Debug information is preserved for the calls to functions which have been
>> outlined but all debug info from the original outlined portions is
removed,
>> making them harder to debug.
>>
>
> Just to check I understand it correctly: you remove *all* debug info in
> outlined functions, essentially making them undebuggable -- correct? Did
> you considered copying debug info from one of outlined fragments instead?
> -- at least line numbers?
>
> The execution time results are to be expected given that the outliner,
>> without profile data, will extract from whatever region it deems
>> profitable. Extracting from the hot path can lead to a noticeable
>> performance regression on any platform, which can be somewhat avoided
by
>> providing profile data during outlining.
>>
>
> Some of regressions are quite severe. It would be interesting to implement
> what you stated above and measure -- both code size reductions and
> performance degradations -- again.
>
>
>> * LTO:
>>
>>    - LTO doesn’t have a code size pipeline, but %reductions over LTO
are
>> comparable to non LTO.
>>
>
> LTO is known to affect code size significantly (for example, by removing
> redundant functions), so I'm frankly quite surprised that the results
are
> the same...
>
> Yours,
> Andrey
> ==> Compiler Architect
> NXP
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170722/31e66d68/attachment-0001.html>

via llvm-dev

2017-Jul-23 07:43 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

River,

Thanks for the reply!
> 23 июля 2017 г., в 1:05, River Riddle <riddleriver at gmail.com>
написал(а):
> 
> - The explanation as to why the improvements can vary between the IR and
MIR outliner mainly boil down to the level of abstraction that each are working
at. The MIR level has very accurate heuristics and is effectively the last post
ISel target independent code gen pass. The IR outliner on the other hand has
more estimation in the cost model, and can affect the decisions of function
simplification passes, instruction selection, RA, etc. Taking this into account
can lead to different results.To clarify, I'm surprised not with % differences (this is understandable),
but with differences in what benchmarks got improved. It seems odd that MO,
working on lower abstraction level, managed to find redundancies (say, in
libquantum) that EO+LO missed. But indeed -- perhaps EO+LO cost model just
considered these cases to be non-profitable. It would be interesting to know
precisely.

Yours,
Andrey

via llvm-dev

2017-Jul-23 18:32 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

River,

...and one more thing that just occurred to me.

Will Outlining put the following two expressions:

int x = (int)x2 + 1
int* p = (int *)p2 + 1

into the same class of congruency? -- if sizeof(int) == sizeof(int *) on the
target hardware?

Even more interesting and relevant are pointers of different types.

I believe MO has no problems capturing both of these.

Yours,
Andrey

Отправлено с iPad
> 23 июля 2017 г., в 1:05, River Riddle <riddleriver at gmail.com>
написал(а):
> 
> Hi Andrey,
>  Questions and feedback are very much welcome. 
> - The explanation as to why the improvements can vary between the IR and
MIR outliner mainly boil down to the level of abstraction that each are working
at. The MIR level has very accurate heuristics and is effectively the last post
ISel target independent code gen pass. The IR outliner on the other hand has
more estimation in the cost model, and can affect the decisions of function
simplification passes, instruction selection, RA, etc. Taking this into account
can lead to different results. Its important to remember the differences that
being at a higher abstraction level can cause.
> - As for the spec(it is 2006, sorry I missed that) command line options, as
well as any other benchmark, I only added "-Oz -mno-red-zone(to keep
comparisons fair) -(whatever enables each transform)" to the default flags
needed for compilation. I'll try to get the exact command line options used
and add them.
> - Debug information(line numbers) is currently only kept if it is the same
across all occurrences. This was simply a design choice and can be changed if
keeping one instance is the desired behavior.
> - The behavior described with the profile data is already implemented, I
will update the numbers to include the results after including pgo data.
> - The LTO results are very experimental given that there isn't a size
pipeline for LTO yet(there should be). The %improvements can be similar to non
LTO but because the LTO binary is generally much smaller the actual decrease in
size is also much smaller. I'll add more detailed LTO numbers as well.
> Thanks,
> River Riddle
> 
>> On Sat, Jul 22, 2017 at 3:23 PM, Andrey Bokhanko <andreybokhanko at
gmail.com> wrote:
>> Hi River,
>> 
>> Very impressive! -- thanks for working on this.
>> 
>> A few questions, if you don't mind.
>> 
>> First, on results (from goo.gl/5k6wsP). Some of them are quite
surprising. In theory, "top improvements" should be quite similar in
all three approaches ("Early&Late Outlining", "Late
Outlining" and "Machine Outliner"), with E&LO capturing most
of the cases. Yet, they are very different:
>> 
>> Test Suite, top improvements:
>> E&LO:
>> enc-3des: 66.31%
>> StatementReordering-dbl: 51.45%
>> Symbolics-dbl: 51.42%
>> Recurrences-dbl: 51.38%
>> Packing-dbl: 51.33%
>> LO:
>> enc-3des: 50.7%
>> ecbdes: 46.27%
>> security-rjindael:45.13%
>> ControlFlow-flt: 25.79%
>> ControlFlow-dbl: 25.74%
>> MO:
>> ecbdes: 28.22%
>> Expansion-flt: 22.56%
>> Recurrences-flt: 22.19%
>> StatementReordering-flt: 22.15%
>> Searching-flt: 21.96%
>> 
>> SPEC, top improvements:
>> E&LO:
>> bzip2: 9.15%
>> gcc: 4.03%
>> sphinx3: 3.8%
>> H264ref: 3.24%
>> Perlbench: 3%
>> LO:
>> bzip2: 7.27%
>> sphinx3: 3.65%
>> Namd: 3.08%
>> Gcc: 3.06%
>> H264ref: 3.05%
>> MO:
>> Namd: 7.8%
>> bzip2: 7.27%
>> libquantum: 2.99%
>> h264ref: 2%
>> 
>> Do you understand why so?
>> 
>> I'm especially interested in cases where MO managed to find
redundancies while E&O+LO didn't. For example, 2.99% on libquantum (or
is it simply below "top 5 results" for E&O+LO?) -- did you
investigated this?
>> 
>> Also, it would be nice to specify full options list for SPEC (I assume
SPEC CPU2006?), similar to how results are reported on spec.org.
>> 
>> And a few questions on the RFC:
>> 
>>> On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>> 
>>> * Debug Info:
>>> Debug information is preserved for the calls to functions which
have been outlined but all debug info from the original outlined portions is
removed, making them harder to debug.
>> 
>> Just to check I understand it correctly: you remove *all* debug info in
outlined functions, essentially making them undebuggable -- correct? Did you
considered copying debug info from one of outlined fragments instead? -- at
least line numbers?
>> 
>>> The execution time results are to be expected given that the
outliner, without profile data, will extract from whatever region it deems
profitable. Extracting from the hot path can lead to a noticeable performance
regression on any platform, which can be somewhat avoided by providing profile
data during outlining.
>> 
>> Some of regressions are quite severe. It would be interesting to
implement what you stated above and measure -- both code size reductions and
performance degradations -- again.
>>  
>>> * LTO:
>>>     - LTO doesn’t have a code size pipeline, but %reductions over
LTO are comparable to non LTO.
>> 
>> LTO is known to affect code size significantly (for example, by
removing redundant functions), so I'm frankly quite surprised that the
results are the same...
>> 
>> Yours,
>> Andrey
>> ==>> Compiler Architect
>> NXP
>> 
> -------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170723/c949ac9c/attachment.html>

River Riddle via llvm-dev

2017-Jul-23 21:57 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

Andrey,
  Currently it will not catch this case because congruence is determined by
types/opcode/etc, but cases like this can be caught by redefining what it
means to be congruent. We can redefine congruency for add/sub/icmp/gep/etc
to no longer care about types, or even opcodes, in certain cases. but this
may cause the need for extra control flow.

As for your example, the machine outliner can handle the case when the
addition amount and register are the same:
int x = (int) x2 + 4;
int *p = (int*)p2 + 1;

If we do relax the congruency restrictions, being at the IR level allows
for us to identify that we could outline more than just the simple case:
assuming: sizeof(int) == sizeof(int *) == sizeof(long long *) =sizeof(char *)

int x = (int)x2 + 1;
int *p = (int *)p2 + (int)ipidx;
long long*lp = (long long*)lp2 + 3;
char *cp = (char *)cp2 + (int)cpidx;

-- could outline to --
int x = outlinedFn(1, x2, 1);
int *p = (int *)outlinedFn(4, (int)p2, ipidx);
long long*lp = (long long*)outlinedFn(8, (int)lp2, 3);
char *cp = (char *)outlinedFn(1, (int)cp2, cpidx);

int outlinedFn(int SizePatchup, int Var, int Amount) {
  return Var + (SizePatchup * Amount);
}

In the above we need some extra patch-up to account for the different sizes
of the pointee types.

This is just one opportunity that can be caught when we start to redefine
equivalence, something that is really powerful at the IR level. We wanted
the initial submission to be have a compact and efficient implementation.
Extensions like these can easily be added, but it is not a part of the
initial design.
Thanks,
 River Riddle


On Sun, Jul 23, 2017 at 11:32 AM, <andreybokhanko at gmail.com> wrote:
> River,
>
> ...and one more thing that just occurred to me.
>
> Will Outlining put the following two expressions:
>
> int x = (int)x2 + 1
> int* p = (int *)p2 + 1
>
> into the same class of congruency? -- if sizeof(int) == sizeof(int *) on
> the target hardware?
>
> Even more interesting and relevant are pointers of different types.
>
> I believe MO has no problems capturing both of these.
>
> Yours,
> Andrey
>
> Отправлено с iPad
>
> 23 июля 2017 г., в 1:05, River Riddle <riddleriver at gmail.com>
написал(а):
>
> Hi Andrey,
>  Questions and feedback are very much welcome.
> - The explanation as to why the improvements can vary between the IR and
> MIR outliner mainly boil down to the level of abstraction that each are
> working at. The MIR level has very accurate heuristics and is effectively
> the last post ISel target independent code gen pass. The IR outliner on the
> other hand has more estimation in the cost model, and can affect the
> decisions of function simplification passes, instruction selection, RA,
> etc. Taking this into account can lead to different results. Its important
> to remember the differences that being at a higher abstraction level can
> cause.
> - As for the spec(it is 2006, sorry I missed that) command line options,
> as well as any other benchmark, I only added "-Oz -mno-red-zone(to
keep
> comparisons fair) -(whatever enables each transform)" to the default
flags
> needed for compilation. I'll try to get the exact command line options
used
> and add them.
> - Debug information(line numbers) is currently only kept if it is the same
> across all occurrences. This was simply a design choice and can be changed
> if keeping one instance is the desired behavior.
> - The behavior described with the profile data is already implemented, I
> will update the numbers to include the results after including pgo data.
> - The LTO results are very experimental given that there isn't a size
> pipeline for LTO yet(there should be). The %improvements can be similar to
> non LTO but because the LTO binary is generally much smaller the actual
> decrease in size is also much smaller. I'll add more detailed LTO
numbers
> as well.
> Thanks,
> River Riddle
>
> On Sat, Jul 22, 2017 at 3:23 PM, Andrey Bokhanko <andreybokhanko at
gmail.com
> > wrote:
>
>> Hi River,
>>
>> Very impressive! -- thanks for working on this.
>>
>> A few questions, if you don't mind.
>>
>> First, on results (from goo.gl/5k6wsP). Some of them are quite
>> surprising. In theory, "top improvements" should be quite
similar in all
>> three approaches ("Early&Late Outlining", "Late
Outlining" and "Machine
>> Outliner"), with E&LO capturing most of the cases. Yet, they
are very
>> different:
>>
>> Test Suite, top improvements:
>> E&LO:
>>
>>    -
>>
>>    enc-3des: 66.31%
>>    -
>>
>>    StatementReordering-dbl: 51.45%
>>    -
>>
>>    Symbolics-dbl: 51.42%
>>    -
>>
>>    Recurrences-dbl: 51.38%
>>    -
>>
>>    Packing-dbl: 51.33%
>>
>> LO:
>>
>>    -
>>
>>    enc-3des: 50.7%
>>    -
>>
>>    ecbdes: 46.27%
>>    -
>>
>>    security-rjindael:45.13%
>>    -
>>
>>    ControlFlow-flt: 25.79%
>>    -
>>
>>    ControlFlow-dbl: 25.74%
>>
>> MO:
>>
>>    -
>>
>>    ecbdes: 28.22%
>>    -
>>
>>    Expansion-flt: 22.56%
>>    -
>>
>>    Recurrences-flt: 22.19%
>>    -
>>
>>    StatementReordering-flt: 22.15%
>>    -
>>
>>    Searching-flt: 21.96%
>>
>>
>> SPEC, top improvements:
>> E&LO:
>>
>>    -
>>
>>    bzip2: 9.15%
>>    -
>>
>>    gcc: 4.03%
>>    -
>>
>>    sphinx3: 3.8%
>>    -
>>
>>    H264ref: 3.24%
>>    -
>>
>>    Perlbench: 3%
>>
>> LO:
>>
>>    -
>>
>>    bzip2: 7.27%
>>    -
>>
>>    sphinx3: 3.65%
>>    -
>>
>>    Namd: 3.08%
>>    -
>>
>>    Gcc: 3.06%
>>    -
>>
>>    H264ref: 3.05%
>>
>> MO:
>>
>>    -
>>
>>    Namd: 7.8%
>>    -
>>
>>    bzip2: 7.27%
>>    -
>>
>>    libquantum: 2.99%
>>    -
>>
>>    h264ref: 2%
>>
>>
>> Do you understand why so?
>>
>> I'm especially interested in cases where MO managed to find
redundancies
>> while E&O+LO didn't. For example, 2.99% on libquantum (or is it
simply
>> below "top 5 results" for E&O+LO?) -- did you
investigated this?
>>
>> Also, it would be nice to specify full options list for SPEC (I assume
>> SPEC CPU2006?), similar to how results are reported on spec.org.
>>
>> And a few questions on the RFC:
>>
>> On Fri, Jul 21, 2017 at 12:47 AM, River Riddle via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> * Debug Info:
>>>
>> Debug information is preserved for the calls to functions which have
been
>>> outlined but all debug info from the original outlined portions is
removed,
>>> making them harder to debug.
>>>
>>
>> Just to check I understand it correctly: you remove *all* debug info in
>> outlined functions, essentially making them undebuggable -- correct?
Did
>> you considered copying debug info from one of outlined fragments
instead?
>> -- at least line numbers?
>>
>> The execution time results are to be expected given that the outliner,
>>> without profile data, will extract from whatever region it deems
>>> profitable. Extracting from the hot path can lead to a noticeable
>>> performance regression on any platform, which can be somewhat
avoided by
>>> providing profile data during outlining.
>>>
>>
>> Some of regressions are quite severe. It would be interesting to
>> implement what you stated above and measure -- both code size
reductions
>> and performance degradations -- again.
>>
>>
>>> * LTO:
>>>
>>>    - LTO doesn’t have a code size pipeline, but %reductions over
LTO are
>>> comparable to non LTO.
>>>
>>
>> LTO is known to affect code size significantly (for example, by
removing
>> redundant functions), so I'm frankly quite surprised that the
results are
>> the same...
>>
>> Yours,
>> Andrey
>> ==>> Compiler Architect
>> NXP
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170723/74046cb6/attachment.html>

Sean Silva via llvm-dev

2017-Jul-25 04:54 UTC

head link

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

On Sun, Jul 23, 2017 at 12:43 AM, via llvm-dev <llvm-dev at
lists.llvm.org>
wrote:
> River,
>
> Thanks for the reply!
>
> > 23 июля 2017 г., в 1:05, River Riddle <riddleriver at gmail.com>
> написал(а):
> >
> > - The explanation as to why the improvements can vary between the IR
and
> MIR outliner mainly boil down to the level of abstraction that each are
> working at. The MIR level has very accurate heuristics and is effectively
> the last post ISel target independent code gen pass. The IR outliner on the
> other hand has more estimation in the cost model, and can affect the
> decisions of function simplification passes, instruction selection, RA,
> etc. Taking this into account can lead to different results.
> To clarify, I'm surprised not with % differences (this is
understandable),
> but with differences in what benchmarks got improved. It seems odd that MO,
> working on lower abstraction level, managed to find redundancies (say, in
> libquantum) that EO+LO missed.

When I looked at this, most of the code size saving from the machine
outliner is from outlining 2-3 instruction sequences (often the same
sequence is outlined for each of many permutations of register
assignments). Think of sequences like "TEST; SETCC" (and many
different
register assignments thereof).

-- Sean Silva

> But indeed -- perhaps EO+LO cost model just considered these cases to be
> non-profitable. It would be interesting to know precisely.
>
> Yours,
> Andrey
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170724/f773ba03/attachment.html>

llvm dev - Jul 2017 - [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.

[llvm-dev] [RFC] Add IR level interprocedural outliner for code size.