thr3ads.net - llvm dev - [llvm-dev] (RFC) Encoding code duplication factor in discriminator [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Dehao Chen via llvm-dev

2016-Oct-27 20:11 UTC

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

The impact to debug_line is actually not small. I only implemented the part
1 (encoding duplication factor) for loop unrolling and loop vectorization.
The debug_line size overhead for "-O2 -g1" binary of speccpu C/C++
benchmarks:

433.milc 23.59%
444.namd 6.25%
447.dealII 8.43%
450.soplex 2.41%
453.povray 5.40%
470.lbm 0.00%
482.sphinx3 7.10%
400.perlbench 2.77%
401.bzip2 9.62%
403.gcc 2.67%
429.mcf 9.54%
445.gobmk 7.40%
456.hmmer 9.79%
458.sjeng 9.98%
462.libquantum 10.90%
464.h264ref 30.21%
471.omnetpp 0.52%
473.astar 5.67%
483.xalancbmk 1.46%
mean 7.86%
Dehao

On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <davidxl at
google.com>
wrote:
> Do you have an estimate of the debug_line size increase? I guess it will
> be small.
>
> David
>
> On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at google.com>
wrote:
>
>> Motivation:
>> Many optimizations duplicate code. E.g. loop unroller duplicates the
loop
>> body, GVN duplicates computation, etc. The duplicated code will share
the
>> same debug info with the original code. For SamplePGO, the debug info
is
>> used to present the profile. Code duplication will affect profile
accuracy.
>> Taking loop unrolling for example:
>>
>> #1 foo();
>> #2 for (i = 0; i < N; i++) {
>> #3   bar();
>> #4 }
>>
>> If N is 8 during runtime, a reasonable profile will look like:
>>
>> #1: 10
>> #3: 80
>>
>> But if the compiler unrolls the loop by a factor of 4, the callsite to
>> bar() is duplicated 4 times and the profile will look like:
>>
>> #1: 10
>> #3: 20
>>
>> The sample count for #3 is 20 because all 4 callsites to bar() are
>> sampled 20 times each, and they shared the same debug loc (#3) so that
20
>> will be attributed to #3 (If one debugloc is mapped by multiple
>> instructions, the max sample count of these instructions is used as
>> debugloc's sample count).
>>
>> When loading this profile into compiler, it will think the loop trip
>> count is 2 instead of 8.
>>
>> Proposal:
>> When compiler duplicates code, it encodes the duplication info in the
>> debug info. As the duplication is not interesting to debugger, I
propose to
>> encode this as part of the discriminator.
>>
>> There are 2 types of code duplication:
>>
>> 1. duplicated code are guaranteed to have the same execution count
(e.g.
>> loop unroll and loop vectorize). We can record the duplication factor,
for
>> the above example "4" is recorded in the discriminator.
>> 2. duplicated code that may have different execution count (e.g. loop
>> peel and gvn). For a same debugloc, a unique number is assigned to each
>> copy and encoded in the discriminator.
>>
>> Assume that the discriminator is uint32. The traditional discriminator
is
>> less than 256, let's take 8 bit for it. For duplication factor
(type 1
>> duplication), we assume the maximum unroll_factor * vectorize_factor is
>> less than 256, thus 8 bit for it. For unique number(type 2
duplication), we
>> assume code is at most duplicated 32 times, thus 5 bit for it. Overall,
we
>> still have 11 free bits left in the discriminator encoding.
>>
>> Let's take the original source as an example, after loop unrolling
and
>> peeling, the code may looks like:
>>
>> for (i = 0; i < N & 3; i+= 4) {
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>> }
>> if (i++ < N) {
>>   foo();   // discriminator: 0x100
>>   if (i++ < N) {
>>     foo(); // discriminator: 0x200
>>     if (i++ < N) {
>>       foo();  // discriminator: 0x300
>>     }
>>   }
>> }
>>
>> The cost of this change would be increased debug_line size because: 1.
we
>> are recording more discriminators 2. discriminators become larger and
will
>> take more ULEB128 encoding.
>>
>> The benefit is that the sample pgo profile can accurately represent the
>> code execution frequency. And we do not need to introduce new building
>> blocks to debug info.
>>
>> Comments?
>>
>> Thanks,
>> Dehao
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/75e83e9c/attachment.html>

Xinliang David Li via llvm-dev

2016-Oct-27 20:18 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

The large percentages are from those tiny benchmarks. If you look at
omnetpp (0.52%), and xalanc (1.46%), the increase is small. To get a better
average increase, you can sum up total debug_line size before and after and
compute percentage accordingly.

David

On Thu, Oct 27, 2016 at 1:11 PM, Dehao Chen <dehao at google.com> wrote:
> The impact to debug_line is actually not small. I only implemented the
> part 1 (encoding duplication factor) for loop unrolling and loop
> vectorization. The debug_line size overhead for "-O2 -g1" binary
of speccpu
> C/C++ benchmarks:
>
> 433.milc 23.59%
> 444.namd 6.25%
> 447.dealII 8.43%
> 450.soplex 2.41%
> 453.povray 5.40%
> 470.lbm 0.00%
> 482.sphinx3 7.10%
> 400.perlbench 2.77%
> 401.bzip2 9.62%
> 403.gcc 2.67%
> 429.mcf 9.54%
> 445.gobmk 7.40%
> 456.hmmer 9.79%
> 458.sjeng 9.98%
> 462.libquantum 10.90%
> 464.h264ref 30.21%
> 471.omnetpp 0.52%
> 473.astar 5.67%
> 483.xalancbmk 1.46%
> mean 7.86%
> Dehao
>
> On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <davidxl at
google.com>
> wrote:
>
>> Do you have an estimate of the debug_line size increase? I guess it
will
>> be small.
>>
>> David
>>
>> On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at
google.com> wrote:
>>
>>> Motivation:
>>> Many optimizations duplicate code. E.g. loop unroller duplicates
the
>>> loop body, GVN duplicates computation, etc. The duplicated code
will share
>>> the same debug info with the original code. For SamplePGO, the
debug info
>>> is used to present the profile. Code duplication will affect
profile
>>> accuracy. Taking loop unrolling for example:
>>>
>>> #1 foo();
>>> #2 for (i = 0; i < N; i++) {
>>> #3   bar();
>>> #4 }
>>>
>>> If N is 8 during runtime, a reasonable profile will look like:
>>>
>>> #1: 10
>>> #3: 80
>>>
>>> But if the compiler unrolls the loop by a factor of 4, the callsite
to
>>> bar() is duplicated 4 times and the profile will look like:
>>>
>>> #1: 10
>>> #3: 20
>>>
>>> The sample count for #3 is 20 because all 4 callsites to bar() are
>>> sampled 20 times each, and they shared the same debug loc (#3) so
that 20
>>> will be attributed to #3 (If one debugloc is mapped by multiple
>>> instructions, the max sample count of these instructions is used as
>>> debugloc's sample count).
>>>
>>> When loading this profile into compiler, it will think the loop
trip
>>> count is 2 instead of 8.
>>>
>>> Proposal:
>>> When compiler duplicates code, it encodes the duplication info in
the
>>> debug info. As the duplication is not interesting to debugger, I
propose to
>>> encode this as part of the discriminator.
>>>
>>> There are 2 types of code duplication:
>>>
>>> 1. duplicated code are guaranteed to have the same execution count
(e.g.
>>> loop unroll and loop vectorize). We can record the duplication
factor, for
>>> the above example "4" is recorded in the discriminator.
>>> 2. duplicated code that may have different execution count (e.g.
loop
>>> peel and gvn). For a same debugloc, a unique number is assigned to
each
>>> copy and encoded in the discriminator.
>>>
>>> Assume that the discriminator is uint32. The traditional
discriminator
>>> is less than 256, let's take 8 bit for it. For duplication
factor (type 1
>>> duplication), we assume the maximum unroll_factor *
vectorize_factor is
>>> less than 256, thus 8 bit for it. For unique number(type 2
duplication), we
>>> assume code is at most duplicated 32 times, thus 5 bit for it.
Overall, we
>>> still have 11 free bits left in the discriminator encoding.
>>>
>>> Let's take the original source as an example, after loop
unrolling and
>>> peeling, the code may looks like:
>>>
>>> for (i = 0; i < N & 3; i+= 4) {
>>>   foo();  // discriminator: 0x40
>>>   foo();  // discriminator: 0x40
>>>   foo();  // discriminator: 0x40
>>>   foo();  // discriminator: 0x40
>>> }
>>> if (i++ < N) {
>>>   foo();   // discriminator: 0x100
>>>   if (i++ < N) {
>>>     foo(); // discriminator: 0x200
>>>     if (i++ < N) {
>>>       foo();  // discriminator: 0x300
>>>     }
>>>   }
>>> }
>>>
>>> The cost of this change would be increased debug_line size because:
1.
>>> we are recording more discriminators 2. discriminators become
larger and
>>> will take more ULEB128 encoding.
>>>
>>> The benefit is that the sample pgo profile can accurately represent
the
>>> code execution frequency. And we do not need to introduce new
building
>>> blocks to debug info.
>>>
>>> Comments?
>>>
>>> Thanks,
>>> Dehao
>>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/68990d62/attachment.html>

Dehao Chen via llvm-dev

2016-Oct-27 20:22 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

If we use sum instead of geomean, then the diff will be 5.59%

Yes, most of the size growth are from small applications, with one
exception: h264, which has significant amount of loops to unroll/vectorize.

Dehao

On Thu, Oct 27, 2016 at 1:18 PM, Xinliang David Li <davidxl at google.com>
wrote:
> The large percentages are from those tiny benchmarks. If you look at
> omnetpp (0.52%), and xalanc (1.46%), the increase is small. To get a better
> average increase, you can sum up total debug_line size before and after and
> compute percentage accordingly.
>
> David
>
> On Thu, Oct 27, 2016 at 1:11 PM, Dehao Chen <dehao at google.com>
wrote:
>
>> The impact to debug_line is actually not small. I only implemented the
>> part 1 (encoding duplication factor) for loop unrolling and loop
>> vectorization. The debug_line size overhead for "-O2 -g1"
binary of speccpu
>> C/C++ benchmarks:
>>
>> 433.milc 23.59%
>> 444.namd 6.25%
>> 447.dealII 8.43%
>> 450.soplex 2.41%
>> 453.povray 5.40%
>> 470.lbm 0.00%
>> 482.sphinx3 7.10%
>> 400.perlbench 2.77%
>> 401.bzip2 9.62%
>> 403.gcc 2.67%
>> 429.mcf 9.54%
>> 445.gobmk 7.40%
>> 456.hmmer 9.79%
>> 458.sjeng 9.98%
>> 462.libquantum 10.90%
>> 464.h264ref 30.21%
>> 471.omnetpp 0.52%
>> 473.astar 5.67%
>> 483.xalancbmk 1.46%
>> mean 7.86%
>> Dehao
>>
>> On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <davidxl at
google.com>
>> wrote:
>>
>>> Do you have an estimate of the debug_line size increase? I guess it
will
>>> be small.
>>>
>>> David
>>>
>>> On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at
google.com> wrote:
>>>
>>>> Motivation:
>>>> Many optimizations duplicate code. E.g. loop unroller
duplicates the
>>>> loop body, GVN duplicates computation, etc. The duplicated code
will share
>>>> the same debug info with the original code. For SamplePGO, the
debug info
>>>> is used to present the profile. Code duplication will affect
profile
>>>> accuracy. Taking loop unrolling for example:
>>>>
>>>> #1 foo();
>>>> #2 for (i = 0; i < N; i++) {
>>>> #3   bar();
>>>> #4 }
>>>>
>>>> If N is 8 during runtime, a reasonable profile will look like:
>>>>
>>>> #1: 10
>>>> #3: 80
>>>>
>>>> But if the compiler unrolls the loop by a factor of 4, the
callsite to
>>>> bar() is duplicated 4 times and the profile will look like:
>>>>
>>>> #1: 10
>>>> #3: 20
>>>>
>>>> The sample count for #3 is 20 because all 4 callsites to bar()
are
>>>> sampled 20 times each, and they shared the same debug loc (#3)
so that 20
>>>> will be attributed to #3 (If one debugloc is mapped by multiple
>>>> instructions, the max sample count of these instructions is
used as
>>>> debugloc's sample count).
>>>>
>>>> When loading this profile into compiler, it will think the loop
trip
>>>> count is 2 instead of 8.
>>>>
>>>> Proposal:
>>>> When compiler duplicates code, it encodes the duplication info
in the
>>>> debug info. As the duplication is not interesting to debugger,
I propose to
>>>> encode this as part of the discriminator.
>>>>
>>>> There are 2 types of code duplication:
>>>>
>>>> 1. duplicated code are guaranteed to have the same execution
count
>>>> (e.g. loop unroll and loop vectorize). We can record the
duplication
>>>> factor, for the above example "4" is recorded in the
discriminator.
>>>> 2. duplicated code that may have different execution count
(e.g. loop
>>>> peel and gvn). For a same debugloc, a unique number is assigned
to each
>>>> copy and encoded in the discriminator.
>>>>
>>>> Assume that the discriminator is uint32. The traditional
discriminator
>>>> is less than 256, let's take 8 bit for it. For duplication
factor (type 1
>>>> duplication), we assume the maximum unroll_factor *
vectorize_factor is
>>>> less than 256, thus 8 bit for it. For unique number(type 2
duplication), we
>>>> assume code is at most duplicated 32 times, thus 5 bit for it.
Overall, we
>>>> still have 11 free bits left in the discriminator encoding.
>>>>
>>>> Let's take the original source as an example, after loop
unrolling and
>>>> peeling, the code may looks like:
>>>>
>>>> for (i = 0; i < N & 3; i+= 4) {
>>>>   foo();  // discriminator: 0x40
>>>>   foo();  // discriminator: 0x40
>>>>   foo();  // discriminator: 0x40
>>>>   foo();  // discriminator: 0x40
>>>> }
>>>> if (i++ < N) {
>>>>   foo();   // discriminator: 0x100
>>>>   if (i++ < N) {
>>>>     foo(); // discriminator: 0x200
>>>>     if (i++ < N) {
>>>>       foo();  // discriminator: 0x300
>>>>     }
>>>>   }
>>>> }
>>>>
>>>> The cost of this change would be increased debug_line size
because: 1.
>>>> we are recording more discriminators 2. discriminators become
larger and
>>>> will take more ULEB128 encoding.
>>>>
>>>> The benefit is that the sample pgo profile can accurately
represent the
>>>> code execution frequency. And we do not need to introduce new
building
>>>> blocks to debug info.
>>>>
>>>> Comments?
>>>>
>>>> Thanks,
>>>> Dehao
>>>>
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/41fde2c3/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Oct 2016 - (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Possibly Parallel Threads