thr3ads.net - llvm dev - [llvm-dev] (RFC) Encoding code duplication factor in discriminator [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Dehao Chen via llvm-dev

2016-Oct-27 18:39 UTC

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Motivation:
Many optimizations duplicate code. E.g. loop unroller duplicates the loop
body, GVN duplicates computation, etc. The duplicated code will share the
same debug info with the original code. For SamplePGO, the debug info is
used to present the profile. Code duplication will affect profile accuracy.
Taking loop unrolling for example:

#1 foo();
#2 for (i = 0; i < N; i++) {
#3   bar();
#4 }

If N is 8 during runtime, a reasonable profile will look like:

#1: 10
#3: 80

But if the compiler unrolls the loop by a factor of 4, the callsite to
bar() is duplicated 4 times and the profile will look like:

#1: 10
#3: 20

The sample count for #3 is 20 because all 4 callsites to bar() are sampled
20 times each, and they shared the same debug loc (#3) so that 20 will be
attributed to #3 (If one debugloc is mapped by multiple instructions, the
max sample count of these instructions is used as debugloc's sample count).

When loading this profile into compiler, it will think the loop trip count
is 2 instead of 8.

Proposal:
When compiler duplicates code, it encodes the duplication info in the debug
info. As the duplication is not interesting to debugger, I propose to
encode this as part of the discriminator.

There are 2 types of code duplication:

1. duplicated code are guaranteed to have the same execution count (e.g.
loop unroll and loop vectorize). We can record the duplication factor, for
the above example "4" is recorded in the discriminator.
2. duplicated code that may have different execution count (e.g. loop peel
and gvn). For a same debugloc, a unique number is assigned to each copy and
encoded in the discriminator.

Assume that the discriminator is uint32. The traditional discriminator is
less than 256, let's take 8 bit for it. For duplication factor (type 1
duplication), we assume the maximum unroll_factor * vectorize_factor is
less than 256, thus 8 bit for it. For unique number(type 2 duplication), we
assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we
still have 11 free bits left in the discriminator encoding.

Let's take the original source as an example, after loop unrolling and
peeling, the code may looks like:

for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
}
if (i++ < N) {
  foo();   // discriminator: 0x100
  if (i++ < N) {
    foo(); // discriminator: 0x200
    if (i++ < N) {
      foo();  // discriminator: 0x300
    }
  }
}

The cost of this change would be increased debug_line size because: 1. we
are recording more discriminators 2. discriminators become larger and will
take more ULEB128 encoding.

The benefit is that the sample pgo profile can accurately represent the
code execution frequency. And we do not need to introduce new building
blocks to debug info.

Comments?

Thanks,
Dehao
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/a126a922/attachment.html>

David Blaikie via llvm-dev

2016-Oct-27 18:44 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Is there prior art for this sort of thing (in GCC, for example) - I take it
this isn't the first time this has come up as a problem for profile
accuracy? (so it'd be helpful to know prior solutions to this (& if
we're
not doing whatever was done before, what it is about our situation that's
different, etc), or why it hasn't been a problem, etc)

On Thu, Oct 27, 2016 at 11:39 AM Dehao Chen <dehao at google.com> wrote:
> Motivation:
> Many optimizations duplicate code. E.g. loop unroller duplicates the loop
> body, GVN duplicates computation, etc. The duplicated code will share the
> same debug info with the original code. For SamplePGO, the debug info is
> used to present the profile. Code duplication will affect profile accuracy.
> Taking loop unrolling for example:
>
> #1 foo();
> #2 for (i = 0; i < N; i++) {
> #3   bar();
> #4 }
>
> If N is 8 during runtime, a reasonable profile will look like:
>
> #1: 10
> #3: 80
>
> But if the compiler unrolls the loop by a factor of 4, the callsite to
> bar() is duplicated 4 times and the profile will look like:
>
> #1: 10
> #3: 20
>
> The sample count for #3 is 20 because all 4 callsites to bar() are sampled
> 20 times each, and they shared the same debug loc (#3) so that 20 will be
> attributed to #3 (If one debugloc is mapped by multiple instructions, the
> max sample count of these instructions is used as debugloc's sample
count).
>
> When loading this profile into compiler, it will think the loop trip count
> is 2 instead of 8.
>
> Proposal:
> When compiler duplicates code, it encodes the duplication info in the
> debug info. As the duplication is not interesting to debugger, I propose to
> encode this as part of the discriminator.
>
> There are 2 types of code duplication:
>
> 1. duplicated code are guaranteed to have the same execution count (e.g.
> loop unroll and loop vectorize). We can record the duplication factor, for
> the above example "4" is recorded in the discriminator.
> 2. duplicated code that may have different execution count (e.g. loop peel
> and gvn). For a same debugloc, a unique number is assigned to each copy and
> encoded in the discriminator.
>
> Assume that the discriminator is uint32. The traditional discriminator is
> less than 256, let's take 8 bit for it. For duplication factor (type 1
> duplication), we assume the maximum unroll_factor * vectorize_factor is
> less than 256, thus 8 bit for it. For unique number(type 2 duplication), we
> assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we
> still have 11 free bits left in the discriminator encoding.
>
> Let's take the original source as an example, after loop unrolling and
> peeling, the code may looks like:
>
> for (i = 0; i < N & 3; i+= 4) {
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
> }
> if (i++ < N) {
>   foo();   // discriminator: 0x100
>   if (i++ < N) {
>     foo(); // discriminator: 0x200
>     if (i++ < N) {
>       foo();  // discriminator: 0x300
>     }
>   }
> }
>
> The cost of this change would be increased debug_line size because: 1. we
> are recording more discriminators 2. discriminators become larger and will
> take more ULEB128 encoding.
>
> The benefit is that the sample pgo profile can accurately represent the
> code execution frequency. And we do not need to introduce new building
> blocks to debug info.
>
> Comments?
>
> Thanks,
> Dehao
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/558f732e/attachment.html>

Xinliang David Li via llvm-dev

2016-Oct-27 18:55 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Do you have an estimate of the debug_line size increase? I guess it will be
small.

David

On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at google.com> wrote:
> Motivation:
> Many optimizations duplicate code. E.g. loop unroller duplicates the loop
> body, GVN duplicates computation, etc. The duplicated code will share the
> same debug info with the original code. For SamplePGO, the debug info is
> used to present the profile. Code duplication will affect profile accuracy.
> Taking loop unrolling for example:
>
> #1 foo();
> #2 for (i = 0; i < N; i++) {
> #3   bar();
> #4 }
>
> If N is 8 during runtime, a reasonable profile will look like:
>
> #1: 10
> #3: 80
>
> But if the compiler unrolls the loop by a factor of 4, the callsite to
> bar() is duplicated 4 times and the profile will look like:
>
> #1: 10
> #3: 20
>
> The sample count for #3 is 20 because all 4 callsites to bar() are sampled
> 20 times each, and they shared the same debug loc (#3) so that 20 will be
> attributed to #3 (If one debugloc is mapped by multiple instructions, the
> max sample count of these instructions is used as debugloc's sample
count).
>
> When loading this profile into compiler, it will think the loop trip count
> is 2 instead of 8.
>
> Proposal:
> When compiler duplicates code, it encodes the duplication info in the
> debug info. As the duplication is not interesting to debugger, I propose to
> encode this as part of the discriminator.
>
> There are 2 types of code duplication:
>
> 1. duplicated code are guaranteed to have the same execution count (e.g.
> loop unroll and loop vectorize). We can record the duplication factor, for
> the above example "4" is recorded in the discriminator.
> 2. duplicated code that may have different execution count (e.g. loop peel
> and gvn). For a same debugloc, a unique number is assigned to each copy and
> encoded in the discriminator.
>
> Assume that the discriminator is uint32. The traditional discriminator is
> less than 256, let's take 8 bit for it. For duplication factor (type 1
> duplication), we assume the maximum unroll_factor * vectorize_factor is
> less than 256, thus 8 bit for it. For unique number(type 2 duplication), we
> assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we
> still have 11 free bits left in the discriminator encoding.
>
> Let's take the original source as an example, after loop unrolling and
> peeling, the code may looks like:
>
> for (i = 0; i < N & 3; i+= 4) {
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
> }
> if (i++ < N) {
>   foo();   // discriminator: 0x100
>   if (i++ < N) {
>     foo(); // discriminator: 0x200
>     if (i++ < N) {
>       foo();  // discriminator: 0x300
>     }
>   }
> }
>
> The cost of this change would be increased debug_line size because: 1. we
> are recording more discriminators 2. discriminators become larger and will
> take more ULEB128 encoding.
>
> The benefit is that the sample pgo profile can accurately represent the
> code execution frequency. And we do not need to introduce new building
> blocks to debug info.
>
> Comments?
>
> Thanks,
> Dehao
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/8dac13eb/attachment.html>

Dehao Chen via llvm-dev

2016-Oct-27 20:09 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

You are right, this problem also exists in GCC. And it was not solved
there, we just left it open. As now we are tuning aggressively for LLVM, I
think it's a good time to raise this issue and have a thorough solution so
that we have a good chance of gaining more better performance than GCC afdo.

Dehao

On Thu, Oct 27, 2016 at 11:44 AM, David Blaikie <dblaikie at gmail.com>
wrote:
> Is there prior art for this sort of thing (in GCC, for example) - I take
> it this isn't the first time this has come up as a problem for profile
> accuracy? (so it'd be helpful to know prior solutions to this (& if
we're
> not doing whatever was done before, what it is about our situation
that's
> different, etc), or why it hasn't been a problem, etc)
>
> On Thu, Oct 27, 2016 at 11:39 AM Dehao Chen <dehao at google.com>
wrote:
>
>> Motivation:
>> Many optimizations duplicate code. E.g. loop unroller duplicates the
loop
>> body, GVN duplicates computation, etc. The duplicated code will share
the
>> same debug info with the original code. For SamplePGO, the debug info
is
>> used to present the profile. Code duplication will affect profile
accuracy.
>> Taking loop unrolling for example:
>>
>> #1 foo();
>> #2 for (i = 0; i < N; i++) {
>> #3   bar();
>> #4 }
>>
>> If N is 8 during runtime, a reasonable profile will look like:
>>
>> #1: 10
>> #3: 80
>>
>> But if the compiler unrolls the loop by a factor of 4, the callsite to
>> bar() is duplicated 4 times and the profile will look like:
>>
>> #1: 10
>> #3: 20
>>
>> The sample count for #3 is 20 because all 4 callsites to bar() are
>> sampled 20 times each, and they shared the same debug loc (#3) so that
20
>> will be attributed to #3 (If one debugloc is mapped by multiple
>> instructions, the max sample count of these instructions is used as
>> debugloc's sample count).
>>
>> When loading this profile into compiler, it will think the loop trip
>> count is 2 instead of 8.
>>
>> Proposal:
>> When compiler duplicates code, it encodes the duplication info in the
>> debug info. As the duplication is not interesting to debugger, I
propose to
>> encode this as part of the discriminator.
>>
>> There are 2 types of code duplication:
>>
>> 1. duplicated code are guaranteed to have the same execution count
(e.g.
>> loop unroll and loop vectorize). We can record the duplication factor,
for
>> the above example "4" is recorded in the discriminator.
>> 2. duplicated code that may have different execution count (e.g. loop
>> peel and gvn). For a same debugloc, a unique number is assigned to each
>> copy and encoded in the discriminator.
>>
>> Assume that the discriminator is uint32. The traditional discriminator
is
>> less than 256, let's take 8 bit for it. For duplication factor
(type 1
>> duplication), we assume the maximum unroll_factor * vectorize_factor is
>> less than 256, thus 8 bit for it. For unique number(type 2
duplication), we
>> assume code is at most duplicated 32 times, thus 5 bit for it. Overall,
we
>> still have 11 free bits left in the discriminator encoding.
>>
>> Let's take the original source as an example, after loop unrolling
and
>> peeling, the code may looks like:
>>
>> for (i = 0; i < N & 3; i+= 4) {
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>> }
>> if (i++ < N) {
>>   foo();   // discriminator: 0x100
>>   if (i++ < N) {
>>     foo(); // discriminator: 0x200
>>     if (i++ < N) {
>>       foo();  // discriminator: 0x300
>>     }
>>   }
>> }
>>
>> The cost of this change would be increased debug_line size because: 1.
we
>> are recording more discriminators 2. discriminators become larger and
will
>> take more ULEB128 encoding.
>>
>> The benefit is that the sample pgo profile can accurately represent the
>> code execution frequency. And we do not need to introduce new building
>> blocks to debug info.
>>
>> Comments?
>>
>> Thanks,
>> Dehao
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/70afb11a/attachment.html>

Dehao Chen via llvm-dev

2016-Oct-27 20:11 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

The impact to debug_line is actually not small. I only implemented the part
1 (encoding duplication factor) for loop unrolling and loop vectorization.
The debug_line size overhead for "-O2 -g1" binary of speccpu C/C++
benchmarks:

433.milc 23.59%
444.namd 6.25%
447.dealII 8.43%
450.soplex 2.41%
453.povray 5.40%
470.lbm 0.00%
482.sphinx3 7.10%
400.perlbench 2.77%
401.bzip2 9.62%
403.gcc 2.67%
429.mcf 9.54%
445.gobmk 7.40%
456.hmmer 9.79%
458.sjeng 9.98%
462.libquantum 10.90%
464.h264ref 30.21%
471.omnetpp 0.52%
473.astar 5.67%
483.xalancbmk 1.46%
mean 7.86%
Dehao

On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <davidxl at
google.com>
wrote:
> Do you have an estimate of the debug_line size increase? I guess it will
> be small.
>
> David
>
> On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at google.com>
wrote:
>
>> Motivation:
>> Many optimizations duplicate code. E.g. loop unroller duplicates the
loop
>> body, GVN duplicates computation, etc. The duplicated code will share
the
>> same debug info with the original code. For SamplePGO, the debug info
is
>> used to present the profile. Code duplication will affect profile
accuracy.
>> Taking loop unrolling for example:
>>
>> #1 foo();
>> #2 for (i = 0; i < N; i++) {
>> #3   bar();
>> #4 }
>>
>> If N is 8 during runtime, a reasonable profile will look like:
>>
>> #1: 10
>> #3: 80
>>
>> But if the compiler unrolls the loop by a factor of 4, the callsite to
>> bar() is duplicated 4 times and the profile will look like:
>>
>> #1: 10
>> #3: 20
>>
>> The sample count for #3 is 20 because all 4 callsites to bar() are
>> sampled 20 times each, and they shared the same debug loc (#3) so that
20
>> will be attributed to #3 (If one debugloc is mapped by multiple
>> instructions, the max sample count of these instructions is used as
>> debugloc's sample count).
>>
>> When loading this profile into compiler, it will think the loop trip
>> count is 2 instead of 8.
>>
>> Proposal:
>> When compiler duplicates code, it encodes the duplication info in the
>> debug info. As the duplication is not interesting to debugger, I
propose to
>> encode this as part of the discriminator.
>>
>> There are 2 types of code duplication:
>>
>> 1. duplicated code are guaranteed to have the same execution count
(e.g.
>> loop unroll and loop vectorize). We can record the duplication factor,
for
>> the above example "4" is recorded in the discriminator.
>> 2. duplicated code that may have different execution count (e.g. loop
>> peel and gvn). For a same debugloc, a unique number is assigned to each
>> copy and encoded in the discriminator.
>>
>> Assume that the discriminator is uint32. The traditional discriminator
is
>> less than 256, let's take 8 bit for it. For duplication factor
(type 1
>> duplication), we assume the maximum unroll_factor * vectorize_factor is
>> less than 256, thus 8 bit for it. For unique number(type 2
duplication), we
>> assume code is at most duplicated 32 times, thus 5 bit for it. Overall,
we
>> still have 11 free bits left in the discriminator encoding.
>>
>> Let's take the original source as an example, after loop unrolling
and
>> peeling, the code may looks like:
>>
>> for (i = 0; i < N & 3; i+= 4) {
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>>   foo();  // discriminator: 0x40
>> }
>> if (i++ < N) {
>>   foo();   // discriminator: 0x100
>>   if (i++ < N) {
>>     foo(); // discriminator: 0x200
>>     if (i++ < N) {
>>       foo();  // discriminator: 0x300
>>     }
>>   }
>> }
>>
>> The cost of this change would be increased debug_line size because: 1.
we
>> are recording more discriminators 2. discriminators become larger and
will
>> take more ULEB128 encoding.
>>
>> The benefit is that the sample pgo profile can accurately represent the
>> code execution frequency. And we do not need to introduce new building
>> blocks to debug info.
>>
>> Comments?
>>
>> Thanks,
>> Dehao
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/75e83e9c/attachment.html>

Robinson, Paul via llvm-dev

2016-Oct-27 21:53 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

It looks like the example doesn't use the encoding described in the text?

Assume that the discriminator is uint32. The traditional discriminator is less
than 256, let's take 8 bit for it. For duplication factor (type 1
duplication), we assume the maximum unroll_factor * vectorize_factor is less
than 256, thus 8 bit for it. For unique number(type 2 duplication), we assume
code is at most duplicated 32 times, thus 5 bit for it. Overall, we still have
11 free bits left in the discriminator encoding.

Let's take the original source as an example, after loop unrolling and
peeling, the code may looks like:

for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
}
if (i++ < N) {
  foo();   // discriminator: 0x100
  if (i++ < N) {
    foo(); // discriminator: 0x200
    if (i++ < N) {
      foo();  // discriminator: 0x300
    }
  }
}

If we allocate 8 bits to "traditional" discriminators, then 0x40 falls
into that range, so I'd think the calls to foo() inside the loop should be
using 0x400 to encode the unroll factor.  Note this requires 2 bytes for ULEB128
instead of 1.
And if we allocate another 8 bits to the unroll factor, then the trailing calls
should use 0x10000, 0x20000, 0x30000.  These will require 3 bytes for ULEB128
instead of 2.

I think it would be fine to allocate only 4 bits to "traditional"
discriminators (as you need them to be unique across the same source location,
but not across different source locations, and 16 independent basic blocks for
the same source location seems like plenty to me; but I haven't looked at a
lot of cases with discriminators).  This would allow you to use 0x40 to encode
the unroll factor in this example.  If you still want to allocate 8 bits for
unroll factor, then the trailing calls would use 0x1000, 0x2000, 0x3000 so as
long as you had no more than 3 trailing calls you can still encode the
discriminator in a 2-byte ULEB128.
--paulr
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/b6581a81/attachment.html>

Hal Finkel via llvm-dev

2016-Oct-28 22:07 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Hi Dehao,

This is definitely an important problem, thanks for writing this up!

There is a related problem that I think we can address at the same time: When we
multiversion code, for example when we use runtime checks to enable the creation
of a vectorized loop while retaining the scalar loop, and then we collect
profiling data, we should be able to recover the relative running time of the
different versions of the loop. In the future, I suspect we'll end up with
transformations that produce even more versions of loops (Intel's compiler,
for example, has a useful pragma that allows the user to request loop
specializations for a specified set of trip counts).

I'd like to have a scheme where the source location + discriminator can be
mapped to information about the relevant loop version so that our profiling
tools can display that information usefully to the user, and so that our
optimizations can make useful choices (i.e. don't bother vectorizing a loop
when the scalar loop is run a lot but the vectorized version almost never runs).

In short, I think that differentiating these different regions using the
descriminator seems like a natural choice, but "And we do not need to
introduce new building blocks to debug info" might not save us all that
much in the long run. To keep information on what regions correspond to what
optimizations, we may need to do that. That's not a bad thing, and I'd
rather we solve this in a way that is extensible. Plus, this might make it
easier to use fewer bits, thus helping the overall impact on the size of the
debug sections.

Thanks again,
Hal

----- Original Message -----> From: "Dehao Chen via llvm-dev" <llvm-dev at
lists.llvm.org>
> To: llvm-dev at lists.llvm.org
> Cc: "Xinliang David Li" <davidxl at google.com>
> Sent: Thursday, October 27, 2016 1:39:15 PM
> Subject: [llvm-dev] (RFC) Encoding code duplication factor in discriminator
> 
> Motivation:
> Many optimizations duplicate code. E.g. loop unroller duplicates the
> loop body, GVN duplicates computation, etc. The duplicated code will
> share the same debug info with the original code. For SamplePGO, the
> debug info is used to present the profile. Code duplication will
> affect profile accuracy. Taking loop unrolling for example:
> 
> 
> #1 foo();
> #2 for (i = 0; i < N; i++) {
> 
> #3 bar();
> #4 }
> 
> 
> If N is 8 during runtime, a reasonable profile will look like:
> 
> 
> #1: 10
> #3: 80
> 
> 
> 
> But if the compiler unrolls the loop by a factor of 4, the callsite
> to bar() is duplicated 4 times and the profile will look like:
> 
> 
> #1: 10
> #3: 20
> 
> 
> The sample count for #3 is 20 because all 4 callsites to bar() are
> sampled 20 times each, and they shared the same debug loc (#3) so
> that 20 will be attributed to #3 (If one debugloc is mapped by
> multiple instructions, the max sample count of these instructions is
> used as debugloc's sample count).
> 
> 
> When loading this profile into compiler, it will think the loop trip
> count is 2 instead of 8.
> 
> 
> Proposal:
> When compiler duplicates code, it encodes the duplication info in the
> debug info. As the duplication is not interesting to debugger, I
> propose to encode this as part of the discriminator.
> 
> 
> There are 2 types of code duplication:
> 
> 
> 1. duplicated code are guaranteed to have the same execution count
> (e.g. loop unroll and loop vectorize). We can record the duplication
> factor, for the above example "4" is recorded in the
discriminator.
> 2. duplicated code that may have different execution count (e.g. loop
> peel and gvn). For a same debugloc, a unique number is assigned to
> each copy and encoded in the discriminator.
> 
> 
> Assume that the discriminator is uint32. The traditional
> discriminator is less than 256, let's take 8 bit for it. For
> duplication factor (type 1 duplication), we assume the maximum
> unroll_factor * vectorize_factor is less than 256, thus 8 bit for
> it. For unique number(type 2 duplication), we assume code is at most
> duplicated 32 times, thus 5 bit for it. Overall, we still have 11
> free bits left in the discriminator encoding.
> 
> 
> Let's take the original source as an example, after loop unrolling
> and peeling, the code may looks like:
> 
> 
> for (i = 0; i < N & 3; i+= 4) {
> foo(); // discriminator: 0x40
> foo(); // discriminator: 0x40
> foo(); // discriminator: 0x40
> foo(); // discriminator: 0x40
> }
> if (i++ < N) {
> foo(); // discriminator: 0x100
> if (i++ < N) {
> foo(); // discriminator: 0x200
> if (i++ < N) {
> foo(); // discriminator: 0x300
> }
> }
> }
> 
> 
> The cost of this change would be increased debug_line size because:
> 1. we are recording more discriminators 2. discriminators become
> larger and will take more ULEB128 encoding.
> 
> 
> The benefit is that the sample pgo profile can accurately represent
> the code execution frequency. And we do not need to introduce new
> building blocks to debug info.
> 
> 
> Comments?
> 
> 
> Thanks,
> Dehao
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Dehao Chen via llvm-dev

2016-Nov-01 16:33 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

On Thu, Oct 27, 2016 at 2:53 PM, Robinson, Paul <paul.robinson at
sony.com>
wrote:
> It looks like the example doesn't use the encoding described in the
text?
>
>
>
> Assume that the discriminator is uint32. The traditional discriminator is
> less than 256, let's take 8 bit for it. For duplication factor (type 1
> duplication), we assume the maximum unroll_factor * vectorize_factor is
> less than 256, thus 8 bit for it. For unique number(type 2 duplication), we
> assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we
> still have 11 free bits left in the discriminator encoding.
>
>
>
> Let's take the original source as an example, after loop unrolling and
> peeling, the code may looks like:
>
>
>
> for (i = 0; i < N & 3; i+= 4) {
>
>   foo();  // discriminator: 0x40
>
>   foo();  // discriminator: 0x40
>
>   foo();  // discriminator: 0x40
>
>   foo();  // discriminator: 0x40
>
> }
>
> if (i++ < N) {
>
>   foo();   // discriminator: 0x100
>
>   if (i++ < N) {
>
>     foo(); // discriminator: 0x200
>
>     if (i++ < N) {
>
>       foo();  // discriminator: 0x300
>
>     }
>
>   }
>
> }
>
>
>
> If we allocate 8 bits to "traditional" discriminators, then 0x40
falls
> into that range, so I'd think the calls to foo() inside the loop should
be
> using 0x400 to encode the unroll factor.
>
You are right, thanks for pointing this out.

> Note this requires 2 bytes for ULEB128 instead of 1.
>
> And if we allocate another 8 bits to the unroll factor, then the trailing
> calls should use 0x10000, 0x20000, 0x30000.  These will require 3 bytes for
> ULEB128 instead of 2.
>
>
>
> I think it would be fine to allocate only 4 bits to "traditional"
> discriminators (as you need them to be unique across the same source
> location, but not across different source locations, and 16 independent
> basic blocks for the same source location seems like plenty to me; but I
> haven't looked at a lot of cases with discriminators).  This would
allow
> you to use 0x40 to encode the unroll factor in this example.  If you still
> want to allocate 8 bits for unroll factor, then the trailing calls would
> use 0x1000, 0x2000, 0x3000 so as long as you had no more than 3 trailing
> calls you can still encode the discriminator in a 2-byte ULEB128.
>
I did some evaluation on speccpu2006 c/c++ benchmarks. Among all source
built, there are 268K non-zero discriminators emitted. The largest
discriminator is 779 (coming from 471.omnetpp, which has significant amount
of EH code.) The 99% percentile is 18. The 85% percentile is 3. For the
duplication factor, the largest is 320, the 99% percentile is 40, the 85%
percentile is 4.

If we want to encode this into the discriminator, I would propose encode
something like:

high bits   ---------->  low bits
EEEEEEEECCCCCFFDDD CCCFFFFFFDDDDD

D: normal discriminator
F: duplication factor
C: code cloning
E: empty bits

So the lower 14 bits should be able to cover 99% percentile

Or something like:
high bits   ---------->  low bits
EEEEEEEECCCCCFFDDD CFFFDDD CCFFFDD

So the lower 7 bits should be able to cover 85% percentile and the lower 14
bits should be able to cover 99% percentile.

Dehao


> --paulr
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/151d1ad3/attachment.html>

Dehao Chen via llvm-dev

2016-Nov-01 16:43 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

On Fri, Oct 28, 2016 at 3:07 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hi Dehao,
>
> This is definitely an important problem, thanks for writing this up!
>
> There is a related problem that I think we can address at the same time:
> When we multiversion code, for example when we use runtime checks to enable
> the creation of a vectorized loop while retaining the scalar loop, and then
> we collect profiling data, we should be able to recover the relative
> running time of the different versions of the loop. In the future, I
> suspect we'll end up with transformations that produce even more
versions
> of loops (Intel's compiler, for example, has a useful pragma that
allows
> the user to request loop specializations for a specified set of trip
> counts).
>
> I'd like to have a scheme where the source location + discriminator can
be
> mapped to information about the relevant loop version so that our profiling
> tools can display that information usefully to the user, and so that our
> optimizations can make useful choices (i.e. don't bother vectorizing a
loop
> when the scalar loop is run a lot but the vectorized version almost never
> runs).
>
That's definitely a valid and important use case, and it is important to
sample pgo too. That's why I proposed to have "duplicated code that may
have different execution count" being recorded. Will that suffice to get
the info you want? (i.e. for every version of the multi-versioned loop, you
will have a disincentive discriminator associated with all the code it
expands.

>
> In short, I think that differentiating these different regions using the
> descriminator seems like a natural choice, but "And we do not need to
> introduce new building blocks to debug info" might not save us all
that
> much in the long run. To keep information on what regions correspond to
> what optimizations, we may need to do that. That's not a bad thing, and
I'd
> rather we solve this in a way that is extensible. Plus, this might make it
> easier to use fewer bits, thus helping the overall impact on the size of
> the debug sections.
>
I agree that if we want to extend this in the future, we need to have
separate dwarf bits other than discriminator. For current use case,
discriminator seem to be good enough. And if we encode efficiently, it will
be better than introducing a new field. e.g., we can encode all info in a
1-byte ULEB128 85%~90% of the time, but for a new field, we will at least
need 2 bytes if both discriminator and cloning info exists for an
instruction.

Dehao

>
> Thanks again,
> Hal
>
> ----- Original Message -----
> > From: "Dehao Chen via llvm-dev" <llvm-dev at
lists.llvm.org>
> > To: llvm-dev at lists.llvm.org
> > Cc: "Xinliang David Li" <davidxl at google.com>
> > Sent: Thursday, October 27, 2016 1:39:15 PM
> > Subject: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
> >
> > Motivation:
> > Many optimizations duplicate code. E.g. loop unroller duplicates the
> > loop body, GVN duplicates computation, etc. The duplicated code will
> > share the same debug info with the original code. For SamplePGO, the
> > debug info is used to present the profile. Code duplication will
> > affect profile accuracy. Taking loop unrolling for example:
> >
> >
> > #1 foo();
> > #2 for (i = 0; i < N; i++) {
> >
> > #3 bar();
> > #4 }
> >
> >
> > If N is 8 during runtime, a reasonable profile will look like:
> >
> >
> > #1: 10
> > #3: 80
> >
> >
> >
> > But if the compiler unrolls the loop by a factor of 4, the callsite
> > to bar() is duplicated 4 times and the profile will look like:
> >
> >
> > #1: 10
> > #3: 20
> >
> >
> > The sample count for #3 is 20 because all 4 callsites to bar() are
> > sampled 20 times each, and they shared the same debug loc (#3) so
> > that 20 will be attributed to #3 (If one debugloc is mapped by
> > multiple instructions, the max sample count of these instructions is
> > used as debugloc's sample count).
> >
> >
> > When loading this profile into compiler, it will think the loop trip
> > count is 2 instead of 8.
> >
> >
> > Proposal:
> > When compiler duplicates code, it encodes the duplication info in the
> > debug info. As the duplication is not interesting to debugger, I
> > propose to encode this as part of the discriminator.
> >
> >
> > There are 2 types of code duplication:
> >
> >
> > 1. duplicated code are guaranteed to have the same execution count
> > (e.g. loop unroll and loop vectorize). We can record the duplication
> > factor, for the above example "4" is recorded in the
discriminator.
> > 2. duplicated code that may have different execution count (e.g. loop
> > peel and gvn). For a same debugloc, a unique number is assigned to
> > each copy and encoded in the discriminator.
> >
> >
> > Assume that the discriminator is uint32. The traditional
> > discriminator is less than 256, let's take 8 bit for it. For
> > duplication factor (type 1 duplication), we assume the maximum
> > unroll_factor * vectorize_factor is less than 256, thus 8 bit for
> > it. For unique number(type 2 duplication), we assume code is at most
> > duplicated 32 times, thus 5 bit for it. Overall, we still have 11
> > free bits left in the discriminator encoding.
> >
> >
> > Let's take the original source as an example, after loop unrolling
> > and peeling, the code may looks like:
> >
> >
> > for (i = 0; i < N & 3; i+= 4) {
> > foo(); // discriminator: 0x40
> > foo(); // discriminator: 0x40
> > foo(); // discriminator: 0x40
> > foo(); // discriminator: 0x40
> > }
> > if (i++ < N) {
> > foo(); // discriminator: 0x100
> > if (i++ < N) {
> > foo(); // discriminator: 0x200
> > if (i++ < N) {
> > foo(); // discriminator: 0x300
> > }
> > }
> > }
> >
> >
> > The cost of this change would be increased debug_line size because:
> > 1. we are recording more discriminators 2. discriminators become
> > larger and will take more ULEB128 encoding.
> >
> >
> > The benefit is that the sample pgo profile can accurately represent
> > the code execution frequency. And we do not need to introduce new
> > building blocks to debug info.
> >
> >
> > Comments?
> >
> >
> > Thanks,
> > Dehao
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/78e6d8b9/attachment-0001.html>

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Oct 2016 - (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Possibly Parallel Threads