thr3ads.net - llvm dev - [llvm-dev] (RFC) Encoding code duplication factor in discriminator [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2016-Nov-01 17:50 UTC

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

----- Original Message -----
> From: "Dehao Chen" <dehao at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Xinliang David Li" <davidxl at google.com>,
"llvm-dev"
> <llvm-dev at lists.llvm.org>
> Sent: Tuesday, November 1, 2016 11:43:41 AM
> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
> On Fri, Oct 28, 2016 at 3:07 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> > Hi Dehao,
> 
> > This is definitely an important problem, thanks for writing this
> > up!
> 
> > There is a related problem that I think we can address at the same
> > time: When we multiversion code, for example when we use runtime
> > checks to enable the creation of a vectorized loop while retaining
> > the scalar loop, and then we collect profiling data, we should be
> > able to recover the relative running time of the different versions
> > of the loop. In the future, I suspect we'll end up with
> > transformations that produce even more versions of loops (Intel's
> > compiler, for example, has a useful pragma that allows the user to
> > request loop specializations for a specified set of trip counts).
> 
> > I'd like to have a scheme where the source location +
discriminator
> > can be mapped to information about the relevant loop version so
> > that
> > our profiling tools can display that information usefully to the
> > user, and so that our optimizations can make useful choices (i.e.
> > don't bother vectorizing a loop when the scalar loop is run a lot
> > but the vectorized version almost never runs).
> 
> That's definitely a valid and important use case, and it is important
> to sample pgo too. That's why I proposed to have " duplicated code
> that may have different execution count" being recorded. Will that
> suffice to get the info you want? (i.e. for every version of the
> multi-versioned loop, you will have a disincentive discriminator
> associated with all the code it expands.I don't know. Can you explain how the process will work? By the time the
code/metadata arrives at, say, the loop vectorizer, how can we tell whether the
vectorized version we might now create will be executed (based on having
profiling data from a run where the compiler might have previously made a
similar choice)?
> > In short, I think that differentiating these different regions
> > using
> > the descriminator seems like a natural choice, but "And we do not
> > need to introduce new building blocks to debug info" might not
save
> > us all that much in the long run. To keep information on what
> > regions correspond to what optimizations, we may need to do that.
> > That's not a bad thing, and I'd rather we solve this in a way
that
> > is extensible. Plus, this might make it easier to use fewer bits,
> > thus helping the overall impact on the size of the debug sections.
> 
> I agree that if we want to extend this in the future, we need to have
> separate dwarf bits other than discriminator. For current use case,
> discriminator seem to be good enough. And if we encode efficiently,
> it will be better than introducing a new field. e.g., we can encode
> all info in a 1-byte ULEB128 85%~90% of the time, but for a new
> field, we will at least need 2 bytes if both discriminator and
> cloning info exists for an instruction.Is this because you need at least one more byte for the debug-info field? 

In general, I don't really care where we stuff the bits so long as we can
get the necessary information back out. For a vectorized loop, for example, we
should be able to figure out which counts go to the vectorized loop vs. the
scalar loop. I don't, however, want to end up with something that is
super-non-extensible (e.g. we have only a few bits, so the vectorizer can get
one and the unroller can get one, but loop distribution is out of luck). Maybe
we need an 'extension' bit saying that there is more information encoded
elsewhere?

Thanks again, 
Hal 
> Dehao
> > Thanks again,
> 
> > Hal
> 
> > ----- Original Message -----
> 
> > > From: "Dehao Chen via llvm-dev" < llvm-dev at
lists.llvm.org >
> 
> > > To: llvm-dev at lists.llvm.org
> 
> > > Cc: "Xinliang David Li" < davidxl at google.com >
> 
> > > Sent: Thursday, October 27, 2016 1:39:15 PM
> 
> > > Subject: [llvm-dev] (RFC) Encoding code duplication factor in
> > > discriminator
> 
> > >
> 
> > > Motivation:
> 
> > > Many optimizations duplicate code. E.g. loop unroller duplicates
> > > the
> 
> > > loop body, GVN duplicates computation, etc. The duplicated code
> > > will
> 
> > > share the same debug info with the original code. For SamplePGO,
> > > the
> 
> > > debug info is used to present the profile. Code duplication will
> 
> > > affect profile accuracy. Taking loop unrolling for example:
> 
> > >
> 
> > >
> 
> > > #1 foo();
> 
> > > #2 for (i = 0; i < N; i++) {
> 
> > >
> 
> > > #3 bar();
> 
> > > #4 }
> 
> > >
> 
> > >
> 
> > > If N is 8 during runtime, a reasonable profile will look like:
> 
> > >
> 
> > >
> 
> > > #1: 10
> 
> > > #3: 80
> 
> > >
> 
> > >
> 
> > >
> 
> > > But if the compiler unrolls the loop by a factor of 4, the
> > > callsite
> 
> > > to bar() is duplicated 4 times and the profile will look like:
> 
> > >
> 
> > >
> 
> > > #1: 10
> 
> > > #3: 20
> 
> > >
> 
> > >
> 
> > > The sample count for #3 is 20 because all 4 callsites to bar()
> > > are
> 
> > > sampled 20 times each, and they shared the same debug loc (#3) so
> 
> > > that 20 will be attributed to #3 (If one debugloc is mapped by
> 
> > > multiple instructions, the max sample count of these instructions
> > > is
> 
> > > used as debugloc's sample count).
> 
> > >
> 
> > >
> 
> > > When loading this profile into compiler, it will think the loop
> > > trip
> 
> > > count is 2 instead of 8.
> 
> > >
> 
> > >
> 
> > > Proposal:
> 
> > > When compiler duplicates code, it encodes the duplication info in
> > > the
> 
> > > debug info. As the duplication is not interesting to debugger, I
> 
> > > propose to encode this as part of the discriminator.
> 
> > >
> 
> > >
> 
> > > There are 2 types of code duplication:
> 
> > >
> 
> > >
> 
> > > 1. duplicated code are guaranteed to have the same execution
> > > count
> 
> > > (e.g. loop unroll and loop vectorize). We can record the
> > > duplication
> 
> > > factor, for the above example "4" is recorded in the
> > > discriminator.
> 
> > > 2. duplicated code that may have different execution count (e.g.
> > > loop
> 
> > > peel and gvn). For a same debugloc, a unique number is assigned
> > > to
> 
> > > each copy and encoded in the discriminator.
> 
> > >
> 
> > >
> 
> > > Assume that the discriminator is uint32. The traditional
> 
> > > discriminator is less than 256, let's take 8 bit for it. For
> 
> > > duplication factor (type 1 duplication), we assume the maximum
> 
> > > unroll_factor * vectorize_factor is less than 256, thus 8 bit for
> 
> > > it. For unique number(type 2 duplication), we assume code is at
> > > most
> 
> > > duplicated 32 times, thus 5 bit for it. Overall, we still have 11
> 
> > > free bits left in the discriminator encoding.
> 
> > >
> 
> > >
> 
> > > Let's take the original source as an example, after loop
> > > unrolling
> 
> > > and peeling, the code may looks like:
> 
> > >
> 
> > >
> 
> > > for (i = 0; i < N & 3; i+= 4) {
> 
> > > foo(); // discriminator: 0x40
> 
> > > foo(); // discriminator: 0x40
> 
> > > foo(); // discriminator: 0x40
> 
> > > foo(); // discriminator: 0x40
> 
> > > }
> 
> > > if (i++ < N) {
> 
> > > foo(); // discriminator: 0x100
> 
> > > if (i++ < N) {
> 
> > > foo(); // discriminator: 0x200
> 
> > > if (i++ < N) {
> 
> > > foo(); // discriminator: 0x300
> 
> > > }
> 
> > > }
> 
> > > }
> 
> > >
> 
> > >
> 
> > > The cost of this change would be increased debug_line size
> > > because:
> 
> > > 1. we are recording more discriminators 2. discriminators become
> 
> > > larger and will take more ULEB128 encoding.
> 
> > >
> 
> > >
> 
> > > The benefit is that the sample pgo profile can accurately
> > > represent
> 
> > > the code execution frequency. And we do not need to introduce new
> 
> > > building blocks to debug info.
> 
> > >
> 
> > >
> 
> > > Comments?
> 
> > >
> 
> > >
> 
> > > Thanks,
> 
> > > Dehao
> 
> > > _______________________________________________
> 
> > > LLVM Developers mailing list
> 
> > > llvm-dev at lists.llvm.org
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> > >
> 
> > --
> 
> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/62214f69/attachment.html>

Dehao Chen via llvm-dev

2016-Nov-01 18:24 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

On Tue, Nov 1, 2016 at 10:50 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"Xinliang David Li" <davidxl at google.com>,
"llvm-dev" <
> llvm-dev at lists.llvm.org>
> *Sent: *Tuesday, November 1, 2016 11:43:41 AM
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> On Fri, Oct 28, 2016 at 3:07 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>> Hi Dehao,
>>
>> This is definitely an important problem, thanks for writing this up!
>>
>> There is a related problem that I think we can address at the same
time:
>> When we multiversion code, for example when we use runtime checks to
enable
>> the creation of a vectorized loop while retaining the scalar loop, and
then
>> we collect profiling data, we should be able to recover the relative
>> running time of the different versions of the loop. In the future, I
>> suspect we'll end up with transformations that produce even more
versions
>> of loops (Intel's compiler, for example, has a useful pragma that
allows
>> the user to request loop specializations for a specified set of trip
>> counts).
>>
>> I'd like to have a scheme where the source location + discriminator
can
>> be mapped to information about the relevant loop version so that our
>> profiling tools can display that information usefully to the user, and
so
>> that our optimizations can make useful choices (i.e. don't bother
>> vectorizing a loop when the scalar loop is run a lot but the vectorized
>> version almost never runs).
>>
>
> That's definitely a valid and important use case, and it is important
to
> sample pgo too. That's why I proposed to have "duplicated code
that may
> have different execution count" being recorded. Will that suffice to
get
> the info you want? (i.e. for every version of the multi-versioned loop, you
> will have a disincentive discriminator associated with all the code it
> expands.
>
> I don't know. Can you explain how the process will work? By the time
the
> code/metadata arrives at, say, the loop vectorizer, how can we tell whether
> the vectorized version we might now create will be executed (based on
> having profiling data from a run where the compiler might have previously
> made a similar choice)?
>
For example, for the following loop:

for (i = 0; i < N; i++)
  stmt;

After many loop optimizations it could become something like:

if (vectorized1)
  for (i = 0;.....)
     stmt;    //uid: 1
else if (vectorized2)
  for (i = 0;.....)
    stmt;    //uid: 2
else if (unrolled)
  for (i = 0; i < N & (~0x3) i += 4)
    stmt; stmt; stmt; stmt; //uid: 3
  for (; i < N; i++)
    stmt;  //uid: 4
else
  for (; i < N; i++)
    stmt;  //uid: 5

So basically, each clone of the loop will have a uid, which you can use to
identify which version of the loop the instruction was coming from.

>
>
>
>>
>> In short, I think that differentiating these different regions using
the
>> descriminator seems like a natural choice, but "And we do not need
to
>> introduce new building blocks to debug info" might not save us all
that
>> much in the long run. To keep information on what regions correspond to
>> what optimizations, we may need to do that. That's not a bad thing,
and I'd
>> rather we solve this in a way that is extensible. Plus, this might make
it
>> easier to use fewer bits, thus helping the overall impact on the size
of
>> the debug sections.
>>
>
> I agree that if we want to extend this in the future, we need to have
> separate dwarf bits other than discriminator. For current use case,
> discriminator seem to be good enough. And if we encode efficiently, it will
> be better than introducing a new field. e.g., we can encode all info in a
> 1-byte ULEB128 85%~90% of the time, but for a new field, we will at least
> need 2 bytes if both discriminator and cloning info exists for an
> instruction.
>
> Is this because you need at least one more byte for the debug-info field?
>
> In general, I don't really care where we stuff the bits so long as we
can
> get the necessary information back out. For a vectorized loop, for example,
> we should be able to figure out which counts go to the vectorized loop vs.
> the scalar loop. I don't, however, want to end up with something that
is
> super-non-extensible (e.g. we have only a few bits, so the vectorizer can
> get one and the unroller can get one, but loop distribution is out of
> luck). Maybe we need an 'extension' bit saying that there is more
> information encoded elsewhere?
>
As illustrated in the above example, it is not like "vectorization has a
distinct bit". All different optimizations make clones of code which will
be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
will be capped by the number of clones all optimizations have made, instead
of # of optimizations that has applied. And it will be capped at 2^N-1. The
cons of using uid is that you will not know if a clone is coming from
vectorization or unroll or loop distribution.

Thanks,
Dehao

>
> Thanks again,
> Hal
>
>
> Dehao
>
>
>>
>> Thanks again,
>> Hal
>>
>> ------------------------------
>>
>> > From: "Dehao Chen via llvm-dev" <llvm-dev at
lists.llvm.org>
>> > To: llvm-dev at lists.llvm.org
>> > Cc: "Xinliang David Li" <davidxl at google.com>
>> > Sent: Thursday, October 27, 2016 1:39:15 PM
>> > Subject: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>> >
>> > Motivation:
>> > Many optimizations duplicate code. E.g. loop unroller duplicates
the
>> > loop body, GVN duplicates computation, etc. The duplicated code
will
>> > share the same debug info with the original code. For SamplePGO,
the
>> > debug info is used to present the profile. Code duplication will
>> > affect profile accuracy. Taking loop unrolling for example:
>> >
>> >
>> > #1 foo();
>> > #2 for (i = 0; i < N; i++) {
>> >
>> > #3 bar();
>> > #4 }
>> >
>> >
>> > If N is 8 during runtime, a reasonable profile will look like:
>> >
>> >
>> > #1: 10
>> > #3: 80
>> >
>> >
>> >
>> > But if the compiler unrolls the loop by a factor of 4, the
callsite
>> > to bar() is duplicated 4 times and the profile will look like:
>> >
>> >
>> > #1: 10
>> > #3: 20
>> >
>> >
>> > The sample count for #3 is 20 because all 4 callsites to bar() are
>> > sampled 20 times each, and they shared the same debug loc (#3) so
>> > that 20 will be attributed to #3 (If one debugloc is mapped by
>> > multiple instructions, the max sample count of these instructions
is
>> > used as debugloc's sample count).
>> >
>> >
>> > When loading this profile into compiler, it will think the loop
trip
>> > count is 2 instead of 8.
>> >
>> >
>> > Proposal:
>> > When compiler duplicates code, it encodes the duplication info in
the
>> > debug info. As the duplication is not interesting to debugger, I
>> > propose to encode this as part of the discriminator.
>> >
>> >
>> > There are 2 types of code duplication:
>> >
>> >
>> > 1. duplicated code are guaranteed to have the same execution count
>> > (e.g. loop unroll and loop vectorize). We can record the
duplication
>> > factor, for the above example "4" is recorded in the
discriminator.
>> > 2. duplicated code that may have different execution count (e.g.
loop
>> > peel and gvn). For a same debugloc, a unique number is assigned to
>> > each copy and encoded in the discriminator.
>> >
>> >
>> > Assume that the discriminator is uint32. The traditional
>> > discriminator is less than 256, let's take 8 bit for it. For
>> > duplication factor (type 1 duplication), we assume the maximum
>> > unroll_factor * vectorize_factor is less than 256, thus 8 bit for
>> > it. For unique number(type 2 duplication), we assume code is at
most
>> > duplicated 32 times, thus 5 bit for it. Overall, we still have 11
>> > free bits left in the discriminator encoding.
>> >
>> >
>> > Let's take the original source as an example, after loop
unrolling
>> > and peeling, the code may looks like:
>> >
>> >
>> > for (i = 0; i < N & 3; i+= 4) {
>> > foo(); // discriminator: 0x40
>> > foo(); // discriminator: 0x40
>> > foo(); // discriminator: 0x40
>> > foo(); // discriminator: 0x40
>> > }
>> > if (i++ < N) {
>> > foo(); // discriminator: 0x100
>> > if (i++ < N) {
>> > foo(); // discriminator: 0x200
>> > if (i++ < N) {
>> > foo(); // discriminator: 0x300
>> > }
>> > }
>> > }
>> >
>> >
>> > The cost of this change would be increased debug_line size
because:
>> > 1. we are recording more discriminators 2. discriminators become
>> > larger and will take more ULEB128 encoding.
>> >
>> >
>> > The benefit is that the sample pgo profile can accurately
represent
>> > the code execution frequency. And we do not need to introduce new
>> > building blocks to debug info.
>> >
>> >
>> > Comments?
>> >
>> >
>> > Thanks,
>> > Dehao
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/5e4785fe/attachment-0001.html>

Hal Finkel via llvm-dev

2016-Nov-01 18:27 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

----- Original Message -----
> From: "Dehao Chen" <dehao at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Xinliang David Li" <davidxl at google.com>,
"llvm-dev"
> <llvm-dev at lists.llvm.org>
> Sent: Tuesday, November 1, 2016 1:24:01 PM
> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
> On Tue, Nov 1, 2016 at 10:50 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> > > From: "Dehao Chen" < dehao at google.com >
> > 
> 
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > 
> 
> > > Cc: "Xinliang David Li" < davidxl at google.com
>, "llvm-dev" <
> > > llvm-dev at lists.llvm.org >
> > 
> 
> > > Sent: Tuesday, November 1, 2016 11:43:41 AM
> > 
> 
> > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > > discriminator
> > 
> 
> > > On Fri, Oct 28, 2016 at 3:07 PM, Hal Finkel < hfinkel at
anl.gov >
> > > wrote:
> > 
> 
> > > > Hi Dehao,
> > > 
> > 
> 
> > > > This is definitely an important problem, thanks for writing
> > > > this
> > > > up!
> > > 
> > 
> 
> > > > There is a related problem that I think we can address at
the
> > > > same
> > > > time: When we multiversion code, for example when we use
> > > > runtime
> > > > checks to enable the creation of a vectorized loop while
> > > > retaining
> > > > the scalar loop, and then we collect profiling data, we
should
> > > > be
> > > > able to recover the relative running time of the different
> > > > versions
> > > > of the loop. In the future, I suspect we'll end up with
> > > > transformations that produce even more versions of loops
> > > > (Intel's
> > > > compiler, for example, has a useful pragma that allows the
user
> > > > to
> > > > request loop specializations for a specified set of trip
> > > > counts).
> > > 
> > 
> 
> > > > I'd like to have a scheme where the source location +
> > > > discriminator
> > > > can be mapped to information about the relevant loop version
so
> > > > that
> > > > our profiling tools can display that information usefully to
> > > > the
> > > > user, and so that our optimizations can make useful choices
> > > > (i.e.
> > > > don't bother vectorizing a loop when the scalar loop is
run a
> > > > lot
> > > > but the vectorized version almost never runs).
> > > 
> > 
> 
> > > That's definitely a valid and important use case, and it is
> > > important
> > > to sample pgo too. That's why I proposed to have "
duplicated
> > > code
> > > that may have different execution count" being recorded.
Will
> > > that
> > > suffice to get the info you want? (i.e. for every version of the
> > > multi-versioned loop, you will have a disincentive discriminator
> > > associated with all the code it expands.
> > 
> 
> > I don't know. Can you explain how the process will work? By the
> > time
> > the code/metadata arrives at, say, the loop vectorizer, how can we
> > tell whether the vectorized version we might now create will be
> > executed (based on having profiling data from a run where the
> > compiler might have previously made a similar choice)?
> 
> For example, for the following loop:
> for (i = 0; i < N; i++)
> stmt;
> After many loop optimizations it could become something like:
> if (vectorized1)
> for (i = 0;.....)
> stmt; //uid: 1
> else if (vectorized2)
> for (i = 0;.....)
> stmt; //uid: 2
> else if (unrolled)
> for (i = 0; i < N & (~0x3) i += 4)
> stmt; stmt; stmt; stmt; //uid: 3
> for (; i < N; i++)
> stmt; //uid: 4
> else
> for (; i < N; i++)
> stmt; //uid: 5
> So basically, each clone of the loop will have a uid, which you can
> use to identify which version of the loop the instruction was coming
> from.
> > > > In short, I think that differentiating these different
regions
> > > > using
> > > > the descriminator seems like a natural choice, but "And
we do
> > > > not
> > > > need to introduce new building blocks to debug info"
might not
> > > > save
> > > > us all that much in the long run. To keep information on
what
> > > > regions correspond to what optimizations, we may need to do
> > > > that.
> > > > That's not a bad thing, and I'd rather we solve this
in a way
> > > > that
> > > > is extensible. Plus, this might make it easier to use fewer
> > > > bits,
> > > > thus helping the overall impact on the size of the debug
> > > > sections.
> > > 
> > 
> 
> > > I agree that if we want to extend this in the future, we need to
> > > have
> > > separate dwarf bits other than discriminator. For current use
> > > case,
> > > discriminator seem to be good enough. And if we encode
> > > efficiently,
> > > it will be better than introducing a new field. e.g., we can
> > > encode
> > > all info in a 1-byte ULEB128 85%~90% of the time, but for a new
> > > field, we will at least need 2 bytes if both discriminator and
> > > cloning info exists for an instruction.
> > 
> 
> > Is this because you need at least one more byte for the debug-info
> > field?
> 
> > In general, I don't really care where we stuff the bits so long as
> > we
> > can get the necessary information back out. For a vectorized loop,
> > for example, we should be able to figure out which counts go to the
> > vectorized loop vs. the scalar loop. I don't, however, want to end
> > up with something that is super-non-extensible (e.g. we have only a
> > few bits, so the vectorizer can get one and the unroller can get
> > one, but loop distribution is out of luck). Maybe we need an
> > 'extension' bit saying that there is more information encoded
> > elsewhere?
> 
> As illustrated in the above example, it is not like "vectorization
> has a distinct bit". All different optimizations make clones of code
> which will be labeled by UIDs represented by N (e.g. 8) bits. In
> this way, the space will be capped by the number of clones all
> optimizations have made, instead of # of optimizations that has
> applied. And it will be capped at 2^N-1. The cons of using uid is
> that you will not know if a clone is coming from vectorization or
> unroll or loop distribution.Okay, but that kind of semantic mapping is important. How should we
encode/recover that information? To be clear, I'm not saying that we need to
implement that up front, but there needs to be a clear path to an
implementation, because I don't want to have two disjoint schemes.

Thanks again, 
Hal 
> Thanks,
> Dehao
> > Thanks again,
> 
> > Hal
> 
> > > Dehao
> > 
> 
> > > > Thanks again,
> > > 
> > 
> 
> > > > Hal
> > > 
> > 
> 
> > > > > From: "Dehao Chen via llvm-dev" < llvm-dev
at lists.llvm.org >
> > > 
> > 
> 
> > > > > To: llvm-dev at lists.llvm.org
> > > 
> > 
> 
> > > > > Cc: "Xinliang David Li" < davidxl at
google.com >
> > > 
> > 
> 
> > > > > Sent: Thursday, October 27, 2016 1:39:15 PM
> > > 
> > 
> 
> > > > > Subject: [llvm-dev] (RFC) Encoding code duplication
factor in
> > > > > discriminator
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Motivation:
> > > 
> > 
> 
> > > > > Many optimizations duplicate code. E.g. loop unroller
> > > > > duplicates
> > > > > the
> > > 
> > 
> 
> > > > > loop body, GVN duplicates computation, etc. The
duplicated
> > > > > code
> > > > > will
> > > 
> > 
> 
> > > > > share the same debug info with the original code. For
> > > > > SamplePGO,
> > > > > the
> > > 
> > 
> 
> > > > > debug info is used to present the profile. Code
duplication
> > > > > will
> > > 
> > 
> 
> > > > > affect profile accuracy. Taking loop unrolling for
example:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #1 foo();
> > > 
> > 
> 
> > > > > #2 for (i = 0; i < N; i++) {
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #3 bar();
> > > 
> > 
> 
> > > > > #4 }
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > If N is 8 during runtime, a reasonable profile will
look
> > > > > like:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #1: 10
> > > 
> > 
> 
> > > > > #3: 80
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > But if the compiler unrolls the loop by a factor of 4,
the
> > > > > callsite
> > > 
> > 
> 
> > > > > to bar() is duplicated 4 times and the profile will
look
> > > > > like:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #1: 10
> > > 
> > 
> 
> > > > > #3: 20
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > The sample count for #3 is 20 because all 4 callsites
to
> > > > > bar()
> > > > > are
> > > 
> > 
> 
> > > > > sampled 20 times each, and they shared the same debug
loc
> > > > > (#3)
> > > > > so
> > > 
> > 
> 
> > > > > that 20 will be attributed to #3 (If one debugloc is
mapped
> > > > > by
> > > 
> > 
> 
> > > > > multiple instructions, the max sample count of these
> > > > > instructions
> > > > > is
> > > 
> > 
> 
> > > > > used as debugloc's sample count).
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > When loading this profile into compiler, it will think
the
> > > > > loop
> > > > > trip
> > > 
> > 
> 
> > > > > count is 2 instead of 8.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Proposal:
> > > 
> > 
> 
> > > > > When compiler duplicates code, it encodes the
duplication
> > > > > info
> > > > > in
> > > > > the
> > > 
> > 
> 
> > > > > debug info. As the duplication is not interesting to
> > > > > debugger,
> > > > > I
> > > 
> > 
> 
> > > > > propose to encode this as part of the discriminator.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > There are 2 types of code duplication:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > 1. duplicated code are guaranteed to have the same
execution
> > > > > count
> > > 
> > 
> 
> > > > > (e.g. loop unroll and loop vectorize). We can record
the
> > > > > duplication
> > > 
> > 
> 
> > > > > factor, for the above example "4" is recorded
in the
> > > > > discriminator.
> > > 
> > 
> 
> > > > > 2. duplicated code that may have different execution
count
> > > > > (e.g.
> > > > > loop
> > > 
> > 
> 
> > > > > peel and gvn). For a same debugloc, a unique number is
> > > > > assigned
> > > > > to
> > > 
> > 
> 
> > > > > each copy and encoded in the discriminator.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Assume that the discriminator is uint32. The
traditional
> > > 
> > 
> 
> > > > > discriminator is less than 256, let's take 8 bit
for it. For
> > > 
> > 
> 
> > > > > duplication factor (type 1 duplication), we assume the
> > > > > maximum
> > > 
> > 
> 
> > > > > unroll_factor * vectorize_factor is less than 256, thus
8 bit
> > > > > for
> > > 
> > 
> 
> > > > > it. For unique number(type 2 duplication), we assume
code is
> > > > > at
> > > > > most
> > > 
> > 
> 
> > > > > duplicated 32 times, thus 5 bit for it. Overall, we
still
> > > > > have
> > > > > 11
> > > 
> > 
> 
> > > > > free bits left in the discriminator encoding.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Let's take the original source as an example, after
loop
> > > > > unrolling
> > > 
> > 
> 
> > > > > and peeling, the code may looks like:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > for (i = 0; i < N & 3; i+= 4) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > > if (i++ < N) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x100
> > > 
> > 
> 
> > > > > if (i++ < N) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x200
> > > 
> > 
> 
> > > > > if (i++ < N) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x300
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > The cost of this change would be increased debug_line
size
> > > > > because:
> > > 
> > 
> 
> > > > > 1. we are recording more discriminators 2.
discriminators
> > > > > become
> > > 
> > 
> 
> > > > > larger and will take more ULEB128 encoding.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > The benefit is that the sample pgo profile can
accurately
> > > > > represent
> > > 
> > 
> 
> > > > > the code execution frequency. And we do not need to
introduce
> > > > > new
> > > 
> > 
> 
> > > > > building blocks to debug info.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Comments?
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Thanks,
> > > 
> > 
> 
> > > > > Dehao
> > > 
> > 
> 
> > > > > _______________________________________________
> > > 
> > 
> 
> > > > > LLVM Developers mailing list
> > > 
> > 
> 
> > > > > llvm-dev at lists.llvm.org
> > > 
> > 
> 
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > --
> > > 
> > 
> 
> > > > Hal Finkel
> > > 
> > 
> 
> > > > Lead, Compiler Technology and Programming Languages
> > > 
> > 
> 
> > > > Leadership Computing Facility
> > > 
> > 
> 
> > > > Argonne National Laboratory
> > > 
> > 
> 
> > --
> 
> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/340b759a/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Nov 2016 - (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Seemingly Similar Threads