thr3ads.net - llvm dev - [llvm-dev] (RFC) Encoding code duplication factor in discriminator [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Robinson, Paul via llvm-dev

2016-Nov-21 23:32 UTC

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

In many cases, the line-table fussing to improve autoFDO/sample-PGO would also
likely help the debugging experience for optimized code, certainly in cases
where line attribution is inherently ambiguous.  In those cases, I have no
problem with Just Doing It.

Something likely to pad the line table to benefit profiling without similarly
benefiting debugging… that's probably worth inventing a –gprofile or some
such.  I suspect Dehao's more elaborate ideas about discriminators fall into
that category (although IIRC discriminators were invented mainly to benefit
profiling in the first place, so maybe having discriminators at all would be
something to put under a –gprofile kind of option).
--paulr

From: David Blaikie [mailto:dblaikie at gmail.com]
Sent: Monday, November 21, 2016 3:04 PM
To: Dehao Chen; Hal Finkel
Cc: llvm-dev; Xinliang David Li; Adrian Prantl; Robinson, Paul; Diego Novillo
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in discriminator

(+Adrian & Paul since they're mentioned here, +Diego since he might have
opinions as he implemented the discriminator stuff originally)

my 2c (that was mentioned here & that I idly discussed with Diego a few
weeks ago in no great detail) is also that maybe a separate mode for PGO might
be what we want/need, because it's not what sanitizers or simple backtrace
tools need & we're slowly growing the size of debug info for profiling
needs that we don't need elsewhere. (might be worth doing a full measurement
here as we add new features - what's the growth of all hte profile related
features (from (including) discriminators up) and see if it's still a
reasonable thing to keep with the rest of the debug info use cases or needs a
separate flag)

But I'm not too fussed/don't feel terribly strongly, really.
On Fri, Nov 4, 2016 at 3:44 PM Dehao Chen via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Discussed with Hal, Adrain and Paul offline at the llvm dev meeting today.

* trip count is not enough for vectorization, there is runtime check that might
go false, which can be reflected in profile that we may want to preserve.
* simply recording these context-profile may cause problems to
iterative-sample-pgo. i.e. when you find a loop's vectorized version no
executed (due to runtime check), you will choose not to vectorize (which is
optimal). But when you collect profile from this binary (optimized with
sample-pgo, not vectorize the loop), as the loop is not vectorized, we do not
have the context to demonstrate "the loop should *not* vectorize" any
more. So it will end up being vectorized again, introducing perf instability.

To summarize, more context info may improve performance for one iteration, but
the perf improvement may not be stable across iterations. If we aim at
performance stability (which is one of the major goals of this RFC), profile
should only reflect the attribute of source, not compiler transformations.

But there are sample pgo users who does *not* care about iterative sample pgo
performance, for them, as Hal suggested, we should invent a more extensible way
to preserve profile context. Apparently discriminator is not an extensible
choice. So how about we just use discriminator to store the attribute of the
source (i.e trip count), and later design new extensible ways in dwarf to
represent more context info?

Adrian also suggested that we may need to consider have a flag or a separate
debugging mode to indicate if we want to emit discriminator to prevent
debug_line size increase.

Dehao

On Tue, Nov 1, 2016 at 9:34 PM, Dehao Chen <dehao at
google.com<mailto:dehao at google.com>> wrote:


On Tue, Nov 1, 2016 at 7:35 PM, Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

________________________________
From: "Dehao Chen" <dehao at google.com<mailto:dehao at
google.com>>
To: "Hal Finkel" <hfinkel at anl.gov<mailto:hfinkel at
anl.gov>>
Cc: "llvm-dev" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>, "Xinliang David Li" <davidxl at
google.com<mailto:davidxl at google.com>>
Sent: Tuesday, November 1, 2016 8:24:30 PM

Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in discriminator


On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

________________________________
From: "Dehao Chen" <dehao at google.com<mailto:dehao at
google.com>>
To: "Hal Finkel" <hfinkel at anl.gov<mailto:hfinkel at
anl.gov>>
Cc: "llvm-dev" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>, "Xinliang David Li" <davidxl at
google.com<mailto:davidxl at google.com>>
Sent: Tuesday, November 1, 2016 6:41:29 PM

Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in discriminator


On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

________________________________
From: "Hal Finkel via llvm-dev" <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
To: "Dehao Chen" <dehao at google.com<mailto:dehao at
google.com>>
Cc: "llvm-dev" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>, "Xinliang David Li" <davidxl at
google.com<mailto:davidxl at google.com>>
Sent: Tuesday, November 1, 2016 4:26:17 PM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in discriminator

________________________________
From: "Dehao Chen" <dehao at google.com<mailto:dehao at
google.com>>
To: "Hal Finkel" <hfinkel at anl.gov<mailto:hfinkel at
anl.gov>>
Cc: "Paul Robinson" <paul.robinson at
sony.com<mailto:paul.robinson at sony.com>>, "Xinliang David
Li" <davidxl at google.com<mailto:davidxl at google.com>>,
"llvm-dev" <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Sent: Tuesday, November 1, 2016 4:14:43 PM
Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in discriminator
damn... my english is not readable at all when I try to write fast... trying to
make some clarification below, hopefully can make it more readable...

On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at
google.com<mailto:dehao at google.com>> wrote:
Oops... pressed the wrong button and sent out early...

On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at
google.com<mailto:dehao at google.com>> wrote:
If Hal's proposal is for SamplePGO purpose, let me clarify some design
principles of SamplePGO.

The profile for sample pgo uses source location as the key to map the execution
count back to IR. This design is based on the principle that we do not want the
profile to be tightly couple with compiler IR. Instead, profile is simple an
attribute of the source code. We have been benefited  a lot from this design
that the profile can easily be reused across different source versions and
compiler versions, or even compilers.

That being said, the design to encode more info into discriminator does not mean
that we will change the profile. The encoded info in discriminator will be
handled by the create_llvm_prof tool, which combines counts from different
clones of the same source code and generate the combined profile data. The
output profile will not have any cloning/dupliaction bits at all. So for the
initial example profile I provided, the output profile will be:

#1: 10
#3: 80

Not:

#1: 10
#3.0x400: 70
#3.0x10400: 5
#3.0x20400: 3
#3.0x30400: 2

Also, how does this work for vectorization? For vectorization, you're going
to multiply by the duplication factor first? The issue obviously is that each
vector instruction counts for VF times as many scalar instructions, and so to
get equivalent scalar counts, you need to multiply by VF. I had assumed this is
what you were saying when I read the initial e-mail, but if you're also
using the same duplication scheme for unrolling, then I think we need some way
to differentiate.

In my original proposal, I only record VF*UF if the clone is both unrolled and
vectorized. I did not distinguish between unroll and vectorization because I the
unroll/vectorize decision only depends on the trip count, which can be derived
from the duplication factor. If more profile info can be used for better
unroll/vectorize decision making, we can definitely add it.
I'm still missing something here. In your proposed system, does the tool
collecting the profiling information *interpret* the duplication factor encoded
in the descriminators at all?

Yes it does.
Okay. I'm still missing something. Can you please write out an example using
your notation below for a vectorized loop?

Sure. Original code:

for (int i = 0; i < N; i++)
  a[i] = b[i];

Transformed code:

for (int i = 0; i < N / 4; i++)
  vectorized_assign_4(&a[i * 4], &b[i * 4]); // discriminator 0x400,
sample count 20
for (int i = N & ~3; i < N; i++)
  a[i] = b[i];       // discriminator 0x10000, sample count 10

The derived sample count for original a[i] = b[i] is: 20 * 4 + 10 = 90



If it does, it seems like this should be specific to vectorization. For
unrolling, you still have the same number of instructions, and so you just need
to make the different instructions from the different loop iterations carry
different discriminator values (so they they'll sum together, instead of
using the maximum, as you explained in your original proposal).

Let's take a look at the original loop unroll example. My proposed
discriminator assignment is:
for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
}
if (i++ < N) {
  foo();   // discriminator: 0x10000
  if (i++ < N) {
    foo(); // discriminator: 0x20000
    if (i++ < N) {
      foo();  // discriminator: 0x30000
    }
  }
}

IIRC, your proposed discriminator assignment is:
To be clear, I don't think that I've proposed any assignment scheme at
all. Regardless, I was asking this question purely in the context of your
proposal.

Oh I used this example to show my understanding of your reply "For
unrolling, you still have the same number of instructions, and so you just need
to make the different instructions from the different loop iterations carry
different discriminator values (so they they'll sum together, instead of
using the maximum, as you explained in your original proposal)." Please let
me know if my understanding is incorrect.

Thanks,
Dehao


Thanks again,
Hal


for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x10000
  foo();  // discriminator: 0x20000
  foo();  // discriminator: 0x30000
  foo();  // discriminator: 0x40000
}
if (i++ < N) {
  foo();   // discriminator: 0x50000
  if (i++ < N) {
    foo(); // discriminator: 0x60000
    if (i++ < N) {
      foo();  // discriminator: 0x70000
    }
  }
}

I think my method can save discriminator space because all clones share the same
discriminator. Consider that if a loop is unrolled 100 times, then with your
algorithm, we may end up having 100 different discriminators, which seem not
good for debug info size.




 -Hal

The goal of the proposed change, is to make profile more accurately represent
the attribute of the source.
The non-goal of the proposed change, is to provide more context in the profile
to present the behavior of program in the context of different context.

The non-goal of the proposed change, is to provide more context in the profile
to present the behavior of program in different contexts.'
Okay, but why? The information will be present in the profiles, why not encode
it in the metadata and let the optimizer decide when to sum it up? We should
even provide some utility functions that make this transparent for passes that
don't care about the distinction.
I see, you are proposing encoding these profile in the metadata. I agree that
this can work, but from implementation's perspective, it could could lead to
great complexity.

Sample PGO pass first read in the source->count map, use it to get
basic_block->count map, and then use heuristics to propagate and get
branch->probability count. So we don't have Metadata to store the raw
count, but only the branch probability is stored after sample pgo pass.
I think that, as you indicate below, we'd need to change this to {source,
disc}->count map --> {BB, disc}->count map --> {branch,
disc}->prob map. The idea being that the total probability for the branch is
the sum of the probabilities associated with it for each discriminator value.
Most users will just sum them up, but some passes will want the breakdowns.

Agree. Note that this would add some complexity to BFI/BPI



If we are pursuing more context-sensitive profile, that would be a very
different design. In this design, we will need to read profile multiple times,
and have the profile tightly coupled with the compiler IR/pass manager. That is
doable, but I don't think that probably better suits instrumentation based
PGO's domain. Comments?

If we are pursuing more context-sensitive profile, that would be a very
different design in which we need to read in profiles multiple times, and have
the profile tightly coupled with the compiler IR/pass manager. That is doable,
but that probably better suits instrumentation based PGO's domain. Comments?
I don't understand why you say we'd need to read the profile multiple
times. Can you please explain? I also don't think it needs to be that
tightly coupled; we just need each pass that generates multiple copies of things
to have some way to generate a unique id for what it's doing. It's all
per source location, so I don't even think we need any kind of complicated
hashing scheme.
As explained above, we only have branch probability. Yes, we can record more
instruction/basicblock count context info in additional metadata. But when you
use it in optimizer, I suppose you will need to convert it to branch probability
too?

Let me take an example to demonstrate my understanding of your use of
context-sensitive profile.

original code:
for ()
  stmt1;

optimized code:
if (cond1)
  for()
    stmt1;  //100 samples
else if (cond2)
  for()
    stmt1;  // 0 samples
else
  for()
    stmt1; // 50 samples

The loop was multi-versioned by cond1 and cond2. With my proposal, the profile
for stmt1 will be combined as 150 samples. If we use this profile to optimize
the orginal source, after annotation, the code becomes:

for()
  stmt1 // 150 samples

Later when it comes to the mulit-version optimization, it will transform the
code to:

if (cond1)
  for()
    stmt1;  //50 samples
else if (cond2)
  for()
    stmt1;  // 50 samples
else
  for()
    stmt1; // 50 samples

The 150 samples are evenly distributed to 3 clones as we do not have context
info.

With your proposal, after annotation, the original code becomes:

for()
  stmt1 // 150 samples (M1:100, M2:0, M3:50)

Then during mutli-version optimization, the optimizer reads the metadata and
find M2 is never taken, so it will transform the code to:

if (cond1)
  for()
    stmt1;  // 100 samples
else
  for()
    stmt1; // 50 samples

There are two major parts that can benefit from the extra metadata recorded for
each copy.

1. It can demonstrate that cond2 is always false and there is no need to create
that version.
2. It can correctly attribute samples to cloned copy (i.e. 100, 50 respectively
in this case)

I think the most important benefit is #1. For #2, my guess is scaled count is
already good enough.
By what are you scaling the count?

It's random. Or most likely, each branch will be 50% taken.



Here comes the problem: cond1 and cond2 are both generated by compiler. Then how
do we map the metadata back to them? Using uid or special encoding seems fragile
because when upstream optimization changes, compiler may choose to put cond2 in
front of cond1 so the uid will change accordingly. How would we handle cases
like this?
Exactly for this reason, I don't think we can rely on the ordering to
generate the identifiers; they need to have semantic meaning. I agree that this
is tricky in general. However, for the cases I have in mind (unrolling,
vectorization, etc.) these passes don't generate arbitrary conditionals, but
rather, specific conditionals related to runtime overlap checks, trip-count
thresholds, and I think that it should be easy to encode a discriminator value
which can be unambiguously inverted.

I'm still not quite clear how to do this. Needs to think more thoroughly.
But one thing I want to note is that, we want to encode the context info only
when it will be useful for the optimization. For vectorization/unroll, I think
accurate trip count is already good enough. So do we want to record its context?

Thanks,
Dehao


Thanks again,
Hal


Dehao


Thanks again,
Hal

Thanks,
Dehao



On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>> wrote:

________________________________
From: "Paul Robinson" <paul.robinson at
sony.com<mailto:paul.robinson at sony.com>>
To: "Dehao Chen" <dehao at google.com<mailto:dehao at
google.com>>, "Hal Finkel" <hfinkel at
anl.gov<mailto:hfinkel at anl.gov>>
Cc: "Xinliang David Li" <davidxl at google.com<mailto:davidxl at
google.com>>, llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>
Sent: Tuesday, November 1, 2016 2:15:38 PM
Subject: RE: [llvm-dev] (RFC) Encoding code duplication factor in       
discriminator
As illustrated in the above example, it is not like "vectorization has a
distinct bit". All different optimizations make clones of code which will
be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space will
be capped by the number of clones all optimizations have made, instead of # of
optimizations that has applied. And it will be capped at 2^N-1. The cons of
using uid is that you will not know if a clone is coming from vectorization or
unroll or loop distribution.
Okay, but that kind of semantic mapping is important. How should we
encode/recover that information? To be clear, I'm not saying that we need to
implement that up front, but there needs to be a clear path to an
implementation, because I don't want to have two disjoint schemes.

You mean that you want to know which optimization created the clone? How would
you use that info? Looks to me this will expose compiler implementation detail
in debug info.

This is still doable, assume we have 15 interesting optimizations to track, we
can use 4 bits to encode the optimization type that created the clone. But this
becomes nasty if the a clone is created by more than one optimizations. In that
way, discriminator may not be fit for this purpose.

My understanding was that the encoding scheme would allow the profiling analysis
to correctly map execution data back to the original source construct, while
preserving the property that each distinct basic block would have its own
discriminator value.  That is, the execution data would be attributed back to
the original source construct, not whatever each individual optimization had
done to it, and the data for the original source construct would correctly
reflect the execution (e.g. profiling says you got 82 hits on the original loop,
rather than reporting 20 hits on the unrolled-by-4 loop plus 1 each on 2 of the
trailing copies).

It sounds like Hal is thinking that the per-discriminator execution info would
be preserved down to the point where an individual optimization could look at
the profile for each piece, and make decisions on that basis.

I'm not clear how that would be possible, as the optimization would have to
first do the transform (or predict how it would do the transform) in order to
see which individual-discriminator counts mapped to which actual blocks, and
then make some kind of decision about whether to do the transform differently
based on that information.  Then, if the optimization did choose to do the
transform differently, then that leaves the IR in a state where the individual
discriminators *cannot* map back to it.  (Say you unroll by 2 instead of 4; then
you have only 1 trailing copy, not 3, and a discriminator that maps to the
second trailing copy now maps to nothing.  The individual-discriminator data
becomes useless.)

Am I expressing this well enough to show that what Hal is looking for is not
feasible?
Yes, it will need to predict how the transformation would affect the blocks
produced. That does not seem problematic (at least at a coarse level). Yes, if
transformations made earlier in the pipeline make different decisions, then that
will invalidate later fine-grained data (at least potentially). I don't see
how any of this makes this infeasible. We just need a way for the profiling
counts, per descriminator, to remain available, and for the transformations
themselves to know which discriminators (loop ids, or whatever) to consider.

 -Hal
--paulr




--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory






--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory




--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory




--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory


_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161121/58a1c4fb/attachment-0001.html>

David Blaikie via llvm-dev

2016-Nov-21 23:43 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

On Mon, Nov 21, 2016 at 3:33 PM Robinson, Paul <paul.robinson at sony.com>
wrote:

In many cases, the line-table fussing to improve autoFDO/sample-PGO would
also likely help the debugging experience for optimized code, certainly in
cases where line attribution is inherently ambiguous.  In those cases, I
have no problem with Just Doing It.


Agreed, certainly.




that's probably worth inventing a –gprofile or some such.  I suspect
Dehao's more elaborate ideas about discriminators fall into that category
(although IIRC discriminators were invented mainly to benefit profiling in
the first place, so maybe having discriminators at all would be something
to put under a –gprofile kind of option).


Yep

- Dave


--paulr



*From:* David Blaikie [mailto:dblaikie at gmail.com]
*Sent:* Monday, November 21, 2016 3:04 PM
*To:* Dehao Chen; Hal Finkel
*Cc:* llvm-dev; Xinliang David Li; Adrian Prantl; Robinson, Paul; Diego
Novillo


*Subject:* Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator



(+Adrian & Paul since they're mentioned here, +Diego since he might have
opinions as he implemented the discriminator stuff originally)

my 2c (that was mentioned here & that I idly discussed with Diego a few
weeks ago in no great detail) is also that maybe a separate mode for PGO
might be what we want/need, because it's not what sanitizers or simple
backtrace tools need & we're slowly growing the size of debug info for
profiling needs that we don't need elsewhere. (might be worth doing a full
measurement here as we add new features - what's the growth of all hte
profile related features (from (including) discriminators up) and see if
it's still a reasonable thing to keep with the rest of the debug info use
cases or needs a separate flag)

But I'm not too fussed/don't feel terribly strongly, really.

On Fri, Nov 4, 2016 at 3:44 PM Dehao Chen via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

Discussed with Hal, Adrain and Paul offline at the llvm dev meeting today.



* trip count is not enough for vectorization, there is runtime check that
might go false, which can be reflected in profile that we may want to
preserve.

* simply recording these context-profile may cause problems to
iterative-sample-pgo. i.e. when you find a loop's vectorized version no
executed (due to runtime check), you will choose not to vectorize (which is
optimal). But when you collect profile from this binary (optimized with
sample-pgo, not vectorize the loop), as the loop is not vectorized, we do
not have the context to demonstrate "the loop should *not* vectorize"
any
more. So it will end up being vectorized again, introducing perf
instability.



To summarize, more context info may improve performance for one iteration,
but the perf improvement may not be stable across iterations. If we aim at
performance stability (which is one of the major goals of this RFC),
profile should only reflect the attribute of source, not compiler
transformations.



But there are sample pgo users who does *not* care about iterative sample
pgo performance, for them, as Hal suggested, we should invent a more
extensible way to preserve profile context. Apparently discriminator is not
an extensible choice. So how about we just use discriminator to store the
attribute of the source (i.e trip count), and later design new extensible
ways in dwarf to represent more context info?



Adrian also suggested that we may need to consider have a flag or a
separate debugging mode to indicate if we want to emit discriminator to
prevent debug_line size increase.



Dehao



On Tue, Nov 1, 2016 at 9:34 PM, Dehao Chen <dehao at google.com> wrote:





On Tue, Nov 1, 2016 at 7:35 PM, Hal Finkel <hfinkel at anl.gov> wrote:


------------------------------

*From: *"Dehao Chen" <dehao at google.com>
*To: *"Hal Finkel" <hfinkel at anl.gov>
*Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang
David Li" <
davidxl at google.com>
*Sent: *Tuesday, November 1, 2016 8:24:30 PM


*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator





On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel <hfinkel at anl.gov> wrote:


------------------------------

*From: *"Dehao Chen" <dehao at google.com>
*To: *"Hal Finkel" <hfinkel at anl.gov>
*Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang
David Li" <
davidxl at google.com>
*Sent: *Tuesday, November 1, 2016 6:41:29 PM


*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator





On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel <hfinkel at anl.gov> wrote:


------------------------------

*From: *"Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org>
*To: *"Dehao Chen" <dehao at google.com>
*Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang
David Li" <
davidxl at google.com>
*Sent: *Tuesday, November 1, 2016 4:26:17 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator


------------------------------

*From: *"Dehao Chen" <dehao at google.com>
*To: *"Hal Finkel" <hfinkel at anl.gov>
*Cc: *"Paul Robinson" <paul.robinson at sony.com>,
"Xinliang David Li" <
davidxl at google.com>, "llvm-dev" <llvm-dev at
lists.llvm.org>
*Sent: *Tuesday, November 1, 2016 4:14:43 PM
*Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
discriminator

damn... my english is not readable at all when I try to write fast...
trying to make some clarification below, hopefully can make it more
readable...



On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at google.com> wrote:

Oops... pressed the wrong button and sent out early...



On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at google.com> wrote:

If Hal's proposal is for SamplePGO purpose, let me clarify some design
principles of SamplePGO.



The profile for sample pgo uses source location as the key to map the
execution count back to IR. This design is based on the principle that we
do not want the profile to be tightly couple with compiler IR. Instead,
profile is simple an attribute of the source code. We have been benefited
 a lot from this design that the profile can easily be reused across
different source versions and compiler versions, or even compilers.



That being said, the design to encode more info into discriminator does not
mean that we will change the profile. The encoded info in discriminator
will be handled by the create_llvm_prof tool, which combines counts from
different clones of the same source code and generate the combined profile
data. The output profile will not have any cloning/dupliaction bits at all.
So for the initial example profile I provided, the output profile will be:



#1: 10

#3: 80



Not:



#1: 10

#3.0x400: 70

#3.0x10400: 5

#3.0x20400: 3

#3.0x30400: 2



Also, how does this work for vectorization? For vectorization, you're going
to multiply by the duplication factor first? The issue obviously is that
each vector instruction counts for VF times as many scalar instructions,
and so to get equivalent scalar counts, you need to multiply by VF. I had
assumed this is what you were saying when I read the initial e-mail, but if
you're also using the same duplication scheme for unrolling, then I think
we need some way to differentiate.



In my original proposal, I only record VF*UF if the clone is both unrolled
and vectorized. I did not distinguish between unroll and vectorization
because I the unroll/vectorize decision only depends on the trip count,
which can be derived from the duplication factor. If more profile info can
be used for better unroll/vectorize decision making, we can definitely add
it.

I'm still missing something here. In your proposed system, does the tool
collecting the profiling information *interpret* the duplication factor
encoded in the descriminators at all?



Yes it does.

Okay. I'm still missing something. Can you please write out an example
using your notation below for a vectorized loop?



Sure. Original code:



for (int i = 0; i < N; i++)

  a[i] = b[i];



Transformed code:



for (int i = 0; i < N / 4; i++)

  vectorized_assign_4(&a[i * 4], &b[i * 4]); // discriminator 0x400,
sample
count 20

for (int i = N & ~3; i < N; i++)

  a[i] = b[i];       // discriminator 0x10000, sample count 10



The derived sample count for original a[i] = b[i] is: 20 * 4 + 10 = 90







If it does, it seems like this should be specific to vectorization. For
unrolling, you still have the same number of instructions, and so you just
need to make the different instructions from the different loop iterations
carry different discriminator values (so they they'll sum together, instead
of using the maximum, as you explained in your original proposal).



Let's take a look at the original loop unroll example. My proposed
discriminator assignment is:

for (i = 0; i < N & 3; i+= 4) {

  foo();  // discriminator: 0x400

  foo();  // discriminator: 0x400

  foo();  // discriminator: 0x400

  foo();  // discriminator: 0x400

}

if (i++ < N) {

  foo();   // discriminator: 0x10000

  if (i++ < N) {

    foo(); // discriminator: 0x20000

    if (i++ < N) {

      foo();  // discriminator: 0x30000

    }

  }

}



IIRC, your proposed discriminator assignment is:

To be clear, I don't think that I've proposed any assignment scheme at
all.
Regardless, I was asking this question purely in the context of your
proposal.



Oh I used this example to show my understanding of your reply "For
unrolling, you still have the same number of instructions, and so you just
need to make the different instructions from the different loop iterations
carry different discriminator values (so they they'll sum together, instead
of using the maximum, as you explained in your original proposal)." Please
let me know if my understanding is incorrect.



Thanks,

Dehao




Thanks again,
Hal





for (i = 0; i < N & 3; i+= 4) {

  foo();  // discriminator: 0x10000

  foo();  // discriminator: 0x20000

  foo();  // discriminator: 0x30000

  foo();  // discriminator: 0x40000

}

if (i++ < N) {

  foo();   // discriminator: 0x50000

  if (i++ < N) {

    foo(); // discriminator: 0x60000

    if (i++ < N) {

      foo();  // discriminator: 0x70000

    }

  }

}



I think my method can save discriminator space because all clones share the
same discriminator. Consider that if a loop is unrolled 100 times, then
with your algorithm, we may end up having 100 different discriminators,
which seem not good for debug info size.







 -Hal



The goal of the proposed change, is to make profile more accurately
represent the attribute of the source.

The non-goal of the proposed change, is to provide more context in the
profile to present the behavior of program in the context of different
context.



The non-goal of the proposed change, is to provide more context in the
profile to present the behavior of program in different contexts.'

Okay, but why? The information will be present in the profiles, why not
encode it in the metadata and let the optimizer decide when to sum it up?
We should even provide some utility functions that make this transparent
for passes that don't care about the distinction.

I see, you are proposing encoding these profile in the metadata. I agree
that this can work, but from implementation's perspective, it could could
lead to great complexity.



Sample PGO pass first read in the source->count map, use it to get
basic_block->count map, and then use heuristics to propagate and get
branch->probability count. So we don't have Metadata to store the raw
count, but only the branch probability is stored after sample pgo pass.

I think that, as you indicate below, we'd need to change this to {source,
disc}->count map --> {BB, disc}->count map --> {branch,
disc}->prob map.
The idea being that the total probability for the branch is the sum of the
probabilities associated with it for each discriminator value. Most users
will just sum them up, but some passes will want the breakdowns.



Agree. Note that this would add some complexity to BFI/BPI







If we are pursuing more context-sensitive profile, that would be a very
different design. In this design, we will need to read profile multiple
times, and have the profile tightly coupled with the compiler IR/pass
manager. That is doable, but I don't think that probably better suits
instrumentation based PGO's domain. Comments?



If we are pursuing more context-sensitive profile, that would be a very
different design in which we need to read in profiles multiple times, and
have the profile tightly coupled with the compiler IR/pass manager. That is
doable, but that probably better suits instrumentation based PGO's domain.
Comments?

I don't understand why you say we'd need to read the profile multiple
times. Can you please explain? I also don't think it needs to be that
tightly coupled; we just need each pass that generates multiple copies of
things to have some way to generate a unique id for what it's doing.
It's
all per source location, so I don't even think we need any kind of
complicated hashing scheme.

As explained above, we only have branch probability. Yes, we can record
more instruction/basicblock count context info in additional metadata. But
when you use it in optimizer, I suppose you will need to convert it to
branch probability too?



Let me take an example to demonstrate my understanding of your use of
context-sensitive profile.



original code:

for ()

  stmt1;



optimized code:

if (cond1)

  for()

    stmt1;  //100 samples

else if (cond2)

  for()

    stmt1;  // 0 samples

else

  for()

    stmt1; // 50 samples



The loop was multi-versioned by cond1 and cond2. With my proposal, the
profile for stmt1 will be combined as 150 samples. If we use this profile
to optimize the orginal source, after annotation, the code becomes:



for()

  stmt1 // 150 samples



Later when it comes to the mulit-version optimization, it will transform
the code to:



if (cond1)

  for()

    stmt1;  //50 samples

else if (cond2)

  for()

    stmt1;  // 50 samples

else

  for()

    stmt1; // 50 samples



The 150 samples are evenly distributed to 3 clones as we do not have
context info.



With your proposal, after annotation, the original code becomes:



for()

  stmt1 // 150 samples (M1:100, M2:0, M3:50)



Then during mutli-version optimization, the optimizer reads the metadata
and find M2 is never taken, so it will transform the code to:



if (cond1)

  for()

    stmt1;  // 100 samples

else

  for()

    stmt1; // 50 samples



There are two major parts that can benefit from the extra metadata recorded
for each copy.



1. It can demonstrate that cond2 is always false and there is no need to
create that version.

2. It can correctly attribute samples to cloned copy (i.e. 100, 50
respectively in this case)



I think the most important benefit is #1. For #2, my guess is scaled count
is already good enough.

By what are you scaling the count?



It's random. Or most likely, each branch will be 50% taken.







Here comes the problem: cond1 and cond2 are both generated by compiler.
Then how do we map the metadata back to them? Using uid or special encoding
seems fragile because when upstream optimization changes, compiler may
choose to put cond2 in front of cond1 so the uid will change accordingly.
How would we handle cases like this?

Exactly for this reason, I don't think we can rely on the ordering to
generate the identifiers; they need to have semantic meaning. I agree that
this is tricky in general. However, for the cases I have in mind
(unrolling, vectorization, etc.) these passes don't generate arbitrary
conditionals, but rather, specific conditionals related to runtime overlap
checks, trip-count thresholds, and I think that it should be easy to encode
a discriminator value which can be unambiguously inverted.



I'm still not quite clear how to do this. Needs to think more thoroughly.
But one thing I want to note is that, we want to encode the context info
only when it will be useful for the optimization. For vectorization/unroll,
I think accurate trip count is already good enough. So do we want to record
its context?



Thanks,

Dehao




Thanks again,
Hal





Dehao




Thanks again,
Hal



Thanks,

Dehao







On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:


------------------------------

*From: *"Paul Robinson" <paul.robinson at sony.com>
*To: *"Dehao Chen" <dehao at google.com>, "Hal Finkel"
<hfinkel at anl.gov>
*Cc: *"Xinliang David Li" <davidxl at google.com>, llvm-dev at
lists.llvm.org
*Sent: *Tuesday, November 1, 2016 2:15:38 PM
*Subject: *RE: [llvm-dev] (RFC) Encoding code duplication factor
in        discriminator

As illustrated in the above example, it is not like "vectorization has a
distinct bit". All different optimizations make clones of code which will
be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
will be capped by the number of clones all optimizations have made, instead
of # of optimizations that has applied. And it will be capped at 2^N-1. The
cons of using uid is that you will not know if a clone is coming from
vectorization or unroll or loop distribution.

Okay, but that kind of semantic mapping is important. How should we
encode/recover that information? To be clear, I'm not saying that we need
to implement that up front, but there needs to be a clear path to an
implementation, because I don't want to have two disjoint schemes.



You mean that you want to know which optimization created the clone? How
would you use that info? Looks to me this will expose compiler
implementation detail in debug info.



This is still doable, assume we have 15 interesting optimizations to track,
we can use 4 bits to encode the optimization type that created the clone.
But this becomes nasty if the a clone is created by more than one
optimizations. In that way, discriminator may not be fit for this purpose.



My understanding was that the encoding scheme would allow the profiling
analysis to correctly map execution data back to the original source
construct, while preserving the property that each distinct basic block
would have its own discriminator value.  That is, the execution data would
be attributed back to the original source construct, not whatever each
individual optimization had done to it, and the data for the original
source construct would correctly reflect the execution (e.g. profiling says
you got 82 hits on the original loop, rather than reporting 20 hits on the
unrolled-by-4 loop plus 1 each on 2 of the trailing copies).



It sounds like Hal is thinking that the per-discriminator execution info
would be preserved down to the point where an individual optimization could
look at the profile for each piece, and make decisions on that basis.



I'm not clear how that would be possible, as the optimization would have to
first do the transform (or predict how it would do the transform) in order
to see which individual-discriminator counts mapped to which actual blocks,
and then make some kind of decision about whether to do the transform
differently based on that information.  Then, if the optimization did
choose to do the transform differently, then that leaves the IR in a state
where the individual discriminators *cannot* map back to it.  (Say you
unroll by 2 instead of 4; then you have only 1 trailing copy, not 3, and a
discriminator that maps to the second trailing copy now maps to nothing.
The individual-discriminator data becomes useless.)



Am I expressing this well enough to show that what Hal is looking for is
not feasible?

Yes, it will need to predict how the transformation would affect the blocks
produced. That does not seem problematic (at least at a coarse level). Yes,
if transformations made earlier in the pipeline make different decisions,
then that will invalidate later fine-grained data (at least potentially). I
don't see how any of this makes this infeasible. We just need a way for the
profiling counts, per descriminator, to remain available, and for the
transformations themselves to know which discriminators (loop ids, or
whatever) to consider.

 -Hal

--paulr






-- 

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory










-- 

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory



_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161121/0564dbea/attachment-0001.html>

Dehao Chen via llvm-dev

2016-Nov-22 23:16 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

-gprofile should have:

1. emit linkage name in all subprograms
2. emit discriminator with dup-factor copy-factor encoded
3. use-unknown-locations=true

Here comes the problem: is it an intermediate -g mode between -g1 and -g2?
If yes, this means that -g2 will also need to include all the 3 items
above. If not, then it should not be called -gprofile, but rather a flag
like -femit-debug-for-profiling", so that we can use it with "-g2
-femit-debug-for-profiling", or "-gmlt
-femit-debug-for-profiling".

Dehao

On Mon, Nov 21, 2016 at 3:43 PM, David Blaikie <dblaikie at gmail.com>
wrote:
> On Mon, Nov 21, 2016 at 3:33 PM Robinson, Paul <paul.robinson at
sony.com>
> wrote:
>
> In many cases, the line-table fussing to improve autoFDO/sample-PGO would
> also likely help the debugging experience for optimized code, certainly in
> cases where line attribution is inherently ambiguous.  In those cases, I
> have no problem with Just Doing It.
>
>
> Agreed, certainly.
>
>
>
>
> that's probably worth inventing a –gprofile or some such.  I suspect
> Dehao's more elaborate ideas about discriminators fall into that
category
> (although IIRC discriminators were invented mainly to benefit profiling in
> the first place, so maybe having discriminators at all would be something
> to put under a –gprofile kind of option).
>
>
> Yep
>
> - Dave
>
>
> --paulr
>
>
>
> *From:* David Blaikie [mailto:dblaikie at gmail.com]
> *Sent:* Monday, November 21, 2016 3:04 PM
> *To:* Dehao Chen; Hal Finkel
> *Cc:* llvm-dev; Xinliang David Li; Adrian Prantl; Robinson, Paul; Diego
> Novillo
>
>
> *Subject:* Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> (+Adrian & Paul since they're mentioned here, +Diego since he might
have
> opinions as he implemented the discriminator stuff originally)
>
> my 2c (that was mentioned here & that I idly discussed with Diego a few
> weeks ago in no great detail) is also that maybe a separate mode for PGO
> might be what we want/need, because it's not what sanitizers or simple
> backtrace tools need & we're slowly growing the size of debug info
for
> profiling needs that we don't need elsewhere. (might be worth doing a
full
> measurement here as we add new features - what's the growth of all hte
> profile related features (from (including) discriminators up) and see if
> it's still a reasonable thing to keep with the rest of the debug info
use
> cases or needs a separate flag)
>
> But I'm not too fussed/don't feel terribly strongly, really.
>
> On Fri, Nov 4, 2016 at 3:44 PM Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Discussed with Hal, Adrain and Paul offline at the llvm dev meeting today.
>
>
>
> * trip count is not enough for vectorization, there is runtime check that
> might go false, which can be reflected in profile that we may want to
> preserve.
>
> * simply recording these context-profile may cause problems to
> iterative-sample-pgo. i.e. when you find a loop's vectorized version no
> executed (due to runtime check), you will choose not to vectorize (which is
> optimal). But when you collect profile from this binary (optimized with
> sample-pgo, not vectorize the loop), as the loop is not vectorized, we do
> not have the context to demonstrate "the loop should *not*
vectorize" any
> more. So it will end up being vectorized again, introducing perf
> instability.
>
>
>
> To summarize, more context info may improve performance for one iteration,
> but the perf improvement may not be stable across iterations. If we aim at
> performance stability (which is one of the major goals of this RFC),
> profile should only reflect the attribute of source, not compiler
> transformations.
>
>
>
> But there are sample pgo users who does *not* care about iterative sample
> pgo performance, for them, as Hal suggested, we should invent a more
> extensible way to preserve profile context. Apparently discriminator is not
> an extensible choice. So how about we just use discriminator to store the
> attribute of the source (i.e trip count), and later design new extensible
> ways in dwarf to represent more context info?
>
>
>
> Adrian also suggested that we may need to consider have a flag or a
> separate debugging mode to indicate if we want to emit discriminator to
> prevent debug_line size increase.
>
>
>
> Dehao
>
>
>
> On Tue, Nov 1, 2016 at 9:34 PM, Dehao Chen <dehao at google.com>
wrote:
>
>
>
>
>
> On Tue, Nov 1, 2016 at 7:35 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 8:24:30 PM
>
>
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
>
>
> On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 6:41:29 PM
>
>
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
>
>
> On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> ------------------------------
>
> *From: *"Hal Finkel via llvm-dev" <llvm-dev at
lists.llvm.org>
> *To: *"Dehao Chen" <dehao at google.com>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 4:26:17 PM
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"Paul Robinson" <paul.robinson at sony.com>,
"Xinliang David Li" <
> davidxl at google.com>, "llvm-dev" <llvm-dev at
lists.llvm.org>
> *Sent: *Tuesday, November 1, 2016 4:14:43 PM
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
> damn... my english is not readable at all when I try to write fast...
> trying to make some clarification below, hopefully can make it more
> readable...
>
>
>
> On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>
> Oops... pressed the wrong button and sent out early...
>
>
>
> On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at google.com>
wrote:
>
> If Hal's proposal is for SamplePGO purpose, let me clarify some design
> principles of SamplePGO.
>
>
>
> The profile for sample pgo uses source location as the key to map the
> execution count back to IR. This design is based on the principle that we
> do not want the profile to be tightly couple with compiler IR. Instead,
> profile is simple an attribute of the source code. We have been benefited
>  a lot from this design that the profile can easily be reused across
> different source versions and compiler versions, or even compilers.
>
>
>
> That being said, the design to encode more info into discriminator does
> not mean that we will change the profile. The encoded info in discriminator
> will be handled by the create_llvm_prof tool, which combines counts from
> different clones of the same source code and generate the combined profile
> data. The output profile will not have any cloning/dupliaction bits at all.
> So for the initial example profile I provided, the output profile will be:
>
>
>
> #1: 10
>
> #3: 80
>
>
>
> Not:
>
>
>
> #1: 10
>
> #3.0x400: 70
>
> #3.0x10400: 5
>
> #3.0x20400: 3
>
> #3.0x30400: 2
>
>
>
> Also, how does this work for vectorization? For vectorization, you're
> going to multiply by the duplication factor first? The issue obviously is
> that each vector instruction counts for VF times as many scalar
> instructions, and so to get equivalent scalar counts, you need to multiply
> by VF. I had assumed this is what you were saying when I read the initial
> e-mail, but if you're also using the same duplication scheme for
unrolling,
> then I think we need some way to differentiate.
>
>
>
> In my original proposal, I only record VF*UF if the clone is both unrolled
> and vectorized. I did not distinguish between unroll and vectorization
> because I the unroll/vectorize decision only depends on the trip count,
> which can be derived from the duplication factor. If more profile info can
> be used for better unroll/vectorize decision making, we can definitely add
> it.
>
> I'm still missing something here. In your proposed system, does the
tool
> collecting the profiling information *interpret* the duplication factor
> encoded in the descriminators at all?
>
>
>
> Yes it does.
>
> Okay. I'm still missing something. Can you please write out an example
> using your notation below for a vectorized loop?
>
>
>
> Sure. Original code:
>
>
>
> for (int i = 0; i < N; i++)
>
>   a[i] = b[i];
>
>
>
> Transformed code:
>
>
>
> for (int i = 0; i < N / 4; i++)
>
>   vectorized_assign_4(&a[i * 4], &b[i * 4]); // discriminator
0x400,
> sample count 20
>
> for (int i = N & ~3; i < N; i++)
>
>   a[i] = b[i];       // discriminator 0x10000, sample count 10
>
>
>
> The derived sample count for original a[i] = b[i] is: 20 * 4 + 10 = 90
>
>
>
>
>
>
>
> If it does, it seems like this should be specific to vectorization. For
> unrolling, you still have the same number of instructions, and so you just
> need to make the different instructions from the different loop iterations
> carry different discriminator values (so they they'll sum together,
instead
> of using the maximum, as you explained in your original proposal).
>
>
>
> Let's take a look at the original loop unroll example. My proposed
> discriminator assignment is:
>
> for (i = 0; i < N & 3; i+= 4) {
>
>   foo();  // discriminator: 0x400
>
>   foo();  // discriminator: 0x400
>
>   foo();  // discriminator: 0x400
>
>   foo();  // discriminator: 0x400
>
> }
>
> if (i++ < N) {
>
>   foo();   // discriminator: 0x10000
>
>   if (i++ < N) {
>
>     foo(); // discriminator: 0x20000
>
>     if (i++ < N) {
>
>       foo();  // discriminator: 0x30000
>
>     }
>
>   }
>
> }
>
>
>
> IIRC, your proposed discriminator assignment is:
>
> To be clear, I don't think that I've proposed any assignment scheme
at
> all. Regardless, I was asking this question purely in the context of your
> proposal.
>
>
>
> Oh I used this example to show my understanding of your reply "For
> unrolling, you still have the same number of instructions, and so you just
> need to make the different instructions from the different loop iterations
> carry different discriminator values (so they they'll sum together,
instead
> of using the maximum, as you explained in your original proposal)."
Please
> let me know if my understanding is incorrect.
>
>
>
> Thanks,
>
> Dehao
>
>
>
>
> Thanks again,
> Hal
>
>
>
>
>
> for (i = 0; i < N & 3; i+= 4) {
>
>   foo();  // discriminator: 0x10000
>
>   foo();  // discriminator: 0x20000
>
>   foo();  // discriminator: 0x30000
>
>   foo();  // discriminator: 0x40000
>
> }
>
> if (i++ < N) {
>
>   foo();   // discriminator: 0x50000
>
>   if (i++ < N) {
>
>     foo(); // discriminator: 0x60000
>
>     if (i++ < N) {
>
>       foo();  // discriminator: 0x70000
>
>     }
>
>   }
>
> }
>
>
>
> I think my method can save discriminator space because all clones share
> the same discriminator. Consider that if a loop is unrolled 100 times, then
> with your algorithm, we may end up having 100 different discriminators,
> which seem not good for debug info size.
>
>
>
>
>
>
>
>  -Hal
>
>
>
> The goal of the proposed change, is to make profile more accurately
> represent the attribute of the source.
>
> The non-goal of the proposed change, is to provide more context in the
> profile to present the behavior of program in the context of different
> context.
>
>
>
> The non-goal of the proposed change, is to provide more context in the
> profile to present the behavior of program in different contexts.'
>
> Okay, but why? The information will be present in the profiles, why not
> encode it in the metadata and let the optimizer decide when to sum it up?
> We should even provide some utility functions that make this transparent
> for passes that don't care about the distinction.
>
> I see, you are proposing encoding these profile in the metadata. I agree
> that this can work, but from implementation's perspective, it could
could
> lead to great complexity.
>
>
>
> Sample PGO pass first read in the source->count map, use it to get
> basic_block->count map, and then use heuristics to propagate and get
> branch->probability count. So we don't have Metadata to store the
raw
> count, but only the branch probability is stored after sample pgo pass.
>
> I think that, as you indicate below, we'd need to change this to
{source,
> disc}->count map --> {BB, disc}->count map --> {branch,
disc}->prob map.
> The idea being that the total probability for the branch is the sum of the
> probabilities associated with it for each discriminator value. Most users
> will just sum them up, but some passes will want the breakdowns.
>
>
>
> Agree. Note that this would add some complexity to BFI/BPI
>
>
>
>
>
>
>
> If we are pursuing more context-sensitive profile, that would be a very
> different design. In this design, we will need to read profile multiple
> times, and have the profile tightly coupled with the compiler IR/pass
> manager. That is doable, but I don't think that probably better suits
> instrumentation based PGO's domain. Comments?
>
>
>
> If we are pursuing more context-sensitive profile, that would be a very
> different design in which we need to read in profiles multiple times, and
> have the profile tightly coupled with the compiler IR/pass manager. That is
> doable, but that probably better suits instrumentation based PGO's
domain.
> Comments?
>
> I don't understand why you say we'd need to read the profile
multiple
> times. Can you please explain? I also don't think it needs to be that
> tightly coupled; we just need each pass that generates multiple copies of
> things to have some way to generate a unique id for what it's doing.
It's
> all per source location, so I don't even think we need any kind of
> complicated hashing scheme.
>
> As explained above, we only have branch probability. Yes, we can record
> more instruction/basicblock count context info in additional metadata. But
> when you use it in optimizer, I suppose you will need to convert it to
> branch probability too?
>
>
>
> Let me take an example to demonstrate my understanding of your use of
> context-sensitive profile.
>
>
>
> original code:
>
> for ()
>
>   stmt1;
>
>
>
> optimized code:
>
> if (cond1)
>
>   for()
>
>     stmt1;  //100 samples
>
> else if (cond2)
>
>   for()
>
>     stmt1;  // 0 samples
>
> else
>
>   for()
>
>     stmt1; // 50 samples
>
>
>
> The loop was multi-versioned by cond1 and cond2. With my proposal, the
> profile for stmt1 will be combined as 150 samples. If we use this profile
> to optimize the orginal source, after annotation, the code becomes:
>
>
>
> for()
>
>   stmt1 // 150 samples
>
>
>
> Later when it comes to the mulit-version optimization, it will transform
> the code to:
>
>
>
> if (cond1)
>
>   for()
>
>     stmt1;  //50 samples
>
> else if (cond2)
>
>   for()
>
>     stmt1;  // 50 samples
>
> else
>
>   for()
>
>     stmt1; // 50 samples
>
>
>
> The 150 samples are evenly distributed to 3 clones as we do not have
> context info.
>
>
>
> With your proposal, after annotation, the original code becomes:
>
>
>
> for()
>
>   stmt1 // 150 samples (M1:100, M2:0, M3:50)
>
>
>
> Then during mutli-version optimization, the optimizer reads the metadata
> and find M2 is never taken, so it will transform the code to:
>
>
>
> if (cond1)
>
>   for()
>
>     stmt1;  // 100 samples
>
> else
>
>   for()
>
>     stmt1; // 50 samples
>
>
>
> There are two major parts that can benefit from the extra metadata
> recorded for each copy.
>
>
>
> 1. It can demonstrate that cond2 is always false and there is no need to
> create that version.
>
> 2. It can correctly attribute samples to cloned copy (i.e. 100, 50
> respectively in this case)
>
>
>
> I think the most important benefit is #1. For #2, my guess is scaled count
> is already good enough.
>
> By what are you scaling the count?
>
>
>
> It's random. Or most likely, each branch will be 50% taken.
>
>
>
>
>
>
>
> Here comes the problem: cond1 and cond2 are both generated by compiler.
> Then how do we map the metadata back to them? Using uid or special encoding
> seems fragile because when upstream optimization changes, compiler may
> choose to put cond2 in front of cond1 so the uid will change accordingly.
> How would we handle cases like this?
>
> Exactly for this reason, I don't think we can rely on the ordering to
> generate the identifiers; they need to have semantic meaning. I agree that
> this is tricky in general. However, for the cases I have in mind
> (unrolling, vectorization, etc.) these passes don't generate arbitrary
> conditionals, but rather, specific conditionals related to runtime overlap
> checks, trip-count thresholds, and I think that it should be easy to encode
> a discriminator value which can be unambiguously inverted.
>
>
>
> I'm still not quite clear how to do this. Needs to think more
thoroughly.
> But one thing I want to note is that, we want to encode the context info
> only when it will be useful for the optimization. For vectorization/unroll,
> I think accurate trip count is already good enough. So do we want to record
> its context?
>
>
>
> Thanks,
>
> Dehao
>
>
>
>
> Thanks again,
> Hal
>
>
>
>
>
> Dehao
>
>
>
>
> Thanks again,
> Hal
>
>
>
> Thanks,
>
> Dehao
>
>
>
>
>
>
>
> On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>
> ------------------------------
>
> *From: *"Paul Robinson" <paul.robinson at sony.com>
> *To: *"Dehao Chen" <dehao at google.com>, "Hal
Finkel" <hfinkel at anl.gov>
> *Cc: *"Xinliang David Li" <davidxl at google.com>, llvm-dev
at lists.llvm.org
> *Sent: *Tuesday, November 1, 2016 2:15:38 PM
> *Subject: *RE: [llvm-dev] (RFC) Encoding code duplication factor
> in        discriminator
>
> As illustrated in the above example, it is not like "vectorization has
a
> distinct bit". All different optimizations make clones of code which
will
> be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
> will be capped by the number of clones all optimizations have made, instead
> of # of optimizations that has applied. And it will be capped at 2^N-1. The
> cons of using uid is that you will not know if a clone is coming from
> vectorization or unroll or loop distribution.
>
> Okay, but that kind of semantic mapping is important. How should we
> encode/recover that information? To be clear, I'm not saying that we
need
> to implement that up front, but there needs to be a clear path to an
> implementation, because I don't want to have two disjoint schemes.
>
>
>
> You mean that you want to know which optimization created the clone? How
> would you use that info? Looks to me this will expose compiler
> implementation detail in debug info.
>
>
>
> This is still doable, assume we have 15 interesting optimizations to
> track, we can use 4 bits to encode the optimization type that created the
> clone. But this becomes nasty if the a clone is created by more than one
> optimizations. In that way, discriminator may not be fit for this purpose.
>
>
>
> My understanding was that the encoding scheme would allow the profiling
> analysis to correctly map execution data back to the original source
> construct, while preserving the property that each distinct basic block
> would have its own discriminator value.  That is, the execution data would
> be attributed back to the original source construct, not whatever each
> individual optimization had done to it, and the data for the original
> source construct would correctly reflect the execution (e.g. profiling says
> you got 82 hits on the original loop, rather than reporting 20 hits on the
> unrolled-by-4 loop plus 1 each on 2 of the trailing copies).
>
>
>
> It sounds like Hal is thinking that the per-discriminator execution info
> would be preserved down to the point where an individual optimization could
> look at the profile for each piece, and make decisions on that basis.
>
>
>
> I'm not clear how that would be possible, as the optimization would
have
> to first do the transform (or predict how it would do the transform) in
> order to see which individual-discriminator counts mapped to which actual
> blocks, and then make some kind of decision about whether to do the
> transform differently based on that information.  Then, if the optimization
> did choose to do the transform differently, then that leaves the IR in a
> state where the individual discriminators *cannot* map back to it.  (Say
> you unroll by 2 instead of 4; then you have only 1 trailing copy, not 3,
> and a discriminator that maps to the second trailing copy now maps to
> nothing.  The individual-discriminator data becomes useless.)
>
>
>
> Am I expressing this well enough to show that what Hal is looking for is
> not feasible?
>
> Yes, it will need to predict how the transformation would affect the
> blocks produced. That does not seem problematic (at least at a coarse
> level). Yes, if transformations made earlier in the pipeline make different
> decisions, then that will invalidate later fine-grained data (at least
> potentially). I don't see how any of this makes this infeasible. We
just
> need a way for the profiling counts, per descriminator, to remain
> available, and for the transformations themselves to know which
> discriminators (loop ids, or whatever) to consider.
>
>  -Hal
>
> --paulr
>
>
>
>
>
>
> --
>
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
>
>
>
>
>
>
> --
>
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161122/4f690c16/attachment-0001.html>

David Blaikie via llvm-dev

2016-Nov-23 21:21 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

yep, +1, seems like a -f flag to me. (you'd want to be able to combine it
with any other level of debugging - no need to have to choose between
profiling or debugging)

On Tue, Nov 22, 2016 at 3:46 PM Greg Bedwell <gregbedwell at gmail.com>
wrote:
> For our users we fully expect their profile/debug builds to be the same
> thing (that is, almost all debugging is on fully optimized code, where
> fully optimized also means part of an iterative profiling workflow).
> Having extra debug information that only benefits the profiling part seems
> reasonable.  What would be unacceptable is if we ever had to ask our users
> to make a trade off between debug info that was best for profiling at the
> expense of debugging quality or vice versa.  The
"-femit-debug-for-profiling"
> flag seems to fit this model of it being more a strict superset rather than
> a tuning option more clearly.
>
>
> -Greg
>
> On 22 November 2016 at 23:16, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> -gprofile should have:
>
> 1. emit linkage name in all subprograms
> 2. emit discriminator with dup-factor copy-factor encoded
> 3. use-unknown-locations=true
>
> Here comes the problem: is it an intermediate -g mode between -g1 and -g2?
> If yes, this means that -g2 will also need to include all the 3 items
> above. If not, then it should not be called -gprofile, but rather a flag
> like -femit-debug-for-profiling", so that we can use it with "-g2
> -femit-debug-for-profiling", or "-gmlt
-femit-debug-for-profiling".
>
>Yep, sounds about right (should have a -fno- option too, should it imply -g
of some kind if you specify it alone? I'd prefer to avoid implication
(implies flags get weird/hard to handle imho) if possible. Just have a
reasonable recommendation of -gmlt -fprof-debug or whatever we call it (I'm
bad with names))

- Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161123/a5d9c324/attachment.html>

Eric Christopher via llvm-dev

2016-Nov-23 21:39 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

(Agreed for the record).

-eric

On Wed, Nov 23, 2016 at 1:21 PM David Blaikie via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> yep, +1, seems like a -f flag to me. (you'd want to be able to combine
it
> with any other level of debugging - no need to have to choose between
> profiling or debugging)
>
> On Tue, Nov 22, 2016 at 3:46 PM Greg Bedwell <gregbedwell at
gmail.com>
> wrote:
>
> For our users we fully expect their profile/debug builds to be the same
> thing (that is, almost all debugging is on fully optimized code, where
> fully optimized also means part of an iterative profiling workflow).
> Having extra debug information that only benefits the profiling part seems
> reasonable.  What would be unacceptable is if we ever had to ask our users
> to make a trade off between debug info that was best for profiling at the
> expense of debugging quality or vice versa.  The
"-femit-debug-for-profiling"
> flag seems to fit this model of it being more a strict superset rather than
> a tuning option more clearly.
>
>
> -Greg
>
>
> On 22 November 2016 at 23:16, Dehao Chen via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> -gprofile should have:
>
> 1. emit linkage name in all subprograms
> 2. emit discriminator with dup-factor copy-factor encoded
> 3. use-unknown-locations=true
>
> Here comes the problem: is it an intermediate -g mode between -g1 and -g2?
> If yes, this means that -g2 will also need to include all the 3 items
> above. If not, then it should not be called -gprofile, but rather a flag
> like -femit-debug-for-profiling", so that we can use it with "-g2
> -femit-debug-for-profiling", or "-gmlt
-femit-debug-for-profiling".
>
>
> Yep, sounds about right (should have a -fno- option too, should it imply
> -g of some kind if you specify it alone? I'd prefer to avoid
implication
> (implies flags get weird/hard to handle imho) if possible. Just have a
> reasonable recommendation of -gmlt -fprof-debug or whatever we call it
(I'm
> bad with names))
>
> - Dave
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161123/c0adfe6b/attachment.html>

llvm dev - Nov 2016 - (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator