thr3ads.net - llvm dev - [llvm-dev] (RFC) Encoding code duplication factor in discriminator [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2016-Nov-02 00:56 UTC

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

----- Original Message -----
> From: "Dehao Chen" <dehao at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang
David Li"
> <davidxl at google.com>
> Sent: Tuesday, November 1, 2016 6:41:29 PM
> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
> On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel < hfinkel at anl.gov >
wrote:
> > > From: "Hal Finkel via llvm-dev" < llvm-dev at
lists.llvm.org >
> > 
> 
> > > To: "Dehao Chen" < dehao at google.com >
> > 
> 
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >,
"Xinliang David Li" <
> > > davidxl at google.com >
> > 
> 
> > > Sent: Tuesday, November 1, 2016 4:26:17 PM
> > 
> 
> > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > > discriminator
> > 
> 
> > > > From: "Dehao Chen" < dehao at google.com >
> > > 
> > 
> 
> > > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > 
> > 
> 
> > > > Cc: "Paul Robinson" < paul.robinson at sony.com
>, "Xinliang David
> > > > Li"
> > > > <
> > > > davidxl at google.com >, "llvm-dev" <
llvm-dev at lists.llvm.org >
> > > 
> > 
> 
> > > > Sent: Tuesday, November 1, 2016 4:14:43 PM
> > > 
> > 
> 
> > > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication
factor
> > > > in
> > > > discriminator
> > > 
> > 
> 
> > > > damn... my english is not readable at all when I try to
write
> > > > fast...
> > > > trying to make some clarification below, hopefully can make
it
> > > > more
> > > > readable...
> > > 
> > 
> 
> > > > On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen < dehao at
google.com >
> > > > wrote:
> > > 
> > 
> 
> > > > > Oops... pressed the wrong button and sent out early...
> > > > 
> > > 
> > 
> 
> > > > > On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen < dehao
at google.com
> > > > > >
> > > > > wrote:
> > > > 
> > > 
> > 
> 
> > > > > > If Hal's proposal is for SamplePGO purpose,
let me clarify
> > > > > > some
> > > > > > design principles of SamplePGO.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > The profile for sample pgo uses source location as
the key
> > > > > > to
> > > > > > map
> > > > > > the
> > > > > > execution count back to IR. This design is based
on the
> > > > > > principle
> > > > > > that we do not want the profile to be tightly
couple with
> > > > > > compiler
> > > > > > IR. Instead, profile is simple an attribute of the
source
> > > > > > code.
> > > > > > We
> > > > > > have been benefited a lot from this design that
the profile
> > > > > > can
> > > > > > easily be reused across different source versions
and
> > > > > > compiler
> > > > > > versions, or even compilers.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > That being said, the design to encode more info
into
> > > > > > discriminator
> > > > > > does not mean that we will change the profile. The
encoded
> > > > > > info
> > > > > > in
> > > > > > discriminator will be handled by the
create_llvm_prof tool,
> > > > > > which
> > > > > > combines counts from different clones of the same
source
> > > > > > code
> > > > > > and
> > > > > > generate the combined profile data. The output
profile will
> > > > > > not
> > > > > > have
> > > > > > any cloning/dupliaction bits at all. So for the
initial
> > > > > > example
> > > > > > profile I provided, the output profile will be:
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > #1: 10
> > > > 
> > > 
> > 
> 
> > > > > #3: 80
> > > > 
> > > 
> > 
> 
> > > > > Not:
> > > > 
> > > 
> > 
> 
> > > > > #1: 10
> > > > 
> > > 
> > 
> 
> > > > > #3.0x400: 70
> > > > 
> > > 
> > 
> 
> > > > > #3.0x10400: 5
> > > > 
> > > 
> > 
> 
> > > > > #3.0x20400: 3
> > > > 
> > > 
> > 
> 
> > > > > #3.0x30400: 2
> > > > 
> > > 
> > 
> 
> > Also, how does this work for vectorization? For vectorization,
> > you're
> > going to multiply by the duplication factor first? The issue
> > obviously is that each vector instruction counts for VF times as
> > many scalar instructions, and so to get equivalent scalar counts,
> > you need to multiply by VF. I had assumed this is what you were
> > saying when I read the initial e-mail, but if you're also using
the
> > same duplication scheme for unrolling, then I think we need some
> > way
> > to differentiate.
> 
> In my original proposal, I only record VF*UF if the clone is both
> unrolled and vectorized. I did not distinguish between unroll and
> vectorization because I the unroll/vectorize decision only depends
> on the trip count, which can be derived from the duplication factor.
> If more profile info can be used for better unroll/vectorize
> decision making, we can definitely add it.I'm still missing something here. In your proposed system, does the tool
collecting the profiling information *interpret* the duplication factor encoded
in the descriminators at all? If it does, it seems like this should be specific
to vectorization. For unrolling, you still have the same number of instructions,
and so you just need to make the different instructions from the different loop
iterations carry different discriminator values (so they they'll sum
together, instead of using the maximum, as you explained in your original
proposal).
> > -Hal
> 
> > > > > The goal of the proposed change, is to make profile
more
> > > > > accurately
> > > > > represent the attribute of the source.
> > > > 
> > > 
> > 
> 
> > > > > The non-goal of the proposed change, is to provide more
> > > > > context
> > > > > in
> > > > > the profile to present the behavior of program in the
context
> > > > > of
> > > > > different context.
> > > > 
> > > 
> > 
> 
> > > > The non-goal of the proposed change, is to provide more
context
> > > > in
> > > > the profile to present the behavior of program in different
> > > > contexts.'
> > > 
> > 
> 
> > > Okay, but why? The information will be present in the profiles,
> > > why
> > > not encode it in the metadata and let the optimizer decide when
> > > to
> > > sum it up? We should even provide some utility functions that
> > > make
> > > this transparent for passes that don't care about the
> > > distinction.
> > 
> 
> I see, you are proposing encoding these profile in the metadata. I
> agree that this can work, but from implementation's perspective, it
> could could lead to great complexity.
> Sample PGO pass first read in the source->count map, use it to get
> basic_block->count map, and then use heuristics to propagate and get
> branch->probability count. So we don't have Metadata to store the
> raw count, but only the branch probability is stored after sample
> pgo pass.
I think that, as you indicate below, we'd need to change this to {source,
disc}->count map --> {BB, disc}->count map --> {branch,
disc}->prob map. The idea being that the total probability for the branch is
the sum of the probabilities associated with it for each discriminator value.
Most users will just sum them up, but some passes will want the breakdowns.
> > > > > If we are pursuing more context-sensitive profile, that
would
> > > > > be
> > > > > a
> > > > > very different design. In this design, we will need to
read
> > > > > profile
> > > > > multiple times, and have the profile tightly coupled
with the
> > > > > compiler IR/pass manager. That is doable, but I
don't think
> > > > > that
> > > > > probably better suits instrumentation based PGO's
domain.
> > > > > Comments?
> > > > 
> > > 
> > 
> 
> > > > If we are pursuing more context-sensitive profile, that
would
> > > > be
> > > > a
> > > > very different design in which we need to read in profiles
> > > > multiple
> > > > times, and have the profile tightly coupled with the
compiler
> > > > IR/pass manager. That is doable, but that probably better
suits
> > > > instrumentation based PGO's domain. Comments?
> > > 
> > 
> 
> > > I don't understand why you say we'd need to read the
profile
> > > multiple
> > > times. Can you please explain? I also don't think it needs to
be
> > > that tightly coupled; we just need each pass that generates
> > > multiple
> > > copies of things to have some way to generate a unique id for
> > > what
> > > it's doing. It's all per source location, so I don't
even think
> > > we
> > > need any kind of complicated hashing scheme.
> > 
> 
> As explained above, we only have branch probability. Yes, we can
> record more instruction/basicblock count context info in additional
> metadata. But when you use it in optimizer, I suppose you will need
> to convert it to branch probability too?
> Let me take an example to demonstrate my understanding of your use of
> context-sensitive profile.
> original code:
> for ()
> stmt1;
> optimized code:
> if (cond1)
> for()
> stmt1; //100 samples
> else if (cond2)
> for()
> stmt1; // 0 samples
> else
> for()
> stmt1; // 50 samples
> The loop was multi-versioned by cond1 and cond2. With my proposal,
> the profile for stmt1 will be combined as 150 samples. If we use
> this profile to optimize the orginal source, after annotation, the
> code becomes:
> for()
> stmt1 // 150 samples
> Later when it comes to the mulit-version optimization, it will
> transform the code to:
> if (cond1)
> for()
> stmt1; //50 samples
> else if (cond2)
> for()
> stmt1; // 50 samples
> else
> for()
> stmt1; // 50 samples
> The 150 samples are evenly distributed to 3 clones as we do not have
> context info.
> With your proposal, after annotation, the original code becomes:
> for()
> stmt1 // 150 samples (M1:100, M2:0, M3:50)
> Then during mutli-version optimization, the optimizer reads the
> metadata and find M2 is never taken, so it will transform the code
> to:
> if (cond1)
> for()
> stmt1; // 100 samples
> else
> for()
> stmt1; // 50 samples
> There are two major parts that can benefit from the extra metadata
> recorded for each copy.
> 1. It can demonstrate that cond2 is always false and there is no need
> to create that version.
> 2. It can correctly attribute samples to cloned copy (i.e. 100, 50
> respectively in this case)
> I think the most important benefit is #1. For #2, my guess is scaled
> count is already good enough.By what are you scaling the count? 
> Here comes the problem: cond1 and cond2 are both generated by
> compiler. Then how do we map the metadata back to them? Using uid or
> special encoding seems fragile because when upstream optimization
> changes, compiler may choose to put cond2 in front of cond1 so the
> uid will change accordingly. How would we handle cases like this?Exactly for this reason, I don't think we can rely on the ordering to
generate the identifiers; they need to have semantic meaning. I agree that this
is tricky in general. However, for the cases I have in mind (unrolling,
vectorization, etc.) these passes don't generate arbitrary conditionals, but
rather, specific conditionals related to runtime overlap checks, trip-count
thresholds, and I think that it should be easy to encode a discriminator value
which can be unambiguously inverted.

Thanks again, 
Hal 
> Dehao
> > > Thanks again,
> > 
> 
> > > Hal
> > 
> 
> > > > > Thanks,
> > > > 
> > > 
> > 
> 
> > > > > Dehao
> > > > 
> > > 
> > 
> 
> > > > > > On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <
> > > > > > hfinkel at anl.gov
> > > > > > >
> > > > > > wrote:
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > From: "Paul Robinson" <
paul.robinson at sony.com >
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > To: "Dehao Chen" < dehao at
google.com >, "Hal Finkel" <
> > > > > > > > hfinkel at anl.gov
> > > > > > > > >
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Cc: "Xinliang David Li" <
davidxl at google.com >,
> > > > > > > > llvm-dev at lists.llvm.org
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Sent: Tuesday, November 1, 2016 2:15:38
PM
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Subject: RE: [llvm-dev] (RFC) Encoding
code duplication
> > > > > > > > factor
> > > > > > > > in
> > > > > > > > discriminator
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > As illustrated in the above example, it
is not like
> > > > > > > > "vectorization
> > > > > > > > has a distinct bit". All different
optimizations make
> > > > > > > > clones
> > > > > > > > of
> > > > > > > > code
> > > > > > > > which will be labeled by UIDs
represented by N (e.g. 8)
> > > > > > > > bits.
> > > > > > > > In
> > > > > > > > this way, the space will be capped by
the number of
> > > > > > > > clones
> > > > > > > > all
> > > > > > > > optimizations have made, instead of # of
optimizations
> > > > > > > > that
> > > > > > > > has
> > > > > > > > applied. And it will be capped at 2^N-1.
The cons of
> > > > > > > > using
> > > > > > > > uid
> > > > > > > > is
> > > > > > > > that you will not know if a clone is
coming from
> > > > > > > > vectorization
> > > > > > > > or
> > > > > > > > unroll or loop distribution.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Okay, but that kind of semantic mapping
is important.
> > > > > > > > How
> > > > > > > > should
> > > > > > > > we
> > > > > > > > encode/recover that information? To be
clear, I'm not
> > > > > > > > saying
> > > > > > > > that
> > > > > > > > we
> > > > > > > > need to implement that up front, but
there needs to be
> > > > > > > > a
> > > > > > > > clear
> > > > > > > > path
> > > > > > > > to an implementation, because I
don't want to have two
> > > > > > > > disjoint
> > > > > > > > schemes.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > You mean that you want to know which
optimization
> > > > > > > > created
> > > > > > > > the
> > > > > > > > clone?
> > > > > > > > How would you use that info? Looks to me
this will
> > > > > > > > expose
> > > > > > > > compiler
> > > > > > > > implementation detail in debug info.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > This is still doable, assume we have 15
interesting
> > > > > > > > optimizations
> > > > > > > > to
> > > > > > > > track, we can use 4 bits to encode the
optimization
> > > > > > > > type
> > > > > > > > that
> > > > > > > > created the clone. But this becomes
nasty if the a
> > > > > > > > clone
> > > > > > > > is
> > > > > > > > created
> > > > > > > > by more than one optimizations. In that
way,
> > > > > > > > discriminator
> > > > > > > > may
> > > > > > > > not
> > > > > > > > be fit for this purpose.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > My understanding was that the encoding
scheme would
> > > > > > > > allow
> > > > > > > > the
> > > > > > > > profiling analysis to correctly map
execution data back
> > > > > > > > to
> > > > > > > > the
> > > > > > > > original source construct, while
preserving the
> > > > > > > > property
> > > > > > > > that
> > > > > > > > each
> > > > > > > > distinct basic block would have its own
discriminator
> > > > > > > > value.
> > > > > > > > That
> > > > > > > > is, the execution data would be
attributed back to the
> > > > > > > > original
> > > > > > > > source construct, not whatever each
individual
> > > > > > > > optimization
> > > > > > > > had
> > > > > > > > done
> > > > > > > > to it, and the data for the original
source construct
> > > > > > > > would
> > > > > > > > correctly reflect the execution (e.g.
profiling says
> > > > > > > > you
> > > > > > > > got
> > > > > > > > 82
> > > > > > > > hits
> > > > > > > > on the original loop, rather than
reporting 20 hits on
> > > > > > > > the
> > > > > > > > unrolled-by-4 loop plus 1 each on 2 of
the trailing
> > > > > > > > copies).
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > It sounds like Hal is thinking that the
> > > > > > > > per-discriminator
> > > > > > > > execution
> > > > > > > > info would be preserved down to the
point where an
> > > > > > > > individual
> > > > > > > > optimization could look at the profile
for each piece,
> > > > > > > > and
> > > > > > > > make
> > > > > > > > decisions on that basis.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > I'm not clear how that would be
possible, as the
> > > > > > > > optimization
> > > > > > > > would
> > > > > > > > have to first do the transform (or
predict how it would
> > > > > > > > do
> > > > > > > > the
> > > > > > > > transform) in order to see which
> > > > > > > > individual-discriminator
> > > > > > > > counts
> > > > > > > > mapped to which actual blocks, and then
make some kind
> > > > > > > > of
> > > > > > > > decision
> > > > > > > > about whether to do the transform
differently based on
> > > > > > > > that
> > > > > > > > information. Then, if the optimization
did choose to do
> > > > > > > > the
> > > > > > > > transform differently, then that leaves
the IR in a
> > > > > > > > state
> > > > > > > > where
> > > > > > > > the
> > > > > > > > individual discriminators *cannot* map
back to it. (Say
> > > > > > > > you
> > > > > > > > unroll
> > > > > > > > by 2 instead of 4; then you have only 1
trailing copy,
> > > > > > > > not
> > > > > > > > 3,
> > > > > > > > and
> > > > > > > > a
> > > > > > > > discriminator that maps to the second
trailing copy now
> > > > > > > > maps
> > > > > > > > to
> > > > > > > > nothing. The individual-discriminator
data becomes
> > > > > > > > useless.)
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Am I expressing this well enough to show
that what Hal
> > > > > > > > is
> > > > > > > > looking
> > > > > > > > for
> > > > > > > > is not feasible?
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Yes, it will need to predict how the
transformation would
> > > > > > > affect
> > > > > > > the
> > > > > > > blocks produced. That does not seem
problematic (at least
> > > > > > > at
> > > > > > > a
> > > > > > > coarse level). Yes, if transformations made
earlier in
> > > > > > > the
> > > > > > > pipeline
> > > > > > > make different decisions, then that will
invalidate later
> > > > > > > fine-grained data (at least potentially). I
don't see how
> > > > > > > any
> > > > > > > of
> > > > > > > this makes this infeasible. We just need a
way for the
> > > > > > > profiling
> > > > > > > counts, per descriminator, to remain
available, and for
> > > > > > > the
> > > > > > > transformations themselves to know which
discriminators
> > > > > > > (loop
> > > > > > > ids,
> > > > > > > or whatever) to consider.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > -Hal
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > --paulr
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > --
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Hal Finkel
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Lead, Compiler Technology and Programming
Languages
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Leadership Computing Facility
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Argonne National Laboratory
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > --
> > 
> 
> > > Hal Finkel
> > 
> 
> > > Lead, Compiler Technology and Programming Languages
> > 
> 
> > > Leadership Computing Facility
> > 
> 
> > > Argonne National Laboratory
> > 
> 
> > > _______________________________________________
> > 
> 
> > > LLVM Developers mailing list
> > 
> 
> > > llvm-dev at lists.llvm.org
> > 
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> 
> > --
> 
> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/2534c9e7/attachment-0001.html>

Dehao Chen via llvm-dev

2016-Nov-02 01:01 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 6:41:29 PM
>
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> ------------------------------
>>
>> *From: *"Hal Finkel via llvm-dev" <llvm-dev at
lists.llvm.org>
>> *To: *"Dehao Chen" <dehao at google.com>
>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Xinliang David Li" <
>> davidxl at google.com>
>> *Sent: *Tuesday, November 1, 2016 4:26:17 PM
>> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>>
>>
>> ------------------------------
>>
>> *From: *"Dehao Chen" <dehao at google.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *"Paul Robinson" <paul.robinson at sony.com>,
"Xinliang David Li" <
>> davidxl at google.com>, "llvm-dev" <llvm-dev at
lists.llvm.org>
>> *Sent: *Tuesday, November 1, 2016 4:14:43 PM
>> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>>
>> damn... my english is not readable at all when I try to write fast...
>> trying to make some clarification below, hopefully can make it more
>> readable...
>>
>> On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>>
>>> Oops... pressed the wrong button and sent out early...
>>>
>>> On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at
google.com> wrote:
>>>
>>>> If Hal's proposal is for SamplePGO purpose, let me clarify
some design
>>>> principles of SamplePGO.
>>>>
>>>> The profile for sample pgo uses source location as the key to
map the
>>>> execution count back to IR. This design is based on the
principle that we
>>>> do not want the profile to be tightly couple with compiler IR.
Instead,
>>>> profile is simple an attribute of the source code. We have been
benefited
>>>>  a lot from this design that the profile can easily be reused
across
>>>> different source versions and compiler versions, or even
compilers.
>>>>
>>>> That being said, the design to encode more info into
discriminator does
>>>> not mean that we will change the profile. The encoded info in
discriminator
>>>> will be handled by the create_llvm_prof tool, which combines
counts from
>>>> different clones of the same source code and generate the
combined profile
>>>> data. The output profile will not have any cloning/dupliaction
bits at all.
>>>> So for the initial example profile I provided, the output
profile will be:
>>>>
>>>
>>> #1: 10
>>> #3: 80
>>>
>>> Not:
>>>
>>> #1: 10
>>> #3.0x400: 70
>>> #3.0x10400: 5
>>> #3.0x20400: 3
>>> #3.0x30400: 2
>>>
>>
>> Also, how does this work for vectorization? For vectorization,
you're
>> going to multiply by the duplication factor first? The issue obviously
is
>> that each vector instruction counts for VF times as many scalar
>> instructions, and so to get equivalent scalar counts, you need to
multiply
>> by VF. I had assumed this is what you were saying when I read the
initial
>> e-mail, but if you're also using the same duplication scheme for
unrolling,
>> then I think we need some way to differentiate.
>>
>
> In my original proposal, I only record VF*UF if the clone is both unrolled
> and vectorized. I did not distinguish between unroll and vectorization
> because I the unroll/vectorize decision only depends on the trip count,
> which can be derived from the duplication factor. If more profile info can
> be used for better unroll/vectorize decision making, we can definitely add
> it.
>
> I'm still missing something here. In your proposed system, does the
tool
> collecting the profiling information *interpret* the duplication factor
> encoded in the descriminators at all? If it does, it seems like this should
> be specific to vectorization. For unrolling, you still have the same number
> of instructions, and so you just need to make the different instructions
> from the different loop iterations carry different discriminator values (so
> they they'll sum together, instead of using the maximum, as you
explained
> in your original proposal).
>
Let's take a look at the example, my proposed discriminator assignment is:


>
>
>
>>
>>
>>  -Hal
>>
>>
>>> The goal of the proposed change, is to make profile more accurately
>>> represent the attribute of the source.
>>> The non-goal of the proposed change, is to provide more context in
the
>>> profile to present the behavior of program in the context of
different
>>> context.
>>>
>>
>> The non-goal of the proposed change, is to provide more context in the
>> profile to present the behavior of program in different contexts.'
>>
>> Okay, but why? The information will be present in the profiles, why not
>> encode it in the metadata and let the optimizer decide when to sum it
up?
>> We should even provide some utility functions that make this
transparent
>> for passes that don't care about the distinction.
>>
>> I see, you are proposing encoding these profile in the metadata. I
agree
> that this can work, but from implementation's perspective, it could
could
> lead to great complexity.
>
> Sample PGO pass first read in the source->count map, use it to get
> basic_block->count map, and then use heuristics to propagate and get
> branch->probability count. So we don't have Metadata to store the
raw
> count, but only the branch probability is stored after sample pgo pass.
>
> I think that, as you indicate below, we'd need to change this to
{source,
> disc}->count map --> {BB, disc}->count map --> {branch,
disc}->prob map.
> The idea being that the total probability for the branch is the sum of the
> probabilities associated with it for each discriminator value. Most users
> will just sum them up, but some passes will want the breakdowns.
>
>
>>
>>>
>>> If we are pursuing more context-sensitive profile, that would be a
very
>>> different design. In this design, we will need to read profile
multiple
>>> times, and have the profile tightly coupled with the compiler
IR/pass
>>> manager. That is doable, but I don't think that probably better
suits
>>> instrumentation based PGO's domain. Comments?
>>>
>>
>> If we are pursuing more context-sensitive profile, that would be a very
>> different design in which we need to read in profiles multiple times,
and
>> have the profile tightly coupled with the compiler IR/pass manager.
That is
>> doable, but that probably better suits instrumentation based PGO's
domain.
>> Comments?
>>
>> I don't understand why you say we'd need to read the profile
multiple
>> times. Can you please explain? I also don't think it needs to be
that
>> tightly coupled; we just need each pass that generates multiple copies
of
>> things to have some way to generate a unique id for what it's
doing. It's
>> all per source location, so I don't even think we need any kind of
>> complicated hashing scheme.
>>
>> As explained above, we only have branch probability. Yes, we can record
> more instruction/basicblock count context info in additional metadata. But
> when you use it in optimizer, I suppose you will need to convert it to
> branch probability too?
>
> Let me take an example to demonstrate my understanding of your use of
> context-sensitive profile.
>
> original code:
> for ()
>   stmt1;
>
> optimized code:
> if (cond1)
>   for()
>     stmt1;  //100 samples
> else if (cond2)
>   for()
>     stmt1;  // 0 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The loop was multi-versioned by cond1 and cond2. With my proposal, the
> profile for stmt1 will be combined as 150 samples. If we use this profile
> to optimize the orginal source, after annotation, the code becomes:
>
> for()
>   stmt1 // 150 samples
>
> Later when it comes to the mulit-version optimization, it will transform
> the code to:
>
> if (cond1)
>   for()
>     stmt1;  //50 samples
> else if (cond2)
>   for()
>     stmt1;  // 50 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The 150 samples are evenly distributed to 3 clones as we do not have
> context info.
>
> With your proposal, after annotation, the original code becomes:
>
> for()
>   stmt1 // 150 samples (M1:100, M2:0, M3:50)
>
> Then during mutli-version optimization, the optimizer reads the metadata
> and find M2 is never taken, so it will transform the code to:
>
> if (cond1)
>   for()
>     stmt1;  // 100 samples
> else
>   for()
>     stmt1; // 50 samples
>
> There are two major parts that can benefit from the extra metadata
> recorded for each copy.
>
> 1. It can demonstrate that cond2 is always false and there is no need to
> create that version.
> 2. It can correctly attribute samples to cloned copy (i.e. 100, 50
> respectively in this case)
>
> I think the most important benefit is #1. For #2, my guess is scaled count
> is already good enough.
>
> By what are you scaling the count?
>
>
> Here comes the problem: cond1 and cond2 are both generated by compiler.
> Then how do we map the metadata back to them? Using uid or special encoding
> seems fragile because when upstream optimization changes, compiler may
> choose to put cond2 in front of cond1 so the uid will change accordingly.
> How would we handle cases like this?
>
> Exactly for this reason, I don't think we can rely on the ordering to
> generate the identifiers; they need to have semantic meaning. I agree that
> this is tricky in general. However, for the cases I have in mind
> (unrolling, vectorization, etc.) these passes don't generate arbitrary
> conditionals, but rather, specific conditionals related to runtime overlap
> checks, trip-count thresholds, and I think that it should be easy to encode
> a discriminator value which can be unambiguously inverted.
>
> Thanks again,
> Hal
>
>
> Dehao
>
>
>> Thanks again,
>> Hal
>>
>>
>>> Thanks,
>>> Dehao
>>>
>>>
>>>>
>>>> On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at
anl.gov> wrote:
>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *From: *"Paul Robinson" <paul.robinson at
sony.com>
>>>>> *To: *"Dehao Chen" <dehao at google.com>,
"Hal Finkel" <hfinkel at anl.gov>
>>>>> *Cc: *"Xinliang David Li" <davidxl at
google.com>,
>>>>> llvm-dev at lists.llvm.org
>>>>> *Sent: *Tuesday, November 1, 2016 2:15:38 PM
>>>>> *Subject: *RE: [llvm-dev] (RFC) Encoding code duplication
factor
>>>>> in        discriminator
>>>>>
>>>>> As illustrated in the above example, it is not like
"vectorization has
>>>>> a distinct bit". All different optimizations make
clones of code which will
>>>>> be labeled by UIDs represented by N (e.g. 8) bits. In this
way, the space
>>>>> will be capped by the number of clones all optimizations
have made, instead
>>>>> of # of optimizations that has applied. And it will be
capped at 2^N-1. The
>>>>> cons of using uid is that you will not know if a clone is
coming from
>>>>> vectorization or unroll or loop distribution.
>>>>>
>>>>> Okay, but that kind of semantic mapping is important. How
should we
>>>>> encode/recover that information? To be clear, I'm not
saying that we need
>>>>> to implement that up front, but there needs to be a clear
path to an
>>>>> implementation, because I don't want to have two
disjoint schemes.
>>>>>
>>>>>
>>>>>
>>>>> You mean that you want to know which optimization created
the clone?
>>>>> How would you use that info? Looks to me this will expose
compiler
>>>>> implementation detail in debug info.
>>>>>
>>>>>
>>>>>
>>>>> This is still doable, assume we have 15 interesting
optimizations to
>>>>> track, we can use 4 bits to encode the optimization type
that created the
>>>>> clone. But this becomes nasty if the a clone is created by
more than one
>>>>> optimizations. In that way, discriminator may not be fit
for this purpose.
>>>>>
>>>>>
>>>>>
>>>>> My understanding was that the encoding scheme would allow
the
>>>>> profiling analysis to correctly map execution data back to
the original
>>>>> source construct, while preserving the property that each
distinct basic
>>>>> block would have its own discriminator value.  That is, the
execution data
>>>>> would be attributed back to the original source construct,
not whatever
>>>>> each individual optimization had done to it, and the data
for the original
>>>>> source construct would correctly reflect the execution
(e.g. profiling says
>>>>> you got 82 hits on the original loop, rather than reporting
20 hits on the
>>>>> unrolled-by-4 loop plus 1 each on 2 of the trailing
copies).
>>>>>
>>>>>
>>>>>
>>>>> It sounds like Hal is thinking that the per-discriminator
execution
>>>>> info would be preserved down to the point where an
individual optimization
>>>>> could look at the profile for each piece, and make
decisions on that basis.
>>>>>
>>>>>
>>>>>
>>>>> I'm not clear how that would be possible, as the
optimization would
>>>>> have to first do the transform (or predict how it would do
the transform)
>>>>> in order to see which individual-discriminator counts
mapped to which
>>>>> actual blocks, and then make some kind of decision about
whether to do the
>>>>> transform differently based on that information.  Then, if
the optimization
>>>>> did choose to do the transform differently, then that
leaves the IR in a
>>>>> state where the individual discriminators *cannot* map back
to it.  (Say
>>>>> you unroll by 2 instead of 4; then you have only 1 trailing
copy, not 3,
>>>>> and a discriminator that maps to the second trailing copy
now maps to
>>>>> nothing.  The individual-discriminator data becomes
useless.)
>>>>>
>>>>>
>>>>>
>>>>> Am I expressing this well enough to show that what Hal is
looking for
>>>>> is not feasible?
>>>>>
>>>>> Yes, it will need to predict how the transformation would
affect the
>>>>> blocks produced. That does not seem problematic (at least
at a coarse
>>>>> level). Yes, if transformations made earlier in the
pipeline make different
>>>>> decisions, then that will invalidate later fine-grained
data (at least
>>>>> potentially). I don't see how any of this makes this
infeasible. We just
>>>>> need a way for the profiling counts, per descriminator, to
remain
>>>>> available, and for the transformations themselves to know
which
>>>>> discriminators (loop ids, or whatever) to consider.
>>>>>
>>>>>  -Hal
>>>>>
>>>>> --paulr
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/8e54467c/attachment.html>

Dehao Chen via llvm-dev

2016-Nov-02 01:24 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 6:41:29 PM
>
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> ------------------------------
>>
>> *From: *"Hal Finkel via llvm-dev" <llvm-dev at
lists.llvm.org>
>> *To: *"Dehao Chen" <dehao at google.com>
>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Xinliang David Li" <
>> davidxl at google.com>
>> *Sent: *Tuesday, November 1, 2016 4:26:17 PM
>> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>>
>>
>> ------------------------------
>>
>> *From: *"Dehao Chen" <dehao at google.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *"Paul Robinson" <paul.robinson at sony.com>,
"Xinliang David Li" <
>> davidxl at google.com>, "llvm-dev" <llvm-dev at
lists.llvm.org>
>> *Sent: *Tuesday, November 1, 2016 4:14:43 PM
>> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>>
>> damn... my english is not readable at all when I try to write fast...
>> trying to make some clarification below, hopefully can make it more
>> readable...
>>
>> On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at google.com>
wrote:
>>
>>> Oops... pressed the wrong button and sent out early...
>>>
>>> On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at
google.com> wrote:
>>>
>>>> If Hal's proposal is for SamplePGO purpose, let me clarify
some design
>>>> principles of SamplePGO.
>>>>
>>>> The profile for sample pgo uses source location as the key to
map the
>>>> execution count back to IR. This design is based on the
principle that we
>>>> do not want the profile to be tightly couple with compiler IR.
Instead,
>>>> profile is simple an attribute of the source code. We have been
benefited
>>>>  a lot from this design that the profile can easily be reused
across
>>>> different source versions and compiler versions, or even
compilers.
>>>>
>>>> That being said, the design to encode more info into
discriminator does
>>>> not mean that we will change the profile. The encoded info in
discriminator
>>>> will be handled by the create_llvm_prof tool, which combines
counts from
>>>> different clones of the same source code and generate the
combined profile
>>>> data. The output profile will not have any cloning/dupliaction
bits at all.
>>>> So for the initial example profile I provided, the output
profile will be:
>>>>
>>>
>>> #1: 10
>>> #3: 80
>>>
>>> Not:
>>>
>>> #1: 10
>>> #3.0x400: 70
>>> #3.0x10400: 5
>>> #3.0x20400: 3
>>> #3.0x30400: 2
>>>
>>
>> Also, how does this work for vectorization? For vectorization,
you're
>> going to multiply by the duplication factor first? The issue obviously
is
>> that each vector instruction counts for VF times as many scalar
>> instructions, and so to get equivalent scalar counts, you need to
multiply
>> by VF. I had assumed this is what you were saying when I read the
initial
>> e-mail, but if you're also using the same duplication scheme for
unrolling,
>> then I think we need some way to differentiate.
>>
>
> In my original proposal, I only record VF*UF if the clone is both unrolled
> and vectorized. I did not distinguish between unroll and vectorization
> because I the unroll/vectorize decision only depends on the trip count,
> which can be derived from the duplication factor. If more profile info can
> be used for better unroll/vectorize decision making, we can definitely add
> it.
>
> I'm still missing something here. In your proposed system, does the
tool
> collecting the profiling information *interpret* the duplication factor
> encoded in the descriminators at all?
>
Yes it does.

> If it does, it seems like this should be specific to vectorization. For
> unrolling, you still have the same number of instructions, and so you just
> need to make the different instructions from the different loop iterations
> carry different discriminator values (so they they'll sum together,
instead
> of using the maximum, as you explained in your original proposal).
>
Let's take a look at the original loop unroll example. My proposed
discriminator assignment is:
for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
}
if (i++ < N) {
  foo();   // discriminator: 0x10000
  if (i++ < N) {
    foo(); // discriminator: 0x20000
    if (i++ < N) {
      foo();  // discriminator: 0x30000
    }
  }
}

IIRC, your proposed discriminator assignment is:

for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x10000
  foo();  // discriminator: 0x20000
  foo();  // discriminator: 0x30000
  foo();  // discriminator: 0x40000
}
if (i++ < N) {
  foo();   // discriminator: 0x50000
  if (i++ < N) {
    foo(); // discriminator: 0x60000
    if (i++ < N) {
      foo();  // discriminator: 0x70000
    }
  }
}

I think my method can save discriminator space because all clones share the
same discriminator. Consider that if a loop is unrolled 100 times, then
with your algorithm, we may end up having 100 different discriminators,
which seem not good for debug info size.
>
>
>
>>
>>
>>  -Hal
>>
>>
>>> The goal of the proposed change, is to make profile more accurately
>>> represent the attribute of the source.
>>> The non-goal of the proposed change, is to provide more context in
the
>>> profile to present the behavior of program in the context of
different
>>> context.
>>>
>>
>> The non-goal of the proposed change, is to provide more context in the
>> profile to present the behavior of program in different contexts.'
>>
>> Okay, but why? The information will be present in the profiles, why not
>> encode it in the metadata and let the optimizer decide when to sum it
up?
>> We should even provide some utility functions that make this
transparent
>> for passes that don't care about the distinction.
>>
>> I see, you are proposing encoding these profile in the metadata. I
agree
> that this can work, but from implementation's perspective, it could
could
> lead to great complexity.
>
> Sample PGO pass first read in the source->count map, use it to get
> basic_block->count map, and then use heuristics to propagate and get
> branch->probability count. So we don't have Metadata to store the
raw
> count, but only the branch probability is stored after sample pgo pass.
>
> I think that, as you indicate below, we'd need to change this to
{source,
> disc}->count map --> {BB, disc}->count map --> {branch,
disc}->prob map.
> The idea being that the total probability for the branch is the sum of the
> probabilities associated with it for each discriminator value. Most users
> will just sum them up, but some passes will want the breakdowns.
>
Agree. Note that this would add some complexity to BFI/BPI
>
>
>>
>>>
>>> If we are pursuing more context-sensitive profile, that would be a
very
>>> different design. In this design, we will need to read profile
multiple
>>> times, and have the profile tightly coupled with the compiler
IR/pass
>>> manager. That is doable, but I don't think that probably better
suits
>>> instrumentation based PGO's domain. Comments?
>>>
>>
>> If we are pursuing more context-sensitive profile, that would be a very
>> different design in which we need to read in profiles multiple times,
and
>> have the profile tightly coupled with the compiler IR/pass manager.
That is
>> doable, but that probably better suits instrumentation based PGO's
domain.
>> Comments?
>>
>> I don't understand why you say we'd need to read the profile
multiple
>> times. Can you please explain? I also don't think it needs to be
that
>> tightly coupled; we just need each pass that generates multiple copies
of
>> things to have some way to generate a unique id for what it's
doing. It's
>> all per source location, so I don't even think we need any kind of
>> complicated hashing scheme.
>>
>> As explained above, we only have branch probability. Yes, we can record
> more instruction/basicblock count context info in additional metadata. But
> when you use it in optimizer, I suppose you will need to convert it to
> branch probability too?
>
> Let me take an example to demonstrate my understanding of your use of
> context-sensitive profile.
>
> original code:
> for ()
>   stmt1;
>
> optimized code:
> if (cond1)
>   for()
>     stmt1;  //100 samples
> else if (cond2)
>   for()
>     stmt1;  // 0 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The loop was multi-versioned by cond1 and cond2. With my proposal, the
> profile for stmt1 will be combined as 150 samples. If we use this profile
> to optimize the orginal source, after annotation, the code becomes:
>
> for()
>   stmt1 // 150 samples
>
> Later when it comes to the mulit-version optimization, it will transform
> the code to:
>
> if (cond1)
>   for()
>     stmt1;  //50 samples
> else if (cond2)
>   for()
>     stmt1;  // 50 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The 150 samples are evenly distributed to 3 clones as we do not have
> context info.
>
> With your proposal, after annotation, the original code becomes:
>
> for()
>   stmt1 // 150 samples (M1:100, M2:0, M3:50)
>
> Then during mutli-version optimization, the optimizer reads the metadata
> and find M2 is never taken, so it will transform the code to:
>
> if (cond1)
>   for()
>     stmt1;  // 100 samples
> else
>   for()
>     stmt1; // 50 samples
>
> There are two major parts that can benefit from the extra metadata
> recorded for each copy.
>
> 1. It can demonstrate that cond2 is always false and there is no need to
> create that version.
> 2. It can correctly attribute samples to cloned copy (i.e. 100, 50
> respectively in this case)
>
> I think the most important benefit is #1. For #2, my guess is scaled count
> is already good enough.
>
> By what are you scaling the count?
>
It's random. Or most likely, each branch will be 50% taken.

>
>
> Here comes the problem: cond1 and cond2 are both generated by compiler.
> Then how do we map the metadata back to them? Using uid or special encoding
> seems fragile because when upstream optimization changes, compiler may
> choose to put cond2 in front of cond1 so the uid will change accordingly.
> How would we handle cases like this?
>
> Exactly for this reason, I don't think we can rely on the ordering to
> generate the identifiers; they need to have semantic meaning. I agree that
> this is tricky in general. However, for the cases I have in mind
> (unrolling, vectorization, etc.) these passes don't generate arbitrary
> conditionals, but rather, specific conditionals related to runtime overlap
> checks, trip-count thresholds, and I think that it should be easy to encode
> a discriminator value which can be unambiguously inverted.
>
I'm still not quite clear how to do this. Needs to think more thoroughly.
But one thing I want to note is that, we want to encode the context info
only when it will be useful for the optimization. For vectorization/unroll,
I think accurate trip count is already good enough. So do we want to record
its context?

Thanks,
Dehao

>
> Thanks again,
> Hal
>
>
> Dehao
>
>
>> Thanks again,
>> Hal
>>
>>
>>> Thanks,
>>> Dehao
>>>
>>>
>>>>
>>>> On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at
anl.gov> wrote:
>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *From: *"Paul Robinson" <paul.robinson at
sony.com>
>>>>> *To: *"Dehao Chen" <dehao at google.com>,
"Hal Finkel" <hfinkel at anl.gov>
>>>>> *Cc: *"Xinliang David Li" <davidxl at
google.com>,
>>>>> llvm-dev at lists.llvm.org
>>>>> *Sent: *Tuesday, November 1, 2016 2:15:38 PM
>>>>> *Subject: *RE: [llvm-dev] (RFC) Encoding code duplication
factor
>>>>> in        discriminator
>>>>>
>>>>> As illustrated in the above example, it is not like
"vectorization has
>>>>> a distinct bit". All different optimizations make
clones of code which will
>>>>> be labeled by UIDs represented by N (e.g. 8) bits. In this
way, the space
>>>>> will be capped by the number of clones all optimizations
have made, instead
>>>>> of # of optimizations that has applied. And it will be
capped at 2^N-1. The
>>>>> cons of using uid is that you will not know if a clone is
coming from
>>>>> vectorization or unroll or loop distribution.
>>>>>
>>>>> Okay, but that kind of semantic mapping is important. How
should we
>>>>> encode/recover that information? To be clear, I'm not
saying that we need
>>>>> to implement that up front, but there needs to be a clear
path to an
>>>>> implementation, because I don't want to have two
disjoint schemes.
>>>>>
>>>>>
>>>>>
>>>>> You mean that you want to know which optimization created
the clone?
>>>>> How would you use that info? Looks to me this will expose
compiler
>>>>> implementation detail in debug info.
>>>>>
>>>>>
>>>>>
>>>>> This is still doable, assume we have 15 interesting
optimizations to
>>>>> track, we can use 4 bits to encode the optimization type
that created the
>>>>> clone. But this becomes nasty if the a clone is created by
more than one
>>>>> optimizations. In that way, discriminator may not be fit
for this purpose.
>>>>>
>>>>>
>>>>>
>>>>> My understanding was that the encoding scheme would allow
the
>>>>> profiling analysis to correctly map execution data back to
the original
>>>>> source construct, while preserving the property that each
distinct basic
>>>>> block would have its own discriminator value.  That is, the
execution data
>>>>> would be attributed back to the original source construct,
not whatever
>>>>> each individual optimization had done to it, and the data
for the original
>>>>> source construct would correctly reflect the execution
(e.g. profiling says
>>>>> you got 82 hits on the original loop, rather than reporting
20 hits on the
>>>>> unrolled-by-4 loop plus 1 each on 2 of the trailing
copies).
>>>>>
>>>>>
>>>>>
>>>>> It sounds like Hal is thinking that the per-discriminator
execution
>>>>> info would be preserved down to the point where an
individual optimization
>>>>> could look at the profile for each piece, and make
decisions on that basis.
>>>>>
>>>>>
>>>>>
>>>>> I'm not clear how that would be possible, as the
optimization would
>>>>> have to first do the transform (or predict how it would do
the transform)
>>>>> in order to see which individual-discriminator counts
mapped to which
>>>>> actual blocks, and then make some kind of decision about
whether to do the
>>>>> transform differently based on that information.  Then, if
the optimization
>>>>> did choose to do the transform differently, then that
leaves the IR in a
>>>>> state where the individual discriminators *cannot* map back
to it.  (Say
>>>>> you unroll by 2 instead of 4; then you have only 1 trailing
copy, not 3,
>>>>> and a discriminator that maps to the second trailing copy
now maps to
>>>>> nothing.  The individual-discriminator data becomes
useless.)
>>>>>
>>>>>
>>>>>
>>>>> Am I expressing this well enough to show that what Hal is
looking for
>>>>> is not feasible?
>>>>>
>>>>> Yes, it will need to predict how the transformation would
affect the
>>>>> blocks produced. That does not seem problematic (at least
at a coarse
>>>>> level). Yes, if transformations made earlier in the
pipeline make different
>>>>> decisions, then that will invalidate later fine-grained
data (at least
>>>>> potentially). I don't see how any of this makes this
infeasible. We just
>>>>> need a way for the profiling counts, per descriminator, to
remain
>>>>> available, and for the transformations themselves to know
which
>>>>> discriminators (loop ids, or whatever) to consider.
>>>>>
>>>>>  -Hal
>>>>>
>>>>> --paulr
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/b30be627/attachment-0001.html>

Hal Finkel via llvm-dev

2016-Nov-02 02:35 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

----- Original Message -----
> From: "Dehao Chen" <dehao at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang
David Li"
> <davidxl at google.com>
> Sent: Tuesday, November 1, 2016 8:24:30 PM
> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
> On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel < hfinkel at anl.gov >
wrote:
> > > From: "Dehao Chen" < dehao at google.com >
> > 
> 
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > 
> 
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >,
"Xinliang David Li" <
> > > davidxl at google.com >
> > 
> 
> > > Sent: Tuesday, November 1, 2016 6:41:29 PM
> > 
> 
> > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > > discriminator
> > 
> 
> > > On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel < hfinkel at
anl.gov >
> > > wrote:
> > 
> 
> > > > > From: "Hal Finkel via llvm-dev" < llvm-dev
at lists.llvm.org >
> > > > 
> > > 
> > 
> 
> > > > > To: "Dehao Chen" < dehao at google.com
>
> > > > 
> > > 
> > 
> 
> > > > > Cc: "llvm-dev" < llvm-dev at
lists.llvm.org >, "Xinliang David
> > > > > Li"
> > > > > <
> > > > > davidxl at google.com >
> > > > 
> > > 
> > 
> 
> > > > > Sent: Tuesday, November 1, 2016 4:26:17 PM
> > > > 
> > > 
> > 
> 
> > > > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication
> > > > > factor
> > > > > in
> > > > > discriminator
> > > > 
> > > 
> > 
> 
> > > > > > From: "Dehao Chen" < dehao at
google.com >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > To: "Hal Finkel" < hfinkel at anl.gov
>
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Cc: "Paul Robinson" < paul.robinson
at sony.com >, "Xinliang
> > > > > > David
> > > > > > Li"
> > > > > > <
> > > > > > davidxl at google.com >, "llvm-dev"
< llvm-dev at lists.llvm.org
> > > > > > >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Sent: Tuesday, November 1, 2016 4:14:43 PM
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Subject: Re: [llvm-dev] (RFC) Encoding code
duplication
> > > > > > factor
> > > > > > in
> > > > > > discriminator
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > damn... my english is not readable at all when I
try to
> > > > > > write
> > > > > > fast...
> > > > > > trying to make some clarification below, hopefully
can make
> > > > > > it
> > > > > > more
> > > > > > readable...
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <
> > > > > > dehao at google.com
> > > > > > >
> > > > > > wrote:
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Oops... pressed the wrong button and sent out
early...
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen
<
> > > > > > > dehao at google.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > If Hal's proposal is for SamplePGO
purpose, let me
> > > > > > > > clarify
> > > > > > > > some
> > > > > > > > design principles of SamplePGO.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > The profile for sample pgo uses source
location as the
> > > > > > > > key
> > > > > > > > to
> > > > > > > > map
> > > > > > > > the
> > > > > > > > execution count back to IR. This design
is based on the
> > > > > > > > principle
> > > > > > > > that we do not want the profile to be
tightly couple
> > > > > > > > with
> > > > > > > > compiler
> > > > > > > > IR. Instead, profile is simple an
attribute of the
> > > > > > > > source
> > > > > > > > code.
> > > > > > > > We
> > > > > > > > have been benefited a lot from this
design that the
> > > > > > > > profile
> > > > > > > > can
> > > > > > > > easily be reused across different source
versions and
> > > > > > > > compiler
> > > > > > > > versions, or even compilers.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > That being said, the design to encode
more info into
> > > > > > > > discriminator
> > > > > > > > does not mean that we will change the
profile. The
> > > > > > > > encoded
> > > > > > > > info
> > > > > > > > in
> > > > > > > > discriminator will be handled by the
create_llvm_prof
> > > > > > > > tool,
> > > > > > > > which
> > > > > > > > combines counts from different clones of
the same
> > > > > > > > source
> > > > > > > > code
> > > > > > > > and
> > > > > > > > generate the combined profile data. The
output profile
> > > > > > > > will
> > > > > > > > not
> > > > > > > > have
> > > > > > > > any cloning/dupliaction bits at all. So
for the initial
> > > > > > > > example
> > > > > > > > profile I provided, the output profile
will be:
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #1: 10
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3: 80
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Not:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #1: 10
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3.0x400: 70
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3.0x10400: 5
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3.0x20400: 3
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3.0x30400: 2
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > Also, how does this work for vectorization? For
vectorization,
> > > > you're
> > > > going to multiply by the duplication factor first? The issue
> > > > obviously is that each vector instruction counts for VF
times
> > > > as
> > > > many scalar instructions, and so to get equivalent scalar
> > > > counts,
> > > > you need to multiply by VF. I had assumed this is what you
were
> > > > saying when I read the initial e-mail, but if you're
also using
> > > > the
> > > > same duplication scheme for unrolling, then I think we need
> > > > some
> > > > way
> > > > to differentiate.
> > > 
> > 
> 
> > > In my original proposal, I only record VF*UF if the clone is both
> > > unrolled and vectorized. I did not distinguish between unroll and
> > > vectorization because I the unroll/vectorize decision only
> > > depends
> > > on the trip count, which can be derived from the duplication
> > > factor.
> > > If more profile info can be used for better unroll/vectorize
> > > decision making, we can definitely add it.
> > 
> 
> > I'm still missing something here. In your proposed system, does
the
> > tool collecting the profiling information *interpret* the
> > duplication factor encoded in the descriminators at all?
> 
> Yes it does.Okay. I'm still missing something. Can you please write out an example using
your notation below for a vectorized loop?
> > If it does, it seems like this should be specific to vectorization.
> > For unrolling, you still have the same number of instructions, and
> > so you just need to make the different instructions from the
> > different loop iterations carry different discriminator values (so
> > they they'll sum together, instead of using the maximum, as you
> > explained in your original proposal).
> 
> Let's take a look at the original loop unroll example. My proposed
> discriminator assignment is:
> for (i = 0; i < N & 3; i+= 4) {
> foo(); // discriminator: 0x400
> foo(); // discriminator: 0x400
> foo(); // discriminator: 0x400
> foo(); // discriminator: 0x400
> }
> if (i++ < N) {
> foo(); // discriminator: 0x10000
> if (i++ < N) {
> foo(); // discriminator: 0x20000
> if (i++ < N) {
> foo(); // discriminator: 0x30000
> }
> }
> }
> IIRC, your proposed discriminator assignment is:To be clear, I don't think that I've proposed any assignment scheme at
all. Regardless, I was asking this question purely in the context of your
proposal.

Thanks again, 
Hal 
> for (i = 0; i < N & 3; i+= 4) {
> foo(); // discriminator: 0x10000
> foo(); // discriminator: 0x20000
> foo(); // discriminator: 0x30000
> foo(); // discriminator: 0x40000
> }
> if (i++ < N) {
> foo(); // discriminator: 0x50000
> if (i++ < N) {
> foo(); // discriminator: 0x60000
> if (i++ < N) {
> foo(); // discriminator: 0x70000
> }
> }
> }
> I think my method can save discriminator space because all clones
> share the same discriminator. Consider that if a loop is unrolled
> 100 times, then with your algorithm, we may end up having 100
> different discriminators, which seem not good for debug info size.
> > > > -Hal
> > > 
> > 
> 
> > > > > > > The goal of the proposed change, is to make
profile more
> > > > > > > accurately
> > > > > > > represent the attribute of the source.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > The non-goal of the proposed change, is to
provide more
> > > > > > > context
> > > > > > > in
> > > > > > > the profile to present the behavior of
program in the
> > > > > > > context
> > > > > > > of
> > > > > > > different context.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > The non-goal of the proposed change, is to provide
more
> > > > > > context
> > > > > > in
> > > > > > the profile to present the behavior of program in
different
> > > > > > contexts.'
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > Okay, but why? The information will be present in the
> > > > > profiles,
> > > > > why
> > > > > not encode it in the metadata and let the optimizer
decide
> > > > > when
> > > > > to
> > > > > sum it up? We should even provide some utility
functions that
> > > > > make
> > > > > this transparent for passes that don't care about
the
> > > > > distinction.
> > > > 
> > > 
> > 
> 
> > > I see, you are proposing encoding these profile in the metadata.
> > > I
> > > agree that this can work, but from implementation's
perspective,
> > > it
> > > could could lead to great complexity.
> > 
> 
> > > Sample PGO pass first read in the source->count map, use it to
> > > get
> > > basic_block->count map, and then use heuristics to propagate
and
> > > get
> > > branch->probability count. So we don't have Metadata to
store the
> > > raw count, but only the branch probability is stored after sample
> > > pgo pass.
> > 
> 
> > I think that, as you indicate below, we'd need to change this to
> > {source, disc}->count map --> {BB, disc}->count map -->
{branch,
> > disc}->prob map. The idea being that the total probability for the
> > branch is the sum of the probabilities associated with it for each
> > discriminator value. Most users will just sum them up, but some
> > passes will want the breakdowns.
> 
> Agree. Note that this would add some complexity to BFI/BPI
> > > > > > > If we are pursuing more context-sensitive
profile, that
> > > > > > > would
> > > > > > > be
> > > > > > > a
> > > > > > > very different design. In this design, we
will need to
> > > > > > > read
> > > > > > > profile
> > > > > > > multiple times, and have the profile tightly
coupled with
> > > > > > > the
> > > > > > > compiler IR/pass manager. That is doable, but
I don't
> > > > > > > think
> > > > > > > that
> > > > > > > probably better suits instrumentation based
PGO's domain.
> > > > > > > Comments?
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > If we are pursuing more context-sensitive profile,
that
> > > > > > would
> > > > > > be
> > > > > > a
> > > > > > very different design in which we need to read in
profiles
> > > > > > multiple
> > > > > > times, and have the profile tightly coupled with
the
> > > > > > compiler
> > > > > > IR/pass manager. That is doable, but that probably
better
> > > > > > suits
> > > > > > instrumentation based PGO's domain. Comments?
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > I don't understand why you say we'd need to
read the profile
> > > > > multiple
> > > > > times. Can you please explain? I also don't think
it needs to
> > > > > be
> > > > > that tightly coupled; we just need each pass that
generates
> > > > > multiple
> > > > > copies of things to have some way to generate a unique
id for
> > > > > what
> > > > > it's doing. It's all per source location, so I
don't even
> > > > > think
> > > > > we
> > > > > need any kind of complicated hashing scheme.
> > > > 
> > > 
> > 
> 
> > > As explained above, we only have branch probability. Yes, we can
> > > record more instruction/basicblock count context info in
> > > additional
> > > metadata. But when you use it in optimizer, I suppose you will
> > > need
> > > to convert it to branch probability too?
> > 
> 
> > > Let me take an example to demonstrate my understanding of your
> > > use
> > > of
> > > context-sensitive profile.
> > 
> 
> > > original code:
> > 
> 
> > > for ()
> > 
> 
> > > stmt1;
> > 
> 
> > > optimized code:
> > 
> 
> > > if (cond1)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; //100 samples
> > 
> 
> > > else if (cond2)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 0 samples
> > 
> 
> > > else
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 
> > > The loop was multi-versioned by cond1 and cond2. With my
> > > proposal,
> > > the profile for stmt1 will be combined as 150 samples. If we use
> > > this profile to optimize the orginal source, after annotation,
> > > the
> > > code becomes:
> > 
> 
> > > for()
> > 
> 
> > > stmt1 // 150 samples
> > 
> 
> > > Later when it comes to the mulit-version optimization, it will
> > > transform the code to:
> > 
> 
> > > if (cond1)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; //50 samples
> > 
> 
> > > else if (cond2)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 
> > > else
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 
> > > The 150 samples are evenly distributed to 3 clones as we do not
> > > have
> > > context info.
> > 
> 
> > > With your proposal, after annotation, the original code becomes:
> > 
> 
> > > for()
> > 
> 
> > > stmt1 // 150 samples (M1:100, M2:0, M3:50)
> > 
> 
> > > Then during mutli-version optimization, the optimizer reads the
> > > metadata and find M2 is never taken, so it will transform the
> > > code
> > > to:
> > 
> 
> > > if (cond1)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 100 samples
> > 
> 
> > > else
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 
> > > There are two major parts that can benefit from the extra
> > > metadata
> > > recorded for each copy.
> > 
> 
> > > 1. It can demonstrate that cond2 is always false and there is no
> > > need
> > > to create that version.
> > 
> 
> > > 2. It can correctly attribute samples to cloned copy (i.e. 100,
> > > 50
> > > respectively in this case)
> > 
> 
> > > I think the most important benefit is #1. For #2, my guess is
> > > scaled
> > > count is already good enough.
> > 
> 
> > By what are you scaling the count?
> 
> It's random. Or most likely, each branch will be 50% taken.
> > > Here comes the problem: cond1 and cond2 are both generated by
> > > compiler. Then how do we map the metadata back to them? Using uid
> > > or
> > > special encoding seems fragile because when upstream optimization
> > > changes, compiler may choose to put cond2 in front of cond1 so
> > > the
> > > uid will change accordingly. How would we handle cases like this?
> > 
> 
> > Exactly for this reason, I don't think we can rely on the ordering
> > to
> > generate the identifiers; they need to have semantic meaning. I
> > agree that this is tricky in general. However, for the cases I have
> > in mind (unrolling, vectorization, etc.) these passes don't
> > generate
> > arbitrary conditionals, but rather, specific conditionals related
> > to
> > runtime overlap checks, trip-count thresholds, and I think that it
> > should be easy to encode a discriminator value which can be
> > unambiguously inverted.
> 
> I'm still not quite clear how to do this. Needs to think more
> thoroughly. But one thing I want to note is that, we want to encode
> the context info only when it will be useful for the optimization.
> For vectorization/unroll, I think accurate trip count is already
> good enough. So do we want to record its context?
> Thanks,
> Dehao
> > Thanks again,
> 
> > Hal
> 
> > > Dehao
> > 
> 
> > > > > Thanks again,
> > > > 
> > > 
> > 
> 
> > > > > Hal
> > > > 
> > > 
> > 
> 
> > > > > > > Thanks,
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Dehao
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > On Tue, Nov 1, 2016 at 1:04 PM, Hal
Finkel <
> > > > > > > > hfinkel at anl.gov
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > From: "Paul
Robinson" < paul.robinson at sony.com >
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > To: "Dehao Chen"
< dehao at google.com >, "Hal Finkel"
> > > > > > > > > > <
> > > > > > > > > > hfinkel at anl.gov
> > > > > > > > > > >
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Cc: "Xinliang David
Li" < davidxl at google.com >,
> > > > > > > > > > llvm-dev at lists.llvm.org
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Sent: Tuesday, November 1,
2016 2:15:38 PM
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Subject: RE: [llvm-dev] (RFC)
Encoding code
> > > > > > > > > > duplication
> > > > > > > > > > factor
> > > > > > > > > > in
> > > > > > > > > > discriminator
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > As illustrated in the above
example, it is not like
> > > > > > > > > > "vectorization
> > > > > > > > > > has a distinct bit". All
different optimizations
> > > > > > > > > > make
> > > > > > > > > > clones
> > > > > > > > > > of
> > > > > > > > > > code
> > > > > > > > > > which will be labeled by UIDs
represented by N
> > > > > > > > > > (e.g.
> > > > > > > > > > 8)
> > > > > > > > > > bits.
> > > > > > > > > > In
> > > > > > > > > > this way, the space will be
capped by the number of
> > > > > > > > > > clones
> > > > > > > > > > all
> > > > > > > > > > optimizations have made,
instead of # of
> > > > > > > > > > optimizations
> > > > > > > > > > that
> > > > > > > > > > has
> > > > > > > > > > applied. And it will be capped
at 2^N-1. The cons
> > > > > > > > > > of
> > > > > > > > > > using
> > > > > > > > > > uid
> > > > > > > > > > is
> > > > > > > > > > that you will not know if a
clone is coming from
> > > > > > > > > > vectorization
> > > > > > > > > > or
> > > > > > > > > > unroll or loop distribution.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Okay, but that kind of
semantic mapping is
> > > > > > > > > > important.
> > > > > > > > > > How
> > > > > > > > > > should
> > > > > > > > > > we
> > > > > > > > > > encode/recover that
information? To be clear, I'm
> > > > > > > > > > not
> > > > > > > > > > saying
> > > > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > need to implement that up
front, but there needs to
> > > > > > > > > > be
> > > > > > > > > > a
> > > > > > > > > > clear
> > > > > > > > > > path
> > > > > > > > > > to an implementation, because
I don't want to have
> > > > > > > > > > two
> > > > > > > > > > disjoint
> > > > > > > > > > schemes.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > You mean that you want to know
which optimization
> > > > > > > > > > created
> > > > > > > > > > the
> > > > > > > > > > clone?
> > > > > > > > > > How would you use that info?
Looks to me this will
> > > > > > > > > > expose
> > > > > > > > > > compiler
> > > > > > > > > > implementation detail in debug
info.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > This is still doable, assume
we have 15 interesting
> > > > > > > > > > optimizations
> > > > > > > > > > to
> > > > > > > > > > track, we can use 4 bits to
encode the optimization
> > > > > > > > > > type
> > > > > > > > > > that
> > > > > > > > > > created the clone. But this
becomes nasty if the a
> > > > > > > > > > clone
> > > > > > > > > > is
> > > > > > > > > > created
> > > > > > > > > > by more than one
optimizations. In that way,
> > > > > > > > > > discriminator
> > > > > > > > > > may
> > > > > > > > > > not
> > > > > > > > > > be fit for this purpose.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > My understanding was that the
encoding scheme would
> > > > > > > > > > allow
> > > > > > > > > > the
> > > > > > > > > > profiling analysis to
correctly map execution data
> > > > > > > > > > back
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > original source construct,
while preserving the
> > > > > > > > > > property
> > > > > > > > > > that
> > > > > > > > > > each
> > > > > > > > > > distinct basic block would
have its own
> > > > > > > > > > discriminator
> > > > > > > > > > value.
> > > > > > > > > > That
> > > > > > > > > > is, the execution data would
be attributed back to
> > > > > > > > > > the
> > > > > > > > > > original
> > > > > > > > > > source construct, not whatever
each individual
> > > > > > > > > > optimization
> > > > > > > > > > had
> > > > > > > > > > done
> > > > > > > > > > to it, and the data for the
original source
> > > > > > > > > > construct
> > > > > > > > > > would
> > > > > > > > > > correctly reflect the
execution (e.g. profiling
> > > > > > > > > > says
> > > > > > > > > > you
> > > > > > > > > > got
> > > > > > > > > > 82
> > > > > > > > > > hits
> > > > > > > > > > on the original loop, rather
than reporting 20 hits
> > > > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > unrolled-by-4 loop plus 1 each
on 2 of the trailing
> > > > > > > > > > copies).
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > It sounds like Hal is thinking
that the
> > > > > > > > > > per-discriminator
> > > > > > > > > > execution
> > > > > > > > > > info would be preserved down
to the point where an
> > > > > > > > > > individual
> > > > > > > > > > optimization could look at the
profile for each
> > > > > > > > > > piece,
> > > > > > > > > > and
> > > > > > > > > > make
> > > > > > > > > > decisions on that basis.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > I'm not clear how that
would be possible, as the
> > > > > > > > > > optimization
> > > > > > > > > > would
> > > > > > > > > > have to first do the transform
(or predict how it
> > > > > > > > > > would
> > > > > > > > > > do
> > > > > > > > > > the
> > > > > > > > > > transform) in order to see
which
> > > > > > > > > > individual-discriminator
> > > > > > > > > > counts
> > > > > > > > > > mapped to which actual blocks,
and then make some
> > > > > > > > > > kind
> > > > > > > > > > of
> > > > > > > > > > decision
> > > > > > > > > > about whether to do the
transform differently based
> > > > > > > > > > on
> > > > > > > > > > that
> > > > > > > > > > information. Then, if the
optimization did choose
> > > > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > the
> > > > > > > > > > transform differently, then
that leaves the IR in a
> > > > > > > > > > state
> > > > > > > > > > where
> > > > > > > > > > the
> > > > > > > > > > individual discriminators
*cannot* map back to it.
> > > > > > > > > > (Say
> > > > > > > > > > you
> > > > > > > > > > unroll
> > > > > > > > > > by 2 instead of 4; then you
have only 1 trailing
> > > > > > > > > > copy,
> > > > > > > > > > not
> > > > > > > > > > 3,
> > > > > > > > > > and
> > > > > > > > > > a
> > > > > > > > > > discriminator that maps to the
second trailing copy
> > > > > > > > > > now
> > > > > > > > > > maps
> > > > > > > > > > to
> > > > > > > > > > nothing. The
individual-discriminator data becomes
> > > > > > > > > > useless.)
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Am I expressing this well
enough to show that what
> > > > > > > > > > Hal
> > > > > > > > > > is
> > > > > > > > > > looking
> > > > > > > > > > for
> > > > > > > > > > is not feasible?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Yes, it will need to predict how
the transformation
> > > > > > > > > would
> > > > > > > > > affect
> > > > > > > > > the
> > > > > > > > > blocks produced. That does not seem
problematic (at
> > > > > > > > > least
> > > > > > > > > at
> > > > > > > > > a
> > > > > > > > > coarse level). Yes, if
transformations made earlier
> > > > > > > > > in
> > > > > > > > > the
> > > > > > > > > pipeline
> > > > > > > > > make different decisions, then that
will invalidate
> > > > > > > > > later
> > > > > > > > > fine-grained data (at least
potentially). I don't see
> > > > > > > > > how
> > > > > > > > > any
> > > > > > > > > of
> > > > > > > > > this makes this infeasible. We just
need a way for
> > > > > > > > > the
> > > > > > > > > profiling
> > > > > > > > > counts, per descriminator, to
remain available, and
> > > > > > > > > for
> > > > > > > > > the
> > > > > > > > > transformations themselves to know
which
> > > > > > > > > discriminators
> > > > > > > > > (loop
> > > > > > > > > ids,
> > > > > > > > > or whatever) to consider.
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > -Hal
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > --paulr
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > --
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Hal Finkel
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Lead, Compiler Technology and
Programming Languages
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Leadership Computing Facility
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Argonne National Laboratory
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > --
> > > > 
> > > 
> > 
> 
> > > > > Hal Finkel
> > > > 
> > > 
> > 
> 
> > > > > Lead, Compiler Technology and Programming Languages
> > > > 
> > > 
> > 
> 
> > > > > Leadership Computing Facility
> > > > 
> > > 
> > 
> 
> > > > > Argonne National Laboratory
> > > > 
> > > 
> > 
> 
> > > > > _______________________________________________
> > > > 
> > > 
> > 
> 
> > > > > LLVM Developers mailing list
> > > > 
> > > 
> > 
> 
> > > > > llvm-dev at lists.llvm.org
> > > > 
> > > 
> > 
> 
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > > 
> > > 
> > 
> 
> > > > --
> > > 
> > 
> 
> > > > Hal Finkel
> > > 
> > 
> 
> > > > Lead, Compiler Technology and Programming Languages
> > > 
> > 
> 
> > > > Leadership Computing Facility
> > > 
> > 
> 
> > > > Argonne National Laboratory
> > > 
> > 
> 
> > --
> 
> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/cf9c2d18/attachment.html>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Nov 2016 - (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Reasonably Related Threads