thr3ads.net - llvm dev - [llvm-dev] (RFC) Encoding code duplication factor in discriminator [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Robinson, Paul via llvm-dev

2016-Nov-01 19:15 UTC

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

As illustrated in the above example, it is not like "vectorization has a
distinct bit". All different optimizations make clones of code which will
be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space will
be capped by the number of clones all optimizations have made, instead of # of
optimizations that has applied. And it will be capped at 2^N-1. The cons of
using uid is that you will not know if a clone is coming from vectorization or
unroll or loop distribution.
Okay, but that kind of semantic mapping is important. How should we
encode/recover that information? To be clear, I'm not saying that we need to
implement that up front, but there needs to be a clear path to an
implementation, because I don't want to have two disjoint schemes.

You mean that you want to know which optimization created the clone? How would
you use that info? Looks to me this will expose compiler implementation detail
in debug info.

This is still doable, assume we have 15 interesting optimizations to track, we
can use 4 bits to encode the optimization type that created the clone. But this
becomes nasty if the a clone is created by more than one optimizations. In that
way, discriminator may not be fit for this purpose.

My understanding was that the encoding scheme would allow the profiling analysis
to correctly map execution data back to the original source construct, while
preserving the property that each distinct basic block would have its own
discriminator value.  That is, the execution data would be attributed back to
the original source construct, not whatever each individual optimization had
done to it, and the data for the original source construct would correctly
reflect the execution (e.g. profiling says you got 82 hits on the original loop,
rather than reporting 20 hits on the unrolled-by-4 loop plus 1 each on 2 of the
trailing copies).

It sounds like Hal is thinking that the per-discriminator execution info would
be preserved down to the point where an individual optimization could look at
the profile for each piece, and make decisions on that basis.

I'm not clear how that would be possible, as the optimization would have to
first do the transform (or predict how it would do the transform) in order to
see which individual-discriminator counts mapped to which actual blocks, and
then make some kind of decision about whether to do the transform differently
based on that information.  Then, if the optimization did choose to do the
transform differently, then that leaves the IR in a state where the individual
discriminators *cannot* map back to it.  (Say you unroll by 2 instead of 4; then
you have only 1 trailing copy, not 3, and a discriminator that maps to the
second trailing copy now maps to nothing.  The individual-discriminator data
becomes useless.)

Am I expressing this well enough to show that what Hal is looking for is not
feasible?
--paulr

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/c90917b2/attachment.html>

Hal Finkel via llvm-dev

2016-Nov-01 20:04 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

----- Original Message -----
> From: "Paul Robinson" <paul.robinson at sony.com>
> To: "Dehao Chen" <dehao at google.com>, "Hal
Finkel" <hfinkel at anl.gov>
> Cc: "Xinliang David Li" <davidxl at google.com>, llvm-dev
at lists.llvm.org
> Sent: Tuesday, November 1, 2016 2:15:38 PM
> Subject: RE: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
> As illustrated in the above example, it is not like "vectorization
> has a distinct bit". All different optimizations make clones of code
> which will be labeled by UIDs represented by N (e.g. 8) bits. In
> this way, the space will be capped by the number of clones all
> optimizations have made, instead of # of optimizations that has
> applied. And it will be capped at 2^N-1. The cons of using uid is
> that you will not know if a clone is coming from vectorization or
> unroll or loop distribution.
> Okay, but that kind of semantic mapping is important. How should we
> encode/recover that information? To be clear, I'm not saying that we
> need to implement that up front, but there needs to be a clear path
> to an implementation, because I don't want to have two disjoint
> schemes.
> You mean that you want to know which optimization created the clone?
> How would you use that info? Looks to me this will expose compiler
> implementation detail in debug info.
> This is still doable, assume we have 15 interesting optimizations to
> track, we can use 4 bits to encode the optimization type that
> created the clone. But this becomes nasty if the a clone is created
> by more than one optimizations. In that way, discriminator may not
> be fit for this purpose.
> My understanding was that the encoding scheme would allow the
> profiling analysis to correctly map execution data back to the
> original source construct, while preserving the property that each
> distinct basic block would have its own discriminator value. That
> is, the execution data would be attributed back to the original
> source construct, not whatever each individual optimization had done
> to it, and the data for the original source construct would
> correctly reflect the execution (e.g. profiling says you got 82 hits
> on the original loop, rather than reporting 20 hits on the
> unrolled-by-4 loop plus 1 each on 2 of the trailing copies).
> It sounds like Hal is thinking that the per-discriminator execution
> info would be preserved down to the point where an individual
> optimization could look at the profile for each piece, and make
> decisions on that basis.
> I'm not clear how that would be possible, as the optimization would
> have to first do the transform (or predict how it would do the
> transform) in order to see which individual-discriminator counts
> mapped to which actual blocks, and then make some kind of decision
> about whether to do the transform differently based on that
> information. Then, if the optimization did choose to do the
> transform differently, then that leaves the IR in a state where the
> individual discriminators *cannot* map back to it. (Say you unroll
> by 2 instead of 4; then you have only 1 trailing copy, not 3, and a
> discriminator that maps to the second trailing copy now maps to
> nothing. The individual-discriminator data becomes useless.)
> Am I expressing this well enough to show that what Hal is looking for
> is not feasible?Yes, it will need to predict how the transformation would affect the blocks
produced. That does not seem problematic (at least at a coarse level). Yes, if
transformations made earlier in the pipeline make different decisions, then that
will invalidate later fine-grained data (at least potentially). I don't see
how any of this makes this infeasible. We just need a way for the profiling
counts, per descriminator, to remain available, and for the transformations
themselves to know which discriminators (loop ids, or whatever) to consider.

-Hal 
> --paulr
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/fdf9a8af/attachment.html>

Dehao Chen via llvm-dev

2016-Nov-01 21:01 UTC

head link

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

If Hal's proposal is for SamplePGO purpose, let me clarify some design
principles of SamplePGO.

The profile for sample pgo uses source location as the key to map the
execution count back to IR. This design is based on the principle that we
do not want the profile to be tightly couple with compiler IR. Instead,
profile is simple an attribute of the source code. We have been benefited
 a lot from this design that the profile can easily be reused across
different source versions and compiler versions, or even compilers.

That being said, the design to encode more info into discriminator does not
mean that we will change the profile. The encoded info in discriminator
will be handled by the create_llvm_prof tool, which combines counts from
different clones of the same source code and generate the combined profile
data. The output profile will not have any cloning/dupliaction bits at all.
So for the initial example profile I provided, the output profile will be:


On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"Paul Robinson" <paul.robinson at sony.com>
> *To: *"Dehao Chen" <dehao at google.com>, "Hal
Finkel" <hfinkel at anl.gov>
> *Cc: *"Xinliang David Li" <davidxl at google.com>, llvm-dev
at lists.llvm.org
> *Sent: *Tuesday, November 1, 2016 2:15:38 PM
> *Subject: *RE: [llvm-dev] (RFC) Encoding code duplication factor
> in        discriminator
>
> As illustrated in the above example, it is not like "vectorization has
a
> distinct bit". All different optimizations make clones of code which
will
> be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
> will be capped by the number of clones all optimizations have made, instead
> of # of optimizations that has applied. And it will be capped at 2^N-1. The
> cons of using uid is that you will not know if a clone is coming from
> vectorization or unroll or loop distribution.
>
> Okay, but that kind of semantic mapping is important. How should we
> encode/recover that information? To be clear, I'm not saying that we
need
> to implement that up front, but there needs to be a clear path to an
> implementation, because I don't want to have two disjoint schemes.
>
>
>
> You mean that you want to know which optimization created the clone? How
> would you use that info? Looks to me this will expose compiler
> implementation detail in debug info.
>
>
>
> This is still doable, assume we have 15 interesting optimizations to
> track, we can use 4 bits to encode the optimization type that created the
> clone. But this becomes nasty if the a clone is created by more than one
> optimizations. In that way, discriminator may not be fit for this purpose.
>
>
>
> My understanding was that the encoding scheme would allow the profiling
> analysis to correctly map execution data back to the original source
> construct, while preserving the property that each distinct basic block
> would have its own discriminator value.  That is, the execution data would
> be attributed back to the original source construct, not whatever each
> individual optimization had done to it, and the data for the original
> source construct would correctly reflect the execution (e.g. profiling says
> you got 82 hits on the original loop, rather than reporting 20 hits on the
> unrolled-by-4 loop plus 1 each on 2 of the trailing copies).
>
>
>
> It sounds like Hal is thinking that the per-discriminator execution info
> would be preserved down to the point where an individual optimization could
> look at the profile for each piece, and make decisions on that basis.
>
>
>
> I'm not clear how that would be possible, as the optimization would
have
> to first do the transform (or predict how it would do the transform) in
> order to see which individual-discriminator counts mapped to which actual
> blocks, and then make some kind of decision about whether to do the
> transform differently based on that information.  Then, if the optimization
> did choose to do the transform differently, then that leaves the IR in a
> state where the individual discriminators *cannot* map back to it.  (Say
> you unroll by 2 instead of 4; then you have only 1 trailing copy, not 3,
> and a discriminator that maps to the second trailing copy now maps to
> nothing.  The individual-discriminator data becomes useless.)
>
>
>
> Am I expressing this well enough to show that what Hal is looking for is
> not feasible?
>
> Yes, it will need to predict how the transformation would affect the
> blocks produced. That does not seem problematic (at least at a coarse
> level). Yes, if transformations made earlier in the pipeline make different
> decisions, then that will invalidate later fine-grained data (at least
> potentially). I don't see how any of this makes this infeasible. We
just
> need a way for the profiling counts, per descriminator, to remain
> available, and for the transformations themselves to know which
> discriminators (loop ids, or whatever) to consider.
>
>  -Hal
>
> --paulr
>
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/18c39057/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Nov 2016 - (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Possibly Parallel Threads