Rong Xu via llvm-dev
2020-Nov-17 17:54 UTC
[llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
Hi all,
Here I include an RFC for control flow sensitive AutoFDO (FS-AFDO). This is
a joint work with David Li. Questions and feedback are welcome.
Thanks,
Rong
============
[RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
1. Motivation
AFDO profile is derived from PMU samples from running an earlier build
binary. PMU samples are indexed by the IP addresses. An offline tool uses
the debug line number to map PMU samples to the source line. The accuracy
of debug information is critical for AFDO performance. Thanks to the
continuous efforts from the community, we have fixed many debug info
related issues that affect AFDO.
One of the remaining areas that affects the profile accuracy is the
handling of duplicated code. Aggressive compiler optimizations can change
functions’ control flow significantly. The transformation can duplicate
branches in multiple places and new branches that do not exist in the
original source code might also be introduced. All these branches, even
though originated from the same source line, may have very different
behavior -- which is called (control) flow sensitivity.
Current AFDO handles the duplication in a pretty coarse grain way. When
AFDO sees samples with different IP addresses that map to the same source
line, it uses a max function to get the maximum number of samples and
attach to that line. From this prospective, current AFDO is not control
flow sensitive. Consider the following example:
// Original code:
for()
Stmt1; // line 10
// After a versioning optimization
for() {
if (cond1)
stmt1; // line 10, 100 samples, address: C1
else if (cond2)
stmt1; // line 10, 80 samples, address: C2
else
stmt1; // line 10, 20 samples, address: C3
}
In the final binary, stmt1 is cloned 3 times by the versioning
optimization. The address C1, gets 100 samples, the address C2 gets 80
samples, and the address C3 gets 20 samples. When converting Perf samples
to AFDO profile, a max function is used for these samples as they have the
same source line. In the AFDO profile, we will have a sample entry of <10,
100>, which means we have a sample value of 100 for line 10 of this
function. We will not have branch information for cond1, cond2, as they do
not exist in the sample loading phase. The profile count for stmt1 is also
off the actual value (200 = 100 +80 + 20) by 50%.
In this work, we propose a method that maintains more precise AFDO profile
counts for the duplicated control flow.
2. FS-AFDO Designs
In this work, we don't aim to keep track of the transformations and make
the flow sensitivity available across the compilation pipelines. Rather, we
identify a set of target optimizations that can potentially benefit most
from a more precise profile. For each of such optimizations, we use a
helper pass to recompute the flow sensitive profile just before this pass.
When loading this profile, we will recompute the branch probability and
hopefully improve the optimization quality for that target pass.
The flow sensitivity in the profile is through a hierarchical discriminator
scheme that can better preserve flow information in the Perf profile.
2.1. Current Use of Discriminators
The AFDO profile contains a discriminator field whose goal is to
differentiate statements with the same line number in the source code
(largely from MACRO expansions).
Discriminators were later expanded to get more accurate profile counts. It
is expanded to 3 components which can partially address the issue. The main
objective of that work was to get the correct profile count for the
unrolled (or vectorized) BBs. The discriminator is divided into 3
components:
* Base discriminator, assigned by AddDiscriminators
* Duplication factor: used in loop unroll and loop vectorization (when
cloning the loop body). This factor will be multiplied by sample counts in
create_llvm_prof tool to get the count value before duplication.
* Copy Identifier: reserved by not currently used.
Duplication factor was explicitly assigned in loop unroll and loop
vectorization. Instruction cloning will copy the discriminator. All the
cloned instruction instances, other than inlining, loop unroll and
vectorization, have the same discriminator. In the create_llvm_prof tool,
when assigning the sample count value for <offset_to_func_head,
discriminator>, the max sample value among all the clones will be the
assigned value.
2.2 Hierarchical discriminators
In FS-AFDO, discriminator assigning happens multiple times -- conceptually,
one assignment for each optimization that can benefit from more precise FS
profile data (let’s call it a target optimization pass). The discriminator
is divided into multiple sections. Earlier passes have the lower bits and
each instance of assigning access the bits in their sections only.
For example, there could be 3 rounds of discriminator assigning: the base
discriminator which corresponds to existing discriminator assigning,
discriminator for pass N, and discriminator for pass M. Let’s assume each
round uses 4 bits. If the instruction Inst1 is from BB0 and it’s copied to
BB1, it will have a discriminator 0x1. If BB1:Inst1 is copied to BB2:intr2
before Pass N, this instruction will have a new discriminator of 0x11
associated with pass N. Similarly, each copy of Inst1 before Pass M will
have its own discriminators using bit 8 to bit 12.
When converting Perf samples to the AFDO profile, we will have the sample
count for each discriminator. The AFDO profile loader reads all the
samples and masks out the bit used by later rounds. For example, when
loading the profile for the base discriminator, it only checks the first 4
bits. All the samples cloned from the same base discriminator will be
aggregated together. Profile loader for pass N will only check the first 8
bits of the discriminator and sum up all the samples with the same
discriminator bit 0 to bit 7.
2.3 Profile Mismatch
The above describes the ideal situation for loading the profile -- all the
discriminators in the Perf binary match the discriminator in optimized
build. In reality, the compilation for the optimized binary, with a new
profile, might perform different transformations (from the compilation of
the Perf binary). It can alter the program control flow for the downstream
passes, which leads to a discriminator mismatch.
If the mismatch happens, the profiling information will not be updated by
the SampleLoader to the affected BBs. They will inherit the profile
information (like branch probabilities) from the previous round of
SampleLoader.
The FS component in the discriminators basically uses a sequential number,
just like in the base discriminator. But to handle potential mismatch, we
also encode the total number of new discriminators. This prevents the
misplacement of profile counts to a wrong code clone. There are limited
bits for each FS discriminator component. We choose to use a hash value as
the FS discriminator. The hash function takes the following as the input:
* the sequential number of the current clone,
* the total number of the current clone.
A second measure to deal with mismatch is to use the profile coverage. We
summarize the samples used by SampleLoad for each function, and compare the
total samples in the profile. A warning is printed out if the coverage
(ratio of used samples and total sample) is lower than a threshold value.
In which case, the user should rebuild the Perf binary and collect the Perf
profile again.
2.4 Example of pass pipeline
Here is the one sample pass pipeline for AFDO + ThinLTO compilation.
// Pass Pipeline:
InsertBaseDiscrimator
SampleProfileLoader in FE
...
CM inline
...
SampleProfileLoader in BE
...
AddFSDiscriminator(kind_mbp)
FSProfileLoader(kind_mbp)
MachineBlockPlacement
AddFSDiscriminator(kind_branchfolding)
FSProfileLoader(kind_branchfolding)
BranchFoldingPass
…
AddFSDiscriminator(kind_final)
In the above pass pipeline. MachineBlockPlacement and BranchFoldingPass are
the two target optimization passes. MachineBlockPlacement can obviously be
benefited from more precise profile information. The tail merge in
BranchFoldPass can undo many unneeded tail duplications in
MachineBlockPlacment.
Note that we have a FS discriminator assigning pass at the final of
compilation pipeline. The reason is to assign a different discriminator for
each cloned instruction in the final binary. By doing this, we can get a
more precise profile for earlier passes by summing the counter values from
different discriminators.
3. A running example
The test program is shown as the following. We pay most attention to the
branch in line 5 and line 7 in function foo(). Inner loop at line 3 will be
fully unrolled after SampleProfileLoader. Outer loop at line 2 is intact by
the optimizer.
__attribute__((noinline)) int bar(int i){
volatile int j;
j = i;
return j;
}
__attribute__((noinline)) void work(int i){
...
}
__attribute__((noinline)) void foo(){ // line 17 of unroll.c
int i, j;
for (j = 0; j < 48; j++)
for (i = 0; i < 4; i++) {
int ii = bar(i + j * 48);
if (ii % 2) // line 22 of unroll.c
work(ii * 2);
if (ii % 4) // line 24 of unroll.c
work(ii * 3);
}
}
Current AFDO will produce the following profile:
foo:30387110:1606 void foo() {
0: 1606 int i, j;
2.1: 83748 for (j = 0; j < 48; j++) {
4: 351836 bar:84932 int ii = bar(i + j * 48);
5: 351836 if (ii % 2)
6: 351836 work:79741 work(ii * 2)
7: 350420 if (ii % 4)
8: 338340 work:75287 work(ii * 3)
10: 1302 }
Function foo() entry sample count is 1606. Line 2.1 is the outer loop body
with a sample count of 83748. Line 4 to line 8 is inner loop body count.
It’s scaled by an unrolled multiplication discriminator of 4. This resulted
in a higher value of sample count: it uses the Max of 4 clones and
multiplies by 4.
Here is the FS AFDO profile. Basically it has a separated profile count for
each unrolled instructions.
foo:8722583:1597 void foo() {
0: 1597 int i, j;
2.1: 83694 for (j = 0; j < 48; j++)
4: 87205 bar:85542 int ii = bar(i + j * 48);
4.8448: 82042 bar:82155 int ii = bar(i + j * 48);
4.13312: 84795 bar:84802 int ii = bar(i + j * 48);
4.13568: 87820 bar:87718 int ii = bar(i + j * 48);
5: 87205 if (ii % 2)
5.8448: 82042 if (ii % 2)
5.13312: 79604 if (ii %2)
5.13568: 87820 if (ii %2)
6: 0 work(ii * 2);
6.8448: 0 work(ii * 2);
6.13312: 79604 work:79652 work(ii * 2);
6.13568: 87820 work:90500 work(ii * 2);
7: 87493 if (ii %4)
7.8448: 84624 if (ii % 4)
7.13312: 71942 if (ii % 4)
7.13568: 83552 if (ii % 4)
8: 0 work(ii * 3);
8.8448: 84624 work:87655 work(ii * 3);
8.13312: 71942 work:75137 work(ii *3);
8.13568: 83552 work:91223 work(ii *3);
10: 1304 }
Let’s look at line 6 and line 8. Each of which has 4 records. When we use
these records in SampleLoader, these 4 records will be combined into one
after masking out the bits: line 6 will have one record of
6: 167424 work: 170152
Line 8 will have one record:
8: 240118 work: 254015
Comparing the sample count in line 5 and 7, we can easily see that line 5
has ~50% taken and line 7 has ~75% taken. This is in contrast to the
current AFDO profile, where both lines have a 100% taken branch.
When we load the above profile before the MachineBlockPlacement pass, each
record will maintain its own instance. We can see that at line 5, two of
the branches have ~0% taken probability, while the rest two have ~100%
taken probability. For line 7, one branch has 0% taken probability, while
the reset three has ~100% taken probability. We will reset the branch
probabilities as shown in the following messages:
Set branch fs prob: MBB (1 -> 3): unroll.c:22:11 W=87820 0x00005f85 /
0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (1 -> 2): unroll.c:22:11 W=87820 0x7fffa07b /
0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (3 -> 5): unroll.c:24:11 W=87820 0x04a8dc00 /
0x80000000 = 3.64% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (3 -> 4): unroll.c:24:11 W=87820 0x7b572400 /
0x80000000 = 96.36% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (9 -> 11): unroll.c:22:11 W=87820 0x00005f85 /
0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (9 -> 10): unroll.c:22:11 W=87820 0x7fffa07b /
0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (15 -> 17): unroll.c:24:11 W=87820 0x04a8dc00 /
0x80000000 = 3.64% --> 0x17248215 / 0x80000000 = 18.08%
Set branch fs prob: MBB (15 -> 16): unroll.c:24:11 W=87820 0x7b572400 /
0x80000000 = 96.36% --> 0x68db7deb / 0x80000000 = 81.92%
4. Implementation in LLVM
4.1 AFDO profile data format
The AFDO profile format remains largely the same. Samples could be splitted
into multiple parts with different discriminator copy components.
SampleLoader will aggregate the samples dynamically when reading the AFDO
profile based on the discriminator masks. Current discriminator encoding
will be removed and related code in the compiler will be removed.
The hierarchical discriminator in FS-AFDO captures most of the cases of
multiplication discriminator in currenting encoding scheme. As a matter of
fact, the resulting count values are usually more precise as each
individual discriminator may have different count values. There is one
exception: when the unrolling body has a single BB. FS-AFDO will have the
correct sample count after unrolling as all the instructions residing in
the same BB. In our prototype implementation, we did not find this issue to
have a visible performance impact.
4.2 Compiler changes
4.2.1 Add two new CodeGen passes:
AddFSDiscriminatorsPass: adds flow sensitive discriminators.
FSProfileLoaderPass: read FS profile and set branch probabilities.
FSProfileLoadPass will reuse most of the IR level SampleProfileLoader
algorithm and heuristics, but we need to reimplement at Machine level. We
will not need to update branch metadata. We will change the branch
probabilities directly.
4.2.2 Enhance SampleProfileReader to handle FS discriminators.
Add a new field of MaskedBitFrom to SampleProfileReader. This is the lower
mask bit being used for this pass instance. For each profile loader, we
dynamically compute samples based on bit masks -- this is done in
readImpl() method where the masked discriminators are used to aggregate the
samples.
4.3 AutoFDO tool changes
The position type used in PositionCountMap and Callsite type is of type
uint32. It is divided into high 16 bits as file offset and lower 16 bits as
discriminator. This is probably OK for the current discriminators as they
are numbered sequentially. But this would create too many clashes in FS
discriminators. We need to extend to uint64 for the position: a full 32-bit
for line number and full 32-bit for discriminators.
5. Experimental performance
A prototype was implemented and a performance test was done on critical
applications in Google.
Test binary:
(1) build the first round binary with an existing AFDO profile with
AddFSDiscriminator enabled.
(2) collect the PERF samples and convert them to the FS AFDO profile.
(3) build an optimized binary using the FSAFDO profile from step (2).
Base binary:
(1) build the first round binary with an existing AFDO profile
(2) collect the PERF samples and convert them to the AFDO profile.
(3) build an optimized binary using profile from step (2).
Run time performance:
The test binary shows 0.9% to 1.3% performance improvement over the base
binary.
Profile size:
Profile size is modestly increased. Using the Google benchmark as the
example, the profile size for the text form increased by 4% to 7% while the
binary form profile size increased by 5% to 8%.
Compiler time:
Prototype implementation shows about 10% increase of build time (measued by
the elapsed time which includes two extra FSAddDiscrimniator passes and one
extra FSProfileSampleLoader pass). Profile readings are not shared b/w
FSProfileSampleLoader and current ProfileSampleLoader. We expect a smaller
compiler time increase with better data structures that share a profile
loader.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201117/f48f887e/attachment-0001.html>
Wenlei He via llvm-dev
2020-Nov-19 00:34 UTC
[llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
Hi Rong,
This is a very interesting proposal. We've also observed profile quality
degradation from CFG destructive pass like loop rotate, and I can see how this
proposal would help improve quality of profile that drives later optimization
passes in the pipeline. I have a few questions.
* How does this affect today's AutoFDO? Specifically, can users upgrade
compiler with FS-AutoFDO support but without refreshing their profile? I think
it's important to make new improvement like this as opt-in, so other users
of AutoFDO can choose if and when they want to make the switch from AutoFDO to
FS-AutoFDO. With the proposed changes to discriminator encoding, sounds like we
are going to eliminate duplication factor etc. altogether. In that case,
multiple FS-AutoFDO sample profile loading would be required in order to not
regress from today's AutoFDO due to lack of duplication factor. Is that
correct? If so, this change as is can break backward compatibility with
today's AutoFDO profile. It’d be great to handle discriminator in a way that
compatible with today’s AutoFDO.
* Profile loading usually happens early in the pipeline in order to avoid
mismatch between profiling and optimization. In the case of FS-AutoFDO, some
mismatch may be inevitable for the later FS profile loading. In practice, how
significant is the mismatch in later profile loading? Have you tried to see by
how much using multiple iterations (like CSPGO) can further help profile quality
and improve performance? I understand this is not practical for production use,
but could give us data points as to what's left on the table due to mismatch
of later FS profile loading.
* In your performance experiment, where is the extra FSProfileSampleLoader
in the pipeline, right before machine block placement or somewhere else? Have
you tried adding more than one FSProfileSampleLoader and is there extra perf
gain for more than one FSProfileSampleLoader?
* For the final discriminator assignment, is this because we take MAX of
samples on addresses from the same location? So if there's any late code
duplication after last discriminator assignment, we need a final discriminator
assignment to turn MAX into SUM for earlier FS profile?
* While changing discriminator encoding, have you considered encode block Id
into discriminator so AutoFDO can be CFG aware? This is something Wei brought up
earlier in the context of CSSPGO, and we didn't go that route partly because
it's hard to do it without breaking compatibility.
This would be complementary to the context-sensitive SPGO we proposed earlier -
would be nice to make PGO/FDO both context-sensitive and flow-sensitive. I think
the flow-sensitive part could also integrate with pseudo-probe, essentially we
can append FS discriminator later multiple times the same way as you proposed
here, except that the "line" part would be "probeId".
+Hongtao as well.
Thanks,
Wenlei
From: llvm-dev <llvm-dev-bounces at lists.llvm.org>
Date: Tuesday, November 17, 2020 at 9:55 AM
To: llvm-dev <llvm-dev at lists.llvm.org>
Cc: David Li <davidxl at google.com>
Subject: [llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
Hi all,
Here I include an RFC for control flow sensitive AutoFDO (FS-AFDO). This is a
joint work with David Li. Questions and feedback are welcome.
Thanks,
Rong
============
[RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
1. Motivation
AFDO profile is derived from PMU samples from running an earlier build binary.
PMU samples are indexed by the IP addresses. An offline tool uses the debug line
number to map PMU samples to the source line. The accuracy of debug information
is critical for AFDO performance. Thanks to the continuous efforts from the
community, we have fixed many debug info related issues that affect AFDO.
One of the remaining areas that affects the profile accuracy is the handling of
duplicated code. Aggressive compiler optimizations can change functions’ control
flow significantly. The transformation can duplicate branches in multiple places
and new branches that do not exist in the original source code might also be
introduced. All these branches, even though originated from the same source
line, may have very different behavior -- which is called (control) flow
sensitivity.
Current AFDO handles the duplication in a pretty coarse grain way. When AFDO
sees samples with different IP addresses that map to the same source line, it
uses a max function to get the maximum number of samples and attach to that
line. From this prospective, current AFDO is not control flow sensitive.
Consider the following example:
// Original code:
for()
Stmt1; // line 10
// After a versioning optimization
for() {
if (cond1)
stmt1; // line 10, 100 samples, address: C1
else if (cond2)
stmt1; // line 10, 80 samples, address: C2
else
stmt1; // line 10, 20 samples, address: C3
}
In the final binary, stmt1 is cloned 3 times by the versioning optimization. The
address C1, gets 100 samples, the address C2 gets 80 samples, and the address C3
gets 20 samples. When converting Perf samples to AFDO profile, a max function is
used for these samples as they have the same source line. In the AFDO profile,
we will have a sample entry of <10, 100>, which means we have a sample
value of 100 for line 10 of this function. We will not have branch information
for cond1, cond2, as they do not exist in the sample loading phase. The profile
count for stmt1 is also off the actual value (200 = 100 +80 + 20) by 50%.
In this work, we propose a method that maintains more precise AFDO profile
counts for the duplicated control flow.
2. FS-AFDO Designs
In this work, we don't aim to keep track of the transformations and make the
flow sensitivity available across the compilation pipelines. Rather, we identify
a set of target optimizations that can potentially benefit most from a more
precise profile. For each of such optimizations, we use a helper pass to
recompute the flow sensitive profile just before this pass. When loading this
profile, we will recompute the branch probability and hopefully improve the
optimization quality for that target pass.
The flow sensitivity in the profile is through a hierarchical discriminator
scheme that can better preserve flow information in the Perf profile.
2.1. Current Use of Discriminators
The AFDO profile contains a discriminator field whose goal is to differentiate
statements with the same line number in the source code (largely from MACRO
expansions).
Discriminators were later expanded to get more accurate profile counts. It is
expanded to 3 components which can partially address the issue. The main
objective of that work was to get the correct profile count for the unrolled (or
vectorized) BBs. The discriminator is divided into 3 components:
* Base discriminator, assigned by AddDiscriminators
* Duplication factor: used in loop unroll and loop vectorization (when cloning
the loop body). This factor will be multiplied by sample counts in
create_llvm_prof tool to get the count value before duplication.
* Copy Identifier: reserved by not currently used.
Duplication factor was explicitly assigned in loop unroll and loop
vectorization. Instruction cloning will copy the discriminator. All the cloned
instruction instances, other than inlining, loop unroll and vectorization, have
the same discriminator. In the create_llvm_prof tool, when assigning the sample
count value for <offset_to_func_head, discriminator>, the max sample value
among all the clones will be the assigned value.
2.2 Hierarchical discriminators
In FS-AFDO, discriminator assigning happens multiple times -- conceptually, one
assignment for each optimization that can benefit from more precise FS profile
data (let’s call it a target optimization pass). The discriminator is divided
into multiple sections. Earlier passes have the lower bits and each instance of
assigning access the bits in their sections only.
For example, there could be 3 rounds of discriminator assigning: the base
discriminator which corresponds to existing discriminator assigning,
discriminator for pass N, and discriminator for pass M. Let’s assume each round
uses 4 bits. If the instruction Inst1 is from BB0 and it’s copied to BB1, it
will have a discriminator 0x1. If BB1:Inst1 is copied to BB2:intr2 before Pass
N, this instruction will have a new discriminator of 0x11 associated with pass
N. Similarly, each copy of Inst1 before Pass M will have its own discriminators
using bit 8 to bit 12.
When converting Perf samples to the AFDO profile, we will have the sample count
for each discriminator. The AFDO profile loader reads all the samples and masks
out the bit used by later rounds. For example, when loading the profile for the
base discriminator, it only checks the first 4 bits. All the samples cloned from
the same base discriminator will be aggregated together. Profile loader for pass
N will only check the first 8 bits of the discriminator and sum up all the
samples with the same discriminator bit 0 to bit 7.
2.3 Profile Mismatch
The above describes the ideal situation for loading the profile -- all the
discriminators in the Perf binary match the discriminator in optimized build. In
reality, the compilation for the optimized binary, with a new profile, might
perform different transformations (from the compilation of the Perf binary). It
can alter the program control flow for the downstream passes, which leads to a
discriminator mismatch.
If the mismatch happens, the profiling information will not be updated by the
SampleLoader to the affected BBs. They will inherit the profile information
(like branch probabilities) from the previous round of SampleLoader.
The FS component in the discriminators basically uses a sequential number, just
like in the base discriminator. But to handle potential mismatch, we also
encode the total number of new discriminators. This prevents the misplacement of
profile counts to a wrong code clone. There are limited bits for each FS
discriminator component. We choose to use a hash value as the FS discriminator.
The hash function takes the following as the input:
* the sequential number of the current clone,
* the total number of the current clone.
A second measure to deal with mismatch is to use the profile coverage. We
summarize the samples used by SampleLoad for each function, and compare the
total samples in the profile. A warning is printed out if the coverage (ratio of
used samples and total sample) is lower than a threshold value. In which case,
the user should rebuild the Perf binary and collect the Perf profile again.
2.4 Example of pass pipeline
Here is the one sample pass pipeline for AFDO + ThinLTO compilation.
// Pass Pipeline:
InsertBaseDiscrimator
SampleProfileLoader in FE
...
CM inline
...
SampleProfileLoader in BE
...
AddFSDiscriminator(kind_mbp)
FSProfileLoader(kind_mbp)
MachineBlockPlacement
AddFSDiscriminator(kind_branchfolding)
FSProfileLoader(kind_branchfolding)
BranchFoldingPass
…
AddFSDiscriminator(kind_final)
In the above pass pipeline. MachineBlockPlacement and BranchFoldingPass are the
two target optimization passes. MachineBlockPlacement can obviously be benefited
from more precise profile information. The tail merge in BranchFoldPass can undo
many unneeded tail duplications in MachineBlockPlacment.
Note that we have a FS discriminator assigning pass at the final of compilation
pipeline. The reason is to assign a different discriminator for each cloned
instruction in the final binary. By doing this, we can get a more precise
profile for earlier passes by summing the counter values from different
discriminators.
3. A running example
The test program is shown as the following. We pay most attention to the branch
in line 5 and line 7 in function foo(). Inner loop at line 3 will be fully
unrolled after SampleProfileLoader. Outer loop at line 2 is intact by the
optimizer.
__attribute__((noinline)) int bar(int i){
volatile int j;
j = i;
return j;
}
__attribute__((noinline)) void work(int i){
...
}
__attribute__((noinline)) void foo(){ // line 17 of unroll.c
int i, j;
for (j = 0; j < 48; j++)
for (i = 0; i < 4; i++) {
int ii = bar(i + j * 48);
if (ii % 2) // line 22 of unroll.c
work(ii * 2);
if (ii % 4) // line 24 of unroll.c
work(ii * 3);
}
}
Current AFDO will produce the following profile:
foo:30387110:1606 void foo() {
0: 1606 int i, j;
2.1: 83748 for (j = 0; j < 48; j++) {
4: 351836 bar:84932 int ii = bar(i + j * 48);
5: 351836 if (ii % 2)
6: 351836 work:79741 work(ii * 2)
7: 350420 if (ii % 4)
8: 338340 work:75287 work(ii * 3)
10: 1302 }
Function foo() entry sample count is 1606. Line 2.1 is the outer loop body with
a sample count of 83748. Line 4 to line 8 is inner loop body count. It’s scaled
by an unrolled multiplication discriminator of 4. This resulted in a higher
value of sample count: it uses the Max of 4 clones and multiplies by 4.
Here is the FS AFDO profile. Basically it has a separated profile count for each
unrolled instructions.
foo:8722583:1597 void foo() {
0: 1597 int i, j;
2.1: 83694 for (j = 0; j < 48; j++)
4: 87205 bar:85542 int ii = bar(i + j * 48);
4.8448: 82042 bar:82155 int ii = bar(i + j * 48);
4.13312: 84795 bar:84802 int ii = bar(i + j * 48);
4.13568: 87820 bar:87718 int ii = bar(i + j * 48);
5: 87205 if (ii % 2)
5.8448: 82042 if (ii % 2)
5.13312: 79604 if (ii %2)
5.13568: 87820 if (ii %2)
6: 0 work(ii * 2);
6.8448: 0 work(ii * 2);
6.13312: 79604 work:79652 work(ii * 2);
6.13568: 87820 work:90500 work(ii * 2);
7: 87493 if (ii %4)
7.8448: 84624 if (ii % 4)
7.13312: 71942 if (ii % 4)
7.13568: 83552 if (ii % 4)
8: 0 work(ii * 3);
8.8448: 84624 work:87655 work(ii * 3);
8.13312: 71942 work:75137 work(ii *3);
8.13568: 83552 work:91223 work(ii *3);
10: 1304 }
Let’s look at line 6 and line 8. Each of which has 4 records. When we use these
records in SampleLoader, these 4 records will be combined into one after masking
out the bits: line 6 will have one record of
6: 167424 work: 170152
Line 8 will have one record:
8: 240118 work: 254015
Comparing the sample count in line 5 and 7, we can easily see that line 5 has
~50% taken and line 7 has ~75% taken. This is in contrast to the current AFDO
profile, where both lines have a 100% taken branch.
When we load the above profile before the MachineBlockPlacement pass, each
record will maintain its own instance. We can see that at line 5, two of the
branches have ~0% taken probability, while the rest two have ~100% taken
probability. For line 7, one branch has 0% taken probability, while the reset
three has ~100% taken probability. We will reset the branch probabilities as
shown in the following messages:
Set branch fs prob: MBB (1 -> 3): unroll.c:22:11 W=87820 0x00005f85 /
0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (1 -> 2): unroll.c:22:11 W=87820 0x7fffa07b /
0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (3 -> 5): unroll.c:24:11 W=87820 0x04a8dc00 /
0x80000000 = 3.64% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (3 -> 4): unroll.c:24:11 W=87820 0x7b572400 /
0x80000000 = 96.36% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (9 -> 11): unroll.c:22:11 W=87820 0x00005f85 /
0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (9 -> 10): unroll.c:22:11 W=87820 0x7fffa07b /
0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (15 -> 17): unroll.c:24:11 W=87820 0x04a8dc00 /
0x80000000 = 3.64% --> 0x17248215 / 0x80000000 = 18.08%
Set branch fs prob: MBB (15 -> 16): unroll.c:24:11 W=87820 0x7b572400 /
0x80000000 = 96.36% --> 0x68db7deb / 0x80000000 = 81.92%
4. Implementation in LLVM
4.1 AFDO profile data format
The AFDO profile format remains largely the same. Samples could be splitted into
multiple parts with different discriminator copy components. SampleLoader will
aggregate the samples dynamically when reading the AFDO profile based on the
discriminator masks. Current discriminator encoding will be removed and related
code in the compiler will be removed.
The hierarchical discriminator in FS-AFDO captures most of the cases of
multiplication discriminator in currenting encoding scheme. As a matter of fact,
the resulting count values are usually more precise as each individual
discriminator may have different count values. There is one exception: when the
unrolling body has a single BB. FS-AFDO will have the correct sample count after
unrolling as all the instructions residing in the same BB. In our prototype
implementation, we did not find this issue to have a visible performance impact.
4.2 Compiler changes
4.2.1 Add two new CodeGen passes:
AddFSDiscriminatorsPass: adds flow sensitive discriminators.
FSProfileLoaderPass: read FS profile and set branch probabilities.
FSProfileLoadPass will reuse most of the IR level SampleProfileLoader algorithm
and heuristics, but we need to reimplement at Machine level. We will not need to
update branch metadata. We will change the branch probabilities directly.
4.2.2 Enhance SampleProfileReader to handle FS discriminators.
Add a new field of MaskedBitFrom to SampleProfileReader. This is the lower mask
bit being used for this pass instance. For each profile loader, we dynamically
compute samples based on bit masks -- this is done in readImpl() method where
the masked discriminators are used to aggregate the samples.
4.3 AutoFDO tool changes
The position type used in PositionCountMap and Callsite type is of type uint32.
It is divided into high 16 bits as file offset and lower 16 bits as
discriminator. This is probably OK for the current discriminators as they are
numbered sequentially. But this would create too many clashes in FS
discriminators. We need to extend to uint64 for the position: a full 32-bit for
line number and full 32-bit for discriminators.
5. Experimental performance
A prototype was implemented and a performance test was done on critical
applications in Google.
Test binary:
(1) build the first round binary with an existing AFDO profile with
AddFSDiscriminator enabled.
(2) collect the PERF samples and convert them to the FS AFDO profile.
(3) build an optimized binary using the FSAFDO profile from step (2).
Base binary:
(1) build the first round binary with an existing AFDO profile
(2) collect the PERF samples and convert them to the AFDO profile.
(3) build an optimized binary using profile from step (2).
Run time performance:
The test binary shows 0.9% to 1.3% performance improvement over the base binary.
Profile size:
Profile size is modestly increased. Using the Google benchmark as the example,
the profile size for the text form increased by 4% to 7% while the binary form
profile size increased by 5% to 8%.
Compiler time:
Prototype implementation shows about 10% increase of build time (measued by the
elapsed time which includes two extra FSAddDiscrimniator passes and one extra
FSProfileSampleLoader pass). Profile readings are not shared b/w
FSProfileSampleLoader and current ProfileSampleLoader. We expect a smaller
compiler time increase with better data structures that share a profile loader.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201119/adddf22d/attachment.html>
Hongtao Yu via llvm-dev
2020-Nov-19 08:15 UTC
[llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
Hi Rong,
Nice to see this proposal which is quite interesting and useful. Actually I
believe it can help address some of the performance-critical issues we have
encountered with our AutoFDO workloads. As mentioned previously by Wenlei, we
have been dealing with profile counts degradation caused by various
transformations, such as loop rotate, simplifyCFG, etc. Relying on BPI/BFI to
infer a reasonable counts distribution for duplicated code is sometimes
challenging. A per-pass profile will make it easier.
One thing I don't quite get is how a flow-sensitive profile applied to a
specific optimization pass. If I understand correctly, the records of duplicated
instructions collected for an optimization pass from the training build are
mapped to the corresponding instructions resulted by the same pass in the
optimized build. Can you explain a bit more about how the mapping works and how
code duplication is tracked? Could a slightly different IR duplication cause
incorrect mapping?
It looks like the ID of a pass is considered when adding discriminators for that
pass. I guess we are not aiming at supporting every pass with a FS profile,
given the size limitation of a 32-bit Dwarf discriminator. What are the
first-class passes being supported? Will it be considered to extend the
discriminator size for new optimizations?
Thanks,
Hongtao
From: Wenlei He <wenlei at fb.com>
Date: Wednesday, November 18, 2020 at 4:34 PM
To: Rong Xu <xur at google.com>, llvm-dev <llvm-dev at
lists.llvm.org>
Cc: David Li <davidxl at google.com>, Hongtao Yu <hoy at fb.com>
Subject: Re: [llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
Hi Rong,
This is a very interesting proposal. We've also observed profile quality
degradation from CFG destructive pass like loop rotate, and I can see how this
proposal would help improve quality of profile that drives later optimization
passes in the pipeline. I have a few questions.
- How does this affect today's AutoFDO? Specifically, can users
upgrade compiler with FS-AutoFDO support but without refreshing their profile? I
think it's important to make new improvement like this as opt-in, so other
users of AutoFDO can choose if and when they want to make the switch from
AutoFDO to FS-AutoFDO. With the proposed changes to discriminator encoding,
sounds like we are going to eliminate duplication factor etc. altogether. In
that case, multiple FS-AutoFDO sample profile loading would be required in order
to not regress from today's AutoFDO due to lack of duplication factor. Is
that correct? If so, this change as is can break backward compatibility with
today's AutoFDO profile. It’d be great to handle discriminator in a way that
compatible with today’s AutoFDO.
- Profile loading usually happens early in the pipeline in order to
avoid mismatch between profiling and optimization. In the case of FS-AutoFDO,
some mismatch may be inevitable for the later FS profile loading. In practice,
how significant is the mismatch in later profile loading? Have you tried to see
by how much using multiple iterations (like CSPGO) can further help profile
quality and improve performance? I understand this is not practical for
production use, but could give us data points as to what's left on the table
due to mismatch of later FS profile loading.
- In your performance experiment, where is the extra
FSProfileSampleLoader in the pipeline, right before machine block placement or
somewhere else? Have you tried adding more than one FSProfileSampleLoader and is
there extra perf gain for more than one FSProfileSampleLoader?
- For the final discriminator assignment, is this because we take MAX
of samples on addresses from the same location? So if there's any late code
duplication after last discriminator assignment, we need a final discriminator
assignment to turn MAX into SUM for earlier FS profile?
- While changing discriminator encoding, have you considered encode
block Id into discriminator so AutoFDO can be CFG aware? This is something Wei
brought up earlier in the context of CSSPGO, and we didn't go that route
partly because it's hard to do it without breaking compatibility.
This would be complementary to the context-sensitive SPGO we proposed earlier -
would be nice to make PGO/FDO both context-sensitive and flow-sensitive. I think
the flow-sensitive part could also integrate with pseudo-probe, essentially we
can append FS discriminator later multiple times the same way as you proposed
here, except that the "line" part would be "probeId".
+Hongtao as well.
Thanks,
Wenlei
From: llvm-dev <llvm-dev-bounces at lists.llvm.org>
Date: Tuesday, November 17, 2020 at 9:55 AM
To: llvm-dev <llvm-dev at lists.llvm.org>
Cc: David Li <davidxl at google.com>
Subject: [llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
Hi all,
Here I include an RFC for control flow sensitive AutoFDO (FS-AFDO). This is a
joint work with David Li. Questions and feedback are welcome.
Thanks,
Rong
============
[RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
1. Motivation
AFDO profile is derived from PMU samples from running an earlier build binary.
PMU samples are indexed by the IP addresses. An offline tool uses the debug line
number to map PMU samples to the source line. The accuracy of debug information
is critical for AFDO performance. Thanks to the continuous efforts from the
community, we have fixed many debug info related issues that affect AFDO.
One of the remaining areas that affects the profile accuracy is the handling of
duplicated code. Aggressive compiler optimizations can change functions’ control
flow significantly. The transformation can duplicate branches in multiple places
and new branches that do not exist in the original source code might also be
introduced. All these branches, even though originated from the same source
line, may have very different behavior -- which is called (control) flow
sensitivity.
Current AFDO handles the duplication in a pretty coarse grain way. When AFDO
sees samples with different IP addresses that map to the same source line, it
uses a max function to get the maximum number of samples and attach to that
line. From this prospective, current AFDO is not control flow sensitive.
Consider the following example:
// Original code:
for()
Stmt1; // line 10
// After a versioning optimization
for() {
if (cond1)
stmt1; // line 10, 100 samples, address: C1
else if (cond2)
stmt1; // line 10, 80 samples, address: C2
else
stmt1; // line 10, 20 samples, address: C3
}
In the final binary, stmt1 is cloned 3 times by the versioning optimization. The
address C1, gets 100 samples, the address C2 gets 80 samples, and the address C3
gets 20 samples. When converting Perf samples to AFDO profile, a max function is
used for these samples as they have the same source line. In the AFDO profile,
we will have a sample entry of <10, 100>, which means we have a sample
value of 100 for line 10 of this function. We will not have branch information
for cond1, cond2, as they do not exist in the sample loading phase. The profile
count for stmt1 is also off the actual value (200 = 100 +80 + 20) by 50%.
In this work, we propose a method that maintains more precise AFDO profile
counts for the duplicated control flow.
2. FS-AFDO Designs
In this work, we don't aim to keep track of the transformations and make the
flow sensitivity available across the compilation pipelines. Rather, we identify
a set of target optimizations that can potentially benefit most from a more
precise profile. For each of such optimizations, we use a helper pass to
recompute the flow sensitive profile just before this pass. When loading this
profile, we will recompute the branch probability and hopefully improve the
optimization quality for that target pass.
The flow sensitivity in the profile is through a hierarchical discriminator
scheme that can better preserve flow information in the Perf profile.
2.1. Current Use of Discriminators
The AFDO profile contains a discriminator field whose goal is to differentiate
statements with the same line number in the source code (largely from MACRO
expansions).
Discriminators were later expanded to get more accurate profile counts. It is
expanded to 3 components which can partially address the issue. The main
objective of that work was to get the correct profile count for the unrolled (or
vectorized) BBs. The discriminator is divided into 3 components:
* Base discriminator, assigned by AddDiscriminators
* Duplication factor: used in loop unroll and loop vectorization (when cloning
the loop body). This factor will be multiplied by sample counts in
create_llvm_prof tool to get the count value before duplication.
* Copy Identifier: reserved by not currently used.
Duplication factor was explicitly assigned in loop unroll and loop
vectorization. Instruction cloning will copy the discriminator. All the cloned
instruction instances, other than inlining, loop unroll and vectorization, have
the same discriminator. In the create_llvm_prof tool, when assigning the sample
count value for <offset_to_func_head, discriminator>, the max sample value
among all the clones will be the assigned value.
2.2 Hierarchical discriminators
In FS-AFDO, discriminator assigning happens multiple times -- conceptually, one
assignment for each optimization that can benefit from more precise FS profile
data (let’s call it a target optimization pass). The discriminator is divided
into multiple sections. Earlier passes have the lower bits and each instance of
assigning access the bits in their sections only.
For example, there could be 3 rounds of discriminator assigning: the base
discriminator which corresponds to existing discriminator assigning,
discriminator for pass N, and discriminator for pass M. Let’s assume each round
uses 4 bits. If the instruction Inst1 is from BB0 and it’s copied to BB1, it
will have a discriminator 0x1. If BB1:Inst1 is copied to BB2:intr2 before Pass
N, this instruction will have a new discriminator of 0x11 associated with pass
N. Similarly, each copy of Inst1 before Pass M will have its own discriminators
using bit 8 to bit 12.
When converting Perf samples to the AFDO profile, we will have the sample count
for each discriminator. The AFDO profile loader reads all the samples and masks
out the bit used by later rounds. For example, when loading the profile for the
base discriminator, it only checks the first 4 bits. All the samples cloned from
the same base discriminator will be aggregated together. Profile loader for pass
N will only check the first 8 bits of the discriminator and sum up all the
samples with the same discriminator bit 0 to bit 7.
2.3 Profile Mismatch
The above describes the ideal situation for loading the profile -- all the
discriminators in the Perf binary match the discriminator in optimized build. In
reality, the compilation for the optimized binary, with a new profile, might
perform different transformations (from the compilation of the Perf binary). It
can alter the program control flow for the downstream passes, which leads to a
discriminator mismatch.
If the mismatch happens, the profiling information will not be updated by the
SampleLoader to the affected BBs. They will inherit the profile information
(like branch probabilities) from the previous round of SampleLoader.
The FS component in the discriminators basically uses a sequential number, just
like in the base discriminator. But to handle potential mismatch, we also
encode the total number of new discriminators. This prevents the misplacement of
profile counts to a wrong code clone. There are limited bits for each FS
discriminator component. We choose to use a hash value as the FS discriminator.
The hash function takes the following as the input:
* the sequential number of the current clone,
* the total number of the current clone.
A second measure to deal with mismatch is to use the profile coverage. We
summarize the samples used by SampleLoad for each function, and compare the
total samples in the profile. A warning is printed out if the coverage (ratio of
used samples and total sample) is lower than a threshold value. In which case,
the user should rebuild the Perf binary and collect the Perf profile again.
2.4 Example of pass pipeline
Here is the one sample pass pipeline for AFDO + ThinLTO compilation.
// Pass Pipeline:
InsertBaseDiscrimator
SampleProfileLoader in FE
...
CM inline
...
SampleProfileLoader in BE
...
AddFSDiscriminator(kind_mbp)
FSProfileLoader(kind_mbp)
MachineBlockPlacement
AddFSDiscriminator(kind_branchfolding)
FSProfileLoader(kind_branchfolding)
BranchFoldingPass
…
AddFSDiscriminator(kind_final)
In the above pass pipeline. MachineBlockPlacement and BranchFoldingPass are the
two target optimization passes. MachineBlockPlacement can obviously be benefited
from more precise profile information. The tail merge in BranchFoldPass can undo
many unneeded tail duplications in MachineBlockPlacment.
Note that we have a FS discriminator assigning pass at the final of compilation
pipeline. The reason is to assign a different discriminator for each cloned
instruction in the final binary. By doing this, we can get a more precise
profile for earlier passes by summing the counter values from different
discriminators.
3. A running example
The test program is shown as the following. We pay most attention to the branch
in line 5 and line 7 in function foo(). Inner loop at line 3 will be fully
unrolled after SampleProfileLoader. Outer loop at line 2 is intact by the
optimizer.
__attribute__((noinline)) int bar(int i){
volatile int j;
j = i;
return j;
}
__attribute__((noinline)) void work(int i){
...
}
__attribute__((noinline)) void foo(){ // line 17 of unroll.c
int i, j;
for (j = 0; j < 48; j++)
for (i = 0; i < 4; i++) {
int ii = bar(i + j * 48);
if (ii % 2) // line 22 of unroll.c
work(ii * 2);
if (ii % 4) // line 24 of unroll.c
work(ii * 3);
}
}
Current AFDO will produce the following profile:
foo:30387110:1606 void foo() {
0: 1606 int i, j;
2.1: 83748 for (j = 0; j < 48; j++) {
4: 351836 bar:84932 int ii = bar(i + j * 48);
5: 351836 if (ii % 2)
6: 351836 work:79741 work(ii * 2)
7: 350420 if (ii % 4)
8: 338340 work:75287 work(ii * 3)
10: 1302 }
Function foo() entry sample count is 1606. Line 2.1 is the outer loop body with
a sample count of 83748. Line 4 to line 8 is inner loop body count. It’s scaled
by an unrolled multiplication discriminator of 4. This resulted in a higher
value of sample count: it uses the Max of 4 clones and multiplies by 4.
Here is the FS AFDO profile. Basically it has a separated profile count for each
unrolled instructions.
foo:8722583:1597 void foo() {
0: 1597 int i, j;
2.1: 83694 for (j = 0; j < 48; j++)
4: 87205 bar:85542 int ii = bar(i + j * 48);
4.8448: 82042 bar:82155 int ii = bar(i + j * 48);
4.13312: 84795 bar:84802 int ii = bar(i + j * 48);
4.13568: 87820 bar:87718 int ii = bar(i + j * 48);
5: 87205 if (ii % 2)
5.8448: 82042 if (ii % 2)
5.13312: 79604 if (ii %2)
5.13568: 87820 if (ii %2)
6: 0 work(ii * 2);
6.8448: 0 work(ii * 2);
6.13312: 79604 work:79652 work(ii * 2);
6.13568: 87820 work:90500 work(ii * 2);
7: 87493 if (ii %4)
7.8448: 84624 if (ii % 4)
7.13312: 71942 if (ii % 4)
7.13568: 83552 if (ii % 4)
8: 0 work(ii * 3);
8.8448: 84624 work:87655 work(ii * 3);
8.13312: 71942 work:75137 work(ii *3);
8.13568: 83552 work:91223 work(ii *3);
10: 1304 }
Let’s look at line 6 and line 8. Each of which has 4 records. When we use these
records in SampleLoader, these 4 records will be combined into one after masking
out the bits: line 6 will have one record of
6: 167424 work: 170152
Line 8 will have one record:
8: 240118 work: 254015
Comparing the sample count in line 5 and 7, we can easily see that line 5 has
~50% taken and line 7 has ~75% taken. This is in contrast to the current AFDO
profile, where both lines have a 100% taken branch.
When we load the above profile before the MachineBlockPlacement pass, each
record will maintain its own instance. We can see that at line 5, two of the
branches have ~0% taken probability, while the rest two have ~100% taken
probability. For line 7, one branch has 0% taken probability, while the reset
three has ~100% taken probability. We will reset the branch probabilities as
shown in the following messages:
Set branch fs prob: MBB (1 -> 3): unroll.c:22:11 W=87820 0x00005f85 /
0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (1 -> 2): unroll.c:22:11 W=87820 0x7fffa07b /
0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (3 -> 5): unroll.c:24:11 W=87820 0x04a8dc00 /
0x80000000 = 3.64% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (3 -> 4): unroll.c:24:11 W=87820 0x7b572400 /
0x80000000 = 96.36% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (9 -> 11): unroll.c:22:11 W=87820 0x00005f85 /
0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00%
Set branch fs prob: MBB (9 -> 10): unroll.c:22:11 W=87820 0x7fffa07b /
0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00%
Set branch fs prob: MBB (15 -> 17): unroll.c:24:11 W=87820 0x04a8dc00 /
0x80000000 = 3.64% --> 0x17248215 / 0x80000000 = 18.08%
Set branch fs prob: MBB (15 -> 16): unroll.c:24:11 W=87820 0x7b572400 /
0x80000000 = 96.36% --> 0x68db7deb / 0x80000000 = 81.92%
4. Implementation in LLVM
4.1 AFDO profile data format
The AFDO profile format remains largely the same. Samples could be splitted into
multiple parts with different discriminator copy components. SampleLoader will
aggregate the samples dynamically when reading the AFDO profile based on the
discriminator masks. Current discriminator encoding will be removed and related
code in the compiler will be removed.
The hierarchical discriminator in FS-AFDO captures most of the cases of
multiplication discriminator in currenting encoding scheme. As a matter of fact,
the resulting count values are usually more precise as each individual
discriminator may have different count values. There is one exception: when the
unrolling body has a single BB. FS-AFDO will have the correct sample count after
unrolling as all the instructions residing in the same BB. In our prototype
implementation, we did not find this issue to have a visible performance impact.
4.2 Compiler changes
4.2.1 Add two new CodeGen passes:
AddFSDiscriminatorsPass: adds flow sensitive discriminators.
FSProfileLoaderPass: read FS profile and set branch probabilities.
FSProfileLoadPass will reuse most of the IR level SampleProfileLoader algorithm
and heuristics, but we need to reimplement at Machine level. We will not need to
update branch metadata. We will change the branch probabilities directly.
4.2.2 Enhance SampleProfileReader to handle FS discriminators.
Add a new field of MaskedBitFrom to SampleProfileReader. This is the lower mask
bit being used for this pass instance. For each profile loader, we dynamically
compute samples based on bit masks -- this is done in readImpl() method where
the masked discriminators are used to aggregate the samples.
4.3 AutoFDO tool changes
The position type used in PositionCountMap and Callsite type is of type uint32.
It is divided into high 16 bits as file offset and lower 16 bits as
discriminator. This is probably OK for the current discriminators as they are
numbered sequentially. But this would create too many clashes in FS
discriminators. We need to extend to uint64 for the position: a full 32-bit for
line number and full 32-bit for discriminators.
5. Experimental performance
A prototype was implemented and a performance test was done on critical
applications in Google.
Test binary:
(1) build the first round binary with an existing AFDO profile with
AddFSDiscriminator enabled.
(2) collect the PERF samples and convert them to the FS AFDO profile.
(3) build an optimized binary using the FSAFDO profile from step (2).
Base binary:
(1) build the first round binary with an existing AFDO profile
(2) collect the PERF samples and convert them to the AFDO profile.
(3) build an optimized binary using profile from step (2).
Run time performance:
The test binary shows 0.9% to 1.3% performance improvement over the base binary.
Profile size:
Profile size is modestly increased. Using the Google benchmark as the example,
the profile size for the text form increased by 4% to 7% while the binary form
profile size increased by 5% to 8%.
Compiler time:
Prototype implementation shows about 10% increase of build time (measued by the
elapsed time which includes two extra FSAddDiscrimniator passes and one extra
FSProfileSampleLoader pass). Profile readings are not shared b/w
FSProfileSampleLoader and current ProfileSampleLoader. We expect a smaller
compiler time increase with better data structures that share a profile loader.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201119/1c5baedc/attachment-0001.html>
Rong Xu via llvm-dev
2020-Nov-20 07:29 UTC
[llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
Hi Wenlei, thanks for comments and questions! Please see my comments inline. On Wed, Nov 18, 2020 at 4:34 PM Wenlei He <wenlei at fb.com> wrote:> Hi Rong, > > > > This is a very interesting proposal. We've also observed profile quality > degradation from CFG destructive pass like loop rotate, and I can see how > this proposal would help improve quality of profile that drives later > optimization passes in the pipeline. I have a few questions. > > > > - How does this affect today's AutoFDO? Specifically, can users > upgrade compiler with FS-AutoFDO support but without refreshing their > profile? I think it's important to make new improvement like this as > opt-in, so other users of AutoFDO can choose if and when they want to make > the switch from AutoFDO to FS-AutoFDO. With the proposed changes to > discriminator encoding, sounds like we are going to eliminate duplication > factor etc. altogether. In that case, multiple FS-AutoFDO sample profile > loading would be required in order to not regress from today's AutoFDO due > to lack of duplication factor. Is that correct? If so, this change as is > can break backward compatibility with today's AutoFDO profile. It’d be > great to handle discriminator in a way that compatible with today’s > AutoFDO. > > FS-AFDO has a new discriminator encoding scheme. You are right that it'snot backward compatible with current AFDO profiles. The conflict is mainly from the duplication factors -- as it will be completely replaced. The only possible regression I can think of is using current AFDO profile as an FS-AFDO profile. Due to the difference in encoding, some discriminators might be handled incorrectly. I think we can add a flag to the AFDO header to let the compiler know if this is the new FS-AFDO profile. We did have some discussions on the deployment at Google -- it will require two stages: the first stage enables the new discriminator scheme but still reads current AFDO profile. In the second stage, we will use the FS-AFDO profile and enable FS-AFDO passes.> > - Profile loading usually happens early in the pipeline in order to > avoid mismatch between profiling and optimization. In the case of > FS-AutoFDO, some mismatch may be inevitable for the later FS profile > loading. In practice, how significant is the mismatch in later profile > loading? Have you tried to see by how much using multiple iterations (like > CSPGO) can further help profile quality and improve performance? I > understand this is not practical for production use, but could give us data > points as to what's left on the table due to mismatch of later FS profile > loading. > > I did measure the coverage (The same mechanism used in SampleLoader). Forour non-triel programs, we can use about 74% of the samples (for the FS-AFDO loader before machine block placement). I also tried the iterative FS-AFDO. Unfortunately iterative AFDO alone shows a non-trival regression which is larger than the gain for FS-AFDO. I did that experiment a while ago. I might need to revisit it later.> > > - In your performance experiment, where is the extra > FSProfileSampleLoader in the pipeline, right before machine block placement > or somewhere else? Have you tried adding more than one > FSProfileSampleLoader and is there extra perf gain for more than one > FSProfileSampleLoader? > > The main performance is from the FSProfileSampleLoader before machineblock placement. I did add one more loader after block placement (and a new BranchFolder pass to do the tail merge). But that did not give me visible performance. That said, I think there is still lots of tuning work to do for better performance. I will definitely try more experimental loader passes.> > - For the final discriminator assignment, is this because we take MAX > of samples on addresses from the same location? So if there's any late code > duplication after last discriminator assignment, we need a final > discriminator assignment to turn MAX into SUM for earlier FS profile? > > Yes. That's the main consideration. create_llvm_profile still use MAX forthe same locations. The final discriminator assignment is to assign different discrimnatiro for each duplicated code so their sample can be summed.> > > - While changing discriminator encoding, have you considered encode > block Id into discriminator so AutoFDO can be CFG aware? This is something > Wei brought up earlier in the context of CSSPGO, and we didn't go that > route partly because it's hard to do it without breaking compatibility. > > Discriminator only has 32 bits. It's hard to inject more information. Inthe early stage of this work, we did consider using BB labels to encode more the flow information. This would use Sri's work on BasicBlock Sections. But that required significant change than the discriminator based method.> > > This would be complementary to the context-sensitive SPGO we proposed > earlier - would be nice to make PGO/FDO both context-sensitive and > flow-sensitive. I think the flow-sensitive part could also integrate with > pseudo-probe, essentially we can append FS discriminator later multiple > times the same way as you proposed here, except that the "line" part would > be "probeId". >Yes. I think this idea could be used in your pseudo-probe work. The good thing about pseudo-probe is that it does not have a real instruction and the compiler can afford multiple round probe insertions.> > > +Hongtao as well. > > > > Thanks, > > Wenlei > > > > *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> > *Date: *Tuesday, November 17, 2020 at 9:55 AM > *To: *llvm-dev <llvm-dev at lists.llvm.org> > *Cc: *David Li <davidxl at google.com> > *Subject: *[llvm-dev] [RFC] Control Flow Sensitive AutoFDO (FS-AFDO) > > Hi all, > > Here I include an RFC for control flow sensitive AutoFDO (FS-AFDO). This > is a joint work with David Li. Questions and feedback are welcome. > > > > Thanks, > > > > Rong > > > > ============> > [RFC] Control Flow Sensitive AutoFDO (FS-AFDO) > > > > 1. Motivation > > > > AFDO profile is derived from PMU samples from running an earlier build > binary. PMU samples are indexed by the IP addresses. An offline tool uses > the debug line number to map PMU samples to the source line. The accuracy > of debug information is critical for AFDO performance. Thanks to the > continuous efforts from the community, we have fixed many debug info > related issues that affect AFDO. > > > > One of the remaining areas that affects the profile accuracy is the > handling of duplicated code. Aggressive compiler optimizations can change > functions’ control flow significantly. The transformation can duplicate > branches in multiple places and new branches that do not exist in the > original source code might also be introduced. All these branches, even > though originated from the same source line, may have very different > behavior -- which is called (control) flow sensitivity. > > > > Current AFDO handles the duplication in a pretty coarse grain way. When > AFDO sees samples with different IP addresses that map to the same source > line, it uses a max function to get the maximum number of samples and > attach to that line. From this prospective, current AFDO is not control > flow sensitive. Consider the following example: > > // Original code: > > for() > > Stmt1; // line 10 > > > > // After a versioning optimization > > for() { > > if (cond1) > > stmt1; // line 10, 100 samples, address: C1 > > else if (cond2) > > stmt1; // line 10, 80 samples, address: C2 > > else > > stmt1; // line 10, 20 samples, address: C3 > > } > > In the final binary, stmt1 is cloned 3 times by the versioning > optimization. The address C1, gets 100 samples, the address C2 gets 80 > samples, and the address C3 gets 20 samples. When converting Perf samples > to AFDO profile, a max function is used for these samples as they have the > same source line. In the AFDO profile, we will have a sample entry of <10, > 100>, which means we have a sample value of 100 for line 10 of this > function. We will not have branch information for cond1, cond2, as they do > not exist in the sample loading phase. The profile count for stmt1 is also > off the actual value (200 = 100 +80 + 20) by 50%. > > > > In this work, we propose a method that maintains more precise AFDO profile > counts for the duplicated control flow. > > > > 2. FS-AFDO Designs > > > > In this work, we don't aim to keep track of the transformations and make > the flow sensitivity available across the compilation pipelines. Rather, we > identify a set of target optimizations that can potentially benefit most > from a more precise profile. For each of such optimizations, we use a > helper pass to recompute the flow sensitive profile just before this pass. > When loading this profile, we will recompute the branch probability and > hopefully improve the optimization quality for that target pass. > > > > The flow sensitivity in the profile is through a hierarchical > discriminator scheme that can better preserve flow information in the Perf > profile. > > > > 2.1. Current Use of Discriminators > > The AFDO profile contains a discriminator field whose goal is to > differentiate statements with the same line number in the source code > (largely from MACRO expansions). > > > > Discriminators were later expanded to get more accurate profile counts. It > is expanded to 3 components which can partially address the issue. The main > objective of that work was to get the correct profile count for the > unrolled (or vectorized) BBs. The discriminator is divided into 3 > components: > > * Base discriminator, assigned by AddDiscriminators > > * Duplication factor: used in loop unroll and loop vectorization (when > cloning the loop body). This factor will be multiplied by sample counts in > create_llvm_prof tool to get the count value before duplication. > > * Copy Identifier: reserved by not currently used. > > > > Duplication factor was explicitly assigned in loop unroll and loop > vectorization. Instruction cloning will copy the discriminator. All the > cloned instruction instances, other than inlining, loop unroll and > vectorization, have the same discriminator. In the create_llvm_prof tool, > when assigning the sample count value for <offset_to_func_head, > discriminator>, the max sample value among all the clones will be the > assigned value. > > > > 2.2 Hierarchical discriminators > > In FS-AFDO, discriminator assigning happens multiple times -- > conceptually, one assignment for each optimization that can benefit from > more precise FS profile data (let’s call it a target optimization pass). > The discriminator is divided into multiple sections. Earlier passes have > the lower bits and each instance of assigning access the bits in their > sections only. > > > > For example, there could be 3 rounds of discriminator assigning: the base > discriminator which corresponds to existing discriminator assigning, > discriminator for pass N, and discriminator for pass M. Let’s assume each > round uses 4 bits. If the instruction Inst1 is from BB0 and it’s copied to > BB1, it will have a discriminator 0x1. If BB1:Inst1 is copied to BB2:intr2 > before Pass N, this instruction will have a new discriminator of 0x11 > associated with pass N. Similarly, each copy of Inst1 before Pass M will > have its own discriminators using bit 8 to bit 12. > > > > When converting Perf samples to the AFDO profile, we will have the sample > count for each discriminator. The AFDO profile loader reads all the > samples and masks out the bit used by later rounds. For example, when > loading the profile for the base discriminator, it only checks the first 4 > bits. All the samples cloned from the same base discriminator will be > aggregated together. Profile loader for pass N will only check the first 8 > bits of the discriminator and sum up all the samples with the same > discriminator bit 0 to bit 7. > > > > 2.3 Profile Mismatch > > > > The above describes the ideal situation for loading the profile -- all the > discriminators in the Perf binary match the discriminator in optimized > build. In reality, the compilation for the optimized binary, with a new > profile, might perform different transformations (from the compilation of > the Perf binary). It can alter the program control flow for the downstream > passes, which leads to a discriminator mismatch. > > > > If the mismatch happens, the profiling information will not be updated by > the SampleLoader to the affected BBs. They will inherit the profile > information (like branch probabilities) from the previous round of > SampleLoader. > > > > The FS component in the discriminators basically uses a sequential number, > just like in the base discriminator. But to handle potential mismatch, we > also encode the total number of new discriminators. This prevents the > misplacement of profile counts to a wrong code clone. There are limited > bits for each FS discriminator component. We choose to use a hash value as > the FS discriminator. The hash function takes the following as the input: > > * the sequential number of the current clone, > > * the total number of the current clone. > > > > A second measure to deal with mismatch is to use the profile coverage. We > summarize the samples used by SampleLoad for each function, and compare the > total samples in the profile. A warning is printed out if the coverage > (ratio of used samples and total sample) is lower than a threshold value. > In which case, the user should rebuild the Perf binary and collect the Perf > profile again. > > > > 2.4 Example of pass pipeline > > Here is the one sample pass pipeline for AFDO + ThinLTO compilation. > > > > // Pass Pipeline: > > InsertBaseDiscrimator > > SampleProfileLoader in FE > > ... > > CM inline > > ... > > SampleProfileLoader in BE > > ... > > AddFSDiscriminator(kind_mbp) > > FSProfileLoader(kind_mbp) > > MachineBlockPlacement > > AddFSDiscriminator(kind_branchfolding) > > FSProfileLoader(kind_branchfolding) > > BranchFoldingPass > > … > > AddFSDiscriminator(kind_final) > > > > In the above pass pipeline. MachineBlockPlacement and BranchFoldingPass > are the two target optimization passes. MachineBlockPlacement can obviously > be benefited from more precise profile information. The tail merge in > BranchFoldPass can undo many unneeded tail duplications in > MachineBlockPlacment. > > > > Note that we have a FS discriminator assigning pass at the final of > compilation pipeline. The reason is to assign a different discriminator for > each cloned instruction in the final binary. By doing this, we can get a > more precise profile for earlier passes by summing the counter values from > different discriminators. > > > > 3. A running example > > > > The test program is shown as the following. We pay most attention to the > branch in line 5 and line 7 in function foo(). Inner loop at line 3 will be > fully unrolled after SampleProfileLoader. Outer loop at line 2 is intact by > the optimizer. > > __attribute__((noinline)) int bar(int i){ > > volatile int j; > > j = i; > > return j; > > } > > > > __attribute__((noinline)) void work(int i){ > > ... > > } > > > > __attribute__((noinline)) void foo(){ // line 17 of unroll.c > > int i, j; > > for (j = 0; j < 48; j++) > > for (i = 0; i < 4; i++) { > > int ii = bar(i + j * 48); > > if (ii % 2) // line 22 of unroll.c > > work(ii * 2); > > if (ii % 4) // line 24 of unroll.c > > work(ii * 3); > > } > > } > > > > Current AFDO will produce the following profile: > > foo:30387110:1606 void foo() { > > 0: 1606 int i, j; > > 2.1: 83748 for (j = 0; j < 48; j++) { > > 4: 351836 bar:84932 int ii = bar(i + j * 48); > > 5: 351836 if (ii % 2) > > 6: 351836 work:79741 work(ii * 2) > > 7: 350420 if (ii % 4) > > 8: 338340 work:75287 work(ii * 3) > > 10: 1302 } > > Function foo() entry sample count is 1606. Line 2.1 is the outer loop body > with a sample count of 83748. Line 4 to line 8 is inner loop body count. > It’s scaled by an unrolled multiplication discriminator of 4. This resulted > in a higher value of sample count: it uses the Max of 4 clones and > multiplies by 4. > > > > Here is the FS AFDO profile. Basically it has a separated profile count > for each unrolled instructions. > > > > foo:8722583:1597 void foo() { > > 0: 1597 int i, j; > > 2.1: 83694 for (j = 0; j < 48; j++) > > 4: 87205 bar:85542 int ii = bar(i + j * 48); > > 4.8448: 82042 bar:82155 int ii = bar(i + j * 48); > > 4.13312: 84795 bar:84802 int ii = bar(i + j * 48); > > 4.13568: 87820 bar:87718 int ii = bar(i + j * 48); > > 5: 87205 if (ii % 2) > > 5.8448: 82042 if (ii % 2) > > 5.13312: 79604 if (ii %2) > > 5.13568: 87820 if (ii %2) > > 6: 0 work(ii * 2); > > 6.8448: 0 work(ii * 2); > > 6.13312: 79604 work:79652 work(ii * 2); > > 6.13568: 87820 work:90500 work(ii * 2); > > 7: 87493 if (ii %4) > > 7.8448: 84624 if (ii % 4) > > 7.13312: 71942 if (ii % 4) > > 7.13568: 83552 if (ii % 4) > > 8: 0 work(ii * 3); > > 8.8448: 84624 work:87655 work(ii * 3); > > 8.13312: 71942 work:75137 work(ii *3); > > 8.13568: 83552 work:91223 work(ii *3); > > 10: 1304 } > > > > Let’s look at line 6 and line 8. Each of which has 4 records. When we use > these records in SampleLoader, these 4 records will be combined into one > after masking out the bits: line 6 will have one record of > > 6: 167424 work: 170152 > > Line 8 will have one record: > > 8: 240118 work: 254015 > > Comparing the sample count in line 5 and 7, we can easily see that line 5 > has ~50% taken and line 7 has ~75% taken. This is in contrast to the > current AFDO profile, where both lines have a 100% taken branch. > > > > When we load the above profile before the MachineBlockPlacement pass, each > record will maintain its own instance. We can see that at line 5, two of > the branches have ~0% taken probability, while the rest two have ~100% > taken probability. For line 7, one branch has 0% taken probability, while > the reset three has ~100% taken probability. We will reset the branch > probabilities as shown in the following messages: > > Set branch fs prob: MBB (1 -> 3): unroll.c:22:11 W=87820 0x00005f85 / > 0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00% > > Set branch fs prob: MBB (1 -> 2): unroll.c:22:11 W=87820 0x7fffa07b / > 0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00% > > Set branch fs prob: MBB (3 -> 5): unroll.c:24:11 W=87820 0x04a8dc00 / > 0x80000000 = 3.64% --> 0x80000000 / 0x80000000 = 100.00% > > Set branch fs prob: MBB (3 -> 4): unroll.c:24:11 W=87820 0x7b572400 / > 0x80000000 = 96.36% --> 0x00000000 / 0x80000000 = 0.00% > > Set branch fs prob: MBB (9 -> 11): unroll.c:22:11 W=87820 0x00005f85 / > 0x80000000 = 0.00% --> 0x80000000 / 0x80000000 = 100.00% > > Set branch fs prob: MBB (9 -> 10): unroll.c:22:11 W=87820 0x7fffa07b / > 0x80000000 = 100.00% --> 0x00000000 / 0x80000000 = 0.00% > > Set branch fs prob: MBB (15 -> 17): unroll.c:24:11 W=87820 0x04a8dc00 / > 0x80000000 = 3.64% --> 0x17248215 / 0x80000000 = 18.08% > > Set branch fs prob: MBB (15 -> 16): unroll.c:24:11 W=87820 0x7b572400 / > 0x80000000 = 96.36% --> 0x68db7deb / 0x80000000 = 81.92% > > > > 4. Implementation in LLVM > > > > 4.1 AFDO profile data format > > The AFDO profile format remains largely the same. Samples could be > splitted into multiple parts with different discriminator copy components. > SampleLoader will aggregate the samples dynamically when reading the AFDO > profile based on the discriminator masks. Current discriminator encoding > will be removed and related code in the compiler will be removed. > > > > The hierarchical discriminator in FS-AFDO captures most of the cases of > multiplication discriminator in currenting encoding scheme. As a matter of > fact, the resulting count values are usually more precise as each > individual discriminator may have different count values. There is one > exception: when the unrolling body has a single BB. FS-AFDO will have the > correct sample count after unrolling as all the instructions residing in > the same BB. In our prototype implementation, we did not find this issue to > have a visible performance impact. > > > > 4.2 Compiler changes > > > > 4.2.1 Add two new CodeGen passes: > > AddFSDiscriminatorsPass: adds flow sensitive discriminators. > > FSProfileLoaderPass: read FS profile and set branch probabilities. > > FSProfileLoadPass will reuse most of the IR level SampleProfileLoader > algorithm and heuristics, but we need to reimplement at Machine level. We > will not need to update branch metadata. We will change the branch > probabilities directly. > > > > 4.2.2 Enhance SampleProfileReader to handle FS discriminators. > > Add a new field of MaskedBitFrom to SampleProfileReader. This is the lower > mask bit being used for this pass instance. For each profile loader, we > dynamically compute samples based on bit masks -- this is done in > readImpl() method where the masked discriminators are used to aggregate the > samples. > > > > 4.3 AutoFDO tool changes > > The position type used in PositionCountMap and Callsite type is of type > uint32. It is divided into high 16 bits as file offset and lower 16 bits as > discriminator. This is probably OK for the current discriminators as they > are numbered sequentially. But this would create too many clashes in FS > discriminators. We need to extend to uint64 for the position: a full 32-bit > for line number and full 32-bit for discriminators. > > > > 5. Experimental performance > > A prototype was implemented and a performance test was done on critical > applications in Google. > > Test binary: > > (1) build the first round binary with an existing AFDO profile with > AddFSDiscriminator enabled. > > (2) collect the PERF samples and convert them to the FS AFDO profile. > > (3) build an optimized binary using the FSAFDO profile from step (2). > > > > Base binary: > > (1) build the first round binary with an existing AFDO profile > > (2) collect the PERF samples and convert them to the AFDO profile. > > (3) build an optimized binary using profile from step (2). > > > > Run time performance: > > The test binary shows 0.9% to 1.3% performance improvement over the base > binary. > > > > Profile size: > > Profile size is modestly increased. Using the Google benchmark as the > example, the profile size for the text form increased by 4% to 7% while the > binary form profile size increased by 5% to 8%. > > > > Compiler time: > > Prototype implementation shows about 10% increase of build time (measued > by the elapsed time which includes two extra FSAddDiscrimniator passes and > one extra FSProfileSampleLoader pass). Profile readings are not shared b/w > FSProfileSampleLoader and current ProfileSampleLoader. We expect a smaller > compiler time increase with better data structures that share a profile > loader. > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201119/efb96b5e/attachment-0001.html>
Apparently Analagous Threads
- [RFC] Control Flow Sensitive AutoFDO (FS-AFDO)
- (RFC) Encoding code duplication factor in discriminator
- (RFC) Encoding code duplication factor in discriminator
- (RFC) Encoding code duplication factor in discriminator
- (RFC) Encoding code duplication factor in discriminator