Dehao Chen via llvm-dev
2016-Oct-27  20:11 UTC
[llvm-dev] (RFC) Encoding code duplication factor in discriminator
The impact to debug_line is actually not small. I only implemented the part 1 (encoding duplication factor) for loop unrolling and loop vectorization. The debug_line size overhead for "-O2 -g1" binary of speccpu C/C++ benchmarks: 433.milc 23.59% 444.namd 6.25% 447.dealII 8.43% 450.soplex 2.41% 453.povray 5.40% 470.lbm 0.00% 482.sphinx3 7.10% 400.perlbench 2.77% 401.bzip2 9.62% 403.gcc 2.67% 429.mcf 9.54% 445.gobmk 7.40% 456.hmmer 9.79% 458.sjeng 9.98% 462.libquantum 10.90% 464.h264ref 30.21% 471.omnetpp 0.52% 473.astar 5.67% 483.xalancbmk 1.46% mean 7.86% Dehao On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <davidxl at google.com> wrote:> Do you have an estimate of the debug_line size increase? I guess it will > be small. > > David > > On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at google.com> wrote: > >> Motivation: >> Many optimizations duplicate code. E.g. loop unroller duplicates the loop >> body, GVN duplicates computation, etc. The duplicated code will share the >> same debug info with the original code. For SamplePGO, the debug info is >> used to present the profile. Code duplication will affect profile accuracy. >> Taking loop unrolling for example: >> >> #1 foo(); >> #2 for (i = 0; i < N; i++) { >> #3 bar(); >> #4 } >> >> If N is 8 during runtime, a reasonable profile will look like: >> >> #1: 10 >> #3: 80 >> >> But if the compiler unrolls the loop by a factor of 4, the callsite to >> bar() is duplicated 4 times and the profile will look like: >> >> #1: 10 >> #3: 20 >> >> The sample count for #3 is 20 because all 4 callsites to bar() are >> sampled 20 times each, and they shared the same debug loc (#3) so that 20 >> will be attributed to #3 (If one debugloc is mapped by multiple >> instructions, the max sample count of these instructions is used as >> debugloc's sample count). >> >> When loading this profile into compiler, it will think the loop trip >> count is 2 instead of 8. >> >> Proposal: >> When compiler duplicates code, it encodes the duplication info in the >> debug info. As the duplication is not interesting to debugger, I propose to >> encode this as part of the discriminator. >> >> There are 2 types of code duplication: >> >> 1. duplicated code are guaranteed to have the same execution count (e.g. >> loop unroll and loop vectorize). We can record the duplication factor, for >> the above example "4" is recorded in the discriminator. >> 2. duplicated code that may have different execution count (e.g. loop >> peel and gvn). For a same debugloc, a unique number is assigned to each >> copy and encoded in the discriminator. >> >> Assume that the discriminator is uint32. The traditional discriminator is >> less than 256, let's take 8 bit for it. For duplication factor (type 1 >> duplication), we assume the maximum unroll_factor * vectorize_factor is >> less than 256, thus 8 bit for it. For unique number(type 2 duplication), we >> assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we >> still have 11 free bits left in the discriminator encoding. >> >> Let's take the original source as an example, after loop unrolling and >> peeling, the code may looks like: >> >> for (i = 0; i < N & 3; i+= 4) { >> foo(); // discriminator: 0x40 >> foo(); // discriminator: 0x40 >> foo(); // discriminator: 0x40 >> foo(); // discriminator: 0x40 >> } >> if (i++ < N) { >> foo(); // discriminator: 0x100 >> if (i++ < N) { >> foo(); // discriminator: 0x200 >> if (i++ < N) { >> foo(); // discriminator: 0x300 >> } >> } >> } >> >> The cost of this change would be increased debug_line size because: 1. we >> are recording more discriminators 2. discriminators become larger and will >> take more ULEB128 encoding. >> >> The benefit is that the sample pgo profile can accurately represent the >> code execution frequency. And we do not need to introduce new building >> blocks to debug info. >> >> Comments? >> >> Thanks, >> Dehao >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/75e83e9c/attachment.html>
Xinliang David Li via llvm-dev
2016-Oct-27  20:18 UTC
[llvm-dev] (RFC) Encoding code duplication factor in discriminator
The large percentages are from those tiny benchmarks. If you look at omnetpp (0.52%), and xalanc (1.46%), the increase is small. To get a better average increase, you can sum up total debug_line size before and after and compute percentage accordingly. David On Thu, Oct 27, 2016 at 1:11 PM, Dehao Chen <dehao at google.com> wrote:> The impact to debug_line is actually not small. I only implemented the > part 1 (encoding duplication factor) for loop unrolling and loop > vectorization. The debug_line size overhead for "-O2 -g1" binary of speccpu > C/C++ benchmarks: > > 433.milc 23.59% > 444.namd 6.25% > 447.dealII 8.43% > 450.soplex 2.41% > 453.povray 5.40% > 470.lbm 0.00% > 482.sphinx3 7.10% > 400.perlbench 2.77% > 401.bzip2 9.62% > 403.gcc 2.67% > 429.mcf 9.54% > 445.gobmk 7.40% > 456.hmmer 9.79% > 458.sjeng 9.98% > 462.libquantum 10.90% > 464.h264ref 30.21% > 471.omnetpp 0.52% > 473.astar 5.67% > 483.xalancbmk 1.46% > mean 7.86% > Dehao > > On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <davidxl at google.com> > wrote: > >> Do you have an estimate of the debug_line size increase? I guess it will >> be small. >> >> David >> >> On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at google.com> wrote: >> >>> Motivation: >>> Many optimizations duplicate code. E.g. loop unroller duplicates the >>> loop body, GVN duplicates computation, etc. The duplicated code will share >>> the same debug info with the original code. For SamplePGO, the debug info >>> is used to present the profile. Code duplication will affect profile >>> accuracy. Taking loop unrolling for example: >>> >>> #1 foo(); >>> #2 for (i = 0; i < N; i++) { >>> #3 bar(); >>> #4 } >>> >>> If N is 8 during runtime, a reasonable profile will look like: >>> >>> #1: 10 >>> #3: 80 >>> >>> But if the compiler unrolls the loop by a factor of 4, the callsite to >>> bar() is duplicated 4 times and the profile will look like: >>> >>> #1: 10 >>> #3: 20 >>> >>> The sample count for #3 is 20 because all 4 callsites to bar() are >>> sampled 20 times each, and they shared the same debug loc (#3) so that 20 >>> will be attributed to #3 (If one debugloc is mapped by multiple >>> instructions, the max sample count of these instructions is used as >>> debugloc's sample count). >>> >>> When loading this profile into compiler, it will think the loop trip >>> count is 2 instead of 8. >>> >>> Proposal: >>> When compiler duplicates code, it encodes the duplication info in the >>> debug info. As the duplication is not interesting to debugger, I propose to >>> encode this as part of the discriminator. >>> >>> There are 2 types of code duplication: >>> >>> 1. duplicated code are guaranteed to have the same execution count (e.g. >>> loop unroll and loop vectorize). We can record the duplication factor, for >>> the above example "4" is recorded in the discriminator. >>> 2. duplicated code that may have different execution count (e.g. loop >>> peel and gvn). For a same debugloc, a unique number is assigned to each >>> copy and encoded in the discriminator. >>> >>> Assume that the discriminator is uint32. The traditional discriminator >>> is less than 256, let's take 8 bit for it. For duplication factor (type 1 >>> duplication), we assume the maximum unroll_factor * vectorize_factor is >>> less than 256, thus 8 bit for it. For unique number(type 2 duplication), we >>> assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we >>> still have 11 free bits left in the discriminator encoding. >>> >>> Let's take the original source as an example, after loop unrolling and >>> peeling, the code may looks like: >>> >>> for (i = 0; i < N & 3; i+= 4) { >>> foo(); // discriminator: 0x40 >>> foo(); // discriminator: 0x40 >>> foo(); // discriminator: 0x40 >>> foo(); // discriminator: 0x40 >>> } >>> if (i++ < N) { >>> foo(); // discriminator: 0x100 >>> if (i++ < N) { >>> foo(); // discriminator: 0x200 >>> if (i++ < N) { >>> foo(); // discriminator: 0x300 >>> } >>> } >>> } >>> >>> The cost of this change would be increased debug_line size because: 1. >>> we are recording more discriminators 2. discriminators become larger and >>> will take more ULEB128 encoding. >>> >>> The benefit is that the sample pgo profile can accurately represent the >>> code execution frequency. And we do not need to introduce new building >>> blocks to debug info. >>> >>> Comments? >>> >>> Thanks, >>> Dehao >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/68990d62/attachment.html>
Dehao Chen via llvm-dev
2016-Oct-27  20:22 UTC
[llvm-dev] (RFC) Encoding code duplication factor in discriminator
If we use sum instead of geomean, then the diff will be 5.59% Yes, most of the size growth are from small applications, with one exception: h264, which has significant amount of loops to unroll/vectorize. Dehao On Thu, Oct 27, 2016 at 1:18 PM, Xinliang David Li <davidxl at google.com> wrote:> The large percentages are from those tiny benchmarks. If you look at > omnetpp (0.52%), and xalanc (1.46%), the increase is small. To get a better > average increase, you can sum up total debug_line size before and after and > compute percentage accordingly. > > David > > On Thu, Oct 27, 2016 at 1:11 PM, Dehao Chen <dehao at google.com> wrote: > >> The impact to debug_line is actually not small. I only implemented the >> part 1 (encoding duplication factor) for loop unrolling and loop >> vectorization. The debug_line size overhead for "-O2 -g1" binary of speccpu >> C/C++ benchmarks: >> >> 433.milc 23.59% >> 444.namd 6.25% >> 447.dealII 8.43% >> 450.soplex 2.41% >> 453.povray 5.40% >> 470.lbm 0.00% >> 482.sphinx3 7.10% >> 400.perlbench 2.77% >> 401.bzip2 9.62% >> 403.gcc 2.67% >> 429.mcf 9.54% >> 445.gobmk 7.40% >> 456.hmmer 9.79% >> 458.sjeng 9.98% >> 462.libquantum 10.90% >> 464.h264ref 30.21% >> 471.omnetpp 0.52% >> 473.astar 5.67% >> 483.xalancbmk 1.46% >> mean 7.86% >> Dehao >> >> On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <davidxl at google.com> >> wrote: >> >>> Do you have an estimate of the debug_line size increase? I guess it will >>> be small. >>> >>> David >>> >>> On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <dehao at google.com> wrote: >>> >>>> Motivation: >>>> Many optimizations duplicate code. E.g. loop unroller duplicates the >>>> loop body, GVN duplicates computation, etc. The duplicated code will share >>>> the same debug info with the original code. For SamplePGO, the debug info >>>> is used to present the profile. Code duplication will affect profile >>>> accuracy. Taking loop unrolling for example: >>>> >>>> #1 foo(); >>>> #2 for (i = 0; i < N; i++) { >>>> #3 bar(); >>>> #4 } >>>> >>>> If N is 8 during runtime, a reasonable profile will look like: >>>> >>>> #1: 10 >>>> #3: 80 >>>> >>>> But if the compiler unrolls the loop by a factor of 4, the callsite to >>>> bar() is duplicated 4 times and the profile will look like: >>>> >>>> #1: 10 >>>> #3: 20 >>>> >>>> The sample count for #3 is 20 because all 4 callsites to bar() are >>>> sampled 20 times each, and they shared the same debug loc (#3) so that 20 >>>> will be attributed to #3 (If one debugloc is mapped by multiple >>>> instructions, the max sample count of these instructions is used as >>>> debugloc's sample count). >>>> >>>> When loading this profile into compiler, it will think the loop trip >>>> count is 2 instead of 8. >>>> >>>> Proposal: >>>> When compiler duplicates code, it encodes the duplication info in the >>>> debug info. As the duplication is not interesting to debugger, I propose to >>>> encode this as part of the discriminator. >>>> >>>> There are 2 types of code duplication: >>>> >>>> 1. duplicated code are guaranteed to have the same execution count >>>> (e.g. loop unroll and loop vectorize). We can record the duplication >>>> factor, for the above example "4" is recorded in the discriminator. >>>> 2. duplicated code that may have different execution count (e.g. loop >>>> peel and gvn). For a same debugloc, a unique number is assigned to each >>>> copy and encoded in the discriminator. >>>> >>>> Assume that the discriminator is uint32. The traditional discriminator >>>> is less than 256, let's take 8 bit for it. For duplication factor (type 1 >>>> duplication), we assume the maximum unroll_factor * vectorize_factor is >>>> less than 256, thus 8 bit for it. For unique number(type 2 duplication), we >>>> assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we >>>> still have 11 free bits left in the discriminator encoding. >>>> >>>> Let's take the original source as an example, after loop unrolling and >>>> peeling, the code may looks like: >>>> >>>> for (i = 0; i < N & 3; i+= 4) { >>>> foo(); // discriminator: 0x40 >>>> foo(); // discriminator: 0x40 >>>> foo(); // discriminator: 0x40 >>>> foo(); // discriminator: 0x40 >>>> } >>>> if (i++ < N) { >>>> foo(); // discriminator: 0x100 >>>> if (i++ < N) { >>>> foo(); // discriminator: 0x200 >>>> if (i++ < N) { >>>> foo(); // discriminator: 0x300 >>>> } >>>> } >>>> } >>>> >>>> The cost of this change would be increased debug_line size because: 1. >>>> we are recording more discriminators 2. discriminators become larger and >>>> will take more ULEB128 encoding. >>>> >>>> The benefit is that the sample pgo profile can accurately represent the >>>> code execution frequency. And we do not need to introduce new building >>>> blocks to debug info. >>>> >>>> Comments? >>>> >>>> Thanks, >>>> Dehao >>>> >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/41fde2c3/attachment.html>
Apparently Analagous Threads
- (RFC) Encoding code duplication factor in discriminator
- (RFC) Encoding code duplication factor in discriminator
- (RFC) Encoding code duplication factor in discriminator
- (RFC) Encoding code duplication factor in discriminator
- (RFC) Encoding code duplication factor in discriminator