Xinliang David Li via llvm-dev
2016-Mar-11 05:42 UTC
[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi, >> >> I'd like to add a new pass to LLVM which removes redundant profile counter >> updates. The goal is to speed up code coverage testing and profile >> generation >> for PGO. >> > > We may want to have a focused discussion about this goal, rather than a > particular suggestion. There are a lot of things we can do. Off the top of > my head, some are: > > 1. add some sort of alias annotation (such as an independent TBAA root > node) to all the counter increment memory instructions to tell the > optimizer they don't interfere with the usual loads and stores. > > 2. communicate to the optimizer that counters can be registerized. In a > loop like: > for (int i = 0; i < N; i++) { > if (foo()) > bar(); > else > baz(); > } > we perform O(N) counter increments (i.e. load, increment, store) last I > checked. However, if the counters are in registers, then we only perform > O(1) memory operations. This can dramatically reduce the pressure on the > CPU's load/store units and also relieve cross-core cache line ping-pong > when two cores are executing the same code. Note that the latter benefit is > attained even if we ultimately end up spilling the counters due to > increased register pressure. > > I actually don't know what is preventing the usual optimization pipeline > from getting 2 right. >Call Mod-ref. We need to teach the optimizer that the counter owned by the current function (if the function is proved to be non-recursive in some way) can not be modified by any other calls. David> > >> >> I'm sending this email out to describe my approach, share some early >> results, >> and gather feedback. >> >> >> Problem Overview >> ===============>> >> A profile counter is redundant if it's incremented in exactly the same >> basic >> blocks as some other profile counter. Consider the following module: >> >> local void f1() { >> instrprof_increment(profc_f1); >> } >> >> void f2() { >> instrprof_increment(profc_f2); >> f1(); >> } >> >> Once the inliner runs and deletes f1, we're left with: >> >> void f2() { >> instrprof_increment(profc_f2); >> instrprof_increment(profc_f1); >> } >> >> Now we can say profc_f1 is redundant (or, an alias for profc_f2). >> >> I've noticed that frontend-based instrumentation can generate many >> redundant >> profile counters. This degrades performance and increases code size. We >> can >> address the problem by getting rid of redundant counter updates. The >> trick is >> to make sure we get back the same profiles. >> >> >> Proposed Solution >> ================>> >> I propose a pruning pass which takes the following steps: >> >> 1. Delete functions with local linkage and only one use, if that use is >> in >> a profile data record. >> >> These functions are left around by the inliner (it doesn't know that >> they're safe to delete). Deleting them reduces code size and >> simplifies >> subsequent analysis of profile counters. >> >> 2. Determine which profile counters are essential. >> >> 3. Erase all updates to redundant profile counters. >> >> 4. Emit the aliases into a new section in the binary. >> >> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some >> changes >> in compiler-rt are required to walk the alias section and fill in the >> correct execution counts at program exit time. >> >> This pass needs to be run after the inliner in order to be effective. >> >> The complexity of this pass is O(N*M), where N is the number of profile >> counters, and M is the average number of updates per counter. In practice >> it is >> a bit faster, since we can skip the analysis of counters which are >> discovered to >> be redundant early on in the process. >> > > I think a conceptually simpler design is something like: > > for each CFG edge: > record which FE counters have ended up associated with it > remove FE counters > run IR instrumentation pass > emit a side table mapping IR instr counters to FE counters (more > generally: how to reconstruct FE counters from the IR counters) > > The problem is simply reduced to the IR instrumentation pass. > > >> >> >> Early Results >> ============>> >> The pruning pass results in 25% speed improvement in the example program >> above >> (where f2 is called in a loop 10^8 times). >> >> Here is a slightly less contrived example: >> >> #include <vector> >> #include <algorithm> >> #include <cstdlib> >> >> static void escape(void *p) { >> asm volatile("" : : "g"(p) : "memory"); >> } >> >> int main(int argc, char **argv) { >> std::vector<int> V(atoi(argv[1])); >> escape(reinterpret_cast<void *>(V.data())); >> std::sort(V.begin(), V.end()); > > return V[0]; >> } >> >> I get the following results on my desktop (10^8 elements, 5 runs each): >> >> O3: 0.262s >> O3 + PGOInstr: 0.663s >> O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 aliases) >> O3 + CoverageInstr: 0.690s >> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 >> aliases) >> >> >> Next Steps? >> ==========>> >> Is the performance of instrumented code something we think we need to fix? >> > > With frontend instrumentation it definitely is. > > What's an acceptable compile-time overhead for running this pruning pass? > > > Instrumented builds are "special" anyway so a fair slowdown is probably > acceptable. I.e. this doesn't affect regular developer compilation times. > As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2` > vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the > control, then that indicates that a 5% extra slowdown is probably > reasonable for a substantial reduction in profiling overhead (which can > result in a qualitative improvement to the actual usability of the > instrumented program). > > -- Sean Silva > > > >> Is >> the general approach a non-starter for anybody? >> >> I'd like to get some feedback and gauge interest before pursuing this >> further. >> Possible next steps include benchmarking instrumented versions of clang >> and >> swift on the relevant test suites, running performance tests from lnt, >> running >> compile-time tests, and measuring any code size differences. >> >> >> thanks >> vedant >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/27b05662/attachment.html>
Sean Silva via llvm-dev
2016-Mar-11 06:13 UTC
[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com> wrote:> > > On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> >> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> Hi, >>> >>> I'd like to add a new pass to LLVM which removes redundant profile >>> counter >>> updates. The goal is to speed up code coverage testing and profile >>> generation >>> for PGO. >>> >> >> We may want to have a focused discussion about this goal, rather than a >> particular suggestion. There are a lot of things we can do. Off the top of >> my head, some are: >> >> 1. add some sort of alias annotation (such as an independent TBAA root >> node) to all the counter increment memory instructions to tell the >> optimizer they don't interfere with the usual loads and stores. >> >> 2. communicate to the optimizer that counters can be registerized. In a >> loop like: >> for (int i = 0; i < N; i++) { >> if (foo()) >> bar(); >> else >> baz(); >> } >> we perform O(N) counter increments (i.e. load, increment, store) last I >> checked. However, if the counters are in registers, then we only perform >> O(1) memory operations. This can dramatically reduce the pressure on the >> CPU's load/store units and also relieve cross-core cache line ping-pong >> when two cores are executing the same code. Note that the latter benefit is >> attained even if we ultimately end up spilling the counters due to >> increased register pressure. >> >> I actually don't know what is preventing the usual optimization pipeline >> from getting 2 right. >> > > Call Mod-ref. We need to teach the optimizer that the counter owned by > the current function (if the function is proved to be non-recursive in some > way) can not be modified by any other calls. >I don't think that's a sufficient explanation. Consider the following example: Sean:~/tmp % cat testprofile.cpp int foo(int n) { unsigned Ret = 42; for (int i = 0; i < n; i++) { if (i % 100) { Ret += 789; } else { Ret *= (283 + Ret); } } return Ret; } Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate testprofile.cpp -S -emit-llvm -O2 >foo.ll Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll [image: Inline image 2] -- Sean Silva> > David > > > > >> >> >>> >>> I'm sending this email out to describe my approach, share some early >>> results, >>> and gather feedback. >>> >>> >>> Problem Overview >>> ===============>>> >>> A profile counter is redundant if it's incremented in exactly the same >>> basic >>> blocks as some other profile counter. Consider the following module: >>> >>> local void f1() { >>> instrprof_increment(profc_f1); >>> } >>> >>> void f2() { >>> instrprof_increment(profc_f2); >>> f1(); >>> } >>> >>> Once the inliner runs and deletes f1, we're left with: >>> >>> void f2() { >>> instrprof_increment(profc_f2); >>> instrprof_increment(profc_f1); >>> } >>> >>> Now we can say profc_f1 is redundant (or, an alias for profc_f2). >>> >>> I've noticed that frontend-based instrumentation can generate many >>> redundant >>> profile counters. This degrades performance and increases code size. We >>> can >>> address the problem by getting rid of redundant counter updates. The >>> trick is >>> to make sure we get back the same profiles. >>> >>> >>> Proposed Solution >>> ================>>> >>> I propose a pruning pass which takes the following steps: >>> >>> 1. Delete functions with local linkage and only one use, if that use >>> is in >>> a profile data record. >>> >>> These functions are left around by the inliner (it doesn't know that >>> they're safe to delete). Deleting them reduces code size and >>> simplifies >>> subsequent analysis of profile counters. >>> >>> 2. Determine which profile counters are essential. >>> >>> 3. Erase all updates to redundant profile counters. >>> >>> 4. Emit the aliases into a new section in the binary. >>> >>> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some >>> changes >>> in compiler-rt are required to walk the alias section and fill in >>> the >>> correct execution counts at program exit time. >>> >>> This pass needs to be run after the inliner in order to be effective. >>> >>> The complexity of this pass is O(N*M), where N is the number of profile >>> counters, and M is the average number of updates per counter. In >>> practice it is >>> a bit faster, since we can skip the analysis of counters which are >>> discovered to >>> be redundant early on in the process. >>> >> >> I think a conceptually simpler design is something like: >> >> for each CFG edge: >> record which FE counters have ended up associated with it >> remove FE counters >> run IR instrumentation pass >> emit a side table mapping IR instr counters to FE counters (more >> generally: how to reconstruct FE counters from the IR counters) >> >> The problem is simply reduced to the IR instrumentation pass. >> >> >>> >>> >>> Early Results >>> ============>>> >>> The pruning pass results in 25% speed improvement in the example program >>> above >>> (where f2 is called in a loop 10^8 times). >>> >>> Here is a slightly less contrived example: >>> >>> #include <vector> >>> #include <algorithm> >>> #include <cstdlib> >>> >>> static void escape(void *p) { >>> asm volatile("" : : "g"(p) : "memory"); >>> } >>> >>> int main(int argc, char **argv) { >>> std::vector<int> V(atoi(argv[1])); >>> escape(reinterpret_cast<void *>(V.data())); >>> std::sort(V.begin(), V.end()); >> >> return V[0]; >>> } >>> >>> I get the following results on my desktop (10^8 elements, 5 runs each): >>> >>> O3: 0.262s >>> O3 + PGOInstr: 0.663s >>> O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 >>> aliases) >>> O3 + CoverageInstr: 0.690s >>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 >>> aliases) >>> >>> >>> Next Steps? >>> ==========>>> >>> Is the performance of instrumented code something we think we need to >>> fix? >>> >> >> With frontend instrumentation it definitely is. >> >> What's an acceptable compile-time overhead for running this pruning pass? >> >> >> Instrumented builds are "special" anyway so a fair slowdown is probably >> acceptable. I.e. this doesn't affect regular developer compilation times. >> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2` >> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the >> control, then that indicates that a 5% extra slowdown is probably >> reasonable for a substantial reduction in profiling overhead (which can >> result in a qualitative improvement to the actual usability of the >> instrumented program). >> >> -- Sean Silva >> >> >> >>> Is >>> the general approach a non-starter for anybody? >>> >>> I'd like to get some feedback and gauge interest before pursuing this >>> further. >>> Possible next steps include benchmarking instrumented versions of clang >>> and >>> swift on the relevant test suites, running performance tests from lnt, >>> running >>> compile-time tests, and measuring any code size differences. >>> >>> >>> thanks >>> vedant >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/f2c960ae/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-10 at 10.12.52 PM.png Type: image/png Size: 98967 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/f2c960ae/attachment-0001.png>
Xinliang David Li via llvm-dev
2016-Mar-11 06:34 UTC
[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
I saw your example has function call in the loop.. For this example, I think the reason is we don't yet do speculative PRE which usually requires profile information. Note that the update is conditionally done in a branch. Note that the counter update of the block %8 is fully optimized away. David On Thu, Mar 10, 2016 at 10:13 PM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com> > wrote: > >> >> >> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> >>> >>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> Hi, >>>> >>>> I'd like to add a new pass to LLVM which removes redundant profile >>>> counter >>>> updates. The goal is to speed up code coverage testing and profile >>>> generation >>>> for PGO. >>>> >>> >>> We may want to have a focused discussion about this goal, rather than a >>> particular suggestion. There are a lot of things we can do. Off the top of >>> my head, some are: >>> >>> 1. add some sort of alias annotation (such as an independent TBAA root >>> node) to all the counter increment memory instructions to tell the >>> optimizer they don't interfere with the usual loads and stores. >>> >>> 2. communicate to the optimizer that counters can be registerized. In a >>> loop like: >>> for (int i = 0; i < N; i++) { >>> if (foo()) >>> bar(); >>> else >>> baz(); >>> } >>> we perform O(N) counter increments (i.e. load, increment, store) last I >>> checked. However, if the counters are in registers, then we only perform >>> O(1) memory operations. This can dramatically reduce the pressure on the >>> CPU's load/store units and also relieve cross-core cache line ping-pong >>> when two cores are executing the same code. Note that the latter benefit is >>> attained even if we ultimately end up spilling the counters due to >>> increased register pressure. >>> >>> I actually don't know what is preventing the usual optimization pipeline >>> from getting 2 right. >>> >> >> Call Mod-ref. We need to teach the optimizer that the counter owned by >> the current function (if the function is proved to be non-recursive in some >> way) can not be modified by any other calls. >> > > I don't think that's a sufficient explanation. Consider the following > example: > > Sean:~/tmp % cat testprofile.cpp > int foo(int n) { > unsigned Ret = 42; > for (int i = 0; i < n; i++) { > if (i % 100) { > Ret += 789; > } else { > Ret *= (283 + Ret); > } > } > return Ret; > } > > Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate > testprofile.cpp -S -emit-llvm -O2 >foo.ll > Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll > > > > [image: Inline image 2] > > > > -- Sean Silva > > >> >> David >> >> >> >> >>> >>> >>>> >>>> I'm sending this email out to describe my approach, share some early >>>> results, >>>> and gather feedback. >>>> >>>> >>>> Problem Overview >>>> ===============>>>> >>>> A profile counter is redundant if it's incremented in exactly the same >>>> basic >>>> blocks as some other profile counter. Consider the following module: >>>> >>>> local void f1() { >>>> instrprof_increment(profc_f1); >>>> } >>>> >>>> void f2() { >>>> instrprof_increment(profc_f2); >>>> f1(); >>>> } >>>> >>>> Once the inliner runs and deletes f1, we're left with: >>>> >>>> void f2() { >>>> instrprof_increment(profc_f2); >>>> instrprof_increment(profc_f1); >>>> } >>>> >>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2). >>>> >>>> I've noticed that frontend-based instrumentation can generate many >>>> redundant >>>> profile counters. This degrades performance and increases code size. >>>> We can >>>> address the problem by getting rid of redundant counter updates. The >>>> trick is >>>> to make sure we get back the same profiles. >>>> >>>> >>>> Proposed Solution >>>> ================>>>> >>>> I propose a pruning pass which takes the following steps: >>>> >>>> 1. Delete functions with local linkage and only one use, if that use >>>> is in >>>> a profile data record. >>>> >>>> These functions are left around by the inliner (it doesn't know >>>> that >>>> they're safe to delete). Deleting them reduces code size and >>>> simplifies >>>> subsequent analysis of profile counters. >>>> >>>> 2. Determine which profile counters are essential. >>>> >>>> 3. Erase all updates to redundant profile counters. >>>> >>>> 4. Emit the aliases into a new section in the binary. >>>> >>>> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some >>>> changes >>>> in compiler-rt are required to walk the alias section and fill in >>>> the >>>> correct execution counts at program exit time. >>>> >>>> This pass needs to be run after the inliner in order to be effective. >>>> >>>> The complexity of this pass is O(N*M), where N is the number of profile >>>> counters, and M is the average number of updates per counter. In >>>> practice it is >>>> a bit faster, since we can skip the analysis of counters which are >>>> discovered to >>>> be redundant early on in the process. >>>> >>> >>> I think a conceptually simpler design is something like: >>> >>> for each CFG edge: >>> record which FE counters have ended up associated with it >>> remove FE counters >>> run IR instrumentation pass >>> emit a side table mapping IR instr counters to FE counters (more >>> generally: how to reconstruct FE counters from the IR counters) >>> >>> The problem is simply reduced to the IR instrumentation pass. >>> >>> >>>> >>>> >>>> Early Results >>>> ============>>>> >>>> The pruning pass results in 25% speed improvement in the example >>>> program above >>>> (where f2 is called in a loop 10^8 times). >>>> >>>> Here is a slightly less contrived example: >>>> >>>> #include <vector> >>>> #include <algorithm> >>>> #include <cstdlib> >>>> >>>> static void escape(void *p) { >>>> asm volatile("" : : "g"(p) : "memory"); >>>> } >>>> >>>> int main(int argc, char **argv) { >>>> std::vector<int> V(atoi(argv[1])); >>>> escape(reinterpret_cast<void *>(V.data())); >>>> std::sort(V.begin(), V.end()); >>> >>> return V[0]; >>>> } >>>> >>>> I get the following results on my desktop (10^8 elements, 5 runs each): >>>> >>>> O3: 0.262s >>>> O3 + PGOInstr: 0.663s >>>> O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 >>>> aliases) >>>> O3 + CoverageInstr: 0.690s >>>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 >>>> aliases) >>>> >>>> >>>> Next Steps? >>>> ==========>>>> >>>> Is the performance of instrumented code something we think we need to >>>> fix? >>>> >>> >>> With frontend instrumentation it definitely is. >>> >>> What's an acceptable compile-time overhead for running this pruning pass? >>> >>> >>> Instrumented builds are "special" anyway so a fair slowdown is probably >>> acceptable. I.e. this doesn't affect regular developer compilation times. >>> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2` >>> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the >>> control, then that indicates that a 5% extra slowdown is probably >>> reasonable for a substantial reduction in profiling overhead (which can >>> result in a qualitative improvement to the actual usability of the >>> instrumented program). >>> >>> -- Sean Silva >>> >>> >>> >>>> Is >>>> the general approach a non-starter for anybody? >>>> >>>> I'd like to get some feedback and gauge interest before pursuing this >>>> further. >>>> Possible next steps include benchmarking instrumented versions of clang >>>> and >>>> swift on the relevant test suites, running performance tests from lnt, >>>> running >>>> compile-time tests, and measuring any code size differences. >>>> >>>> >>>> thanks >>>> vedant >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b2dc0526/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2016-03-10 at 10.12.52 PM.png Type: image/png Size: 98967 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b2dc0526/attachment-0001.png>
Xinliang David Li via llvm-dev
2016-Mar-11 06:36 UTC
[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
On Thu, Mar 10, 2016 at 10:13 PM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Thu, Mar 10, 2016 at 9:42 PM, Xinliang David Li <xinliangli at gmail.com> > wrote: > >> >> >> On Thu, Mar 10, 2016 at 8:33 PM, Sean Silva via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> >>> >>> On Thu, Mar 10, 2016 at 7:21 PM, Vedant Kumar via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> Hi, >>>> >>>> I'd like to add a new pass to LLVM which removes redundant profile >>>> counter >>>> updates. The goal is to speed up code coverage testing and profile >>>> generation >>>> for PGO. >>>> >>> >>> We may want to have a focused discussion about this goal, rather than a >>> particular suggestion. There are a lot of things we can do. Off the top of >>> my head, some are: >>> >>> 1. add some sort of alias annotation (such as an independent TBAA root >>> node) to all the counter increment memory instructions to tell the >>> optimizer they don't interfere with the usual loads and stores. >>> >>> 2. communicate to the optimizer that counters can be registerized. In a >>> loop like: >>> for (int i = 0; i < N; i++) { >>> if (foo()) >>> bar(); >>> else >>> baz(); >>> } >>> we perform O(N) counter increments (i.e. load, increment, store) last I >>> checked. However, if the counters are in registers, then we only perform >>> O(1) memory operations. This can dramatically reduce the pressure on the >>> CPU's load/store units and also relieve cross-core cache line ping-pong >>> when two cores are executing the same code. Note that the latter benefit is >>> attained even if we ultimately end up spilling the counters due to >>> increased register pressure. >>> >>> I actually don't know what is preventing the usual optimization pipeline >>> from getting 2 right. >>> >> >> Call Mod-ref. We need to teach the optimizer that the counter owned by >> the current function (if the function is proved to be non-recursive in some >> way) can not be modified by any other calls. >> > > I don't think that's a sufficient explanation. Consider the following > example: > > Sean:~/tmp % cat testprofile.cpp > int foo(int n) { > unsigned Ret = 42; > for (int i = 0; i < n; i++) { > if (i % 100) { > Ret += 789; > } else { > Ret *= (283 + Ret); > } > } > return Ret; > } > > Sean:~/tmp % ~/pg/release/bin/clang++ -o - -fprofile-instr-generate > testprofile.cpp -S -emit-llvm -O2 >foo.ll > Sean:~/tmp % ~/pg/release/bin/opt -view-cfg foo.ll > > >(resent) I saw your example has function call in the loop, thus the comment about Modref. For this example, the reason seems to that we don't yet do speculative PRE which usually requires profile information. Note that the update is conditionally done in a branch. Also note that the counter update of the block %8 is fully optimized away. David> > -- Sean Silva > > >> >> David >> >> >> >> >>> >>> >>>> >>>> I'm sending this email out to describe my approach, share some early >>>> results, >>>> and gather feedback. >>>> >>>> >>>> Problem Overview >>>> ===============>>>> >>>> A profile counter is redundant if it's incremented in exactly the same >>>> basic >>>> blocks as some other profile counter. Consider the following module: >>>> >>>> local void f1() { >>>> instrprof_increment(profc_f1); >>>> } >>>> >>>> void f2() { >>>> instrprof_increment(profc_f2); >>>> f1(); >>>> } >>>> >>>> Once the inliner runs and deletes f1, we're left with: >>>> >>>> void f2() { >>>> instrprof_increment(profc_f2); >>>> instrprof_increment(profc_f1); >>>> } >>>> >>>> Now we can say profc_f1 is redundant (or, an alias for profc_f2). >>>> >>>> I've noticed that frontend-based instrumentation can generate many >>>> redundant >>>> profile counters. This degrades performance and increases code size. >>>> We can >>>> address the problem by getting rid of redundant counter updates. The >>>> trick is >>>> to make sure we get back the same profiles. >>>> >>>> >>>> Proposed Solution >>>> ================>>>> >>>> I propose a pruning pass which takes the following steps: >>>> >>>> 1. Delete functions with local linkage and only one use, if that use >>>> is in >>>> a profile data record. >>>> >>>> These functions are left around by the inliner (it doesn't know >>>> that >>>> they're safe to delete). Deleting them reduces code size and >>>> simplifies >>>> subsequent analysis of profile counters. >>>> >>>> 2. Determine which profile counters are essential. >>>> >>>> 3. Erase all updates to redundant profile counters. >>>> >>>> 4. Emit the aliases into a new section in the binary. >>>> >>>> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some >>>> changes >>>> in compiler-rt are required to walk the alias section and fill in >>>> the >>>> correct execution counts at program exit time. >>>> >>>> This pass needs to be run after the inliner in order to be effective. >>>> >>>> The complexity of this pass is O(N*M), where N is the number of profile >>>> counters, and M is the average number of updates per counter. In >>>> practice it is >>>> a bit faster, since we can skip the analysis of counters which are >>>> discovered to >>>> be redundant early on in the process. >>>> >>> >>> I think a conceptually simpler design is something like: >>> >>> for each CFG edge: >>> record which FE counters have ended up associated with it >>> remove FE counters >>> run IR instrumentation pass >>> emit a side table mapping IR instr counters to FE counters (more >>> generally: how to reconstruct FE counters from the IR counters) >>> >>> The problem is simply reduced to the IR instrumentation pass. >>> >>> >>>> >>>> >>>> Early Results >>>> ============>>>> >>>> The pruning pass results in 25% speed improvement in the example >>>> program above >>>> (where f2 is called in a loop 10^8 times). >>>> >>>> Here is a slightly less contrived example: >>>> >>>> #include <vector> >>>> #include <algorithm> >>>> #include <cstdlib> >>>> >>>> static void escape(void *p) { >>>> asm volatile("" : : "g"(p) : "memory"); >>>> } >>>> >>>> int main(int argc, char **argv) { >>>> std::vector<int> V(atoi(argv[1])); >>>> escape(reinterpret_cast<void *>(V.data())); >>>> std::sort(V.begin(), V.end()); >>> >>> return V[0]; >>>> } >>>> >>>> I get the following results on my desktop (10^8 elements, 5 runs each): >>>> >>>> O3: 0.262s >>>> O3 + PGOInstr: 0.663s >>>> O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 >>>> aliases) >>>> O3 + CoverageInstr: 0.690s >>>> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 >>>> aliases) >>>> >>>> >>>> Next Steps? >>>> ==========>>>> >>>> Is the performance of instrumented code something we think we need to >>>> fix? >>>> >>> >>> With frontend instrumentation it definitely is. >>> >>> What's an acceptable compile-time overhead for running this pruning pass? >>> >>> >>> Instrumented builds are "special" anyway so a fair slowdown is probably >>> acceptable. I.e. this doesn't affect regular developer compilation times. >>> As a baseline, maybe compare compile time of `-fprofile-instr-generate -O2` >>> vs. `-O2`. If `-fprofile-instr-generate -O2` is 10% slower than the >>> control, then that indicates that a 5% extra slowdown is probably >>> reasonable for a substantial reduction in profiling overhead (which can >>> result in a qualitative improvement to the actual usability of the >>> instrumented program). >>> >>> -- Sean Silva >>> >>> >>> >>>> Is >>>> the general approach a non-starter for anybody? >>>> >>>> I'd like to get some feedback and gauge interest before pursuing this >>>> further. >>>> Possible next steps include benchmarking instrumented versions of clang >>>> and >>>> swift on the relevant test suites, running performance tests from lnt, >>>> running >>>> compile-time tests, and measuring any code size differences. >>>> >>>> >>>> thanks >>>> vedant >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/b14013dc/attachment.html>