Snehasish Kumar via llvm-dev
2016-Mar-15 16:53 UTC
[llvm-dev] GSoC Proposal : Path Profiling Support
This proposal adds support for path profiling [Ball96] to LLVM. Path profiling compactly represents acyclic paths in a directed acyclic graph representation of the control flow graph of a routine. Instrumentation can be added to uniquely identify paths executed at runtime. Path profiles enable precise enumeration of the sequence of basic blocks executed in order for a particular path. Using path profiles has been demonstrated by [Young98] to perform better global scheduling than edge profiles. Superblocks constructed from path profiles offer more opportunities for speculative code motion thereby improving program performance. About Me : I am a PhD candidate at Simon Fraser University, Canada. My research primarily deals with computer architecture and workload analysis. Over the last year, I have built a toolchain which uses path profiling as its basis. With support from GSoC, I would like to integrate our path profiling pass into trunk. Our passes are currently built against LLVM 3.6.2. I am willing to maintain said passes as they form the core of our research infrastructure. Prior Work : Path profiling was implemented by Adam Preuss for LLVM [Preuss10]. However, this was before the unified profiling infrastructure was in place. It seems that the code was removed from trunk post LLVM 3.3. Current Implementation : + HashTable based + Static overflow checks (64 bit counters) at instrumentation time with fallback to APInt counters + Implements efficient event counting [Ball94] to reduce instrumentation overheads + Context sensitive counters if no 64-bitoverflow (i.e. supports path profiling across recursion) + Optimizations for use cases with very large number of executed paths + Used to analyse C/C++ benchmarks from SPECCPU2006 Proposed Integration : + Support library code added to compiler-rt + llvm/Analysis gets PathEncoding, PathDecoding and PathProfileInfo module passes + llvm/Transforms/Instrumentation gets PathProfileInstrumentation module pass + PathProfileInfo offers const only public methods to query info Open Issues : + Update PathProfileInfo on CFG transformations ? + Verify with PGOEdge info ? + Handle setjmp, longjmp, early program termination, noreturn calls Please let me know your comments on this proposal as a GSoC project. Any comments on how to refine this proposal are welcome. I can also elaborate on specific details of our implementation if there is interest on the mailing list. Thanks, Snehasish Kumar School of Computing Science Simon Fraser University References : [Ball94] Ball, Thomas. "Efficiently counting program events with support for on-line queries." ACM Transactions on Programming Languages and Systems (TOPLAS) 16.5 (1994): 1399-1410. [Ball96] Ball, Thomas, and James R. Larus. "Efficient path profiling." Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society, 1996. [Young98] Young, Cliff, and Michael D. Smith. "Better global scheduling using path profiles." Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society Press, 1998. [Preuss10] http://lists.llvm.org/pipermail/llvm-dev/2010-December/036790.html -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160315/d4b866c9/attachment.html>
Vedant Kumar via llvm-dev
2016-Mar-15 18:51 UTC
[llvm-dev] GSoC Proposal : Path Profiling Support
Hi, This sounds like a very interesting proposal!> This proposal adds support for path profiling [Ball96] to LLVM. Path profiling compactly represents acyclic paths in a directed acyclic graph representation of the control flow graph of a routine. Instrumentation can be added to uniquely identify paths executed at runtime.In [Ball96], the authors mention that the worst case number of acyclic paths in a program is exponential in its size. How do you manage this? Is the amount of instrumentation necessary to handle this scenario also exponential in the size of the program?> Path profiles enable precise enumeration of the sequence of basic blocks executed in order for a particular path. Using path profiles has been demonstrated by [Young98] to perform better global scheduling than edge profiles.I haven't had time to digest [Young98]. Could you explain in some more detail which optimizations stand to benefit from path profile information?> Over the last year, I have built a toolchain which uses path profiling as its basis.Awesome! Do you have any numbers about the size of path profiles, compared to the size of raw instrprof-based profiles? In general I'm curious about how we can support users of instrprof-style profiles with path profiles. Do you have plans to support code coverage using path profiles? Can path profiles from dylibs be concatenated together to yield valid profiles?> + Used to analyse C/C++ benchmarks from SPECCPU2006I'd be interested to see any numbers you have about this. In particular, what's the compile-time overhead of path profiling instrumentation? What are the sizes of instrumented binaries, and what's the slowdown factor?> Open Issues : > + Update PathProfileInfo on CFG transformations ?Could you clarify what this means?> + Verify with PGOEdge info ?Ditto.> + Handle setjmp, longjmp, early program termination, noreturn callsHow do you handle indirect calls? best, vedant
Snehasish Kumar via llvm-dev
2016-Mar-16 19:44 UTC
[llvm-dev] GSoC Proposal : Path Profiling Support
Hi Vedant, I would like to clarify that the proposal does *intra-procedural* path profiling as described in [Ball96].> > This proposal adds support for path profiling [Ball96] to LLVM. Path profiling compactly represents acyclic paths in a directed acyclic graph representation of the control flow graph of a routine. Instrumentation can be added to uniquely identify paths executed at runtime.> In [Ball96], the authors mention that the worst case number of acyclic > paths in a program is exponential in its size. How do you manage this? Is > the amount of instrumentation necessary to handle this scenario also > exponential in the size of the program?Paths are numbered from 0 to NumPaths-1. This can be exponential in size. In practice we find that they fit in 64-bit wide counters are enough for numbering paths within a routine. The amount of instrumentation required is not exponential. For example, every if-condition in a routine will increase the number of paths by X2, but the instrumentation points required to disambiguate between each side of the branch is only 1. Furthermore, we reduce the amount of instrumentation using [Ball94] which is proven to be optimal.> > Path profiles enable precise enumeration of the sequence of basic blocks executed in order for a particular path. Using path profiles has been demonstrated by [Young98] to perform better global scheduling than edge profiles.> I haven't had time to digest [Young98]. Could you explain in some more > detail which optimizations stand to benefit from path profile information?[Young98] target superblock enlargement optimizations. Having precise information about the paths enables additional optimizations using value numbering and subsequent dead code elimination. In our experience, isolating paths with reduced control flow allows aggressive dead-store-elimination, dead-code-elimination and constant propagation.> > Over the last year, I have built a toolchain which uses path profiling as its basis.> Do you have any numbers about the size of path profiles, compared to the > size of raw instrprof-based profiles?No. We use a plain text representation internally which can be further optimized.> In general I'm curious about how we can support users of instrprof-style > profiles with path profiles. Do you have plans to support code coverage > using path profiles?Perfect instrprof data can be derived from path profiles. Code coverage can be supported using path profiles though we do not do so as part of our implementation.> Can path profiles from dylibs be concatenated together > to yield valid profiles?Yes. As each routine is independently instrumented, the profiles can be concatenated. However, it is difficult to reason about which paths in the caller invoke which paths in the callee. Melinski and Reps describe how to do so in Interprocedural Path Profiling (CC'99). This is beyond the scope of this proposal.> > + Used to analyse C/C++ benchmarks from SPECCPU2006> I'd be interested to see any numbers you have about this. In particular, > what's the compile-time overhead of path profiling instrumentation? What > are the sizes of instrumented binaries, and what's the slowdown factor?You can find similar numbers reported by [Ball96] in Table 1 of the paper. Numbers from our selected workloads are shown below. These numbers use 256 bit APInt counters (primary cause for slowdown) along with a DenseMap<APInt, APInt> for Id => Counter mapping. The runtime is implemented as a shared library. The benchmarks are selected from SPEC2006 and Parsec 3.0. Each was compiled to a single bitcode module and a particular function was picked for instrumentation. The function was determined to be (one of) the primary work functions. The bitcodes were compiled with -O2 optimization post instrumentation. Note that the frequency of paths executed is not shown in this table and is a factor for slowdown. Each path is executed multiple times and each path has varying number of instrumentation points. This data shows the *worst possible case* where 256-bit APInt counters are used to calculate path ids as well as maintain runtime frequency counts in the shared library. Note that all benchmarks are well within the 64-bit integer range for path-ids. Thus the data in the [Ball96] paper shows the best case overheads, while the data I have included herein shows the worst case overheads for instrumentation+runtime. + Key numpaths = Number of possible paths epp+compile = Time taken to compute encoding, insert instrumentation and compile to executable compile = Time taken to compile to executable execpaths = Number of paths dynamically executed epp-exec-time = Execution time with instrumentation exec-time = Normal execution time epp-bin-size = Size of instrumented binary in bytes bin-size = Size of binary ** size of shared library in bytes = 598042 +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | benchmark | numpaths | epp+compile | compile | execpaths | epp-exec-time | exec-time | epp-bin-size | bin-size | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | blks | 2 | 0m1.036s | 0m1.008s | 2 | 0m3.643s | 0m3.205s | 155931 | 155459 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | bodytrack | 29 | 0m4.907s | 0m4.881s | 5 | 0m14.786s | 0m1.943s | 2125256 | 2124224 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | bzip2 | 60 | 0m1.274s | 0m1.268s | 3 | 0m9.441s | 0m9.624s | 259125 | 258477 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | ferret | 360921 | 0m26.208s | 0m26.102s | 40 | 0m10.342s | 0m6.224s | 8342571 | 8338588 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | fluidanimate | 384117 | 0m0.895s | 0m0.869s | 88 | 0m56.631s | 0m1.294s | 202702 | 197878 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | freqmine | 45 | 0m1.220s | 0m1.214s | 18 | 0m22.150s | 0m5.515s | 278615 | 277656 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | gcc | 6026 | 0m31.941s | 0m31.327s | 125 | 1m30.139s | 0m36.601s | 6991413 | 6991245 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | hmmer | 1882 | 0m3.193s | 0m3.232s | 65 | 0m58.911s | 0m2.474s | 744510 | 742806 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | mcf | 230 | 0m0.838s | 0m0.830s | 10 | 0m11.097s | 0m3.074s | 162680 | 161736 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | mcf2000 | 1155 | 0m0.859s | 0m0.853s | 26 | 0m24.169s | 0m4.625s | 166092 | 165213 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | povray | 17 | 0m8.543s | 0m8.552s | 4 | 9m24.562s | 5m39.295s | 2388152 | 2387960 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | sjeng | 158740 | 0m1.648s | 0m1.637s | 280 | 0m20.786s | 0m5.229s | 368841 | 368009 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | soplex | 30 | 0m4.849s | 0m4.848s | 24 | 7m28.151s | 4m10.813s | 1244775 | 1242063 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | sphinx | 26 | 0m2.212s | 0m2.198s | 5 | 1m36.291s | 0m13.811s | 543534 | 543358 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | streamcluster | 21121728 | 0m0.947s | 0m0.908s | 33 | 0m50.212s | 0m5.986s | 191981 | 185438 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | swaptions | 20655 | 0m0.965s | 0m0.950s | 13 | 0m0.263s | 0m0.178s | 193841 | 184274 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | h264ref | 24130 | 0m4.278s | 0m4.272s | 76 | 3m26.701s | 3m4.461s | 816660 | 812396 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | lbm | 8 | 0m0.824s | 0m0.815s | 5 | 6m29.685s | 1m39.180s | 150871 | 150327 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+ | namd | 59598954 | 0m4.124s | 0m4.139s | 43 | 18m36.447s | 6m50.288s | 925863 | 925271 | +---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+> > Open Issues : > > + Update PathProfileInfo on CFG transformations ?> Could you clarify what this means?Changing the control flow graph of a routine may invalidate collected path profiles. For example, splitting a block with an unconditional branch does not change the profile, but introducing a conditional branch invalidates the profile. The issue I would like to address is which transformations should we allow as safe transformations and how should we update the internal path profile data structures if we allow this at all.> > + Verify with PGOEdge info ?> Ditto.Verification with PGOEdge info implies that the edge frequencies derived from path profiles and via instrprof should be equal.> > + Handle setjmp, longjmp, early program termination, noreturn calls> How do you handle indirect calls?No special handling of indirect calls as path profiles are intra-procedural and control returns to same basic block after call in the general case. For the above mentioned cases, control may not return. Regards, Snehasish