TL;DR: We plan to build a suite of compiler-based dynamic instrumentation tools for analyzing targeted performance problems. These tools will all live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as they will share significant portions of their implementations. ===================Motivation =================== Our goal is to build a suite of dynamic instrumentation tools for analyzing particular performance problems that are difficult to evaluate using other profiling methods. Modern hardware performance counters provide insight into where time is spent and when micro-architectural events such as cache misses are occurring, but they are of limited effectiveness for contextual analysis: it is not easy to answer *why* a cache miss occurred. Examples of tools that we have planned include: identifying wasted or redundant computation, identifying cache fragmentation, and measuring working sets. See more details on these below. ===================Approach =================== We believe that tools with overhead beyond about 5x are simply too heavyweight to easily apply to large, industrial-sized applications running real-world workloads. Our goal is for our tools to gather useful information with overhead less than 5x, and ideally closer to 3x, to facilitate deployment. We would prefer to trade off accuracy and build a less-accurate tool below our overhead ceiling than to build a high-accuracy but slow tool. We hope to hit a sweet spot of tools that gather trace-based contextual information not feasible with pure sampling yet are still practical to deploy. In a similar vein, we would prefer a targeted tool that analyzes one particular aspect of performance with low overhead than a more general tool that can answer more questions but has high overhead. Dynamic binary instrumentation is one option for these types of tools, but typically compiler-based instrumentation provides better performance, and we intend to focus only on analyzing applications for which source code is available. Studying instruction cache behavior with compiler instrumentation can be challenging, however, so we plan to at least initially focus on data performance. Many of our planned tools target specific performance issues with data accesses. They employ the technique of *shadow memory* to store metadata about application data references, using the compiler to instrument loads and stores with code to update the shadow memory. A companion runtime library intercepts libc calls if necessary to update shadow memory on non-application data references. The runtime library also intercepts heap allocations and other key events in order to perform its analyses. This is all very similar to how existing sanitizers such as AddressSanitizer, ThreadSanitizer, MemorySanitizer, etc. operate today. ===================Example Tools =================== We have several initial tools that we plan to build. These are not necessarily novel ideas on their own: some of these have already been explored in academia. The idea is to create practical, low-overhead, robust, and publicly available versions of these tools. *Cache fragmentation*: this tool gather data structure field hotness information, looking for data layout optimization opportunities by grouping hot fields together to avoid data cache fragmentation. Future enhancements may add field affinity information if it can be computed with low enough overhead. *Working set measurement*: this tool measures the data working set size of an application at each snapshot during execution. It can help to understand phased behavior as well as providing basic direction for further effort by the developer: e.g., knowing whether the working set is close to fitting in current L3 caches or is many times larger can help determine where to spend effort. *Dead store detection*: this tool identifies dead stores (write-after-write patterns with no intervening read) as well as redundant stores (writes of the same value already in memory). Xref the Deadspy paper from CGO 2012. *Single-reference*: this tool identifies data cache lines brought in but only read once. These could be candidates for non-temporal loads. ===================EfficiencySanitizer =================== We are proposing the name EfficiencySanitizer, or "esan" for short, to refer to this suite of dynamic instrumentation tools for improving program efficiency. As we have a number of different tools that share quite a bit of their implementation we plan to consider them sub-tools under the EfficiencySanitizer umbrella, rather than adding a whole bunch of separate instrumentation and runtime library components. While these tools are not addressing correctness issues like other sanitizers, they will be sharing a lot of the existing sanitizer runtime library support code. Furthermore, users are already familiar with the sanitizer brand, and it seems better to extend that concept rather than add some new term. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160417/21831feb/attachment.html>
This sounds interesting. I've got a couple of questions about the cache fragmentation tool and the working set measurement tool. On 4/17/2016 4:46 PM, Derek Bruening via llvm-dev wrote:> TL;DR: We plan to build a suite of compiler-based dynamic > instrumentation tools for analyzing targeted performance problems. > These tools will all live under a new "EfficiencySanitizer" (or > "esan") sanitizer umbrella, as they will share significant portions of > their implementations. > > ===================> Motivation > ===================> > Our goal is to build a suite of dynamic instrumentation tools for > analyzing particular performance problems that are difficult to > evaluate using other profiling methods. Modern hardware performance > counters provide insight into where time is spent and when > micro-architectural events such as cache misses are occurring, but > they are of limited effectiveness for contextual analysis: it is not > easy to answer *why* a cache miss occurred. > > Examples of tools that we have planned include: identifying wasted or > redundant computation, identifying cache fragmentation, and measuring > working sets. See more details on these below. > > ===================> Approach > ===================> > We believe that tools with overhead beyond about 5x are simply too > heavyweight to easily apply to large, industrial-sized applications > running real-world workloads. Our goal is for our tools to gather > useful information with overhead less than 5x, and ideally closer to > 3x, to facilitate deployment. We would prefer to trade off accuracy > and build a less-accurate tool below our overhead ceiling than to > build a high-accuracy but slow tool. We hope to hit a sweet spot of > tools that gather trace-based contextual information not feasible with > pure sampling yet are still practical to deploy. > > In a similar vein, we would prefer a targeted tool that analyzes one > particular aspect of performance with low overhead than a more general > tool that can answer more questions but has high overhead. > > Dynamic binary instrumentation is one option for these types of tools, > but typically compiler-based instrumentation provides better > performance, and we intend to focus only on analyzing applications for > which source code is available. Studying instruction cache behavior > with compiler instrumentation can be challenging, however, so we plan > to at least initially focus on data performance. > > Many of our planned tools target specific performance issues with data > accesses. They employ the technique of *shadow memory* to store > metadata about application data references, using the compiler to > instrument loads and stores with code to update the shadow memory. A > companion runtime library intercepts libc calls if necessary to update > shadow memory on non-application data references. The runtime library > also intercepts heap allocations and other key events in order to > perform its analyses. This is all very similar to how existing > sanitizers such as AddressSanitizer, ThreadSanitizer, MemorySanitizer, > etc. operate today. > > ===================> Example Tools > ===================> > We have several initial tools that we plan to build. These are not > necessarily novel ideas on their own: some of these have already been > explored in academia. The idea is to create practical, low-overhead, > robust, and publicly available versions of these tools. > > *Cache fragmentation*: this tool gather data structure field hotness > information, looking for data layout optimization opportunities by > grouping hot fields together to avoid data cache fragmentation. > Future enhancements may add field affinity information if it can be > computed with low enough overhead.I can imagine vaguely imagine how this data would be acquired, but I'm more interested in what analysis is provided by the tool, and how this information would be presented to a user. Would it be a flat list of classes, sorted by number of accesses, with each field annotated by number of accesses? Or is there some other kind of presentation planned? Maybe some kind of weighting for classes with frequent cache misses?> *Working set measurement*: this tool measures the data working set > size of an application at each snapshot during execution. It can help > to understand phased behavior as well as providing basic direction for > further effort by the developer: e.g., knowing whether the working set > is close to fitting in current L3 caches or is many times larger can > help determine where to spend effort.I think my questions here are basically the reverse of my prior questions. I can imagine the presentation ( a graph with time on the X axis, working set measurement on the Y axis, with some markers highlighting key execution points). I'm not sure how the data collection works though, or even really what is being measured. Are you planning on counting the number of data bytes / data cache lines used during each time period? For the purposes of this tool, when is data brought into the working set and when is data evicted from the working set?> > *Dead store detection*: this tool identifies dead stores > (write-after-write patterns with no intervening read) as well as > redundant stores (writes of the same value already in memory). Xref > the Deadspy paper from CGO 2012. > > *Single-reference*: this tool identifies data cache lines brought in > but only read once. These could be candidates for non-temporal loads. > > ===================> EfficiencySanitizer > ===================> > We are proposing the name EfficiencySanitizer, or "esan" for short, to > refer to this suite of dynamic instrumentation tools for improving > program efficiency. As we have a number of different tools that share > quite a bit of their implementation we plan to consider them sub-tools > under the EfficiencySanitizer umbrella, rather than adding a whole > bunch of separate instrumentation and runtime library components. > > While these tools are not addressing correctness issues like other > sanitizers, they will be sharing a lot of the existing sanitizer > runtime library support code. Furthermore, users are already familiar > with the sanitizer brand, and it seems better to extend that concept > rather than add some new term. > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160418/339e79db/attachment.html>
On Mon, Apr 18, 2016 at 1:36 PM, Craig, Ben via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On 4/17/2016 4:46 PM, Derek Bruening via llvm-dev wrote: > > *Cache fragmentation*: this tool gather data structure field hotness > information, looking for data layout optimization opportunities by grouping > hot fields together to avoid data cache fragmentation. Future enhancements > may add field affinity information if it can be computed with low enough > overhead. > > I can imagine vaguely imagine how this data would be acquired, but I'm > more interested in what analysis is provided by the tool, and how this > information would be presented to a user. Would it be a flat list of > classes, sorted by number of accesses, with each field annotated by number > of accesses? Or is there some other kind of presentation planned? Maybe > some kind of weighting for classes with frequent cache misses? >The sorting/filtering metric will include disparity between fields: hot fields interleaved with cold fields are what it's looking for, with a total access count high enough to matter. Yes, it would present to the user the field layout with annotations for access count as you suggest.> *Working set measurement*: this tool measures the data working set size of > an application at each snapshot during execution. It can help to > understand phased behavior as well as providing basic direction for further > effort by the developer: e.g., knowing whether the working set is close to > fitting in current L3 caches or is many times larger can help determine > where to spend effort. > > I think my questions here are basically the reverse of my prior > questions. I can imagine the presentation ( a graph with time on the X > axis, working set measurement on the Y axis, with some markers highlighting > key execution points). I'm not sure how the data collection works though, > or even really what is being measured. Are you planning on counting the > number of data bytes / data cache lines used during each time period? For > the purposes of this tool, when is data brought into the working set and > when is data evicted from the working set? >The tool records which data cache lines were touched at least once during a snapshot (basically just setting a shadow memory bit for each load/store). The metadata is cleared after each snapshot is recorded so that the next snapshot starts with a blank slate. Snapshots can be combined via logical or as the execution time grows to adaptively handle varying total execution time. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160418/8d9588c0/attachment.html>
Hi Derek, Thanks for proposing this. It seems like it might be an interesting tool for us too. But this proposal seems a bit hand-wavy, and I think it's missing some crucial info before we start heading this way. On Sun, Apr 17, 2016 at 10:46 PM, Derek Bruening via llvm-dev <llvm-dev at lists.llvm.org> wrote:> TL;DR: We plan to build a suite of compiler-based dynamic instrumentation > tools for analyzing targeted performance problems. These tools will all > live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as > they will share significant portions of their implementations.I will bike-shed the name as much as anyone else, but I'll stay away for now :-)> ===================> Motivation > ===================> > Our goal is to build a suite of dynamic instrumentation tools for analyzing > particular performance problems that are difficult to evaluate using other > profiling methods. Modern hardware performance counters provide insight > into where time is spent and when micro-architectural events such as cache > misses are occurring, but they are of limited effectiveness for contextual > analysis: it is not easy to answer *why* a cache miss occurred. > > Examples of tools that we have planned include: identifying wasted or > redundant computation, identifying cache fragmentation, and measuring > working sets. See more details on these below. > > ===================> Approach > ===================> > We believe that tools with overhead beyond about 5x are simply too > heavyweight to easily apply to large, industrial-sized applications running > real-world workloads. Our goal is for our tools to gather useful > information with overhead less than 5x, and ideally closer to 3x, to > facilitate deployment. We would prefer to trade off accuracy and build a > less-accurate tool below our overhead ceiling than to build a high-accuracy > but slow tool. We hope to hit a sweet spot of tools that gather trace-based > contextual information not feasible with pure sampling yet are still > practical to deploy.This is also very important for us, and there are sanitizers which are "harder to sell" because of the overhead.> In a similar vein, we would prefer a targeted tool that analyzes one > particular aspect of performance with low overhead than a more general tool > that can answer more questions but has high overhead. > > Dynamic binary instrumentation is one option for these types of tools, but > typically compiler-based instrumentation provides better performance, and we > intend to focus only on analyzing applications for which source code is > available. Studying instruction cache behavior with compiler > instrumentation can be challenging, however, so we plan to at least > initially focus on data performance. > > Many of our planned tools target specific performance issues with data > accesses. They employ the technique of *shadow memory* to store metadata > about application data references, using the compiler to instrument loads > and stores with code to update the shadow memory. A companion runtime > library intercepts libc calls if necessary to update shadow memory on > non-application data references. The runtime library also intercepts heap > allocations and other key events in order to perform its analyses. This is > all very similar to how existing sanitizers such as AddressSanitizer, > ThreadSanitizer, MemorySanitizer, etc. operate today. > > ===================> Example Tools > ===================> > We have several initial tools that we plan to build. These are not > necessarily novel ideas on their own: some of these have already been > explored in academia. The idea is to create practical, low-overhead, > robust, and publicly available versions of these tools. > > *Cache fragmentation*: this tool gather data structure field hotness > information, looking for data layout optimization opportunities by grouping > hot fields together to avoid data cache fragmentation. Future enhancements > may add field affinity information if it can be computed with low enough > overhead. > > *Working set measurement*: this tool measures the data working set size of > an application at each snapshot during execution. It can help to understand > phased behavior as well as providing basic direction for further effort by > the developer: e.g., knowing whether the working set is close to fitting in > current L3 caches or is many times larger can help determine where to spend > effort. > > *Dead store detection*: this tool identifies dead stores (write-after-write > patterns with no intervening read) as well as redundant stores (writes of > the same value already in memory). Xref the Deadspy paper from CGO 2012. > > *Single-reference*: this tool identifies data cache lines brought in but > only read once. These could be candidates for non-temporal loads.Do you have any estimates on memory overhead (both memory usage (+shadow), and code size) you expect? As well as estimated about the possible slowdown? At least for the tools you are currently starting to implement, it would be nice to have some estimates and plans on what is going to happen. I would actually like to see a small RFC about each tool and what the plan (overhead/slowdown, pseudo-code for instrumentation, UX, ...) is before starting to commit. I don't expect the plan to be very detailed, nor for everything to be pinned down, of course. This seems to be a bit at a research stage, and I'm totally ok with it. But would rather it not be as opaque as it is now. The way the tools will report problems/data/etc, is important, for example.> ===================> EfficiencySanitizer > ===================> > We are proposing the name EfficiencySanitizer, or "esan" for short, to refer > to this suite of dynamic instrumentation tools for improving program > efficiency. As we have a number of different tools that share quite a bit > of their implementation we plan to consider them sub-tools under the > EfficiencySanitizer umbrella, rather than adding a whole bunch of separate > instrumentation and runtime library components.How much code is expected to be shared? Most? Similar to what the sanitizers already share? Do we expect the shadow memory mapping to be (mostly) the same among all esan's tools? Do you already have an idea of a few tools + the type of code that would be shared (for example: "read/write instrumentation is mostly the same among these tools", or "generating reports for these tools is mostly the same", or something similar)?> While these tools are not addressing correctness issues like other > sanitizers, they will be sharing a lot of the existing sanitizer runtime > library support code. Furthermore, users are already familiar with the > sanitizer brand, and it seems better to extend that concept rather than add > some new term.Not the email to bike-shed the name, but I don't like "Efficiency" that much here :-) Thank you for working on this, Filipe
On Tue, Apr 19, 2016 at 1:18 PM, Filipe Cabecinhas <filcab at gmail.com> wrote:> Thanks for proposing this. It seems like it might be an interesting > tool for us too. But this proposal seems a bit hand-wavy, and I think > it's missing some crucial info before we start heading this way. >> At least for the tools you are currently starting to implement, it > would be nice to have some estimates and plans on what is going to > happen. >I would actually like to see a small RFC about each tool and what the> plan (overhead/slowdown, pseudo-code for instrumentation, UX, ...) is > before starting to commit. >I don't expect the plan to be very detailed, nor for everything to be> pinned down, of course. This seems to be a bit at a research stage, > and I'm totally ok with it. But would rather it not be as opaque as it > is now. The way the tools will report problems/data/etc, is important, > for example. >The main response here is that these tools are in an exploratory stage and that what we are proposing is not a fixed list of tools but a framework within which we can build prototypes and refine them. We do not know which ideas will work out and which ones will be abandoned. In some cases, answers will not all be known until a prototype is in place to see how it behaves on large applications: we might need to add additional shadow metadata or additional instrumentation to produce the right amount of actionable output. What we want to avoid is having to build prototypes out-of-tree until they are proven and polished and only then trying to commit large patches upstream. Do you have any estimates on memory overhead (both memory usage> (+shadow), and code size) you expect? As well as estimated about the > possible slowdown? >The shadow memory for tools we are considering include 64:1, 4:1, and 1:1. This is all within the ranges of existing sanitizers. Each tool will have inlined instrumentation to update shadow memory on either every memory access or in some cases just focusing on the heap; I do not have a number for code size expansion. The slowdown ranges were in the original email: 2x-5x.> We are proposing the name EfficiencySanitizer, or "esan" for short, to > refer > > to this suite of dynamic instrumentation tools for improving program > > efficiency. As we have a number of different tools that share quite a > bit > > of their implementation we plan to consider them sub-tools under the > > EfficiencySanitizer umbrella, rather than adding a whole bunch of > separate > > instrumentation and runtime library components. > > How much code is expected to be shared? Most? Similar to what the > sanitizers already share? >More than what the existing sanitizers share.> Do we expect the shadow memory mapping to be (mostly) the same among > all esan's tools? >Most will use a single mapping and single algorithm, but the scaledown will vary as mentioned.> Do you already have an idea of a few tools + the type of code that > would be shared (for example: "read/write instrumentation is mostly > the same among these tools", or "generating reports for these tools is > mostly the same", or something similar)? >Instrumentation will be very similar (minor differences in shadow scaling and shadow metadata format), identical intrinsic handling, identical slowpath entries, identical libc interception, identical shadow management. Reports are not fully explored.> While these tools are not addressing correctness issues like other > > sanitizers, they will be sharing a lot of the existing sanitizer runtime > > library support code. Furthermore, users are already familiar with the > > sanitizer brand, and it seems better to extend that concept rather than > add > > some new term. > > Not the email to bike-shed the name, but I don't like "Efficiency" > that much here :-) >There were name discussions prior to the RFC and the conclusion was that there is just not going to be a single first-place choice for everyone. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160419/2035c019/attachment.html>
On Tue, Apr 19, 2016 at 10:18 AM Filipe Cabecinhas via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Sun, Apr 17, 2016 at 10:46 PM, Derek Bruening via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > TL;DR: We plan to build a suite of compiler-based dynamic instrumentation > > tools for analyzing targeted performance problems. These tools will all > > live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as > > they will share significant portions of their implementations. > > I will bike-shed the name as much as anyone else, but I'll stay away for > now :-) > >> > While these tools are not addressing correctness issues like other > > sanitizers, they will be sharing a lot of the existing sanitizer runtime > > library support code. Furthermore, users are already familiar with the > > sanitizer brand, and it seems better to extend that concept rather than > add > > some new term. > > Not the email to bike-shed the name, but I don't like "Efficiency" > that much here :-) >Heh, so I really do like "efficiency", much more than performance or quickness. But I really do think that we shouldn't try to bikeshed the name much. IMO, while it's great to throw out ideas around better names, and certainly to point out any serious *problems* with a name, but past that I'll be happy with whatever the folks actually working on this pick. I think we should all focus on the relevant technical discussion, and float any name ideas that we have, and let Derek and Qin make the final call on the name. -Chandler -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160419/343fc4c1/attachment.html>
Interesting idea! I understand how the bookkeeping in the tool is similar to some of the sanitizers but I am wondering whether that is really the best developer’s work-flow for such a tool. I could imagine that some of the opportunities discovered by the tool could be optimized automatically by the compiler (e.g. temporal loads, sw prefetching, partitioning the heap) so feeding this information back to the compiler could be highly useful. I am wondering whether the PGO model is closer to what we want at the end. The problem can also be thought of as a natural extension of PGO. Besides instrumenting branches and indirect calls, it adds instrumentation for loads and stores. We have internally been discussing ways to use PGO for optimization diagnostics (a continuation of Tyler’s work, see http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html <http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html>). The idea is to help the developer to focus in on opportunities in hot code. It seems that the diagnostics provided by your tools could be emitted directly by the analyses in LLVM during the profile-use phase. Adam> On Apr 17, 2016, at 2:46 PM, Derek Bruening via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > TL;DR: We plan to build a suite of compiler-based dynamic instrumentation tools for analyzing targeted performance problems. These tools will all live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as they will share significant portions of their implementations. > > ===================> Motivation > ===================> > Our goal is to build a suite of dynamic instrumentation tools for analyzing particular performance problems that are difficult to evaluate using other profiling methods. Modern hardware performance counters provide insight into where time is spent and when micro-architectural events such as cache misses are occurring, but they are of limited effectiveness for contextual analysis: it is not easy to answer *why* a cache miss occurred. > > Examples of tools that we have planned include: identifying wasted or redundant computation, identifying cache fragmentation, and measuring working sets. See more details on these below. > > ===================> Approach > ===================> > We believe that tools with overhead beyond about 5x are simply too heavyweight to easily apply to large, industrial-sized applications running real-world workloads. Our goal is for our tools to gather useful information with overhead less than 5x, and ideally closer to 3x, to facilitate deployment. We would prefer to trade off accuracy and build a less-accurate tool below our overhead ceiling than to build a high-accuracy but slow tool. We hope to hit a sweet spot of tools that gather trace-based contextual information not feasible with pure sampling yet are still practical to deploy. > > In a similar vein, we would prefer a targeted tool that analyzes one particular aspect of performance with low overhead than a more general tool that can answer more questions but has high overhead. > > Dynamic binary instrumentation is one option for these types of tools, but typically compiler-based instrumentation provides better performance, and we intend to focus only on analyzing applications for which source code is available. Studying instruction cache behavior with compiler instrumentation can be challenging, however, so we plan to at least initially focus on data performance. > > Many of our planned tools target specific performance issues with data accesses. They employ the technique of *shadow memory* to store metadata about application data references, using the compiler to instrument loads and stores with code to update the shadow memory. A companion runtime library intercepts libc calls if necessary to update shadow memory on non-application data references. The runtime library also intercepts heap allocations and other key events in order to perform its analyses. This is all very similar to how existing sanitizers such as AddressSanitizer, ThreadSanitizer, MemorySanitizer, etc. operate today. > > ===================> Example Tools > ===================> > We have several initial tools that we plan to build. These are not necessarily novel ideas on their own: some of these have already been explored in academia. The idea is to create practical, low-overhead, robust, and publicly available versions of these tools. > > *Cache fragmentation*: this tool gather data structure field hotness information, looking for data layout optimization opportunities by grouping hot fields together to avoid data cache fragmentation. Future enhancements may add field affinity information if it can be computed with low enough overhead. > > *Working set measurement*: this tool measures the data working set size of an application at each snapshot during execution. It can help to understand phased behavior as well as providing basic direction for further effort by the developer: e.g., knowing whether the working set is close to fitting in current L3 caches or is many times larger can help determine where to spend effort. > > *Dead store detection*: this tool identifies dead stores (write-after-write patterns with no intervening read) as well as redundant stores (writes of the same value already in memory). Xref the Deadspy paper from CGO 2012. > > *Single-reference*: this tool identifies data cache lines brought in but only read once. These could be candidates for non-temporal loads. > > ===================> EfficiencySanitizer > ===================> > We are proposing the name EfficiencySanitizer, or "esan" for short, to refer to this suite of dynamic instrumentation tools for improving program efficiency. As we have a number of different tools that share quite a bit of their implementation we plan to consider them sub-tools under the EfficiencySanitizer umbrella, rather than adding a whole bunch of separate instrumentation and runtime library components. > > While these tools are not addressing correctness issues like other sanitizers, they will be sharing a lot of the existing sanitizer runtime library support code. Furthermore, users are already familiar with the sanitizer brand, and it seems better to extend that concept rather than add some new term. > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160419/21932e1a/attachment.html>
Some of this data might be interesting for profile guidance. Are there any plans there? -- Sean Silva On Sun, Apr 17, 2016 at 2:46 PM, Derek Bruening via llvm-dev < llvm-dev at lists.llvm.org> wrote:> TL;DR: We plan to build a suite of compiler-based dynamic instrumentation > tools for analyzing targeted performance problems. These tools will all > live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as > they will share significant portions of their implementations. > > ===================> Motivation > ===================> > Our goal is to build a suite of dynamic instrumentation tools for > analyzing particular performance problems that are difficult to evaluate > using other profiling methods. Modern hardware performance counters > provide insight into where time is spent and when micro-architectural > events such as cache misses are occurring, but they are of limited > effectiveness for contextual analysis: it is not easy to answer *why* a > cache miss occurred. > > Examples of tools that we have planned include: identifying wasted or > redundant computation, identifying cache fragmentation, and measuring > working sets. See more details on these below. > > ===================> Approach > ===================> > We believe that tools with overhead beyond about 5x are simply too > heavyweight to easily apply to large, industrial-sized applications running > real-world workloads. Our goal is for our tools to gather useful > information with overhead less than 5x, and ideally closer to 3x, to > facilitate deployment. We would prefer to trade off accuracy and build a > less-accurate tool below our overhead ceiling than to build a high-accuracy > but slow tool. We hope to hit a sweet spot of tools that gather > trace-based contextual information not feasible with pure sampling yet are > still practical to deploy. > > In a similar vein, we would prefer a targeted tool that analyzes one > particular aspect of performance with low overhead than a more general tool > that can answer more questions but has high overhead. > > Dynamic binary instrumentation is one option for these types of tools, but > typically compiler-based instrumentation provides better performance, and > we intend to focus only on analyzing applications for which source code is > available. Studying instruction cache behavior with compiler > instrumentation can be challenging, however, so we plan to at least > initially focus on data performance. > > Many of our planned tools target specific performance issues with data > accesses. They employ the technique of *shadow memory* to store metadata > about application data references, using the compiler to instrument loads > and stores with code to update the shadow memory. A companion runtime > library intercepts libc calls if necessary to update shadow memory on > non-application data references. The runtime library also intercepts heap > allocations and other key events in order to perform its analyses. This is > all very similar to how existing sanitizers such as AddressSanitizer, > ThreadSanitizer, MemorySanitizer, etc. operate today. > > ===================> Example Tools > ===================> > We have several initial tools that we plan to build. These are not > necessarily novel ideas on their own: some of these have already been > explored in academia. The idea is to create practical, low-overhead, > robust, and publicly available versions of these tools. > > *Cache fragmentation*: this tool gather data structure field hotness > information, looking for data layout optimization opportunities by grouping > hot fields together to avoid data cache fragmentation. Future enhancements > may add field affinity information if it can be computed with low enough > overhead. > > *Working set measurement*: this tool measures the data working set size of > an application at each snapshot during execution. It can help to > understand phased behavior as well as providing basic direction for further > effort by the developer: e.g., knowing whether the working set is close to > fitting in current L3 caches or is many times larger can help determine > where to spend effort. > > *Dead store detection*: this tool identifies dead stores > (write-after-write patterns with no intervening read) as well as redundant > stores (writes of the same value already in memory). Xref the Deadspy > paper from CGO 2012. > > *Single-reference*: this tool identifies data cache lines brought in but > only read once. These could be candidates for non-temporal loads. > > ===================> EfficiencySanitizer > ===================> > We are proposing the name EfficiencySanitizer, or "esan" for short, to > refer to this suite of dynamic instrumentation tools for improving program > efficiency. As we have a number of different tools that share quite a bit > of their implementation we plan to consider them sub-tools under the > EfficiencySanitizer umbrella, rather than adding a whole bunch of separate > instrumentation and runtime library components. > > While these tools are not addressing correctness issues like other > sanitizers, they will be sharing a lot of the existing sanitizer runtime > library support code. Furthermore, users are already familiar with the > sanitizer brand, and it seems better to extend that concept rather than add > some new term. > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160419/94cb8371/attachment.html>
Hi Derek, I'm not an expert in any of these topics, but I'm excited that you guys are doing it. It seems like a missing piece that needs to be filled. Some comments inline... On 17 April 2016 at 22:46, Derek Bruening via llvm-dev <llvm-dev at lists.llvm.org> wrote:> We would prefer to trade off accuracy and build a > less-accurate tool below our overhead ceiling than to build a high-accuracy > but slow tool.I agree with this strategy. As a first approach, making the fastest you can, then later introducing more probes, maybe via some slider flag (like -ON) to consciously trade speed for accuracy.> Studying instruction cache behavior with compiler > instrumentation can be challenging, however, so we plan to at least > initially focus on data performance.I'm interested in how you're going to do this without kernel profiling probes, like perf. Or is the point here introducing syscalls in the right places instead of randomly profiled? Wouldn't that bias your results?> Many of our planned tools target specific performance issues with data > accesses. They employ the technique of *shadow memory* to store metadata > about application data references, using the compiler to instrument loads > and stores with code to update the shadow memory.Is it just counting the number of reads/writes? Or are you going to add how many of those accesses were hit by a cache miss?> *Cache fragmentation*: this tool gather data structure field hotness > information, looking for data layout optimization opportunities by grouping > hot fields together to avoid data cache fragmentation. Future enhancements > may add field affinity information if it can be computed with low enough > overhead.Would be also good to have temporal information, so that you can correlate data access that occurs, for example, inside the same loop / basic block, or in sequence in the common CFG flow. This could lead to change in allocation patterns (heap, BSS).> *Working set measurement*: this tool measures the data working set size of > an application at each snapshot during execution. It can help to understand > phased behavior as well as providing basic direction for further effort by > the developer: e.g., knowing whether the working set is close to fitting in > current L3 caches or is many times larger can help determine where to spend > effort.This is interesting, but most useful when your dataset changes size over different runs. This is similar to running the program under perf for different workloads, and I'm not sure how you're going to get that in a single run. It also comes with the additional problem that cache sizes are not always advertised, so you might have an additional tool to guess the sizes based on increasing the size of data blocks and finding steps on the data access graph.> *Dead store detection*: this tool identifies dead stores (write-after-write > patterns with no intervening read) as well as redundant stores (writes of > the same value already in memory). Xref the Deadspy paper from CGO 2012.This should probably be spotted by the compiler, so I guess it's a tool for compiler developers to spot missed optimisation opportunities in the back-end.> *Single-reference*: this tool identifies data cache lines brought in but > only read once. These could be candidates for non-temporal loads.That's nice and should be simple enough to get a report in the end. This also seem to be a hint to compiler developers rather than users. I think you guys have a nice set of tools to develop and I'm looking forward to working with them. cheers, --renato
On 04/20/2016 02:58 PM, Renato Golin via llvm-dev wrote:> Hi Derek, > > I'm not an expert in any of these topics, but I'm excited that you > guys are doing it. It seems like a missing piece that needs to be > filled. > > Some comments inline... > > > On 17 April 2016 at 22:46, Derek Bruening via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> We would prefer to trade off accuracy and build a >> less-accurate tool below our overhead ceiling than to build a high-accuracy >> but slow tool. > > I agree with this strategy. > > As a first approach, making the fastest you can, then later > introducing more probes, maybe via some slider flag (like -ON) to > consciously trade speed for accuracy. > > >> Studying instruction cache behavior with compiler >> instrumentation can be challenging, however, so we plan to at least >> initially focus on data performance. > > I'm interested in how you're going to do this without kernel profiling > probes, like perf. > > Or is the point here introducing syscalls in the right places instead > of randomly profiled? Wouldn't that bias your results? > > >> Many of our planned tools target specific performance issues with data >> accesses. They employ the technique of *shadow memory* to store metadata >> about application data references, using the compiler to instrument loads >> and stores with code to update the shadow memory. > > Is it just counting the number of reads/writes? Or are you going to > add how many of those accesses were hit by a cache miss? > > >> *Cache fragmentation*: this tool gather data structure field hotness >> information, looking for data layout optimization opportunities by grouping >> hot fields together to avoid data cache fragmentation. Future enhancements >> may add field affinity information if it can be computed with low enough >> overhead. > > Would be also good to have temporal information, so that you can > correlate data access that occurs, for example, inside the same loop / > basic block, or in sequence in the common CFG flow. This could lead to > change in allocation patterns (heap, BSS). > > >> *Working set measurement*: this tool measures the data working set size of >> an application at each snapshot during execution. It can help to understand >> phased behavior as well as providing basic direction for further effort by >> the developer: e.g., knowing whether the working set is close to fitting in >> current L3 caches or is many times larger can help determine where to spend >> effort. > > This is interesting, but most useful when your dataset changes size > over different runs. This is similar to running the program under perf > for different workloads, and I'm not sure how you're going to get that > in a single run. It also comes with the additional problem that cache > sizes are not always advertised, so you might have an additional tool > to guess the sizes based on increasing the size of data blocks and > finding steps on the data access graph. > > >> *Dead store detection*: this tool identifies dead stores (write-after-write >> patterns with no intervening read) as well as redundant stores (writes of >> the same value already in memory). Xref the Deadspy paper from CGO 2012. > > This should probably be spotted by the compiler, so I guess it's a > tool for compiler developers to spot missed optimisation opportunities > in the back-end.Not when dead store happens in an external DSO where compiler can't detect it (same applies for single references).>> *Single-reference*: this tool identifies data cache lines brought in but >> only read once. These could be candidates for non-temporal loads. > > That's nice and should be simple enough to get a report in the end. > This also seem to be a hint to compiler developers rather than users. > > I think you guys have a nice set of tools to develop and I'm looking > forward to working with them. > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
On Wed, Apr 20, 2016 at 2:14 AM, Adam Nemet <anemet at apple.com> wrote:> Interesting idea! I understand how the bookkeeping in the tool is similar > to some of the sanitizers but I am wondering whether that is really the > best developer’s work-flow for such a tool. > > I could imagine that some of the opportunities discovered by the tool > could be optimized automatically by the compiler (e.g. temporal loads, sw > prefetching, partitioning the heap) so feeding this information back to the > compiler could be highly useful. I am wondering whether the PGO model is > closer to what we want at the end. The problem can also be thought of as a > natural extension of PGO. Besides instrumenting branches and indirect > calls, it adds instrumentation for loads and stores. >It would be great to automatically apply the results of the tools, but we do not think that this is straightforward for enough cases up front. For the cache fragmentation tool, automatically applying data structure field reordering (or splitting or peeling) generally requires whole-program compilation, which is not always available and currently does not scale up to the size of applications we would like to target. The working set tool is not a candidate for automated action. Acting on dead stores typically requires programmer analysis to confirm that there is not some non-executed path where the store is not actually dead. We would like to start with a standalone sanitizer usage model providing developer feedback. How about if we start with that and in the future we can revisit whether some subset of the results can be acted on in an automated fashion? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/57e8ec56/attachment.html>
On Tue, Apr 19, 2016 at 11:19 PM, Sean Silva via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Some of this data might be interesting for profile guidance. Are there any > plans there? > >Esan instrumentation is geared toward application level tuning by developers -- the data collected here are not quite 'actionable' by the compiler directly. For instance, struct field reordering needs whole program analysis and can be very tricky to do for C++ code with complex inheritances (e.g, best base class field order may depending on the context of the inheritance). Fancy struct layout changes such as peeling, splitting/outlining, field inlining etc also requires very good address escape analysis etc. Dead store detection can be used indirectly by the compiler -- compiler certainly can not use the information to prove statically the stores are dead, but the compiler developer can use this tool to find the cases and figure out missing optimizations in the compiler. Data working set size data is also not quite usable by the compiler. Esan's design tradeoffs are also not the same as PGO. The former allows more overhead and is less restricted -- it can go deeper. David> -- Sean Silva > > On Sun, Apr 17, 2016 at 2:46 PM, Derek Bruening via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> TL;DR: We plan to build a suite of compiler-based dynamic instrumentation >> tools for analyzing targeted performance problems. These tools will all >> live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as >> they will share significant portions of their implementations. >> >> ===================>> Motivation >> ===================>> >> Our goal is to build a suite of dynamic instrumentation tools for >> analyzing particular performance problems that are difficult to evaluate >> using other profiling methods. Modern hardware performance counters >> provide insight into where time is spent and when micro-architectural >> events such as cache misses are occurring, but they are of limited >> effectiveness for contextual analysis: it is not easy to answer *why* a >> cache miss occurred. >> >> Examples of tools that we have planned include: identifying wasted or >> redundant computation, identifying cache fragmentation, and measuring >> working sets. See more details on these below. >> >> ===================>> Approach >> ===================>> >> We believe that tools with overhead beyond about 5x are simply too >> heavyweight to easily apply to large, industrial-sized applications running >> real-world workloads. Our goal is for our tools to gather useful >> information with overhead less than 5x, and ideally closer to 3x, to >> facilitate deployment. We would prefer to trade off accuracy and build a >> less-accurate tool below our overhead ceiling than to build a high-accuracy >> but slow tool. We hope to hit a sweet spot of tools that gather >> trace-based contextual information not feasible with pure sampling yet are >> still practical to deploy. >> >> In a similar vein, we would prefer a targeted tool that analyzes one >> particular aspect of performance with low overhead than a more general tool >> that can answer more questions but has high overhead. >> >> Dynamic binary instrumentation is one option for these types of tools, >> but typically compiler-based instrumentation provides better performance, >> and we intend to focus only on analyzing applications for which source code >> is available. Studying instruction cache behavior with compiler >> instrumentation can be challenging, however, so we plan to at least >> initially focus on data performance. >> >> Many of our planned tools target specific performance issues with data >> accesses. They employ the technique of *shadow memory* to store metadata >> about application data references, using the compiler to instrument loads >> and stores with code to update the shadow memory. A companion runtime >> library intercepts libc calls if necessary to update shadow memory on >> non-application data references. The runtime library also intercepts heap >> allocations and other key events in order to perform its analyses. This is >> all very similar to how existing sanitizers such as AddressSanitizer, >> ThreadSanitizer, MemorySanitizer, etc. operate today. >> >> ===================>> Example Tools >> ===================>> >> We have several initial tools that we plan to build. These are not >> necessarily novel ideas on their own: some of these have already been >> explored in academia. The idea is to create practical, low-overhead, >> robust, and publicly available versions of these tools. >> >> *Cache fragmentation*: this tool gather data structure field hotness >> information, looking for data layout optimization opportunities by grouping >> hot fields together to avoid data cache fragmentation. Future enhancements >> may add field affinity information if it can be computed with low enough >> overhead. >> >> *Working set measurement*: this tool measures the data working set size >> of an application at each snapshot during execution. It can help to >> understand phased behavior as well as providing basic direction for further >> effort by the developer: e.g., knowing whether the working set is close to >> fitting in current L3 caches or is many times larger can help determine >> where to spend effort. >> >> *Dead store detection*: this tool identifies dead stores >> (write-after-write patterns with no intervening read) as well as redundant >> stores (writes of the same value already in memory). Xref the Deadspy >> paper from CGO 2012. >> >> *Single-reference*: this tool identifies data cache lines brought in but >> only read once. These could be candidates for non-temporal loads. >> >> ===================>> EfficiencySanitizer >> ===================>> >> We are proposing the name EfficiencySanitizer, or "esan" for short, to >> refer to this suite of dynamic instrumentation tools for improving program >> efficiency. As we have a number of different tools that share quite a bit >> of their implementation we plan to consider them sub-tools under the >> EfficiencySanitizer umbrella, rather than adding a whole bunch of separate >> instrumentation and runtime library components. >> >> While these tools are not addressing correctness issues like other >> sanitizers, they will be sharing a lot of the existing sanitizer runtime >> library support code. Furthermore, users are already familiar with the >> sanitizer brand, and it seems better to extend that concept rather than add >> some new term. >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/ca5b30bd/attachment.html>
On Tue, Apr 19, 2016 at 11:14 PM, Adam Nemet via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Interesting idea! I understand how the bookkeeping in the tool is similar > to some of the sanitizers but I am wondering whether that is really the > best developer’s work-flow for such a tool. > > I could imagine that some of the opportunities discovered by the tool > could be optimized automatically by the compiler (e.g. temporal loads, sw > prefetching, partitioning the heap) so feeding this information back to the > compiler could be highly useful. I am wondering whether the PGO model is > closer to what we want at the end. The problem can also be thought of as a > natural extension of PGO. Besides instrumenting branches and indirect > calls, it adds instrumentation for loads and stores. > > We have internally been discussing ways to use PGO for optimization > diagnostics (a continuation of Tyler’s work, see > http://blog.llvm.org/2014/11/loop-vectorization-diagnostics-and.html). > The idea is to help the developer to focus in on opportunities in hot code. > It seems that the diagnostics provided by your tools could be emitted > directly by the analyses in LLVM during the profile-use phase. >You are right that some of the information are already available with PGO + static analysis -- one example is field affinity data. Instruction working set data is also 'roughly' available. However I think Esan can potentially do a better job, easier to use and be a centralized place for getting such information. David> > Adam > > On Apr 17, 2016, at 2:46 PM, Derek Bruening via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > TL;DR: We plan to build a suite of compiler-based dynamic instrumentation > tools for analyzing targeted performance problems. These tools will all > live under a new "EfficiencySanitizer" (or "esan") sanitizer umbrella, as > they will share significant portions of their implementations. > > ===================> Motivation > ===================> > Our goal is to build a suite of dynamic instrumentation tools for > analyzing particular performance problems that are difficult to evaluate > using other profiling methods. Modern hardware performance counters > provide insight into where time is spent and when micro-architectural > events such as cache misses are occurring, but they are of limited > effectiveness for contextual analysis: it is not easy to answer *why* a > cache miss occurred. > > Examples of tools that we have planned include: identifying wasted or > redundant computation, identifying cache fragmentation, and measuring > working sets. See more details on these below. > > ===================> Approach > ===================> > We believe that tools with overhead beyond about 5x are simply too > heavyweight to easily apply to large, industrial-sized applications running > real-world workloads. Our goal is for our tools to gather useful > information with overhead less than 5x, and ideally closer to 3x, to > facilitate deployment. We would prefer to trade off accuracy and build a > less-accurate tool below our overhead ceiling than to build a high-accuracy > but slow tool. We hope to hit a sweet spot of tools that gather > trace-based contextual information not feasible with pure sampling yet are > still practical to deploy. > > In a similar vein, we would prefer a targeted tool that analyzes one > particular aspect of performance with low overhead than a more general tool > that can answer more questions but has high overhead. > > Dynamic binary instrumentation is one option for these types of tools, but > typically compiler-based instrumentation provides better performance, and > we intend to focus only on analyzing applications for which source code is > available. Studying instruction cache behavior with compiler > instrumentation can be challenging, however, so we plan to at least > initially focus on data performance. > > Many of our planned tools target specific performance issues with data > accesses. They employ the technique of *shadow memory* to store metadata > about application data references, using the compiler to instrument loads > and stores with code to update the shadow memory. A companion runtime > library intercepts libc calls if necessary to update shadow memory on > non-application data references. The runtime library also intercepts heap > allocations and other key events in order to perform its analyses. This is > all very similar to how existing sanitizers such as AddressSanitizer, > ThreadSanitizer, MemorySanitizer, etc. operate today. > > ===================> Example Tools > ===================> > We have several initial tools that we plan to build. These are not > necessarily novel ideas on their own: some of these have already been > explored in academia. The idea is to create practical, low-overhead, > robust, and publicly available versions of these tools. > > *Cache fragmentation*: this tool gather data structure field hotness > information, looking for data layout optimization opportunities by grouping > hot fields together to avoid data cache fragmentation. Future enhancements > may add field affinity information if it can be computed with low enough > overhead. > > *Working set measurement*: this tool measures the data working set size of > an application at each snapshot during execution. It can help to > understand phased behavior as well as providing basic direction for further > effort by the developer: e.g., knowing whether the working set is close to > fitting in current L3 caches or is many times larger can help determine > where to spend effort. > > *Dead store detection*: this tool identifies dead stores > (write-after-write patterns with no intervening read) as well as redundant > stores (writes of the same value already in memory). Xref the Deadspy > paper from CGO 2012. > > *Single-reference*: this tool identifies data cache lines brought in but > only read once. These could be candidates for non-temporal loads. > > ===================> EfficiencySanitizer > ===================> > We are proposing the name EfficiencySanitizer, or "esan" for short, to > refer to this suite of dynamic instrumentation tools for improving program > efficiency. As we have a number of different tools that share quite a bit > of their implementation we plan to consider them sub-tools under the > EfficiencySanitizer umbrella, rather than adding a whole bunch of separate > instrumentation and runtime library components. > > While these tools are not addressing correctness issues like other > sanitizers, they will be sharing a lot of the existing sanitizer runtime > library support code. Furthermore, users are already familiar with the > sanitizer brand, and it seems better to extend that concept rather than add > some new term. > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/e963d85b/attachment.html>
On Wed, Apr 20, 2016 at 7:58 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 17 April 2016 at 22:46, Derek Bruening via llvm-dev > > Studying instruction cache behavior with compiler > > instrumentation can be challenging, however, so we plan to at least > > initially focus on data performance. > > I'm interested in how you're going to do this without kernel profiling > probes, like perf. > > Or is the point here introducing syscalls in the right places instead > of randomly profiled? Wouldn't that bias your results? >I'm not sure I understand the question: are you asking whether not gathering data on time spent in the kernel is an issue? Or you're asking how to measure aspects of performance without using sampling or hardware performance counters?> Many of our planned tools target specific performance issues with data > > accesses. They employ the technique of *shadow memory* to store metadata > > about application data references, using the compiler to instrument loads > > and stores with code to update the shadow memory. > > Is it just counting the number of reads/writes? Or are you going to > add how many of those accesses were hit by a cache miss? >It varies by tool. The brief descriptions in the original email hopefully shed some light; we are also sending separate RFC's for each tool (working set was already sent). The cache frag tool is basically just counting, yes. There is no cache miss information here: we are not using hardware perf counters nor running a software cache simulation. We are measuring particular aspects of application behavior that tend to affect performance, often abstracted away from the precise microarchitecture you're running on.> *Cache fragmentation*: this tool gather data structure field hotness > > information, looking for data layout optimization opportunities by > grouping > > hot fields together to avoid data cache fragmentation. Future > enhancements > > may add field affinity information if it can be computed with low enough > > overhead. > > Would be also good to have temporal information, so that you can > correlate data access that occurs, for example, inside the same loop / > basic block, or in sequence in the common CFG flow. This could lead to > change in allocation patterns (heap, BSS). >Agreed, we have thought about adding temporal information, though it would cost more and we have not flushed out the details.> > *Working set measurement*: this tool measures the data working set size > of > > an application at each snapshot during execution. It can help to > understand > > phased behavior as well as providing basic direction for further effort > by > > the developer: e.g., knowing whether the working set is close to fitting > in > > current L3 caches or is many times larger can help determine where to > spend > > effort. > > This is interesting, but most useful when your dataset changes size > over different runs. This is similar to running the program under perf > for different workloads, and I'm not sure how you're going to get that > in a single run. It also comes with the additional problem that cache > sizes are not always advertised, so you might have an additional tool > to guess the sizes based on increasing the size of data blocks and > finding steps on the data access graph. >This tool is relatively agnostic of the precise details of the caches beyond having its granularity based on the cache line size it assumes (64 bytes, can be parametrized). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/94074bda/attachment.html>
----- Original Message -----> From: "Derek Bruening via llvm-dev" <llvm-dev at lists.llvm.org> > To: llvm-dev at lists.llvm.org > Cc: efficiency-sanitizer at google.com > Sent: Sunday, April 17, 2016 4:46:29 PM > Subject: [llvm-dev] RFC: EfficiencySanitizer > > > > > TL;DR: We plan to build a suite of compiler-based dynamic > instrumentation tools for analyzing targeted performance problems. > These tools will all live under a new "EfficiencySanitizer" (or > "esan") sanitizer umbrella, as they will share significant portions > of their implementations. > > > ===================> Motivation > ===================> > > Our goal is to build a suite of dynamic instrumentation tools for > analyzing particular performance problems that are difficult to > evaluate using other profiling methods. Modern hardware performance > counters provide insight into where time is spent and when > micro-architectural events such as cache misses are occurring, but > they are of limited effectiveness for contextual analysis: it is not > easy to answer *why* a cache miss occurred. > > > Examples of tools that we have planned include: identifying wasted or > redundant computation, identifying cache fragmentation, and > measuring working sets. See more details on these below. > > > ===================> Approach > ===================> > > We believe that tools with overhead beyond about 5x are simply too > heavyweight to easily apply to large, industrial-sized applications > running real-world workloads. Our goal is for our tools to gather > useful information with overhead less than 5x, and ideally closer to > 3x, to facilitate deployment. We would prefer to trade off accuracy > and build a less-accurate tool below our overhead ceiling than to > build a high-accuracy but slow tool. We hope to hit a sweet spot of > tools that gather trace-based contextual information not feasible > with pure sampling yet are still practical to deploy. > > > In a similar vein, we would prefer a targeted tool that analyzes one > particular aspect of performance with low overhead than a more > general tool that can answer more questions but has high overhead. > > > Dynamic binary instrumentation is one option for these types of > tools, but typically compiler-based instrumentation provides better > performance, and we intend to focus only on analyzing applications > for which source code is available. Studying instruction cache > behavior with compiler instrumentation can be challenging, however, > so we plan to at least initially focus on data performance. > > > Many of our planned tools target specific performance issues with > data accesses. They employ the technique of *shadow memory* to store > metadata about application data references, using the compiler to > instrument loads and stores with code to update the shadow memory. A > companion runtime library intercepts libc calls if necessary to > update shadow memory on non-application data references. The runtime > library also intercepts heap allocations and other key events in > order to perform its analyses. This is all very similar to how > existing sanitizers such as AddressSanitizer, ThreadSanitizer, > MemorySanitizer, etc. operate today. > > > ===================> Example Tools > ===================> > > We have several initial tools that we plan to build. These are not > necessarily novel ideas on their own: some of these have already > been explored in academia. The idea is to create practical, > low-overhead, robust, and publicly available versions of these > tools. > > > *Cache fragmentation*: this tool gather data structure field hotness > information, looking for data layout optimization opportunities by > grouping hot fields together to avoid data cache fragmentation. > Future enhancements may add field affinity information if it can be > computed with low enough overhead. > > > *Working set measurement*: this tool measures the data working set > size of an application at each snapshot during execution. It can > help to understand phased behavior as well as providing basic > direction for further effort by the developer: e.g., knowing whether > the working set is close to fitting in current L3 caches or is many > times larger can help determine where to spend effort.Will this technology allow us to pinpoint specific accesses that generally have high latency (i.e. generally are cache misses)? This information is useful for programmers, and is also useful as an input to loop unrolling, instruction scheduling, and the like on ooo cores. -Hal> > *Dead store detection*: this tool identifies dead stores > (write-after-write patterns with no intervening read) as well as > redundant stores (writes of the same value already in memory). Xref > the Deadspy paper from CGO 2012. > > > *Single-reference*: this tool identifies data cache lines brought in > but only read once. These could be candidates for non-temporal > loads. > > > ===================> EfficiencySanitizer > ===================> > > We are proposing the name EfficiencySanitizer, or "esan" for short, > to refer to this suite of dynamic instrumentation tools for > improving program efficiency. As we have a number of different tools > that share quite a bit of their implementation we plan to consider > them sub-tools under the EfficiencySanitizer umbrella, rather than > adding a whole bunch of separate instrumentation and runtime library > components. > > > While these tools are not addressing correctness issues like other > sanitizers, they will be sharing a lot of the existing sanitizer > runtime library support code. Furthermore, users are already > familiar with the sanitizer brand, and it seems better to extend > that concept rather than add some new term. > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
> > > Will this technology allow us to pinpoint specific accesses that generally > have high latency (i.e. generally are cache misses)? This information is > useful for programmers, and is also useful as an input to loop unrolling, > instruction scheduling, and the like on ooo cores. >Won't hardware performance counter tell you which accesses are delinquent accesses? The tools here are trying to provide more information and hopefully some insight of the performance problem. For example, if we can tell that the working set of an application is slightly bigger than the L3 cache, developers would be able to take the right action to improve the performance. We are welcome any suggestions about how the information can be used, or what information should be collected. Qin> > -Hal > > > > > *Dead store detection*: this tool identifies dead stores > > (write-after-write patterns with no intervening read) as well as > > redundant stores (writes of the same value already in memory). Xref > > the Deadspy paper from CGO 2012. > > > > > > *Single-reference*: this tool identifies data cache lines brought in > > but only read once. These could be candidates for non-temporal > > loads. > > > > > > ===================> > EfficiencySanitizer > > ===================> > > > > > We are proposing the name EfficiencySanitizer, or "esan" for short, > > to refer to this suite of dynamic instrumentation tools for > > improving program efficiency. As we have a number of different tools > > that share quite a bit of their implementation we plan to consider > > them sub-tools under the EfficiencySanitizer umbrella, rather than > > adding a whole bunch of separate instrumentation and runtime library > > components. > > > > > > While these tools are not addressing correctness issues like other > > sanitizers, they will be sharing a lot of the existing sanitizer > > runtime library support code. Furthermore, users are already > > familiar with the sanitizer brand, and it seems better to extend > > that concept rather than add some new term. > > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > > -- > You received this message because you are subscribed to the Google Groups > "efficiency-sanitizer" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to efficiency-sanitizer+unsubscribe at google.com. > To post to this group, send email to efficiency-sanitizer at google.com. > To view this discussion on the web visit > https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/19928798.1837.1461205693345.JavaMail.hfinkel%40sapling5.localdomain > . >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160421/55f919dc/attachment.html>