Qin Zhao via llvm-dev
2016-Apr-20 15:14 UTC
[llvm-dev] RFC: EfficiencySanitizer working set tool
On Wed, Apr 20, 2016 at 2:48 AM, Sean Silva <chisophugis at gmail.com> wrote:> > > On Tue, Apr 19, 2016 at 10:44 PM, Derek Bruening via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Please reference the prior RFC on EfficiencySanitizer. This is one of >> the performance analysis tools we would like to build under the >> EfficiencySanitizer umbrella. >> >> ===================>> Motivation >> ===================>> >> Knowing the working set size at periodic points during a given >> application's execution helps to understand its cache behavior, how its >> behavior changes over time, how its performance might vary on different >> hardware configurations, and how best to direct performance improvement >> efforts. For example, knowing whether the working set is close to fitting >> in current L3 caches or is many times larger can help determine which >> efforts are most likely to bear fruit. >> >> The typical approach to acquire working set information is to collect a >> memory reference profile and feed it to a cache simulator, looking for the >> point at which the miss rate drops to zero. However, this incurs >> prohibitively high runtime overhead. Our goal is to build a fast shadow >> memory-based working set measurement tool. >> >> ===================>> Approach >> ===================>> >> We focus on the data working set size, partly because it is typically >> more important and partly because to measure the instruction working set >> with compiler instrumentation is much more difficult to implement, >> requiring insertion at a later and more challenging point in the >> compilation pipeline. >> >> Because we want to know how the program’s working set affects the cache >> behavior, we define the working set as the set of cache lines that are >> referenced by the program during a certain period of time. >> >> Using the definition above, we are able to use a single shadow bit to >> represent whether a cache line is referenced or not. If the cache line >> size is 64 bytes, we will use one bit to represent a 64 byte cache line via >> a 512 byte to 1 byte shadow mapping. >> >> For every memory reference, we insert code to set the shadow bits >> corresponding to the cache line(s) touched by that reference. The >> resulting shadow memory bitmap describes the program’s working set. >> >> There are two optimizations that can be added: >> >> 1) Add a shadow bit check before writing each bit to reduce the number of >> memory stores. In our experience this is always a win with shadow memory >> tools. >> >> 2) Analyze the memory references within a basic block to coalesce >> multiple memory references into a single shadow update. >> > > Will this fire often? Very few memory references are known to the compiler > to be >=cacheline aligned, so I'm not seeing how the compiler will prove > that two memory references definitely land on the same cacheline. > > -- Sean Silva >I am not sure what you mean "fire often"? This is an optimization during the instrumentation, so it will be performed during the instrumentation pass. If you are asking about how many opportunities there, we do not know because we have not yet implemented it. And we probably won't implement it until we implement the basic instrumentation and evaluate the performance and result. But I think there are opportunities: example 1: if two references access address Addr, and Addr+64(cacheline size), then we only need one shadow address calculation for the first one, the second one's shadow address can be obtained with +1. example 2: if three references are: Addr, Addr+16, Addr+32, we can skip instrumentation for Addr+16 without prove which cacheline it belongs to. -- Qin> > >> We will divide the program execution into regular intervals and record a >> snapshot of the working set at each interval boundary. An adaptive >> strategy can keep the number of snapshots bounded across varying total >> execution time by combining adjacent snapshots via logical or. When a >> snapshot is recorded, the shadow memory is cleared so that the next >> snapshot starts with a blank slate. If we use a 64 byte to 1 byte shadow >> mapping, we can use the upper bits to store up to 8 consecutive snapshots >> in the shadow memory itself by shifting rather than clearing the shadow >> memory on a snapshot. >> >> Recording the snapshot will use optimizations to avoid storing for the >> entire address space: only mapped regions will be saved. >> >> The report from the tool to the user at the end of execution would >> essentially be a histogram with time on the x axis and the cache lines >> touched on the y axis. The units of time could be selected by the user: >> cpu time, wall clock time, memory access count, etc. The unit determines >> the method of snapshot sampling, whether performance counters for the >> memory access count (or instrumentation to increment a global counter) or >> an itimer. We can record callstacks with each snapshot as well to help >> give an indication of what the program is doing at that point in time, to >> help the user understand the program phases. >> >> We expect the time overhead of the tool to be well under the 5x >> EfficiencySanitizer ceiling; presumably it should be under 3x. >> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > -- > You received this message because you are subscribed to the Google Groups > "efficiency-sanitizer" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to efficiency-sanitizer+unsubscribe at google.com. > To post to this group, send email to efficiency-sanitizer at google.com. > To view this discussion on the web visit > https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com > <https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/f3b3f11e/attachment.html>
Sean Silva via llvm-dev
2016-Apr-21 00:00 UTC
[llvm-dev] RFC: EfficiencySanitizer working set tool
On Wed, Apr 20, 2016 at 8:14 AM, Qin Zhao <zhaoqin at google.com> wrote:> > > On Wed, Apr 20, 2016 at 2:48 AM, Sean Silva <chisophugis at gmail.com> wrote: > >> >> >> On Tue, Apr 19, 2016 at 10:44 PM, Derek Bruening via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> Please reference the prior RFC on EfficiencySanitizer. This is one of >>> the performance analysis tools we would like to build under the >>> EfficiencySanitizer umbrella. >>> >>> ===================>>> Motivation >>> ===================>>> >>> Knowing the working set size at periodic points during a given >>> application's execution helps to understand its cache behavior, how its >>> behavior changes over time, how its performance might vary on different >>> hardware configurations, and how best to direct performance improvement >>> efforts. For example, knowing whether the working set is close to fitting >>> in current L3 caches or is many times larger can help determine which >>> efforts are most likely to bear fruit. >>> >>> The typical approach to acquire working set information is to collect a >>> memory reference profile and feed it to a cache simulator, looking for the >>> point at which the miss rate drops to zero. However, this incurs >>> prohibitively high runtime overhead. Our goal is to build a fast shadow >>> memory-based working set measurement tool. >>> >>> ===================>>> Approach >>> ===================>>> >>> We focus on the data working set size, partly because it is typically >>> more important and partly because to measure the instruction working set >>> with compiler instrumentation is much more difficult to implement, >>> requiring insertion at a later and more challenging point in the >>> compilation pipeline. >>> >>> Because we want to know how the program’s working set affects the cache >>> behavior, we define the working set as the set of cache lines that are >>> referenced by the program during a certain period of time. >>> >>> Using the definition above, we are able to use a single shadow bit to >>> represent whether a cache line is referenced or not. If the cache line >>> size is 64 bytes, we will use one bit to represent a 64 byte cache line via >>> a 512 byte to 1 byte shadow mapping. >>> >>> For every memory reference, we insert code to set the shadow bits >>> corresponding to the cache line(s) touched by that reference. The >>> resulting shadow memory bitmap describes the program’s working set. >>> >>> There are two optimizations that can be added: >>> >>> 1) Add a shadow bit check before writing each bit to reduce the number >>> of memory stores. In our experience this is always a win with shadow >>> memory tools. >>> >>> 2) Analyze the memory references within a basic block to coalesce >>> multiple memory references into a single shadow update. >>> >> >> Will this fire often? Very few memory references are known to the >> compiler to be >=cacheline aligned, so I'm not seeing how the compiler will >> prove that two memory references definitely land on the same cacheline. >> >> -- Sean Silva >> > > I am not sure what you mean "fire often"? > This is an optimization during the instrumentation, so it will be > performed during the instrumentation pass. >I mean, I don't expect many opportunities during instrumentation where you can "coalesce multiple memory references into a single shadow update". The examples you give below are good, but I would have described them differently ("example 1" does not involve coalescing of the store to shadow, and "example 2" involves multiple shadow updates (which store is coalesced into which "single shadow update"?)).> If you are asking about how many opportunities there, we do not know > because we have not yet implemented it. And we probably won't implement it > until we implement the basic instrumentation and evaluate the performance > and result. > > But I think there are opportunities: > example 1: if two references access address Addr, and Addr+64(cacheline > size), then we only need one shadow address calculation for the first one, > the second one's shadow address can be obtained with +1. > example 2: if three references are: Addr, Addr+16, Addr+32, we can skip > instrumentation for Addr+16 without prove which cacheline it belongs to. >Thanks, these are good examples, although I wonder what it would take for the regular passes in the optimizer to optimize these opportunities appropriately (if they don't already). -- Sean Silva> > -- Qin > > >> >> >>> We will divide the program execution into regular intervals and record a >>> snapshot of the working set at each interval boundary. An adaptive >>> strategy can keep the number of snapshots bounded across varying total >>> execution time by combining adjacent snapshots via logical or. When a >>> snapshot is recorded, the shadow memory is cleared so that the next >>> snapshot starts with a blank slate. If we use a 64 byte to 1 byte shadow >>> mapping, we can use the upper bits to store up to 8 consecutive snapshots >>> in the shadow memory itself by shifting rather than clearing the shadow >>> memory on a snapshot. >>> >>> Recording the snapshot will use optimizations to avoid storing for the >>> entire address space: only mapped regions will be saved. >>> >>> The report from the tool to the user at the end of execution would >>> essentially be a histogram with time on the x axis and the cache lines >>> touched on the y axis. The units of time could be selected by the user: >>> cpu time, wall clock time, memory access count, etc. The unit determines >>> the method of snapshot sampling, whether performance counters for the >>> memory access count (or instrumentation to increment a global counter) or >>> an itimer. We can record callstacks with each snapshot as well to help >>> give an indication of what the program is doing at that point in time, to >>> help the user understand the program phases. >>> >>> We expect the time overhead of the tool to be well under the 5x >>> EfficiencySanitizer ceiling; presumably it should be under 3x. >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "efficiency-sanitizer" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to efficiency-sanitizer+unsubscribe at google.com. >> To post to this group, send email to efficiency-sanitizer at google.com. >> To view this discussion on the web visit >> https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com >> <https://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/1c22c510/attachment.html>
Qin Zhao via llvm-dev
2016-Apr-21 00:51 UTC
[llvm-dev] RFC: EfficiencySanitizer working set tool
On Apr 20, 2016 8:00 PM, "Sean Silva" <chisophugis at gmail.com> wrote:> > > > On Wed, Apr 20, 2016 at 8:14 AM, Qin Zhao <zhaoqin at google.com> wrote: >> >> >> >> On Wed, Apr 20, 2016 at 2:48 AM, Sean Silva <chisophugis at gmail.com>wrote:>>> >>> >>> >>> On Tue, Apr 19, 2016 at 10:44 PM, Derek Bruening via llvm-dev <llvm-dev at lists.llvm.org> wrote:>>>> >>>> Please reference the prior RFC on EfficiencySanitizer. This is one ofthe performance analysis tools we would like to build under the EfficiencySanitizer umbrella.>>>> >>>> ===================>>>> Motivation >>>> ===================>>>> >>>> Knowing the working set size at periodic points during a givenapplication's execution helps to understand its cache behavior, how its behavior changes over time, how its performance might vary on different hardware configurations, and how best to direct performance improvement efforts. For example, knowing whether the working set is close to fitting in current L3 caches or is many times larger can help determine which efforts are most likely to bear fruit.>>>> >>>> The typical approach to acquire working set information is to collecta memory reference profile and feed it to a cache simulator, looking for the point at which the miss rate drops to zero. However, this incurs prohibitively high runtime overhead. Our goal is to build a fast shadow memory-based working set measurement tool.>>>> >>>> ===================>>>> Approach >>>> ===================>>>> >>>> We focus on the data working set size, partly because it is typicallymore important and partly because to measure the instruction working set with compiler instrumentation is much more difficult to implement, requiring insertion at a later and more challenging point in the compilation pipeline.>>>> >>>> Because we want to know how the program’s working set affects thecache behavior, we define the working set as the set of cache lines that are referenced by the program during a certain period of time.>>>> >>>> Using the definition above, we are able to use a single shadow bit torepresent whether a cache line is referenced or not. If the cache line size is 64 bytes, we will use one bit to represent a 64 byte cache line via a 512 byte to 1 byte shadow mapping.>>>> >>>> For every memory reference, we insert code to set the shadow bitscorresponding to the cache line(s) touched by that reference. The resulting shadow memory bitmap describes the program’s working set.>>>> >>>> There are two optimizations that can be added: >>>> >>>> 1) Add a shadow bit check before writing each bit to reduce the numberof memory stores. In our experience this is always a win with shadow memory tools.>>>> >>>> 2) Analyze the memory references within a basic block to coalescemultiple memory references into a single shadow update.>>> >>> >>> Will this fire often? Very few memory references are known to thecompiler to be >=cacheline aligned, so I'm not seeing how the compiler will prove that two memory references definitely land on the same cacheline.>>> >>> -- Sean Silva >> >> >> I am not sure what you mean "fire often"? >> This is an optimization during the instrumentation, so it will beperformed during the instrumentation pass.> > > I mean, I don't expect many opportunities during instrumentation whereyou can "coalesce multiple memory references into a single shadow update". The examples you give below are good, but I would have described them differently ("example 1" does not involve coalescing of the store to shadow, and "example 2" involves multiple shadow updates (which store is coalesced into which "single shadow update"?)).>Example 1 allows us to coalesce up to 4 consecutive cacheline operations into one since we operate at byte level. Though there might be few such opportunities. You are right, example 2 is more of skipping than coalescing.>> >> If you are asking about how many opportunities there, we do not knowbecause we have not yet implemented it. And we probably won't implement it until we implement the basic instrumentation and evaluate the performance and result.>> >> But I think there are opportunities: >> example 1: if two references access address Addr, and Addr+64(cachelinesize), then we only need one shadow address calculation for the first one, the second one's shadow address can be obtained with +1.>> example 2: if three references are: Addr, Addr+16, Addr+32, we can skipinstrumentation for Addr+16 without prove which cacheline it belongs to.> > > > Thanks, these are good examples, although I wonder what it would take forthe regular passes in the optimizer to optimize these opportunities appropriately (if they don't already).>My experience was more of dynamic instrumentation. So when I think about instrumentation, I think less about later optimization passes. The check-and-write optimization we added may split original basic block to multiple basic blocks, which may make an optimizer harder to perform optimization on instrumented code. For example 1, I think an optimizer might not be able to know how to coalescing those shadow updates. It is likely that I am wrong, we need implement those and see.> -- Sean Silva > >> >> >> -- Qin >> >>> >>> >>>> >>>> We will divide the program execution into regular intervals and recorda snapshot of the working set at each interval boundary. An adaptive strategy can keep the number of snapshots bounded across varying total execution time by combining adjacent snapshots via logical or. When a snapshot is recorded, the shadow memory is cleared so that the next snapshot starts with a blank slate. If we use a 64 byte to 1 byte shadow mapping, we can use the upper bits to store up to 8 consecutive snapshots in the shadow memory itself by shifting rather than clearing the shadow memory on a snapshot.>>>> >>>> Recording the snapshot will use optimizations to avoid storing for theentire address space: only mapped regions will be saved.>>>> >>>> The report from the tool to the user at the end of execution wouldessentially be a histogram with time on the x axis and the cache lines touched on the y axis. The units of time could be selected by the user: cpu time, wall clock time, memory access count, etc. The unit determines the method of snapshot sampling, whether performance counters for the memory access count (or instrumentation to increment a global counter) or an itimer. We can record callstacks with each snapshot as well to help give an indication of what the program is doing at that point in time, to help the user understand the program phases.>>>> >>>> We expect the time overhead of the tool to be well under the 5xEfficiencySanitizer ceiling; presumably it should be under 3x.>>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >>> -- >>> You received this message because you are subscribed to the GoogleGroups "efficiency-sanitizer" group.>>> To unsubscribe from this group and stop receiving emails from it, sendan email to efficiency-sanitizer+unsubscribe at google.com.>>> To post to this group, send email to efficiency-sanitizer at google.com. >>> To view this discussion on the web visithttps://groups.google.com/a/google.com/d/msgid/efficiency-sanitizer/CAHnXoa%3Dgv-V_mzDEK%3DE422%2B5nCpOX3-gHpScVJojk%3DVVq3Hw%3Dw%40mail.gmail.com .>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160420/cbbd9e50/attachment.html>