Hal Finkel via llvm-dev
2016-Apr-23 00:18 UTC
[llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool
----- Original Message -----> From: "Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org> > To: "Qin Zhao" <zhaoqin at google.com> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > Sent: Friday, April 22, 2016 7:13:38 PM > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation > tool> ----- Original Message -----> > From: "Qin Zhao via llvm-dev" <llvm-dev at lists.llvm.org> > > > To: "llvm-dev" <llvm-dev at lists.llvm.org> > > > Sent: Friday, April 22, 2016 4:59:00 PM > > > Subject: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation > > tool >> > Please reference the prior RFC on EfficiencySanitizer. This is one > > of > > the performance analysis tools we would like to build under the > > EfficiencySanitizer umbrella. >> > ===================> > > Motivation > > > ===================>> > An application is running sub-optimally if only part of the data > > brought into the cache is used, which we call cache fragmentation. > > Knowing the cache fragmentation information during a given > > application's execution helps developers to understand the > > application’s cache behavior and how best to direct performance > > optimization efforts. For example, developers may reorder the > > struct > > fields based on the cache fragmentation information and hopefully > > improve cache hit ration and performance. >> > ===================> > > Approach > > > ===================>> > We focus on two ways to get cache fragmentation information: >> > 1. > > > Struct field access patterns. > > > 2. > > > Heap/global object access patterns. >> > ====================> > > Struct field access patterns > > > ====================>> > 1. > > > Get all the struct type information (e.g., via > > getIdentifiedStructTypes()), and create a counter for each field of > > each struct. > > > 2. > > > Instrument each GEP (GetElementPtr) instruction if it operates on a > > struct type to update the corresponding field counter. > > > 3. > > > At the program exit, filter and sort the struct field reference > > counters, and print the struct field hotness information for those > > structs deemed most likely to affect performance. The > > sorting/filtering metric could include disparity between fields: > > hot > > fields interleaved with cold fields, with a total access count high > > enough to matter. >> > There are a few potential problems with this simple approach: >> > * > > > Overcount: a GEP instruction does not necessarily mean a field > > access. > > > * > > > Undercount: a GEP instruction may lead to multiple field accesses, > > especially if the address is passed to another function as an > > argument. >You can ignore my message; I misread the distinction here. -Hal> > * >> I can't tell from your description, but it sounds like you might also > undercount accesses from escaped addresses (because these are later > indistinguishable from heap accesses)?> > * >> > * > > > Racy update by multiple threads. >> > We want to keep the instrumentation simple in our initial > > implementation for both robustness and performance reasons, so we > > will defer any analysis (e.g., data flow analysis) to later stages. > > Any suggestions on how to improve the accuracy are welcome. > > I don't understand why you're using a separate mechanism here from > what is being used for heap accesses. Why don't you use the same > shadow-memory scheme here as you do for heap accesses (especially > considering that escaped stack addresses will be counted this way > anyway), and then upon function exit, collect the counts from the > local stack? I think the necessary region is:> [@llvm.frameaddress(0), @llvm.frameaddress(0) + > @llvm.get.dynamic.area.offset())> or you can call @llvm.stacksave() upon entry and use that as the base > offset.> -Hal> > There is one simple improvement we may want to explore: the > > temporal > > locality of struct field accesses. > > > Two struct fields being hot (i.e., frequently accessed) does not > > necessarily mean they are accessed together. We want to know the > > affinity among those struct fields, which could be determined via a > > sampling approach: track which fields are accessed together during > > the last period at each sample, and update an affinity table for > > the > > final report. >> > We expect the time overhead of the tool to be well under the 5x > > EfficiencySanitizer ceiling; presumably it should be under 2x. >> > ==========================> > > Heap/global object access patterns > > > ==========================>> > We plan to use shadow memory and sampling to keep track of > > heap/global object accesses. > > > Shadow memory: > > > We use a 4byte-to-1byte shadow mapping. Each application word is > > mapped to a >> > * > > > shadow byte, and so a 64-byte cache line is mapped to a 16-byte > > shadow memory. In each shadow byte, the highest bit is used for > > indicating whether the corresponding application word is accessed, > > and the other 7 bits are used as a counter for the hotness of the > > application word. > > > * > > > Instrumentation: On every memory reference, the instrumented code > > simply checks if the highest bit is set. If not, the code sets it > > using an OR operation. We will live with races in updating shadow > > memory bits. > > > * > > > Sampling: On each sample we scan the shadow memory. If the highest > > bit of a shadow byte is set, we increment the 7-bit counter (to the > > maximum of 127; if this is found to be too small we could use > > separate storage for an additional counter for hot fields). > > > * > > > Memory allocation wrapping: When a heap object is freed, we acquire > > the callstack and its access pattern in the shadow memory. We may > > coalesce them based on the allocation/free callstack. >> > The report from the tool to the user at the end of execution would > > essentially be a list of objects that have some significant > > fragmented access pattern. We expect the time overhead of the tool > > to be well under the 5x EfficiencySanitizer ceiling; presumably it > > should be under 3x. >> > We plan to implement both struct field access tracking and shadow > > based heap/global object access tracking. In our initial > > implementation, we plan to provide both results to developers > > simultaneously. >> > There are a number of alternative approaches that could be > > explored, > > including a 4byte:4byte 32-bit hotness counter per 4-byte field, or > > a 1byte:1bit bitmap for field byte granularity with sampling. > > Extensions to the proposals above could also be explored in the > > future, such as combining the struct and shadow modes for better > > results. Additionally, we may use the 7 shadow bits differently to > > track temporal locality information instead. Any suggestions are > > also welcome. >> > -- Qin >> > _______________________________________________ > > > LLVM Developers mailing list > > > llvm-dev at lists.llvm.org > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> --> Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160422/d57f4278/attachment-0001.html>
Hal Finkel via llvm-dev
2016-Apr-23 00:38 UTC
[llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool
----- Original Message -----> From: "Hal Finkel" <hfinkel at anl.gov> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Qin Zhao" > <zhaoqin at google.com> > Sent: Friday, April 22, 2016 7:18:03 PM > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation > tool> ----- Original Message -----> > From: "Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org> > > > To: "Qin Zhao" <zhaoqin at google.com> > > > Cc: "llvm-dev" <llvm-dev at lists.llvm.org> > > > Sent: Friday, April 22, 2016 7:13:38 PM > > > Subject: Re: [llvm-dev] RFC: EfficiencySanitizer Cache > > Fragmentation > > tool >> > ----- Original Message ----- >> > > From: "Qin Zhao via llvm-dev" <llvm-dev at lists.llvm.org> > > > > > > To: "llvm-dev" <llvm-dev at lists.llvm.org> > > > > > > Sent: Friday, April 22, 2016 4:59:00 PM > > > > > > Subject: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation > > > tool > > >> > > Please reference the prior RFC on EfficiencySanitizer. This is > > > one > > > of > > > the performance analysis tools we would like to build under the > > > EfficiencySanitizer umbrella. > > >> > > ===================> > > > > > Motivation > > > > > > ===================> > >> > > An application is running sub-optimally if only part of the data > > > brought into the cache is used, which we call cache > > > fragmentation. > > > Knowing the cache fragmentation information during a given > > > application's execution helps developers to understand the > > > application’s cache behavior and how best to direct performance > > > optimization efforts. For example, developers may reorder the > > > struct > > > fields based on the cache fragmentation information and hopefully > > > improve cache hit ration and performance. > > >> > > ===================> > > > > > Approach > > > > > > ===================> > >> > > We focus on two ways to get cache fragmentation information: > > >> > > 1. > > > > > > Struct field access patterns. > > > > > > 2. > > > > > > Heap/global object access patterns. > > >> > > ====================> > > > > > Struct field access patterns > > > > > > ====================> > >> > > 1. > > > > > > Get all the struct type information (e.g., via > > > getIdentifiedStructTypes()), and create a counter for each field > > > of > > > each struct. > > > > > > 2. > > > > > > Instrument each GEP (GetElementPtr) instruction if it operates on > > > a > > > struct type to update the corresponding field counter. > > > > > > 3. > > > > > > At the program exit, filter and sort the struct field reference > > > counters, and print the struct field hotness information for > > > those > > > structs deemed most likely to affect performance. The > > > sorting/filtering metric could include disparity between fields: > > > hot > > > fields interleaved with cold fields, with a total access count > > > high > > > enough to matter. > > >> > > There are a few potential problems with this simple approach: > > >> > > * > > > > > > Overcount: a GEP instruction does not necessarily mean a field > > > access. > > > > > > * > > > > > > Undercount: a GEP instruction may lead to multiple field > > > accesses, > > > especially if the address is passed to another function as an > > > argument. > > >I changed my mind; don't ignore what I said. But let me propose a slightly-different design: 1. Just as you've described, collect counts in shadow memory for memory accesses. 2. For each relevant type, emit a function that knows how to collect counts from an array of that type into the struct-field counters . 3. When memory is freed, call the appropriate function to perform this accumulation. 4. If desired, also collect counts from stack memory (as I indicated below, although we'll need to call all of the relevant summation functions, and this could get expensive if they're not inlined). In this way, we'll have no over-counting / under-counting problems; plus this is easily extensible to collecting per-allocation struct-field access stats (which is often desired). We can get everything at the same time, and might actually be less expensive when running on multiple cores (because they'll be less cache-line contention for the struct-field counters). Thanks again, Hal> You can ignore my message; I misread the distinction here.> -Hal> > > * > > >> > I can't tell from your description, but it sounds like you might > > also > > undercount accesses from escaped addresses (because these are later > > indistinguishable from heap accesses)? >> > > * > > >> > > * > > > > > > Racy update by multiple threads. > > >> > > We want to keep the instrumentation simple in our initial > > > implementation for both robustness and performance reasons, so we > > > will defer any analysis (e.g., data flow analysis) to later > > > stages. > > > Any suggestions on how to improve the accuracy are welcome. > > > > > I don't understand why you're using a separate mechanism here from > > what is being used for heap accesses. Why don't you use the same > > shadow-memory scheme here as you do for heap accesses (especially > > considering that escaped stack addresses will be counted this way > > anyway), and then upon function exit, collect the counts from the > > local stack? I think the necessary region is: >> > [@llvm.frameaddress(0), @llvm.frameaddress(0) + > > @llvm.get.dynamic.area.offset()) >> > or you can call @llvm.stacksave() upon entry and use that as the > > base > > offset. >> > -Hal >> > > There is one simple improvement we may want to explore: the > > > temporal > > > locality of struct field accesses. > > > > > > Two struct fields being hot (i.e., frequently accessed) does not > > > necessarily mean they are accessed together. We want to know the > > > affinity among those struct fields, which could be determined via > > > a > > > sampling approach: track which fields are accessed together > > > during > > > the last period at each sample, and update an affinity table for > > > the > > > final report. > > >> > > We expect the time overhead of the tool to be well under the 5x > > > EfficiencySanitizer ceiling; presumably it should be under 2x. > > >> > > ==========================> > > > > > Heap/global object access patterns > > > > > > ==========================> > >> > > We plan to use shadow memory and sampling to keep track of > > > heap/global object accesses. > > > > > > Shadow memory: > > > > > > We use a 4byte-to-1byte shadow mapping. Each application word is > > > mapped to a > > >> > > * > > > > > > shadow byte, and so a 64-byte cache line is mapped to a 16-byte > > > shadow memory. In each shadow byte, the highest bit is used for > > > indicating whether the corresponding application word is > > > accessed, > > > and the other 7 bits are used as a counter for the hotness of the > > > application word. > > > > > > * > > > > > > Instrumentation: On every memory reference, the instrumented code > > > simply checks if the highest bit is set. If not, the code sets it > > > using an OR operation. We will live with races in updating shadow > > > memory bits. > > > > > > * > > > > > > Sampling: On each sample we scan the shadow memory. If the > > > highest > > > bit of a shadow byte is set, we increment the 7-bit counter (to > > > the > > > maximum of 127; if this is found to be too small we could use > > > separate storage for an additional counter for hot fields). > > > > > > * > > > > > > Memory allocation wrapping: When a heap object is freed, we > > > acquire > > > the callstack and its access pattern in the shadow memory. We may > > > coalesce them based on the allocation/free callstack. > > >> > > The report from the tool to the user at the end of execution > > > would > > > essentially be a list of objects that have some significant > > > fragmented access pattern. We expect the time overhead of the > > > tool > > > to be well under the 5x EfficiencySanitizer ceiling; presumably > > > it > > > should be under 3x. > > >> > > We plan to implement both struct field access tracking and shadow > > > based heap/global object access tracking. In our initial > > > implementation, we plan to provide both results to developers > > > simultaneously. > > >> > > There are a number of alternative approaches that could be > > > explored, > > > including a 4byte:4byte 32-bit hotness counter per 4-byte field, > > > or > > > a 1byte:1bit bitmap for field byte granularity with sampling. > > > Extensions to the proposals above could also be explored in the > > > future, such as combining the struct and shadow modes for better > > > results. Additionally, we may use the 7 shadow bits differently > > > to > > > track temporal locality information instead. Any suggestions are > > > also welcome. > > >> > > -- Qin > > >> > > _______________________________________________ > > > > > > LLVM Developers mailing list > > > > > > llvm-dev at lists.llvm.org > > > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > >> > -- >> > Hal Finkel > > > Assistant Computational Scientist > > > Leadership Computing Facility > > > Argonne National Laboratory >> > _______________________________________________ > > > LLVM Developers mailing list > > > llvm-dev at lists.llvm.org > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> --> Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160422/063f8b34/attachment.html>
Qin Zhao via llvm-dev
2016-Apr-23 03:26 UTC
[llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool
Thanks for the comment and suggestions. That's a great idea! We actually thought about using each heap object with its type information for more accurate data, and it is definitely in our future plan. However, there are challenges to do so. For example, getting correct type information for each heap object might not be easy, especially for C programs. An application can even use a custom allocator and make things even harder. Please let me know if there is a way to get correct type information for heap object. There are also other issues like how to count union of two struct types. I want to collect good enough information with minimized overhead, so I need evaluate overhead and quality of the profile for each approach. Maybe the simple field counting approach is good enough. We need implement and evaluate. -- Qin On Fri, Apr 22, 2016 at 8:38 PM, Hal Finkel <hfinkel at anl.gov> wrote:> > > ------------------------------ > > *From: *"Hal Finkel" <hfinkel at anl.gov> > *To: *"Hal Finkel" <hfinkel at anl.gov> > *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Qin Zhao" <zhaoqin at google.com > > > *Sent: *Friday, April 22, 2016 7:18:03 PM > > *Subject: *Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation > tool > > > > ------------------------------ > > *From: *"Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org> > *To: *"Qin Zhao" <zhaoqin at google.com> > *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org> > *Sent: *Friday, April 22, 2016 7:13:38 PM > *Subject: *Re: [llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation > tool > > > ------------------------------ > > *From: *"Qin Zhao via llvm-dev" <llvm-dev at lists.llvm.org> > *To: *"llvm-dev" <llvm-dev at lists.llvm.org> > *Sent: *Friday, April 22, 2016 4:59:00 PM > *Subject: *[llvm-dev] RFC: EfficiencySanitizer Cache Fragmentation tool > > Please reference the prior RFC on EfficiencySanitizer. This is one of the > performance analysis tools we would like to build under the > EfficiencySanitizer umbrella. > > ===================> > Motivation > > ===================> > An application is running sub-optimally if only part of the data brought > into the cache is used, which we call cache fragmentation. Knowing the > cache fragmentation information during a given application's execution > helps developers to understand the application’s cache behavior and how > best to direct performance optimization efforts. For example, developers > may reorder the struct fields based on the cache fragmentation information > and hopefully improve cache hit ration and performance. > > ===================> > Approach > > ===================> > We focus on two ways to get cache fragmentation information: > > 1. > > Struct field access patterns. > 2. > > Heap/global object access patterns. > > > ====================> > Struct field access patterns > > ====================> > > 1. > > Get all the struct type information (e.g., via > getIdentifiedStructTypes()), and create a counter for each field of each > struct. > 2. > > Instrument each GEP (GetElementPtr) instruction if it operates on a > struct type to update the corresponding field counter. > 3. > > At the program exit, filter and sort the struct field reference > counters, and print the struct field hotness information for those structs > deemed most likely to affect performance. The sorting/filtering metric > could include disparity between fields: hot fields interleaved with cold > fields, with a total access count high enough to matter. > > > There are a few potential problems with this simple approach: > > - > > Overcount: a GEP instruction does not necessarily mean a field access. > - > > Undercount: a GEP instruction may lead to multiple field accesses, > especially if the address is passed to another function as an argument. > > > > I changed my mind; don't ignore what I said. But let me propose a > slightly-different design: > > 1. Just as you've described, collect counts in shadow memory for memory > accesses. > 2. For each relevant type, emit a function that knows how to collect > counts from an array of that type into the struct-field counters. > 3. When memory is freed, call the appropriate function to perform this > accumulation. > 4. If desired, also collect counts from stack memory (as I indicated > below, although we'll need to call all of the relevant summation functions, > and this could get expensive if they're not inlined). > > In this way, we'll have no over-counting / under-counting problems; plus > this is easily extensible to collecting per-allocation struct-field access > stats (which is often desired). We can get everything at the same time, and > might actually be less expensive when running on multiple cores (because > they'll be less cache-line contention for the struct-field counters). > > Thanks again, > Hal > > You can ignore my message; I misread the distinction here. > > -Hal > > > - > > > > I can't tell from your description, but it sounds like you might also > undercount accesses from escaped addresses (because these are later > indistinguishable from heap accesses)? > > > - > > > - > > Racy update by multiple threads. > > We want to keep the instrumentation simple in our initial implementation > for both robustness and performance reasons, so we will defer any analysis > (e.g., data flow analysis) to later stages. Any suggestions on how to > improve the accuracy are welcome. > > I don't understand why you're using a separate mechanism here from what is > being used for heap accesses. Why don't you use the same shadow-memory > scheme here as you do for heap accesses (especially considering that > escaped stack addresses will be counted this way anyway), and then upon > function exit, collect the counts from the local stack? I think the > necessary region is: > > [@llvm.frameaddress(0), @llvm.frameaddress(0) + > @llvm.get.dynamic.area.offset()) > > or you can call @llvm.stacksave() upon entry and use that as the base > offset. > > -Hal > > > There is one simple improvement we may want to explore: the temporal > locality of struct field accesses. > > Two struct fields being hot (i.e., frequently accessed) does not > necessarily mean they are accessed together. We want to know the affinity > among those struct fields, which could be determined via a sampling > approach: track which fields are accessed together during the last period > at each sample, and update an affinity table for the final report. > > We expect the time overhead of the tool to be well under the 5x > EfficiencySanitizer ceiling; presumably it should be under 2x. > > ==========================> > Heap/global object access patterns > > ==========================> > We plan to use shadow memory and sampling to keep track of heap/global > object accesses. > > Shadow memory: > We use a 4byte-to-1byte shadow mapping. Each application word is mapped > to a > > - > > shadow byte, and so a 64-byte cache line is mapped to a 16-byte shadow > memory. In each shadow byte, the highest bit is used for indicating whether > the corresponding application word is accessed, and the other 7 bits are > used as a counter for the hotness of the application word. > > - > > Instrumentation: > On every memory reference, the instrumented code simply checks if the > highest bit is set. If not, the code sets it using an OR operation. We will > live with races in updating shadow memory bits. > - > > Sampling: > On each sample we scan the shadow memory. If the highest bit of a > shadow byte is set, we increment the 7-bit counter (to the maximum of 127; > if this is found to be too small we could use separate storage for an > additional counter for hot fields). > - > > Memory allocation wrapping: > When a heap object is freed, we acquire the callstack and its access > pattern in the shadow memory. We may coalesce them based on the > allocation/free callstack. > > > The report from the tool to the user at the end of execution would > essentially be a list of objects that have some significant fragmented > access pattern. We expect the time overhead of the tool to be well under > the 5x EfficiencySanitizer ceiling; presumably it should be under 3x. > > We plan to implement both struct field access tracking and shadow based > heap/global object access tracking. In our initial implementation, we plan > to provide both results to developers simultaneously. > > There are a number of alternative approaches that could be explored, > including a 4byte:4byte 32-bit hotness counter per 4-byte field, or a > 1byte:1bit bitmap for field byte granularity with sampling. Extensions to > the proposals above could also be explored in the future, such as combining > the struct and shadow modes for better results. Additionally, we may use > the 7 shadow bits differently to track temporal locality information > instead. Any suggestions are also welcome. > > > -- Qin > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160422/da920393/attachment.html>