Vikram TV via llvm-dev
2016-Jun-08 07:58 UTC
[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
Hi, This is a proposal about implementing an analysis that calculates loop cost based on cache data. The primary motivation for implementing this is to write profitability measures for cache related optimizations like interchange, fusion, fission, pre-fetching and others. I have implemented a prototypical version at http://reviews.llvm.org/D21124. The patch basically creates groups of references that would lie in the same cache line. Each group is then analysed with respect to innermost loops considering cache lines. Penalty for the reference is: a. 1, if the reference is invariant with the innermost loop, b. TripCount for non-unit stride access, c. TripCount / CacheLineSize for a unit-stride access. Loop Cost is then calculated as the sum of the reference penalties times the product of the loop bounds of the outer loops. This loop cost can then be used as a profitability measure for cache reuse related optimizations. This is just a brief description; please refer to [1] for more details. A primary drawback in the above patch is the static use of Cache Line Size. I wish to get this data from tblgen/TTI and I am happy to submit patches on it. Finally, if the community is interested in the above work, I have the following work-plan: a. Update loop cost patch to a consistent, commit-able state. b. Add cache data information to tblgen/TTI. c. Rewrite Loop Interchange profitability to start using Loop Cost. Please share your valuable thoughts. Thank you. References: [1] [Carr-McKinley-Tseng] http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf -- Good time... Vikram TV CompilerTree Technologies Mysore, Karnataka, INDIA -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/d904ebb4/attachment.html>
Hal Finkel via llvm-dev
2016-Jun-08 16:50 UTC
[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
----- Original Message -----> From: "Vikram TV via llvm-dev" <llvm-dev at lists.llvm.org> > To: "DEV" <llvm-dev at lists.llvm.org> > Sent: Wednesday, June 8, 2016 2:58:17 AM > Subject: [llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis> Hi,> This is a proposal about implementing an analysis that calculates > loop cost based on cache data. The primary motivation for > implementing this is to write profitability measures for cache > related optimizations like interchange, fusion, fission, > pre-fetching and others.> I have implemented a prototypical version at > http://reviews.llvm.org/D21124 .I quick comment on your pass: You'll also want to add the ability to get the loop trip count from profiling information from BFI when available (by dividing the loop header frequency by the loop predecessor-block frequencies).> The patch basically creates groups of references that would lie in > the same cache line. Each group is then analysed with respect to > innermost loops considering cache lines. Penalty for the reference > is: > a. 1, if the reference is invariant with the innermost loop, > b. TripCount for non-unit stride access, > c. TripCount / CacheLineSize for a unit-stride access. > Loop Cost is then calculated as the sum of the reference penalties > times the product of the loop bounds of the outer loops. This loop > cost can then be used as a profitability measure for cache reuse > related optimizations. This is just a brief description; please > refer to [1] for more details.This sounds like the right approach, and is certainly something we need. One question is how the various transformations you name above (loop fusion, interchange, etc.) will use this analysis to evaluate the effect of the transformation they are considering performing. Specifically, they likely need to do this without actually rewriting the IR (speculatively).> A primary drawback in the above patch is the static use of Cache Line > Size. I wish to get this data from tblgen/TTI and I am happy to > submit patches on it.Yes, this sounds like the right direction. The targets obviously need to provide this information.> Finally, if the community is interested in the above work, I have the > following work-plan: > a. Update loop cost patch to a consistent, commit-able state. > b. Add cache data information to tblgen/TTI. > c. Rewrite Loop Interchange profitability to start using Loop Cost.Great! This does not affect loop interchange directly, but one thing we need to also consider is the number of prefetching streams supported by the hardware prefetcher on various platforms. This is generally, per thread, some number around 5 to 10. Splitting loops requiring more streams than supported, and preventing fusion from requiring more streams than supported, is also an important cost metric. Of course, so is ILP and other factors. -Hal> Please share your valuable thoughts.> Thank you.> References: > [1] [Carr-McKinley-Tseng] > http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf> --> Good time... > Vikram TV > CompilerTree Technologies > Mysore, Karnataka, INDIA > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/615b75b6/attachment-0001.html>
Ehsan Amiri via llvm-dev
2016-Jun-08 17:26 UTC
[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
> a. 1, if the reference is invariant with the innermost loop, > b. TripCount for non-unit stride access, > c. TripCount / CacheLineSize for a unit-stride access. > Loop Cost is then calculated as the sum of the reference penalties times > the product of the loop bounds of the outer loops. This loop cost can then > be used as a profitability measure for cache reuse related optimizations. > This is just a brief description; please refer to [1] for more details. > > > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >I think you need finer granularity here. At least you need to distinguish between stride-c (for some reasonable constant, say c = 2) access and non-strided access as in indirect indexing (a[b[x]]) or otherwise unpredictable index. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/5a17eb12/attachment.html>
Vikram TV via llvm-dev
2016-Jun-09 16:21 UTC
[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
On Wed, Jun 8, 2016 at 10:20 PM, Hal Finkel <hfinkel at anl.gov> wrote:> > ------------------------------ > > *From: *"Vikram TV via llvm-dev" <llvm-dev at lists.llvm.org> > *To: *"DEV" <llvm-dev at lists.llvm.org> > *Sent: *Wednesday, June 8, 2016 2:58:17 AM > *Subject: *[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis > > Hi, > > This is a proposal about implementing an analysis that calculates loop > cost based on cache data. The primary motivation for implementing this is > to write profitability measures for cache related optimizations like > interchange, fusion, fission, pre-fetching and others. > > I have implemented a prototypical version at > http://reviews.llvm.org/D21124. > > I quick comment on your pass: You'll also want to add the ability to get > the loop trip count from profiling information from BFI when available (by > dividing the loop header frequency by the loop predecessor-block > frequencies). >Sure. I will add it.> > The patch basically creates groups of references that would lie in the > same cache line. Each group is then analysed with respect to innermost > loops considering cache lines. Penalty for the reference is: > a. 1, if the reference is invariant with the innermost loop, > b. TripCount for non-unit stride access, > c. TripCount / CacheLineSize for a unit-stride access. > Loop Cost is then calculated as the sum of the reference penalties times > the product of the loop bounds of the outer loops. This loop cost can then > be used as a profitability measure for cache reuse related optimizations. > This is just a brief description; please refer to [1] for more details. > > This sounds like the right approach, and is certainly something we need. > One question is how the various transformations you name above (loop > fusion, interchange, etc.) will use this analysis to evaluate the effect of > the transformation they are considering performing. Specifically, they > likely need to do this without actually rewriting the IR (speculatively). >For Interchange: Loop cost can be used to build a required ordering and interchange can try to match the nearest possible ordering. Few internal kernels, matmul are working fine with this approach. For Fission/Fusion: Loop cost can provide the cost before and after fusion. Loop cost for a fused loop can be obtained by building reference groups by assuming a fused loop. Similar case with fission but with partitions. These require no prior IR changes.> > A primary drawback in the above patch is the static use of Cache Line > Size. I wish to get this data from tblgen/TTI and I am happy to submit > patches on it. > > Yes, this sounds like the right direction. The targets obviously need to > provide this information. > > > Finally, if the community is interested in the above work, I have the > following work-plan: > a. Update loop cost patch to a consistent, commit-able state. > b. Add cache data information to tblgen/TTI. > c. Rewrite Loop Interchange profitability to start using Loop Cost. > > Great! > > This does not affect loop interchange directly, but one thing we need to > also consider is the number of prefetching streams supported by the > hardware prefetcher on various platforms. This is generally, per thread, > some number around 5 to 10. Splitting loops requiring more streams than > supported, and preventing fusion from requiring more streams than > supported, is also an important cost metric. Of course, so is ILP and other > factors. >Cache based cost computation is one of the metric and yes, more of other metrics need to be added.> > -Hal > > > Please share your valuable thoughts. > > Thank you. > > References: > [1] [Carr-McKinley-Tseng] > http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf > > -- > > Good time... > Vikram TV > CompilerTree Technologies > Mysore, Karnataka, INDIA > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-- Good time... Vikram TV CompilerTree Technologies Mysore, Karnataka, INDIA -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160609/b63be38a/attachment.html>
Vikram TV via llvm-dev
2016-Jun-09 16:33 UTC
[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
On Wed, Jun 8, 2016 at 10:56 PM, Ehsan Amiri <ehsanamiri at gmail.com> wrote:> > a. 1, if the reference is invariant with the innermost loop, >> b. TripCount for non-unit stride access, >> c. TripCount / CacheLineSize for a unit-stride access. >> Loop Cost is then calculated as the sum of the reference penalties times >> the product of the loop bounds of the outer loops. This loop cost can then >> be used as a profitability measure for cache reuse related optimizations. >> This is just a brief description; please refer to [1] for more details. >> >> >> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >> > I think you need finer granularity here. At least you need to distinguish > between stride-c (for some reasonable constant, say c = 2) access and > non-strided access as in indirect indexing (a[b[x]]) or otherwise > unpredictable index. >I think, 'c' should be <= CacheLineSize. I will have this as a to-do as only canonical loops are supported for now. -- Good time... Vikram TV CompilerTree Technologies Mysore, Karnataka, INDIA -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160609/44f63491/attachment.html>
JF Bastien via llvm-dev
2016-Jun-13 23:03 UTC
[llvm-dev] [Proposal][RFC] Cache aware Loop Cost Analysis
> > > A primary drawback in the above patch is the static use of Cache Line > Size. I wish to get this data from tblgen/TTI and I am happy to submit > patches on it. > > Yes, this sounds like the right direction. The targets obviously need to > provide this information. >I'd like to help review this as it'll be necessary to implement http://wg21.link/p0154r1 which is (likely) in C++17. It adds two values, constexpr std::hardware_{constructive,destructive}_interference_size, because in some configurations you don't have a precise cacheline size but rather a range based on what multiple architectures have implemented for the same ISA. I think you'll want something similar. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160613/e8d525b5/attachment.html>