Xinliang David Li via llvm-dev
2015-Dec-17 19:17 UTC
[llvm-dev] RFC: Hotness thresholds in profile header
On Thu, Dec 17, 2015 at 9:21 AM, Andy Ayers <andya at microsoft.com> wrote:> While your bb count distribution is extremely likely to be some kind of power-law like distribution, it's not guaranteed. > > Also you might think about operations that can amplify (rerolling) or appear to amplify (TRE) or diminish BB counts, and how you'd go about reclassifying block hotness.yes -- this is more of a problem for Sample based FDO because the samples are collected after such transformations. We have a paper discussing that (to appear cgo2016). thanks, Dvaid> > > -----Original Message----- > From: Xinliang David Li [mailto:davidxl at google.com] > Sent: Wednesday, December 16, 2015 9:47 PM > To: Andy Ayers <andya at microsoft.com> > Cc: Easwaran Raman <eraman at google.com>; llvm-dev at lists.llvm.org > Subject: Re: [llvm-dev] RFC: Hotness thresholds in profile header > > On Wed, Dec 16, 2015 at 6:00 PM, Andy Ayers <andya at microsoft.com> wrote: >> I’ve done similar rankings with profile data and found it worked out >> pretty well. >> >> >> >> In my case it was ranking function hotness but the idea was similar: >> sort by weight, compute various percentile cutoff points, and use those to classify. >> I put in some compensation for truly degenerate cases. Consider one >> super-hot function that is 100x the weight of all remaining functions >> – I’d just call that function hot and subtract it out of the >> distribution and so spread a bit of the hotness around elsewhere. >> > > This is probably more important if we only consider function entry count (to measure hotness). With Easwaran's proposal, the cut-off computation is based on sum of BB counts (sorted in descending order) > -- the BB count sum approximate the execution time spent in those BBs. > For instance, if 99.9% of execution time is spent in top 10 BBs, the rest BBs are pretty much irrelevant in terms of performance. > > thanks, > David > > >> >> >> Also, you should think about how to handle ties, so your sorting & >> ranking is “stable” somehow. >> >> >> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of >> Easwaran Raman via llvm-dev >> Sent: Wednesday, December 16, 2015 2:43 PM >> To: llvm-dev at lists.llvm.org >> Cc: David Li <davidxl at google.com> >> Subject: [llvm-dev] RFC: Hotness thresholds in profile header >> >> >> >> Hi, >> >> The problem we're trying to address: >> >> PGO transforms that are based on block counts need some way to answer >> isHotBlock() query. A simple comparison of the form block_count > N, >> for some constant N is not useful since even for the same program >> different profile inputs can cause different answers for isHotBlock() >> on the same block. This can be addressed by comparing against a >> reference value that depends on the program and its input. For >> instance, the indexed profile header contains MaxFunctionCount and >> isHotBlock() could check if the block count exceeds a certain fraction of MaxFunctionCount. >> >> >> >> The drawback of this approach is that it is difficult to come up with >> a single fraction that works both for programs with a few hot-spots >> and programs with a long tail of block execution counts. Consider the >> following two profiles both with a MaxFunctionCount of 1000. In the >> first case, there are many blocks with count 100 and those (along with >> the function with count >> 1000) account for 99% of execution time. In the second case, there are >> many blocks with count 1000 that together account for 99% of execution >> time, but there are also a non-trivial number of blocks with count >> 100. In the first case, we would like the hotness threshold to be 100 >> (or 10% of MaxFunctionCount), but in the second case, we want it to be >> 1000 (100% of MaxFunctionCount). >> >> >> Our proposal: >> >> We want to define hot blocks as those that account for a large >> percentage(P) of execution time of the program. We then make a >> simplistic assumption that all blocks take the same amount of time to >> execute. With this assumption, the execution time of the program can >> be approximated by the sum of execution counts of the blocks in the >> program. We can then compute a maximum threshold T(P) such that >> execution counts of blocks whose count is >= T(P) account for P percentage of total block count. >> >> To illustrate, let {B1...B5} be the set of blocks whose counts are {1, >> 0, 2, 2, 5}. For this example: >> >> T(50) = 5 since {B5} accounts for 50% of total block count. >> T(90) = 2 since {B3, B4, B5} has a cumulative count of 9 which is 90% >> of total block count. >> T(95) = 1 since all blocks with non-zero block counts are needed to >> account for >=95% of total block count. >> >> We propose to record a vector of triples {P, T(P), B(P)}, where B(P) >> is the number of blocks with count >= T(P), for a set of pre-defined >> number of values of P into the header of the indexed profile. A >> possible set of values of P are 90, 95, 99, 99.9, 99.99, but this >> should be configurable through profdata tool. The B(P) above could be >> useful to optimization passes in making size/performance tradeoffs. >> For example, if S is the average block size and S*B(95) fits well >> within cache, but S*B(99) well exceeds the cache size, an optimization >> like loop unroller could choose to use T(95) as the threshold to avoid heavy cache misses. >> >> The {P, T(P), B(P)} triples have to be attached to the module. A >> common API would be added to retrieve the set of available Ps and >> {T(P), B(P)} pair for a given P for all flavors (front-end, IR level and sample) of profiles. >> >> Thoughts? >> >> >> >> Thanks, >> >> Easwaran (with inputs from Teresa Johnson and David Li)
Easwaran Raman via llvm-dev
2016-Jan-08 19:59 UTC
[llvm-dev] RFC: Hotness thresholds in profile header
I have a patch at http://reviews.llvm.org/D16005 that implements this feature as part of 'show' command of the llvm-profdata. I think it'll be useful to implement this as part of show so that it can be tried on profiles of programs that people care about and refine it before getting this information into the header of the indexed profile. On Thu, Dec 17, 2015 at 11:17 AM, Xinliang David Li <davidxl at google.com> wrote:> On Thu, Dec 17, 2015 at 9:21 AM, Andy Ayers <andya at microsoft.com> wrote: > > While your bb count distribution is extremely likely to be some kind of > power-law like distribution, it's not guaranteed. > > > > Also you might think about operations that can amplify (rerolling) or > appear to amplify (TRE) or diminish BB counts, and how you'd go about > reclassifying block hotness. > > yes -- this is more of a problem for Sample based FDO because the > samples are collected after such transformations. We have a paper > discussing that (to appear cgo2016). > > thanks, > > Dvaid > > > > > > > -----Original Message----- > > From: Xinliang David Li [mailto:davidxl at google.com] > > Sent: Wednesday, December 16, 2015 9:47 PM > > To: Andy Ayers <andya at microsoft.com> > > Cc: Easwaran Raman <eraman at google.com>; llvm-dev at lists.llvm.org > > Subject: Re: [llvm-dev] RFC: Hotness thresholds in profile header > > > > On Wed, Dec 16, 2015 at 6:00 PM, Andy Ayers <andya at microsoft.com> wrote: > >> I’ve done similar rankings with profile data and found it worked out > >> pretty well. > >> > >> > >> > >> In my case it was ranking function hotness but the idea was similar: > >> sort by weight, compute various percentile cutoff points, and use those > to classify. > >> I put in some compensation for truly degenerate cases. Consider one > >> super-hot function that is 100x the weight of all remaining functions > >> – I’d just call that function hot and subtract it out of the > >> distribution and so spread a bit of the hotness around elsewhere. > >> > > > > This is probably more important if we only consider function entry count > (to measure hotness). With Easwaran's proposal, the cut-off computation is > based on sum of BB counts (sorted in descending order) > > -- the BB count sum approximate the execution time spent in those BBs. > > For instance, if 99.9% of execution time is spent in top 10 BBs, the > rest BBs are pretty much irrelevant in terms of performance. > > > > thanks, > > David > > > > > >> > >> > >> Also, you should think about how to handle ties, so your sorting & > >> ranking is “stable” somehow. > >> > >> > >> > >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > >> Easwaran Raman via llvm-dev > >> Sent: Wednesday, December 16, 2015 2:43 PM > >> To: llvm-dev at lists.llvm.org > >> Cc: David Li <davidxl at google.com> > >> Subject: [llvm-dev] RFC: Hotness thresholds in profile header > >> > >> > >> > >> Hi, > >> > >> The problem we're trying to address: > >> > >> PGO transforms that are based on block counts need some way to answer > >> isHotBlock() query. A simple comparison of the form block_count > N, > >> for some constant N is not useful since even for the same program > >> different profile inputs can cause different answers for isHotBlock() > >> on the same block. This can be addressed by comparing against a > >> reference value that depends on the program and its input. For > >> instance, the indexed profile header contains MaxFunctionCount and > >> isHotBlock() could check if the block count exceeds a certain fraction > of MaxFunctionCount. > >> > >> > >> > >> The drawback of this approach is that it is difficult to come up with > >> a single fraction that works both for programs with a few hot-spots > >> and programs with a long tail of block execution counts. Consider the > >> following two profiles both with a MaxFunctionCount of 1000. In the > >> first case, there are many blocks with count 100 and those (along with > >> the function with count > >> 1000) account for 99% of execution time. In the second case, there are > >> many blocks with count 1000 that together account for 99% of execution > >> time, but there are also a non-trivial number of blocks with count > >> 100. In the first case, we would like the hotness threshold to be 100 > >> (or 10% of MaxFunctionCount), but in the second case, we want it to be > >> 1000 (100% of MaxFunctionCount). > >> > >> > >> Our proposal: > >> > >> We want to define hot blocks as those that account for a large > >> percentage(P) of execution time of the program. We then make a > >> simplistic assumption that all blocks take the same amount of time to > >> execute. With this assumption, the execution time of the program can > >> be approximated by the sum of execution counts of the blocks in the > >> program. We can then compute a maximum threshold T(P) such that > >> execution counts of blocks whose count is >= T(P) account for P > percentage of total block count. > >> > >> To illustrate, let {B1...B5} be the set of blocks whose counts are {1, > >> 0, 2, 2, 5}. For this example: > >> > >> T(50) = 5 since {B5} accounts for 50% of total block count. > >> T(90) = 2 since {B3, B4, B5} has a cumulative count of 9 which is 90% > >> of total block count. > >> T(95) = 1 since all blocks with non-zero block counts are needed to > >> account for >=95% of total block count. > >> > >> We propose to record a vector of triples {P, T(P), B(P)}, where B(P) > >> is the number of blocks with count >= T(P), for a set of pre-defined > >> number of values of P into the header of the indexed profile. A > >> possible set of values of P are 90, 95, 99, 99.9, 99.99, but this > >> should be configurable through profdata tool. The B(P) above could be > >> useful to optimization passes in making size/performance tradeoffs. > >> For example, if S is the average block size and S*B(95) fits well > >> within cache, but S*B(99) well exceeds the cache size, an optimization > >> like loop unroller could choose to use T(95) as the threshold to avoid > heavy cache misses. > >> > >> The {P, T(P), B(P)} triples have to be attached to the module. A > >> common API would be added to retrieve the set of available Ps and > >> {T(P), B(P)} pair for a given P for all flavors (front-end, IR level > and sample) of profiles. > >> > >> Thoughts? > >> > >> > >> > >> Thanks, > >> > >> Easwaran (with inputs from Teresa Johnson and David Li) >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160108/3a72ddf0/attachment.html>
Xinliang David Li via llvm-dev
2016-Jan-08 20:01 UTC
[llvm-dev] RFC: Hotness thresholds in profile header
Can you submit the patch for review? David On Fri, Jan 8, 2016 at 11:59 AM, Easwaran Raman <eraman at google.com> wrote:> I have a patch at http://reviews.llvm.org/D16005 that implements this > feature as part of 'show' command of the llvm-profdata. I think it'll be > useful to implement this as part of show so that it can be tried on profiles > of programs that people care about and refine it before getting this > information into the header of the indexed profile. > > On Thu, Dec 17, 2015 at 11:17 AM, Xinliang David Li <davidxl at google.com> > wrote: >> >> On Thu, Dec 17, 2015 at 9:21 AM, Andy Ayers <andya at microsoft.com> wrote: >> > While your bb count distribution is extremely likely to be some kind of >> > power-law like distribution, it's not guaranteed. >> > >> > Also you might think about operations that can amplify (rerolling) or >> > appear to amplify (TRE) or diminish BB counts, and how you'd go about >> > reclassifying block hotness. >> >> yes -- this is more of a problem for Sample based FDO because the >> samples are collected after such transformations. We have a paper >> discussing that (to appear cgo2016). >> >> thanks, >> >> Dvaid >> >> > >> > >> > -----Original Message----- >> > From: Xinliang David Li [mailto:davidxl at google.com] >> > Sent: Wednesday, December 16, 2015 9:47 PM >> > To: Andy Ayers <andya at microsoft.com> >> > Cc: Easwaran Raman <eraman at google.com>; llvm-dev at lists.llvm.org >> > Subject: Re: [llvm-dev] RFC: Hotness thresholds in profile header >> > >> > On Wed, Dec 16, 2015 at 6:00 PM, Andy Ayers <andya at microsoft.com> wrote: >> >> I’ve done similar rankings with profile data and found it worked out >> >> pretty well. >> >> >> >> >> >> >> >> In my case it was ranking function hotness but the idea was similar: >> >> sort by weight, compute various percentile cutoff points, and use those >> >> to classify. >> >> I put in some compensation for truly degenerate cases. Consider one >> >> super-hot function that is 100x the weight of all remaining functions >> >> – I’d just call that function hot and subtract it out of the >> >> distribution and so spread a bit of the hotness around elsewhere. >> >> >> > >> > This is probably more important if we only consider function entry count >> > (to measure hotness). With Easwaran's proposal, the cut-off computation is >> > based on sum of BB counts (sorted in descending order) >> > -- the BB count sum approximate the execution time spent in those BBs. >> > For instance, if 99.9% of execution time is spent in top 10 BBs, the >> > rest BBs are pretty much irrelevant in terms of performance. >> > >> > thanks, >> > David >> > >> > >> >> >> >> >> >> Also, you should think about how to handle ties, so your sorting & >> >> ranking is “stable” somehow. >> >> >> >> >> >> >> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of >> >> Easwaran Raman via llvm-dev >> >> Sent: Wednesday, December 16, 2015 2:43 PM >> >> To: llvm-dev at lists.llvm.org >> >> Cc: David Li <davidxl at google.com> >> >> Subject: [llvm-dev] RFC: Hotness thresholds in profile header >> >> >> >> >> >> >> >> Hi, >> >> >> >> The problem we're trying to address: >> >> >> >> PGO transforms that are based on block counts need some way to answer >> >> isHotBlock() query. A simple comparison of the form block_count > N, >> >> for some constant N is not useful since even for the same program >> >> different profile inputs can cause different answers for isHotBlock() >> >> on the same block. This can be addressed by comparing against a >> >> reference value that depends on the program and its input. For >> >> instance, the indexed profile header contains MaxFunctionCount and >> >> isHotBlock() could check if the block count exceeds a certain fraction >> >> of MaxFunctionCount. >> >> >> >> >> >> >> >> The drawback of this approach is that it is difficult to come up with >> >> a single fraction that works both for programs with a few hot-spots >> >> and programs with a long tail of block execution counts. Consider the >> >> following two profiles both with a MaxFunctionCount of 1000. In the >> >> first case, there are many blocks with count 100 and those (along with >> >> the function with count >> >> 1000) account for 99% of execution time. In the second case, there are >> >> many blocks with count 1000 that together account for 99% of execution >> >> time, but there are also a non-trivial number of blocks with count >> >> 100. In the first case, we would like the hotness threshold to be 100 >> >> (or 10% of MaxFunctionCount), but in the second case, we want it to be >> >> 1000 (100% of MaxFunctionCount). >> >> >> >> >> >> Our proposal: >> >> >> >> We want to define hot blocks as those that account for a large >> >> percentage(P) of execution time of the program. We then make a >> >> simplistic assumption that all blocks take the same amount of time to >> >> execute. With this assumption, the execution time of the program can >> >> be approximated by the sum of execution counts of the blocks in the >> >> program. We can then compute a maximum threshold T(P) such that >> >> execution counts of blocks whose count is >= T(P) account for P >> >> percentage of total block count. >> >> >> >> To illustrate, let {B1...B5} be the set of blocks whose counts are {1, >> >> 0, 2, 2, 5}. For this example: >> >> >> >> T(50) = 5 since {B5} accounts for 50% of total block count. >> >> T(90) = 2 since {B3, B4, B5} has a cumulative count of 9 which is 90% >> >> of total block count. >> >> T(95) = 1 since all blocks with non-zero block counts are needed to >> >> account for >=95% of total block count. >> >> >> >> We propose to record a vector of triples {P, T(P), B(P)}, where B(P) >> >> is the number of blocks with count >= T(P), for a set of pre-defined >> >> number of values of P into the header of the indexed profile. A >> >> possible set of values of P are 90, 95, 99, 99.9, 99.99, but this >> >> should be configurable through profdata tool. The B(P) above could be >> >> useful to optimization passes in making size/performance tradeoffs. >> >> For example, if S is the average block size and S*B(95) fits well >> >> within cache, but S*B(99) well exceeds the cache size, an optimization >> >> like loop unroller could choose to use T(95) as the threshold to avoid >> >> heavy cache misses. >> >> >> >> The {P, T(P), B(P)} triples have to be attached to the module. A >> >> common API would be added to retrieve the set of available Ps and >> >> {T(P), B(P)} pair for a given P for all flavors (front-end, IR level >> >> and sample) of profiles. >> >> >> >> Thoughts? >> >> >> >> >> >> >> >> Thanks, >> >> >> >> Easwaran (with inputs from Teresa Johnson and David Li) > >