(and sys::cas_flag that STATISTIC uses is a uint32 ...) On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org> wrote:> Okay, dumb question: > Are you really getting negative numbers in the second column? > > 526,766 -136 mem2reg # PHI nodes inserted > > http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html > (Search for NumPHIInsert). > > I don't see how it could be negative unless this wrapped around? > > > On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com> wrote: > >> I did gathered aggregate statistics reported by “-stats” over the ~400 >> test files. >> >> The following table summarizes the impact. The first column is the >> >> sum where the new analysis is enabled, the second column is the >> >> delta from baseline where no CFL alias analysis is performed. I am not >> >> experienced enough to know which of these are “good” or “bad” indicators. >> >> —david >> >> >> >> 72,250 685 SLP # vector instructions generated >> >> 1,256,401 566 adce # instructions removed >> >> 67,020,774 13,835,126 basicaa # times a GEP is decomposed >> >> 11,154 26 basicaa # times the limit to decompose GEPs >> is reached >> >> 153,613 324 bdce # instructions removed (unused) >> >> 198,495 2 bdce # instructions trivialized (dead >> bits) >> >> 298,621 0 cfl-od-aa Maximum compressed graph >> >> 58,462,719 0 cfl-od-aa Number Search Steps >> >> 48,401 0 cfl-od-aa # NoAlias results absed on address >> roots >> >> 61,936 0 cfl-od-aa # NoAlias results on compressed >> search path >> >> 3,768,131 0 cfl-od-aa # NoAlias results on fast path >> >> 47,016,909 0 cfl-od-aa # calls to query() >> >> 43,172,261 0 cfl-od-aa # instructions analyzed >> >> 10,515,257 0 cfl-od-aa # times there was no graph node for >> a value >> >> 9,895,755 0 cfl-od-aa Total size of compressed graphs >> (edges) >> >> 2,797 2 correlated-value-propagation # comparisons >> propagated >> >> 66,515 -126 correlated-value-propagation # phis propagated >> >> 912 3 correlated-value-propagation # sdiv converted to >> udiv >> >> 13,527 501 dse # other instrs removed >> >> 40,973 416 dse # stores deleted >> >> 126 -2 early-cse # compare instructions CVP'd >> >> 1,824,703 -138 early-cse # instructions CSE'd >> >> 1,875,417 87 early-cse # instructions simplified or DCE'd >> >> 62,505 1 functionattrs # arguments marked nocapture >> >> 29,979 1 functionattrs # arguments marked readonly >> >> 42,648 37 globaldce # functions removed >> >> 40,498 10 globaldce # global variables removed >> >> 4,368 35 gvn # blocks merged >> >> 21,961 26 gvn # equalities propagated >> >> 29,434 45 gvn # instructions PRE'd >> >> 631,597 3,307 gvn # instructions deleted >> >> 217,618 494 gvn # instructions simplified >> >> 51,089 634 gvn # loads PRE'd >> >> 135,568 1,526 gvn # loads deleted >> >> 2,197 4 indvars # IV comparisons eliminated >> >> 826 8 indvars # congruent IVs eliminated >> >> 2,538 4 indvars # exit values replaced >> >> 1,856 1 indvars # loop exit tests replaced >> >> 5,740,738 8 inline # caller-callers analyzed >> >> 1,629,169 3 inline # functions deleted because all >> callers found >> >> 3,563,497 2 inline # functions inlined >> >> 10,879,125 86 inline-cost # call sites analyzed >> >> 34,766 5 instcombine # constant folds >> >> 3,979,078 2,004 instcombine # dead inst eliminated >> >> 6,323 2 instcombine # dead stores eliminated >> >> 1,522 4 instcombine # factorizations >> >> 254,146 66 instcombine # instructions sunk >> >> 10,427,131 1,749 instcombine # insts combined >> >> 57,943 -205 instcombine # reassociations >> >> 1,072 1 instsimplify # expansions >> >> 135,129 1 instsimplify # reassociations >> >> 121,777 246 instsimplify # redundant instructions removed >> >> 27,612 -12 jump-threading # branch blocks duplicated to >> eliminate phi >> >> 76,000 197 jump-threading # jumps threaded >> >> 4,991 8 jump-threading # terminators folded >> >> 869,838 1,370 lcssa # live out of a loop variables >> >> 345,329 433 licm # instructions hoisted out of loop >> >> 702 -27 licm # instructions sunk out of loop >> >> 19,520 192 licm # load insts hoisted or sunk >> >> 202 37 licm # memory locations promoted to >> registers >> >> 467,244 246 local # unreachable basic blocks removed >> >> 1,586 34 loop-delete # loops deleted >> >> 84 27 loop-idiom # memcpy's formed from loop >> load+stores >> >> 752 7 loop-idiom # memset's formed from loop stores >> >> 63,364 -8 loop-rotate # loops rotated >> >> 4,602 1 loop-simplify # nested loops split out >> >> 1,244,741 472 loop-simplify # pre-header or exit blocks inserted >> >> 2,847 2 loop-unroll # loops completely unrolled >> >> 9,668 -29 loop-unroll # loops unrolled (completely or >> otherwise) >> >> 5,799 -35 loop-unroll # loops unrolled with run-time trip >> counts >> >> 3,863 25 loop-unswitch # branches unswitched >> >> 1,054,060 1,482 loop-unswitch Total number of instructions >> analyzed >> >> 109,279 -3 loop-vectorize # loops analyzed for vectorization >> >> 526,766 -136 mem2reg # PHI nodes inserted >> >> 4,150,078 -3 mem2reg # alloca's promoted with a single >> store >> >> 4,567 6 memcpyopt # memcpy instructions deleted >> >> 96 1 memcpyopt # memcpys converted to memset >> >> 1,074 173 memcpyopt # memmoves converted to memcpy >> >> 39,584 6 memcpyopt # memsets inferred >> >> 179,629 2,475 memdep # block queries that were completely >> cached >> >> 1,020 -3 memdep # cached, but dirty, non-local ptr >> responses >> >> 9,108,504 146,792 memdep # fully cached non-local ptr >> responses >> >> 11,678,674 92,225 memdep # uncached non-local ptr responses >> >> 399,802 1,832 memory-builtins # arguments with unsolved size >> and offset >> >> 10,844 -24,169 memory-builtins # load instructions with unsolved >> size and offset >> >> 188,181 54 reassociate # insts reassociated >> >> 87,009 -82 scalar-evolution # loops with predictable loop >> counts >> >> 402,724 71 scalar-evolution # loops without predictable loop >> counts >> >> 133,310 72 sccp # basic blocks unreachable >> >> 275,949 263 sccp # instructions removed >> >> 2,056,414 723 simplifycfg # blocks simplified >> >> 5,292 -36 simplifycfg # common instructions sunk down to >> the end block >> >> 15,110 1 simplifycfg # speculative executed instructions >> >> 43,068 -2 sroa Maximum number of uses of a partition >> >> 11,754,901 -180 sroa # alloca partition uses rewritten >> >> 4,623,115 -11 sroa # alloca partitions formed >> >> 5,927,727 -11 sroa # allocas analyzed for replacement >> >> 4,576,406 -5 sroa # allocas promoted to SSA values >> >> 13,770,636 -227 sroa # instructions deleted >> >> 3,797 -1 strip-dead-prototypes # dead prototypes removed >> >> >> >> >> >> >> >> From: Daniel Berlin <dberlin at dberlin.org> >> Date: Thursday, August 25, 2016 at 9:06 AM >> To: David Callahan <dcallahan at fb.com> >> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing >> list <llvm-dev at lists.llvm.org> >> Subject: Re: [llvm-dev] CFLAA >> >> Hey David, >> I'll take a look at the patch :) >> Sounds like fun work. >> >> As George says, improving AA significantly will almost always cause >> significant performance regressions at first, in almost any compiler. >> >> Compilers knobs, passes, usually get tuned for x amount of freedom, and >> if you give them 10x, they start moving things too far, vectorizing too >> much, spilling, etc. >> >> This was definitely the case for GCC, where adding a precise >> interprocedural field-sensitive analysis initially regressed performance by >> a few percent on average. >> >> I know it was also the case for XLC at IBM, etc. >> >> Like anything else, just gotta figure out what passes are going nuts, and >> rework them to have better heuristics/etc. >> The end result is performance improvements, but the path takes a bit of >> time. >> >> If you need a way to see whether your analysis has actually done an okay >> job in the meantime, usually a good way to see if you are doing well or not >> is to see how many loads/stores get eliminated or moved by various passes >> before and after. >> >> If the number is significantly higher, great. >> If the number is significantly lower, something has likely gone wrong :) >> >> >> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> (Adding “LLVM Dev”) >>> >>> My variant is up as https://reviews.llvm.org/D23876 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=> >>> —david >>> >>> >>> From: George Burgess IV <george.burgess.iv at gmail.com> >>> Date: Wednesday, August 24, 2016 at 3:17 PM >>> To: David Callahan <dcallahan at fb.com> >>> Subject: Re: CFLAA >>> >>> Hi! >>> >>> > I see there is on going work with alias analysis and it appears the >>> prior CFLAA has been abandoned. >>> >>> There was quite a bit of refactoring done, yeah. The original CFLAA is >>> now called CFLSteens, and graph construction was moved to its own bit. We >>> also have CFLAnders, which is based more heavily on the paper by Zheng and >>> Rugina (e.g. no stratifiedsets magic). >>> >>> > I have a variant of it where I reworked how compression was done to be >>> less conservative, reworked the interprocedural to do simulated but bounded >>> inlining, and added code to do on-demand testing of CFL paths on both >>> compressed and full graphs. >>> >>> Awesome! >>> >>> > Happy to share the patch with you if you are interested as well as >>> some data collected >>> >>> Yes, please. Would you mind if I CC'ed llvm-dev on this thread (and a >>> few people specifically, who also might find this interesting)? >>> >>> > However I was not able to see any performance improvements in the >>> code. In fact on a various benchmarks there were noticeable regressions in >>> measured performance of the generated code. Have you noticed any similar >>> problems? >>> >>> I know that a number of people people in the community expressed >>> concerns about how other passes will perform with better AA results (e.g. >>> If LICM becomes more aggressive, register pressure may increase, which may >>> cause us to spill when we haven't before, etc). So, such a problem isn't >>> unthinkable. :) >>> >>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at fb.com> >>> wrote: >>> >>>> Hi Greg, >>>> >>>> >>>> >>>> I see there is on going work with alias analysis and it appears the >>>> prior CFLAA has been abandoned. >>>> >>>> >>>> >>>> I have a variant of it where I reworked how compression was done to be >>>> less conservative, reworked the interprocedural to do simulated but bounded >>>> inlining, and added code to do on-demand testing of CFL paths on both >>>> compressed and full graphs. >>>> >>>> >>>> >>>> I reached a point where the ahead-of-time compression was linear but >>>> still very accurate compared to on-demand path search and there were >>>> noticeable improvements in the alias analysis results and impacted >>>> transformations. Happy to share the patch with you if you are interested >>>> as well as some data collected. >>>> >>>> >>>> >>>> However I was not able to see any performance improvements in the code. >>>> In fact on a various benchmarks there were noticeable regressions in >>>> measured performance of the generated code. Have you noticed any similar >>>> problems? >>>> >>>> >>>> >>>> --david >>>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=> >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/25e19284/attachment-0001.html>
The second column is the difference between the new analysis and a baseline without the analysis. From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> Date: Thursday, August 25, 2016 at 9:55 AM To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> Cc: George Burgess IV <george.burgess.iv at gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] CFLAA (and sys::cas_flag that STATISTIC uses is a uint32 ...) On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> wrote: Okay, dumb question: Are you really getting negative numbers in the second column? 526,766 -136 mem2reg # PHI nodes inserted http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=> (Search for NumPHIInsert). I don't see how it could be negative unless this wrapped around? On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> wrote: I did gathered aggregate statistics reported by “-stats” over the ~400 test files. The following table summarizes the impact. The first column is the sum where the new analysis is enabled, the second column is the delta from baseline where no CFL alias analysis is performed. I am not experienced enough to know which of these are “good” or “bad” indicators. —david 72,250 685 SLP # vector instructions generated 1,256,401 566 adce # instructions removed 67,020,774 13,835,126 basicaa # times a GEP is decomposed 11,154 26 basicaa # times the limit to decompose GEPs is reached 153,613 324 bdce # instructions removed (unused) 198,495 2 bdce # instructions trivialized (dead bits) 298,621 0 cfl-od-aa Maximum compressed graph 58,462,719 0 cfl-od-aa Number Search Steps 48,401 0 cfl-od-aa # NoAlias results absed on address roots 61,936 0 cfl-od-aa # NoAlias results on compressed search path 3,768,131 0 cfl-od-aa # NoAlias results on fast path 47,016,909 0 cfl-od-aa # calls to query() 43,172,261 0 cfl-od-aa # instructions analyzed 10,515,257 0 cfl-od-aa # times there was no graph node for a value 9,895,755 0 cfl-od-aa Total size of compressed graphs (edges) 2,797 2 correlated-value-propagation # comparisons propagated 66,515 -126 correlated-value-propagation # phis propagated 912 3 correlated-value-propagation # sdiv converted to udiv 13,527 501 dse # other instrs removed 40,973 416 dse # stores deleted 126 -2 early-cse # compare instructions CVP'd 1,824,703 -138 early-cse # instructions CSE'd 1,875,417 87 early-cse # instructions simplified or DCE'd 62,505 1 functionattrs # arguments marked nocapture 29,979 1 functionattrs # arguments marked readonly 42,648 37 globaldce # functions removed 40,498 10 globaldce # global variables removed 4,368 35 gvn # blocks merged 21,961 26 gvn # equalities propagated 29,434 45 gvn # instructions PRE'd 631,597 3,307 gvn # instructions deleted 217,618 494 gvn # instructions simplified 51,089 634 gvn # loads PRE'd 135,568 1,526 gvn # loads deleted 2,197 4 indvars # IV comparisons eliminated 826 8 indvars # congruent IVs eliminated 2,538 4 indvars # exit values replaced 1,856 1 indvars # loop exit tests replaced 5,740,738 8 inline # caller-callers analyzed 1,629,169 3 inline # functions deleted because all callers found 3,563,497 2 inline # functions inlined 10,879,125 86 inline-cost # call sites analyzed 34,766 5 instcombine # constant folds 3,979,078 2,004 instcombine # dead inst eliminated 6,323 2 instcombine # dead stores eliminated 1,522 4 instcombine # factorizations 254,146 66 instcombine # instructions sunk 10,427,131 1,749 instcombine # insts combined 57,943 -205 instcombine # reassociations 1,072 1 instsimplify # expansions 135,129 1 instsimplify # reassociations 121,777 246 instsimplify # redundant instructions removed 27,612 -12 jump-threading # branch blocks duplicated to eliminate phi 76,000 197 jump-threading # jumps threaded 4,991 8 jump-threading # terminators folded 869,838 1,370 lcssa # live out of a loop variables 345,329 433 licm # instructions hoisted out of loop 702 -27 licm # instructions sunk out of loop 19,520 192 licm # load insts hoisted or sunk 202 37 licm # memory locations promoted to registers 467,244 246 local # unreachable basic blocks removed 1,586 34 loop-delete # loops deleted 84 27 loop-idiom # memcpy's formed from loop load+stores 752 7 loop-idiom # memset's formed from loop stores 63,364 -8 loop-rotate # loops rotated 4,602 1 loop-simplify # nested loops split out 1,244,741 472 loop-simplify # pre-header or exit blocks inserted 2,847 2 loop-unroll # loops completely unrolled 9,668 -29 loop-unroll # loops unrolled (completely or otherwise) 5,799 -35 loop-unroll # loops unrolled with run-time trip counts 3,863 25 loop-unswitch # branches unswitched 1,054,060 1,482 loop-unswitch Total number of instructions analyzed 109,279 -3 loop-vectorize # loops analyzed for vectorization 526,766 -136 mem2reg # PHI nodes inserted 4,150,078 -3 mem2reg # alloca's promoted with a single store 4,567 6 memcpyopt # memcpy instructions deleted 96 1 memcpyopt # memcpys converted to memset 1,074 173 memcpyopt # memmoves converted to memcpy 39,584 6 memcpyopt # memsets inferred 179,629 2,475 memdep # block queries that were completely cached 1,020 -3 memdep # cached, but dirty, non-local ptr responses 9,108,504 146,792 memdep # fully cached non-local ptr responses 11,678,674 92,225 memdep # uncached non-local ptr responses 399,802 1,832 memory-builtins # arguments with unsolved size and offset 10,844 -24,169 memory-builtins # load instructions with unsolved size and offset 188,181 54 reassociate # insts reassociated 87,009 -82 scalar-evolution # loops with predictable loop counts 402,724 71 scalar-evolution # loops without predictable loop counts 133,310 72 sccp # basic blocks unreachable 275,949 263 sccp # instructions removed 2,056,414 723 simplifycfg # blocks simplified 5,292 -36 simplifycfg # common instructions sunk down to the end block 15,110 1 simplifycfg # speculative executed instructions 43,068 -2 sroa Maximum number of uses of a partition 11,754,901 -180 sroa # alloca partition uses rewritten 4,623,115 -11 sroa # alloca partitions formed 5,927,727 -11 sroa # allocas analyzed for replacement 4,576,406 -5 sroa # allocas promoted to SSA values 13,770,636 -227 sroa # instructions deleted 3,797 -1 strip-dead-prototypes # dead prototypes removed From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> Date: Thursday, August 25, 2016 at 9:06 AM To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> Cc: George Burgess IV <george.burgess.iv at gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] CFLAA Hey David, I'll take a look at the patch :) Sounds like fun work. As George says, improving AA significantly will almost always cause significant performance regressions at first, in almost any compiler. Compilers knobs, passes, usually get tuned for x amount of freedom, and if you give them 10x, they start moving things too far, vectorizing too much, spilling, etc. This was definitely the case for GCC, where adding a precise interprocedural field-sensitive analysis initially regressed performance by a few percent on average. I know it was also the case for XLC at IBM, etc. Like anything else, just gotta figure out what passes are going nuts, and rework them to have better heuristics/etc. The end result is performance improvements, but the path takes a bit of time. If you need a way to see whether your analysis has actually done an okay job in the meantime, usually a good way to see if you are doing well or not is to see how many loads/stores get eliminated or moved by various passes before and after. If the number is significantly higher, great. If the number is significantly lower, something has likely gone wrong :) On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: (Adding “LLVM Dev”) My variant is up as https://reviews.llvm.org/D23876<https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=> —david From: George Burgess IV <george.burgess.iv at gmail.com<mailto:george.burgess.iv at gmail.com>> Date: Wednesday, August 24, 2016 at 3:17 PM To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> Subject: Re: CFLAA Hi!> I see there is on going work with alias analysis and it appears the prior CFLAA has been abandoned.There was quite a bit of refactoring done, yeah. The original CFLAA is now called CFLSteens, and graph construction was moved to its own bit. We also have CFLAnders, which is based more heavily on the paper by Zheng and Rugina (e.g. no stratifiedsets magic).> I have a variant of it where I reworked how compression was done to be less conservative, reworked the interprocedural to do simulated but bounded inlining, and added code to do on-demand testing of CFL paths on both compressed and full graphs.Awesome!> Happy to share the patch with you if you are interested as well as some data collectedYes, please. Would you mind if I CC'ed llvm-dev on this thread (and a few people specifically, who also might find this interesting)?> However I was not able to see any performance improvements in the code. In fact on a various benchmarks there were noticeable regressions in measured performance of the generated code. Have you noticed any similar problems?I know that a number of people people in the community expressed concerns about how other passes will perform with better AA results (e.g. If LICM becomes more aggressive, register pressure may increase, which may cause us to spill when we haven't before, etc). So, such a problem isn't unthinkable. :) On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> wrote: Hi Greg, I see there is on going work with alias analysis and it appears the prior CFLAA has been abandoned. I have a variant of it where I reworked how compression was done to be less conservative, reworked the interprocedural to do simulated but bounded inlining, and added code to do on-demand testing of CFL paths on both compressed and full graphs. I reached a point where the ahead-of-time compression was linear but still very accurate compared to on-demand path search and there were noticeable improvements in the alias analysis results and impacted transformations. Happy to share the patch with you if you are interested as well as some data collected. However I was not able to see any performance improvements in the code. In fact on a various benchmarks there were noticeable regressions in measured performance of the generated code. Have you noticed any similar problems? --david _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/2a010a23/attachment.html>
Here is a summary of the experiment behind this patch https://www.facebook.com/notes/david-callahan/llvm-cfl-alias-analysis/10150648030284971 From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> Date: Thursday, August 25, 2016 at 9:55 AM To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> Cc: George Burgess IV <george.burgess.iv at gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] CFLAA (and sys::cas_flag that STATISTIC uses is a uint32 ...) On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> wrote: Okay, dumb question: Are you really getting negative numbers in the second column? 526,766 -136 mem2reg # PHI nodes inserted http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=> (Search for NumPHIInsert). I don't see how it could be negative unless this wrapped around? On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> wrote: I did gathered aggregate statistics reported by “-stats” over the ~400 test files. The following table summarizes the impact. The first column is the sum where the new analysis is enabled, the second column is the delta from baseline where no CFL alias analysis is performed. I am not experienced enough to know which of these are “good” or “bad” indicators. —david 72,250 685 SLP # vector instructions generated 1,256,401 566 adce # instructions removed 67,020,774 13,835,126 basicaa # times a GEP is decomposed 11,154 26 basicaa # times the limit to decompose GEPs is reached 153,613 324 bdce # instructions removed (unused) 198,495 2 bdce # instructions trivialized (dead bits) 298,621 0 cfl-od-aa Maximum compressed graph 58,462,719 0 cfl-od-aa Number Search Steps 48,401 0 cfl-od-aa # NoAlias results absed on address roots 61,936 0 cfl-od-aa # NoAlias results on compressed search path 3,768,131 0 cfl-od-aa # NoAlias results on fast path 47,016,909 0 cfl-od-aa # calls to query() 43,172,261 0 cfl-od-aa # instructions analyzed 10,515,257 0 cfl-od-aa # times there was no graph node for a value 9,895,755 0 cfl-od-aa Total size of compressed graphs (edges) 2,797 2 correlated-value-propagation # comparisons propagated 66,515 -126 correlated-value-propagation # phis propagated 912 3 correlated-value-propagation # sdiv converted to udiv 13,527 501 dse # other instrs removed 40,973 416 dse # stores deleted 126 -2 early-cse # compare instructions CVP'd 1,824,703 -138 early-cse # instructions CSE'd 1,875,417 87 early-cse # instructions simplified or DCE'd 62,505 1 functionattrs # arguments marked nocapture 29,979 1 functionattrs # arguments marked readonly 42,648 37 globaldce # functions removed 40,498 10 globaldce # global variables removed 4,368 35 gvn # blocks merged 21,961 26 gvn # equalities propagated 29,434 45 gvn # instructions PRE'd 631,597 3,307 gvn # instructions deleted 217,618 494 gvn # instructions simplified 51,089 634 gvn # loads PRE'd 135,568 1,526 gvn # loads deleted 2,197 4 indvars # IV comparisons eliminated 826 8 indvars # congruent IVs eliminated 2,538 4 indvars # exit values replaced 1,856 1 indvars # loop exit tests replaced 5,740,738 8 inline # caller-callers analyzed 1,629,169 3 inline # functions deleted because all callers found 3,563,497 2 inline # functions inlined 10,879,125 86 inline-cost # call sites analyzed 34,766 5 instcombine # constant folds 3,979,078 2,004 instcombine # dead inst eliminated 6,323 2 instcombine # dead stores eliminated 1,522 4 instcombine # factorizations 254,146 66 instcombine # instructions sunk 10,427,131 1,749 instcombine # insts combined 57,943 -205 instcombine # reassociations 1,072 1 instsimplify # expansions 135,129 1 instsimplify # reassociations 121,777 246 instsimplify # redundant instructions removed 27,612 -12 jump-threading # branch blocks duplicated to eliminate phi 76,000 197 jump-threading # jumps threaded 4,991 8 jump-threading # terminators folded 869,838 1,370 lcssa # live out of a loop variables 345,329 433 licm # instructions hoisted out of loop 702 -27 licm # instructions sunk out of loop 19,520 192 licm # load insts hoisted or sunk 202 37 licm # memory locations promoted to registers 467,244 246 local # unreachable basic blocks removed 1,586 34 loop-delete # loops deleted 84 27 loop-idiom # memcpy's formed from loop load+stores 752 7 loop-idiom # memset's formed from loop stores 63,364 -8 loop-rotate # loops rotated 4,602 1 loop-simplify # nested loops split out 1,244,741 472 loop-simplify # pre-header or exit blocks inserted 2,847 2 loop-unroll # loops completely unrolled 9,668 -29 loop-unroll # loops unrolled (completely or otherwise) 5,799 -35 loop-unroll # loops unrolled with run-time trip counts 3,863 25 loop-unswitch # branches unswitched 1,054,060 1,482 loop-unswitch Total number of instructions analyzed 109,279 -3 loop-vectorize # loops analyzed for vectorization 526,766 -136 mem2reg # PHI nodes inserted 4,150,078 -3 mem2reg # alloca's promoted with a single store 4,567 6 memcpyopt # memcpy instructions deleted 96 1 memcpyopt # memcpys converted to memset 1,074 173 memcpyopt # memmoves converted to memcpy 39,584 6 memcpyopt # memsets inferred 179,629 2,475 memdep # block queries that were completely cached 1,020 -3 memdep # cached, but dirty, non-local ptr responses 9,108,504 146,792 memdep # fully cached non-local ptr responses 11,678,674 92,225 memdep # uncached non-local ptr responses 399,802 1,832 memory-builtins # arguments with unsolved size and offset 10,844 -24,169 memory-builtins # load instructions with unsolved size and offset 188,181 54 reassociate # insts reassociated 87,009 -82 scalar-evolution # loops with predictable loop counts 402,724 71 scalar-evolution # loops without predictable loop counts 133,310 72 sccp # basic blocks unreachable 275,949 263 sccp # instructions removed 2,056,414 723 simplifycfg # blocks simplified 5,292 -36 simplifycfg # common instructions sunk down to the end block 15,110 1 simplifycfg # speculative executed instructions 43,068 -2 sroa Maximum number of uses of a partition 11,754,901 -180 sroa # alloca partition uses rewritten 4,623,115 -11 sroa # alloca partitions formed 5,927,727 -11 sroa # allocas analyzed for replacement 4,576,406 -5 sroa # allocas promoted to SSA values 13,770,636 -227 sroa # instructions deleted 3,797 -1 strip-dead-prototypes # dead prototypes removed From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at dberlin.org>> Date: Thursday, August 25, 2016 at 9:06 AM To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> Cc: George Burgess IV <george.burgess.iv at gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> Subject: Re: [llvm-dev] CFLAA Hey David, I'll take a look at the patch :) Sounds like fun work. As George says, improving AA significantly will almost always cause significant performance regressions at first, in almost any compiler. Compilers knobs, passes, usually get tuned for x amount of freedom, and if you give them 10x, they start moving things too far, vectorizing too much, spilling, etc. This was definitely the case for GCC, where adding a precise interprocedural field-sensitive analysis initially regressed performance by a few percent on average. I know it was also the case for XLC at IBM, etc. Like anything else, just gotta figure out what passes are going nuts, and rework them to have better heuristics/etc. The end result is performance improvements, but the path takes a bit of time. If you need a way to see whether your analysis has actually done an okay job in the meantime, usually a good way to see if you are doing well or not is to see how many loads/stores get eliminated or moved by various passes before and after. If the number is significantly higher, great. If the number is significantly lower, something has likely gone wrong :) On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: (Adding “LLVM Dev”) My variant is up as https://reviews.llvm.org/D23876<https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=> —david From: George Burgess IV <george.burgess.iv at gmail.com<mailto:george.burgess.iv at gmail.com>> Date: Wednesday, August 24, 2016 at 3:17 PM To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> Subject: Re: CFLAA Hi!> I see there is on going work with alias analysis and it appears the prior CFLAA has been abandoned.There was quite a bit of refactoring done, yeah. The original CFLAA is now called CFLSteens, and graph construction was moved to its own bit. We also have CFLAnders, which is based more heavily on the paper by Zheng and Rugina (e.g. no stratifiedsets magic).> I have a variant of it where I reworked how compression was done to be less conservative, reworked the interprocedural to do simulated but bounded inlining, and added code to do on-demand testing of CFL paths on both compressed and full graphs.Awesome!> Happy to share the patch with you if you are interested as well as some data collectedYes, please. Would you mind if I CC'ed llvm-dev on this thread (and a few people specifically, who also might find this interesting)?> However I was not able to see any performance improvements in the code. In fact on a various benchmarks there were noticeable regressions in measured performance of the generated code. Have you noticed any similar problems?I know that a number of people people in the community expressed concerns about how other passes will perform with better AA results (e.g. If LICM becomes more aggressive, register pressure may increase, which may cause us to spill when we haven't before, etc). So, such a problem isn't unthinkable. :) On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>> wrote: Hi Greg, I see there is on going work with alias analysis and it appears the prior CFLAA has been abandoned. I have a variant of it where I reworked how compression was done to be less conservative, reworked the interprocedural to do simulated but bounded inlining, and added code to do on-demand testing of CFL paths on both compressed and full graphs. I reached a point where the ahead-of-time compression was linear but still very accurate compared to on-demand path search and there were noticeable improvements in the alias analysis results and impacted transformations. Happy to share the patch with you if you are interested as well as some data collected. However I was not able to see any performance improvements in the code. In fact on a various benchmarks there were noticeable regressions in measured performance of the generated code. Have you noticed any similar problems? --david _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/bb637835/attachment-0001.html>
One of the things throwing me off is the cfl numbers being 0 :) Outside of one number, they could be deltas in either direction. IE baseline is column A - column B, or baseline is column B - column A However, given: 10,844 -24,169 I presume this means the baseline is 34k? :) If so, it looks like you are promoting, hoisting, deleting, vectorizing etc a bunch more instructions. About 1-5, most things being 1%, a few being 5% SROA seems to do very slightly worse, but it's not clear if that's because the instructions got deleted before it did anything. On Thu, Aug 25, 2016 at 9:56 AM, David Callahan <dcallahan at fb.com> wrote:> The second column is the difference between the new analysis and a > baseline without the analysis. > > From: Daniel Berlin <dberlin at dberlin.org> > Date: Thursday, August 25, 2016 at 9:55 AM > > To: David Callahan <dcallahan at fb.com> > Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing > list <llvm-dev at lists.llvm.org> > Subject: Re: [llvm-dev] CFLAA > > (and sys::cas_flag that STATISTIC uses is a uint32 ...) > > On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org> > wrote: > >> Okay, dumb question: >> Are you really getting negative numbers in the second column? >> >> 526,766 -136 mem2reg # PHI nodes inserted >> >> http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8c >> pp_source.html >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=> >> (Search for NumPHIInsert). >> >> I don't see how it could be negative unless this wrapped around? >> >> >> On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com> wrote: >> >>> I did gathered aggregate statistics reported by “-stats” over the ~400 >>> test files. >>> >>> The following table summarizes the impact. The first column is the >>> >>> sum where the new analysis is enabled, the second column is the >>> >>> delta from baseline where no CFL alias analysis is performed. I am not >>> >>> experienced enough to know which of these are “good” or “bad” indicators >>> . >>> >>> —david >>> >>> >>> >>> 72,250 685 SLP # vector instructions generated >>> >>> 1,256,401 566 adce # instructions removed >>> >>> 67,020,774 13,835,126 basicaa # times a GEP is decomposed >>> >>> 11,154 26 basicaa # times the limit to decompose GEPs >>> is reached >>> >>> 153,613 324 bdce # instructions removed (unused) >>> >>> 198,495 2 bdce # instructions trivialized (dead >>> bits) >>> >>> 298,621 0 cfl-od-aa Maximum compressed graph >>> >>> 58,462,719 0 cfl-od-aa Number Search Steps >>> >>> 48,401 0 cfl-od-aa # NoAlias results absed on address >>> roots >>> >>> 61,936 0 cfl-od-aa # NoAlias results on compressed >>> search path >>> >>> 3,768,131 0 cfl-od-aa # NoAlias results on fast path >>> >>> 47,016,909 0 cfl-od-aa # calls to query() >>> >>> 43,172,261 0 cfl-od-aa # instructions analyzed >>> >>> 10,515,257 0 cfl-od-aa # times there was no graph node for >>> a value >>> >>> 9,895,755 0 cfl-od-aa Total size of compressed graphs >>> (edges) >>> >>> 2,797 2 correlated-value-propagation # comparisons >>> propagated >>> >>> 66,515 -126 correlated-value-propagation # phis propagated >>> >>> 912 3 correlated-value-propagation # sdiv converted to >>> udiv >>> >>> 13,527 501 dse # other instrs removed >>> >>> 40,973 416 dse # stores deleted >>> >>> 126 -2 early-cse # compare instructions CVP'd >>> >>> 1,824,703 -138 early-cse # instructions CSE'd >>> >>> 1,875,417 87 early-cse # instructions simplified or DCE'd >>> >>> 62,505 1 functionattrs # arguments marked nocapture >>> >>> 29,979 1 functionattrs # arguments marked readonly >>> >>> 42,648 37 globaldce # functions removed >>> >>> 40,498 10 globaldce # global variables removed >>> >>> 4,368 35 gvn # blocks merged >>> >>> 21,961 26 gvn # equalities propagated >>> >>> 29,434 45 gvn # instructions PRE'd >>> >>> 631,597 3,307 gvn # instructions deleted >>> >>> 217,618 494 gvn # instructions simplified >>> >>> 51,089 634 gvn # loads PRE'd >>> >>> 135,568 1,526 gvn # loads deleted >>> >>> 2,197 4 indvars # IV comparisons eliminated >>> >>> 826 8 indvars # congruent IVs eliminated >>> >>> 2,538 4 indvars # exit values replaced >>> >>> 1,856 1 indvars # loop exit tests replaced >>> >>> 5,740,738 8 inline # caller-callers analyzed >>> >>> 1,629,169 3 inline # functions deleted because all >>> callers found >>> >>> 3,563,497 2 inline # functions inlined >>> >>> 10,879,125 86 inline-cost # call sites analyzed >>> >>> 34,766 5 instcombine # constant folds >>> >>> 3,979,078 2,004 instcombine # dead inst eliminated >>> >>> 6,323 2 instcombine # dead stores eliminated >>> >>> 1,522 4 instcombine # factorizations >>> >>> 254,146 66 instcombine # instructions sunk >>> >>> 10,427,131 1,749 instcombine # insts combined >>> >>> 57,943 -205 instcombine # reassociations >>> >>> 1,072 1 instsimplify # expansions >>> >>> 135,129 1 instsimplify # reassociations >>> >>> 121,777 246 instsimplify # redundant instructions removed >>> >>> 27,612 -12 jump-threading # branch blocks duplicated to >>> eliminate phi >>> >>> 76,000 197 jump-threading # jumps threaded >>> >>> 4,991 8 jump-threading # terminators folded >>> >>> 869,838 1,370 lcssa # live out of a loop variables >>> >>> 345,329 433 licm # instructions hoisted out of loop >>> >>> 702 -27 licm # instructions sunk out of loop >>> >>> 19,520 192 licm # load insts hoisted or sunk >>> >>> 202 37 licm # memory locations promoted to >>> registers >>> >>> 467,244 246 local # unreachable basic blocks removed >>> >>> 1,586 34 loop-delete # loops deleted >>> >>> 84 27 loop-idiom # memcpy's formed from loop >>> load+stores >>> >>> 752 7 loop-idiom # memset's formed from loop stores >>> >>> 63,364 -8 loop-rotate # loops rotated >>> >>> 4,602 1 loop-simplify # nested loops split out >>> >>> 1,244,741 472 loop-simplify # pre-header or exit blocks >>> inserted >>> >>> 2,847 2 loop-unroll # loops completely unrolled >>> >>> 9,668 -29 loop-unroll # loops unrolled (completely or >>> otherwise) >>> >>> 5,799 -35 loop-unroll # loops unrolled with run-time trip >>> counts >>> >>> 3,863 25 loop-unswitch # branches unswitched >>> >>> 1,054,060 1,482 loop-unswitch Total number of instructions >>> analyzed >>> >>> 109,279 -3 loop-vectorize # loops analyzed for vectorization >>> >>> 526,766 -136 mem2reg # PHI nodes inserted >>> >>> 4,150,078 -3 mem2reg # alloca's promoted with a single >>> store >>> >>> 4,567 6 memcpyopt # memcpy instructions deleted >>> >>> 96 1 memcpyopt # memcpys converted to memset >>> >>> 1,074 173 memcpyopt # memmoves converted to memcpy >>> >>> 39,584 6 memcpyopt # memsets inferred >>> >>> 179,629 2,475 memdep # block queries that were >>> completely cached >>> >>> 1,020 -3 memdep # cached, but dirty, non-local ptr >>> responses >>> >>> 9,108,504 146,792 memdep # fully cached non-local ptr >>> responses >>> >>> 11,678,674 92,225 memdep # uncached non-local ptr responses >>> >>> 399,802 1,832 memory-builtins # arguments with unsolved size >>> and offset >>> >>> 10,844 -24,169 memory-builtins # load instructions with >>> unsolved size and offset >>> >>> 188,181 54 reassociate # insts reassociated >>> >>> 87,009 -82 scalar-evolution # loops with predictable loop >>> counts >>> >>> 402,724 71 scalar-evolution # loops without predictable >>> loop counts >>> >>> 133,310 72 sccp # basic blocks unreachable >>> >>> 275,949 263 sccp # instructions removed >>> >>> 2,056,414 723 simplifycfg # blocks simplified >>> >>> 5,292 -36 simplifycfg # common instructions sunk down to >>> the end block >>> >>> 15,110 1 simplifycfg # speculative executed instructions >>> >>> 43,068 -2 sroa Maximum number of uses of a >>> partition >>> >>> 11,754,901 -180 sroa # alloca partition uses rewritten >>> >>> 4,623,115 -11 sroa # alloca partitions formed >>> >>> 5,927,727 -11 sroa # allocas analyzed for replacement >>> >>> 4,576,406 -5 sroa # allocas promoted to SSA values >>> >>> 13,770,636 -227 sroa # instructions deleted >>> >>> 3,797 -1 strip-dead-prototypes # dead prototypes removed >>> >>> >>> >>> >>> >>> >>> >>> From: Daniel Berlin <dberlin at dberlin.org> >>> Date: Thursday, August 25, 2016 at 9:06 AM >>> To: David Callahan <dcallahan at fb.com> >>> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing >>> list <llvm-dev at lists.llvm.org> >>> Subject: Re: [llvm-dev] CFLAA >>> >>> Hey David, >>> I'll take a look at the patch :) >>> Sounds like fun work. >>> >>> As George says, improving AA significantly will almost always cause >>> significant performance regressions at first, in almost any compiler. >>> >>> Compilers knobs, passes, usually get tuned for x amount of freedom, and >>> if you give them 10x, they start moving things too far, vectorizing too >>> much, spilling, etc. >>> >>> This was definitely the case for GCC, where adding a precise >>> interprocedural field-sensitive analysis initially regressed performance by >>> a few percent on average. >>> >>> I know it was also the case for XLC at IBM, etc. >>> >>> Like anything else, just gotta figure out what passes are going nuts, >>> and rework them to have better heuristics/etc. >>> The end result is performance improvements, but the path takes a bit of >>> time. >>> >>> If you need a way to see whether your analysis has actually done an okay >>> job in the meantime, usually a good way to see if you are doing well or not >>> is to see how many loads/stores get eliminated or moved by various passes >>> before and after. >>> >>> If the number is significantly higher, great. >>> If the number is significantly lower, something has likely gone wrong :) >>> >>> >>> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> (Adding “LLVM Dev”) >>>> >>>> My variant is up as https://reviews.llvm.org/D23876 >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=> >>>> —david >>>> >>>> >>>> From: George Burgess IV <george.burgess.iv at gmail.com> >>>> Date: Wednesday, August 24, 2016 at 3:17 PM >>>> To: David Callahan <dcallahan at fb.com> >>>> Subject: Re: CFLAA >>>> >>>> Hi! >>>> >>>> > I see there is on going work with alias analysis and it appears the >>>> prior CFLAA has been abandoned. >>>> >>>> There was quite a bit of refactoring done, yeah. The original CFLAA is >>>> now called CFLSteens, and graph construction was moved to its own bit. We >>>> also have CFLAnders, which is based more heavily on the paper by Zheng and >>>> Rugina (e.g. no stratifiedsets magic). >>>> >>>> > I have a variant of it where I reworked how compression was done to >>>> be less conservative, reworked the interprocedural to do simulated but >>>> bounded inlining, and added code to do on-demand testing of CFL paths on >>>> both compressed and full graphs. >>>> >>>> Awesome! >>>> >>>> > Happy to share the patch with you if you are interested as well as >>>> some data collected >>>> >>>> Yes, please. Would you mind if I CC'ed llvm-dev on this thread (and a >>>> few people specifically, who also might find this interesting)? >>>> >>>> > However I was not able to see any performance improvements in the >>>> code. In fact on a various benchmarks there were noticeable regressions in >>>> measured performance of the generated code. Have you noticed any similar >>>> problems? >>>> >>>> I know that a number of people people in the community expressed >>>> concerns about how other passes will perform with better AA results (e.g. >>>> If LICM becomes more aggressive, register pressure may increase, which may >>>> cause us to spill when we haven't before, etc). So, such a problem isn't >>>> unthinkable. :) >>>> >>>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at fb.com> >>>> wrote: >>>> >>>>> Hi Greg, >>>>> >>>>> >>>>> >>>>> I see there is on going work with alias analysis and it appears the >>>>> prior CFLAA has been abandoned. >>>>> >>>>> >>>>> >>>>> I have a variant of it where I reworked how compression was done to be >>>>> less conservative, reworked the interprocedural to do simulated but bounded >>>>> inlining, and added code to do on-demand testing of CFL paths on both >>>>> compressed and full graphs. >>>>> >>>>> >>>>> >>>>> I reached a point where the ahead-of-time compression was linear but >>>>> still very accurate compared to on-demand path search and there were >>>>> noticeable improvements in the alias analysis results and impacted >>>>> transformations. Happy to share the patch with you if you are interested >>>>> as well as some data collected. >>>>> >>>>> >>>>> >>>>> However I was not able to see any performance improvements in the >>>>> code. In fact on a various benchmarks there were noticeable regressions in >>>>> measured performance of the generated code. Have you noticed any similar >>>>> problems? >>>>> >>>>> >>>>> >>>>> --david >>>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=> >>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/fcfc8e9e/attachment-0001.html>
Note that this kind of "graph CSE" is the basis of HVN/HU/HRU in https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwjEupa6t93OAhUE1mMKHbsCDU4QFggeMAE&url=https%3A%2F%2Fwww.cs.ucsb.edu%2F~benh%2Fresearch%2Fpapers%2Fhardekopf07exploiting.pdf&usg=AFQjCNGNzQ6vgsfxWRMW3a5aA4fGGLADOQ&sig2=3mC64txCKq5daxQqMK7UIA&bvm=bv.130731782,d.cGc You could do the same analysis on this graph. On Thu, Aug 25, 2016 at 10:17 AM, David Callahan <dcallahan at fb.com> wrote:> Here is a summary of the experiment behind this patch > https://www.facebook.com/notes/david-callahan/llvm-cfl-alias-analysis/ > 10150648030284971 > > From: Daniel Berlin <dberlin at dberlin.org> > Date: Thursday, August 25, 2016 at 9:55 AM > > To: David Callahan <dcallahan at fb.com> > Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing > list <llvm-dev at lists.llvm.org> > Subject: Re: [llvm-dev] CFLAA > > (and sys::cas_flag that STATISTIC uses is a uint32 ...) > > On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org> > wrote: > >> Okay, dumb question: >> Are you really getting negative numbers in the second column? >> >> 526,766 -136 mem2reg # PHI nodes inserted >> >> http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8c >> pp_source.html >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=> >> (Search for NumPHIInsert). >> >> I don't see how it could be negative unless this wrapped around? >> >> >> On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com> wrote: >> >>> I did gathered aggregate statistics reported by “-stats” over the ~400 >>> test files. >>> >>> The following table summarizes the impact. The first column is the >>> >>> sum where the new analysis is enabled, the second column is the >>> >>> delta from baseline where no CFL alias analysis is performed. I am not >>> >>> experienced enough to know which of these are “good” or “bad” indicators >>> . >>> >>> —david >>> >>> >>> >>> 72,250 685 SLP # vector instructions generated >>> >>> 1,256,401 566 adce # instructions removed >>> >>> 67,020,774 13,835,126 basicaa # times a GEP is decomposed >>> >>> 11,154 26 basicaa # times the limit to decompose GEPs >>> is reached >>> >>> 153,613 324 bdce # instructions removed (unused) >>> >>> 198,495 2 bdce # instructions trivialized (dead >>> bits) >>> >>> 298,621 0 cfl-od-aa Maximum compressed graph >>> >>> 58,462,719 0 cfl-od-aa Number Search Steps >>> >>> 48,401 0 cfl-od-aa # NoAlias results absed on address >>> roots >>> >>> 61,936 0 cfl-od-aa # NoAlias results on compressed >>> search path >>> >>> 3,768,131 0 cfl-od-aa # NoAlias results on fast path >>> >>> 47,016,909 0 cfl-od-aa # calls to query() >>> >>> 43,172,261 0 cfl-od-aa # instructions analyzed >>> >>> 10,515,257 0 cfl-od-aa # times there was no graph node for >>> a value >>> >>> 9,895,755 0 cfl-od-aa Total size of compressed graphs >>> (edges) >>> >>> 2,797 2 correlated-value-propagation # comparisons >>> propagated >>> >>> 66,515 -126 correlated-value-propagation # phis propagated >>> >>> 912 3 correlated-value-propagation # sdiv converted to >>> udiv >>> >>> 13,527 501 dse # other instrs removed >>> >>> 40,973 416 dse # stores deleted >>> >>> 126 -2 early-cse # compare instructions CVP'd >>> >>> 1,824,703 -138 early-cse # instructions CSE'd >>> >>> 1,875,417 87 early-cse # instructions simplified or DCE'd >>> >>> 62,505 1 functionattrs # arguments marked nocapture >>> >>> 29,979 1 functionattrs # arguments marked readonly >>> >>> 42,648 37 globaldce # functions removed >>> >>> 40,498 10 globaldce # global variables removed >>> >>> 4,368 35 gvn # blocks merged >>> >>> 21,961 26 gvn # equalities propagated >>> >>> 29,434 45 gvn # instructions PRE'd >>> >>> 631,597 3,307 gvn # instructions deleted >>> >>> 217,618 494 gvn # instructions simplified >>> >>> 51,089 634 gvn # loads PRE'd >>> >>> 135,568 1,526 gvn # loads deleted >>> >>> 2,197 4 indvars # IV comparisons eliminated >>> >>> 826 8 indvars # congruent IVs eliminated >>> >>> 2,538 4 indvars # exit values replaced >>> >>> 1,856 1 indvars # loop exit tests replaced >>> >>> 5,740,738 8 inline # caller-callers analyzed >>> >>> 1,629,169 3 inline # functions deleted because all >>> callers found >>> >>> 3,563,497 2 inline # functions inlined >>> >>> 10,879,125 86 inline-cost # call sites analyzed >>> >>> 34,766 5 instcombine # constant folds >>> >>> 3,979,078 2,004 instcombine # dead inst eliminated >>> >>> 6,323 2 instcombine # dead stores eliminated >>> >>> 1,522 4 instcombine # factorizations >>> >>> 254,146 66 instcombine # instructions sunk >>> >>> 10,427,131 1,749 instcombine # insts combined >>> >>> 57,943 -205 instcombine # reassociations >>> >>> 1,072 1 instsimplify # expansions >>> >>> 135,129 1 instsimplify # reassociations >>> >>> 121,777 246 instsimplify # redundant instructions removed >>> >>> 27,612 -12 jump-threading # branch blocks duplicated to >>> eliminate phi >>> >>> 76,000 197 jump-threading # jumps threaded >>> >>> 4,991 8 jump-threading # terminators folded >>> >>> 869,838 1,370 lcssa # live out of a loop variables >>> >>> 345,329 433 licm # instructions hoisted out of loop >>> >>> 702 -27 licm # instructions sunk out of loop >>> >>> 19,520 192 licm # load insts hoisted or sunk >>> >>> 202 37 licm # memory locations promoted to >>> registers >>> >>> 467,244 246 local # unreachable basic blocks removed >>> >>> 1,586 34 loop-delete # loops deleted >>> >>> 84 27 loop-idiom # memcpy's formed from loop >>> load+stores >>> >>> 752 7 loop-idiom # memset's formed from loop stores >>> >>> 63,364 -8 loop-rotate # loops rotated >>> >>> 4,602 1 loop-simplify # nested loops split out >>> >>> 1,244,741 472 loop-simplify # pre-header or exit blocks >>> inserted >>> >>> 2,847 2 loop-unroll # loops completely unrolled >>> >>> 9,668 -29 loop-unroll # loops unrolled (completely or >>> otherwise) >>> >>> 5,799 -35 loop-unroll # loops unrolled with run-time trip >>> counts >>> >>> 3,863 25 loop-unswitch # branches unswitched >>> >>> 1,054,060 1,482 loop-unswitch Total number of instructions >>> analyzed >>> >>> 109,279 -3 loop-vectorize # loops analyzed for vectorization >>> >>> 526,766 -136 mem2reg # PHI nodes inserted >>> >>> 4,150,078 -3 mem2reg # alloca's promoted with a single >>> store >>> >>> 4,567 6 memcpyopt # memcpy instructions deleted >>> >>> 96 1 memcpyopt # memcpys converted to memset >>> >>> 1,074 173 memcpyopt # memmoves converted to memcpy >>> >>> 39,584 6 memcpyopt # memsets inferred >>> >>> 179,629 2,475 memdep # block queries that were >>> completely cached >>> >>> 1,020 -3 memdep # cached, but dirty, non-local ptr >>> responses >>> >>> 9,108,504 146,792 memdep # fully cached non-local ptr >>> responses >>> >>> 11,678,674 92,225 memdep # uncached non-local ptr responses >>> >>> 399,802 1,832 memory-builtins # arguments with unsolved size >>> and offset >>> >>> 10,844 -24,169 memory-builtins # load instructions with >>> unsolved size and offset >>> >>> 188,181 54 reassociate # insts reassociated >>> >>> 87,009 -82 scalar-evolution # loops with predictable loop >>> counts >>> >>> 402,724 71 scalar-evolution # loops without predictable >>> loop counts >>> >>> 133,310 72 sccp # basic blocks unreachable >>> >>> 275,949 263 sccp # instructions removed >>> >>> 2,056,414 723 simplifycfg # blocks simplified >>> >>> 5,292 -36 simplifycfg # common instructions sunk down to >>> the end block >>> >>> 15,110 1 simplifycfg # speculative executed instructions >>> >>> 43,068 -2 sroa Maximum number of uses of a >>> partition >>> >>> 11,754,901 -180 sroa # alloca partition uses rewritten >>> >>> 4,623,115 -11 sroa # alloca partitions formed >>> >>> 5,927,727 -11 sroa # allocas analyzed for replacement >>> >>> 4,576,406 -5 sroa # allocas promoted to SSA values >>> >>> 13,770,636 -227 sroa # instructions deleted >>> >>> 3,797 -1 strip-dead-prototypes # dead prototypes removed >>> >>> >>> >>> >>> >>> >>> >>> From: Daniel Berlin <dberlin at dberlin.org> >>> Date: Thursday, August 25, 2016 at 9:06 AM >>> To: David Callahan <dcallahan at fb.com> >>> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing >>> list <llvm-dev at lists.llvm.org> >>> Subject: Re: [llvm-dev] CFLAA >>> >>> Hey David, >>> I'll take a look at the patch :) >>> Sounds like fun work. >>> >>> As George says, improving AA significantly will almost always cause >>> significant performance regressions at first, in almost any compiler. >>> >>> Compilers knobs, passes, usually get tuned for x amount of freedom, and >>> if you give them 10x, they start moving things too far, vectorizing too >>> much, spilling, etc. >>> >>> This was definitely the case for GCC, where adding a precise >>> interprocedural field-sensitive analysis initially regressed performance by >>> a few percent on average. >>> >>> I know it was also the case for XLC at IBM, etc. >>> >>> Like anything else, just gotta figure out what passes are going nuts, >>> and rework them to have better heuristics/etc. >>> The end result is performance improvements, but the path takes a bit of >>> time. >>> >>> If you need a way to see whether your analysis has actually done an okay >>> job in the meantime, usually a good way to see if you are doing well or not >>> is to see how many loads/stores get eliminated or moved by various passes >>> before and after. >>> >>> If the number is significantly higher, great. >>> If the number is significantly lower, something has likely gone wrong :) >>> >>> >>> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>>> (Adding “LLVM Dev”) >>>> >>>> My variant is up as https://reviews.llvm.org/D23876 >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=> >>>> —david >>>> >>>> >>>> From: George Burgess IV <george.burgess.iv at gmail.com> >>>> Date: Wednesday, August 24, 2016 at 3:17 PM >>>> To: David Callahan <dcallahan at fb.com> >>>> Subject: Re: CFLAA >>>> >>>> Hi! >>>> >>>> > I see there is on going work with alias analysis and it appears the >>>> prior CFLAA has been abandoned. >>>> >>>> There was quite a bit of refactoring done, yeah. The original CFLAA is >>>> now called CFLSteens, and graph construction was moved to its own bit. We >>>> also have CFLAnders, which is based more heavily on the paper by Zheng and >>>> Rugina (e.g. no stratifiedsets magic). >>>> >>>> > I have a variant of it where I reworked how compression was done to >>>> be less conservative, reworked the interprocedural to do simulated but >>>> bounded inlining, and added code to do on-demand testing of CFL paths on >>>> both compressed and full graphs. >>>> >>>> Awesome! >>>> >>>> > Happy to share the patch with you if you are interested as well as >>>> some data collected >>>> >>>> Yes, please. Would you mind if I CC'ed llvm-dev on this thread (and a >>>> few people specifically, who also might find this interesting)? >>>> >>>> > However I was not able to see any performance improvements in the >>>> code. In fact on a various benchmarks there were noticeable regressions in >>>> measured performance of the generated code. Have you noticed any similar >>>> problems? >>>> >>>> I know that a number of people people in the community expressed >>>> concerns about how other passes will perform with better AA results (e.g. >>>> If LICM becomes more aggressive, register pressure may increase, which may >>>> cause us to spill when we haven't before, etc). So, such a problem isn't >>>> unthinkable. :) >>>> >>>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at fb.com> >>>> wrote: >>>> >>>>> Hi Greg, >>>>> >>>>> >>>>> >>>>> I see there is on going work with alias analysis and it appears the >>>>> prior CFLAA has been abandoned. >>>>> >>>>> >>>>> >>>>> I have a variant of it where I reworked how compression was done to be >>>>> less conservative, reworked the interprocedural to do simulated but bounded >>>>> inlining, and added code to do on-demand testing of CFL paths on both >>>>> compressed and full graphs. >>>>> >>>>> >>>>> >>>>> I reached a point where the ahead-of-time compression was linear but >>>>> still very accurate compared to on-demand path search and there were >>>>> noticeable improvements in the alias analysis results and impacted >>>>> transformations. Happy to share the patch with you if you are interested >>>>> as well as some data collected. >>>>> >>>>> >>>>> >>>>> However I was not able to see any performance improvements in the >>>>> code. In fact on a various benchmarks there were noticeable regressions in >>>>> measured performance of the generated code. Have you noticed any similar >>>>> problems? >>>>> >>>>> >>>>> >>>>> --david >>>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=> >>>> >>>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/598d29a6/attachment.html>