thr3ads.net - llvm dev - [llvm-dev] CFLAA [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Daniel Berlin via llvm-dev

2016-Aug-25 16:55 UTC

[llvm-dev] CFLAA

(and sys::cas_flag that STATISTIC uses is a uint32 ...)

On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org>
wrote:
> Okay, dumb question:
> Are you really getting negative numbers in the second column?
>
>     526,766      -136  mem2reg       # PHI nodes inserted
>
> http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html
> (Search for NumPHIInsert).
>
> I don't see how it could be negative unless this wrapped around?
>
>
> On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com>
wrote:
>
>> I did gathered aggregate statistics reported by “-stats” over the ~400
>> test files.
>>
>> The following table summarizes the impact.  The first column is the
>>
>> sum where the new analysis is enabled, the second column is the
>>
>> delta from baseline where no CFL alias analysis is performed. I am not
>>
>> experienced enough to know which of these are “good” or “bad”
indicators.
>>
>> —david
>>
>>
>>
>>      72,250       685  SLP           # vector instructions generated
>>
>>   1,256,401       566  adce          # instructions removed
>>
>>  67,020,774  13,835,126  basicaa       # times a GEP is decomposed
>>
>>      11,154        26  basicaa       # times the limit to decompose
GEPs
>> is reached
>>
>>     153,613       324  bdce          # instructions removed (unused)
>>
>>     198,495         2  bdce          # instructions trivialized (dead
>> bits)
>>
>>     298,621         0  cfl-od-aa     Maximum compressed graph
>>
>>  58,462,719         0  cfl-od-aa     Number Search Steps
>>
>>      48,401         0  cfl-od-aa     # NoAlias results absed on address
>> roots
>>
>>      61,936         0  cfl-od-aa     # NoAlias results on compressed
>> search path
>>
>>   3,768,131         0  cfl-od-aa     # NoAlias results on fast path
>>
>>  47,016,909         0  cfl-od-aa     # calls to query()
>>
>>  43,172,261         0  cfl-od-aa     # instructions analyzed
>>
>>  10,515,257         0  cfl-od-aa     # times there was no graph node
for
>> a value
>>
>>   9,895,755         0  cfl-od-aa     Total size of compressed graphs
>> (edges)
>>
>>       2,797         2  correlated-value-propagation  # comparisons
>> propagated
>>
>>      66,515      -126  correlated-value-propagation  # phis propagated
>>
>>         912         3  correlated-value-propagation  # sdiv converted
to
>> udiv
>>
>>      13,527       501  dse           # other instrs removed
>>
>>      40,973       416  dse           # stores deleted
>>
>>         126        -2  early-cse     # compare instructions CVP'd
>>
>>   1,824,703      -138  early-cse     # instructions CSE'd
>>
>>   1,875,417        87  early-cse     # instructions simplified or
DCE'd
>>
>>      62,505         1  functionattrs  # arguments marked nocapture
>>
>>      29,979         1  functionattrs  # arguments marked readonly
>>
>>      42,648        37  globaldce     # functions removed
>>
>>      40,498        10  globaldce     # global variables removed
>>
>>       4,368        35  gvn           # blocks merged
>>
>>      21,961        26  gvn           # equalities propagated
>>
>>      29,434        45  gvn           # instructions PRE'd
>>
>>     631,597     3,307  gvn           # instructions deleted
>>
>>     217,618       494  gvn           # instructions simplified
>>
>>      51,089       634  gvn           # loads PRE'd
>>
>>     135,568     1,526  gvn           # loads deleted
>>
>>       2,197         4  indvars       # IV comparisons eliminated
>>
>>         826         8  indvars       # congruent IVs eliminated
>>
>>       2,538         4  indvars       # exit values replaced
>>
>>       1,856         1  indvars       # loop exit tests replaced
>>
>>   5,740,738         8  inline        # caller-callers analyzed
>>
>>   1,629,169         3  inline        # functions deleted because all
>> callers found
>>
>>   3,563,497         2  inline        # functions inlined
>>
>>  10,879,125        86  inline-cost   # call sites analyzed
>>
>>      34,766         5  instcombine   # constant folds
>>
>>   3,979,078     2,004  instcombine   # dead inst eliminated
>>
>>       6,323         2  instcombine   # dead stores eliminated
>>
>>       1,522         4  instcombine   # factorizations
>>
>>     254,146        66  instcombine   # instructions sunk
>>
>>  10,427,131     1,749  instcombine   # insts combined
>>
>>      57,943      -205  instcombine   # reassociations
>>
>>       1,072         1  instsimplify  # expansions
>>
>>     135,129         1  instsimplify  # reassociations
>>
>>     121,777       246  instsimplify  # redundant instructions removed
>>
>>      27,612       -12  jump-threading  # branch blocks duplicated to
>> eliminate phi
>>
>>      76,000       197  jump-threading  # jumps threaded
>>
>>       4,991         8  jump-threading  # terminators folded
>>
>>     869,838     1,370  lcssa         # live out of a loop variables
>>
>>     345,329       433  licm          # instructions hoisted out of loop
>>
>>         702       -27  licm          # instructions sunk out of loop
>>
>>      19,520       192  licm          # load insts hoisted or sunk
>>
>>         202        37  licm          # memory locations promoted to
>> registers
>>
>>     467,244       246  local         # unreachable basic blocks removed
>>
>>       1,586        34  loop-delete   # loops deleted
>>
>>          84        27  loop-idiom    # memcpy's formed from loop
>> load+stores
>>
>>         752         7  loop-idiom    # memset's formed from loop
stores
>>
>>      63,364        -8  loop-rotate   # loops rotated
>>
>>       4,602         1  loop-simplify  # nested loops split out
>>
>>   1,244,741       472  loop-simplify  # pre-header or exit blocks
inserted
>>
>>       2,847         2  loop-unroll   # loops completely unrolled
>>
>>       9,668       -29  loop-unroll   # loops unrolled (completely or
>> otherwise)
>>
>>       5,799       -35  loop-unroll   # loops unrolled with run-time
trip
>> counts
>>
>>       3,863        25  loop-unswitch  # branches unswitched
>>
>>   1,054,060     1,482  loop-unswitch  Total number of instructions
>> analyzed
>>
>>     109,279        -3  loop-vectorize  # loops analyzed for
vectorization
>>
>>     526,766      -136  mem2reg       # PHI nodes inserted
>>
>>   4,150,078        -3  mem2reg       # alloca's promoted with a
single
>> store
>>
>>       4,567         6  memcpyopt     # memcpy instructions deleted
>>
>>          96         1  memcpyopt     # memcpys converted to memset
>>
>>       1,074       173  memcpyopt     # memmoves converted to memcpy
>>
>>      39,584         6  memcpyopt     # memsets inferred
>>
>>     179,629     2,475  memdep        # block queries that were
completely
>> cached
>>
>>       1,020        -3  memdep        # cached, but dirty, non-local ptr
>> responses
>>
>>   9,108,504   146,792  memdep        # fully cached non-local ptr
>> responses
>>
>>  11,678,674    92,225  memdep        # uncached non-local ptr responses
>>
>>     399,802     1,832  memory-builtins  # arguments with unsolved size
>> and offset
>>
>>      10,844   -24,169  memory-builtins  # load instructions with
unsolved
>> size and offset
>>
>>     188,181        54  reassociate   # insts reassociated
>>
>>      87,009       -82  scalar-evolution  # loops with predictable loop
>> counts
>>
>>     402,724        71  scalar-evolution  # loops without predictable
loop
>> counts
>>
>>     133,310        72  sccp          # basic blocks unreachable
>>
>>     275,949       263  sccp          # instructions removed
>>
>>   2,056,414       723  simplifycfg   # blocks simplified
>>
>>       5,292       -36  simplifycfg   # common instructions sunk down to
>> the end block
>>
>>      15,110         1  simplifycfg   # speculative executed
instructions
>>
>>      43,068        -2  sroa          Maximum number of uses of a
partition
>>
>>  11,754,901      -180  sroa          # alloca partition uses rewritten
>>
>>   4,623,115       -11  sroa          # alloca partitions formed
>>
>>   5,927,727       -11  sroa          # allocas analyzed for replacement
>>
>>   4,576,406        -5  sroa          # allocas promoted to SSA values
>>
>>  13,770,636      -227  sroa          # instructions deleted
>>
>>       3,797        -1  strip-dead-prototypes  # dead prototypes removed
>>
>>
>>
>>
>>
>>
>>
>> From: Daniel Berlin <dberlin at dberlin.org>
>> Date: Thursday, August 25, 2016 at 9:06 AM
>> To: David Callahan <dcallahan at fb.com>
>> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev
Mailing
>> list <llvm-dev at lists.llvm.org>
>> Subject: Re: [llvm-dev] CFLAA
>>
>> Hey David,
>> I'll take a look at the patch :)
>> Sounds like fun work.
>>
>> As George says, improving AA significantly will almost always cause
>> significant performance regressions at first, in almost any compiler.
>>
>> Compilers knobs, passes,  usually get tuned for x amount of freedom,
and
>> if you give them 10x, they start moving things too far, vectorizing too
>> much, spilling, etc.
>>
>> This was definitely the case for GCC, where adding a precise
>> interprocedural field-sensitive analysis initially regressed
performance by
>> a few percent on average.
>>
>> I know it was also the case for XLC at IBM, etc.
>>
>> Like anything else, just gotta figure out what passes are going nuts,
and
>> rework them to have better heuristics/etc.
>> The end result is performance improvements, but the path takes a bit of
>> time.
>>
>> If you need a way to see whether your analysis has actually done an
okay
>> job in the meantime, usually a good way to see if you are doing well or
not
>> is to see how many loads/stores get eliminated or moved by various
passes
>> before and after.
>>
>> If the number is significantly higher, great.
>> If the number is significantly lower, something has likely gone wrong
:)
>>
>>
>> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> (Adding “LLVM Dev”)
>>>
>>> My variant is up as https://reviews.llvm.org/D23876
>>>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=>
>>> —david
>>>
>>>
>>> From: George Burgess IV <george.burgess.iv at gmail.com>
>>> Date: Wednesday, August 24, 2016 at 3:17 PM
>>> To: David Callahan <dcallahan at fb.com>
>>> Subject: Re: CFLAA
>>>
>>> Hi!
>>>
>>> > I see there is on going work with alias analysis and it
appears the
>>> prior CFLAA has been abandoned.
>>>
>>> There was quite a bit of refactoring done, yeah. The original CFLAA
is
>>> now called CFLSteens, and graph construction was moved to its own
bit. We
>>> also have CFLAnders, which is based more heavily on the paper by
Zheng and
>>> Rugina (e.g. no stratifiedsets magic).
>>>
>>> > I have a variant of it where I reworked how compression was
done to be
>>> less conservative, reworked the interprocedural to do simulated but
bounded
>>> inlining, and added code to do on-demand testing of CFL paths on
both
>>> compressed and full graphs.
>>>
>>> Awesome!
>>>
>>> > Happy to share the patch with you if you are interested as
well as
>>> some data collected
>>>
>>> Yes, please. Would you mind if I CC'ed llvm-dev on this thread
(and a
>>> few people specifically, who also might find this interesting)?
>>>
>>> > However I was not able to see any performance improvements in
the
>>> code. In fact on a various benchmarks there were noticeable
regressions in
>>> measured performance of the generated code. Have you noticed any
similar
>>> problems?
>>>
>>> I know that a number of people people in the community expressed
>>> concerns about how other passes will perform with better AA results
(e.g.
>>> If LICM becomes more aggressive, register pressure may increase,
which may
>>> cause us to spill when we haven't before, etc). So, such a
problem isn't
>>> unthinkable. :)
>>>
>>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at
fb.com>
>>> wrote:
>>>
>>>> Hi Greg,
>>>>
>>>>
>>>>
>>>> I see there is on going work with alias analysis and it appears
the
>>>> prior CFLAA has been abandoned.
>>>>
>>>>
>>>>
>>>> I have a variant of it where I reworked how compression was
done to be
>>>> less conservative, reworked the interprocedural to do simulated
but bounded
>>>> inlining, and added code to do on-demand testing of CFL paths
on both
>>>> compressed and full graphs.
>>>>
>>>>
>>>>
>>>> I reached a point where the ahead-of-time compression was
linear but
>>>> still very accurate compared to on-demand path search and there
were
>>>> noticeable improvements in the alias analysis results and
impacted
>>>> transformations.  Happy to share the patch with you if you are
interested
>>>> as well as some data collected.
>>>>
>>>>
>>>>
>>>> However I was not able to see any performance improvements in
the code.
>>>> In fact on a various benchmarks there were noticeable
regressions in
>>>> measured performance of the generated code. Have you noticed
any similar
>>>> problems?
>>>>
>>>>
>>>>
>>>> --david
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=>
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/25e19284/attachment-0001.html>

David Callahan via llvm-dev

2016-Aug-25 16:56 UTC

head link

[llvm-dev] CFLAA

The second column is the difference between the new analysis and a baseline
without the analysis.

From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at
dberlin.org>>
Date: Thursday, August 25, 2016 at 9:55 AM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Cc: George Burgess IV <george.burgess.iv at
gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing
list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] CFLAA

(and sys::cas_flag that STATISTIC uses is a uint32 ...)

On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at
dberlin.org<mailto:dberlin at dberlin.org>> wrote:
Okay, dumb question:
Are you really getting negative numbers in the second column?

    526,766      -136  mem2reg       # PHI nodes inserted

http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=>
(Search for NumPHIInsert).

I don't see how it could be negative unless this wrapped around?


On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at
fb.com<mailto:dcallahan at fb.com>> wrote:
I did gathered aggregate statistics reported by “-stats” over the ~400 test
files.
The following table summarizes the impact.  The first column is the
sum where the new analysis is enabled, the second column is the
delta from baseline where no CFL alias analysis is performed. I am not
experienced enough to know which of these are “good” or “bad” indicators.
—david

     72,250       685  SLP           # vector instructions generated
  1,256,401       566  adce          # instructions removed
 67,020,774  13,835,126  basicaa       # times a GEP is decomposed
     11,154        26  basicaa       # times the limit to decompose GEPs is
reached
    153,613       324  bdce          # instructions removed (unused)
    198,495         2  bdce          # instructions trivialized (dead bits)
    298,621         0  cfl-od-aa     Maximum compressed graph
 58,462,719         0  cfl-od-aa     Number Search Steps
     48,401         0  cfl-od-aa     # NoAlias results absed on address roots
     61,936         0  cfl-od-aa     # NoAlias results on compressed search path
  3,768,131         0  cfl-od-aa     # NoAlias results on fast path
 47,016,909         0  cfl-od-aa     # calls to query()
 43,172,261         0  cfl-od-aa     # instructions analyzed
 10,515,257         0  cfl-od-aa     # times there was no graph node for a value
  9,895,755         0  cfl-od-aa     Total size of compressed graphs (edges)
      2,797         2  correlated-value-propagation  # comparisons propagated
     66,515      -126  correlated-value-propagation  # phis propagated
        912         3  correlated-value-propagation  # sdiv converted to udiv
     13,527       501  dse           # other instrs removed
     40,973       416  dse           # stores deleted
        126        -2  early-cse     # compare instructions CVP'd
  1,824,703      -138  early-cse     # instructions CSE'd
  1,875,417        87  early-cse     # instructions simplified or DCE'd
     62,505         1  functionattrs  # arguments marked nocapture
     29,979         1  functionattrs  # arguments marked readonly
     42,648        37  globaldce     # functions removed
     40,498        10  globaldce     # global variables removed
      4,368        35  gvn           # blocks merged
     21,961        26  gvn           # equalities propagated
     29,434        45  gvn           # instructions PRE'd
    631,597     3,307  gvn           # instructions deleted
    217,618       494  gvn           # instructions simplified
     51,089       634  gvn           # loads PRE'd
    135,568     1,526  gvn           # loads deleted
      2,197         4  indvars       # IV comparisons eliminated
        826         8  indvars       # congruent IVs eliminated
      2,538         4  indvars       # exit values replaced
      1,856         1  indvars       # loop exit tests replaced
  5,740,738         8  inline        # caller-callers analyzed
  1,629,169         3  inline        # functions deleted because all callers
found
  3,563,497         2  inline        # functions inlined
 10,879,125        86  inline-cost   # call sites analyzed
     34,766         5  instcombine   # constant folds
  3,979,078     2,004  instcombine   # dead inst eliminated
      6,323         2  instcombine   # dead stores eliminated
      1,522         4  instcombine   # factorizations
    254,146        66  instcombine   # instructions sunk
 10,427,131     1,749  instcombine   # insts combined
     57,943      -205  instcombine   # reassociations
      1,072         1  instsimplify  # expansions
    135,129         1  instsimplify  # reassociations
    121,777       246  instsimplify  # redundant instructions removed
     27,612       -12  jump-threading  # branch blocks duplicated to eliminate
phi
     76,000       197  jump-threading  # jumps threaded
      4,991         8  jump-threading  # terminators folded
    869,838     1,370  lcssa         # live out of a loop variables
    345,329       433  licm          # instructions hoisted out of loop
        702       -27  licm          # instructions sunk out of loop
     19,520       192  licm          # load insts hoisted or sunk
        202        37  licm          # memory locations promoted to registers
    467,244       246  local         # unreachable basic blocks removed
      1,586        34  loop-delete   # loops deleted
         84        27  loop-idiom    # memcpy's formed from loop load+stores
        752         7  loop-idiom    # memset's formed from loop stores
     63,364        -8  loop-rotate   # loops rotated
      4,602         1  loop-simplify  # nested loops split out
  1,244,741       472  loop-simplify  # pre-header or exit blocks inserted
      2,847         2  loop-unroll   # loops completely unrolled
      9,668       -29  loop-unroll   # loops unrolled (completely or otherwise)
      5,799       -35  loop-unroll   # loops unrolled with run-time trip counts
      3,863        25  loop-unswitch  # branches unswitched
  1,054,060     1,482  loop-unswitch  Total number of instructions analyzed
    109,279        -3  loop-vectorize  # loops analyzed for vectorization
    526,766      -136  mem2reg       # PHI nodes inserted
  4,150,078        -3  mem2reg       # alloca's promoted with a single store
      4,567         6  memcpyopt     # memcpy instructions deleted
         96         1  memcpyopt     # memcpys converted to memset
      1,074       173  memcpyopt     # memmoves converted to memcpy
     39,584         6  memcpyopt     # memsets inferred
    179,629     2,475  memdep        # block queries that were completely cached
      1,020        -3  memdep        # cached, but dirty, non-local ptr
responses
  9,108,504   146,792  memdep        # fully cached non-local ptr responses
 11,678,674    92,225  memdep        # uncached non-local ptr responses
    399,802     1,832  memory-builtins  # arguments with unsolved size and
offset
     10,844   -24,169  memory-builtins  # load instructions with unsolved size
and offset
    188,181        54  reassociate   # insts reassociated
     87,009       -82  scalar-evolution  # loops with predictable loop counts
    402,724        71  scalar-evolution  # loops without predictable loop counts
    133,310        72  sccp          # basic blocks unreachable
    275,949       263  sccp          # instructions removed
  2,056,414       723  simplifycfg   # blocks simplified
      5,292       -36  simplifycfg   # common instructions sunk down to the end
block
     15,110         1  simplifycfg   # speculative executed instructions
     43,068        -2  sroa          Maximum number of uses of a partition
 11,754,901      -180  sroa          # alloca partition uses rewritten
  4,623,115       -11  sroa          # alloca partitions formed
  5,927,727       -11  sroa          # allocas analyzed for replacement
  4,576,406        -5  sroa          # allocas promoted to SSA values
 13,770,636      -227  sroa          # instructions deleted
      3,797        -1  strip-dead-prototypes  # dead prototypes removed





From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at
dberlin.org>>
Date: Thursday, August 25, 2016 at 9:06 AM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Cc: George Burgess IV <george.burgess.iv at
gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing
list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] CFLAA

Hey David,
I'll take a look at the patch :)
Sounds like fun work.

As George says, improving AA significantly will almost always cause significant
performance regressions at first, in almost any compiler.

Compilers knobs, passes,  usually get tuned for x amount of freedom, and if you
give them 10x, they start moving things too far, vectorizing too much, spilling,
etc.

This was definitely the case for GCC, where adding a precise interprocedural
field-sensitive analysis initially regressed performance by a few percent on
average.

I know it was also the case for XLC at IBM, etc.

Like anything else, just gotta figure out what passes are going nuts, and rework
them to have better heuristics/etc.
The end result is performance improvements, but the path takes a bit of time.

If you need a way to see whether your analysis has actually done an okay job in
the meantime, usually a good way to see if you are doing well or not is to see
how many loads/stores get eliminated or moved by various passes before and
after.

If the number is significantly higher, great.
If the number is significantly lower, something has likely gone wrong :)


On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
(Adding “LLVM Dev”)

My variant is up as
https://reviews.llvm.org/D23876<https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=>
—david


From: George Burgess IV <george.burgess.iv at
gmail.com<mailto:george.burgess.iv at gmail.com>>
Date: Wednesday, August 24, 2016 at 3:17 PM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Subject: Re: CFLAA

Hi!
> I see there is on going work with alias analysis and it appears the prior
CFLAA has been abandoned.
There was quite a bit of refactoring done, yeah. The original CFLAA is now
called CFLSteens, and graph construction was moved to its own bit. We also have
CFLAnders, which is based more heavily on the paper by Zheng and Rugina (e.g. no
stratifiedsets magic).
> I have a variant of it where I reworked how compression was done to be less
conservative, reworked the interprocedural to do simulated but bounded inlining,
and added code to do on-demand testing of CFL paths on both compressed and full
graphs.
Awesome!
> Happy to share the patch with you if you are interested as well as some
data collected
Yes, please. Would you mind if I CC'ed llvm-dev on this thread (and a few
people specifically, who also might find this interesting)?
> However I was not able to see any performance improvements in the code. In
fact on a various benchmarks there were noticeable regressions in measured
performance of the generated code. Have you noticed any similar problems?
I know that a number of people people in the community expressed concerns about
how other passes will perform with better AA results (e.g. If LICM becomes more
aggressive, register pressure may increase, which may cause us to spill when we
haven't before, etc). So, such a problem isn't unthinkable. :)

On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at
fb.com<mailto:dcallahan at fb.com>> wrote:
Hi Greg,

I see there is on going work with alias analysis and it appears the prior CFLAA
has been abandoned.

I have a variant of it where I reworked how compression was done to be less
conservative, reworked the interprocedural to do simulated but bounded inlining,
and added code to do on-demand testing of CFL paths on both compressed and full
graphs.

I reached a point where the ahead-of-time compression was linear but still very
accurate compared to on-demand path search and there were noticeable
improvements in the alias analysis results and impacted transformations.  Happy
to share the patch with you if you are interested as well as some data
collected.

However I was not able to see any performance improvements in the code. In fact
on a various benchmarks there were noticeable regressions in measured
performance of the generated code. Have you noticed any similar problems?

--david


_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=>




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/2a010a23/attachment.html>

David Callahan via llvm-dev

2016-Aug-25 17:17 UTC

head link

[llvm-dev] CFLAA

Here is a summary of the experiment behind this patch
https://www.facebook.com/notes/david-callahan/llvm-cfl-alias-analysis/10150648030284971

From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at
dberlin.org>>
Date: Thursday, August 25, 2016 at 9:55 AM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Cc: George Burgess IV <george.burgess.iv at
gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing
list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] CFLAA

(and sys::cas_flag that STATISTIC uses is a uint32 ...)

On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at
dberlin.org<mailto:dberlin at dberlin.org>> wrote:
Okay, dumb question:
Are you really getting negative numbers in the second column?

    526,766      -136  mem2reg       # PHI nodes inserted

http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=>
(Search for NumPHIInsert).

I don't see how it could be negative unless this wrapped around?


On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at
fb.com<mailto:dcallahan at fb.com>> wrote:
I did gathered aggregate statistics reported by “-stats” over the ~400 test
files.
The following table summarizes the impact.  The first column is the
sum where the new analysis is enabled, the second column is the
delta from baseline where no CFL alias analysis is performed. I am not
experienced enough to know which of these are “good” or “bad” indicators.
—david

     72,250       685  SLP           # vector instructions generated
  1,256,401       566  adce          # instructions removed
 67,020,774  13,835,126  basicaa       # times a GEP is decomposed
     11,154        26  basicaa       # times the limit to decompose GEPs is
reached
    153,613       324  bdce          # instructions removed (unused)
    198,495         2  bdce          # instructions trivialized (dead bits)
    298,621         0  cfl-od-aa     Maximum compressed graph
 58,462,719         0  cfl-od-aa     Number Search Steps
     48,401         0  cfl-od-aa     # NoAlias results absed on address roots
     61,936         0  cfl-od-aa     # NoAlias results on compressed search path
  3,768,131         0  cfl-od-aa     # NoAlias results on fast path
 47,016,909         0  cfl-od-aa     # calls to query()
 43,172,261         0  cfl-od-aa     # instructions analyzed
 10,515,257         0  cfl-od-aa     # times there was no graph node for a value
  9,895,755         0  cfl-od-aa     Total size of compressed graphs (edges)
      2,797         2  correlated-value-propagation  # comparisons propagated
     66,515      -126  correlated-value-propagation  # phis propagated
        912         3  correlated-value-propagation  # sdiv converted to udiv
     13,527       501  dse           # other instrs removed
     40,973       416  dse           # stores deleted
        126        -2  early-cse     # compare instructions CVP'd
  1,824,703      -138  early-cse     # instructions CSE'd
  1,875,417        87  early-cse     # instructions simplified or DCE'd
     62,505         1  functionattrs  # arguments marked nocapture
     29,979         1  functionattrs  # arguments marked readonly
     42,648        37  globaldce     # functions removed
     40,498        10  globaldce     # global variables removed
      4,368        35  gvn           # blocks merged
     21,961        26  gvn           # equalities propagated
     29,434        45  gvn           # instructions PRE'd
    631,597     3,307  gvn           # instructions deleted
    217,618       494  gvn           # instructions simplified
     51,089       634  gvn           # loads PRE'd
    135,568     1,526  gvn           # loads deleted
      2,197         4  indvars       # IV comparisons eliminated
        826         8  indvars       # congruent IVs eliminated
      2,538         4  indvars       # exit values replaced
      1,856         1  indvars       # loop exit tests replaced
  5,740,738         8  inline        # caller-callers analyzed
  1,629,169         3  inline        # functions deleted because all callers
found
  3,563,497         2  inline        # functions inlined
 10,879,125        86  inline-cost   # call sites analyzed
     34,766         5  instcombine   # constant folds
  3,979,078     2,004  instcombine   # dead inst eliminated
      6,323         2  instcombine   # dead stores eliminated
      1,522         4  instcombine   # factorizations
    254,146        66  instcombine   # instructions sunk
 10,427,131     1,749  instcombine   # insts combined
     57,943      -205  instcombine   # reassociations
      1,072         1  instsimplify  # expansions
    135,129         1  instsimplify  # reassociations
    121,777       246  instsimplify  # redundant instructions removed
     27,612       -12  jump-threading  # branch blocks duplicated to eliminate
phi
     76,000       197  jump-threading  # jumps threaded
      4,991         8  jump-threading  # terminators folded
    869,838     1,370  lcssa         # live out of a loop variables
    345,329       433  licm          # instructions hoisted out of loop
        702       -27  licm          # instructions sunk out of loop
     19,520       192  licm          # load insts hoisted or sunk
        202        37  licm          # memory locations promoted to registers
    467,244       246  local         # unreachable basic blocks removed
      1,586        34  loop-delete   # loops deleted
         84        27  loop-idiom    # memcpy's formed from loop load+stores
        752         7  loop-idiom    # memset's formed from loop stores
     63,364        -8  loop-rotate   # loops rotated
      4,602         1  loop-simplify  # nested loops split out
  1,244,741       472  loop-simplify  # pre-header or exit blocks inserted
      2,847         2  loop-unroll   # loops completely unrolled
      9,668       -29  loop-unroll   # loops unrolled (completely or otherwise)
      5,799       -35  loop-unroll   # loops unrolled with run-time trip counts
      3,863        25  loop-unswitch  # branches unswitched
  1,054,060     1,482  loop-unswitch  Total number of instructions analyzed
    109,279        -3  loop-vectorize  # loops analyzed for vectorization
    526,766      -136  mem2reg       # PHI nodes inserted
  4,150,078        -3  mem2reg       # alloca's promoted with a single store
      4,567         6  memcpyopt     # memcpy instructions deleted
         96         1  memcpyopt     # memcpys converted to memset
      1,074       173  memcpyopt     # memmoves converted to memcpy
     39,584         6  memcpyopt     # memsets inferred
    179,629     2,475  memdep        # block queries that were completely cached
      1,020        -3  memdep        # cached, but dirty, non-local ptr
responses
  9,108,504   146,792  memdep        # fully cached non-local ptr responses
 11,678,674    92,225  memdep        # uncached non-local ptr responses
    399,802     1,832  memory-builtins  # arguments with unsolved size and
offset
     10,844   -24,169  memory-builtins  # load instructions with unsolved size
and offset
    188,181        54  reassociate   # insts reassociated
     87,009       -82  scalar-evolution  # loops with predictable loop counts
    402,724        71  scalar-evolution  # loops without predictable loop counts
    133,310        72  sccp          # basic blocks unreachable
    275,949       263  sccp          # instructions removed
  2,056,414       723  simplifycfg   # blocks simplified
      5,292       -36  simplifycfg   # common instructions sunk down to the end
block
     15,110         1  simplifycfg   # speculative executed instructions
     43,068        -2  sroa          Maximum number of uses of a partition
 11,754,901      -180  sroa          # alloca partition uses rewritten
  4,623,115       -11  sroa          # alloca partitions formed
  5,927,727       -11  sroa          # allocas analyzed for replacement
  4,576,406        -5  sroa          # allocas promoted to SSA values
 13,770,636      -227  sroa          # instructions deleted
      3,797        -1  strip-dead-prototypes  # dead prototypes removed





From: Daniel Berlin <dberlin at dberlin.org<mailto:dberlin at
dberlin.org>>
Date: Thursday, August 25, 2016 at 9:06 AM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Cc: George Burgess IV <george.burgess.iv at
gmail.com<mailto:george.burgess.iv at gmail.com>>, LLVM Dev Mailing
list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: Re: [llvm-dev] CFLAA

Hey David,
I'll take a look at the patch :)
Sounds like fun work.

As George says, improving AA significantly will almost always cause significant
performance regressions at first, in almost any compiler.

Compilers knobs, passes,  usually get tuned for x amount of freedom, and if you
give them 10x, they start moving things too far, vectorizing too much, spilling,
etc.

This was definitely the case for GCC, where adding a precise interprocedural
field-sensitive analysis initially regressed performance by a few percent on
average.

I know it was also the case for XLC at IBM, etc.

Like anything else, just gotta figure out what passes are going nuts, and rework
them to have better heuristics/etc.
The end result is performance improvements, but the path takes a bit of time.

If you need a way to see whether your analysis has actually done an okay job in
the meantime, usually a good way to see if you are doing well or not is to see
how many loads/stores get eliminated or moved by various passes before and
after.

If the number is significantly higher, great.
If the number is significantly lower, something has likely gone wrong :)


On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
(Adding “LLVM Dev”)

My variant is up as
https://reviews.llvm.org/D23876<https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=>
—david


From: George Burgess IV <george.burgess.iv at
gmail.com<mailto:george.burgess.iv at gmail.com>>
Date: Wednesday, August 24, 2016 at 3:17 PM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Subject: Re: CFLAA

Hi!
> I see there is on going work with alias analysis and it appears the prior
CFLAA has been abandoned.
There was quite a bit of refactoring done, yeah. The original CFLAA is now
called CFLSteens, and graph construction was moved to its own bit. We also have
CFLAnders, which is based more heavily on the paper by Zheng and Rugina (e.g. no
stratifiedsets magic).
> I have a variant of it where I reworked how compression was done to be less
conservative, reworked the interprocedural to do simulated but bounded inlining,
and added code to do on-demand testing of CFL paths on both compressed and full
graphs.
Awesome!
> Happy to share the patch with you if you are interested as well as some
data collected
Yes, please. Would you mind if I CC'ed llvm-dev on this thread (and a few
people specifically, who also might find this interesting)?
> However I was not able to see any performance improvements in the code. In
fact on a various benchmarks there were noticeable regressions in measured
performance of the generated code. Have you noticed any similar problems?
I know that a number of people people in the community expressed concerns about
how other passes will perform with better AA results (e.g. If LICM becomes more
aggressive, register pressure may increase, which may cause us to spill when we
haven't before, etc). So, such a problem isn't unthinkable. :)

On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at
fb.com<mailto:dcallahan at fb.com>> wrote:
Hi Greg,

I see there is on going work with alias analysis and it appears the prior CFLAA
has been abandoned.

I have a variant of it where I reworked how compression was done to be less
conservative, reworked the interprocedural to do simulated but bounded inlining,
and added code to do on-demand testing of CFL paths on both compressed and full
graphs.

I reached a point where the ahead-of-time compression was linear but still very
accurate compared to on-demand path search and there were noticeable
improvements in the alias analysis results and impacted transformations.  Happy
to share the patch with you if you are interested as well as some data
collected.

However I was not able to see any performance improvements in the code. In fact
on a various benchmarks there were noticeable regressions in measured
performance of the generated code. Have you noticed any similar problems?

--david


_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=>




-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/bb637835/attachment-0001.html>

Daniel Berlin via llvm-dev

2016-Aug-25 20:43 UTC

head link

[llvm-dev] CFLAA

One of the things throwing me off is the cfl numbers being 0 :)

Outside of one number, they could be deltas in either direction.

IE baseline is column A - column B, or baseline is column B - column A

However, given:

10,844   -24,169

I presume this means the baseline is 34k?
:)

If so, it looks like you are promoting, hoisting, deleting, vectorizing etc
a bunch more instructions. About 1-5, most things being 1%, a few being 5%

SROA seems to do very slightly worse, but it's not clear if that's
because
the instructions got deleted before it did anything.


On Thu, Aug 25, 2016 at 9:56 AM, David Callahan <dcallahan at fb.com>
wrote:
> The second column is the difference between the new analysis and a
> baseline without the analysis.
>
> From: Daniel Berlin <dberlin at dberlin.org>
> Date: Thursday, August 25, 2016 at 9:55 AM
>
> To: David Callahan <dcallahan at fb.com>
> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev
Mailing
> list <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] CFLAA
>
> (and sys::cas_flag that STATISTIC uses is a uint32 ...)
>
> On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at
dberlin.org>
> wrote:
>
>> Okay, dumb question:
>> Are you really getting negative numbers in the second column?
>>
>>     526,766      -136  mem2reg       # PHI nodes inserted
>>
>> http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8c
>> pp_source.html
>>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=>
>> (Search for NumPHIInsert).
>>
>> I don't see how it could be negative unless this wrapped around?
>>
>>
>> On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at
fb.com> wrote:
>>
>>> I did gathered aggregate statistics reported by “-stats” over the
~400
>>> test files.
>>>
>>> The following table summarizes the impact.  The first column is the
>>>
>>> sum where the new analysis is enabled, the second column is the
>>>
>>> delta from baseline where no CFL alias analysis is performed. I am
not
>>>
>>> experienced enough to know which of these are “good” or “bad”
indicators
>>> .
>>>
>>> —david
>>>
>>>
>>>
>>>      72,250       685  SLP           # vector instructions
generated
>>>
>>>   1,256,401       566  adce          # instructions removed
>>>
>>>  67,020,774  13,835,126  basicaa       # times a GEP is decomposed
>>>
>>>      11,154        26  basicaa       # times the limit to decompose
GEPs
>>> is reached
>>>
>>>     153,613       324  bdce          # instructions removed
(unused)
>>>
>>>     198,495         2  bdce          # instructions trivialized
(dead
>>> bits)
>>>
>>>     298,621         0  cfl-od-aa     Maximum compressed graph
>>>
>>>  58,462,719         0  cfl-od-aa     Number Search Steps
>>>
>>>      48,401         0  cfl-od-aa     # NoAlias results absed on
address
>>> roots
>>>
>>>      61,936         0  cfl-od-aa     # NoAlias results on
compressed
>>> search path
>>>
>>>   3,768,131         0  cfl-od-aa     # NoAlias results on fast path
>>>
>>>  47,016,909         0  cfl-od-aa     # calls to query()
>>>
>>>  43,172,261         0  cfl-od-aa     # instructions analyzed
>>>
>>>  10,515,257         0  cfl-od-aa     # times there was no graph
node for
>>> a value
>>>
>>>   9,895,755         0  cfl-od-aa     Total size of compressed
graphs
>>> (edges)
>>>
>>>       2,797         2  correlated-value-propagation  # comparisons
>>> propagated
>>>
>>>      66,515      -126  correlated-value-propagation  # phis
propagated
>>>
>>>         912         3  correlated-value-propagation  # sdiv
converted to
>>> udiv
>>>
>>>      13,527       501  dse           # other instrs removed
>>>
>>>      40,973       416  dse           # stores deleted
>>>
>>>         126        -2  early-cse     # compare instructions
CVP'd
>>>
>>>   1,824,703      -138  early-cse     # instructions CSE'd
>>>
>>>   1,875,417        87  early-cse     # instructions simplified or
DCE'd
>>>
>>>      62,505         1  functionattrs  # arguments marked nocapture
>>>
>>>      29,979         1  functionattrs  # arguments marked readonly
>>>
>>>      42,648        37  globaldce     # functions removed
>>>
>>>      40,498        10  globaldce     # global variables removed
>>>
>>>       4,368        35  gvn           # blocks merged
>>>
>>>      21,961        26  gvn           # equalities propagated
>>>
>>>      29,434        45  gvn           # instructions PRE'd
>>>
>>>     631,597     3,307  gvn           # instructions deleted
>>>
>>>     217,618       494  gvn           # instructions simplified
>>>
>>>      51,089       634  gvn           # loads PRE'd
>>>
>>>     135,568     1,526  gvn           # loads deleted
>>>
>>>       2,197         4  indvars       # IV comparisons eliminated
>>>
>>>         826         8  indvars       # congruent IVs eliminated
>>>
>>>       2,538         4  indvars       # exit values replaced
>>>
>>>       1,856         1  indvars       # loop exit tests replaced
>>>
>>>   5,740,738         8  inline        # caller-callers analyzed
>>>
>>>   1,629,169         3  inline        # functions deleted because
all
>>> callers found
>>>
>>>   3,563,497         2  inline        # functions inlined
>>>
>>>  10,879,125        86  inline-cost   # call sites analyzed
>>>
>>>      34,766         5  instcombine   # constant folds
>>>
>>>   3,979,078     2,004  instcombine   # dead inst eliminated
>>>
>>>       6,323         2  instcombine   # dead stores eliminated
>>>
>>>       1,522         4  instcombine   # factorizations
>>>
>>>     254,146        66  instcombine   # instructions sunk
>>>
>>>  10,427,131     1,749  instcombine   # insts combined
>>>
>>>      57,943      -205  instcombine   # reassociations
>>>
>>>       1,072         1  instsimplify  # expansions
>>>
>>>     135,129         1  instsimplify  # reassociations
>>>
>>>     121,777       246  instsimplify  # redundant instructions
removed
>>>
>>>      27,612       -12  jump-threading  # branch blocks duplicated
to
>>> eliminate phi
>>>
>>>      76,000       197  jump-threading  # jumps threaded
>>>
>>>       4,991         8  jump-threading  # terminators folded
>>>
>>>     869,838     1,370  lcssa         # live out of a loop variables
>>>
>>>     345,329       433  licm          # instructions hoisted out of
loop
>>>
>>>         702       -27  licm          # instructions sunk out of
loop
>>>
>>>      19,520       192  licm          # load insts hoisted or sunk
>>>
>>>         202        37  licm          # memory locations promoted to
>>> registers
>>>
>>>     467,244       246  local         # unreachable basic blocks
removed
>>>
>>>       1,586        34  loop-delete   # loops deleted
>>>
>>>          84        27  loop-idiom    # memcpy's formed from
loop
>>> load+stores
>>>
>>>         752         7  loop-idiom    # memset's formed from
loop stores
>>>
>>>      63,364        -8  loop-rotate   # loops rotated
>>>
>>>       4,602         1  loop-simplify  # nested loops split out
>>>
>>>   1,244,741       472  loop-simplify  # pre-header or exit blocks
>>> inserted
>>>
>>>       2,847         2  loop-unroll   # loops completely unrolled
>>>
>>>       9,668       -29  loop-unroll   # loops unrolled (completely
or
>>> otherwise)
>>>
>>>       5,799       -35  loop-unroll   # loops unrolled with run-time
trip
>>> counts
>>>
>>>       3,863        25  loop-unswitch  # branches unswitched
>>>
>>>   1,054,060     1,482  loop-unswitch  Total number of instructions
>>> analyzed
>>>
>>>     109,279        -3  loop-vectorize  # loops analyzed for
vectorization
>>>
>>>     526,766      -136  mem2reg       # PHI nodes inserted
>>>
>>>   4,150,078        -3  mem2reg       # alloca's promoted with a
single
>>> store
>>>
>>>       4,567         6  memcpyopt     # memcpy instructions deleted
>>>
>>>          96         1  memcpyopt     # memcpys converted to memset
>>>
>>>       1,074       173  memcpyopt     # memmoves converted to memcpy
>>>
>>>      39,584         6  memcpyopt     # memsets inferred
>>>
>>>     179,629     2,475  memdep        # block queries that were
>>> completely cached
>>>
>>>       1,020        -3  memdep        # cached, but dirty, non-local
ptr
>>> responses
>>>
>>>   9,108,504   146,792  memdep        # fully cached non-local ptr
>>> responses
>>>
>>>  11,678,674    92,225  memdep        # uncached non-local ptr
responses
>>>
>>>     399,802     1,832  memory-builtins  # arguments with unsolved
size
>>> and offset
>>>
>>>      10,844   -24,169  memory-builtins  # load instructions with
>>> unsolved size and offset
>>>
>>>     188,181        54  reassociate   # insts reassociated
>>>
>>>      87,009       -82  scalar-evolution  # loops with predictable
loop
>>> counts
>>>
>>>     402,724        71  scalar-evolution  # loops without
predictable
>>> loop counts
>>>
>>>     133,310        72  sccp          # basic blocks unreachable
>>>
>>>     275,949       263  sccp          # instructions removed
>>>
>>>   2,056,414       723  simplifycfg   # blocks simplified
>>>
>>>       5,292       -36  simplifycfg   # common instructions sunk
down to
>>> the end block
>>>
>>>      15,110         1  simplifycfg   # speculative executed
instructions
>>>
>>>      43,068        -2  sroa          Maximum number of uses of a
>>> partition
>>>
>>>  11,754,901      -180  sroa          # alloca partition uses
rewritten
>>>
>>>   4,623,115       -11  sroa          # alloca partitions formed
>>>
>>>   5,927,727       -11  sroa          # allocas analyzed for
replacement
>>>
>>>   4,576,406        -5  sroa          # allocas promoted to SSA
values
>>>
>>>  13,770,636      -227  sroa          # instructions deleted
>>>
>>>       3,797        -1  strip-dead-prototypes  # dead prototypes
removed
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Daniel Berlin <dberlin at dberlin.org>
>>> Date: Thursday, August 25, 2016 at 9:06 AM
>>> To: David Callahan <dcallahan at fb.com>
>>> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM
Dev Mailing
>>> list <llvm-dev at lists.llvm.org>
>>> Subject: Re: [llvm-dev] CFLAA
>>>
>>> Hey David,
>>> I'll take a look at the patch :)
>>> Sounds like fun work.
>>>
>>> As George says, improving AA significantly will almost always cause
>>> significant performance regressions at first, in almost any
compiler.
>>>
>>> Compilers knobs, passes,  usually get tuned for x amount of
freedom, and
>>> if you give them 10x, they start moving things too far, vectorizing
too
>>> much, spilling, etc.
>>>
>>> This was definitely the case for GCC, where adding a precise
>>> interprocedural field-sensitive analysis initially regressed
performance by
>>> a few percent on average.
>>>
>>> I know it was also the case for XLC at IBM, etc.
>>>
>>> Like anything else, just gotta figure out what passes are going
nuts,
>>> and rework them to have better heuristics/etc.
>>> The end result is performance improvements, but the path takes a
bit of
>>> time.
>>>
>>> If you need a way to see whether your analysis has actually done an
okay
>>> job in the meantime, usually a good way to see if you are doing
well or not
>>> is to see how many loads/stores get eliminated or moved by various
passes
>>> before and after.
>>>
>>> If the number is significantly higher, great.
>>> If the number is significantly lower, something has likely gone
wrong :)
>>>
>>>
>>> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> (Adding “LLVM Dev”)
>>>>
>>>> My variant is up as https://reviews.llvm.org/D23876
>>>>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=>
>>>> —david
>>>>
>>>>
>>>> From: George Burgess IV <george.burgess.iv at gmail.com>
>>>> Date: Wednesday, August 24, 2016 at 3:17 PM
>>>> To: David Callahan <dcallahan at fb.com>
>>>> Subject: Re: CFLAA
>>>>
>>>> Hi!
>>>>
>>>> > I see there is on going work with alias analysis and it
appears the
>>>> prior CFLAA has been abandoned.
>>>>
>>>> There was quite a bit of refactoring done, yeah. The original
CFLAA is
>>>> now called CFLSteens, and graph construction was moved to its
own bit. We
>>>> also have CFLAnders, which is based more heavily on the paper
by Zheng and
>>>> Rugina (e.g. no stratifiedsets magic).
>>>>
>>>> > I have a variant of it where I reworked how compression
was done to
>>>> be less conservative, reworked the interprocedural to do
simulated but
>>>> bounded inlining, and added code to do on-demand testing of CFL
paths on
>>>> both compressed and full graphs.
>>>>
>>>> Awesome!
>>>>
>>>> > Happy to share the patch with you if you are interested as
well as
>>>> some data collected
>>>>
>>>> Yes, please. Would you mind if I CC'ed llvm-dev on this
thread (and a
>>>> few people specifically, who also might find this interesting)?
>>>>
>>>> > However I was not able to see any performance improvements
in the
>>>> code. In fact on a various benchmarks there were noticeable
regressions in
>>>> measured performance of the generated code. Have you noticed
any similar
>>>> problems?
>>>>
>>>> I know that a number of people people in the community
expressed
>>>> concerns about how other passes will perform with better AA
results (e.g.
>>>> If LICM becomes more aggressive, register pressure may
increase, which may
>>>> cause us to spill when we haven't before, etc). So, such a
problem isn't
>>>> unthinkable. :)
>>>>
>>>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan
at fb.com>
>>>> wrote:
>>>>
>>>>> Hi Greg,
>>>>>
>>>>>
>>>>>
>>>>> I see there is on going work with alias analysis and it
appears the
>>>>> prior CFLAA has been abandoned.
>>>>>
>>>>>
>>>>>
>>>>> I have a variant of it where I reworked how compression was
done to be
>>>>> less conservative, reworked the interprocedural to do
simulated but bounded
>>>>> inlining, and added code to do on-demand testing of CFL
paths on both
>>>>> compressed and full graphs.
>>>>>
>>>>>
>>>>>
>>>>> I reached a point where the ahead-of-time compression was
linear but
>>>>> still very accurate compared to on-demand path search and
there were
>>>>> noticeable improvements in the alias analysis results and
impacted
>>>>> transformations.  Happy to share the patch with you if you
are interested
>>>>> as well as some data collected.
>>>>>
>>>>>
>>>>>
>>>>> However I was not able to see any performance improvements
in the
>>>>> code. In fact on a various benchmarks there were noticeable
regressions in
>>>>> measured performance of the generated code. Have you
noticed any similar
>>>>> problems?
>>>>>
>>>>>
>>>>>
>>>>> --david
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=>
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/fcfc8e9e/attachment-0001.html>

Daniel Berlin via llvm-dev

2016-Aug-25 20:45 UTC

head link

[llvm-dev] CFLAA

Note that this kind of "graph CSE" is the basis of  HVN/HU/HRU in
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwjEupa6t93OAhUE1mMKHbsCDU4QFggeMAE&url=https%3A%2F%2Fwww.cs.ucsb.edu%2F~benh%2Fresearch%2Fpapers%2Fhardekopf07exploiting.pdf&usg=AFQjCNGNzQ6vgsfxWRMW3a5aA4fGGLADOQ&sig2=3mC64txCKq5daxQqMK7UIA&bvm=bv.130731782,d.cGc

You could do the same analysis on this graph.



On Thu, Aug 25, 2016 at 10:17 AM, David Callahan <dcallahan at fb.com>
wrote:
> Here is a summary of the experiment behind this patch
> https://www.facebook.com/notes/david-callahan/llvm-cfl-alias-analysis/
> 10150648030284971
>
> From: Daniel Berlin <dberlin at dberlin.org>
> Date: Thursday, August 25, 2016 at 9:55 AM
>
> To: David Callahan <dcallahan at fb.com>
> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev
Mailing
> list <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] CFLAA
>
> (and sys::cas_flag that STATISTIC uses is a uint32 ...)
>
> On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at
dberlin.org>
> wrote:
>
>> Okay, dumb question:
>> Are you really getting negative numbers in the second column?
>>
>>     526,766      -136  mem2reg       # PHI nodes inserted
>>
>> http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8c
>> pp_source.html
>>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=>
>> (Search for NumPHIInsert).
>>
>> I don't see how it could be negative unless this wrapped around?
>>
>>
>> On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at
fb.com> wrote:
>>
>>> I did gathered aggregate statistics reported by “-stats” over the
~400
>>> test files.
>>>
>>> The following table summarizes the impact.  The first column is the
>>>
>>> sum where the new analysis is enabled, the second column is the
>>>
>>> delta from baseline where no CFL alias analysis is performed. I am
not
>>>
>>> experienced enough to know which of these are “good” or “bad”
indicators
>>> .
>>>
>>> —david
>>>
>>>
>>>
>>>      72,250       685  SLP           # vector instructions
generated
>>>
>>>   1,256,401       566  adce          # instructions removed
>>>
>>>  67,020,774  13,835,126  basicaa       # times a GEP is decomposed
>>>
>>>      11,154        26  basicaa       # times the limit to decompose
GEPs
>>> is reached
>>>
>>>     153,613       324  bdce          # instructions removed
(unused)
>>>
>>>     198,495         2  bdce          # instructions trivialized
(dead
>>> bits)
>>>
>>>     298,621         0  cfl-od-aa     Maximum compressed graph
>>>
>>>  58,462,719         0  cfl-od-aa     Number Search Steps
>>>
>>>      48,401         0  cfl-od-aa     # NoAlias results absed on
address
>>> roots
>>>
>>>      61,936         0  cfl-od-aa     # NoAlias results on
compressed
>>> search path
>>>
>>>   3,768,131         0  cfl-od-aa     # NoAlias results on fast path
>>>
>>>  47,016,909         0  cfl-od-aa     # calls to query()
>>>
>>>  43,172,261         0  cfl-od-aa     # instructions analyzed
>>>
>>>  10,515,257         0  cfl-od-aa     # times there was no graph
node for
>>> a value
>>>
>>>   9,895,755         0  cfl-od-aa     Total size of compressed
graphs
>>> (edges)
>>>
>>>       2,797         2  correlated-value-propagation  # comparisons
>>> propagated
>>>
>>>      66,515      -126  correlated-value-propagation  # phis
propagated
>>>
>>>         912         3  correlated-value-propagation  # sdiv
converted to
>>> udiv
>>>
>>>      13,527       501  dse           # other instrs removed
>>>
>>>      40,973       416  dse           # stores deleted
>>>
>>>         126        -2  early-cse     # compare instructions
CVP'd
>>>
>>>   1,824,703      -138  early-cse     # instructions CSE'd
>>>
>>>   1,875,417        87  early-cse     # instructions simplified or
DCE'd
>>>
>>>      62,505         1  functionattrs  # arguments marked nocapture
>>>
>>>      29,979         1  functionattrs  # arguments marked readonly
>>>
>>>      42,648        37  globaldce     # functions removed
>>>
>>>      40,498        10  globaldce     # global variables removed
>>>
>>>       4,368        35  gvn           # blocks merged
>>>
>>>      21,961        26  gvn           # equalities propagated
>>>
>>>      29,434        45  gvn           # instructions PRE'd
>>>
>>>     631,597     3,307  gvn           # instructions deleted
>>>
>>>     217,618       494  gvn           # instructions simplified
>>>
>>>      51,089       634  gvn           # loads PRE'd
>>>
>>>     135,568     1,526  gvn           # loads deleted
>>>
>>>       2,197         4  indvars       # IV comparisons eliminated
>>>
>>>         826         8  indvars       # congruent IVs eliminated
>>>
>>>       2,538         4  indvars       # exit values replaced
>>>
>>>       1,856         1  indvars       # loop exit tests replaced
>>>
>>>   5,740,738         8  inline        # caller-callers analyzed
>>>
>>>   1,629,169         3  inline        # functions deleted because
all
>>> callers found
>>>
>>>   3,563,497         2  inline        # functions inlined
>>>
>>>  10,879,125        86  inline-cost   # call sites analyzed
>>>
>>>      34,766         5  instcombine   # constant folds
>>>
>>>   3,979,078     2,004  instcombine   # dead inst eliminated
>>>
>>>       6,323         2  instcombine   # dead stores eliminated
>>>
>>>       1,522         4  instcombine   # factorizations
>>>
>>>     254,146        66  instcombine   # instructions sunk
>>>
>>>  10,427,131     1,749  instcombine   # insts combined
>>>
>>>      57,943      -205  instcombine   # reassociations
>>>
>>>       1,072         1  instsimplify  # expansions
>>>
>>>     135,129         1  instsimplify  # reassociations
>>>
>>>     121,777       246  instsimplify  # redundant instructions
removed
>>>
>>>      27,612       -12  jump-threading  # branch blocks duplicated
to
>>> eliminate phi
>>>
>>>      76,000       197  jump-threading  # jumps threaded
>>>
>>>       4,991         8  jump-threading  # terminators folded
>>>
>>>     869,838     1,370  lcssa         # live out of a loop variables
>>>
>>>     345,329       433  licm          # instructions hoisted out of
loop
>>>
>>>         702       -27  licm          # instructions sunk out of
loop
>>>
>>>      19,520       192  licm          # load insts hoisted or sunk
>>>
>>>         202        37  licm          # memory locations promoted to
>>> registers
>>>
>>>     467,244       246  local         # unreachable basic blocks
removed
>>>
>>>       1,586        34  loop-delete   # loops deleted
>>>
>>>          84        27  loop-idiom    # memcpy's formed from
loop
>>> load+stores
>>>
>>>         752         7  loop-idiom    # memset's formed from
loop stores
>>>
>>>      63,364        -8  loop-rotate   # loops rotated
>>>
>>>       4,602         1  loop-simplify  # nested loops split out
>>>
>>>   1,244,741       472  loop-simplify  # pre-header or exit blocks
>>> inserted
>>>
>>>       2,847         2  loop-unroll   # loops completely unrolled
>>>
>>>       9,668       -29  loop-unroll   # loops unrolled (completely
or
>>> otherwise)
>>>
>>>       5,799       -35  loop-unroll   # loops unrolled with run-time
trip
>>> counts
>>>
>>>       3,863        25  loop-unswitch  # branches unswitched
>>>
>>>   1,054,060     1,482  loop-unswitch  Total number of instructions
>>> analyzed
>>>
>>>     109,279        -3  loop-vectorize  # loops analyzed for
vectorization
>>>
>>>     526,766      -136  mem2reg       # PHI nodes inserted
>>>
>>>   4,150,078        -3  mem2reg       # alloca's promoted with a
single
>>> store
>>>
>>>       4,567         6  memcpyopt     # memcpy instructions deleted
>>>
>>>          96         1  memcpyopt     # memcpys converted to memset
>>>
>>>       1,074       173  memcpyopt     # memmoves converted to memcpy
>>>
>>>      39,584         6  memcpyopt     # memsets inferred
>>>
>>>     179,629     2,475  memdep        # block queries that were
>>> completely cached
>>>
>>>       1,020        -3  memdep        # cached, but dirty, non-local
ptr
>>> responses
>>>
>>>   9,108,504   146,792  memdep        # fully cached non-local ptr
>>> responses
>>>
>>>  11,678,674    92,225  memdep        # uncached non-local ptr
responses
>>>
>>>     399,802     1,832  memory-builtins  # arguments with unsolved
size
>>> and offset
>>>
>>>      10,844   -24,169  memory-builtins  # load instructions with
>>> unsolved size and offset
>>>
>>>     188,181        54  reassociate   # insts reassociated
>>>
>>>      87,009       -82  scalar-evolution  # loops with predictable
loop
>>> counts
>>>
>>>     402,724        71  scalar-evolution  # loops without
predictable
>>> loop counts
>>>
>>>     133,310        72  sccp          # basic blocks unreachable
>>>
>>>     275,949       263  sccp          # instructions removed
>>>
>>>   2,056,414       723  simplifycfg   # blocks simplified
>>>
>>>       5,292       -36  simplifycfg   # common instructions sunk
down to
>>> the end block
>>>
>>>      15,110         1  simplifycfg   # speculative executed
instructions
>>>
>>>      43,068        -2  sroa          Maximum number of uses of a
>>> partition
>>>
>>>  11,754,901      -180  sroa          # alloca partition uses
rewritten
>>>
>>>   4,623,115       -11  sroa          # alloca partitions formed
>>>
>>>   5,927,727       -11  sroa          # allocas analyzed for
replacement
>>>
>>>   4,576,406        -5  sroa          # allocas promoted to SSA
values
>>>
>>>  13,770,636      -227  sroa          # instructions deleted
>>>
>>>       3,797        -1  strip-dead-prototypes  # dead prototypes
removed
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Daniel Berlin <dberlin at dberlin.org>
>>> Date: Thursday, August 25, 2016 at 9:06 AM
>>> To: David Callahan <dcallahan at fb.com>
>>> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM
Dev Mailing
>>> list <llvm-dev at lists.llvm.org>
>>> Subject: Re: [llvm-dev] CFLAA
>>>
>>> Hey David,
>>> I'll take a look at the patch :)
>>> Sounds like fun work.
>>>
>>> As George says, improving AA significantly will almost always cause
>>> significant performance regressions at first, in almost any
compiler.
>>>
>>> Compilers knobs, passes,  usually get tuned for x amount of
freedom, and
>>> if you give them 10x, they start moving things too far, vectorizing
too
>>> much, spilling, etc.
>>>
>>> This was definitely the case for GCC, where adding a precise
>>> interprocedural field-sensitive analysis initially regressed
performance by
>>> a few percent on average.
>>>
>>> I know it was also the case for XLC at IBM, etc.
>>>
>>> Like anything else, just gotta figure out what passes are going
nuts,
>>> and rework them to have better heuristics/etc.
>>> The end result is performance improvements, but the path takes a
bit of
>>> time.
>>>
>>> If you need a way to see whether your analysis has actually done an
okay
>>> job in the meantime, usually a good way to see if you are doing
well or not
>>> is to see how many loads/stores get eliminated or moved by various
passes
>>> before and after.
>>>
>>> If the number is significantly higher, great.
>>> If the number is significantly lower, something has likely gone
wrong :)
>>>
>>>
>>> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> (Adding “LLVM Dev”)
>>>>
>>>> My variant is up as https://reviews.llvm.org/D23876
>>>>
<https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=>
>>>> —david
>>>>
>>>>
>>>> From: George Burgess IV <george.burgess.iv at gmail.com>
>>>> Date: Wednesday, August 24, 2016 at 3:17 PM
>>>> To: David Callahan <dcallahan at fb.com>
>>>> Subject: Re: CFLAA
>>>>
>>>> Hi!
>>>>
>>>> > I see there is on going work with alias analysis and it
appears the
>>>> prior CFLAA has been abandoned.
>>>>
>>>> There was quite a bit of refactoring done, yeah. The original
CFLAA is
>>>> now called CFLSteens, and graph construction was moved to its
own bit. We
>>>> also have CFLAnders, which is based more heavily on the paper
by Zheng and
>>>> Rugina (e.g. no stratifiedsets magic).
>>>>
>>>> > I have a variant of it where I reworked how compression
was done to
>>>> be less conservative, reworked the interprocedural to do
simulated but
>>>> bounded inlining, and added code to do on-demand testing of CFL
paths on
>>>> both compressed and full graphs.
>>>>
>>>> Awesome!
>>>>
>>>> > Happy to share the patch with you if you are interested as
well as
>>>> some data collected
>>>>
>>>> Yes, please. Would you mind if I CC'ed llvm-dev on this
thread (and a
>>>> few people specifically, who also might find this interesting)?
>>>>
>>>> > However I was not able to see any performance improvements
in the
>>>> code. In fact on a various benchmarks there were noticeable
regressions in
>>>> measured performance of the generated code. Have you noticed
any similar
>>>> problems?
>>>>
>>>> I know that a number of people people in the community
expressed
>>>> concerns about how other passes will perform with better AA
results (e.g.
>>>> If LICM becomes more aggressive, register pressure may
increase, which may
>>>> cause us to spill when we haven't before, etc). So, such a
problem isn't
>>>> unthinkable. :)
>>>>
>>>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan
at fb.com>
>>>> wrote:
>>>>
>>>>> Hi Greg,
>>>>>
>>>>>
>>>>>
>>>>> I see there is on going work with alias analysis and it
appears the
>>>>> prior CFLAA has been abandoned.
>>>>>
>>>>>
>>>>>
>>>>> I have a variant of it where I reworked how compression was
done to be
>>>>> less conservative, reworked the interprocedural to do
simulated but bounded
>>>>> inlining, and added code to do on-demand testing of CFL
paths on both
>>>>> compressed and full graphs.
>>>>>
>>>>>
>>>>>
>>>>> I reached a point where the ahead-of-time compression was
linear but
>>>>> still very accurate compared to on-demand path search and
there were
>>>>> noticeable improvements in the alias analysis results and
impacted
>>>>> transformations.  Happy to share the patch with you if you
are interested
>>>>> as well as some data collected.
>>>>>
>>>>>
>>>>>
>>>>> However I was not able to see any performance improvements
in the
>>>>> code. In fact on a various benchmarks there were noticeable
regressions in
>>>>> measured performance of the generated code. Have you
noticed any similar
>>>>> problems?
>>>>>
>>>>>
>>>>>
>>>>> --david
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=>
>>>>
>>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/598d29a6/attachment.html>

llvm dev - Aug 2016 - CFLAA

[llvm-dev] CFLAA

[llvm-dev] CFLAA

[llvm-dev] CFLAA

[llvm-dev] CFLAA

[llvm-dev] CFLAA