thr3ads.net - llvm dev - [llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE) [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Florian Hahn via llvm-dev

2020-Aug-18 13:35 UTC

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

Hi,

Over the past six months, a MemorySSA-backed DSE implementation has been added
to LLVM and it now covers almost all cases the existing DSE implementation does,
plus adding a major new capability: eliminating stores across basic blocks.
Thanks everyone involved with reviews, testing & patches!

I think now would be a good time to start working towards switching to use
MemorySSA-backed DSE as default.

TLDR:
MemorySSA DSE can substantially increase the number of stores eliminated in many
cases (e.g. across MultiSource/SPEC2000/SPEC2006 with -O3 -flto, the number of
eliminated stores increases roughly by 45%, with some benchmarks seeing much
larger improvements). But the improvements come at additional compile-time cost
(CTmark geomean -O3 +0.86%, ReleaseThinLTO +1.59%), due to a substantially
increased search space. I’d like to propose to switch to MemorySSA-backed DSE,
as in my opinion the benefits in terms of store reductions outweigh the
compile-time impact).

Details:

First, lets take a look at the benefits:

1. More dead stores eliminated.

The current DSE implementation is mostly limited to eliminating stores in a
single BB. MemorySSA-backed DSE can eliminate stores across different basic
blocks. This results in large improvements in some benchmarks. Below some
numbers for MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto:

Legacy MSSA
Number of dead stores removed across all benchmarks: 17919 26120
~ +45.76%
Total number of stores remaining after DSE: 1424860
1409300 ~ -1.1%

Some notable changes (excluding benchmarks with small absolute values of
NumFastStores) below. Note that roughly half of the benchmarks have binary
changes.

Same hash: 114 (filtered out)
Remaining: 123
Metric: dse.NumFastStores

Program legacy
mssa. diff
test-suite...-typeset/consumer-typeset.test 186.00 1815.00
875.8%
test-suite...lications/sqlite3/sqlite3.test 29.00 167.00
475.9%
test-suite...T2006/445.gobmk/445.gobmk.test 19.00 88.00
363.2%
test-suite.../Applications/SPASS/SPASS.test 49.00 155.00
216.3%
test-suite...lications/ClamAV/clamscan.test 72.00 227.00
215.3%
test-suite.../Benchmarks/nbench/nbench.test 30.00 92.00
206.7%
test-suite...T2000/256.bzip2/256.bzip2.test 31.00 85.00
174.2%
test-suite...chmarks/MallocBench/gs/gs.test 57.00 133.00
133.3%
test-suite...000/255.vortex/255.vortex.test 95.00 200.00
110.5%
test-suite...CFP2006/433.milc/433.milc.test 64.00 134.00
109.4%
test-suite...ications/JM/ldecod/ldecod.test 258.00 514.00
99.2%
test-suite.../Trimaran/enc-pc1/enc-pc1.test 87.00 168.00
93.1%
test-suite...ce/Benchmarks/PAQ8p/paq8p.test 35.00 66.00
88.6%
test-suite...INT2000/164.gzip/164.gzip.test 23.00 43.00
87.0%
test-suite...CFP2000/188.ammp/188.ammp.test 85.00 157.00 84.7%
test-suite...pplications/oggenc/oggenc.test 63.00 110.00
74.6%
test-suite...T2006/456.hmmer/456.hmmer.test 24.00 41.00
70.8%
test-suite.../Benchmarks/Bullet/bullet.test 799.00 1351.00
69.1%
test-suite.../CINT2000/176.gcc/176.gcc.test 412.00 668.00
62.1%
test-suite...nsumer-lame/consumer-lame.test 111.00 175.00
57.7%
test-suite...marks/7zip/7zip-benchmark.test 1069.00 1683.00
57.4%
test-suite...006/447.dealII/447.dealII.test 1715.00 2689.00
56.8%

There are a few existing missed optimization issues MemorySSA-backed DSE
addresses, e.g.
https://bugs.llvm.org/show_bug.cgi?id=46847
https://bugs.llvm.org/show_bug.cgi?id=40527

And some that should be relatively straight-forward to address with the new
implementation:
https://bugs.llvm.org/show_bug.cgi?id=16520
https://bugs.llvm.org/show_bug.cgi?id=10292

2. Major new users for MemorySSA means more testing.

MemorySSA-backed DSE makes extensive use of MemorySSA which increases the
test-coverage MemorySSA gets. It already surfaced one issue, that has since been
fixed. More users also likely means more eyes on tuning MemorySSA.

3. Eliminating some long-standing MemDepAnalysis related issue.

MemoryDependenceAnalysis used by the current DSE implementation makes heavy use
of caching and is relatively fragile with respect to cache invalidation. I think
there are at least a few long-standing known issues in that area, e.g.
https://bugs.llvm.org/show_bug.cgi?id=28014

4. Laying ground-work for further optimizations, like cross-bb memcpy opts

The current memcpy optimizers has the same BB-local limitation as the existing
DSE. MemorySSA-backed DSE should add most helpers required to port MemCpyOpt to
work across basic blocks. Ideally we would do both optimizations as part of the
same memorySSA walk, saving some compile-time.

Now, lets take a look at the costs:

1. Increase in compile-time

The current DSE implementation relies heavily on caching of info about memory
dependences, relying on the BB-only nature to keep compile-time in check. The
cross-bb nature of the MemorySSA-backed DSE implementation drastically increase
the search space, thus leading to an increase in compile-time. Here are some
numbers for CTMark. Note that the numbers are from a branch with additional
compile-time tuning.

Geomean changes (see link for full details)

Executed Instrs. Size (text)
O3 +0.86% -0.29%
ReleaseThinLTO +1.59% -0.38%
ReleaseLTO-g: +1.18% -0.35%
ReleaseThinLTO (link-only) +1.68% -0.38%
ReleaseLTO-g (link only): +1.36% -0.35%

http://llvm-compile-time-tracker.com/compare.php?from=4cc20aa74339812c3e4daa4138ed40cb98521bfc&to=504153dc6efd29ad37ffb6ffb2e80d4101b56e5b&stat=instructions

Note that both the current and MemorySSA-backed DSE implementations rely on
various thresholds to avoid excessive compile-times. There are a few knobs to
turn to further reduce compile-time of the MemorySSA-backed DSE implementation,
at the cost of missing to eliminate some stores.The current settings are chose
so the compile-time difference is limited without limiting optimizations too
much.

Another thing to note is that the current DSE implementation has seen
compile-time tuning over quite a few years now, while the amount of tuning done
for the MemorySSA-backed one is relatively small, so I am optimistic that we
will see improvements in compile-time in the future.

I took a look at some of the worst regressions ( tramp3d-v4, 7zip) and
MemorySSA-backed DSE takes roughly 3x more time than the current DSE, taking DSE
from the upper half of passes ordered by time to the top 5, still noticeably
quicker than heavy hitters like GVN, ISEL and roughly the same as the most
expensive instcombine run.

There are also cases where MemorySSA-backed DSE greatly out-performs legacy DSE:
https://bugs.llvm.org/show_bug.cgi?id=37588

Given the above, I think there are a few options

a) Switch to MemorySSA-backed DSE in all configurations
b) Switch to memorySSA-backed DSE, but use more aggressive limits. For some
configurations (e.g. limit optimizations for -O2/-Os)
c) Switch to MemorySSA-backed DSE in some configurations (e.g. -O3/-Oz only)
d) Do not switch.

I’d like to propose opting for a), because I think the additional optimizations
warrant the increase in compile-time and I am optimistic we see further tuning
once it gets used heavily and that it will enable additional optimizations.

I think b) would also be an improvement over the current status, but we’d have
to balance to limits so we still get substantially enough improvements over the
current DSE.

I don’t think c) is great, as we would have to maintain 2 versions of DSE, with
both of them seeing half the testing.

d) is also not great, as we are currently are failing to optimize a substantial
number of dead stores and are missing future optimizations made possible by the
implementation.

If there are no objections to switching, I’ll work on getting the remaining
compile-time related patches in ASAP and flip the switch. That should give us
enough time to address issues in time for the next release and fine tune
compile-time.

I’m looking forward to your thoughts!

Cheers,
Florian

Renato Golin via llvm-dev

2020-Aug-18 14:02 UTC

head link

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

On Tue, 18 Aug 2020 at 14:36, Florian Hahn via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Note that both the current and MemorySSA-backed DSE implementations rely
> on various thresholds to avoid excessive compile-times. There are a few
> knobs to turn to further reduce compile-time of the MemorySSA-backed DSE
> implementation, at the cost of missing to eliminate some stores.The current
> settings are chose so the compile-time difference is limited without
> limiting optimizations too much.
>
Did you compare the new algorithm with the knobs down enough to take
roughly the same compile time as the old one?

b) Switch to memorySSA-backed DSE, but use more aggressive limits. For
some> configurations (e.g. limit optimizations for -O2/-Os)
> c) Switch to MemorySSA-backed DSE in some configurations (e.g. -O3/-Oz
> only)

If we do switch, and I'm not advocating for that yet, I'd do a mix of
those.

Use slightly more aggressive limits at O2, those limits at O3 and perhaps
add an extra argument to push even further for people who really need it
and can pay the extra compile time.

--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200818/69578584/attachment.html>

Michael Kruse via llvm-dev

2020-Aug-18 15:59 UTC

head link

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

Thanks for all the work. The reductions in stores look promising. Do you
also have performance numbers how much this improves the execution time?
Did you observe any regressions where MSSA resulted in fewer removed stores?

Michael


Am Di., 18. Aug. 2020 um 08:36 Uhr schrieb Florian Hahn via llvm-dev <
llvm-dev at lists.llvm.org>:
> Hi,
>
> Over the past six months, a MemorySSA-backed DSE implementation has been
> added to LLVM and it now covers almost all cases the existing DSE
> implementation does, plus adding a major new capability: eliminating stores
> across basic blocks. Thanks everyone involved with reviews, testing &
> patches!
>
> I think now would be a good time to start working towards switching to use
> MemorySSA-backed DSE as default.
>
> TLDR:
> MemorySSA DSE can substantially increase the number of stores eliminated
> in many cases (e.g. across MultiSource/SPEC2000/SPEC2006 with -O3 -flto,
> the number of eliminated stores increases roughly by 45%, with some
> benchmarks seeing much larger improvements). But the improvements come at
> additional compile-time cost (CTmark geomean -O3 +0.86%, ReleaseThinLTO
> +1.59%), due to a substantially increased search space. I’d like to propose
> to switch to MemorySSA-backed DSE, as in my opinion the benefits in terms
> of store reductions outweigh the compile-time impact).
>
>
> Details:
>
> First, lets take a look at the benefits:
>
>
> 1. More dead stores eliminated.
>
> The current DSE implementation is mostly limited to eliminating stores in
> a single BB. MemorySSA-backed DSE can eliminate stores across different
> basic blocks. This results in large improvements in some benchmarks. Below
> some numbers for MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto:
>
>
>                      Legacy    MSSA
> Number of dead stores removed across all benchmarks:       17919
>  26120        ~  +45.76%
> Total number of stores remaining after DSE:                       1424860
>   1409300    ~  -1.1%
>
> Some notable changes (excluding benchmarks with small absolute values of
> NumFastStores) below. Note that roughly half of the benchmarks have binary
> changes.
>
> Same hash: 114 (filtered out)
> Remaining: 123
> Metric: dse.NumFastStores
>
> Program
>  legacy     mssa.           diff
>  test-suite...-typeset/consumer-typeset.test          186.00     1815.00
>     875.8%
>  test-suite...lications/sqlite3/sqlite3.test                   29.00
> 167.00      475.9%
>  test-suite...T2006/445.gobmk/445.gobmk.test       19.00        88.00
> 363.2%
>  test-suite.../Applications/SPASS/SPASS.test         49.00      155.00
>   216.3%
>  test-suite...lications/ClamAV/clamscan.test            72.00      227.00
>     215.3%
>  test-suite.../Benchmarks/nbench/nbench.test        30.00        92.00
>   206.7%
>  test-suite...T2000/256.bzip2/256.bzip2.test           31.00        85.00
>     174.2%
>  test-suite...chmarks/MallocBench/gs/gs.test          57.00      133.00
>   133.3%
>  test-suite...000/255.vortex/255.vortex.test             95.00
> 200.00      110.5%
>  test-suite...CFP2006/433.milc/433.milc.test           64.00      134.00
>     109.4%
>  test-suite...ications/JM/ldecod/ldecod.test             258.00
>  514.00       99.2%
>  test-suite.../Trimaran/enc-pc1/enc-pc1.test             87.00     168.00
>      93.1%
>  test-suite...ce/Benchmarks/PAQ8p/paq8p.test       35.00        66.00
>  88.6%
>  test-suite...INT2000/164.gzip/164.gzip.test            23.00
> 43.00       87.0%
>  test-suite...CFP2000/188.ammp/188.ammp.test    85.00      157.00
>  84.7%
>  test-suite...pplications/oggenc/oggenc.test             63.00
> 110.00       74.6%
>  test-suite...T2006/456.hmmer/456.hmmer.test       24.00        41.00
>  70.8%
>  test-suite.../Benchmarks/Bullet/bullet.test             799.00
> 1351.00       69.1%
>  test-suite.../CINT2000/176.gcc/176.gcc.test         412.00      668.00
>    62.1%
>  test-suite...nsumer-lame/consumer-lame.test       111.00      175.00
>  57.7%
>  test-suite...marks/7zip/7zip-benchmark.test        1069.00    1683.00
>    57.4%
>  test-suite...006/447.dealII/447.dealII.test            1715.00
>  2689.00       56.8%
>
> There are a few existing missed optimization issues MemorySSA-backed DSE
> addresses, e.g.
> https://bugs.llvm.org/show_bug.cgi?id=46847
> https://bugs.llvm.org/show_bug.cgi?id=40527
>
> And some that should be relatively straight-forward to address with the
> new implementation:
> https://bugs.llvm.org/show_bug.cgi?id=16520
> https://bugs.llvm.org/show_bug.cgi?id=10292
>
>
> 2. Major new users for MemorySSA means more testing.
>
> MemorySSA-backed DSE makes extensive use of MemorySSA which increases the
> test-coverage MemorySSA gets. It already surfaced one issue, that has since
> been fixed. More users also likely means more eyes on tuning MemorySSA.
>
>
> 3. Eliminating some long-standing MemDepAnalysis related issue.
>
> MemoryDependenceAnalysis used by the current DSE implementation makes
> heavy use of caching and is relatively fragile with respect to cache
> invalidation. I think there are at least a few long-standing known issues
> in that area, e.g. https://bugs.llvm.org/show_bug.cgi?id=28014
>
>
> 4. Laying ground-work for further optimizations, like cross-bb memcpy
> opts
>
> The current memcpy optimizers has the same BB-local limitation as the
> existing DSE. MemorySSA-backed DSE should add most helpers required to port
> MemCpyOpt to work across basic blocks. Ideally  we would do both
> optimizations as part of the same memorySSA walk, saving some compile-time.
>
>
>
> Now, lets take a look at the costs:
>
> 1. Increase in compile-time
>
> The current DSE implementation relies heavily on caching of info about
> memory dependences, relying on the BB-only nature to keep compile-time in
> check. The cross-bb nature of the MemorySSA-backed DSE implementation
> drastically increase the search space, thus leading to an increase in
> compile-time. Here are some numbers for CTMark. Note that the numbers are
> from a branch with additional compile-time tuning.
>
> Geomean changes (see link for full details)
>
>                                       Executed Instrs.   Size (text)
> O3                                         +0.86%         -0.29%
> ReleaseThinLTO                   +1.59%         -0.38%
> ReleaseLTO-g:                      +1.18%         -0.35%
> ReleaseThinLTO (link-only)  +1.68%          -0.38%
> ReleaseLTO-g (link only):     +1.36%          -0.35%
>
>
>
http://llvm-compile-time-tracker.com/compare.php?from=4cc20aa74339812c3e4daa4138ed40cb98521bfc&to=504153dc6efd29ad37ffb6ffb2e80d4101b56e5b&stat=instructions
>
> Note that both the current and MemorySSA-backed DSE implementations rely
> on various thresholds to avoid excessive compile-times. There are a few
> knobs to turn to further reduce compile-time of the MemorySSA-backed DSE
> implementation, at the cost of missing to eliminate some stores.The current
> settings are chose so the compile-time difference is limited without
> limiting optimizations too much.
>
> Another thing to note is that the current DSE implementation has seen
> compile-time tuning over quite a few years now, while the amount of tuning
> done for the MemorySSA-backed one is relatively small, so I am optimistic
> that we will see improvements in compile-time in the future.
>
> I took a look at some of the worst regressions ( tramp3d-v4, 7zip) and
> MemorySSA-backed DSE takes roughly 3x more time than the current DSE,
> taking DSE from the upper half of passes ordered by time to the top 5,
> still noticeably quicker than heavy hitters like GVN, ISEL and roughly the
> same as the most expensive instcombine run.
>
> There are also cases where MemorySSA-backed DSE greatly out-performs
> legacy DSE: https://bugs.llvm.org/show_bug.cgi?id=37588
>
>
> Given the above, I think there are a few options
>
> a) Switch to MemorySSA-backed DSE in all configurations
> b) Switch to memorySSA-backed DSE, but use more aggressive limits. For
> some configurations (e.g. limit optimizations for -O2/-Os)
> c) Switch to MemorySSA-backed DSE in some configurations (e.g. -O3/-Oz
> only)
> d) Do not switch.
>
> I’d like to propose opting for a), because I think the additional
> optimizations warrant the increase in compile-time and I am optimistic we
> see further tuning once it gets used heavily and that it will enable
> additional optimizations.
>
> I think b) would also be an improvement over the current status, but we’d
> have to balance to limits so we still get substantially enough improvements
> over the current DSE.
>
> I don’t think c) is great, as we would have to maintain 2 versions of DSE,
> with both of them seeing half the testing.
>
> d) is also not great, as we are currently are failing to optimize a
> substantial number of dead stores and are missing future optimizations made
> possible by the implementation.
>
> If there are no objections to switching, I’ll work on getting the
> remaining compile-time related patches in ASAP and flip the switch. That
> should give us enough time to address issues in time for the next release
> and fine tune compile-time.
>
> I’m looking forward to your thoughts!
>
> Cheers,
> Florian
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200818/e0cfe925/attachment.html>

Florian Hahn via llvm-dev

2020-Aug-18 16:10 UTC

head link

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

> On Aug 18, 2020, at 15:02, Renato Golin <rengolin at gmail.com>
wrote:
> 
> On Tue, 18 Aug 2020 at 14:36, Florian Hahn via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Note that both the current and MemorySSA-backed DSE implementations rely on
various thresholds to avoid excessive compile-times. There are a few knobs to
turn to further reduce compile-time of the MemorySSA-backed DSE implementation,
at the cost of missing to eliminate some stores.The current settings are chose
so the compile-time difference is limited without limiting optimizations too
much.
> 
> Did you compare the new algorithm with the knobs down enough to take
roughly the same compile time as the old one?
Without spending too much time on that, the best I could do while not regressing
on the total number of stores removed was +0.30% (for CTMark) executed
instructions for -O3 with ~3% more stores removed across the large benchmark
set. (Executed instruction +0.28% for ReleaseLTO, +0.58%  for ReleaseThinLTO).

So MemorySSA DSE probably can get quite close to legacy DSE at legacy DSE’s own
game, but intuitively I think the MemorySSA approach is just slightly more
expensive in general,  because it does not just handle cases that can be cached
easily and allows for more flexibility.
> Use slightly more aggressive limits at O2, those limits at O3 and perhaps
add an extra argument to push even further for people who really need it and can
pay the extra compile time.
That is certainly an option, but I think ideally the settings do not get too
fragmented.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200818/bd6b3dfb/attachment.html>

Philip Reames via llvm-dev

2020-Aug-18 18:15 UTC

head link

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

I would second the notion to switch to the new implementation. The < 2% 
compile time cost seems well justified by the results, and removing a 
major user of MDA - which has had *major* latent correctness issues - 
seems very worthwhile.  We should continue to tune compile time, but we 
should do that after making the switch.

Philip

On 8/18/20 6:35 AM, Florian Hahn via llvm-dev wrote:> Hi,
>
> Over the past six months, a MemorySSA-backed DSE implementation has been
added to LLVM and it now covers almost all cases the existing DSE implementation
does, plus adding a major new capability: eliminating stores across basic
blocks. Thanks everyone involved with reviews, testing & patches!
>
> I think now would be a good time to start working towards switching to use
MemorySSA-backed DSE as default.
>
> TLDR:
> MemorySSA DSE can substantially increase the number of stores eliminated in
many cases (e.g. across MultiSource/SPEC2000/SPEC2006 with -O3 -flto, the number
of eliminated stores increases roughly by 45%, with some benchmarks seeing much
larger improvements). But the improvements come at additional compile-time cost
(CTmark geomean -O3 +0.86%, ReleaseThinLTO +1.59%), due to a substantially
increased search space. I’d like to propose to switch to MemorySSA-backed DSE,
as in my opinion the benefits in terms of store reductions outweigh the
compile-time impact).
>
>
> Details:
>
> First, lets take a look at the benefits:
>
>
> 1. More dead stores eliminated.
>
> The current DSE implementation is mostly limited to eliminating stores in a
single BB. MemorySSA-backed DSE can eliminate stores across different basic
blocks. This results in large improvements in some benchmarks. Below some
numbers for MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto:
>
>                                                                            
Legacy    MSSA
> Number of dead stores removed across all benchmarks:       17919     26120 
~  +45.76%
> Total number of stores remaining after DSE:                       1424860  
1409300    ~  -1.1%
>
> Some notable changes (excluding benchmarks with small absolute values of
NumFastStores) below. Note that roughly half of the benchmarks have binary
changes.
>
> Same hash: 114 (filtered out)
> Remaining: 123
> Metric: dse.NumFastStores
>
> Program                                                              
legacy     mssa.           diff
>   test-suite...-typeset/consumer-typeset.test          186.00     1815.00  
875.8%
>   test-suite...lications/sqlite3/sqlite3.test                   29.00     
167.00      475.9%
>   test-suite...T2006/445.gobmk/445.gobmk.test       19.00        88.00     
363.2%
>   test-suite.../Applications/SPASS/SPASS.test         49.00      155.00    
216.3%
>   test-suite...lications/ClamAV/clamscan.test            72.00      227.00 
215.3%
>   test-suite.../Benchmarks/nbench/nbench.test        30.00        92.00    
206.7%
>   test-suite...T2000/256.bzip2/256.bzip2.test           31.00        85.00 
174.2%
>   test-suite...chmarks/MallocBench/gs/gs.test          57.00      133.00   
133.3%
>   test-suite...000/255.vortex/255.vortex.test             95.00      200.00
110.5%
>   test-suite...CFP2006/433.milc/433.milc.test           64.00      134.00  
109.4%
>   test-suite...ications/JM/ldecod/ldecod.test             258.00     514.00
99.2%
>   test-suite.../Trimaran/enc-pc1/enc-pc1.test             87.00     168.00 
93.1%
>   test-suite...ce/Benchmarks/PAQ8p/paq8p.test       35.00        66.00     
88.6%
>   test-suite...INT2000/164.gzip/164.gzip.test            23.00        43.00
87.0%
>   test-suite...CFP2000/188.ammp/188.ammp.test    85.00      157.00      
84.7%
>   test-suite...pplications/oggenc/oggenc.test             63.00      110.00
74.6%
>   test-suite...T2006/456.hmmer/456.hmmer.test       24.00        41.00     
70.8%
>   test-suite.../Benchmarks/Bullet/bullet.test             799.00    1351.00
69.1%
>   test-suite.../CINT2000/176.gcc/176.gcc.test         412.00      668.00   
62.1%
>   test-suite...nsumer-lame/consumer-lame.test       111.00      175.00     
57.7%
>   test-suite...marks/7zip/7zip-benchmark.test        1069.00    1683.00    
57.4%
>   test-suite...006/447.dealII/447.dealII.test            1715.00    
2689.00       56.8%
>   
> There are a few existing missed optimization issues MemorySSA-backed DSE
addresses, e.g.
> https://bugs.llvm.org/show_bug.cgi?id=46847
> https://bugs.llvm.org/show_bug.cgi?id=40527
>
> And some that should be relatively straight-forward to address with the new
implementation:
> https://bugs.llvm.org/show_bug.cgi?id=16520
> https://bugs.llvm.org/show_bug.cgi?id=10292
>
>
> 2. Major new users for MemorySSA means more testing.
>
> MemorySSA-backed DSE makes extensive use of MemorySSA which increases the
test-coverage MemorySSA gets. It already surfaced one issue, that has since been
fixed. More users also likely means more eyes on tuning MemorySSA.
>
>
> 3. Eliminating some long-standing MemDepAnalysis related issue.
>
> MemoryDependenceAnalysis used by the current DSE implementation makes heavy
use of caching and is relatively fragile with respect to cache invalidation. I
think there are at least a few long-standing known issues in that area, e.g.
https://bugs.llvm.org/show_bug.cgi?id=28014
>
>
> 4. Laying ground-work for further optimizations, like cross-bb memcpy opts	
>
> The current memcpy optimizers has the same BB-local limitation as the
existing DSE. MemorySSA-backed DSE should add most helpers required to port
MemCpyOpt to work across basic blocks. Ideally  we would do both optimizations
as part of the same memorySSA walk, saving some compile-time.
>
>
>
> Now, lets take a look at the costs:
>
> 1. Increase in compile-time
>
> The current DSE implementation relies heavily on caching of info about
memory dependences, relying on the BB-only nature to keep compile-time in check.
The cross-bb nature of the MemorySSA-backed DSE implementation drastically
increase the search space, thus leading to an increase in compile-time. Here are
some numbers for CTMark. Note that the numbers are from a branch with additional
compile-time tuning.
>
> Geomean changes (see link for full details)
>
>                                        Executed Instrs.   Size (text)
> O3                                         +0.86%         -0.29%
> ReleaseThinLTO                   +1.59%         -0.38%
> ReleaseLTO-g:                      +1.18%         -0.35%
> ReleaseThinLTO (link-only)  +1.68%          -0.38%
> ReleaseLTO-g (link only):     +1.36%          -0.35%
>
>
http://llvm-compile-time-tracker.com/compare.php?from=4cc20aa74339812c3e4daa4138ed40cb98521bfc&to=504153dc6efd29ad37ffb6ffb2e80d4101b56e5b&stat=instructions
>
> Note that both the current and MemorySSA-backed DSE implementations rely on
various thresholds to avoid excessive compile-times. There are a few knobs to
turn to further reduce compile-time of the MemorySSA-backed DSE implementation,
at the cost of missing to eliminate some stores.The current settings are chose
so the compile-time difference is limited without limiting optimizations too
much.
>
> Another thing to note is that the current DSE implementation has seen
compile-time tuning over quite a few years now, while the amount of tuning done
for the MemorySSA-backed one is relatively small, so I am optimistic that we
will see improvements in compile-time in the future.
>
> I took a look at some of the worst regressions ( tramp3d-v4, 7zip) and
MemorySSA-backed DSE takes roughly 3x more time than the current DSE, taking DSE
from the upper half of passes ordered by time to the top 5, still noticeably
quicker than heavy hitters like GVN, ISEL and roughly the same as the most
expensive instcombine run.
>
> There are also cases where MemorySSA-backed DSE greatly out-performs legacy
DSE: https://bugs.llvm.org/show_bug.cgi?id=37588
>
>
> Given the above, I think there are a few options
>
> a) Switch to MemorySSA-backed DSE in all configurations
> b) Switch to memorySSA-backed DSE, but use more aggressive limits. For some
configurations (e.g. limit optimizations for -O2/-Os)
> c) Switch to MemorySSA-backed DSE in some configurations (e.g. -O3/-Oz
only)
> d) Do not switch.
>
> I’d like to propose opting for a), because I think the additional
optimizations warrant the increase in compile-time and I am optimistic we see
further tuning once it gets used heavily and that it will enable additional
optimizations.
>
> I think b) would also be an improvement over the current status, but we’d
have to balance to limits so we still get substantially enough improvements over
the current DSE.
>
> I don’t think c) is great, as we would have to maintain 2 versions of DSE,
with both of them seeing half the testing.
>
> d) is also not great, as we are currently are failing to optimize a
substantial number of dead stores and are missing future optimizations made
possible by the implementation.
>
> If there are no objections to switching, I’ll work on getting the remaining
compile-time related patches in ASAP and flip the switch. That should give us
enough time to address issues in time for the next release and fine tune
compile-time.
>
> I’m looking forward to your thoughts!
>
> Cheers,
> Florian
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Joerg Sonnenberger via llvm-dev

2020-Aug-18 19:56 UTC

head link

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

On Tue, Aug 18, 2020 at 02:35:56PM +0100, Florian Hahn via llvm-dev
wrote:> MemorySSA DSE can substantially increase the number of stores eliminated
> in many cases (e.g. across MultiSource/SPEC2000/SPEC2006 with -O3 -flto,
> the number of eliminated stores increases roughly by 45%, with some
> benchmarks seeing much larger improvements). But the improvements come
> at additional compile-time cost (CTmark geomean -O3 +0.86%, ReleaseThinLTO
> +1.59%), due to a substantially increased search space. I’d like to
> propose to switch to MemorySSA-backed DSE, as in my opinion the benefits
> in terms of store reductions outweigh the compile-time impact).
Do you also have any numbers on the impact on (peak) RSS this has?

Joerg

Florian Hahn via llvm-dev

2020-Aug-18 21:14 UTC

head link

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

> On Aug 18, 2020, at 16:59, Michael Kruse <llvmdev at meinersbur.de>
wrote:
> 
> Thanks for all the work. The reductions in stores look promising. Do you
also have performance numbers how much this improves the execution time? Did you
observe any regressions where MSSA resulted in fewer removed stores?
I did not gather numbers for execution time yet, but I’ll try to share some
tomorrow.

At the current state, for MultiSource/SPEC2000/SPEC2006, there are the following
regressions

 test-suite...6/471.omnetpp/471.omnetpp.test   321.00      316.00      -1.6%
 test-suite...arks/mafft/pairlocalalign.test    64.00       61.00      -4.7%
 test-suite...ks/Prolangs-C++/city/city.test    23.00       21.00      -8.7%
 test-suite...oxyApps-C/miniGMG/miniGMG.test    69.00       60.00      -13.0%

I suspect those are caused by the few cases that the MemorySSA version does not
yet support. Those will need some more investigating, but I think ideally we
would not block the switch on them, so we can switch early and address the
issues that pop up early.

Cheers,
Florian

Florian Hahn via llvm-dev

2020-Aug-18 21:14 UTC

head link

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

> On Aug 18, 2020, at 20:56, Joerg Sonnenberger via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On Tue, Aug 18, 2020 at 02:35:56PM +0100, Florian Hahn via llvm-dev wrote:
>> MemorySSA DSE can substantially increase the number of stores
eliminated
>> in many cases (e.g. across MultiSource/SPEC2000/SPEC2006 with -O3
-flto,
>> the number of eliminated stores increases roughly by 45%, with some
>> benchmarks seeing much larger improvements). But the improvements come
>> at additional compile-time cost (CTmark geomean -O3 +0.86%,
ReleaseThinLTO
>> +1.59%), due to a substantially increased search space. I’d like to
>> propose to switch to MemorySSA-backed DSE, as in my opinion the
benefits
>> in terms of store reductions outweigh the compile-time impact).
> 
> Do you also have any numbers on the impact on (peak) RSS this has?
> 
Max RSS looks mostly neutral for CTMark.

-O3 -0.01%
ReleaseThinLTO +0.10%
ReleaseLTO-g +0.0%
O0-g: +0.0%
ReleaseThinLTO (link only): -0.23%
ReleaseLTO-g (link only): -0.09%

http://llvm-compile-time-tracker.com/compare.php?from=4cc20aa74339812c3e4daa4138ed40cb98521bfc&to=504153dc6efd29ad37ffb6ffb2e80d4101b56e5b&stat=max-rss

There appear to be some +- 2% swings for ReleaseThinLTO which might be worth
taking a closer look, but overall it looks not too bad I think.

Cheers,
Florian

llvm dev - Aug 2020 - [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)