thr3ads.net - llvm dev - [llvm-dev] LLVM-IR store-load propagation [Jun 2020]

If this information is useful, please help other people find it:
Share via:

Matteo Favaro via llvm-dev

2020-Jun-19 00:28 UTC

[llvm-dev] LLVM-IR store-load propagation

Hello everyone,

This week I was looking into the following example (
https://godbolt.org/z/uhgQcq) where two constants are written to a local
array and an input argument, masked and shifted, is used to select between
them. The possible values for the CC variable are 0 and 1, so I'm expecting
that at the maximum level of optimizations the two constants are actually
propagated, resulting in the return value being a 'select' controlled by
CC
and returning either one or the other.

Although, I quickly realized that the implementation in the function
'src'
was not going to be optimized any further, resulting in the generation of
two 'store' instructions and one 'load' instruction, apparently
hindering
the constant propagation pass.

I then decided to explicitly access the local buffer with constant indexes
and see if LLVM would have been able to identify that CC could have been
either 0 or 1 (effectively avoiding the 'default' case of the switch and
therefore the '0xdeadc0de' constant). As a result the function
'tgt' is
optimized in the way I would expect it to be.

This also seemed to be a good exercise for Alive2, so I fed it with the
unoptimized 'src' and the optimized 'tgt' functions to prove
their
equivalence, obtaining the result 'Transformation seems to be correct'.
As
a counter-proof I tampered with the logic or modified the constants,
obtaining a valid proof of why the transformation wasn't correct
(effectively showing that the original 'src' and 'tgt' functions
may
actually be semantically equivalent).

To replicate the Alive2 result at https://alive2.llvm.org, the following
input can be used:

define i64 @_Z3srcm(i64 %Flags) {
entry:
  %Memory = alloca [2 x i64], align 16
  %and = lshr i64 %Flags, 6
  %shr = and i64 %and, 1
  %0 = bitcast [2 x i64]* %Memory to i8*
  %arrayidx = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
i64 0
  store i64 5369966919, i64* %arrayidx, align 16
  %arrayidx1 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
i64 1
  store i64 5369966790, i64* %arrayidx1, align 8
  %arrayidx2 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
i64 %shr
  %1 = load i64, i64* %arrayidx2, align 8
  ret i64 %1
}

define i64 @_Z3tgtm(i64 %Flags) {
entry:
  %0 = and i64 %Flags, 64
  %trunc = icmp eq i64 %0, 0
  %. = select i1 %trunc, i64 5369966919, i64 5369966790
  ret i64 %.
}

At this point I decided to replicate the 'tgt' function logic and coded
a
quick LLVM pass that:

   1. uses the known/unknown computed bits information to identify a
   non-volatile 'load' instruction that uses an index proved to have
only two
   possible values;
   2.  check if there's at least one 'store' to the accessed buffer
using
   one of the two indexes;
   3. converts the single 'load' instruction into two 'load'
instructions
   using the concrete indexes;
   4. generates a 'select' instruction that returns one of the two
loaded
   values, using as condition a check on the index.

The pass seems to be working fine, but I'm left wondering if LLVM is
purposefully avoiding such an optimization, and if so what is the reason to
do so (e.g. hard to prove that the optimization is actually going to
improve the quality of the code, the logic I'm using is completely off the
rails).

Assuming the logic it's correct and this could be seen as a new custom
optimization pass, what would be the suggested way to implement it in a
solid and generic fashion?

Thanks for any insight,
Matteo
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200619/fd551583/attachment.html>

Johannes Doerfert via llvm-dev

2020-Jun-19 02:07 UTC

head link

[llvm-dev] LLVM-IR store-load propagation

Hi Matteo,


I think Roman (CC'ed) had a similar problem recently.

IIRC, the solution that was discussed was to enable memory promotion of 
the alloca into a vector.

Once that happen the existing instcombine logic might already create 
your select.


If the above turns out to be not sufficient for a more complex case, you 
might want to

consider building something on top of https://reviews.llvm.org/D80991, 
or maybe Shinji (CC'ed)

might even ;)


Cheers,

   Johannes


On 6/18/20 7:28 PM, Matteo Favaro via llvm-dev wrote:> Hello everyone,
>
> This week I was looking into the following example (
> https://godbolt.org/z/uhgQcq) where two constants are written to a local
> array and an input argument, masked and shifted, is used to select between
> them. The possible values for the CC variable are 0 and 1, so I'm
expecting
> that at the maximum level of optimizations the two constants are actually
> propagated, resulting in the return value being a 'select'
controlled by CC
> and returning either one or the other.
>
> Although, I quickly realized that the implementation in the function
'src'
> was not going to be optimized any further, resulting in the generation of
> two 'store' instructions and one 'load' instruction,
apparently hindering
> the constant propagation pass.
>
> I then decided to explicitly access the local buffer with constant indexes
> and see if LLVM would have been able to identify that CC could have been
> either 0 or 1 (effectively avoiding the 'default' case of the
switch and
> therefore the '0xdeadc0de' constant). As a result the function
'tgt' is
> optimized in the way I would expect it to be.
>
> This also seemed to be a good exercise for Alive2, so I fed it with the
> unoptimized 'src' and the optimized 'tgt' functions to
prove their
> equivalence, obtaining the result 'Transformation seems to be
correct'. As
> a counter-proof I tampered with the logic or modified the constants,
> obtaining a valid proof of why the transformation wasn't correct
> (effectively showing that the original 'src' and 'tgt'
functions may
> actually be semantically equivalent).
>
> To replicate the Alive2 result at https://alive2.llvm.org, the following
> input can be used:
>
> define i64 @_Z3srcm(i64 %Flags) {
> entry:
>    %Memory = alloca [2 x i64], align 16
>    %and = lshr i64 %Flags, 6
>    %shr = and i64 %and, 1
>    %0 = bitcast [2 x i64]* %Memory to i8*
>    %arrayidx = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
> i64 0
>    store i64 5369966919, i64* %arrayidx, align 16
>    %arrayidx1 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64
0,
> i64 1
>    store i64 5369966790, i64* %arrayidx1, align 8
>    %arrayidx2 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64
0,
> i64 %shr
>    %1 = load i64, i64* %arrayidx2, align 8
>    ret i64 %1
> }
>
> define i64 @_Z3tgtm(i64 %Flags) {
> entry:
>    %0 = and i64 %Flags, 64
>    %trunc = icmp eq i64 %0, 0
>    %. = select i1 %trunc, i64 5369966919, i64 5369966790
>    ret i64 %.
> }
>
> At this point I decided to replicate the 'tgt' function logic and
coded a
> quick LLVM pass that:
>
>     1. uses the known/unknown computed bits information to identify a
>     non-volatile 'load' instruction that uses an index proved to
have only two
>     possible values;
>     2.  check if there's at least one 'store' to the accessed
buffer using
>     one of the two indexes;
>     3. converts the single 'load' instruction into two
'load' instructions
>     using the concrete indexes;
>     4. generates a 'select' instruction that returns one of the two
loaded
>     values, using as condition a check on the index.
>
> The pass seems to be working fine, but I'm left wondering if LLVM is
> purposefully avoiding such an optimization, and if so what is the reason to
> do so (e.g. hard to prove that the optimization is actually going to
> improve the quality of the code, the logic I'm using is completely off
the
> rails).
>
> Assuming the logic it's correct and this could be seen as a new custom
> optimization pass, what would be the suggested way to implement it in a
> solid and generic fashion?
>
> Thanks for any insight,
> Matteo
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200618/9ace600e/attachment.html>

Roman Lebedev via llvm-dev

2020-Jun-19 08:32 UTC

head link

[llvm-dev] LLVM-IR store-load propagation

On Fri, Jun 19, 2020 at 5:08 AM Johannes Doerfert
<johannesdoerfert at gmail.com> wrote:>
> Hi Matteo,
>
>
> I think Roman (CC'ed) had a similar problem recently.Similar, yes.
> IIRC, the solution that was discussed was to enable memory promotion of the
alloca into a vector.I'm working on that :)
But i'm not sure if we will be okay doing that unconditionally,
is spilling going to be able to undo all that?
> Once that happen the existing instcombine logic might already create your
select.
>
> If the above turns out to be not sufficient for a more complex case, you
might want to
> consider building something on top of https://reviews.llvm.org/D80991, or
maybe Shinji (CC'ed)
> might even ;)
>
> Cheers,
>   JohannesRoman
> On 6/18/20 7:28 PM, Matteo Favaro via llvm-dev wrote:
>
> Hello everyone,
>
> This week I was looking into the following example (
> https://godbolt.org/z/uhgQcq) where two constants are written to a local
> array and an input argument, masked and shifted, is used to select between
> them. The possible values for the CC variable are 0 and 1, so I'm
expecting
> that at the maximum level of optimizations the two constants are actually
> propagated, resulting in the return value being a 'select'
controlled by CC
> and returning either one or the other.
>
> Although, I quickly realized that the implementation in the function
'src'
> was not going to be optimized any further, resulting in the generation of
> two 'store' instructions and one 'load' instruction,
apparently hindering
> the constant propagation pass.
>
> I then decided to explicitly access the local buffer with constant indexes
> and see if LLVM would have been able to identify that CC could have been
> either 0 or 1 (effectively avoiding the 'default' case of the
switch and
> therefore the '0xdeadc0de' constant). As a result the function
'tgt' is
> optimized in the way I would expect it to be.
>
> This also seemed to be a good exercise for Alive2, so I fed it with the
> unoptimized 'src' and the optimized 'tgt' functions to
prove their
> equivalence, obtaining the result 'Transformation seems to be
correct'. As
> a counter-proof I tampered with the logic or modified the constants,
> obtaining a valid proof of why the transformation wasn't correct
> (effectively showing that the original 'src' and 'tgt'
functions may
> actually be semantically equivalent).
>
> To replicate the Alive2 result at https://alive2.llvm.org, the following
> input can be used:
>
> define i64 @_Z3srcm(i64 %Flags) {
> entry:
>   %Memory = alloca [2 x i64], align 16
>   %and = lshr i64 %Flags, 6
>   %shr = and i64 %and, 1
>   %0 = bitcast [2 x i64]* %Memory to i8*
>   %arrayidx = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
> i64 0
>   store i64 5369966919, i64* %arrayidx, align 16
>   %arrayidx1 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
> i64 1
>   store i64 5369966790, i64* %arrayidx1, align 8
>   %arrayidx2 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
> i64 %shr
>   %1 = load i64, i64* %arrayidx2, align 8
>   ret i64 %1
> }
>
> define i64 @_Z3tgtm(i64 %Flags) {
> entry:
>   %0 = and i64 %Flags, 64
>   %trunc = icmp eq i64 %0, 0
>   %. = select i1 %trunc, i64 5369966919, i64 5369966790
>   ret i64 %.
> }
>
> At this point I decided to replicate the 'tgt' function logic and
coded a
> quick LLVM pass that:
>
>    1. uses the known/unknown computed bits information to identify a
>    non-volatile 'load' instruction that uses an index proved to
have only two
>    possible values;
>    2.  check if there's at least one 'store' to the accessed
buffer using
>    one of the two indexes;
>    3. converts the single 'load' instruction into two
'load' instructions
>    using the concrete indexes;
>    4. generates a 'select' instruction that returns one of the two
loaded
>    values, using as condition a check on the index.
>
> The pass seems to be working fine, but I'm left wondering if LLVM is
> purposefully avoiding such an optimization, and if so what is the reason to
> do so (e.g. hard to prove that the optimization is actually going to
> improve the quality of the code, the logic I'm using is completely off
the
> rails).
>
> Assuming the logic it's correct and this could be seen as a new custom
> optimization pass, what would be the suggested way to implement it in a
> solid and generic fashion?
>
> Thanks for any insight,
> Matteo
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Florian Hahn via llvm-dev

2020-Jun-19 09:16 UTC

head link

[llvm-dev] LLVM-IR store-load propagation

Hi,
> On Jun 19, 2020, at 01:28, Matteo Favaro via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello everyone,
> 
> This week I was looking into the following example
(https://godbolt.org/z/uhgQcq <https://godbolt.org/z/uhgQcq>) where two
constants are written to a local array and an input argument, masked and
shifted, is used to select between them. The possible values for the CC variable
are 0 and 1, so I'm expecting that at the maximum level of optimizations the
two constants are actually propagated, resulting in the return value being a
'select' controlled by CC and returning either one or the other.
> 
> Although, I quickly realized that the implementation in the function
'src' was not going to be optimized any further, resulting in the
generation of two 'store' instructions and one 'load'
instruction, apparently hindering the constant propagation pass.
> 
> I then decided to explicitly access the local buffer with constant indexes
and see if LLVM would have been able to identify that CC could have been either
0 or 1 (effectively avoiding the 'default' case of the switch and
therefore the '0xdeadc0de' constant). As a result the function
'tgt' is optimized in the way I would expect it to be.
> 
> This also seemed to be a good exercise for Alive2, so I fed it with the
unoptimized 'src' and the optimized 'tgt' functions to prove
their equivalence, obtaining the result 'Transformation seems to be
correct'. As a counter-proof I tampered with the logic or modified the
constants, obtaining a valid proof of why the transformation wasn't correct
(effectively showing that the original 'src' and 'tgt' functions
may actually be semantically equivalent).
> 
> To replicate the Alive2 result at https://alive2.llvm.org
<https://alive2.llvm.org/>, the following input can be used:
> 
> define i64 @_Z3srcm(i64 %Flags) {
> entry:
>   %Memory = alloca [2 x i64], align 16
>   %and = lshr i64 %Flags, 6
>   %shr = and i64 %and, 1
>   %0 = bitcast [2 x i64]* %Memory to i8*
>   %arrayidx = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
i64 0
>   store i64 5369966919, i64* %arrayidx, align 16
>   %arrayidx1 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
i64 1
>   store i64 5369966790, i64* %arrayidx1, align 8
>   %arrayidx2 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
i64 %shr
>   %1 = load i64, i64* %arrayidx2, align 8
>   ret i64 %1
> }
> 
> define i64 @_Z3tgtm(i64 %Flags) {
> entry:
>   %0 = and i64 %Flags, 64
>   %trunc = icmp eq i64 %0, 0
>   %. = select i1 %trunc, i64 5369966919, i64 5369966790
>   ret i64 %.
> }
> 
> At this point I decided to replicate the 'tgt' function logic and
coded a quick LLVM pass that:
> uses the known/unknown computed bits information to identify a non-volatile
'load' instruction that uses an index proved to have only two possible
values;
>  check if there's at least one 'store' to the accessed buffer
using one of the two indexes;
> converts the single 'load' instruction into two 'load'
instructions using the concrete indexes;
> generates a 'select' instruction that returns one of the two loaded
values, using as condition a check on the index.
> The pass seems to be working fine, but I'm left wondering if LLVM is
purposefully avoiding such an optimization, and if so what is the reason to do
so (e.g. hard to prove that the optimization is actually going to improve the
quality of the code, the logic I'm using is completely off the rails).
> 
This transform seems fine in principal and I don’t think there is fundamental
reason we do not perform this transform currently. The challenge is to do the
transform in a correct and scalable way. It would be interesting how often this
triggers in practice. Limiting it to indices with 2 possible values might not be
very effective in practice.
> Assuming the logic it's correct and this could be seen as a new custom
optimization pass, what would be the suggested way to implement it in a solid
and generic fashion?

I would recommend looking into extending DeadStoreElimination [1] to handle the
scenario. We already deal with merging stores to overlapping memory locations
there (see the code dealing with InstOverlapIntervals) and it might be possible
to extend that. Note that the legacy DSE mostly works in a single basic block,
but there’s also a version that use MemorySSA and works across multiple basic
blocks.

An alternative to implementing the transform directly might be transforming the
input into something that other passes already handle naturally instead.

For example, if you turn the single  load  '%1 = load i64, i64* %arrayidx2,
align 8’ into separate loads for each known index and replace all users with a
select on the new loads, DSE will already handle it properly [2]. And that
transform should require much less checks. Of course you’d have to make sure to
undo the transform, if the loads couldn’t be eliminated.

Cheers,
Florian

[1]
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
<https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp>
[2] https://godbolt.org/z/U2NUht <https://godbolt.org/z/U2NUht>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200619/9e586f5a/attachment.html>

Matteo Favaro via llvm-dev

2020-Jun-19 17:45 UTC

head link

[llvm-dev] LLVM-IR store-load propagation

Thanks Florian,

I'm happy to hear that in principle the transformation is correct. I
omitted that the "pattern" is actually obtained after lifting some
obfuscated x86_64 assembly to LLVM-IR, so I don't expect this to be
commonly found in the code generated by a compiler. In my case limiting it
to 2 possible values was fine because I was aware of the trick being
implemented by that pair of 'store' and 'load' instructions, but
as you
pointed out it could be generalized to any amount of unknown bits in the
index used by the 'load'. Although I still think limiting this to just a
few free bits it's going to cover the majority of the practical cases.

I noticed that the explanation of the dummy pass I shared in the first
email wasn't really good, but I'm basically doing the same exact steps
you
are explaining in the example, in fact I can even confirm that the
amount of checks is really limited (assuming a previous analysis pass
provided the known/unknown bits for the index variable) and that the other
optimization passes properly take care of the propagation (as shown in the
original example).

In my case instead of undoing the transformation I'm preemptively checking
if there's at least a 'store' to one of the indexed memory slots, so
in the
worst case we end up with a propagated value and a single 'load'. The
scalability issue here is caused by the checks on the presence of the
previous 'store' instruction, especially because I don't know how to
do it
in case the values are more than 2. Specifically I don't know how to
quickly link the indexes used by the 'store' and 'load'
instruction
starting from the single knowledge of the index used by the 'load'
instruction.

Assuming there's no solid way to track the amount of 'store'
instructions
that would guarantee the optimization: how would it be possible to obtain
the undoing of the transformation without entering an infinite loop (of
transformation back and forth) or ending up with worse code?

Regarding the DeadStoreElimination pass I'll look into its implementation,
but my question was more related to knowing if there are best practices or
some tricks that could be used to implement it in a safe way (e.g. instead
of generating the indexes values from the unknown bits, using an available
pass that already computes them). By no means my code (
https://pastebin.com/fCVs8Gms) would be good enough to be integrated in an
official LLVM pass, hence why I referred to it as a 'custom' pass :)

Cheers,
Matteo

Il giorno ven 19 giu 2020 alle ore 11:16 Florian Hahn <
florian_hahn at apple.com> ha scritto:
> Hi,
>
> On Jun 19, 2020, at 01:28, Matteo Favaro via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hello everyone,
>
> This week I was looking into the following example (
> https://godbolt.org/z/uhgQcq) where two constants are written to a local
> array and an input argument, masked and shifted, is used to select between
> them. The possible values for the CC variable are 0 and 1, so I'm
expecting
> that at the maximum level of optimizations the two constants are actually
> propagated, resulting in the return value being a 'select'
controlled by CC
> and returning either one or the other.
>
> Although, I quickly realized that the implementation in the function
'src'
> was not going to be optimized any further, resulting in the generation of
> two 'store' instructions and one 'load' instruction,
apparently hindering
> the constant propagation pass.
>
> I then decided to explicitly access the local buffer with constant indexes
> and see if LLVM would have been able to identify that CC could have been
> either 0 or 1 (effectively avoiding the 'default' case of the
switch and
> therefore the '0xdeadc0de' constant). As a result the function
'tgt' is
> optimized in the way I would expect it to be.
>
> This also seemed to be a good exercise for Alive2, so I fed it with the
> unoptimized 'src' and the optimized 'tgt' functions to
prove their
> equivalence, obtaining the result 'Transformation seems to be
correct'. As
> a counter-proof I tampered with the logic or modified the constants,
> obtaining a valid proof of why the transformation wasn't correct
> (effectively showing that the original 'src' and 'tgt'
functions may
> actually be semantically equivalent).
>
> To replicate the Alive2 result at https://alive2.llvm.org, the following
> input can be used:
>
> define i64 @_Z3srcm(i64 %Flags) {
> entry:
>   %Memory = alloca [2 x i64], align 16
>   %and = lshr i64 %Flags, 6
>   %shr = and i64 %and, 1
>   %0 = bitcast [2 x i64]* %Memory to i8*
>   %arrayidx = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64 0,
> i64 0
>   store i64 5369966919, i64* %arrayidx, align 16
>   %arrayidx1 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64
> 0, i64 1
>   store i64 5369966790, i64* %arrayidx1, align 8
>   %arrayidx2 = getelementptr inbounds [2 x i64], [2 x i64]* %Memory, i64
> 0, i64 %shr
>   %1 = load i64, i64* %arrayidx2, align 8
>   ret i64 %1
> }
>
> define i64 @_Z3tgtm(i64 %Flags) {
> entry:
>   %0 = and i64 %Flags, 64
>   %trunc = icmp eq i64 %0, 0
>   %. = select i1 %trunc, i64 5369966919, i64 5369966790
>   ret i64 %.
> }
>
> At this point I decided to replicate the 'tgt' function logic and
coded a
> quick LLVM pass that:
>
>    1. uses the known/unknown computed bits information to identify a
>    non-volatile 'load' instruction that uses an index proved to
have only two
>    possible values;
>    2.  check if there's at least one 'store' to the accessed
buffer using
>    one of the two indexes;
>    3. converts the single 'load' instruction into two
'load' instructions
>    using the concrete indexes;
>    4. generates a 'select' instruction that returns one of the two
loaded
>    values, using as condition a check on the index.
>
> The pass seems to be working fine, but I'm left wondering if LLVM is
> purposefully avoiding such an optimization, and if so what is the reason to
> do so (e.g. hard to prove that the optimization is actually going to
> improve the quality of the code, the logic I'm using is completely off
the
> rails).
>
>
> This transform seems fine in principal and I don’t think there is
> fundamental reason we do not perform this transform currently. The
> challenge is to do the transform in a correct and scalable way. It would be
> interesting how often this triggers in practice. Limiting it to indices
> with 2 possible values might not be very effective in practice.
>
> Assuming the logic it's correct and this could be seen as a new custom
> optimization pass, what would be the suggested way to implement it in a
> solid and generic fashion?
>
>
> I would recommend looking into extending DeadStoreElimination [1] to
> handle the scenario. We already deal with merging stores to overlapping
> memory locations there (see the code dealing with InstOverlapIntervals)
> and it might be possible to extend that. Note that the legacy DSE mostly
> works in a single basic block, but there’s also a version that use
> MemorySSA and works across multiple basic blocks.
>
> An alternative to implementing the transform directly might be
> transforming the input into something that other passes already handle
> naturally instead.
>
> For example, if you turn the single  load  '%1 = load i64, i64*
> %arrayidx2, align 8’ into separate loads for each known index and replace
> all users with a select on the new loads, DSE will already handle it
> properly [2]. And that transform should require much less checks. Of course
> you’d have to make sure to undo the transform, if the loads couldn’t be
> eliminated.
>
> Cheers,
> Florian
>
> [1]
>
https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
> [2] https://godbolt.org/z/U2NUht
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200619/84bf9c22/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jun 2020 - LLVM-IR store-load propagation

[llvm-dev] LLVM-IR store-load propagation

[llvm-dev] LLVM-IR store-load propagation

[llvm-dev] LLVM-IR store-load propagation

[llvm-dev] LLVM-IR store-load propagation

[llvm-dev] LLVM-IR store-load propagation

Apparently Analagous Threads