thr3ads.net - llvm dev - [llvm-dev] Generalizing load/store promotion in LICM [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2018-Sep-12 21:41 UTC

[llvm-dev] Generalizing load/store promotion in LICM

I'm thinking about making some semi radical changes to load store 
promotion works in LICM, and figured it would be a good idea to get buy 
in before I actually started writing code.  :)

TLDR: legality of sinking stores to exits is hard, can we separate load 
handling into a super aggressive form of PRE, and use predicated stores 
to avoid solving legality question?


Background

We've been seeing an interesting class of problems recently that looks 
roughly like this:

for (int = 0; i < N; i++)
   if (a[i] == 0) // some data dependent check
     g_count++; // conditional load and store to shared location

The critical detail in this example is that g_count is a global location 
which may be accessed concurrently* by multiple threads. The challenge 
here is that we don't know whether the store ever executes, and we're 
not allowed to insert a store along any path that didn't originally 
contain them.  Said differently, if all elements in "a" are non-zero, 
we're not allowed to store to g_count.  We do know that the g_count 
location is dereferenceable though.

(*Please, let's avoid the memory model derailment here.  I'm simplifying
and C++ language rules aren't real useful for my Java language frontend 
anyways.  In practice, all the access are atomic, but unordered, but 
we'll leave that out of discussion otherwise.)

I have two approaches I'm considering here.  These are orthogonal, but I 
suspect we'll want to implement both.


Proposal A - Use predicated stores in loop exits

The basic idea is that we don't worry about solving the legality 
question above, and just insert a store which is predicated on a 
condition which is true exactly when the original store ran.  In pseudo 
code, this looks something like:

bool StoreLegal = false;
int LocalCount = g_count;
for (int = 0; i < N; i++)
   if (a[i] == 0) {
     LocalCount++;
     StoreLegal = true;
   }
if (StoreLegal) g_count = LocalCount;

There are two obvious concerns here:

 1. The predicated store might be expensive in practice - true for most
    current architectures.
 2. We''re introducing an extra boolean phi cycle around the loop.

Talking to a couple of folks offline at the socials over the last few 
months, the former seems to be the main objection.  I think we can 
control this by restricting this transform to when the loop is believed 
to have a high trip count and the conditional path is taken with some 
frequency.   Getting feedback on this particular point is one of my main 
reasons for writing this mail.

The second objection can frequently be resolved by finding other 
expressions which evaluate to the same boolean.  (In this case, if 
LocalCount != LocalCountOrig assuming i doesn't overflow.)  We already 
have a framework with SCEV to do these replacements. Though, from some 
quick testing, it would definitely need strengthening.  However, SCEV 
can't remove the extra phi in all cases, so we have to be okay with the 
extra phi cycle in the general case.  This seems worthwhile to me, but 
thoughts?


Proposal B - Separate load and store handling into distinct transforms

(For folks I've discussed this with before, this part is all new.)

Thinking about the problem, it finally occurred to me that we can 
decompose the original example into two steps: getting the loads out of 
the loop, and sinking the stores out of the loop.  If we can accomplish 
the former, but not the later, we've still made meaningful progress.

So, what'd we'd essentially have is a load only transformation which 
produces this:
int LocalCount = g_count;
for (int = 0; i < N; i++)
   if (a[i] == 0) {
     LocalCount++;
     g_count = LocalCount;
   }

At this point, we've reduced memory traffic by half, and opened up the 
possibility that other parts of the optimizer can exploit the simplified 
form.  The last point is particular interesting since we generally try 
to canonicalize loads out of loops, and much of the optimizer is tuned 
for a form with as much as possible being loop invariant.  As one 
example, completely by accident, there's some work going on in the 
LoopVectorizer right now to handle such stores to loop invariant 
addresses during vectorization.  Putting the two pieces together would 
let us fully vectorize this loop without needing to sink the stores at all.

In practice, the existing implementation in LICM would cleanly split 
along these lines with little problem.

One way of looking at this load specific transform is as an extreme form 
of PRE (partial redundancy elimination).  Our current PRE implementation 
doesn't handle cases this complicated.

Thoughts?

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180912/3600bb9c/attachment.html>

Chandler Carruth via llvm-dev

2018-Sep-13 08:51 UTC

head link

[llvm-dev] Generalizing load/store promotion in LICM

Haven't had time to dig into this, but wanted to add +Alina Sbirlea
<asbirlea at google.com> to the thread as she has been working on
promotion
and other aspects of LICM for a long time here.

On Wed, Sep 12, 2018 at 11:41 PM Philip Reames <listmail at
philipreames.com>
wrote:
> I'm thinking about making some semi radical changes to load store
> promotion works in LICM, and figured it would be a good idea to get buy in
> before I actually started writing code.  :)
>
> TLDR: legality of sinking stores to exits is hard, can we separate load
> handling into a super aggressive form of PRE, and use predicated stores to
> avoid solving legality question?
>
>
> Background
>
> We've been seeing an interesting class of problems recently that looks
> roughly like this:
>
> for (int = 0; i < N; i++)
>   if (a[i] == 0) // some data dependent check
>     g_count++; // conditional load and store to shared location
>
> The critical detail in this example is that g_count is a global location
> which may be accessed concurrently* by multiple threads.  The challenge
> here is that we don't know whether the store ever executes, and
we're not
> allowed to insert a store along any path that didn't originally contain
> them.  Said differently, if all elements in "a" are non-zero,
we're not
> allowed to store to g_count.  We do know that the g_count location is
> dereferenceable though.
>
> (*Please, let's avoid the memory model derailment here.  I'm
simplifying
> and C++ language rules aren't real useful for my Java language frontend
> anyways.  In practice, all the access are atomic, but unordered, but
we'll
> leave that out of discussion otherwise.)
>
> I have two approaches I'm considering here.  These are orthogonal, but
I
> suspect we'll want to implement both.
>
>
> Proposal A - Use predicated stores in loop exits
>
> The basic idea is that we don't worry about solving the legality
question
> above, and just insert a store which is predicated on a condition which is
> true exactly when the original store ran.  In pseudo code, this looks
> something like:
>
> bool StoreLegal = false;
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>   if (a[i] == 0) {
>     LocalCount++;
>     StoreLegal = true;
>   }
> if (StoreLegal) g_count = LocalCount;
>
> There are two obvious concerns here:
>
>    1. The predicated store might be expensive in practice - true for most
>    current architectures.
>    2. We''re introducing an extra boolean phi cycle around the
loop.
>
> Talking to a couple of folks offline at the socials over the last few
> months, the former seems to be the main objection.  I think we can control
> this by restricting this transform to when the loop is believed to have a
> high trip count and the conditional path is taken with some frequency.
> Getting feedback on this particular point is one of my main reasons for
> writing this mail.
>
> The second objection can frequently be resolved by finding other
> expressions which evaluate to the same boolean.  (In this case, if
> LocalCount != LocalCountOrig assuming i doesn't overflow.)  We already
have
> a framework with SCEV to do these replacements.  Though, from some quick
> testing, it would definitely need strengthening.  However, SCEV can't
> remove the extra phi in all cases, so we have to be okay with the extra phi
> cycle in the general case.  This seems worthwhile to me, but thoughts?
>
>
> Proposal B - Separate load and store handling into distinct transforms
>
> (For folks I've discussed this with before, this part is all new.)
>
> Thinking about the problem, it finally occurred to me that we can
> decompose the original example into two steps: getting the loads out of the
> loop, and sinking the stores out of the loop.  If we can accomplish the
> former, but not the later, we've still made meaningful progress.
>
> So, what'd we'd essentially have is a load only transformation
which
> produces this:
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>   if (a[i] == 0) {
>     LocalCount++;
>     g_count = LocalCount;
>   }
>
> At this point, we've reduced memory traffic by half, and opened up the
> possibility that other parts of the optimizer can exploit the simplified
> form.  The last point is particular interesting since we generally try to
> canonicalize loads out of loops, and much of the optimizer is tuned for a
> form with as much as possible being loop invariant.  As one example,
> completely by accident, there's some work going on in the
LoopVectorizer
> right now to handle such stores to loop invariant addresses during
> vectorization.  Putting the two pieces together would let us fully
> vectorize this loop without needing to sink the stores at all.
>
> In practice, the existing implementation in LICM would cleanly split along
> these lines with little problem.
>
> One way of looking at this load specific transform is as an extreme form
> of PRE (partial redundancy elimination).  Our current PRE implementation
> doesn't handle cases this complicated.
>
> Thoughts?
>
> Philip
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180913/111e4291/attachment.html>

Philip Reames via llvm-dev

2018-Sep-13 20:42 UTC

head link

[llvm-dev] Generalizing load/store promotion in LICM

(minor inline additions)

On 09/13/2018 01:51 AM, Chandler Carruth wrote:> Haven't had time to dig into this, but wanted to add +Alina Sbirlea 
> <mailto:asbirlea at google.com> to the thread as she has been working
on
> promotion and other aspects of LICM for a long time here.
Thanks!> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames 
> <listmail at philipreames.com <mailto:listmail at
philipreames.com>> wrote:
>
>     I'm thinking about making some semi radical changes to load store
>     promotion works in LICM, and figured it would be a good idea to
>     get buy in before I actually started writing code.  :)
>
>     TLDR: legality of sinking stores to exits is hard, can we separate
>     load handling into a super aggressive form of PRE, and use
>     predicated stores to avoid solving legality question?
>
>
>     Background
>
>     We've been seeing an interesting class of problems recently that
>     looks roughly like this:
>
>     for (int = 0; i < N; i++)
>       if (a[i] == 0) // some data dependent check
>         g_count++; // conditional load and store to shared location
>
>     The critical detail in this example is that g_count is a global
>     location which may be accessed concurrently* by multiple threads. 
>     The challenge here is that we don't know whether the store ever
>     executes, and we're not allowed to insert a store along any path
>     that didn't originally contain them.  Said differently, if all
>     elements in "a" are non-zero, we're not allowed to store
to
>     g_count.  We do know that the g_count location is dereferenceable
>     though.
>
>     (*Please, let's avoid the memory model derailment here. I'm
>     simplifying and C++ language rules aren't real useful for my Java
>     language frontend anyways.  In practice, all the access are
>     atomic, but unordered, but we'll leave that out of discussion
>     otherwise.)
>
>     I have two approaches I'm considering here.  These are orthogonal,
>     but I suspect we'll want to implement both.
>
>
>     Proposal A - Use predicated stores in loop exits
>
>     The basic idea is that we don't worry about solving the legality
>     question above, and just insert a store which is predicated on a
>     condition which is true exactly when the original store ran.  In
>     pseudo code, this looks something like:
>
>     bool StoreLegal = false;
>     int LocalCount = g_count;
>     for (int = 0; i < N; i++)
>       if (a[i] == 0) {
>         LocalCount++;
>         StoreLegal = true;
>       }
>     if (StoreLegal) g_count = LocalCount;
>
>     There are two obvious concerns here:
>
>      1. The predicated store might be expensive in practice - true for
>         most current architectures.
>      2. We''re introducing an extra boolean phi cycle around the
loop.
>
>     Talking to a couple of folks offline at the socials over the last
>     few months, the former seems to be the main objection.  I think we
>     can control this by restricting this transform to when the loop is
>     believed to have a high trip count and the conditional path is
>     taken with some frequency.   Getting feedback on this particular
>     point is one of my main reasons for writing this mail.
>
>     The second objection can frequently be resolved by finding other
>     expressions which evaluate to the same boolean.  (In this case, if
>     LocalCount != LocalCountOrig assuming i doesn't overflow.)  We
>     already have a framework with SCEV to do these replacements. 
>     Though, from some quick testing, it would definitely need
>     strengthening. However, SCEV can't remove the extra phi in all
>     cases, so we have to be okay with the extra phi cycle in the
>     general case.  This seems worthwhile to me, but thoughts?
>
>
>     Proposal B - Separate load and store handling into distinct transforms
>
>     (For folks I've discussed this with before, this part is all new.)
>
>     Thinking about the problem, it finally occurred to me that we can
>     decompose the original example into two steps: getting the loads
>     out of the loop, and sinking the stores out of the loop.  If we
>     can accomplish the former, but not the later, we've still made
>     meaningful progress.
>
>     So, what'd we'd essentially have is a load only transformation
>     which produces this:
>     int LocalCount = g_count;
>     for (int = 0; i < N; i++)
>       if (a[i] == 0) {
>         LocalCount++;
>         g_count = LocalCount;
>       }
>
>     At this point, we've reduced memory traffic by half, and opened up
>     the possibility that other parts of the optimizer can exploit the
>     simplified form.  The last point is particular interesting since
>     we generally try to canonicalize loads out of loops, and much of
>     the optimizer is tuned for a form with as much as possible being
>     loop invariant.  As one example, completely by accident, there's
>     some work going on in the LoopVectorizer right now to handle such
>     stores to loop invariant addresses during vectorization.  Putting
>     the two pieces together would let us fully vectorize this loop
>     without needing to sink the stores at all.
>
>     In practice, the existing implementation in LICM would cleanly
>     split along these lines with little problem.
>
>     One way of looking at this load specific transform is as an
>     extreme form of PRE (partial redundancy elimination). Our current
>     PRE implementation doesn't handle cases this complicated.
>It occurred to my later that simply framing the new transform as a 
separate pass (LoopPRE) and using the same AST + SSA construction 
approach would be straight forward.  So, if folks think that having an 
aggressive form of load PRE in LICM is going a bit too far, it'd be easy 
to represent as an optional separate pass.  I'd still prefer having LICM 
contain the logic though.>
>     Thoughts?
>
>     Philip
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180913/c383000f/attachment.html>

Krzysztof Parzyszek via llvm-dev

2018-Sep-19 13:52 UTC

head link

[llvm-dev] Generalizing load/store promotion in LICM

On 9/12/2018 4:41 PM, Philip Reames via llvm-dev wrote:> 
> Proposal A - Use predicated stores in loop exits
> 
> The basic idea is that we don't worry about solving the legality 
> question above, and just insert a store which is predicated on a 
> condition which is true exactly when the original store ran.  In pseudo 
> code, this looks something like:
> 
> bool StoreLegal = false;
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>    if (a[i] == 0) {
>      LocalCount++;
>      StoreLegal = true;
>    }
> if (StoreLegal) g_count = LocalCount;
> 
> There are two obvious concerns here:
> 
>  1. The predicated store might be expensive in practice - true for most
>     current architectures.
>  2. We''re introducing an extra boolean phi cycle around the loop.
In many practical scenarios you can write to g_count unconditionally, so 
if that's the case, then the tracking of store's occurrence could be 
omitted. If N is large enough, the amortized cost of that store would be 
very small.
There should be some mechanism to convey this kind of information, since 
it could be difficult to determine it by simply examining the function.

> Proposal B - Separate load and store handling into distinct transforms
> 
> (For folks I've discussed this with before, this part is all new.)
> 
> Thinking about the problem, it finally occurred to me that we can 
> decompose the original example into two steps: getting the loads out of 
> the loop, and sinking the stores out of the loop.  If we can accomplish 
> the former, but not the later, we've still made meaningful progress.
> 
> So, what'd we'd essentially have is a load only transformation
which
> produces this:
> int LocalCount = g_count;
> for (int = 0; i < N; i++)
>    if (a[i] == 0) {
>      LocalCount++;
>      g_count = LocalCount;
>    }
The stores to g_count could be considered "redundant", since the value
in g_count is always the same as the value in LocalCount. In that sense, 
moving the store out of the loop would be a form of PRE. This could also 
introduce predication to the store if necessary.

-Krzysztof


-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Sep 2018 - Generalizing load/store promotion in LICM

[llvm-dev] Generalizing load/store promotion in LICM

[llvm-dev] Generalizing load/store promotion in LICM

[llvm-dev] Generalizing load/store promotion in LICM

[llvm-dev] Generalizing load/store promotion in LICM

Maybe Matching Threads