Philip Reames via llvm-dev
2018-Sep-14 23:25 UTC
[llvm-dev] Generalizing load/store promotion in LICM
This is going OT from the original thread, but, what the heck... Alina, can you explain the challenge with implementing promotion over MemorySSA? On the surface, it seems like it should be fairly straight forward to provide an alias set like abstraction over MemorySSA. What am I missing? Here's a sketch of how I'd think to approach this: Visit the MemoryPhi nodes in a loop. Every interesting promotion case (mod in loop) must be involved in at least one MemoryPhi cycle. For each MemoryPhi in the header, create a set of all MemoryDef instructions involved. (I'm saying this in terms of instructions, but sets of MemoryLocations should be equivalent.) For each instruction in loop, identify it's (optimized) memory def. If that def is outside loop, and we're looking at a load/store, then we can trivially hoist/sink (aliasing wise only). If the def is in the loop, it must be involved in one of our cycles, add it to related alias set. When I last looked, we were debating whether MemoryPhis were optimized by default. Did we end up deciding they weren't? If so, I can definitely see the problem there. Even then, constructing an AST for only the instructions involved in a loop MemoryPhi cycle should be pretty cheap. Separately, can you summarize what the overall status of MemorySSA is? I've lost track. It looks like it's enabled by default for EarlyCSE. So, does that mean we believe the bugs have been worked out and we "just" need to plumb it through the pipeline? Philip On 09/13/2018 02:00 PM, Alina Sbirlea wrote:> For context, I've been looking into replacing the use of AST > (AliasSetTracker) with MemorySSA in LICM. This works great for > sinking/hoisting but does not apply well for promotion, and one of the > solutions I considered is precisely ripping out the promotion part and > replacing its functionality with a separate PRE pass + possibly store > sinking. FWIW I think that's the right way to go. > I did not get deep enough into working on the solution but I would > gladly have a detailed discussion to move this forward. > > Reading into detail now. > > Thanks, > Alina > > On Thu, Sep 13, 2018 at 1:43 PM Philip Reames > <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: > > (minor inline additions) > > On 09/13/2018 01:51 AM, Chandler Carruth wrote: >> Haven't had time to dig into this, but wanted to add +Alina >> Sbirlea <mailto:asbirlea at google.com> to the thread as she has >> been working on promotion and other aspects of LICM for a long >> time here. > Thanks! >> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames >> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: >> >> I'm thinking about making some semi radical changes to load >> store promotion works in LICM, and figured it would be a good >> idea to get buy in before I actually started writing code. :) >> >> TLDR: legality of sinking stores to exits is hard, can we >> separate load handling into a super aggressive form of PRE, >> and use predicated stores to avoid solving legality question? >> >> >> Background >> >> We've been seeing an interesting class of problems recently >> that looks roughly like this: >> >> for (int = 0; i < N; i++) >> if (a[i] == 0) // some data dependent check >> g_count++; // conditional load and store to shared location >> >> The critical detail in this example is that g_count is a >> global location which may be accessed concurrently* by >> multiple threads. The challenge here is that we don't know >> whether the store ever executes, and we're not allowed to >> insert a store along any path that didn't originally contain >> them. Said differently, if all elements in "a" are non-zero, >> we're not allowed to store to g_count. We do know that the >> g_count location is dereferenceable though. >> >> (*Please, let's avoid the memory model derailment here. I'm >> simplifying and C++ language rules aren't real useful for my >> Java language frontend anyways. In practice, all the access >> are atomic, but unordered, but we'll leave that out of >> discussion otherwise.) >> >> I have two approaches I'm considering here. These are >> orthogonal, but I suspect we'll want to implement both. >> >> >> Proposal A - Use predicated stores in loop exits >> >> The basic idea is that we don't worry about solving the >> legality question above, and just insert a store which is >> predicated on a condition which is true exactly when the >> original store ran. In pseudo code, this looks something like: >> >> bool StoreLegal = false; >> int LocalCount = g_count; >> for (int = 0; i < N; i++) >> if (a[i] == 0) { >> LocalCount++; >> StoreLegal = true; >> } >> if (StoreLegal) g_count = LocalCount; >> >> There are two obvious concerns here: >> >> 1. The predicated store might be expensive in practice - >> true for most current architectures. >> 2. We''re introducing an extra boolean phi cycle around the >> loop. >> >> Talking to a couple of folks offline at the socials over the >> last few months, the former seems to be the main objection. >> I think we can control this by restricting this transform to >> when the loop is believed to have a high trip count and the >> conditional path is taken with some frequency. Getting >> feedback on this particular point is one of my main reasons >> for writing this mail. >> >> The second objection can frequently be resolved by finding >> other expressions which evaluate to the same boolean. (In >> this case, if LocalCount != LocalCountOrig assuming i doesn't >> overflow.) We already have a framework with SCEV to do these >> replacements. Though, from some quick testing, it would >> definitely need strengthening. However, SCEV can't remove >> the extra phi in all cases, so we have to be okay with the >> extra phi cycle in the general case. This seems worthwhile >> to me, but thoughts? >> >> >> Proposal B - Separate load and store handling into distinct >> transforms >> >> (For folks I've discussed this with before, this part is all >> new.) >> >> Thinking about the problem, it finally occurred to me that we >> can decompose the original example into two steps: getting >> the loads out of the loop, and sinking the stores out of the >> loop. If we can accomplish the former, but not the later, >> we've still made meaningful progress. >> >> So, what'd we'd essentially have is a load only >> transformation which produces this: >> int LocalCount = g_count; >> for (int = 0; i < N; i++) >> if (a[i] == 0) { >> LocalCount++; >> g_count = LocalCount; >> } >> >> At this point, we've reduced memory traffic by half, and >> opened up the possibility that other parts of the optimizer >> can exploit the simplified form. The last point is >> particular interesting since we generally try to canonicalize >> loads out of loops, and much of the optimizer is tuned for a >> form with as much as possible being loop invariant. As one >> example, completely by accident, there's some work going on >> in the LoopVectorizer right now to handle such stores to loop >> invariant addresses during vectorization. Putting the two >> pieces together would let us fully vectorize this loop >> without needing to sink the stores at all. >> >> In practice, the existing implementation in LICM would >> cleanly split along these lines with little problem. >> >> One way of looking at this load specific transform is as an >> extreme form of PRE (partial redundancy elimination). Our >> current PRE implementation doesn't handle cases this >> complicated. >> > It occurred to my later that simply framing the new transform as a > separate pass (LoopPRE) and using the same AST + SSA construction > approach would be straight forward. So, if folks think that having > an aggressive form of load PRE in LICM is going a bit too far, > it'd be easy to represent as an optional separate pass. I'd still > prefer having LICM contain the logic though. >> >> Thoughts? >> >> Philip >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180914/3616b193/attachment-0001.html>
Alina Sbirlea via llvm-dev
2018-Sep-18 21:50 UTC
[llvm-dev] Generalizing load/store promotion in LICM
On Fri, Sep 14, 2018 at 4:25 PM Philip Reames <listmail at philipreames.com> wrote:> This is going OT from the original thread, but, what the heck... >Sorry, not my intention, I was just giving another reason why getting promotion done in LICM differently would be helpful.> Alina, can you explain the challenge with implementing promotion over > MemorySSA? On the surface, it seems like it should be fairly straight > forward to provide an alias set like abstraction over MemorySSA. What am I > missing? >Yes, it is straight forward to have alias sets on top of MemorySSA, but it will likely lose most of MemorySSA's benefits.> Here's a sketch of how I'd think to approach this: > > Visit the MemoryPhi nodes in a loop. Every interesting promotion case > (mod in loop) must be involved in at least one MemoryPhi cycle. > > For each MemoryPhi in the header, create a set of all MemoryDef > instructions involved. (I'm saying this in terms of instructions, but sets > of MemoryLocations should be equivalent.) > > For each instruction in loop, identify it's (optimized) memory def. If > that def is outside loop, and we're looking at a load/store, then we can > trivially hoist/sink (aliasing wise only). If the def is in the loop, it > must be involved in one of our cycles, add it to related alias set. > > When I last looked, we were debating whether MemoryPhis were optimized by > default. Did we end up deciding they weren't? If so, I can definitely see > the problem there. Even then, constructing an AST for only the > instructions involved in a loop MemoryPhi cycle should be pretty cheap. >AFAIK there is no notion of MemoryPhis being optimized. Each block has a *single* MemoryPhi, which has the last Def from each incoming block. MemoryDefs and MemoryUses can be optimized through Phis. MemoryDefs are also not optimized from the start, only MemoryUses are. All MemoryDefs build a single def chain, regardless of whether they alias or not. Doing alias sets means checking alias() at least for all pairs of Defs, since alias relation is not transitive. That is, even if we optimize all MemoryDefs, we may still need additional alias() calls to build those sets. But the cheaper way here is not to optimize all Defs, but instead do alias/modref calls only for the Def we're processing currently to determine if *that* can be moved (+ cache those results in some new form of alias sets if a "leave this in the loop" answer is received). We may also need additional alias() checks for Uses, since, again, due to non-transitivity, a Use's clobbering access may be a Def with which it MayAlias (let's call it D1), but that D1 may be NoAlias with some D2, giving no info on the (Use, D2) alias relation. If NoAlias, D2 can be promoted, if MustAlias, the set (Use, D2) can be promoted, if MayAlias, we cannot promote anything. The AliasSetTracker has a cap after which it gives up and merges all sets, and so will an implementation of sets with MemorySSA, since the cost is the same: lots of alias/modref calls. Granted, MemorySSA has fewer, especially for Uses, but we will still need a cap for pathological cases, while getting benefits for small loops. Trying to get back to the original topic. The reason promotion is so difficult in the current form, is that it promotes a whole AliasSet. And that AliasSet is built based on conservative alias decisions. If we were to split the promotion problem into a PRE for loads, then sink for stores, the problem becomes more manageable as far as the number of alias calls; we'd process one memory-accessing instruction at a time, and, with the right processing order, MemorySSA should be a better fit.> Separately, can you summarize what the overall status of MemorySSA is? > I've lost track. It looks like it's enabled by default for EarlyCSE. So, > does that mean we believe the bugs have been worked out and we "just" need > to plumb it through the pipeline? >Sure! MemorySSA is fairly stable right now (i.e. no known bugs). I've been working on extending the APIs for updating it (e.g. D45299 <https://reviews.llvm.org/D45299>, D48396 <https://reviews.llvm.org/D48396>, D48897 <https://reviews.llvm.org/D48897>), and added the plumbing to update it in the Utils (D48790 <https://reviews.llvm.org/D48790>, D45300 <https://reviews.llvm.org/D45300>) and through the loop pass pipeline ( D50906 <https://reviews.llvm.org/D50906>, D50911 <https://reviews.llvm.org/D50911>, D45301 <https://reviews.llvm.org/D45301>, D47022 <https://reviews.llvm.org/D47022>, D51718 <https://reviews.llvm.org/D51718>). LICM is the loop pass I've picked back up now, and the first that also uses MemorySSA vs just updating it. Patches in Phabricator are fairly old (D40375 <https://reviews.llvm.org/D40375>, D35741 <https://reviews.llvm.org/D35741>). I'm going to upload the rebased and updated versions once they're ready for review. The actual use/update is currently disabled by default (flag to enable: EnableMSSALoopDependency). I also recently became aware of EarlyCSE using MemorySSA. There are no correctness issues with that but there may be missed optimizations in subsequent passes using MemorySSA (D51960 <https://reviews.llvm.org/D51960> for details and to avoid going even more off-topic) . Thanks, Alina> > Philip> > On 09/13/2018 02:00 PM, Alina Sbirlea wrote: > > For context, I've been looking into replacing the use of AST > (AliasSetTracker) with MemorySSA in LICM. This works great for > sinking/hoisting but does not apply well for promotion, and one of the > solutions I considered is precisely ripping out the promotion part and > replacing its functionality with a separate PRE pass + possibly store > sinking. FWIW I think that's the right way to go. > I did not get deep enough into working on the solution but I would gladly > have a detailed discussion to move this forward. > > Reading into detail now. > > Thanks, > Alina > > On Thu, Sep 13, 2018 at 1:43 PM Philip Reames <listmail at philipreames.com> > wrote: > >> (minor inline additions) >> >> On 09/13/2018 01:51 AM, Chandler Carruth wrote: >> >> Haven't had time to dig into this, but wanted to add +Alina Sbirlea >> <asbirlea at google.com> to the thread as she has been working on promotion >> and other aspects of LICM for a long time here. >> >> Thanks! >> >> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames <listmail at philipreames.com> >> wrote: >> >>> I'm thinking about making some semi radical changes to load store >>> promotion works in LICM, and figured it would be a good idea to get buy in >>> before I actually started writing code. :) >>> >>> TLDR: legality of sinking stores to exits is hard, can we separate load >>> handling into a super aggressive form of PRE, and use predicated stores to >>> avoid solving legality question? >>> >>> >>> Background >>> >>> We've been seeing an interesting class of problems recently that looks >>> roughly like this: >>> >>> for (int = 0; i < N; i++) >>> if (a[i] == 0) // some data dependent check >>> g_count++; // conditional load and store to shared location >>> >>> The critical detail in this example is that g_count is a global location >>> which may be accessed concurrently* by multiple threads. The challenge >>> here is that we don't know whether the store ever executes, and we're not >>> allowed to insert a store along any path that didn't originally contain >>> them. Said differently, if all elements in "a" are non-zero, we're not >>> allowed to store to g_count. We do know that the g_count location is >>> dereferenceable though. >>> >>> (*Please, let's avoid the memory model derailment here. I'm simplifying >>> and C++ language rules aren't real useful for my Java language frontend >>> anyways. In practice, all the access are atomic, but unordered, but we'll >>> leave that out of discussion otherwise.) >>> >>> I have two approaches I'm considering here. These are orthogonal, but I >>> suspect we'll want to implement both. >>> >>> >>> Proposal A - Use predicated stores in loop exits >>> >>> The basic idea is that we don't worry about solving the legality >>> question above, and just insert a store which is predicated on a condition >>> which is true exactly when the original store ran. In pseudo code, this >>> looks something like: >>> >>> bool StoreLegal = false; >>> int LocalCount = g_count; >>> for (int = 0; i < N; i++) >>> if (a[i] == 0) { >>> LocalCount++; >>> StoreLegal = true; >>> } >>> if (StoreLegal) g_count = LocalCount; >>> >>> There are two obvious concerns here: >>> >>> 1. The predicated store might be expensive in practice - true for >>> most current architectures. >>> 2. We''re introducing an extra boolean phi cycle around the loop. >>> >>> Talking to a couple of folks offline at the socials over the last few >>> months, the former seems to be the main objection. I think we can control >>> this by restricting this transform to when the loop is believed to have a >>> high trip count and the conditional path is taken with some frequency. >>> Getting feedback on this particular point is one of my main reasons for >>> writing this mail. >>> >>> The second objection can frequently be resolved by finding other >>> expressions which evaluate to the same boolean. (In this case, if >>> LocalCount != LocalCountOrig assuming i doesn't overflow.) We already have >>> a framework with SCEV to do these replacements. Though, from some quick >>> testing, it would definitely need strengthening. However, SCEV can't >>> remove the extra phi in all cases, so we have to be okay with the extra phi >>> cycle in the general case. This seems worthwhile to me, but thoughts? >>> >>> >>> Proposal B - Separate load and store handling into distinct transforms >>> >>> (For folks I've discussed this with before, this part is all new.) >>> >>> Thinking about the problem, it finally occurred to me that we can >>> decompose the original example into two steps: getting the loads out of the >>> loop, and sinking the stores out of the loop. If we can accomplish the >>> former, but not the later, we've still made meaningful progress. >>> >>> So, what'd we'd essentially have is a load only transformation which >>> produces this: >>> int LocalCount = g_count; >>> for (int = 0; i < N; i++) >>> if (a[i] == 0) { >>> LocalCount++; >>> g_count = LocalCount; >>> } >>> >>> At this point, we've reduced memory traffic by half, and opened up the >>> possibility that other parts of the optimizer can exploit the simplified >>> form. The last point is particular interesting since we generally try to >>> canonicalize loads out of loops, and much of the optimizer is tuned for a >>> form with as much as possible being loop invariant. As one example, >>> completely by accident, there's some work going on in the LoopVectorizer >>> right now to handle such stores to loop invariant addresses during >>> vectorization. Putting the two pieces together would let us fully >>> vectorize this loop without needing to sink the stores at all. >>> >>> In practice, the existing implementation in LICM would cleanly split >>> along these lines with little problem. >>> >>> One way of looking at this load specific transform is as an extreme form >>> of PRE (partial redundancy elimination). Our current PRE implementation >>> doesn't handle cases this complicated. >>> >> It occurred to my later that simply framing the new transform as a >> separate pass (LoopPRE) and using the same AST + SSA construction approach >> would be straight forward. So, if folks think that having an aggressive >> form of load PRE in LICM is going a bit too far, it'd be easy to represent >> as an optional separate pass. I'd still prefer having LICM contain the >> logic though. >> >> Thoughts? >>> >>> Philip >>> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180918/15e2c023/attachment.html>
Philip Reames via llvm-dev
2018-Sep-19 17:06 UTC
[llvm-dev] Generalizing load/store promotion in LICM
On 09/18/2018 02:50 PM, Alina Sbirlea wrote:> > On Fri, Sep 14, 2018 at 4:25 PM Philip Reames > <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: > > This is going OT from the original thread, but, what the heck... > > Sorry, not my intention, I was just giving another reason why getting > promotion done in LICM differently would be helpful. > > Alina, can you explain the challenge with implementing promotion > over MemorySSA? On the surface, it seems like it should be fairly > straight forward to provide an alias set like abstraction over > MemorySSA. What am I missing? > > Yes, it is straight forward to have alias sets on top of MemorySSA, > but it will likely lose most of MemorySSA's benefits.Ah, understood. I think this was the core source of our confusion. I was thinking "how to we migrate to using MemorySSA". You were thinking "how do we exploit memory ssa". :)> > Here's a sketch of how I'd think to approach this: > > Visit the MemoryPhi nodes in a loop. Every interesting promotion > case (mod in loop) must be involved in at least one MemoryPhi cycle. > > For each MemoryPhi in the header, create a set of all MemoryDef > instructions involved. (I'm saying this in terms of instructions, > but sets of MemoryLocations should be equivalent.) > > For each instruction in loop, identify it's (optimized) memory > def. If that def is outside loop, and we're looking at a > load/store, then we can trivially hoist/sink (aliasing wise > only). If the def is in the loop, it must be involved in one of > our cycles, add it to related alias set. > > When I last looked, we were debating whether MemoryPhis were > optimized by default. Did we end up deciding they weren't? If > so, I can definitely see the problem there. Even then, > constructing an AST for only the instructions involved in a loop > MemoryPhi cycle should be pretty cheap. > > AFAIK there is no notion of MemoryPhis being optimized. Each block has > a *single* MemoryPhi, which has the last Def from each incoming block. > MemoryDefs and MemoryUses can be optimized through Phis. MemoryDefs > are also not optimized from the start, only MemoryUses are. > > All MemoryDefs build a single def chain, regardless of whether they > alias or not. Doing alias sets means checking alias() at least for all > pairs of Defs, since alias relation is not transitive. That is, even > if we optimize all MemoryDefs, we may still need additional alias() > calls to build those sets. But the cheaper way here is not to optimize > all Defs, but instead do alias/modref calls only for the Def we're > processing currently to determine if *that* can be moved (+ cache > those results in some new form of alias sets if a "leave this in the > loop" answer is received). > > We may also need additional alias() checks for Uses, since, again, due > to non-transitivity, a Use's clobbering access may be a Def with which > it MayAlias (let's call it D1), but that D1 may be NoAlias with some > D2, giving no info on the (Use, D2) alias relation. If NoAlias, D2 can > be promoted, if MustAlias, the set (Use, D2) can be promoted, if > MayAlias, we cannot promote anything.Putting the above in my own words to confirm understanding: We can form alias sets by starting with all memory defs in the loop (which must be part of a single MemoryDef/MemoryPhi cycle). Applying existing AS::add to these instructions gives us the set of possibly modifying alias sets (but without ref info yet). We can then visit each MemoryUse and add it to the alias set of it's (optimized) MemoryDef if that def is in the loop, or introduce a new ref only AS if not.> > The AliasSetTracker has a cap after which it gives up and merges all > sets, and so will an implementation of sets with MemorySSA, since the > cost is the same: lots of alias/modref calls. Granted, MemorySSA has > fewer, especially for Uses, but we will still need a cap for > pathological cases, while getting benefits for small loops.Agreed.> > Trying to get back to the original topic. The reason promotion is so > difficult in the current form, is that it promotes a whole AliasSet. > And that AliasSet is built based on conservative alias decisions. If > we were to split the promotion problem into a PRE for loads, then sink > for stores, the problem becomes more manageable as far as the number > of alias calls; we'd process one memory-accessing instruction at a > time, and, with the right processing order, MemorySSA should be a > better fit.Er, huh? We've already hoisted/sunk all of the loads which would form AliasSets w/o Mod set. All that's left by the point we're doing promotion are alias sets with Mods and (optionally) Refs. Aren't we pretty much stuck with doing the precise aliasing to ensure MustAlias at that point? Or is your point that we can delay the heavy weight work until after hoist/sink has been done? If so, then yes, I agree. p.s. I added store hoisting to LICM recently which does require mustalias mod information during hoisting. This could be separate into a later transform easily though.> > > Separately, can you summarize what the overall status of MemorySSA > is? I've lost track. It looks like it's enabled by default for > EarlyCSE. So, does that mean we believe the bugs have been worked > out and we "just" need to plumb it through the pipeline? > > Sure! > > MemorySSA is fairly stable right now (i.e. no known bugs). I've been > working on extending the APIs for updating it (e.g. D45299 > <https://reviews.llvm.org/D45299>, D48396 > <https://reviews.llvm.org/D48396>, D48897 > <https://reviews.llvm.org/D48897>), and added the plumbing to update > it in the Utils (D48790 <https://reviews.llvm.org/D48790>, D45300 > <https://reviews.llvm.org/D45300>) and through the loop pass pipeline > (D50906 <https://reviews.llvm.org/D50906>, D50911 > <https://reviews.llvm.org/D50911>, D45301 > <https://reviews.llvm.org/D45301>, D47022 > <https://reviews.llvm.org/D47022>, D51718 > <https://reviews.llvm.org/D51718>). > LICM is the loop pass I've picked back up now, and the first that also > uses MemorySSA vs just updating it. Patches in Phabricator are fairly > old (D40375 <https://reviews.llvm.org/D40375>, D35741 > <https://reviews.llvm.org/D35741>). I'm going to upload the rebased > and updated versions once they're ready for review. > The actual use/update is currently disabled by default (flag to > enable: EnableMSSALoopDependency). > > I also recently became aware of EarlyCSE using MemorySSA. There are no > correctness issues with that but there may be missed optimizations in > subsequent passes using MemorySSA (D51960 > <https://reviews.llvm.org/D51960> for details and to avoid going even > more off-topic) . > > Thanks, > Alina > > > > > Philip > > > On 09/13/2018 02:00 PM, Alina Sbirlea wrote: >> For context, I've been looking into replacing the use of AST >> (AliasSetTracker) with MemorySSA in LICM. This works great for >> sinking/hoisting but does not apply well for promotion, and one >> of the solutions I considered is precisely ripping out the >> promotion part and replacing its functionality with a separate >> PRE pass + possibly store sinking. FWIW I think that's the right >> way to go. >> I did not get deep enough into working on the solution but I >> would gladly have a detailed discussion to move this forward. >> >> Reading into detail now. >> >> Thanks, >> Alina >> >> On Thu, Sep 13, 2018 at 1:43 PM Philip Reames >> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: >> >> (minor inline additions) >> >> On 09/13/2018 01:51 AM, Chandler Carruth wrote: >>> Haven't had time to dig into this, but wanted to add +Alina >>> Sbirlea <mailto:asbirlea at google.com> to the thread as she >>> has been working on promotion and other aspects of LICM for >>> a long time here. >> Thanks! >>> On Wed, Sep 12, 2018 at 11:41 PM Philip Reames >>> <listmail at philipreames.com >>> <mailto:listmail at philipreames.com>> wrote: >>> >>> I'm thinking about making some semi radical changes to >>> load store promotion works in LICM, and figured it would >>> be a good idea to get buy in before I actually started >>> writing code. :) >>> >>> TLDR: legality of sinking stores to exits is hard, can >>> we separate load handling into a super aggressive form >>> of PRE, and use predicated stores to avoid solving >>> legality question? >>> >>> >>> Background >>> >>> We've been seeing an interesting class of problems >>> recently that looks roughly like this: >>> >>> for (int = 0; i < N; i++) >>> if (a[i] == 0) // some data dependent check >>> g_count++; // conditional load and store to shared >>> location >>> >>> The critical detail in this example is that g_count is a >>> global location which may be accessed concurrently* by >>> multiple threads. The challenge here is that we don't >>> know whether the store ever executes, and we're not >>> allowed to insert a store along any path that didn't >>> originally contain them. Said differently, if all >>> elements in "a" are non-zero, we're not allowed to store >>> to g_count. We do know that the g_count location is >>> dereferenceable though. >>> >>> (*Please, let's avoid the memory model derailment here. >>> I'm simplifying and C++ language rules aren't real >>> useful for my Java language frontend anyways. In >>> practice, all the access are atomic, but unordered, but >>> we'll leave that out of discussion otherwise.) >>> >>> I have two approaches I'm considering here. These are >>> orthogonal, but I suspect we'll want to implement both. >>> >>> >>> Proposal A - Use predicated stores in loop exits >>> >>> The basic idea is that we don't worry about solving the >>> legality question above, and just insert a store which >>> is predicated on a condition which is true exactly when >>> the original store ran. In pseudo code, this looks >>> something like: >>> >>> bool StoreLegal = false; >>> int LocalCount = g_count; >>> for (int = 0; i < N; i++) >>> if (a[i] == 0) { >>> LocalCount++; >>> StoreLegal = true; >>> } >>> if (StoreLegal) g_count = LocalCount; >>> >>> There are two obvious concerns here: >>> >>> 1. The predicated store might be expensive in practice >>> - true for most current architectures. >>> 2. We''re introducing an extra boolean phi cycle around >>> the loop. >>> >>> Talking to a couple of folks offline at the socials over >>> the last few months, the former seems to be the main >>> objection. I think we can control this by restricting >>> this transform to when the loop is believed to have a >>> high trip count and the conditional path is taken with >>> some frequency. Getting feedback on this particular >>> point is one of my main reasons for writing this mail. >>> >>> The second objection can frequently be resolved by >>> finding other expressions which evaluate to the same >>> boolean. (In this case, if LocalCount != LocalCountOrig >>> assuming i doesn't overflow.) We already have a >>> framework with SCEV to do these replacements. Though, >>> from some quick testing, it would definitely need >>> strengthening. However, SCEV can't remove the extra phi >>> in all cases, so we have to be okay with the extra phi >>> cycle in the general case. This seems worthwhile to me, >>> but thoughts? >>> >>> >>> Proposal B - Separate load and store handling into >>> distinct transforms >>> >>> (For folks I've discussed this with before, this part is >>> all new.) >>> >>> Thinking about the problem, it finally occurred to me >>> that we can decompose the original example into two >>> steps: getting the loads out of the loop, and sinking >>> the stores out of the loop. If we can accomplish the >>> former, but not the later, we've still made meaningful >>> progress. >>> >>> So, what'd we'd essentially have is a load only >>> transformation which produces this: >>> int LocalCount = g_count; >>> for (int = 0; i < N; i++) >>> if (a[i] == 0) { >>> LocalCount++; >>> g_count = LocalCount; >>> } >>> >>> At this point, we've reduced memory traffic by half, and >>> opened up the possibility that other parts of the >>> optimizer can exploit the simplified form. The last >>> point is particular interesting since we generally try >>> to canonicalize loads out of loops, and much of the >>> optimizer is tuned for a form with as much as possible >>> being loop invariant. As one example, completely by >>> accident, there's some work going on in the >>> LoopVectorizer right now to handle such stores to loop >>> invariant addresses during vectorization. Putting the >>> two pieces together would let us fully vectorize this >>> loop without needing to sink the stores at all. >>> >>> In practice, the existing implementation in LICM would >>> cleanly split along these lines with little problem. >>> >>> One way of looking at this load specific transform is as >>> an extreme form of PRE (partial redundancy >>> elimination). Our current PRE implementation doesn't >>> handle cases this complicated. >>> >> It occurred to my later that simply framing the new transform >> as a separate pass (LoopPRE) and using the same AST + SSA >> construction approach would be straight forward. So, if >> folks think that having an aggressive form of load PRE in >> LICM is going a bit too far, it'd be easy to represent as an >> optional separate pass. I'd still prefer having LICM contain >> the logic though. >>> >>> Thoughts? >>> >>> Philip >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180919/97d2b11c/attachment.html>