thr3ads.net - llvm dev - [llvm-dev] MachinePipeliner refactoring [Jul 2019]

If this information is useful, please help other people find it:
Share via:

Brendon Cahoon via llvm-dev

2019-Jul-16 00:35 UTC

[llvm-dev] MachinePipeliner refactoring

Hi James,

I also think that refactoring the code generation part is a great idea.  That
code is very complicated and difficult to maintain. I’ve wanted to rewrite that
code for a long time, but just have never got to it.  There are quite a few edge
cases to handle (at least in the current code). I’ll take a deeper look at your
patch.  The abstractions that you mention, Stage and Block, are good ones. I
think you’ll need to account for the cycle, or position within a Stage as well.
It is a complex problem with a lot different edge cases when dealing with Phis,
though I think they can be dealt with much better than the existing code.

Generating code for the 3 main parts, the prolog, kernel, and epilog, can be
challenging because each has a unique difference that made is hard to generalize
the code.  For example, the prologs will not have an Phis, but the code
generation needs to be aware of when the Phis occur in the sequence in order to
get the correct remapped name. In the kernel, values may come from a previous
iteration of the kernel. In the epilog, values can come from the same epilog
block, the kernel, or a prolog block.

It would be great to be able to generate the compact representation that you
describe, so that a prolog/epilog do not need to be generated. Hexagon has a
special hardware loop instruction, spNloop, specifically for this. Although, the
compiler does not generate that instruction currently.  Perhaps it would be
easier to generate this form first, and then convert it to a prolog(s), kernel,
epilog(s) sequence.

There are a lot of subtle rules when dealing with the Phi instructions depending
on when they are scheduled relative to uses. There are two different factors to
account for – the stage and cycle when the Phi is scheduled.  For example, a use
of a Phi can be scheduled at an earlier cycle, but later stage. Several
different combinations can occur, and each use of the Phi can have a different
stage/cycle.  Maintaining the map between new and old virtual registers depends
on the stage, cycle, and the current block (prolog, kernel, or epilog).

The epilogs are generated in reverse order due to the way the instructions need
to be laid out. The first epilog block after the kernel is going to have
instructions scheduled in the last stage only, then the next epilog block, if
needed, will have instructions from LastStage-1 first, then the LastStage.

Let’s say we have instructions x, y, z, that are scheduled in different
iterations/stage, x in stage0, y in stage1, and z in stage2, and x generates a
value used by y, which generates a value used by z. The generated code will look
like:
Prolog 0 (this block ends with a conditional jump to Epilog 0):
  X
Prolog 1 (this block ends with a conditional jump to Epilog 1):
  Y
  X
Kernel:
  Z, Y, X
Epilog 0:
  Z
Epilog 1:
  Y
  Z

The instructions in the prolog and epilog are generate in original program order
within each stage, but in the last epilog stage, the instruction Y generates a
value used by the subsequent Z.

The reduceLoopCount function is needed for an architecture that has a hardware
loop instruction like Hexagon’s LOOP0.  For each prolog generated, the loop
counter in the instruction needs to be decremented. This could have been done
only in the last prolog block, but it’s done every time a prolog block is
generated instead (initially, I thought that would be easier since the prolog
generation code wouldn’t need to know which block was the last one).  But,
later, I added code to handle the case when the loop can be removed completely.
It seems reasonable to change this so that it’s updated only once, especially
since find the loop instruction can require searching through the CFG.

Loop carried Phis are hard. Initially, I tried handling the cases for adding new
Phis and existing Phis with one function. But, it turned out that adding new
Phis, when a instruction use is generated in an earlier cycle, but later stage,
doesn’t cause as many special cases as handling the different edge cases for an
existing Phis.  There are less combinations when adding a new Phis.

Yes, you’re correct about the use of “iteration” vs. “stage” (I even used them
together earlier).  The use of term stage comes from the paper, but I often use
iteration when talking to others about software pipelining, especially if they
have read the paper.  I agree that we should be more consistent with using the
terms.

Let me think some more about the questions you’ve asked. I do think it’s
worthwhile to spend some time on creating a good abstraction and improving the
code. I appreciate that you’ve started to think/work on this problem, so I’d
like to help out to pursue this direction.

Thanks,
Brendon


From: James Molloy <james at jamesmolloy.co.uk>
Sent: Monday, July 15, 2019 11:05 AM
To: Jinsong Ji <jji at us.ibm.com>
Cc: Brendon Cahoon <bcahoon at quicinc.com>; Hal Finkel <hfinkel at
anl.gov>; LLVM Dev <llvm-dev at lists.llvm.org>
Subject: [EXT] Re: MachinePipeliner refactoring

Hi Jingsong,

Thanks for testing out the prototype! I'm not surprised there are errors in
that version; it's not 100% ready yet (mainly a demonstrator for the path
I'd like to take). I'm really trying to work out if it makes sense to
try and incrementally get from the current state of the world to that prototype
incrementally, or if it's just so far away that a full-on refactor (or two
code paths) is required. I suspect only Brendan really has the context to give
insight there!

James

On Mon, 15 Jul 2019 at 16:32, Jinsong Ji <jji at us.ibm.com<mailto:jji at
us.ibm.com>> wrote:

Hi James:

Personally, I like the idea of refactoring and more abstraction,
But unfortunately, I don't know enough about the edges cases either.

BTW: the prototype is still causing quite some Asseertions in PowerPC - some
nodes are not generated in correct order.


Best,

Jinsong Ji (纪金松), PhD.

XL/LLVM on Power Compiler Development
E-mail: jji at us.ibm.com<mailto:jji at us.ibm.com>

[Inactive hide details for James Molloy ---07/15/2019 06:16:22 AM---Hi Brendan
(and friends of MachinePipeliner, +llvm-dev for o]James Molloy ---07/15/2019
06:16:22 AM---Hi Brendan (and friends of MachinePipeliner, +llvm-dev for
openness), Over the past week or so I've

From: James Molloy <james at jamesmolloy.co.uk<mailto:james at
jamesmolloy.co.uk>>
To: LLVM Dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>, jji at us.ibm.com<mailto:jji at us.ibm.com>,
bcahoon at quicinc.com<mailto:bcahoon at quicinc.com>, Hal Finkel
<hfinkel at anl.gov<mailto:hfinkel at anl.gov>>
Date: 07/15/2019 06:16 AM
Subject: [EXTERNAL] MachinePipeliner refactoring

________________________________



Hi Brendan (and friends of MachinePipeliner, +llvm-dev for openness),

Over the past week or so I've been attempting to extend the MachinePipeliner
to support different idioms of code generation. To make this a bit more
concrete, there are two areas where the currently generated code could be
improved depending on architecture:

  1) The epilog blocks peel off the final iterations in reverse order. This
means that the overall execution of loop iterations isn't in a perfectly
pipelined order. For architectures that have hardware constructs that insist on
a first-in-first-out order (queues), the currently generated code cannot be
used.
  2) For architectures (like Hexagon) that have dedicated predicate register
files, we can generate a compact representation of the loop by predicating
stages of the loop kernel independently. In this case we can either have a
prolog, epilog, or neither (wrapping the prolog and epilog inside the kernel by
using PHIs of predicates).

At the moment, a lot of the code generation helper code in MachinePipeliner is
tightly fit to its current code generation strategy ("If we're in the
epilog, to this, else do this"). I'm keen to try and make some of the
complex calculations it does, such as where PHIs should come from, more abstract
so they can be reused and composed.

https://reviews.llvm.org/D64665 is my current best-effort. This generates
perfect code for PowerPC, but causes a load of problems for Hexagon. It's
become apparent that I don't know enough about some of the edge cases in the
MachinePipeliner code to refactor this from scratch. I'm therefore looking
for direction in factoring in an incremental fashion.

I think there are a couple of areas that I'd like to change, and I'd
appreciate your ideas and opinions because I clearly don't know enough about
the edge cases here.

  a) TII->reduceLoopCount() is hard to understand. Understanding the intended
semantics of this hook from the documentation, I've found, is hard. Its use
appears to be strongly fit to Hexagon (there is even a comment about the removal
of LOOP0 in the MachinePipeliner target agnostic code, which really
shouldn't be there). Why it's called multiple times I don't
understand (why can't we just call it once with the total number of
iterations to peel?).
  b) Understanding how loop-carried PHIs are linked together is really hard.
There are two functions dedicated to this with many edge cases, which are linked
with the prolog and epilog schedule. It'd be great to somehow factor these
such that they are independent of the code generation strategy. Note that this
is really important for some of the code gen strategies I mention at the
beginning, because loop-carried PHIs in this case may actually end up being
selects or uses of predicated instructions.
  c) There is a slight conflation of "iteration" and "stage"
in the documentation that makes it hard to follow what VRMap contains and the
invariants between functions.

My intent in D64665 was to create two abstractions: "Stage" and
"Block". Instead of referring to stages by index (VRMap), each Stage
would take a prior Stage as input. Stages are contained inside Blocks, which
handles predecessors and successors. I feel that arranging the code generation
in this CFG-like way will make the flow of data much easier to analyze. Of
course, a Stage doesn't just depend on a prior Stage - their loop carried
inputs could come from any other Stage (and while I think I understand how this
works, I clearly don't get all the edge cases).

What do you think of this abstraction? do you think it's doomed to failure
because it's too simplistic to cover all the cases?

Do you have any suggestions of areas where we can start to factor out without a
large-scale code breakage? I'm finding this hard to get my teeth into as the
proposed code structure is so different from its current form.

Thanks for any opinions or suggestions!

Cheers,

James

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190716/0b3688de/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 105 bytes
Desc: image001.gif
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190716/0b3688de/attachment-0001.gif>

James Molloy via llvm-dev

2019-Jul-16 14:02 UTC

head link

[llvm-dev] MachinePipeliner refactoring

Hi Brendon,

Thanks for the in-depth reply!
> Loop carried Phis are hard.
I think the whole thing can be summarized like this :) The data flow
between prolog, kernel and epilog would be fairly trivial to handle without
these (and this is demonstrated by my prototype being ~300 lines of code).

To dig a little deeper, one thing I don't understand right now (and this is
inhibiting me from making effective refactoring) is what it actually means
to "schedule" a loop-carried PHI. In my naive head, things would be a
lot
simpler if all phis were scheduled in stage zero. Given that they are free,
is the idea behind them being in a non-zero stage purely for register
allocation? (I'm aware the answer to this may well be "read the SMS
literature more carefully"!)

The big thing I'm trying to wrap my head around is the logic in the
generateExistingPhis() function. I feel that everything else seems fairly
easy to refactor; it's the logic here that's tied closely to the
concepts
of prologs and epilogs. It feels like we should be able to answer the
question "Given this loop-carried PHI, what instruction in what stage (as
an offset from the current stage) generates this value?"

If we could answer that question, generating the tree of phis to plumb the
values through seems, while not trivial, an orthogonal task.

Thanks for answering my newbie questions!

James

On Tue, 16 Jul 2019 at 01:35, Brendon Cahoon <bcahoon at quicinc.com>
wrote:
> Hi James,
>
>
>
> I also think that refactoring the code generation part is a great idea.
> That code is very complicated and difficult to maintain. I’ve wanted to
> rewrite that code for a long time, but just have never got to it.  There
> are quite a few edge cases to handle (at least in the current code). I’ll
> take a deeper look at your patch.  The abstractions that you mention, Stage
> and Block, are good ones. I think you’ll need to account for the cycle, or
> position within a Stage as well. It is a complex problem with a lot
> different edge cases when dealing with Phis, though I think they can be
> dealt with much better than the existing code.
>
>
>
> Generating code for the 3 main parts, the prolog, kernel, and epilog, can
> be challenging because each has a unique difference that made is hard to
> generalize the code.  For example, the prologs will not have an Phis, but
> the code generation needs to be aware of when the Phis occur in the
> sequence in order to get the correct remapped name. In the kernel, values
> may come from a previous iteration of the kernel. In the epilog, values can
> come from the same epilog block, the kernel, or a prolog block.
>
>
>
> It would be great to be able to generate the compact representation that
> you describe, so that a prolog/epilog do not need to be generated. Hexagon
> has a special hardware loop instruction, spNloop, specifically for this.
> Although, the compiler does not generate that instruction currently.
> Perhaps it would be easier to generate this form first, and then convert it
> to a prolog(s), kernel, epilog(s) sequence.
>
>
>
> There are a lot of subtle rules when dealing with the Phi instructions
> depending on when they are scheduled relative to uses. There are two
> different factors to account for – the stage and cycle when the Phi is
> scheduled.  For example, a use of a Phi can be scheduled at an earlier
> cycle, but later stage. Several different combinations can occur, and each
> use of the Phi can have a different stage/cycle.  Maintaining the map
> between new and old virtual registers depends on the stage, cycle, and the
> current block (prolog, kernel, or epilog).
>
>
>
> The epilogs are generated in reverse order due to the way the instructions
> need to be laid out. The first epilog block after the kernel is going to
> have instructions scheduled in the last stage only, then the next epilog
> block, if needed, will have instructions from LastStage-1 first, then the
> LastStage.
>
>
>
> Let’s say we have instructions x, y, z, that are scheduled in different
> iterations/stage, x in stage0, y in stage1, and z in stage2, and x
> generates a value used by y, which generates a value used by z. The
> generated code will look like:
>
> Prolog 0 (this block ends with a conditional jump to Epilog 0):
>
>   X
>
> Prolog 1 (this block ends with a conditional jump to Epilog 1):
>
>   Y
>
>   X
>
> Kernel:
>
>   Z, Y, X
>
> Epilog 0:
>
>   Z
>
> Epilog 1:
>
>   Y
>
>   Z
>
>
>
> The instructions in the prolog and epilog are generate in original program
> order within each stage, but in the last epilog stage, the instruction Y
> generates a value used by the subsequent Z.
>
>
>
> The reduceLoopCount function is needed for an architecture that has a
> hardware loop instruction like Hexagon’s LOOP0.  For each prolog generated,
> the loop counter in the instruction needs to be decremented. This could
> have been done only in the last prolog block, but it’s done every time a
> prolog block is generated instead (initially, I thought that would be
> easier since the prolog generation code wouldn’t need to know which block
> was the last one).  But, later, I added code to handle the case when the
> loop can be removed completely. It seems reasonable to change this so that
> it’s updated only once, especially since find the loop instruction can
> require searching through the CFG.
>
>
>
> Loop carried Phis are hard. Initially, I tried handling the cases for
> adding new Phis and existing Phis with one function. But, it turned out
> that adding new Phis, when a instruction use is generated in an earlier
> cycle, but later stage, doesn’t cause as many special cases as handling the
> different edge cases for an existing Phis.  There are less combinations
> when adding a new Phis.
>
>
>
> Yes, you’re correct about the use of “iteration” vs. “stage” (I even used
> them together earlier).  The use of term stage comes from the paper, but I
> often use iteration when talking to others about software pipelining,
> especially if they have read the paper.  I agree that we should be more
> consistent with using the terms.
>
>
>
> Let me think some more about the questions you’ve asked. I do think it’s
> worthwhile to spend some time on creating a good abstraction and improving
> the code. I appreciate that you’ve started to think/work on this problem,
> so I’d like to help out to pursue this direction.
>
>
>
> Thanks,
>
> Brendon
>
>
>
>
>
> *From:* James Molloy <james at jamesmolloy.co.uk>
> *Sent:* Monday, July 15, 2019 11:05 AM
> *To:* Jinsong Ji <jji at us.ibm.com>
> *Cc:* Brendon Cahoon <bcahoon at quicinc.com>; Hal Finkel <hfinkel
at anl.gov>;
> LLVM Dev <llvm-dev at lists.llvm.org>
> *Subject:* [EXT] Re: MachinePipeliner refactoring
>
>
>
> Hi Jingsong,
>
>
>
> Thanks for testing out the prototype! I'm not surprised there are
errors
> in that version; it's not 100% ready yet (mainly a demonstrator for the
> path I'd like to take). I'm really trying to work out if it makes
sense to
> try and incrementally get from the current state of the world to that
> prototype incrementally, or if it's just so far away that a full-on
> refactor (or two code paths) is required. I suspect only Brendan really has
> the context to give insight there!
>
>
>
> James
>
>
>
> On Mon, 15 Jul 2019 at 16:32, Jinsong Ji <jji at us.ibm.com> wrote:
>
> Hi James:
>
> Personally, I like the idea of refactoring and more abstraction,
> But unfortunately, I don't know enough about the edges cases either.
>
> BTW: the prototype is still causing quite some Asseertions in PowerPC -
> some nodes are not generated in correct order.
>
>
> Best,
>
> Jinsong Ji (纪金松), PhD.
>
> XL/LLVM on Power Compiler Development
> E-mail: jji at us.ibm.com
>
> [image: Inactive hide details for James Molloy ---07/15/2019 06:16:22
> AM---Hi Brendan (and friends of MachinePipeliner, +llvm-dev for o]James
> Molloy ---07/15/2019 06:16:22 AM---Hi Brendan (and friends of
> MachinePipeliner, +llvm-dev for openness), Over the past week or so
I've
>
> From: James Molloy <james at jamesmolloy.co.uk>
> To: LLVM Dev <llvm-dev at lists.llvm.org>, jji at us.ibm.com,
> bcahoon at quicinc.com, Hal Finkel <hfinkel at anl.gov>
> Date: 07/15/2019 06:16 AM
> Subject: [EXTERNAL] MachinePipeliner refactoring
> ------------------------------
>
>
>
>
> Hi Brendan (and friends of MachinePipeliner, +llvm-dev for openness),
>
> Over the past week or so I've been attempting to extend the
> MachinePipeliner to support different idioms of code generation. To make
> this a bit more concrete, there are two areas where the currently generated
> code could be improved depending on architecture:
>
>   1) The epilog blocks peel off the final iterations in reverse order.
> This means that the overall execution of loop iterations isn't in a
> perfectly pipelined order. For architectures that have hardware constructs
> that insist on a first-in-first-out order (queues), the currently generated
> code cannot be used.
>   2) For architectures (like Hexagon) that have dedicated predicate
> register files, we can generate a compact representation of the loop by
> predicating stages of the loop kernel independently. In this case we can
> either have a prolog, epilog, or neither (wrapping the prolog and epilog
> inside the kernel by using PHIs of predicates).
>
> At the moment, a lot of the code generation helper code in
> MachinePipeliner is tightly fit to its current code generation strategy
> ("If we're in the epilog, to this, else do this"). I'm
keen to try and make
> some of the complex calculations it does, such as where PHIs should come
> from, more abstract so they can be reused and composed.
>
> https://reviews.llvm.org/D64665 is my current best-effort. This generates
> perfect code for PowerPC, but causes a load of problems for Hexagon.
It's
> become apparent that I don't know enough about some of the edge cases
in
> the MachinePipeliner code to refactor this from scratch. I'm therefore
> looking for direction in factoring in an incremental fashion.
>
> I think there are a couple of areas that I'd like to change, and
I'd
> appreciate your ideas and opinions because I clearly don't know enough
> about the edge cases here.
>
>   a) TII->reduceLoopCount() is hard to understand. Understanding the
> intended semantics of this hook from the documentation, I've found, is
> hard. Its use appears to be strongly fit to Hexagon (there is even a
> comment about the removal of LOOP0 in the MachinePipeliner target agnostic
> code, which really shouldn't be there). Why it's called multiple
times I
> don't understand (why can't we just call it once with the total
number of
> iterations to peel?).
>   b) Understanding how loop-carried PHIs are linked together is really
> hard. There are two functions dedicated to this with many edge cases, which
> are linked with the prolog and epilog schedule. It'd be great to
somehow
> factor these such that they are independent of the code generation
> strategy. Note that this is really important for some of the code gen
> strategies I mention at the beginning, because loop-carried PHIs in this
> case may actually end up being selects or uses of predicated instructions.
>   c) There is a slight conflation of "iteration" and
"stage" in the
> documentation that makes it hard to follow what VRMap contains and the
> invariants between functions.
>
> My intent in D64665 was to create two abstractions: "Stage" and
"Block".
> Instead of referring to stages by index (VRMap), each Stage would take a
> prior Stage as input. Stages are contained inside Blocks, which handles
> predecessors and successors. I feel that arranging the code generation in
> this CFG-like way will make the flow of data much easier to analyze. Of
> course, a Stage doesn't just depend on a prior Stage - their loop
carried
> inputs could come from any other Stage (and while I think I understand how
> this works, I clearly don't get all the edge cases).
>
> What do you think of this abstraction? do you think it's doomed to
failure
> because it's too simplistic to cover all the cases?
>
> Do you have any suggestions of areas where we can start to factor out
> without a large-scale code breakage? I'm finding this hard to get my
teeth
> into as the proposed code structure is so different from its current form.
>
> Thanks for any opinions or suggestions!
>
> Cheers,
>
> James
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190716/a733f163/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190716/a733f163/attachment.gif>

Brendon Cahoon via llvm-dev

2019-Jul-17 19:15 UTC

head link

[llvm-dev] MachinePipeliner refactoring

Hi James,

My initial attempt at generating the prolog, kernel, and epilog was quite simple
as well. But, as the pipeliner attempted to work on more complex loop
structures, the logic got complicated pretty quickly.  This is most evident in
the generateExistingPhis function, which is a mess.  I wanted this function to
be general, but it evolved over time into a sequence of special cases.
> To dig a little deeper, one thing I don't understand right now (and
this is inhibiting me from making effective refactoring) is what it actually
means to "schedule" a loop-carried PHI.
This is a good question. I don’t think the Phis need to be scheduled, but even
if the Phis aren’t scheduled, we still need to generate new Phis in the prolog
and epilog. And, we need to refer to the original Phi to find the incoming
values, and to maintain the map from original names to new names.

The stage that a Phi is scheduled in is related to the stage(s) where the uses
are scheduled, and the stage for the loop definition. The code uses the Phi
stage to determine the total number of new Phis to generate. We probably could
use the stages for the definition and uses to calculate values that we use the
Phi for – for example, the number of stages, or copies, needed between the
definition and a use.

If a Phi has the loop definition scheduled in stage 1, and uses of the Phi
scheduled in stages 1 and 2, then the Phi will be scheduled in stage 1 most
likely.  The kernel will require two new Phis to represent the values used and
generated in stages 1 and 2. We probably could ignore the stage in which the Phi
is scheduled, and compute the number of new Phis based upon the schedule stages
for the loop definition and uses.

I don’t believe any of this is described in the SMS paper though.
> The big thing I'm trying to wrap my head around is the logic in the
generateExistingPhis() function
Yeah, me too sometimes.  The role of this function is to generate new Phis in
the kernel and epilogs for each Phi that occurred in the original loop.  This
function also updates the mapping between old and new names, VRMap, and updates
register names used in other instructions.  The function generates the Phis for
the kernel, and epilogs at the start.  It needs to determine:

  *   How many new Phis to generate based upon the schedule stages for the
definition and uses.
  *   Which stage contains the incoming values, and which block is holding the
value for that stage
  *   Rewrite previously scheduled uses, if needed

Determining the number of new Phis is straightforward, most of the time.  There
are some special cases when a Phi use is scheduled in an earlier cycle, but
later stage, than the Phi loop definition.  For example,

   = a                  cycle 0, stage 1
   a = phi(b, c)  cycle 1, stage 0
   c =                  cycle 2, stage 0

In this case, only 1 new Phi is needed.  If the use of a is scheduled after the
definition of c, but still in stage 1, then 2 Phis are needed.

Also, there are some special cases when generating the Phis in the epilog. This
happens because the code computes the maximum number of Phis needed, but that
value decreases as the number of epilogs generated increases.

Computing the block that contains the incoming values is most of the code in
this function. Much of this code attempts to handle various related to Phis that
refer to other Phis.  For example,

   a = phi(b, c)
   c = phi(d, e)
      = a
      = c
   e 
The number of dependent phis can be more than 2.

The calculation for the correct block depends on if we’re generating code for
the kernel or epilog too.  For example, in the kernel the second Phi operand
always comes from the kernel, but in the epilog, the second Phi operand may come
from the kernel, or a previous epilog block.  The epilog block will never have a
Phi that depends on a Phi from the same block (like the case shown above).

In the simplest case, the stage with the incoming value is the difference
between the schedule stage of the use and definition.  The extra computation
occurs when the Phi or loop definition is not scheduled in stage 0. Then,
adjustments are made to find the stage and block with the correct value.  If a
Phi loop definition is scheduled in stage 1, then the first occurrence of the
instruction appears in the kernel (for a 2 stage pipeline). This also means that
a Phi is needed in the epilog for that value. When extended to schedules with
multiple stages, it calculations get trickier.  For example, in a 3 stage
pipeline, a Phi loop definition is scheduled in stage 1 appears in the 2nd
prolog, the kernel, and the 2nd epilog.  If scheduled in stage 2, it appears in
the kernel, 1st epilog and 2nd epilog.  When rewriting the uses of that Phi, the
calculation of the block with the correct value depends its schedule stage (and
if the use appears before/after the Phi in the schedule).

Let me try to enumerate some more of the special cases in the code.  You’re
correct in that we just need to answer a fairly straightforward question about
which stage and block contains the value we’re looking for.

Thanks,
Brendon

From: James Molloy <james at jamesmolloy.co.uk>
Sent: Tuesday, July 16, 2019 9:02 AM
To: Brendon Cahoon <bcahoon at quicinc.com>
Cc: Jinsong Ji <jji at us.ibm.com>; Hal Finkel <hfinkel at anl.gov>;
LLVM Dev <llvm-dev at lists.llvm.org>
Subject: [EXT] Re: MachinePipeliner refactoring

Hi Brendon,

Thanks for the in-depth reply!
> Loop carried Phis are hard.
I think the whole thing can be summarized like this :) The data flow between
prolog, kernel and epilog would be fairly trivial to handle without these (and
this is demonstrated by my prototype being ~300 lines of code).

To dig a little deeper, one thing I don't understand right now (and this is
inhibiting me from making effective refactoring) is what it actually means to
"schedule" a loop-carried PHI. In my naive head, things would be a lot
simpler if all phis were scheduled in stage zero. Given that they are free, is
the idea behind them being in a non-zero stage purely for register allocation?
(I'm aware the answer to this may well be "read the SMS literature more
carefully"!)

The big thing I'm trying to wrap my head around is the logic in the
generateExistingPhis() function. I feel that everything else seems fairly easy
to refactor; it's the logic here that's tied closely to the concepts of
prologs and epilogs. It feels like we should be able to answer the question
"Given this loop-carried PHI, what instruction in what stage (as an offset
from the current stage) generates this value?"

If we could answer that question, generating the tree of phis to plumb the
values through seems, while not trivial, an orthogonal task.

Thanks for answering my newbie questions!

James

On Tue, 16 Jul 2019 at 01:35, Brendon Cahoon <bcahoon at
quicinc.com<mailto:bcahoon at quicinc.com>> wrote:
Hi James,

I also think that refactoring the code generation part is a great idea.  That
code is very complicated and difficult to maintain. I’ve wanted to rewrite that
code for a long time, but just have never got to it.  There are quite a few edge
cases to handle (at least in the current code). I’ll take a deeper look at your
patch.  The abstractions that you mention, Stage and Block, are good ones. I
think you’ll need to account for the cycle, or position within a Stage as well.
It is a complex problem with a lot different edge cases when dealing with Phis,
though I think they can be dealt with much better than the existing code.

Generating code for the 3 main parts, the prolog, kernel, and epilog, can be
challenging because each has a unique difference that made is hard to generalize
the code.  For example, the prologs will not have an Phis, but the code
generation needs to be aware of when the Phis occur in the sequence in order to
get the correct remapped name. In the kernel, values may come from a previous
iteration of the kernel. In the epilog, values can come from the same epilog
block, the kernel, or a prolog block.

It would be great to be able to generate the compact representation that you
describe, so that a prolog/epilog do not need to be generated. Hexagon has a
special hardware loop instruction, spNloop, specifically for this. Although, the
compiler does not generate that instruction currently.  Perhaps it would be
easier to generate this form first, and then convert it to a prolog(s), kernel,
epilog(s) sequence.

There are a lot of subtle rules when dealing with the Phi instructions depending
on when they are scheduled relative to uses. There are two different factors to
account for – the stage and cycle when the Phi is scheduled.  For example, a use
of a Phi can be scheduled at an earlier cycle, but later stage. Several
different combinations can occur, and each use of the Phi can have a different
stage/cycle.  Maintaining the map between new and old virtual registers depends
on the stage, cycle, and the current block (prolog, kernel, or epilog).

The epilogs are generated in reverse order due to the way the instructions need
to be laid out. The first epilog block after the kernel is going to have
instructions scheduled in the last stage only, then the next epilog block, if
needed, will have instructions from LastStage-1 first, then the LastStage.

Let’s say we have instructions x, y, z, that are scheduled in different
iterations/stage, x in stage0, y in stage1, and z in stage2, and x generates a
value used by y, which generates a value used by z. The generated code will look
like:
Prolog 0 (this block ends with a conditional jump to Epilog 0):
  X
Prolog 1 (this block ends with a conditional jump to Epilog 1):
  Y
  X
Kernel:
  Z, Y, X
Epilog 0:
  Z
Epilog 1:
  Y
  Z

The instructions in the prolog and epilog are generate in original program order
within each stage, but in the last epilog stage, the instruction Y generates a
value used by the subsequent Z.

The reduceLoopCount function is needed for an architecture that has a hardware
loop instruction like Hexagon’s LOOP0.  For each prolog generated, the loop
counter in the instruction needs to be decremented. This could have been done
only in the last prolog block, but it’s done every time a prolog block is
generated instead (initially, I thought that would be easier since the prolog
generation code wouldn’t need to know which block was the last one).  But,
later, I added code to handle the case when the loop can be removed completely.
It seems reasonable to change this so that it’s updated only once, especially
since find the loop instruction can require searching through the CFG.

Loop carried Phis are hard. Initially, I tried handling the cases for adding new
Phis and existing Phis with one function. But, it turned out that adding new
Phis, when a instruction use is generated in an earlier cycle, but later stage,
doesn’t cause as many special cases as handling the different edge cases for an
existing Phis.  There are less combinations when adding a new Phis.

Yes, you’re correct about the use of “iteration” vs. “stage” (I even used them
together earlier).  The use of term stage comes from the paper, but I often use
iteration when talking to others about software pipelining, especially if they
have read the paper.  I agree that we should be more consistent with using the
terms.

Let me think some more about the questions you’ve asked. I do think it’s
worthwhile to spend some time on creating a good abstraction and improving the
code. I appreciate that you’ve started to think/work on this problem, so I’d
like to help out to pursue this direction.

Thanks,
Brendon

From: James Molloy <james at jamesmolloy.co.uk<mailto:james at
jamesmolloy.co.uk>>
Sent: Monday, July 15, 2019 11:05 AM
To: Jinsong Ji <jji at us.ibm.com<mailto:jji at us.ibm.com>>
Cc: Brendon Cahoon <bcahoon at quicinc.com<mailto:bcahoon at
quicinc.com>>; Hal Finkel <hfinkel at anl.gov<mailto:hfinkel at
anl.gov>>; LLVM Dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: [EXT] Re: MachinePipeliner refactoring

Hi Jingsong,

Thanks for testing out the prototype! I'm not surprised there are errors in
that version; it's not 100% ready yet (mainly a demonstrator for the path
I'd like to take). I'm really trying to work out if it makes sense to
try and incrementally get from the current state of the world to that prototype
incrementally, or if it's just so far away that a full-on refactor (or two
code paths) is required. I suspect only Brendan really has the context to give
insight there!

James

On Mon, 15 Jul 2019 at 16:32, Jinsong Ji <jji at us.ibm.com<mailto:jji at
us.ibm.com>> wrote:

Hi James:

Personally, I like the idea of refactoring and more abstraction,
But unfortunately, I don't know enough about the edges cases either.

BTW: the prototype is still causing quite some Asseertions in PowerPC - some
nodes are not generated in correct order.

Best,

Jinsong Ji (纪金松), PhD.

XL/LLVM on Power Compiler Development
E-mail: jji at us.ibm.com<mailto:jji at us.ibm.com>

[Inactive hide details for James Molloy ---07/15/2019 06:16:22 AM---Hi Brendan
(and friends of MachinePipeliner, +llvm-dev for o]James Molloy ---07/15/2019
06:16:22 AM---Hi Brendan (and friends of MachinePipeliner, +llvm-dev for
openness), Over the past week or so I've

From: James Molloy <james at jamesmolloy.co.uk<mailto:james at
jamesmolloy.co.uk>>
To: LLVM Dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>, jji at us.ibm.com<mailto:jji at us.ibm.com>,
bcahoon at quicinc.com<mailto:bcahoon at quicinc.com>, Hal Finkel
<hfinkel at anl.gov<mailto:hfinkel at anl.gov>>
Date: 07/15/2019 06:16 AM
Subject: [EXTERNAL] MachinePipeliner refactoring

________________________________

Hi Brendan (and friends of MachinePipeliner, +llvm-dev for openness),

Over the past week or so I've been attempting to extend the MachinePipeliner
to support different idioms of code generation. To make this a bit more
concrete, there are two areas where the currently generated code could be
improved depending on architecture:

  1) The epilog blocks peel off the final iterations in reverse order. This
means that the overall execution of loop iterations isn't in a perfectly
pipelined order. For architectures that have hardware constructs that insist on
a first-in-first-out order (queues), the currently generated code cannot be
used.
  2) For architectures (like Hexagon) that have dedicated predicate register
files, we can generate a compact representation of the loop by predicating
stages of the loop kernel independently. In this case we can either have a
prolog, epilog, or neither (wrapping the prolog and epilog inside the kernel by
using PHIs of predicates).

At the moment, a lot of the code generation helper code in MachinePipeliner is
tightly fit to its current code generation strategy ("If we're in the
epilog, to this, else do this"). I'm keen to try and make some of the
complex calculations it does, such as where PHIs should come from, more abstract
so they can be reused and composed.

https://reviews.llvm.org/D64665 is my current best-effort. This generates
perfect code for PowerPC, but causes a load of problems for Hexagon. It's
become apparent that I don't know enough about some of the edge cases in the
MachinePipeliner code to refactor this from scratch. I'm therefore looking
for direction in factoring in an incremental fashion.

I think there are a couple of areas that I'd like to change, and I'd
appreciate your ideas and opinions because I clearly don't know enough about
the edge cases here.

  a) TII->reduceLoopCount() is hard to understand. Understanding the intended
semantics of this hook from the documentation, I've found, is hard. Its use
appears to be strongly fit to Hexagon (there is even a comment about the removal
of LOOP0 in the MachinePipeliner target agnostic code, which really
shouldn't be there). Why it's called multiple times I don't
understand (why can't we just call it once with the total number of
iterations to peel?).
  b) Understanding how loop-carried PHIs are linked together is really hard.
There are two functions dedicated to this with many edge cases, which are linked
with the prolog and epilog schedule. It'd be great to somehow factor these
such that they are independent of the code generation strategy. Note that this
is really important for some of the code gen strategies I mention at the
beginning, because loop-carried PHIs in this case may actually end up being
selects or uses of predicated instructions.
  c) There is a slight conflation of "iteration" and "stage"
in the documentation that makes it hard to follow what VRMap contains and the
invariants between functions.

My intent in D64665 was to create two abstractions: "Stage" and
"Block". Instead of referring to stages by index (VRMap), each Stage
would take a prior Stage as input. Stages are contained inside Blocks, which
handles predecessors and successors. I feel that arranging the code generation
in this CFG-like way will make the flow of data much easier to analyze. Of
course, a Stage doesn't just depend on a prior Stage - their loop carried
inputs could come from any other Stage (and while I think I understand how this
works, I clearly don't get all the edge cases).

What do you think of this abstraction? do you think it's doomed to failure
because it's too simplistic to cover all the cases?

Do you have any suggestions of areas where we can start to factor out without a
large-scale code breakage? I'm finding this hard to get my teeth into as the
proposed code structure is so different from its current form.

Thanks for any opinions or suggestions!

Cheers,

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190717/3aabeaa8/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 105 bytes
Desc: image001.gif
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190717/3aabeaa8/attachment.gif>

llvm dev - Jul 2019 - MachinePipeliner refactoring

[llvm-dev] MachinePipeliner refactoring

[llvm-dev] MachinePipeliner refactoring

[llvm-dev] MachinePipeliner refactoring