thr3ads.net - llvm dev - [LLVMdev] RFC: Machine Instruction Bundle [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Evan Cheng

2011-Dec-02 20:40 UTC

[LLVMdev] RFC: Machine Instruction Bundle

Machine Instruction Bundle in LLVM

Hi all,

There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM code
generator. I believe I have a fairly good plan now and would like to share with
the LLVM community.

Design Criteria

1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.
2. It must be flexible enough to represent more than VLIW bundles. It should be
useful to represent arbitrary sequence of instructions that must be scheduled as
a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random
instruction sequences that are currently modeled as pseudo instructions that are
expanded late.
3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication (i.e.
we don't want code snippets that search for bundle start / end like all the
code in the backend that skip over DBG_VALUE).
4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right
away. We don't want MachineBasic block to keep a list of MachineInstrBundles
since it will require massive amount of code change. So what are the choices?

Bundle Representation

1. A nested MachineInstr: This is the most natural (meaning it looks most like
the real HW bundle) representation. It has the nice property that most passes do
not have to check if a MI is a bundle.The concern here this can add significant
memory overhead if this means adding a ilist or SmallVector field to keep
bundled MIs.
2. Add a bit to MachineInstr: The bit means the next MI in the list is part of
the same bundle. This is very light weight. However it requires many passes to
check wether a MI is part of a bundle.

The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:

--------------
|  Bundle    | -------
--------------        \
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |
--------------
|  Bundle    | ------
--------------       \
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |                  …
         |
--------------
|  Bundle    | ------
--------------       \
         |
        ...


This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to
MI that indicates whether it is part of a bundle.

                     ----------------
                     |      MI    * |   (* bit indicates next MI is
"glued" to this MI, i.e. in the same bundle)
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI      |    (no bit, this is the end of the bundle)
                     --------------
                             |
                     ----------------
                     |      MI    * |   (* a new bundle)
                     ----------------
                             |
                     ----------------
                     |      MI      |
                     ----------------
                             |
                            ...

We are going to hide the complexity in the MachineBasicBlock::iterator instead.
That is, the iterator will be changed to visit only the *top level* instructions
(i.e. first instruction in each bundle). We will add another iterator that
allows client to visit all of the MIs for those passes that want to look into
bundles.

We can use the same representation for arbitrary sequence of instructions that
cannot be broken up. e.g. Thumb2 IT blocks.

                     ----------------
                     |      MI      |   (just a MI)
                     ----------------
                            |
                     ----------------
                     |      MI    * |   (* Start of Thumb2 IT block)
                     ----------------
                            |
                     ----------------
                     |      MI    * | 
                     ----------------
                            |
                     ----------------
                     |      MI      |   (last MI in the block)
                     ----------------
                            |
                     ----------------
                     |      MI      | 
                     ----------------
                            |
                           ...

This representation can support VLIW (where top level MI's are all start of
bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also
very cheap since the "Flags" field has plenty of free bits available.

Properties of Bundle

If MI passes can consider each bundle as a single unit, then how are they going
to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a
the properties of a bundle is the union of the properties of all the MIs inside
the bundle. So a bundle reads all the inputs that the individual MIs read and it
defines all the outputs of the individual MIs. However, this is not correct when
there are intra-bundle dependencies. e.g.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0<kill>, #c |
-------------------------

r0 should not be considered as a source on the bundle since it's defined
inside the bundle and its live range does not extend beyond it. Instead, r0 is a
clobber (i.e. dead def).

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0, #c       |
-------------------------
 ...
     = op3 r0, 

r0 is a def, not a use.

What does this mean? It means in order for passes to operate on a bundle at a
time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial as
taking the union. This is certainly not something we want to re-compute every
time! This requires a slight change to the bundle representation.

                     ----------------
                     |    Bundle  * |   (A MI with special opcode
"Bundle")
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI      |    (no bit, this is the end of the bundle)
                     --------------
                             |
                     ----------------
                     |    Bundle  * |   (a new bundle)
                     ----------------
                             |
                     ----------------
                     |      MI    * | 
                     ----------------
                             |
                     ----------------
                     |      MI      |
                     ----------------
                             |
                            ...

The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.

Other properties, such as mayLoad, mayStore, are static properties associated
with opcodes. They cannot be copied. We will add APIs to examine properties of
MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles).

Packetizing

The current MI flow looks like this:

1. DAG to MI lowering (and pre-RA schedule)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Post-RA scheduling

In the hopefully not very distant future it should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Pre-RA scheduling
   3d. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

The current proposal is for "packetization" to be done as part of the
"RA super pass". Early MI optimization passes such as LICM do not
benefit from operating on bundles. Furthermore, the packetizer should not have
to know how to deal with copies which may later be coalesced, phi nodes, or
other copy like pseudo instructions.

Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together. This can be done either before pre-RA scheduling. The second part of
the packetization should only be done after register allocation is completed.
There are two very important reason for this.

1. Packet finalization *must* add source and def operands to the
"Bundle" pseudo MI. This allows all later passes to handle they
transparently. However, we do not want to do this before register allocation is
complete. Otherwise it introduces new defs and uses of virtual registers and
that mess up MachineRegisterInfo def-use chains.

e.g. Now vr0 has two defs!
defs: vr0<dead>, vr3, uses: vr1, vr2
----------------------------
| vr0 = op1 vr1, vr2       |
| vr3 = op2 vr0<kill>, #c  |
----------------------------

2. During register allocation, more identity copies will be eliminated while
loads, stores, copies, re-materialized instructions will be introduced. It makes
sense for the second part of packetization to try to fill these new instructions
into empty slots (for VLIW like targets).

So the overall flow should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Pre-scheduling packetization (optional)
   3d. Pre-RA scheduling (or integrated packetization)
   3e. Post-scheduling packetization (optional)
   3f. Actual register allocation
   3g. Packet finalization
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

Lowering Bundles to MCInst

There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0, #c       |
-------------------------

=>

MCInst: op1 r0, r1, r2, op2, r3, r0, #c
or
MCInst: op1 op2 r0, r1, r2, r3, r0, #c

What's Next?

I am hoping to find some time to implement the followings in the near future:
1. Add BUNDLE opcode
2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().
3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator.
4. Add MachineInstr API to check for instruction properties and switch existing
code over.
5. Add API to form a bundle. It would compute the proper def's and use's
and add MachineOperands to the bundle MI.
6. Switch Thumb2 IT block to using MI bundles.
7. Add interface for targets to register their own packetization passes.

I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also
would not cry if someone beats me to #6 (or actually any of the tasks. :-)

In the longer term, I would like to see a target independent packetization pass
(I believe one is being reviewed). I would also like to see a target independent
interface for pre-scheduling optimizations that form instruction sequences (e.g.
macro-fusion). Patches welcome!

Evan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/89337255/attachment.html>

Sergei Larin

2011-Dec-02 22:34 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

Evan, 

 

  I will need to comprehend it better, but one small comment right away. 

 

Did we not discuss one more option for bundle implementation - global cycle
ID. We would add an unsigned int field to MI definition representing
"global
scheduling cycle". All MIs with the same global cycle value belong to one
group/packet.  Zero means unscheduled MI.

 

  That is light weight, position independent (means if instructions are
added/reordered by a pass that does not care for packets/groups, original
grouping could be easily restored. With a single bit marking, a newly
inserted instruction breaks the coding or special ABI is needed to be used
for that). Only real downside I see, iterating over a single packet members
becomes a search, but if the scope is BB it is not a big deal.

 

  This way we could also effectively represent NOOP bundles (no bundle for a
certain cycle value) - VLIW cycles with no instructions issued or estimate
scheduling depth easily etc.

 

  I am not voting here in any way, I just wanted to bring it up for
discussion.

 

Sergei Larin

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

 

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Evan Cheng
Sent: Friday, December 02, 2011 2:40 PM
To: LLVM Dev
Subject: [LLVMdev] RFC: Machine Instruction Bundle

 

Machine Instruction Bundle in LLVM

 

Hi all,

 

There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM
code generator. I believe I have a fairly good plan now and would like to
share with the LLVM community.

 

Design Criteria

 

1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.

2. It must be flexible enough to represent more than VLIW bundles. It should
be useful to represent arbitrary sequence of instructions that must be
scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch
macro-fusion, or random instruction sequences that are currently modeled as
pseudo instructions that are expanded late.

3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication
(i.e. we don't want code snippets that search for bundle start / end like
all the code in the backend that skip over DBG_VALUE).

4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.

 

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right
away. We don't want MachineBasic block to keep a list of MachineInstrBundles
since it will require massive amount of code change. So what are the
choices?

 

Bundle Representation

 

1. A nested MachineInstr: This is the most natural (meaning it looks most
like the real HW bundle) representation. It has the nice property that most
passes do not have to check if a MI is a bundle.The concern here this can
add significant memory overhead if this means adding a ilist or SmallVector
field to keep bundled MIs.

2. Add a bit to MachineInstr: The bit means the next MI in the list is part
of the same bundle. This is very light weight. However it requires many
passes to check wether a MI is part of a bundle.

 

The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:

 

--------------

|  Bundle    | -------

--------------        \

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |

--------------

|  Bundle    | ------

--------------       \

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |                  .

         |

--------------

|  Bundle    | ------

--------------       \

         |

        ...

 

 

This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to
MI
that indicates whether it is part of a bundle.

 

                     ----------------

                     |      MI    * |   (* bit indicates next MI is
"glued"
to this MI, i.e. in the same bundle)

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI      |    (no bit, this is the end of the
bundle)

                     --------------

                             |

                     ----------------

                     |      MI    * |   (* a new bundle)

                     ----------------

                             |

                     ----------------

                     |      MI      |

                     ----------------

                             |

                            ...

 

We are going to hide the complexity in the MachineBasicBlock::iterator
instead. That is, the iterator will be changed to visit only the *top level*
instructions (i.e. first instruction in each bundle). We will add another
iterator that allows client to visit all of the MIs for those passes that
want to look into bundles.

 

We can use the same representation for arbitrary sequence of instructions
that cannot be broken up. e.g. Thumb2 IT blocks.

 

                     ----------------

                     |      MI      |   (just a MI)

                     ----------------

                            |

                     ----------------

                     |      MI    * |   (* Start of Thumb2 IT block)

                     ----------------

                            |

                     ----------------

                     |      MI    * | 

                     ----------------

                            |

                     ----------------

                     |      MI      |   (last MI in the block)

                     ----------------

                            |

                     ----------------

                     |      MI      | 

                     ----------------

                            |

                           ...

 

This representation can support VLIW (where top level MI's are all start of
bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also
very cheap since the "Flags" field has plenty of free bits available.

 

Properties of Bundle

 

If MI passes can consider each bundle as a single unit, then how are they
going to examine properties (i.e. flags and operands) of a MI bundle?
Conceptually a the properties of a bundle is the union of the properties of
all the MIs inside the bundle. So a bundle reads all the inputs that the
individual MIs read and it defines all the outputs of the individual MIs.
However, this is not correct when there are intra-bundle dependencies. e.g.

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0<kill>, #c |

-------------------------

 

r0 should not be considered as a source on the bundle since it's defined
inside the bundle and its live range does not extend beyond it. Instead, r0
is a clobber (i.e. dead def).

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0, #c       |

-------------------------

 ...

     = op3 r0, 

 

r0 is a def, not a use.

 

What does this mean? It means in order for passes to operate on a bundle at
a time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial
as taking the union. This is certainly not something we want to re-compute
every time! This requires a slight change to the bundle representation.

 

                     ----------------

                     |    Bundle  * |   (A MI with special opcode
"Bundle")

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI      |    (no bit, this is the end of the
bundle)

                     --------------

                             |

                     ----------------

                     |    Bundle  * |   (a new bundle)

                     ----------------

                             |

                     ----------------

                     |      MI    * | 

                     ----------------

                             |

                     ----------------

                     |      MI      |

                     ----------------

                             |

                            ...

 

The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.

 

Other properties, such as mayLoad, mayStore, are static properties
associated with opcodes. They cannot be copied. We will add APIs to examine
properties of MIs which will do the *right thing* for bundles (i.e. look
into MIs in bundles).

 

Packetizing

 

The current MI flow looks like this:

 

1. DAG to MI lowering (and pre-RA schedule)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Actual register allocation

4. Post-RA optimizations

5. PEI

6. Post-RA scheduling

 

In the hopefully not very distant future it should look like this:

 

1. DAG to MI lowering (no scheduling!)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Pre-RA scheduling

   3d. Actual register allocation

4. Post-RA optimizations

5. PEI

6. Re-schedule restores, copies

 

The current proposal is for "packetization" to be done as part of the
"RA
super pass". Early MI optimization passes such as LICM do not benefit from
operating on bundles. Furthermore, the packetizer should not have to know
how to deal with copies which may later be coalesced, phi nodes, or other
copy like pseudo instructions.

 

Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together.
This can be done either before pre-RA scheduling. The second part of the
packetization should only be done after register allocation is completed.
There are two very important reason for this.

 

1. Packet finalization *must* add source and def operands to the
"Bundle"
pseudo MI. This allows all later passes to handle they transparently.
However, we do not want to do this before register allocation is complete.
Otherwise it introduces new defs and uses of virtual registers and that mess
up MachineRegisterInfo def-use chains.

 

e.g. Now vr0 has two defs!

defs: vr0<dead>, vr3, uses: vr1, vr2

----------------------------

| vr0 = op1 vr1, vr2       |

| vr3 = op2 vr0<kill>, #c  |

----------------------------

 

2. During register allocation, more identity copies will be eliminated while
loads, stores, copies, re-materialized instructions will be introduced. It
makes sense for the second part of packetization to try to fill these new
instructions into empty slots (for VLIW like targets).

 

So the overall flow should look like this:

 

1. DAG to MI lowering (no scheduling!)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Pre-scheduling packetization (optional)

   3d. Pre-RA scheduling (or integrated packetization)

   3e. Post-scheduling packetization (optional)

   3f. Actual register allocation

   3g. Packet finalization

4. Post-RA optimizations

5. PEI

6. Re-schedule restores, copies

 

Lowering Bundles to MCInst

 

There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0, #c       |

-------------------------

 

=>

 

MCInst: op1 r0, r1, r2, op2, r3, r0, #c

or

MCInst: op1 op2 r0, r1, r2, r3, r0, #c

 

What's Next?

 

I am hoping to find some time to implement the followings in the near
future:

1. Add BUNDLE opcode

2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().

3. Change MachineInstr::iterator to skip over bundled MIs. Rename old
iterator.

4. Add MachineInstr API to check for instruction properties and switch
existing code over.

5. Add API to form a bundle. It would compute the proper def's and use's
and
add MachineOperands to the bundle MI.

6. Switch Thumb2 IT block to using MI bundles.

7. Add interface for targets to register their own packetization passes.

 

I would dearly welcome help on any of these tasks especially on 4, 5, 6. I
also would not cry if someone beats me to #6 (or actually any of the tasks.
:-)

 

In the longer term, I would like to see a target independent packetization
pass (I believe one is being reviewed). I would also like to see a target
independent interface for pre-scheduling optimizations that form instruction
sequences (e.g. macro-fusion). Patches welcome!

 

Evan

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/184ad6e1/attachment.html>

Sergei Larin

2011-Dec-02 22:41 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

. and yes, one more thing. On some architectures it might be desirable to
know the _order_ of instructions in the packet. That is a bit trickier.. 

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

 

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Evan Cheng
Sent: Friday, December 02, 2011 2:40 PM
To: LLVM Dev
Subject: [LLVMdev] RFC: Machine Instruction Bundle

 

Machine Instruction Bundle in LLVM

 

Hi all,

 

There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM
code generator. I believe I have a fairly good plan now and would like to
share with the LLVM community.

 

Design Criteria

 

1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.

2. It must be flexible enough to represent more than VLIW bundles. It should
be useful to represent arbitrary sequence of instructions that must be
scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch
macro-fusion, or random instruction sequences that are currently modeled as
pseudo instructions that are expanded late.

3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication
(i.e. we don't want code snippets that search for bundle start / end like
all the code in the backend that skip over DBG_VALUE).

4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.

 

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right
away. We don't want MachineBasic block to keep a list of MachineInstrBundles
since it will require massive amount of code change. So what are the
choices?

 

Bundle Representation

 

1. A nested MachineInstr: This is the most natural (meaning it looks most
like the real HW bundle) representation. It has the nice property that most
passes do not have to check if a MI is a bundle.The concern here this can
add significant memory overhead if this means adding a ilist or SmallVector
field to keep bundled MIs.

2. Add a bit to MachineInstr: The bit means the next MI in the list is part
of the same bundle. This is very light weight. However it requires many
passes to check wether a MI is part of a bundle.

 

The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:

 

--------------

|  Bundle    | -------

--------------        \

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |

--------------

|  Bundle    | ------

--------------       \

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |                  .

         |

--------------

|  Bundle    | ------

--------------       \

         |

        ...

 

 

This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to
MI
that indicates whether it is part of a bundle.

 

                     ----------------

                     |      MI    * |   (* bit indicates next MI is
"glued"
to this MI, i.e. in the same bundle)

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI      |    (no bit, this is the end of the
bundle)

                     --------------

                             |

                     ----------------

                     |      MI    * |   (* a new bundle)

                     ----------------

                             |

                     ----------------

                     |      MI      |

                     ----------------

                             |

                            ...

 

We are going to hide the complexity in the MachineBasicBlock::iterator
instead. That is, the iterator will be changed to visit only the *top level*
instructions (i.e. first instruction in each bundle). We will add another
iterator that allows client to visit all of the MIs for those passes that
want to look into bundles.

 

We can use the same representation for arbitrary sequence of instructions
that cannot be broken up. e.g. Thumb2 IT blocks.

 

                     ----------------

                     |      MI      |   (just a MI)

                     ----------------

                            |

                     ----------------

                     |      MI    * |   (* Start of Thumb2 IT block)

                     ----------------

                            |

                     ----------------

                     |      MI    * | 

                     ----------------

                            |

                     ----------------

                     |      MI      |   (last MI in the block)

                     ----------------

                            |

                     ----------------

                     |      MI      | 

                     ----------------

                            |

                           ...

 

This representation can support VLIW (where top level MI's are all start of
bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also
very cheap since the "Flags" field has plenty of free bits available.

 

Properties of Bundle

 

If MI passes can consider each bundle as a single unit, then how are they
going to examine properties (i.e. flags and operands) of a MI bundle?
Conceptually a the properties of a bundle is the union of the properties of
all the MIs inside the bundle. So a bundle reads all the inputs that the
individual MIs read and it defines all the outputs of the individual MIs.
However, this is not correct when there are intra-bundle dependencies. e.g.

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0<kill>, #c |

-------------------------

 

r0 should not be considered as a source on the bundle since it's defined
inside the bundle and its live range does not extend beyond it. Instead, r0
is a clobber (i.e. dead def).

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0, #c       |

-------------------------

 ...

     = op3 r0, 

 

r0 is a def, not a use.

 

What does this mean? It means in order for passes to operate on a bundle at
a time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial
as taking the union. This is certainly not something we want to re-compute
every time! This requires a slight change to the bundle representation.

 

                     ----------------

                     |    Bundle  * |   (A MI with special opcode
"Bundle")

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI      |    (no bit, this is the end of the
bundle)

                     --------------

                             |

                     ----------------

                     |    Bundle  * |   (a new bundle)

                     ----------------

                             |

                     ----------------

                     |      MI    * | 

                     ----------------

                             |

                     ----------------

                     |      MI      |

                     ----------------

                             |

                            ...

 

The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.

 

Other properties, such as mayLoad, mayStore, are static properties
associated with opcodes. They cannot be copied. We will add APIs to examine
properties of MIs which will do the *right thing* for bundles (i.e. look
into MIs in bundles).

 

Packetizing

 

The current MI flow looks like this:

 

1. DAG to MI lowering (and pre-RA schedule)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Actual register allocation

4. Post-RA optimizations

5. PEI

6. Post-RA scheduling

 

In the hopefully not very distant future it should look like this:

 

1. DAG to MI lowering (no scheduling!)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Pre-RA scheduling

   3d. Actual register allocation

4. Post-RA optimizations

5. PEI

6. Re-schedule restores, copies

 

The current proposal is for "packetization" to be done as part of the
"RA
super pass". Early MI optimization passes such as LICM do not benefit from
operating on bundles. Furthermore, the packetizer should not have to know
how to deal with copies which may later be coalesced, phi nodes, or other
copy like pseudo instructions.

 

Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together.
This can be done either before pre-RA scheduling. The second part of the
packetization should only be done after register allocation is completed.
There are two very important reason for this.

 

1. Packet finalization *must* add source and def operands to the
"Bundle"
pseudo MI. This allows all later passes to handle they transparently.
However, we do not want to do this before register allocation is complete.
Otherwise it introduces new defs and uses of virtual registers and that mess
up MachineRegisterInfo def-use chains.

 

e.g. Now vr0 has two defs!

defs: vr0<dead>, vr3, uses: vr1, vr2

----------------------------

| vr0 = op1 vr1, vr2       |

| vr3 = op2 vr0<kill>, #c  |

----------------------------

 

2. During register allocation, more identity copies will be eliminated while
loads, stores, copies, re-materialized instructions will be introduced. It
makes sense for the second part of packetization to try to fill these new
instructions into empty slots (for VLIW like targets).

 

So the overall flow should look like this:

 

1. DAG to MI lowering (no scheduling!)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Pre-scheduling packetization (optional)

   3d. Pre-RA scheduling (or integrated packetization)

   3e. Post-scheduling packetization (optional)

   3f. Actual register allocation

   3g. Packet finalization

4. Post-RA optimizations

5. PEI

6. Re-schedule restores, copies

 

Lowering Bundles to MCInst

 

There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0, #c       |

-------------------------

 

=>

 

MCInst: op1 r0, r1, r2, op2, r3, r0, #c

or

MCInst: op1 op2 r0, r1, r2, r3, r0, #c

 

What's Next?

 

I am hoping to find some time to implement the followings in the near
future:

1. Add BUNDLE opcode

2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().

3. Change MachineInstr::iterator to skip over bundled MIs. Rename old
iterator.

4. Add MachineInstr API to check for instruction properties and switch
existing code over.

5. Add API to form a bundle. It would compute the proper def's and use's
and
add MachineOperands to the bundle MI.

6. Switch Thumb2 IT block to using MI bundles.

7. Add interface for targets to register their own packetization passes.

 

I would dearly welcome help on any of these tasks especially on 4, 5, 6. I
also would not cry if someone beats me to #6 (or actually any of the tasks.
:-)

 

In the longer term, I would like to see a target independent packetization
pass (I believe one is being reviewed). I would also like to see a target
independent interface for pre-scheduling optimizations that form instruction
sequences (e.g. macro-fusion). Patches welcome!

 

Evan

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/ba05a962/attachment.html>

Jakob Stoklund Olesen

2011-Dec-02 22:44 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote:
>                      ----------------
>                      |    Bundle  * |   (A MI with special opcode
"Bundle")
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of the
bundle)
>                      --------------
>                              |
>                      ----------------
>                      |    Bundle  * |   (a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * | 
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
I think you should turn the glue bit upside down, so it means "This
instruction is glued to its predecessor" instead of "This instruction
is glued to its successor. That way, the inner bundled instructions have the
glue bit, while the outer bundle headers and unbundled instructions don't.

That should make your iterator a lot simpler when you have mixed bundles and
unbundled MIs. It simply skips MIs with the glue bit set:

             |       ----------------
             |-->    |    Bundle    |    (A MI with special opcode
"Bundle")
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * |
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * |    (last MI in bundle)
             |       ----------------
             |               |
             |       ----------------
             |-->    |      MI      |    (no bit, this is a top-level
unbundled MI)
             |       ----------------
             |               |
             |       ----------------
             |-->    |      MI      |    (unbundled MI)
             |       ----------------
             |               |
             |       ----------------
             |-->    |    Bundle    |   (a new bundle)
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * | 
             |       ----------------
             |               |
             |       ----------------
             |       |      MI    * |
             |       ----------------
             |               |

What happens when you assign a bundled MI* to a MachineBasicBlock::iterator?
Does it rewind to the bundle header? There is a lot of code that treats the two
as almost equivalent.

/jakob

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/62ce9fd3/attachment.html>

David A. Greene

2011-Dec-02 22:44 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

Evan Cheng <evan.cheng at apple.com> writes:
> 2. It must be flexible enough to represent more than VLIW bundles. It
> should be useful to represent arbitrary sequence of instructions that
> must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare +
> branch macro-fusion, or random instruction sequences that are
> currently modeled as pseudo instructions that are expanded late.
This is really important.  This kind of support could also aid
scheduling to make sure some later pass doesn't screw up a carefully
crafted schedule.
> So the overall flow should look like this:
>
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Pre-scheduling packetization (optional)
>    3d. Pre-RA scheduling (or integrated packetization)
>    3e. Post-scheduling packetization (optional)
>    3f. Actual register allocation
>    3g. Packet finalization
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
This all looks fairly reasonable to me.  I just have one question.  How
much machine-level information do we have at the point where bundling
occurs?  For example, can we figure out instruction sizes exactly?  This
kind of thing is important for certain kinds of scheduling.

If we don't have that information at this point, it might be
advantageous to delay the final packetization until later, once we're at
a point where we have all of the useful information we might want.

                         -Dave

Evan Cheng

2011-Dec-02 23:58 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Dec 2, 2011, at 2:44 PM, Jakob Stoklund Olesen wrote:
> 
> On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote:
> 
>>                      ----------------
>>                      |    Bundle  * |   (A MI with special opcode
"Bundle")
>>                      ----------------
>>                              |
>>                      ----------------
>>                      |      MI    * |
>>                      ----------------
>>                              |
>>                      ----------------
>>                      |      MI    * |
>>                      ----------------
>>                              |
>>                      ----------------
>>                      |      MI      |    (no bit, this is the end of
the bundle)
>>                      --------------
>>                              |
>>                      ----------------
>>                      |    Bundle  * |   (a new bundle)
>>                      ----------------
>>                              |
>>                      ----------------
>>                      |      MI    * | 
>>                      ----------------
>>                              |
>>                      ----------------
>>                      |      MI      |
>>                      ----------------
>>                              |
>>                             ...
> 
> I think you should turn the glue bit upside down, so it means "This
instruction is glued to its predecessor" instead of "This instruction
is glued to its successor. That way, the inner bundled instructions have the
glue bit, while the outer bundle headers and unbundled instructions don't.
> 
> That should make your iterator a lot simpler when you have mixed bundles
and unbundled MIs. It simply skips MIs with the glue bit set:
> 
>              |       ----------------
>              |-->    |    Bundle    |    (A MI with special opcode
"Bundle")
>              |       ----------------
>              |               |
>              |       ----------------
>              |       |      MI    * |
>              |       ----------------
>              |               |
>              |       ----------------
>              |       |      MI    * |    (last MI in bundle)
>              |       ----------------
>              |               |
>              |       ----------------
>              |-->    |      MI      |    (no bit, this is a top-level
unbundled MI)
>              |       ----------------
>              |               |
>              |       ----------------
>              |-->    |      MI      |    (unbundled MI)
>              |       ----------------
>              |               |
>              |       ----------------
>              |-->    |    Bundle    |   (a new bundle)
>              |       ----------------
>              |               |
>              |       ----------------
>              |       |      MI    * | 
>              |       ----------------
>              |               |
>              |       ----------------
>              |       |      MI    * |
>              |       ----------------
>              |               |
Ok. Makes sense.
> 
> What happens when you assign a bundled MI* to a
MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot
of code that treats the two as almost equivalent.
That's invalid. If you use the "top level MIs only" iterator, then
it's not legal to assign it to a bundled instruction.

Evan
> 
> /jakob
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/761c107e/attachment.html>

Evan Cheng

2011-Dec-03 00:01 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Dec 2, 2011, at 2:34 PM, Sergei Larin wrote:
> Evan,
>  
>   I will need to comprehend it better, but one small comment right away…
>  
> Did we not discuss one more option for bundle implementation – global cycle
ID. We would add an unsigned int field to MI
No, I don't remember discussing this.
> definition representing “global scheduling cycle”. All MIs with the same
global cycle value belong to one group/packet.  Zero means unscheduled MI.
>  
>   That is light weight, position independent (means if instructions are
added/reordered by a pass that does not care for packets/groups, original
grouping could be easily restored. With a single bit marking, a newly inserted
instruction breaks the coding or special ABI is needed to be used for that).
Only real downside I see, iterating over a single packet members becomes a
search, but if the scope is BB it is not a big deal.
Sorry, I'm not seeing the advantages of this. This actually requires one
additional field to keep the id. And all the passes which can potentially move
code around, breaking MBBs etc. will have to maintain it.

Evan
>  
>   This way we could also effectively represent NOOP bundles (no bundle for
a certain cycle value) – VLIW cycles with no instructions issued or estimate
scheduling depth easily etc.
>  
>   I am not voting here in any way, I just wanted to bring it up for
discussion.
>  
> Sergei Larin
>  
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
>  
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] On Behalf Of Evan Cheng
> Sent: Friday, December 02, 2011 2:40 PM
> To: LLVM Dev
> Subject: [LLVMdev] RFC: Machine Instruction Bundle
>  
> Machine Instruction Bundle in LLVM
>  
> Hi all,
>  
> There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM code
generator. I believe I have a fairly good plan now and would like to share with
the LLVM community.
>  
> Design Criteria
>  
> 1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.
> 2. It must be flexible enough to represent more than VLIW bundles. It
should be useful to represent arbitrary sequence of instructions that must be
scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch
macro-fusion, or random instruction sequences that are currently modeled as
pseudo instructions that are expanded late.
> 3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication (i.e.
we don't want code snippets that search for bundle start / end like all the
code in the backend that skip over DBG_VALUE).
> 4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.
>  
> Given the above, we can rule out a new class (e.g. MachineInstrBundle)
right away. We don't want MachineBasic block to keep a list of
MachineInstrBundles since it will require massive amount of code change. So what
are the choices?
>  
> Bundle Representation
>  
> 1. A nested MachineInstr: This is the most natural (meaning it looks most
like the real HW bundle) representation. It has the nice property that most
passes do not have to check if a MI is a bundle.The concern here this can add
significant memory overhead if this means adding a ilist or SmallVector field to
keep bundled MIs.
> 2. Add a bit to MachineInstr: The bit means the next MI in the list is part
of the same bundle. This is very light weight. However it requires many passes
to check wether a MI is part of a bundle.
>  
> The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:
>  
> --------------
> |  Bundle    | -------
> --------------        \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |                  …
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |
>         ...
>  
>  
> This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to
MI that indicates whether it is part of a bundle.
>  
>                      ----------------
>                      |      MI    * |   (* bit indicates next MI is
"glued" to this MI, i.e. in the same bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of the
bundle)
>                      --------------
>                              |
>                      ----------------
>                      |      MI    * |   (* a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>  
> We are going to hide the complexity in the MachineBasicBlock::iterator
instead. That is, the iterator will be changed to visit only the *top level*
instructions (i.e. first instruction in each bundle). We will add another
iterator that allows client to visit all of the MIs for those passes that want
to look into bundles.
>  
> We can use the same representation for arbitrary sequence of instructions
that cannot be broken up. e.g. Thumb2 IT blocks.
>  
>                      ----------------
>                      |      MI      |   (just a MI)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * |   (* Start of Thumb2 IT block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * | 
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      |   (last MI in the block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      | 
>                      ----------------
>                             |
>                            ...
>  
> This representation can support VLIW (where top level MI's are all
start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is
also very cheap since the "Flags" field has plenty of free bits
available.
>  
> Properties of Bundle
>  
> If MI passes can consider each bundle as a single unit, then how are they
going to examine properties (i.e. flags and operands) of a MI bundle?
Conceptually a the properties of a bundle is the union of the properties of all
the MIs inside the bundle. So a bundle reads all the inputs that the individual
MIs read and it defines all the outputs of the individual MIs. However, this is
not correct when there are intra-bundle dependencies. e.g.
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0<kill>, #c |
> -------------------------
>  
> r0 should not be considered as a source on the bundle since it's
defined inside the bundle and its live range does not extend beyond it. Instead,
r0 is a clobber (i.e. dead def).
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>  ...
>      = op3 r0, 
>  
> r0 is a def, not a use.
>  
> What does this mean? It means in order for passes to operate on a bundle at
a time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial as
taking the union. This is certainly not something we want to re-compute every
time! This requires a slight change to the bundle representation.
>  
>                      ----------------
>                      |    Bundle  * |   (A MI with special opcode
"Bundle")
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of the
bundle)
>                      --------------
>                              |
>                      ----------------
>                      |    Bundle  * |   (a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * | 
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>  
> The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.
>  
> Other properties, such as mayLoad, mayStore, are static properties
associated with opcodes. They cannot be copied. We will add APIs to examine
properties of MIs which will do the *right thing* for bundles (i.e. look into
MIs in bundles).
>  
> Packetizing
>  
> The current MI flow looks like this:
>  
> 1. DAG to MI lowering (and pre-RA schedule)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Post-RA scheduling
>  
> In the hopefully not very distant future it should look like this:
>  
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Pre-RA scheduling
>    3d. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>  
> The current proposal is for "packetization" to be done as part of
the "RA super pass". Early MI optimization passes such as LICM do not
benefit from operating on bundles. Furthermore, the packetizer should not have
to know how to deal with copies which may later be coalesced, phi nodes, or
other copy like pseudo instructions.
>  
> Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together. This can be done either before pre-RA scheduling. The second part of
the packetization should only be done after register allocation is completed.
There are two very important reason for this.
>  
> 1. Packet finalization *must* add source and def operands to the
"Bundle" pseudo MI. This allows all later passes to handle they
transparently. However, we do not want to do this before register allocation is
complete. Otherwise it introduces new defs and uses of virtual registers and
that mess up MachineRegisterInfo def-use chains.
>  
> e.g. Now vr0 has two defs!
> defs: vr0<dead>, vr3, uses: vr1, vr2
> ----------------------------
> | vr0 = op1 vr1, vr2       |
> | vr3 = op2 vr0<kill>, #c  |
> ----------------------------
>  
> 2. During register allocation, more identity copies will be eliminated
while loads, stores, copies, re-materialized instructions will be introduced. It
makes sense for the second part of packetization to try to fill these new
instructions into empty slots (for VLIW like targets).
>  
> So the overall flow should look like this:
>  
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Pre-scheduling packetization (optional)
>    3d. Pre-RA scheduling (or integrated packetization)
>    3e. Post-scheduling packetization (optional)
>    3f. Actual register allocation
>    3g. Packet finalization
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>  
> Lowering Bundles to MCInst
>  
> There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>  
> =>
>  
> MCInst: op1 r0, r1, r2, op2, r3, r0, #c
> or
> MCInst: op1 op2 r0, r1, r2, r3, r0, #c
>  
> What's Next?
>  
> I am hoping to find some time to implement the followings in the near
future:
> 1. Add BUNDLE opcode
> 2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().
> 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old
iterator.
> 4. Add MachineInstr API to check for instruction properties and switch
existing code over.
> 5. Add API to form a bundle. It would compute the proper def's and
use's and add MachineOperands to the bundle MI.
> 6. Switch Thumb2 IT block to using MI bundles.
> 7. Add interface for targets to register their own packetization passes.
>  
> I would dearly welcome help on any of these tasks especially on 4, 5, 6. I
also would not cry if someone beats me to #6 (or actually any of the tasks. :-)
>  
> In the longer term, I would like to see a target independent packetization
pass (I believe one is being reviewed). I would also like to see a target
independent interface for pre-scheduling optimizations that form instruction
sequences (e.g. macro-fusion). Patches welcome!
>  
> Evan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/368e0622/attachment.html>

Evan Cheng

2011-Dec-03 00:06 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Dec 2, 2011, at 2:41 PM, Sergei Larin wrote:
> … and yes, one more thing. On some architectures it might be desirable to
know the _order_ of instructions in the packet. That is a bit trickier….
Isn't that just the order of the instructions in the list? I don't see
anything that prevents getting the order of instructions. It might require
iterator over MIs in the packet. But for targets like VLIW, the # of
instructions should be small. Also note, passes that require this kind of
information should aggregate and maintain the data itself. Information like
schedule cycle belongs in "schedule unit", not in MI.

Evan
>  
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
>  
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu] On Behalf Of Evan Cheng
> Sent: Friday, December 02, 2011 2:40 PM
> To: LLVM Dev
> Subject: [LLVMdev] RFC: Machine Instruction Bundle
>  
> Machine Instruction Bundle in LLVM
>  
> Hi all,
>  
> There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM code
generator. I believe I have a fairly good plan now and would like to share with
the LLVM community.
>  
> Design Criteria
>  
> 1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.
> 2. It must be flexible enough to represent more than VLIW bundles. It
should be useful to represent arbitrary sequence of instructions that must be
scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch
macro-fusion, or random instruction sequences that are currently modeled as
pseudo instructions that are expanded late.
> 3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication (i.e.
we don't want code snippets that search for bundle start / end like all the
code in the backend that skip over DBG_VALUE).
> 4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.
>  
> Given the above, we can rule out a new class (e.g. MachineInstrBundle)
right away. We don't want MachineBasic block to keep a list of
MachineInstrBundles since it will require massive amount of code change. So what
are the choices?
>  
> Bundle Representation
>  
> 1. A nested MachineInstr: This is the most natural (meaning it looks most
like the real HW bundle) representation. It has the nice property that most
passes do not have to check if a MI is a bundle.The concern here this can add
significant memory overhead if this means adding a ilist or SmallVector field to
keep bundled MIs.
> 2. Add a bit to MachineInstr: The bit means the next MI in the list is part
of the same bundle. This is very light weight. However it requires many passes
to check wether a MI is part of a bundle.
>  
> The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:
>  
> --------------
> |  Bundle    | -------
> --------------        \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |                  …
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |
>         ...
>  
>  
> This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to
MI that indicates whether it is part of a bundle.
>  
>                      ----------------
>                      |      MI    * |   (* bit indicates next MI is
"glued" to this MI, i.e. in the same bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of the
bundle)
>                      --------------
>                              |
>                      ----------------
>                      |      MI    * |   (* a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>  
> We are going to hide the complexity in the MachineBasicBlock::iterator
instead. That is, the iterator will be changed to visit only the *top level*
instructions (i.e. first instruction in each bundle). We will add another
iterator that allows client to visit all of the MIs for those passes that want
to look into bundles.
>  
> We can use the same representation for arbitrary sequence of instructions
that cannot be broken up. e.g. Thumb2 IT blocks.
>  
>                      ----------------
>                      |      MI      |   (just a MI)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * |   (* Start of Thumb2 IT block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * | 
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      |   (last MI in the block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      | 
>                      ----------------
>                             |
>                            ...
>  
> This representation can support VLIW (where top level MI's are all
start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is
also very cheap since the "Flags" field has plenty of free bits
available.
>  
> Properties of Bundle
>  
> If MI passes can consider each bundle as a single unit, then how are they
going to examine properties (i.e. flags and operands) of a MI bundle?
Conceptually a the properties of a bundle is the union of the properties of all
the MIs inside the bundle. So a bundle reads all the inputs that the individual
MIs read and it defines all the outputs of the individual MIs. However, this is
not correct when there are intra-bundle dependencies. e.g.
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0<kill>, #c |
> -------------------------
>  
> r0 should not be considered as a source on the bundle since it's
defined inside the bundle and its live range does not extend beyond it. Instead,
r0 is a clobber (i.e. dead def).
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>  ...
>      = op3 r0, 
>  
> r0 is a def, not a use.
>  
> What does this mean? It means in order for passes to operate on a bundle at
a time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial as
taking the union. This is certainly not something we want to re-compute every
time! This requires a slight change to the bundle representation.
>  
>                      ----------------
>                      |    Bundle  * |   (A MI with special opcode
"Bundle")
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of the
bundle)
>                      --------------
>                              |
>                      ----------------
>                      |    Bundle  * |   (a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * | 
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>  
> The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.
>  
> Other properties, such as mayLoad, mayStore, are static properties
associated with opcodes. They cannot be copied. We will add APIs to examine
properties of MIs which will do the *right thing* for bundles (i.e. look into
MIs in bundles).
>  
> Packetizing
>  
> The current MI flow looks like this:
>  
> 1. DAG to MI lowering (and pre-RA schedule)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Post-RA scheduling
>  
> In the hopefully not very distant future it should look like this:
>  
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Pre-RA scheduling
>    3d. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>  
> The current proposal is for "packetization" to be done as part of
the "RA super pass". Early MI optimization passes such as LICM do not
benefit from operating on bundles. Furthermore, the packetizer should not have
to know how to deal with copies which may later be coalesced, phi nodes, or
other copy like pseudo instructions.
>  
> Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together. This can be done either before pre-RA scheduling. The second part of
the packetization should only be done after register allocation is completed.
There are two very important reason for this.
>  
> 1. Packet finalization *must* add source and def operands to the
"Bundle" pseudo MI. This allows all later passes to handle they
transparently. However, we do not want to do this before register allocation is
complete. Otherwise it introduces new defs and uses of virtual registers and
that mess up MachineRegisterInfo def-use chains.
>  
> e.g. Now vr0 has two defs!
> defs: vr0<dead>, vr3, uses: vr1, vr2
> ----------------------------
> | vr0 = op1 vr1, vr2       |
> | vr3 = op2 vr0<kill>, #c  |
> ----------------------------
>  
> 2. During register allocation, more identity copies will be eliminated
while loads, stores, copies, re-materialized instructions will be introduced. It
makes sense for the second part of packetization to try to fill these new
instructions into empty slots (for VLIW like targets).
>  
> So the overall flow should look like this:
>  
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Pre-scheduling packetization (optional)
>    3d. Pre-RA scheduling (or integrated packetization)
>    3e. Post-scheduling packetization (optional)
>    3f. Actual register allocation
>    3g. Packet finalization
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>  
> Lowering Bundles to MCInst
>  
> There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>  
> =>
>  
> MCInst: op1 r0, r1, r2, op2, r3, r0, #c
> or
> MCInst: op1 op2 r0, r1, r2, r3, r0, #c
>  
> What's Next?
>  
> I am hoping to find some time to implement the followings in the near
future:
> 1. Add BUNDLE opcode
> 2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().
> 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old
iterator.
> 4. Add MachineInstr API to check for instruction properties and switch
existing code over.
> 5. Add API to form a bundle. It would compute the proper def's and
use's and add MachineOperands to the bundle MI.
> 6. Switch Thumb2 IT block to using MI bundles.
> 7. Add interface for targets to register their own packetization passes.
>  
> I would dearly welcome help on any of these tasks especially on 4, 5, 6. I
also would not cry if someone beats me to #6 (or actually any of the tasks. :-)
>  
> In the longer term, I would like to see a target independent packetization
pass (I believe one is being reviewed). I would also like to see a target
independent interface for pre-scheduling optimizations that form instruction
sequences (e.g. macro-fusion). Patches welcome!
>  
> Evan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/e2f0cb01/attachment.html>

Evan Cheng

2011-Dec-03 00:09 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Dec 2, 2011, at 2:44 PM, David A. Greene wrote:
> Evan Cheng <evan.cheng at apple.com> writes:
> 
>> 2. It must be flexible enough to represent more than VLIW bundles. It
>> should be useful to represent arbitrary sequence of instructions that
>> must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare +
>> branch macro-fusion, or random instruction sequences that are
>> currently modeled as pseudo instructions that are expanded late.
> 
> This is really important.  This kind of support could also aid
> scheduling to make sure some later pass doesn't screw up a carefully
> crafted schedule.
> 
>> So the overall flow should look like this:
>> 
>> 1. DAG to MI lowering (no scheduling!)
>> 2. MI optimizations (LICM, CSE, etc.)
>> 3. Register allocation super pass
>>   3a. De-ssa (2-address, phi slim)
>>   3b. Coalescing
>>   3c. Pre-scheduling packetization (optional)
>>   3d. Pre-RA scheduling (or integrated packetization)
>>   3e. Post-scheduling packetization (optional)
>>   3f. Actual register allocation
>>   3g. Packet finalization
>> 4. Post-RA optimizations
>> 5. PEI
>> 6. Re-schedule restores, copies
> 
> This all looks fairly reasonable to me.  I just have one question.  How
> much machine-level information do we have at the point where bundling
> occurs?  For example, can we figure out instruction sizes exactly?  This
> kind of thing is important for certain kinds of scheduling.
That depends on the target.
> 
> If we don't have that information at this point, it might be
> advantageous to delay the final packetization until later, once we're
at
> a point where we have all of the useful information we might want.
Targets are allowed to delay packetization to later. The only restriction is
packets cannot be finalized under register allocation is done because we do not
want to add virtual register defs and uses to bundle instructions.

Evan> 
>                         -Dave

Pekka Jääskeläinen

2011-Dec-03 15:12 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

Hi,

I'm glad to see some action with regard to static instruction
scheduling and VLIW support in LLVM. I have some questions and
remarks which might not be relevant as I'm not totally familiar
with the current code generation framework of LLVM nor your plan.

On 12/02/2011 10:40 PM, Evan Cheng wrote:> 2. It must be flexible enough to represent more than VLIW bundles. It
should be
> useful to represent arbitrary sequence of instructions that must be
scheduled as
> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or
random
> instruction sequences that are currently modeled as pseudo instructions
that are
> expanded late.
The concept of a "VLIW bundle" is to mark a set of instructions that
should/could be executed in *parallel*. A static parallel instruction
schedule for a single instruction cycle, that is.

In other words, with a VLIW target a bundle might not be just "an atomic,
possibly sequentially executed chunk of instructions" or "a set of
instructions that can be executed in parallel but also sequentially".
In some architectures, the sequential execution might break the schedule
due to visible function unit pipeline latencies and no hardware interlocking.

Is it wise to mix the two concepts of "parallel instructions" and the
looser
"instructions that should be executed together"? The "parallel
semantics"
implies changes to how the scheduling is done (the earliest/latest cycle where
an instruction can be scheduled) and also, e.g., the register allocation's
live
ranges (if allocating regs on a "packetized" = parallel code)?

Moreover, the definition of VLIW parallel bundle implies that there cannot be
no "intra bundle dependencies", otherwise those instructions could not
be
executed in parallel in the reality.

For example, looking at your example of a bundle with "intra-bundle
dependencies":

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0<kill>, #c |
-------------------------

In case of a static VLIW target the semantics of this instruction is that these
two "RISC instructions are executed in parallel, period". Thus, the
first
instruction cannot depend on the latter (or the other way around) but op2 reads
the old value of r0, not the one written in the same bundle.

It depends on the architecture's data hazard detection support, register
file
bypasses, etc. whether the r0 update of the 1st instruction is available to
the second instruction in the bundle or whether the new r0 value can be read
only by the succeeding instruction bundles. If it is available, the execution
is sequential in reality as op1 must produce the value before op2 can
execute.

Itanium machines are an example of "parallel bundle architectures"
(and of course also other "more traditional" VLIWs are, like the TI
C64x[2]):

"EPIC allows compilers to define independent instruction sequences, which
allows
hardware to ignore dependency checks between these instructions.  This same 
hardware functionality in OOO RISC designs is very costly and complex."
[1]

As an example of the "not truly parallel instruction bundles", on the
other
hand, we have played a bit with the Cell SPU which is quite static architecture
but still has hardware data hazard detection and hardware interlocking. It
would differentiate between your case and the one where the order is different
because it follows the sequential instruction order in its hardware data
dependence resolving logic and stalls the pipeline (thus does not really
execute the instructions in parallel) if the sequential order has data hazards.

For how to actually represent the (parallel) instruction bundles I do not have
a strong opinion, as long as these semantic difference between a "parallel
bundle" and "just a chunk of instructions that should be executed
together" are
made clear and adhered to everywhere in the code generation.

[1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf
[2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf

Best regards,
-- 
Pekka from the TCE project
http://tce.cs.tut.fi

Evan Cheng

2011-Dec-03 20:18 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

It's important to understand this is a proposal for code generator IR
change, not specific to specific passes for register allocator or scheduler.
Passes that want to understand more about the internals of instruction bundles
are free to design their own data structure. It is imperative that we do not add
multiple constructs for various types of bundles. That would add significant
overhead for the rest of the code generator.

I have discussed the proposal for current code owners of register allocator and
instruction scheduler. This is a proposal that will work with existing (or
module being planned for near future) infrastructure.

Evan

On Dec 3, 2011, at 7:12 AM, Pekka Jääskeläinen wrote:
> Hi,
> 
> I'm glad to see some action with regard to static instruction
> scheduling and VLIW support in LLVM. I have some questions and
> remarks which might not be relevant as I'm not totally familiar
> with the current code generation framework of LLVM nor your plan.
> 
> On 12/02/2011 10:40 PM, Evan Cheng wrote:
>> 2. It must be flexible enough to represent more than VLIW bundles. It
should be
>> useful to represent arbitrary sequence of instructions that must be
scheduled as
>> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion,
or random
>> instruction sequences that are currently modeled as pseudo instructions
that are
>> expanded late.
> 
> The concept of a "VLIW bundle" is to mark a set of instructions
that
> should/could be executed in *parallel*. A static parallel instruction
> schedule for a single instruction cycle, that is.
> 
> In other words, with a VLIW target a bundle might not be just "an
atomic,
> possibly sequentially executed chunk of instructions" or "a set
of
> instructions that can be executed in parallel but also sequentially".
> In some architectures, the sequential execution might break the schedule
> due to visible function unit pipeline latencies and no hardware
interlocking.
> 
> Is it wise to mix the two concepts of "parallel instructions" and
the looser
> "instructions that should be executed together"? The
"parallel semantics"
> implies changes to how the scheduling is done (the earliest/latest cycle
where
> an instruction can be scheduled) and also, e.g., the register
allocation's live
> ranges (if allocating regs on a "packetized" = parallel code)?
> 
> Moreover, the definition of VLIW parallel bundle implies that there cannot
be
> no "intra bundle dependencies", otherwise those instructions
could not be
> executed in parallel in the reality.
> 
> For example, looking at your example of a bundle with "intra-bundle
> dependencies":
> 
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0<kill>, #c |
> -------------------------
> 
> In case of a static VLIW target the semantics of this instruction is that
these
> two "RISC instructions are executed in parallel, period". Thus,
the first
> instruction cannot depend on the latter (or the other way around) but op2
reads
> the old value of r0, not the one written in the same bundle.
> 
> It depends on the architecture's data hazard detection support,
register file
> bypasses, etc. whether the r0 update of the 1st instruction is available to
> the second instruction in the bundle or whether the new r0 value can be
read
> only by the succeeding instruction bundles. If it is available, the
execution
> is sequential in reality as op1 must produce the value before op2 can
> execute.
> 
> Itanium machines are an example of "parallel bundle
architectures"
> (and of course also other "more traditional" VLIWs are, like the
TI C64x[2]):
> 
> "EPIC allows compilers to define independent instruction sequences,
which allows
> hardware to ignore dependency checks between these instructions.  This same
> hardware functionality in OOO RISC designs is very costly and
complex."
> [1]
> 
> As an example of the "not truly parallel instruction bundles", on
the other
> hand, we have played a bit with the Cell SPU which is quite static
architecture
> but still has hardware data hazard detection and hardware interlocking. It
> would differentiate between your case and the one where the order is
different
> because it follows the sequential instruction order in its hardware data
> dependence resolving logic and stalls the pipeline (thus does not really
> execute the instructions in parallel) if the sequential order has data
hazards.
> 
> For how to actually represent the (parallel) instruction bundles I do not
have
> a strong opinion, as long as these semantic difference between a
"parallel
> bundle" and "just a chunk of instructions that should be executed
together" are
> made clear and adhered to everywhere in the code generation.
> 
> [1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf
> [2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf
> 
> Best regards,
> -- 
> Pekka from the TCE project
> http://tce.cs.tut.fi
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Anshuman Dasgupta

2011-Dec-05 15:17 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

Hi Evan,

Thanks for sending out the proposal. This pass will, of course, be very 
important for the Hexagon backend. I have to study the proposal in more 
detail but overall, it looks good to me.

 > I would also like to see a target independent interface for 
pre-scheduling optimizations that form instruction sequences (e.g. 
macro-fusion).

My team will be happy to help with the tasks and specifically with the 
packet-formation pass. We already have a pass that forms packets for 
Hexagon and we can generalize the pass and make it machine-independent.

-Anshu

--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum



On 12/2/2011 2:40 PM, Evan Cheng wrote:> *Machine Instruction Bundle in LLVM*
>
> Hi all,
>
> There have been quite a bit of discussions about adding machine 
> instruction bundle to support VLIW targets. I have been pondering what 
> the right representation should be and what kind of impact it might 
> have on the LLVM code generator. I believe I have a fairly good plan 
> now and would like to share with the LLVM community.
>
> *Design Criteria*
>
> 1. The bundle representation must be light weight. We cannot afford to 
> add significant memory or compile time overhead.
> 2. It must be flexible enough to represent more than VLIW bundles. It 
> should be useful to represent arbitrary sequence of instructions that 
> must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + 
> branch macro-fusion, or random instruction sequences that are 
> currently modeled as pseudo instructions that are expanded late.
> 3. Minimize the amount of changes required in the LLVM code generator, 
> especially in target independent passes. It must minimize code 
> duplication (i.e. we don't want code snippets that search for bundle 
> start / end like all the code in the backend that skip over DBG_VALUE).
> 4. The representation should make it easy for new code to be oblivious 
> of bundles. That is, MI passes should not have to check whether 
> something is a bundle.
>
> Given the above, we can rule out a new class (e.g. MachineInstrBundle) 
> right away. We don't want MachineBasic block to keep a list of 
> MachineInstrBundles since it will require massive amount of code 
> change. So what are the choices?
>
> *Bundle Representation*
>
> 1. A nested MachineInstr: This is the most natural (meaning it looks 
> most like the real HW bundle) representation. It has the nice property 
> that most passes do not have to check if a MI is a bundle.The concern 
> here this can add significant memory overhead if this means adding a 
> ilist or SmallVector field to keep bundled MIs.
> 2. Add a bit to MachineInstr: The bit means the next MI in the list is 
> part of the same bundle. This is very light weight. However it 
> requires many passes to check wether a MI is part of a bundle.
>
> The solution is a combination of both #1 and #2. Conceptually we want 
> a representation that looks like this:
>
> --------------
> |  Bundle    | -------
> --------------        \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |                  ...
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |
>         ...
>
>
> This is #1, a series of nested MI's. However, we are going to store 
> the instructions in the same way as it's done right now, i.e. a 
> list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a
bit
> to MI that indicates whether it is part of a bundle.
>
>                      ----------------
>                      |      MI    * |   (* bit indicates next MI is 
> "glued" to this MI, i.e. in the same bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of 
> the bundle)
>                      --------------
>                              |
>                      ----------------
>                      |      MI    * |   (* a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>
> We are going to hide the complexity in the MachineBasicBlock::iterator 
> instead. That is, the iterator will be changed to visit only the *top 
> level* instructions (i.e. first instruction in each bundle). We will 
> add another iterator that allows client to visit all of the MIs for 
> those passes that want to look into bundles.
>
> We can use the same representation for arbitrary sequence of 
> instructions that cannot be broken up. e.g. Thumb2 IT blocks.
>
>                      ----------------
>                      |      MI      |   (just a MI)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * |   (* Start of Thumb2 IT block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      |   (last MI in the block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                             |
>                            ...
>
> This representation can support VLIW (where top level MI's are all 
> start of bundles) or non-VLIW (where there can be mix of MIs and 
> bundles). It is also very cheap since the "Flags" field has
plenty of
> free bits available.
>
> *Properties of Bundle*
>
> If MI passes can consider each bundle as a single unit, then how are 
> they going to examine properties (i.e. flags and operands) of a MI 
> bundle? Conceptually a the properties of a bundle is the union of the 
> properties of all the MIs inside the bundle. So a bundle reads all the 
> inputs that the individual MIs read and it defines all the outputs of 
> the individual MIs. However, this is not correct when there are 
> intra-bundle dependencies. e.g.
>
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0<kill>, #c |
> -------------------------
>
> r0 should not be considered as a source on the bundle since it's 
> defined inside the bundle and its live range does not extend beyond 
> it. Instead, r0 is a clobber (i.e. dead def).
>
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>  ...
>      = op3 r0,
>
> r0 is a def, not a use.
>
> What does this mean? It means in order for passes to operate on a 
> bundle at a time, it must be able to visit all the defs and uses of a 
> bundle. We have established that computing the defs and uses of a 
> bundle is not as trivial as taking the union. This is certainly not 
> something we want to re-compute every time! This requires a slight 
> change to the bundle representation.
>
>                      ----------------
>                      |    Bundle  * |   (A MI with special opcode 
> "Bundle")
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of 
> the bundle)
>                      --------------
>                              |
>                      ----------------
>                      |    Bundle  * |   (a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>
> The pseudo bundle instructions should be used to capture properties of 
> the bundle. When a bundle is finalized the packetizer must add source 
> and def operands to the pseudo bundle instruction. More on this later.
>
> Other properties, such as mayLoad, mayStore, are static properties 
> associated with opcodes. They cannot be copied. We will add APIs to 
> examine properties of MIs which will do the *right thing* for bundles 
> (i.e. look into MIs in bundles).
>
> *Packetizing*
>
> The current MI flow looks like this:
>
> 1. DAG to MI lowering (and pre-RA schedule)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Post-RA scheduling
>
> In the hopefully not very distant future it should look like this:
>
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. *Pre-RA scheduling*
>    3d. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. *Re-schedule restores, copies*
>
> The current proposal is for "packetization" to be done as part of
the
> "RA super pass". Early MI optimization passes such as LICM do not
> benefit from operating on bundles. Furthermore, the packetizer should 
> not have to know how to deal with copies which may later be coalesced, 
> phi nodes, or other copy like pseudo instructions.
>
> Packetization should be done in two phases. The first part decides 
> what MIs should be bundled together and it add the "bits" which
glued
> MIs together. This can be done either before pre-RA scheduling. The 
> second part of the packetization should only be done after register 
> allocation is completed. There are two very important reason for this.
>
> 1. Packet finalization *must* add source and def operands to the 
> "Bundle" pseudo MI. This allows all later passes to handle they 
> transparently. However, we do not want to do this before register 
> allocation is complete. Otherwise it introduces new defs and uses of 
> virtual registers and that mess up MachineRegisterInfo def-use chains.
>
> e.g. Now vr0 has two defs!
> defs: vr0<dead>, vr3, uses: vr1, vr2
> ----------------------------
> | vr0 = op1 vr1, vr2       |
> | vr3 = op2 vr0<kill>, #c  |
> ----------------------------
>
> 2. During register allocation, more identity copies will be eliminated 
> while loads, stores, copies, re-materialized instructions will be 
> introduced. It makes sense for the second part of packetization to try 
> to fill these new instructions into empty slots (for VLIW like targets).
>
> So the overall flow should look like this:
>
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. *Pre-scheduling packetization (optional)*
>    3d. Pre-RA scheduling (or *integrated packetization*)
>    3e. *Post-scheduling packetization (optional)*
>    3f. Actual register allocation
>    3g. *Packet finalization*
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>
> *Lowering Bundles to MCInst*
>
> There is no need to add the equivalent of MI bundle to MCInst. A MI 
> bundle should be concatenated into a single MCInst by storing opcodes 
> as integer operands. e.g.
>
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>
> =>
>
> MCInst: op1 r0, r1, r2, op2, r3, r0, #c
> or
> MCInst: op1 op2 r0, r1, r2, r3, r0, #c
>
> *What's Next?*
>
> I am hoping to find some time to implement the followings in the near 
> future:
> 1. Add BUNDLE opcode
> 2. MachineInstr class changes: new bit, changes to methods such as 
> eraseFromParent(), isIdenticalTo().
> 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old 
> iterator.
> 4. Add MachineInstr API to check for instruction properties and switch 
> existing code over.
> 5. Add API to form a bundle. It would compute the proper def's and 
> use's and add MachineOperands to the bundle MI.
> 6. Switch Thumb2 IT block to using MI bundles.
> 7. Add interface for targets to register their own packetization passes.
>
> I would dearly welcome help on any of these tasks especially on 4, 5, 
> 6. I also would not cry if someone beats me to #6 (or actually any of 
> the tasks. :-)
>
> In the longer term, I would like to see a target independent 
> packetization pass (I believe one is being reviewed). I would also 
> like to see a target independent interface for pre-scheduling 
> optimizations that form instruction sequences (e.g. macro-fusion). 
> Patches welcome!
>
> Evan
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111205/e4b298fc/attachment.html>

Jakob Stoklund Olesen

2011-Dec-05 21:50 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote:
> There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM code
generator. I believe I have a fairly good plan now and would like to share with
the LLVM community.
Let me add some information about how the register handles this extension.

First of all, I agree it is important that this can be used for more than VLIW
targets. Besides the uses you mention, I would like to use bundle DBG_VALUE
instructions so we can avoid silly code like this:

  MachineBasicBlock::iterator InsertPos = mi;
  while (InsertPos != MBB->begin() &&
llvm::prior(InsertPos)->isDebugValue())
    --InsertPos;
  MachineBasicBlock::iterator From = KillMI;
  MachineBasicBlock::iterator To = llvm::next(From);
  while (llvm::prior(From)->isDebugValue())
    --From;
 
I also think bundles can be used for implementing parallel copies.


Value Semantics
==============
The register allocator and probably most of the target-independent code
generator won't care if instructions in a bundle are executed sequentially
or in parallel.  We do care about value semantics, though. That is, which value
is being read by an instruction operand in a bundle.

We definitely want to support the parallel execution semantics where all
instructions in a bundle read values defined outside the bundle.  This is a swap
implemented as a parallel copy bundle:

{
  R2 = R3;
  R3 = R2;
}

However, even VLIW targets like Hexagon can read values defined inside the same
bundle:
 
{
  P0 = cmp.eq(R2,#4)
  if (!P0) R5 = #5
  if (P0.new) R3 = memw(R4)
}

This Hexagon bundle reads both the P0 predicate register defined inside the
bundle (P0.new) and the value defined outside the bundle (!P0). We need to
support this.

I propose that we add a new MachineOperand flag, IsInternalRead, to represent
this.  The flag will mean "This operand is reading a value defined within
the same instruction/bundle, not a value from outside the
instruction/bundle."  The register allocator will treat the
<internal> flag almost exactly like it currently treats <undef>, but
there is a big difference to the target-specific code.

The register allocator and other target-independent passes don't care if
there are multiple defs of the same register inside a bundle. Values are only
tracked at the bundle granularity. The semantics of multiple defs is
target-defined.

To summarize, all instructions in a bundle read values defined outside the
bundle, unless explicitly marked as bundle-internal reads.  Multiple defs inside
a bundle are indistinguishable except to the target.


Rewriting
========
The register allocator super-pass needs to rewrite register operands. 
Virtual-to-virtual rewriting happens during coalescing and live range splitting.
Virtual-to-physical rewriting happens only once at the end.

When rewriting virtual registers, a minimal understanding of value semantics is
required. In particular, it is possible to split a live range right down the
middle of an instruction:

  %vr0 = add %vr0, 1

May be rewritten as:

  %vr2 = add %vr1, 1

This is assuming the add doesn't have two-address constraints, of course.

When rewriting bundle operands, the <internal> flag will be sufficient to
determine the correct virtual register. For example:

{
  %vr0 = cmp.eq(R2,#4)
  if (!%vr0) R5 = #5
  if (%vr0<internal>) R3 = memw(R4)
}

could be rewritten after live range splitting as:

{
  %vr2 = cmp.eq(R2,#4)
  if (!%vr1) R5 = #5
  if (%vr2<internal>) R3 = memw(R4)
}


Constraining spill code insertion
================================
It is important to note that bundling instructions doesn't constrain the
register allocation problem.

For example, this bundle would be impossible with sequential value constraints:

{
  call foo
  %vr0  = addps %vr1, %vr2
  call bar
}

The calls clobber the xmm registers, so it is impossible to register allocate
this code without breaking up the bundle and inserting spill code between the
calls.

With our definition of bundle value semantics, the addps is reading %vr1 and
%vr2 outside the bundle, and the call clobbers are not considered relevant.  The
fact that the call clobbers all registers before addps is the target's
problem.

This is very similar to how inline assembly is treated.


TL;DR
====
By adding a new <internal> flag to MachineOperand, the register allocator
can effectively treat a bundle as a single instruction. All MachineOperands
inside a bundle are treated as if they all belong to the single instruction.
This even works when rewriting operands.


/jakob

Jakob Stoklund Olesen

2011-Dec-05 22:56 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Dec 5, 2011, at 1:50 PM, Jakob Stoklund Olesen wrote:
> I would like to use bundle DBG_VALUE instructions so we can avoid silly
code like this:
> 
>  MachineBasicBlock::iterator InsertPos = mi;
>  while (InsertPos != MBB->begin() &&
llvm::prior(InsertPos)->isDebugValue())
>    --InsertPos;
>  MachineBasicBlock::iterator From = KillMI;
>  MachineBasicBlock::iterator To = llvm::next(From);
>  while (llvm::prior(From)->isDebugValue())
>    --From;
Thinking about it, this probably wouldn't work for pre-RA passes that can
delete code. They could inadvertently delete DBG_VALUE instructions.

/jakob

Pekka Jääskeläinen

2011-Dec-06 10:06 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On 12/05/2011 11:50 PM, Jakob Stoklund Olesen wrote:> By adding a new<internal> flag to MachineOperand, the register
allocator can
> effectively treat a bundle as a single instruction. All MachineOperands
> inside a bundle are treated as if they all belong to the single
instruction.
> This even works when rewriting operands.
This sounds like a simple and good solution to the "parallel bundle"
semantic
worries I had.

What about instruction scheduling? Has anyone thought how/if isched could work
with parallel bundles? That is, to create the bundles (referred to as
"packing"
by some) the first place (before and/or after RA).

I'm not familiar enough with the current LLVM isched work so I do not know
if the current algorithms could be used for static scheduling for VLIWs.
Is one really required to write a separate "packer" that just tries to
pack
sequential(?) instructions to bundles without trying to reorder to improve
the packing density?

How I see it, "packing" is just instruction scheduling with an
additional
"what can be in a single bundle and in what order"-scheduling
constraint and has
fixed cycles (bundles) for the instructions as an outcome. The bundle
constraint (the wide instruction template(s) supported by the machine) can be
implemented with the state machine approach or just a resource table.

VLIW scheduling requires also multiplicity for the FU resource constraints.
Can those be described in the current TableGen format? Clustered (or in
general, not-fully-connected) VLIWs also require connectivity information
(which FUs are connected to which RFs) but that can be modeled with
register classes, I've been told.

--
--Pekka

Evandro Menezes

2011-Dec-06 22:34 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

Indeed, there are strict VLIW architectures out there and VLIW 
architectures that leverage some aspects of conventional architectures, 
sometimes taking advantage of how the microarchitecture was implemented.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0<kill>, #c |
-------------------------

For instance, some of such VLIW architectures allow that the programmer 
specifies whether "r0" in "op2" will have the value from t-1
(before the
bundle) or from t+1 (the result produced by "op1").  If anything, 
because the forwarding of results is close enough in the pipeline to be 
used in the same bundle.

It seems to me that such subtleties are better left near the 
target-dependent code, but the target-independent code should refrain 
from making strict interpretations of what a bundle should look like.

When it gets to the MC layer, it seems to me that overloading the MCI 
operands with opcodes is a bit unusual, but, at first glance, it might 
not be too hard to handle.

-- 
Evandro Menezes        Austin, TX        emenezes at codeaurora.org
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum


On 12/03/11 09:12, Pekka Jääskeläinen wrote:> Hi,
>
> I'm glad to see some action with regard to static instruction
> scheduling and VLIW support in LLVM. I have some questions and
> remarks which might not be relevant as I'm not totally familiar
> with the current code generation framework of LLVM nor your plan.
>
> On 12/02/2011 10:40 PM, Evan Cheng wrote:
>> 2. It must be flexible enough to represent more than VLIW bundles. It
should be
>> useful to represent arbitrary sequence of instructions that must be
scheduled as
>> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion,
or random
>> instruction sequences that are currently modeled as pseudo instructions
that are
>> expanded late.
>
> The concept of a "VLIW bundle" is to mark a set of instructions
that
> should/could be executed in *parallel*. A static parallel instruction
> schedule for a single instruction cycle, that is.
>
> In other words, with a VLIW target a bundle might not be just "an
atomic,
> possibly sequentially executed chunk of instructions" or "a set
of
> instructions that can be executed in parallel but also sequentially".
> In some architectures, the sequential execution might break the schedule
> due to visible function unit pipeline latencies and no hardware
interlocking.
>
> Is it wise to mix the two concepts of "parallel instructions" and
the looser
> "instructions that should be executed together"? The
"parallel semantics"
> implies changes to how the scheduling is done (the earliest/latest cycle
where
> an instruction can be scheduled) and also, e.g., the register
allocation's live
> ranges (if allocating regs on a "packetized" = parallel code)?
>
> Moreover, the definition of VLIW parallel bundle implies that there cannot
be
> no "intra bundle dependencies", otherwise those instructions
could not be
> executed in parallel in the reality.
>
> For example, looking at your example of a bundle with "intra-bundle
> dependencies":
>
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0<kill>, #c |
> -------------------------
>
> In case of a static VLIW target the semantics of this instruction is that
these
> two "RISC instructions are executed in parallel, period". Thus,
the first
> instruction cannot depend on the latter (or the other way around) but op2
reads
> the old value of r0, not the one written in the same bundle.
>
> It depends on the architecture's data hazard detection support,
register file
> bypasses, etc. whether the r0 update of the 1st instruction is available to
> the second instruction in the bundle or whether the new r0 value can be
read
> only by the succeeding instruction bundles. If it is available, the
execution
> is sequential in reality as op1 must produce the value before op2 can
> execute.
>
> Itanium machines are an example of "parallel bundle
architectures"
> (and of course also other "more traditional" VLIWs are, like the
TI C64x[2]):
>
> "EPIC allows compilers to define independent instruction sequences,
which allows
> hardware to ignore dependency checks between these instructions.  This same
> hardware functionality in OOO RISC designs is very costly and
complex."
> [1]
>
> As an example of the "not truly parallel instruction bundles", on
the other
> hand, we have played a bit with the Cell SPU which is quite static
architecture
> but still has hardware data hazard detection and hardware interlocking. It
> would differentiate between your case and the one where the order is
different
> because it follows the sequential instruction order in its hardware data
> dependence resolving logic and stalls the pipeline (thus does not really
> execute the instructions in parallel) if the sequential order has data
hazards.
>
> For how to actually represent the (parallel) instruction bundles I do not
have
> a strong opinion, as long as these semantic difference between a
"parallel
> bundle" and "just a chunk of instructions that should be executed
together" are
> made clear and adhered to everywhere in the code generation.
>
> [1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf
> [2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf
>
> Best regards,

Stripf, Timo (ITIV)

2012-Jan-11 12:56 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

Hi Evan,

I just read your proposal and the following discussion for VLIW support and want
to share my experience of writing a VLIW back-end for LLVM.

I would not integrate the packetizer into the register allocator super class
since it would reduce the flexibility for the back-end developer to add some
optimization passes after the packetizer. Instead, I would add the packetizer as
a separate pass. It is true that the packetizer must deal in that case with PHI
and COPY nodes that are eliminated by the RA. The packetizer can simple group
all PHI and COPY instruction into single bundles consisting of only one
instruction.
>From my experience a simple packetizer that groups instruction into bundles
(like the old IA-64 back-end did) without changing the order of the instructions
produces bad code. Instead, a VLIW scheduler that directly outputs bundles
produces better code. The current LLVM scheduler (at the end of the instruction
selection pass) is not suitable to generate bundled instructions since it
operates on scheduling units for glued instructions.However, the post-RA scheduler in combination with a VLIW-aware hazard
recognizer can be used before RA to bundle and schedule instructions for VLIW
architectures. Only small modifications within the post-RA scheduler classes to
support virtual registers are necessary.

I also would not include packet finalization into the register allocator super
class since also the following pre- and epilog code insertion (PECI) pass adds
extra instruction into the instruction list. So I would add the packet
finalization after pre- and epilog code insertion. Both the RA and PECI can add
its instruction into single bundles that can be integrated into larger bundles
within packet finalization. For packet finalization it also makes sense to
perform a post-ra VLIW scheduling.

Timo

Von: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] Im
Auftrag von Evan Cheng
Gesendet: Freitag, 2. Dezember 2011 21:40
An: LLVM Dev
Betreff: [LLVMdev] RFC: Machine Instruction Bundle

Machine Instruction Bundle in LLVM

Hi all,

There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM code
generator. I believe I have a fairly good plan now and would like to share with
the LLVM community.

Design Criteria

1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.
2. It must be flexible enough to represent more than VLIW bundles. It should be
useful to represent arbitrary sequence of instructions that must be scheduled as
a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random
instruction sequences that are currently modeled as pseudo instructions that are
expanded late.
3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication (i.e.
we don't want code snippets that search for bundle start / end like all the
code in the backend that skip over DBG_VALUE).
4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right
away. We don't want MachineBasic block to keep a list of MachineInstrBundles
since it will require massive amount of code change. So what are the choices?

Bundle Representation

1. A nested MachineInstr: This is the most natural (meaning it looks most like
the real HW bundle) representation. It has the nice property that most passes do
not have to check if a MI is a bundle.The concern here this can add significant
memory overhead if this means adding a ilist or SmallVector field to keep
bundled MIs.
2. Add a bit to MachineInstr: The bit means the next MI in the list is part of
the same bundle. This is very light weight. However it requires many passes to
check wether a MI is part of a bundle.

The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:

--------------
| Bundle | -------
-------------- \
| ----------------
| | MI |
| ----------------
| |
| ----------------
| | MI |
| ----------------
| |
| ----------------
| | MI |
| ----------------
|
--------------
| Bundle | ------
-------------- \
| ----------------
| | MI |
| ----------------
| |
| ----------------
| | MI |
| ----------------
| |
| ...
|
--------------
| Bundle | ------
-------------- \
|
...

This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to
MI that indicates whether it is part of a bundle.

----------------
| MI * | (* bit indicates next MI is
"glued" to this MI, i.e. in the same bundle)
----------------
|
----------------
| MI * |
----------------
|
----------------
| MI | (no bit, this is the end of the bundle)
--------------
|
----------------
| MI * | (* a new bundle)
----------------
|
----------------
| MI |
----------------
|
...

We are going to hide the complexity in the MachineBasicBlock::iterator instead.
That is, the iterator will be changed to visit only the *top level* instructions
(i.e. first instruction in each bundle). We will add another iterator that
allows client to visit all of the MIs for those passes that want to look into
bundles.

We can use the same representation for arbitrary sequence of instructions that
cannot be broken up. e.g. Thumb2 IT blocks.

----------------
| MI | (just a MI)
----------------
|
----------------
| MI * | (* Start of Thumb2 IT block)
----------------
|
----------------
| MI * |
----------------
|
----------------
| MI | (last MI in the block)
----------------
|
----------------
| MI |
----------------
|
...

This representation can support VLIW (where top level MI's are all start of
bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also
very cheap since the "Flags" field has plenty of free bits available.

Properties of Bundle

If MI passes can consider each bundle as a single unit, then how are they going
to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a
the properties of a bundle is the union of the properties of all the MIs inside
the bundle. So a bundle reads all the inputs that the individual MIs read and it
defines all the outputs of the individual MIs. However, this is not correct when
there are intra-bundle dependencies. e.g.

-------------------------
| r0 = op1 r1, r2 |
| r3 = op2 r0<kill>, #c |
-------------------------

r0 should not be considered as a source on the bundle since it's defined
inside the bundle and its live range does not extend beyond it. Instead, r0 is a
clobber (i.e. dead def).

-------------------------
| r0 = op1 r1, r2 |
| r3 = op2 r0, #c |
-------------------------
...
= op3 r0,

r0 is a def, not a use.

What does this mean? It means in order for passes to operate on a bundle at a
time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial as
taking the union. This is certainly not something we want to re-compute every
time! This requires a slight change to the bundle representation.

----------------
| Bundle * | (A MI with special opcode
"Bundle")
----------------
|
----------------
| MI * |
----------------
|
----------------
| MI * |
----------------
|
----------------
| MI | (no bit, this is the end of the bundle)
--------------
|
----------------
| Bundle * | (a new bundle)
----------------
|
----------------
| MI * |
----------------
|
----------------
| MI |
----------------
|
...

The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.

Other properties, such as mayLoad, mayStore, are static properties associated
with opcodes. They cannot be copied. We will add APIs to examine properties of
MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles).

Packetizing

The current MI flow looks like this:

1. DAG to MI lowering (and pre-RA schedule)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
3a. De-ssa (2-address, phi slim)
3b. Coalescing
3c. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Post-RA scheduling

In the hopefully not very distant future it should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
3a. De-ssa (2-address, phi slim)
3b. Coalescing
3c. Pre-RA scheduling
3d. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

The current proposal is for "packetization" to be done as part of the
"RA super pass". Early MI optimization passes such as LICM do not
benefit from operating on bundles. Furthermore, the packetizer should not have
to know how to deal with copies which may later be coalesced, phi nodes, or
other copy like pseudo instructions.

Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together. This can be done either before pre-RA scheduling. The second part of
the packetization should only be done after register allocation is completed.
There are two very important reason for this.

1. Packet finalization *must* add source and def operands to the
"Bundle" pseudo MI. This allows all later passes to handle they
transparently. However, we do not want to do this before register allocation is
complete. Otherwise it introduces new defs and uses of virtual registers and
that mess up MachineRegisterInfo def-use chains.

e.g. Now vr0 has two defs!
defs: vr0<dead>, vr3, uses: vr1, vr2
----------------------------
| vr0 = op1 vr1, vr2 |
| vr3 = op2 vr0<kill>, #c |
----------------------------

2. During register allocation, more identity copies will be eliminated while
loads, stores, copies, re-materialized instructions will be introduced. It makes
sense for the second part of packetization to try to fill these new instructions
into empty slots (for VLIW like targets).

So the overall flow should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
3a. De-ssa (2-address, phi slim)
3b. Coalescing
3c. Pre-scheduling packetization (optional)
3d. Pre-RA scheduling (or integrated packetization)
3e. Post-scheduling packetization (optional)
3f. Actual register allocation
3g. Packet finalization
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

Lowering Bundles to MCInst

There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.

-------------------------
| r0 = op1 r1, r2 |
| r3 = op2 r0, #c |
-------------------------

MCInst: op1 r0, r1, r2, op2, r3, r0, #c
or
MCInst: op1 op2 r0, r1, r2, r3, r0, #c

What's Next?

I am hoping to find some time to implement the followings in the near future:
1. Add BUNDLE opcode
2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().
3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator.
4. Add MachineInstr API to check for instruction properties and switch existing
code over.
5. Add API to form a bundle. It would compute the proper def's and use's
and add MachineOperands to the bundle MI.
6. Switch Thumb2 IT block to using MI bundles.
7. Add interface for targets to register their own packetization passes.

I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also
would not cry if someone beats me to #6 (or actually any of the tasks. :-)

In the longer term, I would like to see a target independent packetization pass
(I believe one is being reviewed). I would also like to see a target independent
interface for pre-scheduling optimizations that form instruction sequences (e.g.
macro-fusion). Patches welcome!

Evan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120111/e17b2c8e/attachment.html>

girish gulawani

2012-Jan-11 15:43 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

Hi Evan/Evan.
This is great! I''ve been experimenting on a VLIW target by adding a
FinalizeBundle loop and combination of Scoreboard / Schedule hazard classes.

Only question that is nagging now is how to pass the scheduling information (or
the flags) down PECI, to 'choose' the correct opcode. I am currently
hacking a few bits off the "MachineInstr.Flags". But somebody
suggested avoiding this and recommended adding a dummy operand. I'm not sure
if that is the correct way. Any comments on that?

Thanks.

Girish.


From: "Stripf, Timo (ITIV)" <timo.stripf at
kit.edu>>To: Evan Cheng <evan.cheng at apple.com>; LLVM Dev <llvmdev at
cs.uiuc.edu>
>Sent: Wednesday, 11 January 2012 6:26 PM
>Subject: Re: [LLVMdev] RFC: Machine Instruction Bundle
> 
>
>Hi Evan,
> 
>I just read your proposal and the following discussion for VLIW support and
want to share my experience of writing a VLIW back-end for LLVM.
> 
> 
>I would not integrate the packetizer into the register allocator super class
since it would reduce the flexibility for the back-end developer to add some
optimization passes after the packetizer. Instead, I would add the packetizer as
a separate pass. It is true that the packetizer must deal in that case with PHI
and COPY nodes that are eliminated by the RA. The packetizer can simple group
all PHI and COPY instruction into single bundles consisting of only one
instruction.
> 
>From my experience a simple packetizer that groups instruction into bundles
(like the old IA-64 back-end did) without changing the order of the instructions
produces bad code. Instead, a VLIW scheduler that directly outputs bundles
produces better code. The current LLVM scheduler (at the end of the instruction
selection pass) is not suitable to generate bundled instructions since it
operates on scheduling units for glued instructions.
>However, the post-RA scheduler in combination with a VLIW-aware hazard
recognizer can be used before RA to bundle and schedule instructions for VLIW
architectures. Only small modifications within the post-RA scheduler classes to
support virtual registers are necessary.
> 
>I also would not include packet finalization into the register allocator
super class since also the following pre- and epilog code insertion (PECI) pass
adds extra instruction into the instruction list. So I would add the packet
finalization after pre- and epilog code insertion. Both the RA and PECI can add
its instruction into single bundles that can be integrated into larger bundles
within packet finalization. For packet finalization it also makes sense to
perform a post-ra VLIW scheduling.
> 
>Timo
> 
> 
>Von:llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
Im Auftrag von Evan Cheng
>Gesendet: Freitag, 2. Dezember 2011 21:40
>An: LLVM Dev
>Betreff: [LLVMdev] RFC: Machine Instruction Bundle
> 
>Machine Instruction Bundle in LLVM
> 
>Hi all,
> 
>There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM code
generator. I believe I have a fairly good plan now and would like to share with
the LLVM community.
> 
>Design Criteria
> 
>1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.
>2. It must be flexible enough to represent more than VLIW bundles. It should
be useful to represent arbitrary sequence of instructions that must be scheduled
as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or
random instruction sequences that are currently modeled as pseudo instructions
that are expanded late.
>3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication (i.e.
we don't want code snippets that search for bundle start / end like all the
code in the backend that skip over DBG_VALUE).
>4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.
> 
>Given the above, we can rule out a new class (e.g. MachineInstrBundle) right
away. We don't want MachineBasic block to keep a list of MachineInstrBundles
since it will require massive amount of code change. So what are the choices?
> 
>Bundle Representation
> 
>1. A nested MachineInstr: This is the most natural (meaning it looks most
like the real HW bundle) representation. It has the nice property that most
passes do not have to check if a MI is a bundle.The concern here this can add
significant memory overhead if this means adding a ilist or SmallVector field to
keep bundled MIs.
>2. Add a bit to MachineInstr: The bit means the next MI in the list is part
of the same bundle. This is very light weight. However it requires many passes
to check wether a MI is part of a bundle.
> 
>The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:
> 
>--------------
>|  Bundle    | -------
>--------------        \
>         |           ----------------
>         |           |      MI      |
>         |           ----------------
>         |                   |
>         |           ----------------
>         |           |      MI      |
>         |           ----------------
>         |                   |
>         |           ----------------
>         |           |      MI      |
>         |           ----------------
>         |
>--------------
>|  Bundle    | ------
>--------------       \
>         |           ----------------
>         |           |      MI      |
>         |           ----------------
>         |                   |
>         |           ----------------
>         |           |      MI      |
>         |           ----------------
>         |                   |
>         |                  …
>         |
>--------------
>|  Bundle    | ------
>--------------       \
>         |
>        ...
> 
> 
>This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to
MI that indicates whether it is part of a bundle.
> 
>                     ----------------
>                     |      MI    * |   (* bit indicates next MI is
"glued" to this MI, i.e. in the same bundle)
>                     ----------------
>                             |
>                     ----------------
>                     |      MI    * |
>                     ----------------
>                             |
>                     ----------------
>                     |      MI      |    (no bit, this is the end of the
bundle)
>                     --------------
>                             |
>                     ----------------
>                     |      MI    * |   (* a new bundle)
>                     ----------------
>                             |
>                     ----------------
>                     |      MI      |
>                     ----------------
>                             |
>                            ...
> 
>We are going to hide the complexity in the MachineBasicBlock::iterator
instead. That is, the iterator will be changed to visit only the *top level*
instructions (i.e. first instruction in each bundle). We will add another
iterator that allows client to visit all of the MIs for those passes that want
to look into bundles.
> 
>We can use the same representation for arbitrary sequence of instructions
that cannot be broken up. e.g. Thumb2 IT blocks.
> 
>                     ----------------
>                     |      MI      |   (just a MI)
>                     ----------------
>                            |
>                     ----------------
>                     |      MI    * |   (* Start of Thumb2 IT block)
>                     ----------------
>                            |
>                     ----------------
>                     |      MI    * | 
>                     ----------------
>                            |
>                     ----------------
>                     |      MI      |   (last MI in the block)
>                     ----------------
>                            |
>                     ----------------
>                     |      MI      | 
>                     ----------------
>                            |
>                           ...
> 
>This representation can support VLIW (where top level MI's are all start
of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also
very cheap since the "Flags" field has plenty of free bits available.
> 
>Properties of Bundle
> 
>If MI passes can consider each bundle as a single unit, then how are they
going to examine properties (i.e. flags and operands) of a MI bundle?
Conceptually a the properties of a bundle is the union of the properties of all
the MIs inside the bundle. So a bundle reads all the inputs that the individual
MIs read and it defines all the outputs of the individual MIs. However, this is
not correct when there are intra-bundle dependencies. e.g.
> 
>-------------------------
>| r0 = op1 r1, r2       |
>| r3 = op2 r0<kill>, #c |
>-------------------------
> 
>r0 should not be considered as a source on the bundle since it's defined
inside the bundle and its live range does not extend beyond it. Instead, r0 is a
clobber (i.e. dead def).
> 
>-------------------------
>| r0 = op1 r1, r2       |
>| r3 = op2 r0, #c       |
>-------------------------
> ...
>     = op3 r0, 
> 
>r0 is a def, not a use.
> 
>What does this mean? It means in order for passes to operate on a bundle at
a time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial as
taking the union. This is certainly not something we want to re-compute every
time! This requires a slight change to the bundle representation.
> 
>                     ----------------
>                     |    Bundle  * |   (A MI with special opcode
"Bundle")
>                     ----------------
>                             |
>                     ----------------
>                     |      MI    * |
>                     ----------------
>                             |
>                     ----------------
>                     |      MI    * |
>                     ----------------
>                             |
>                     ----------------
>                     |      MI      |    (no bit, this is the end of the
bundle)
>                     --------------
>                             |
>                     ----------------
>                     |    Bundle  * |   (a new bundle)
>                     ----------------
>                             |
>                     ----------------
>                     |      MI    * | 
>                     ----------------
>                             |
>                     ----------------
>                     |      MI      |
>                     ----------------
>                             |
>                            ...
> 
>The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.
> 
>Other properties, such as mayLoad, mayStore, are static properties
associated with opcodes. They cannot be copied. We will add APIs to examine
properties of MIs which will do the *right thing* for bundles (i.e. look into
MIs in bundles).
> 
>Packetizing
> 
>The current MI flow looks like this:
> 
>1. DAG to MI lowering (and pre-RA schedule)
>2. MI optimizations (LICM, CSE, etc.)
>3. Register allocation super pass
>   3a. De-ssa (2-address, phi slim)
>   3b. Coalescing
>   3c. Actual register allocation
>4. Post-RA optimizations
>5. PEI
>6. Post-RA scheduling
> 
>In the hopefully not very distant future it should look like this:
> 
>1. DAG to MI lowering (no scheduling!)
>2. MI optimizations (LICM, CSE, etc.)
>3. Register allocation super pass
>   3a. De-ssa (2-address, phi slim)
>   3b. Coalescing
>   3c. Pre-RA scheduling
>   3d. Actual register allocation
>4. Post-RA optimizations
>5. PEI
>6. Re-schedule restores, copies
> 
>The current proposal is for "packetization" to be done as part of
the "RA super pass". Early MI optimization passes such as LICM do not
benefit from operating on bundles. Furthermore, the packetizer should not have
to know how to deal with copies which may later be coalesced, phi nodes, or
other copy like pseudo instructions.
> 
>Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together. This can be done either before pre-RA scheduling. The second part of
the packetization should only be done after register allocation is completed.
There are two very important reason for this.
> 
>1. Packet finalization *must* add source and def operands to the
"Bundle" pseudo MI. This allows all later passes to handle they
transparently. However, we do not want to do this before register allocation is
complete. Otherwise it introduces new defs and uses of virtual registers and
that mess up MachineRegisterInfo def-use chains.
> 
>e.g. Now vr0 has two defs!
>defs: vr0<dead>, vr3, uses: vr1, vr2
>----------------------------
>| vr0 = op1 vr1, vr2       |
>| vr3 = op2 vr0<kill>, #c  |
>----------------------------
> 
>2. During register allocation, more identity copies will be eliminated while
loads, stores, copies, re-materialized instructions will be introduced. It makes
sense for the second part of packetization to try to fill these new instructions
into empty slots (for VLIW like targets).
> 
>So the overall flow should look like this:
> 
>1. DAG to MI lowering (no scheduling!)
>2. MI optimizations (LICM, CSE, etc.)
>3. Register allocation super pass
>   3a. De-ssa (2-address, phi slim)
>   3b. Coalescing
>   3c. Pre-scheduling packetization (optional)
>   3d. Pre-RA scheduling (or integrated packetization)
>   3e. Post-scheduling packetization (optional)
>   3f. Actual register allocation
>   3g. Packet finalization
>4. Post-RA optimizations
>5. PEI
>6. Re-schedule restores, copies
> 
>Lowering Bundles to MCInst
> 
>There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.
> 
>-------------------------
>| r0 = op1 r1, r2       |
>| r3 = op2 r0, #c       |
>-------------------------
> 
>=>
> 
>MCInst: op1 r0, r1, r2, op2, r3, r0, #c
>or
>MCInst: op1 op2 r0, r1, r2, r3, r0, #c
> 
>What's Next?
> 
>I am hoping to find some time to implement the followings in the near
future:
>1. Add BUNDLE opcode
>2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().
>3. Change MachineInstr::iterator to skip over bundled MIs. Rename old
iterator.
>4. Add MachineInstr API to check for instruction properties and switch
existing code over.
>5. Add API to form a bundle. It would compute the proper def's and
use's and add MachineOperands to the bundle MI.
>6. Switch Thumb2 IT block to using MI bundles.
>7. Add interface for targets to register their own packetization passes.
> 
>I would dearly welcome help on any of these tasks especially on 4, 5, 6. I
also would not cry if someone beats me to #6 (or actually any of the tasks. :-)
> 
>In the longer term, I would like to see a target independent packetization
pass (I believe one is being reviewed). I would also like to see a target
independent interface for pre-scheduling optimizations that form instruction
sequences (e.g. macro-fusion). Patches welcome!
> 
>Evan
>_______________________________________________
>LLVM Developers mailing list
>LLVMdev at cs.uiuc.edu        http://llvm.cs.uiuc.edu
>http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120111/92d6c75d/attachment.html>

Andrew Trick

2012-Jan-11 22:18 UTC

head link

[LLVMdev] RFC: Machine Instruction Bundle

On Jan 11, 2012, at 4:56 AM, Stripf, Timo (ITIV) wrote:
> Hi Evan,
>  
> I just read your proposal and the following discussion for VLIW support and
want to share my experience of writing a VLIW back-end for LLVM.
>  
>  
> I would not integrate the packetizer into the register allocator super
class since it would reduce the flexibility for the back-end developer to add
some optimization passes after the packetizer. Instead, I would add the
packetizer as a separate pass. It is true that the packetizer must deal in that
case with PHI and COPY nodes that are eliminated by the RA. The packetizer can
simple group all PHI and COPY instruction into single bundles consisting of only
one instruction.
>  
> From my experience a simple packetizer that groups instruction into bundles
(like the old IA-64 back-end did) without changing the order of the instructions
produces bad code. Instead, a VLIW scheduler that directly outputs bundles
produces better code. The current LLVM scheduler (at the end of the instruction
selection pass) is not suitable to generate bundled instructions since it
operates on scheduling units for glued instructions.
> However, the post-RA scheduler in combination with a VLIW-aware hazard
recognizer can be used before RA to bundle and schedule instructions for VLIW
architectures. Only small modifications within the post-RA scheduler classes to
support virtual registers are necessary.
What you're describing is how we expect some packetizers to be implemented.
Although it's really up the the person implementing the packetizer whether
or not to integrate it with scheduling. The point is that a framework will
support both scheduling and bundling before RA (and after coalescing).
> I also would not include packet finalization into the register allocator
super class since also the following pre- and epilog code insertion (PECI) pass
adds extra instruction into the instruction list. So I would add the packet
finalization after pre- and epilog code insertion. Both the RA and PECI can add
its instruction into single bundles that can be integrated into larger bundles
within packet finalization. For packet finalization it also makes sense to
perform a post-ra VLIW scheduling.
I think Evan referred to a regalloc "super pass". This means nothing
more than a set of passes that all require and preserve liveness information.

Once physical registers are assigned and we throw away vreg liveness, then
passes can view bundles as individual instructions. That's what I thin of as
"packet finalization". Nothing really prevents rebundling though. The
late bundler just may not have as much freedom.

-Andy
>  
> Timo
>  
>  
> Von: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
Im Auftrag von Evan Cheng
> Gesendet: Freitag, 2. Dezember 2011 21:40
> An: LLVM Dev
> Betreff: [LLVMdev] RFC: Machine Instruction Bundle
>  
> Machine Instruction Bundle in LLVM
>  
> Hi all,
>  
> There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM code
generator. I believe I have a fairly good plan now and would like to share with
the LLVM community.
>  
> Design Criteria
>  
> 1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.
> 2. It must be flexible enough to represent more than VLIW bundles. It
should be useful to represent arbitrary sequence of instructions that must be
scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch
macro-fusion, or random instruction sequences that are currently modeled as
pseudo instructions that are expanded late.
> 3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication (i.e.
we don't want code snippets that search for bundle start / end like all the
code in the backend that skip over DBG_VALUE).
> 4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.
>  
> Given the above, we can rule out a new class (e.g. MachineInstrBundle)
right away. We don't want MachineBasic block to keep a list of
MachineInstrBundles since it will require massive amount of code change. So what
are the choices?
>  
> Bundle Representation
>  
> 1. A nested MachineInstr: This is the most natural (meaning it looks most
like the real HW bundle) representation. It has the nice property that most
passes do not have to check if a MI is a bundle.The concern here this can add
significant memory overhead if this means adding a ilist or SmallVector field to
keep bundled MIs.
> 2. Add a bit to MachineInstr: The bit means the next MI in the list is part
of the same bundle. This is very light weight. However it requires many passes
to check wether a MI is part of a bundle.
>  
> The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:
>  
> --------------
> |  Bundle    | -------
> --------------        \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |           ----------------
>          |           |      MI      |
>          |           ----------------
>          |                   |
>          |                  …
>          |
> --------------
> |  Bundle    | ------
> --------------       \
>          |
>         ...
>  
>  
> This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to
MI that indicates whether it is part of a bundle.
>  
>                      ----------------
>                      |      MI    * |   (* bit indicates next MI is
"glued" to this MI, i.e. in the same bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of the
bundle)
>                      --------------
>                              |
>                      ----------------
>                      |      MI    * |   (* a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>  
> We are going to hide the complexity in the MachineBasicBlock::iterator
instead. That is, the iterator will be changed to visit only the *top level*
instructions (i.e. first instruction in each bundle). We will add another
iterator that allows client to visit all of the MIs for those passes that want
to look into bundles.
>  
> We can use the same representation for arbitrary sequence of instructions
that cannot be broken up. e.g. Thumb2 IT blocks.
>  
>                      ----------------
>                      |      MI      |   (just a MI)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * |   (* Start of Thumb2 IT block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI    * | 
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      |   (last MI in the block)
>                      ----------------
>                             |
>                      ----------------
>                      |      MI      | 
>                      ----------------
>                             |
>                            ...
>  
> This representation can support VLIW (where top level MI's are all
start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is
also very cheap since the "Flags" field has plenty of free bits
available.
>  
> Properties of Bundle
>  
> If MI passes can consider each bundle as a single unit, then how are they
going to examine properties (i.e. flags and operands) of a MI bundle?
Conceptually a the properties of a bundle is the union of the properties of all
the MIs inside the bundle. So a bundle reads all the inputs that the individual
MIs read and it defines all the outputs of the individual MIs. However, this is
not correct when there are intra-bundle dependencies. e.g.
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0<kill>, #c |
> -------------------------
>  
> r0 should not be considered as a source on the bundle since it's
defined inside the bundle and its live range does not extend beyond it. Instead,
r0 is a clobber (i.e. dead def).
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>  ...
>      = op3 r0, 
>  
> r0 is a def, not a use.
>  
> What does this mean? It means in order for passes to operate on a bundle at
a time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial as
taking the union. This is certainly not something we want to re-compute every
time! This requires a slight change to the bundle representation.
>  
>                      ----------------
>                      |    Bundle  * |   (A MI with special opcode
"Bundle")
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * |
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |    (no bit, this is the end of the
bundle)
>                      --------------
>                              |
>                      ----------------
>                      |    Bundle  * |   (a new bundle)
>                      ----------------
>                              |
>                      ----------------
>                      |      MI    * | 
>                      ----------------
>                              |
>                      ----------------
>                      |      MI      |
>                      ----------------
>                              |
>                             ...
>  
> The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.
>  
> Other properties, such as mayLoad, mayStore, are static properties
associated with opcodes. They cannot be copied. We will add APIs to examine
properties of MIs which will do the *right thing* for bundles (i.e. look into
MIs in bundles).
>  
> Packetizing
>  
> The current MI flow looks like this:
>  
> 1. DAG to MI lowering (and pre-RA schedule)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Post-RA scheduling
>  
> In the hopefully not very distant future it should look like this:
>  
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Pre-RA scheduling
>    3d. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>  
> The current proposal is for "packetization" to be done as part of
the "RA super pass". Early MI optimization passes such as LICM do not
benefit from operating on bundles. Furthermore, the packetizer should not have
to know how to deal with copies which may later be coalesced, phi nodes, or
other copy like pseudo instructions.
>  
> Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs
together. This can be done either before pre-RA scheduling. The second part of
the packetization should only be done after register allocation is completed.
There are two very important reason for this.
>  
> 1. Packet finalization *must* add source and def operands to the
"Bundle" pseudo MI. This allows all later passes to handle they
transparently. However, we do not want to do this before register allocation is
complete. Otherwise it introduces new defs and uses of virtual registers and
that mess up MachineRegisterInfo def-use chains.
>  
> e.g. Now vr0 has two defs!
> defs: vr0<dead>, vr3, uses: vr1, vr2
> ----------------------------
> | vr0 = op1 vr1, vr2       |
> | vr3 = op2 vr0<kill>, #c  |
> ----------------------------
>  
> 2. During register allocation, more identity copies will be eliminated
while loads, stores, copies, re-materialized instructions will be introduced. It
makes sense for the second part of packetization to try to fill these new
instructions into empty slots (for VLIW like targets).
>  
> So the overall flow should look like this:
>  
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
>    3a. De-ssa (2-address, phi slim)
>    3b. Coalescing
>    3c. Pre-scheduling packetization (optional)
>    3d. Pre-RA scheduling (or integrated packetization)
>    3e. Post-scheduling packetization (optional)
>    3f. Actual register allocation
>    3g. Packet finalization
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>  
> Lowering Bundles to MCInst
>  
> There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.
>  
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0, #c       |
> -------------------------
>  
> =>
>  
> MCInst: op1 r0, r1, r2, op2, r3, r0, #c
> or
> MCInst: op1 op2 r0, r1, r2, r3, r0, #c
>  
> What's Next?
>  
> I am hoping to find some time to implement the followings in the near
future:
> 1. Add BUNDLE opcode
> 2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().
> 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old
iterator.
> 4. Add MachineInstr API to check for instruction properties and switch
existing code over.
> 5. Add API to form a bundle. It would compute the proper def's and
use's and add MachineOperands to the bundle MI.
> 6. Switch Thumb2 IT block to using MI bundles.
> 7. Add interface for targets to register their own packetization passes.
>  
> I would dearly welcome help on any of these tasks especially on 4, 5, 6. I
also would not cry if someone beats me to #6 (or actually any of the tasks. :-)
>  
> In the longer term, I would like to see a target independent packetization
pass (I believe one is being reviewed). I would also like to see a target
independent interface for pre-scheduling optimizations that form instruction
sequences (e.g. macro-fusion). Patches welcome!
>  
> Evan
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120111/242df13c/attachment.html>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Dec 2011 - [LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

[LLVMdev] RFC: Machine Instruction Bundle

Reasonably Related Threads