Machine Instruction Bundle in LLVM Hi all, There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. Design Criteria 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? Bundle Representation 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: -------------- | Bundle | ------- -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | -------------- | Bundle | ------ -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | … | -------------- | Bundle | ------ -------------- \ | ... This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. ---------------- | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | MI * | (* a new bundle) ---------------- | ---------------- | MI | ---------------- | ... We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. ---------------- | MI | (just a MI) ---------------- | ---------------- | MI * | (* Start of Thumb2 IT block) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (last MI in the block) ---------------- | ---------------- | MI | ---------------- | ... This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. Properties of Bundle If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0<kill>, #c | ------------------------- r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- ... = op3 r0, r0 is a def, not a use. What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. ---------------- | Bundle * | (A MI with special opcode "Bundle") ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | Bundle * | (a new bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | ---------------- | ... The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). Packetizing The current MI flow looks like this: 1. DAG to MI lowering (and pre-RA schedule) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Post-RA scheduling In the hopefully not very distant future it should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-RA scheduling 3d. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. e.g. Now vr0 has two defs! defs: vr0<dead>, vr3, uses: vr1, vr2 ---------------------------- | vr0 = op1 vr1, vr2 | | vr3 = op2 vr0<kill>, #c | ---------------------------- 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). So the overall flow should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-scheduling packetization (optional) 3d. Pre-RA scheduling (or integrated packetization) 3e. Post-scheduling packetization (optional) 3f. Actual register allocation 3g. Packet finalization 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies Lowering Bundles to MCInst There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- => MCInst: op1 r0, r1, r2, op2, r3, r0, #c or MCInst: op1 op2 r0, r1, r2, r3, r0, #c What's Next? I am hoping to find some time to implement the followings in the near future: 1. Add BUNDLE opcode 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. 4. Add MachineInstr API to check for instruction properties and switch existing code over. 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. 6. Switch Thumb2 IT block to using MI bundles. 7. Add interface for targets to register their own packetization passes. I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! Evan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/89337255/attachment.html>
Evan, I will need to comprehend it better, but one small comment right away. Did we not discuss one more option for bundle implementation - global cycle ID. We would add an unsigned int field to MI definition representing "global scheduling cycle". All MIs with the same global cycle value belong to one group/packet. Zero means unscheduled MI. That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal. This way we could also effectively represent NOOP bundles (no bundle for a certain cycle value) - VLIW cycles with no instructions issued or estimate scheduling depth easily etc. I am not voting here in any way, I just wanted to bring it up for discussion. Sergei Larin -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum. From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Evan Cheng Sent: Friday, December 02, 2011 2:40 PM To: LLVM Dev Subject: [LLVMdev] RFC: Machine Instruction Bundle Machine Instruction Bundle in LLVM Hi all, There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. Design Criteria 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? Bundle Representation 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: -------------- | Bundle | ------- -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | -------------- | Bundle | ------ -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | . | -------------- | Bundle | ------ -------------- \ | ... This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. ---------------- | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | MI * | (* a new bundle) ---------------- | ---------------- | MI | ---------------- | ... We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. ---------------- | MI | (just a MI) ---------------- | ---------------- | MI * | (* Start of Thumb2 IT block) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (last MI in the block) ---------------- | ---------------- | MI | ---------------- | ... This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. Properties of Bundle If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0<kill>, #c | ------------------------- r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- ... = op3 r0, r0 is a def, not a use. What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. ---------------- | Bundle * | (A MI with special opcode "Bundle") ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | Bundle * | (a new bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | ---------------- | ... The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). Packetizing The current MI flow looks like this: 1. DAG to MI lowering (and pre-RA schedule) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Post-RA scheduling In the hopefully not very distant future it should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-RA scheduling 3d. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. e.g. Now vr0 has two defs! defs: vr0<dead>, vr3, uses: vr1, vr2 ---------------------------- | vr0 = op1 vr1, vr2 | | vr3 = op2 vr0<kill>, #c | ---------------------------- 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). So the overall flow should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-scheduling packetization (optional) 3d. Pre-RA scheduling (or integrated packetization) 3e. Post-scheduling packetization (optional) 3f. Actual register allocation 3g. Packet finalization 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies Lowering Bundles to MCInst There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- => MCInst: op1 r0, r1, r2, op2, r3, r0, #c or MCInst: op1 op2 r0, r1, r2, r3, r0, #c What's Next? I am hoping to find some time to implement the followings in the near future: 1. Add BUNDLE opcode 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. 4. Add MachineInstr API to check for instruction properties and switch existing code over. 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. 6. Switch Thumb2 IT block to using MI bundles. 7. Add interface for targets to register their own packetization passes. I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! Evan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/184ad6e1/attachment.html>
. and yes, one more thing. On some architectures it might be desirable to know the _order_ of instructions in the packet. That is a bit trickier.. -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum. From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Evan Cheng Sent: Friday, December 02, 2011 2:40 PM To: LLVM Dev Subject: [LLVMdev] RFC: Machine Instruction Bundle Machine Instruction Bundle in LLVM Hi all, There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. Design Criteria 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? Bundle Representation 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: -------------- | Bundle | ------- -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | -------------- | Bundle | ------ -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | . | -------------- | Bundle | ------ -------------- \ | ... This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. ---------------- | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | MI * | (* a new bundle) ---------------- | ---------------- | MI | ---------------- | ... We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. ---------------- | MI | (just a MI) ---------------- | ---------------- | MI * | (* Start of Thumb2 IT block) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (last MI in the block) ---------------- | ---------------- | MI | ---------------- | ... This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. Properties of Bundle If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0<kill>, #c | ------------------------- r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- ... = op3 r0, r0 is a def, not a use. What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. ---------------- | Bundle * | (A MI with special opcode "Bundle") ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | Bundle * | (a new bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | ---------------- | ... The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). Packetizing The current MI flow looks like this: 1. DAG to MI lowering (and pre-RA schedule) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Post-RA scheduling In the hopefully not very distant future it should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-RA scheduling 3d. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. e.g. Now vr0 has two defs! defs: vr0<dead>, vr3, uses: vr1, vr2 ---------------------------- | vr0 = op1 vr1, vr2 | | vr3 = op2 vr0<kill>, #c | ---------------------------- 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). So the overall flow should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-scheduling packetization (optional) 3d. Pre-RA scheduling (or integrated packetization) 3e. Post-scheduling packetization (optional) 3f. Actual register allocation 3g. Packet finalization 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies Lowering Bundles to MCInst There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- => MCInst: op1 r0, r1, r2, op2, r3, r0, #c or MCInst: op1 op2 r0, r1, r2, r3, r0, #c What's Next? I am hoping to find some time to implement the followings in the near future: 1. Add BUNDLE opcode 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. 4. Add MachineInstr API to check for instruction properties and switch existing code over. 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. 6. Switch Thumb2 IT block to using MI bundles. 7. Add interface for targets to register their own packetization passes. I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! Evan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/ba05a962/attachment.html>
On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote:> ---------------- > | Bundle * | (A MI with special opcode "Bundle") > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | Bundle * | (a new bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | > ---------------- > | > ...I think you should turn the glue bit upside down, so it means "This instruction is glued to its predecessor" instead of "This instruction is glued to its successor. That way, the inner bundled instructions have the glue bit, while the outer bundle headers and unbundled instructions don't. That should make your iterator a lot simpler when you have mixed bundles and unbundled MIs. It simply skips MIs with the glue bit set: | ---------------- |--> | Bundle | (A MI with special opcode "Bundle") | ---------------- | | | ---------------- | | MI * | | ---------------- | | | ---------------- | | MI * | (last MI in bundle) | ---------------- | | | ---------------- |--> | MI | (no bit, this is a top-level unbundled MI) | ---------------- | | | ---------------- |--> | MI | (unbundled MI) | ---------------- | | | ---------------- |--> | Bundle | (a new bundle) | ---------------- | | | ---------------- | | MI * | | ---------------- | | | ---------------- | | MI * | | ---------------- | | What happens when you assign a bundled MI* to a MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot of code that treats the two as almost equivalent. /jakob -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/62ce9fd3/attachment.html>
Evan Cheng <evan.cheng at apple.com> writes:> 2. It must be flexible enough to represent more than VLIW bundles. It > should be useful to represent arbitrary sequence of instructions that > must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + > branch macro-fusion, or random instruction sequences that are > currently modeled as pseudo instructions that are expanded late.This is really important. This kind of support could also aid scheduling to make sure some later pass doesn't screw up a carefully crafted schedule.> So the overall flow should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-scheduling packetization (optional) > 3d. Pre-RA scheduling (or integrated packetization) > 3e. Post-scheduling packetization (optional) > 3f. Actual register allocation > 3g. Packet finalization > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copiesThis all looks fairly reasonable to me. I just have one question. How much machine-level information do we have at the point where bundling occurs? For example, can we figure out instruction sizes exactly? This kind of thing is important for certain kinds of scheduling. If we don't have that information at this point, it might be advantageous to delay the final packetization until later, once we're at a point where we have all of the useful information we might want. -Dave
On Dec 2, 2011, at 2:44 PM, Jakob Stoklund Olesen wrote:> > On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote: > >> ---------------- >> | Bundle * | (A MI with special opcode "Bundle") >> ---------------- >> | >> ---------------- >> | MI * | >> ---------------- >> | >> ---------------- >> | MI * | >> ---------------- >> | >> ---------------- >> | MI | (no bit, this is the end of the bundle) >> -------------- >> | >> ---------------- >> | Bundle * | (a new bundle) >> ---------------- >> | >> ---------------- >> | MI * | >> ---------------- >> | >> ---------------- >> | MI | >> ---------------- >> | >> ... > > I think you should turn the glue bit upside down, so it means "This instruction is glued to its predecessor" instead of "This instruction is glued to its successor. That way, the inner bundled instructions have the glue bit, while the outer bundle headers and unbundled instructions don't. > > That should make your iterator a lot simpler when you have mixed bundles and unbundled MIs. It simply skips MIs with the glue bit set: > > | ---------------- > |--> | Bundle | (A MI with special opcode "Bundle") > | ---------------- > | | > | ---------------- > | | MI * | > | ---------------- > | | > | ---------------- > | | MI * | (last MI in bundle) > | ---------------- > | | > | ---------------- > |--> | MI | (no bit, this is a top-level unbundled MI) > | ---------------- > | | > | ---------------- > |--> | MI | (unbundled MI) > | ---------------- > | | > | ---------------- > |--> | Bundle | (a new bundle) > | ---------------- > | | > | ---------------- > | | MI * | > | ---------------- > | | > | ---------------- > | | MI * | > | ---------------- > | |Ok. Makes sense.> > What happens when you assign a bundled MI* to a MachineBasicBlock::iterator? Does it rewind to the bundle header? There is a lot of code that treats the two as almost equivalent.That's invalid. If you use the "top level MIs only" iterator, then it's not legal to assign it to a bundled instruction. Evan> > /jakob >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/761c107e/attachment.html>
On Dec 2, 2011, at 2:34 PM, Sergei Larin wrote:> Evan, > > I will need to comprehend it better, but one small comment right away… > > Did we not discuss one more option for bundle implementation – global cycle ID. We would add an unsigned int field to MINo, I don't remember discussing this.> definition representing “global scheduling cycle”. All MIs with the same global cycle value belong to one group/packet. Zero means unscheduled MI. > > That is light weight, position independent (means if instructions are added/reordered by a pass that does not care for packets/groups, original grouping could be easily restored. With a single bit marking, a newly inserted instruction breaks the coding or special ABI is needed to be used for that). Only real downside I see, iterating over a single packet members becomes a search, but if the scope is BB it is not a big deal.Sorry, I'm not seeing the advantages of this. This actually requires one additional field to keep the id. And all the passes which can potentially move code around, breaking MBBs etc. will have to maintain it. Evan> > This way we could also effectively represent NOOP bundles (no bundle for a certain cycle value) – VLIW cycles with no instructions issued or estimate scheduling depth easily etc. > > I am not voting here in any way, I just wanted to bring it up for discussion. > > Sergei Larin > > -- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum. > > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Evan Cheng > Sent: Friday, December 02, 2011 2:40 PM > To: LLVM Dev > Subject: [LLVMdev] RFC: Machine Instruction Bundle > > Machine Instruction Bundle in LLVM > > Hi all, > > There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. > > Design Criteria > > 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. > 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. > 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). > 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. > > Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? > > Bundle Representation > > 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. > 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. > > The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: > > -------------- > | Bundle | ------- > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | > -------------- > | Bundle | ------ > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | … > | > -------------- > | Bundle | ------ > -------------- \ > | > ... > > > This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. > > ---------------- > | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | MI * | (* a new bundle) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. > > We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. > > ---------------- > | MI | (just a MI) > ---------------- > | > ---------------- > | MI * | (* Start of Thumb2 IT block) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (last MI in the block) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. > > Properties of Bundle > > If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0<kill>, #c | > ------------------------- > > r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > ... > = op3 r0, > > r0 is a def, not a use. > > What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. > > ---------------- > | Bundle * | (A MI with special opcode "Bundle") > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | Bundle * | (a new bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. > > Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). > > Packetizing > > The current MI flow looks like this: > > 1. DAG to MI lowering (and pre-RA schedule) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. Post-RA scheduling > > In the hopefully not very distant future it should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-RA scheduling > 3d. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copies > > The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. > > Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. > > 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. > > e.g. Now vr0 has two defs! > defs: vr0<dead>, vr3, uses: vr1, vr2 > ---------------------------- > | vr0 = op1 vr1, vr2 | > | vr3 = op2 vr0<kill>, #c | > ---------------------------- > > 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). > > So the overall flow should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-scheduling packetization (optional) > 3d. Pre-RA scheduling (or integrated packetization) > 3e. Post-scheduling packetization (optional) > 3f. Actual register allocation > 3g. Packet finalization > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copies > > Lowering Bundles to MCInst > > There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > > => > > MCInst: op1 r0, r1, r2, op2, r3, r0, #c > or > MCInst: op1 op2 r0, r1, r2, r3, r0, #c > > What's Next? > > I am hoping to find some time to implement the followings in the near future: > 1. Add BUNDLE opcode > 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). > 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. > 4. Add MachineInstr API to check for instruction properties and switch existing code over. > 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. > 6. Switch Thumb2 IT block to using MI bundles. > 7. Add interface for targets to register their own packetization passes. > > I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) > > In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! > > Evan-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/368e0622/attachment.html>
On Dec 2, 2011, at 2:41 PM, Sergei Larin wrote:> … and yes, one more thing. On some architectures it might be desirable to know the _order_ of instructions in the packet. That is a bit trickier….Isn't that just the order of the instructions in the list? I don't see anything that prevents getting the order of instructions. It might require iterator over MIs in the packet. But for targets like VLIW, the # of instructions should be small. Also note, passes that require this kind of information should aggregate and maintain the data itself. Information like schedule cycle belongs in "schedule unit", not in MI. Evan> > -- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum. > > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Evan Cheng > Sent: Friday, December 02, 2011 2:40 PM > To: LLVM Dev > Subject: [LLVMdev] RFC: Machine Instruction Bundle > > Machine Instruction Bundle in LLVM > > Hi all, > > There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. > > Design Criteria > > 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. > 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. > 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). > 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. > > Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? > > Bundle Representation > > 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. > 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. > > The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: > > -------------- > | Bundle | ------- > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | > -------------- > | Bundle | ------ > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | … > | > -------------- > | Bundle | ------ > -------------- \ > | > ... > > > This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. > > ---------------- > | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | MI * | (* a new bundle) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. > > We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. > > ---------------- > | MI | (just a MI) > ---------------- > | > ---------------- > | MI * | (* Start of Thumb2 IT block) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (last MI in the block) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. > > Properties of Bundle > > If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0<kill>, #c | > ------------------------- > > r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > ... > = op3 r0, > > r0 is a def, not a use. > > What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. > > ---------------- > | Bundle * | (A MI with special opcode "Bundle") > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | Bundle * | (a new bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. > > Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). > > Packetizing > > The current MI flow looks like this: > > 1. DAG to MI lowering (and pre-RA schedule) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. Post-RA scheduling > > In the hopefully not very distant future it should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-RA scheduling > 3d. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copies > > The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. > > Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. > > 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. > > e.g. Now vr0 has two defs! > defs: vr0<dead>, vr3, uses: vr1, vr2 > ---------------------------- > | vr0 = op1 vr1, vr2 | > | vr3 = op2 vr0<kill>, #c | > ---------------------------- > > 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). > > So the overall flow should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-scheduling packetization (optional) > 3d. Pre-RA scheduling (or integrated packetization) > 3e. Post-scheduling packetization (optional) > 3f. Actual register allocation > 3g. Packet finalization > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copies > > Lowering Bundles to MCInst > > There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > > => > > MCInst: op1 r0, r1, r2, op2, r3, r0, #c > or > MCInst: op1 op2 r0, r1, r2, r3, r0, #c > > What's Next? > > I am hoping to find some time to implement the followings in the near future: > 1. Add BUNDLE opcode > 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). > 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. > 4. Add MachineInstr API to check for instruction properties and switch existing code over. > 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. > 6. Switch Thumb2 IT block to using MI bundles. > 7. Add interface for targets to register their own packetization passes. > > I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) > > In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! > > Evan-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/e2f0cb01/attachment.html>
On Dec 2, 2011, at 2:44 PM, David A. Greene wrote:> Evan Cheng <evan.cheng at apple.com> writes: > >> 2. It must be flexible enough to represent more than VLIW bundles. It >> should be useful to represent arbitrary sequence of instructions that >> must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + >> branch macro-fusion, or random instruction sequences that are >> currently modeled as pseudo instructions that are expanded late. > > This is really important. This kind of support could also aid > scheduling to make sure some later pass doesn't screw up a carefully > crafted schedule. > >> So the overall flow should look like this: >> >> 1. DAG to MI lowering (no scheduling!) >> 2. MI optimizations (LICM, CSE, etc.) >> 3. Register allocation super pass >> 3a. De-ssa (2-address, phi slim) >> 3b. Coalescing >> 3c. Pre-scheduling packetization (optional) >> 3d. Pre-RA scheduling (or integrated packetization) >> 3e. Post-scheduling packetization (optional) >> 3f. Actual register allocation >> 3g. Packet finalization >> 4. Post-RA optimizations >> 5. PEI >> 6. Re-schedule restores, copies > > This all looks fairly reasonable to me. I just have one question. How > much machine-level information do we have at the point where bundling > occurs? For example, can we figure out instruction sizes exactly? This > kind of thing is important for certain kinds of scheduling.That depends on the target.> > If we don't have that information at this point, it might be > advantageous to delay the final packetization until later, once we're at > a point where we have all of the useful information we might want.Targets are allowed to delay packetization to later. The only restriction is packets cannot be finalized under register allocation is done because we do not want to add virtual register defs and uses to bundle instructions. Evan> > -Dave
Hi, I'm glad to see some action with regard to static instruction scheduling and VLIW support in LLVM. I have some questions and remarks which might not be relevant as I'm not totally familiar with the current code generation framework of LLVM nor your plan. On 12/02/2011 10:40 PM, Evan Cheng wrote:> 2. It must be flexible enough to represent more than VLIW bundles. It should be > useful to represent arbitrary sequence of instructions that must be scheduled as > a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random > instruction sequences that are currently modeled as pseudo instructions that are > expanded late.The concept of a "VLIW bundle" is to mark a set of instructions that should/could be executed in *parallel*. A static parallel instruction schedule for a single instruction cycle, that is. In other words, with a VLIW target a bundle might not be just "an atomic, possibly sequentially executed chunk of instructions" or "a set of instructions that can be executed in parallel but also sequentially". In some architectures, the sequential execution might break the schedule due to visible function unit pipeline latencies and no hardware interlocking. Is it wise to mix the two concepts of "parallel instructions" and the looser "instructions that should be executed together"? The "parallel semantics" implies changes to how the scheduling is done (the earliest/latest cycle where an instruction can be scheduled) and also, e.g., the register allocation's live ranges (if allocating regs on a "packetized" = parallel code)? Moreover, the definition of VLIW parallel bundle implies that there cannot be no "intra bundle dependencies", otherwise those instructions could not be executed in parallel in the reality. For example, looking at your example of a bundle with "intra-bundle dependencies": ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0<kill>, #c | ------------------------- In case of a static VLIW target the semantics of this instruction is that these two "RISC instructions are executed in parallel, period". Thus, the first instruction cannot depend on the latter (or the other way around) but op2 reads the old value of r0, not the one written in the same bundle. It depends on the architecture's data hazard detection support, register file bypasses, etc. whether the r0 update of the 1st instruction is available to the second instruction in the bundle or whether the new r0 value can be read only by the succeeding instruction bundles. If it is available, the execution is sequential in reality as op1 must produce the value before op2 can execute. Itanium machines are an example of "parallel bundle architectures" (and of course also other "more traditional" VLIWs are, like the TI C64x[2]): "EPIC allows compilers to define independent instruction sequences, which allows hardware to ignore dependency checks between these instructions. This same hardware functionality in OOO RISC designs is very costly and complex." [1] As an example of the "not truly parallel instruction bundles", on the other hand, we have played a bit with the Cell SPU which is quite static architecture but still has hardware data hazard detection and hardware interlocking. It would differentiate between your case and the one where the order is different because it follows the sequential instruction order in its hardware data dependence resolving logic and stalls the pipeline (thus does not really execute the instructions in parallel) if the sequential order has data hazards. For how to actually represent the (parallel) instruction bundles I do not have a strong opinion, as long as these semantic difference between a "parallel bundle" and "just a chunk of instructions that should be executed together" are made clear and adhered to everywhere in the code generation. [1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf [2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf Best regards, -- Pekka from the TCE project http://tce.cs.tut.fi
It's important to understand this is a proposal for code generator IR change, not specific to specific passes for register allocator or scheduler. Passes that want to understand more about the internals of instruction bundles are free to design their own data structure. It is imperative that we do not add multiple constructs for various types of bundles. That would add significant overhead for the rest of the code generator. I have discussed the proposal for current code owners of register allocator and instruction scheduler. This is a proposal that will work with existing (or module being planned for near future) infrastructure. Evan On Dec 3, 2011, at 7:12 AM, Pekka Jääskeläinen wrote:> Hi, > > I'm glad to see some action with regard to static instruction > scheduling and VLIW support in LLVM. I have some questions and > remarks which might not be relevant as I'm not totally familiar > with the current code generation framework of LLVM nor your plan. > > On 12/02/2011 10:40 PM, Evan Cheng wrote: >> 2. It must be flexible enough to represent more than VLIW bundles. It should be >> useful to represent arbitrary sequence of instructions that must be scheduled as >> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random >> instruction sequences that are currently modeled as pseudo instructions that are >> expanded late. > > The concept of a "VLIW bundle" is to mark a set of instructions that > should/could be executed in *parallel*. A static parallel instruction > schedule for a single instruction cycle, that is. > > In other words, with a VLIW target a bundle might not be just "an atomic, > possibly sequentially executed chunk of instructions" or "a set of > instructions that can be executed in parallel but also sequentially". > In some architectures, the sequential execution might break the schedule > due to visible function unit pipeline latencies and no hardware interlocking. > > Is it wise to mix the two concepts of "parallel instructions" and the looser > "instructions that should be executed together"? The "parallel semantics" > implies changes to how the scheduling is done (the earliest/latest cycle where > an instruction can be scheduled) and also, e.g., the register allocation's live > ranges (if allocating regs on a "packetized" = parallel code)? > > Moreover, the definition of VLIW parallel bundle implies that there cannot be > no "intra bundle dependencies", otherwise those instructions could not be > executed in parallel in the reality. > > For example, looking at your example of a bundle with "intra-bundle > dependencies": > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0<kill>, #c | > ------------------------- > > In case of a static VLIW target the semantics of this instruction is that these > two "RISC instructions are executed in parallel, period". Thus, the first > instruction cannot depend on the latter (or the other way around) but op2 reads > the old value of r0, not the one written in the same bundle. > > It depends on the architecture's data hazard detection support, register file > bypasses, etc. whether the r0 update of the 1st instruction is available to > the second instruction in the bundle or whether the new r0 value can be read > only by the succeeding instruction bundles. If it is available, the execution > is sequential in reality as op1 must produce the value before op2 can > execute. > > Itanium machines are an example of "parallel bundle architectures" > (and of course also other "more traditional" VLIWs are, like the TI C64x[2]): > > "EPIC allows compilers to define independent instruction sequences, which allows > hardware to ignore dependency checks between these instructions. This same > hardware functionality in OOO RISC designs is very costly and complex." > [1] > > As an example of the "not truly parallel instruction bundles", on the other > hand, we have played a bit with the Cell SPU which is quite static architecture > but still has hardware data hazard detection and hardware interlocking. It > would differentiate between your case and the one where the order is different > because it follows the sequential instruction order in its hardware data > dependence resolving logic and stalls the pipeline (thus does not really > execute the instructions in parallel) if the sequential order has data hazards. > > For how to actually represent the (parallel) instruction bundles I do not have > a strong opinion, as long as these semantic difference between a "parallel > bundle" and "just a chunk of instructions that should be executed together" are > made clear and adhered to everywhere in the code generation. > > [1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf > [2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf > > Best regards, > -- > Pekka from the TCE project > http://tce.cs.tut.fi > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hi Evan, Thanks for sending out the proposal. This pass will, of course, be very important for the Hexagon backend. I have to study the proposal in more detail but overall, it looks good to me. > I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). My team will be happy to help with the tasks and specifically with the packet-formation pass. We already have a pass that forms packets for Hexagon and we can generalize the pass and make it machine-independent. -Anshu -- Qualcomm Innovation Center, Inc is a member of Code Aurora Forum On 12/2/2011 2:40 PM, Evan Cheng wrote:> *Machine Instruction Bundle in LLVM* > > Hi all, > > There have been quite a bit of discussions about adding machine > instruction bundle to support VLIW targets. I have been pondering what > the right representation should be and what kind of impact it might > have on the LLVM code generator. I believe I have a fairly good plan > now and would like to share with the LLVM community. > > *Design Criteria* > > 1. The bundle representation must be light weight. We cannot afford to > add significant memory or compile time overhead. > 2. It must be flexible enough to represent more than VLIW bundles. It > should be useful to represent arbitrary sequence of instructions that > must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + > branch macro-fusion, or random instruction sequences that are > currently modeled as pseudo instructions that are expanded late. > 3. Minimize the amount of changes required in the LLVM code generator, > especially in target independent passes. It must minimize code > duplication (i.e. we don't want code snippets that search for bundle > start / end like all the code in the backend that skip over DBG_VALUE). > 4. The representation should make it easy for new code to be oblivious > of bundles. That is, MI passes should not have to check whether > something is a bundle. > > Given the above, we can rule out a new class (e.g. MachineInstrBundle) > right away. We don't want MachineBasic block to keep a list of > MachineInstrBundles since it will require massive amount of code > change. So what are the choices? > > *Bundle Representation* > > 1. A nested MachineInstr: This is the most natural (meaning it looks > most like the real HW bundle) representation. It has the nice property > that most passes do not have to check if a MI is a bundle.The concern > here this can add significant memory overhead if this means adding a > ilist or SmallVector field to keep bundled MIs. > 2. Add a bit to MachineInstr: The bit means the next MI in the list is > part of the same bundle. This is very light weight. However it > requires many passes to check wether a MI is part of a bundle. > > The solution is a combination of both #1 and #2. Conceptually we want > a representation that looks like this: > > -------------- > | Bundle | ------- > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | > -------------- > | Bundle | ------ > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | ... > | > -------------- > | Bundle | ------ > -------------- \ > | > ... > > > This is #1, a series of nested MI's. However, we are going to store > the instructions in the same way as it's done right now, i.e. a > list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit > to MI that indicates whether it is part of a bundle. > > ---------------- > | MI * | (* bit indicates next MI is > "glued" to this MI, i.e. in the same bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of > the bundle) > -------------- > | > ---------------- > | MI * | (* a new bundle) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > We are going to hide the complexity in the MachineBasicBlock::iterator > instead. That is, the iterator will be changed to visit only the *top > level* instructions (i.e. first instruction in each bundle). We will > add another iterator that allows client to visit all of the MIs for > those passes that want to look into bundles. > > We can use the same representation for arbitrary sequence of > instructions that cannot be broken up. e.g. Thumb2 IT blocks. > > ---------------- > | MI | (just a MI) > ---------------- > | > ---------------- > | MI * | (* Start of Thumb2 IT block) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (last MI in the block) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > This representation can support VLIW (where top level MI's are all > start of bundles) or non-VLIW (where there can be mix of MIs and > bundles). It is also very cheap since the "Flags" field has plenty of > free bits available. > > *Properties of Bundle* > > If MI passes can consider each bundle as a single unit, then how are > they going to examine properties (i.e. flags and operands) of a MI > bundle? Conceptually a the properties of a bundle is the union of the > properties of all the MIs inside the bundle. So a bundle reads all the > inputs that the individual MIs read and it defines all the outputs of > the individual MIs. However, this is not correct when there are > intra-bundle dependencies. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0<kill>, #c | > ------------------------- > > r0 should not be considered as a source on the bundle since it's > defined inside the bundle and its live range does not extend beyond > it. Instead, r0 is a clobber (i.e. dead def). > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > ... > = op3 r0, > > r0 is a def, not a use. > > What does this mean? It means in order for passes to operate on a > bundle at a time, it must be able to visit all the defs and uses of a > bundle. We have established that computing the defs and uses of a > bundle is not as trivial as taking the union. This is certainly not > something we want to re-compute every time! This requires a slight > change to the bundle representation. > > ---------------- > | Bundle * | (A MI with special opcode > "Bundle") > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of > the bundle) > -------------- > | > ---------------- > | Bundle * | (a new bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > The pseudo bundle instructions should be used to capture properties of > the bundle. When a bundle is finalized the packetizer must add source > and def operands to the pseudo bundle instruction. More on this later. > > Other properties, such as mayLoad, mayStore, are static properties > associated with opcodes. They cannot be copied. We will add APIs to > examine properties of MIs which will do the *right thing* for bundles > (i.e. look into MIs in bundles). > > *Packetizing* > > The current MI flow looks like this: > > 1. DAG to MI lowering (and pre-RA schedule) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. Post-RA scheduling > > In the hopefully not very distant future it should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. *Pre-RA scheduling* > 3d. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. *Re-schedule restores, copies* > > The current proposal is for "packetization" to be done as part of the > "RA super pass". Early MI optimization passes such as LICM do not > benefit from operating on bundles. Furthermore, the packetizer should > not have to know how to deal with copies which may later be coalesced, > phi nodes, or other copy like pseudo instructions. > > Packetization should be done in two phases. The first part decides > what MIs should be bundled together and it add the "bits" which glued > MIs together. This can be done either before pre-RA scheduling. The > second part of the packetization should only be done after register > allocation is completed. There are two very important reason for this. > > 1. Packet finalization *must* add source and def operands to the > "Bundle" pseudo MI. This allows all later passes to handle they > transparently. However, we do not want to do this before register > allocation is complete. Otherwise it introduces new defs and uses of > virtual registers and that mess up MachineRegisterInfo def-use chains. > > e.g. Now vr0 has two defs! > defs: vr0<dead>, vr3, uses: vr1, vr2 > ---------------------------- > | vr0 = op1 vr1, vr2 | > | vr3 = op2 vr0<kill>, #c | > ---------------------------- > > 2. During register allocation, more identity copies will be eliminated > while loads, stores, copies, re-materialized instructions will be > introduced. It makes sense for the second part of packetization to try > to fill these new instructions into empty slots (for VLIW like targets). > > So the overall flow should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. *Pre-scheduling packetization (optional)* > 3d. Pre-RA scheduling (or *integrated packetization*) > 3e. *Post-scheduling packetization (optional)* > 3f. Actual register allocation > 3g. *Packet finalization* > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copies > > *Lowering Bundles to MCInst* > > There is no need to add the equivalent of MI bundle to MCInst. A MI > bundle should be concatenated into a single MCInst by storing opcodes > as integer operands. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > > => > > MCInst: op1 r0, r1, r2, op2, r3, r0, #c > or > MCInst: op1 op2 r0, r1, r2, r3, r0, #c > > *What's Next?* > > I am hoping to find some time to implement the followings in the near > future: > 1. Add BUNDLE opcode > 2. MachineInstr class changes: new bit, changes to methods such as > eraseFromParent(), isIdenticalTo(). > 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old > iterator. > 4. Add MachineInstr API to check for instruction properties and switch > existing code over. > 5. Add API to form a bundle. It would compute the proper def's and > use's and add MachineOperands to the bundle MI. > 6. Switch Thumb2 IT block to using MI bundles. > 7. Add interface for targets to register their own packetization passes. > > I would dearly welcome help on any of these tasks especially on 4, 5, > 6. I also would not cry if someone beats me to #6 (or actually any of > the tasks. :-) > > In the longer term, I would like to see a target independent > packetization pass (I believe one is being reviewed). I would also > like to see a target independent interface for pre-scheduling > optimizations that form instruction sequences (e.g. macro-fusion). > Patches welcome! > > Evan > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111205/e4b298fc/attachment.html>
On Dec 2, 2011, at 12:40 PM, Evan Cheng wrote:> There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community.Let me add some information about how the register handles this extension. First of all, I agree it is important that this can be used for more than VLIW targets. Besides the uses you mention, I would like to use bundle DBG_VALUE instructions so we can avoid silly code like this: MachineBasicBlock::iterator InsertPos = mi; while (InsertPos != MBB->begin() && llvm::prior(InsertPos)->isDebugValue()) --InsertPos; MachineBasicBlock::iterator From = KillMI; MachineBasicBlock::iterator To = llvm::next(From); while (llvm::prior(From)->isDebugValue()) --From; I also think bundles can be used for implementing parallel copies. Value Semantics ============== The register allocator and probably most of the target-independent code generator won't care if instructions in a bundle are executed sequentially or in parallel. We do care about value semantics, though. That is, which value is being read by an instruction operand in a bundle. We definitely want to support the parallel execution semantics where all instructions in a bundle read values defined outside the bundle. This is a swap implemented as a parallel copy bundle: { R2 = R3; R3 = R2; } However, even VLIW targets like Hexagon can read values defined inside the same bundle: { P0 = cmp.eq(R2,#4) if (!P0) R5 = #5 if (P0.new) R3 = memw(R4) } This Hexagon bundle reads both the P0 predicate register defined inside the bundle (P0.new) and the value defined outside the bundle (!P0). We need to support this. I propose that we add a new MachineOperand flag, IsInternalRead, to represent this. The flag will mean "This operand is reading a value defined within the same instruction/bundle, not a value from outside the instruction/bundle." The register allocator will treat the <internal> flag almost exactly like it currently treats <undef>, but there is a big difference to the target-specific code. The register allocator and other target-independent passes don't care if there are multiple defs of the same register inside a bundle. Values are only tracked at the bundle granularity. The semantics of multiple defs is target-defined. To summarize, all instructions in a bundle read values defined outside the bundle, unless explicitly marked as bundle-internal reads. Multiple defs inside a bundle are indistinguishable except to the target. Rewriting ======== The register allocator super-pass needs to rewrite register operands. Virtual-to-virtual rewriting happens during coalescing and live range splitting. Virtual-to-physical rewriting happens only once at the end. When rewriting virtual registers, a minimal understanding of value semantics is required. In particular, it is possible to split a live range right down the middle of an instruction: %vr0 = add %vr0, 1 May be rewritten as: %vr2 = add %vr1, 1 This is assuming the add doesn't have two-address constraints, of course. When rewriting bundle operands, the <internal> flag will be sufficient to determine the correct virtual register. For example: { %vr0 = cmp.eq(R2,#4) if (!%vr0) R5 = #5 if (%vr0<internal>) R3 = memw(R4) } could be rewritten after live range splitting as: { %vr2 = cmp.eq(R2,#4) if (!%vr1) R5 = #5 if (%vr2<internal>) R3 = memw(R4) } Constraining spill code insertion ================================ It is important to note that bundling instructions doesn't constrain the register allocation problem. For example, this bundle would be impossible with sequential value constraints: { call foo %vr0 = addps %vr1, %vr2 call bar } The calls clobber the xmm registers, so it is impossible to register allocate this code without breaking up the bundle and inserting spill code between the calls. With our definition of bundle value semantics, the addps is reading %vr1 and %vr2 outside the bundle, and the call clobbers are not considered relevant. The fact that the call clobbers all registers before addps is the target's problem. This is very similar to how inline assembly is treated. TL;DR ==== By adding a new <internal> flag to MachineOperand, the register allocator can effectively treat a bundle as a single instruction. All MachineOperands inside a bundle are treated as if they all belong to the single instruction. This even works when rewriting operands. /jakob
On Dec 5, 2011, at 1:50 PM, Jakob Stoklund Olesen wrote:> I would like to use bundle DBG_VALUE instructions so we can avoid silly code like this: > > MachineBasicBlock::iterator InsertPos = mi; > while (InsertPos != MBB->begin() && llvm::prior(InsertPos)->isDebugValue()) > --InsertPos; > MachineBasicBlock::iterator From = KillMI; > MachineBasicBlock::iterator To = llvm::next(From); > while (llvm::prior(From)->isDebugValue()) > --From;Thinking about it, this probably wouldn't work for pre-RA passes that can delete code. They could inadvertently delete DBG_VALUE instructions. /jakob
On 12/05/2011 11:50 PM, Jakob Stoklund Olesen wrote:> By adding a new<internal> flag to MachineOperand, the register allocator can > effectively treat a bundle as a single instruction. All MachineOperands > inside a bundle are treated as if they all belong to the single instruction. > This even works when rewriting operands.This sounds like a simple and good solution to the "parallel bundle" semantic worries I had. What about instruction scheduling? Has anyone thought how/if isched could work with parallel bundles? That is, to create the bundles (referred to as "packing" by some) the first place (before and/or after RA). I'm not familiar enough with the current LLVM isched work so I do not know if the current algorithms could be used for static scheduling for VLIWs. Is one really required to write a separate "packer" that just tries to pack sequential(?) instructions to bundles without trying to reorder to improve the packing density? How I see it, "packing" is just instruction scheduling with an additional "what can be in a single bundle and in what order"-scheduling constraint and has fixed cycles (bundles) for the instructions as an outcome. The bundle constraint (the wide instruction template(s) supported by the machine) can be implemented with the state machine approach or just a resource table. VLIW scheduling requires also multiplicity for the FU resource constraints. Can those be described in the current TableGen format? Clustered (or in general, not-fully-connected) VLIWs also require connectivity information (which FUs are connected to which RFs) but that can be modeled with register classes, I've been told. -- --Pekka
Indeed, there are strict VLIW architectures out there and VLIW architectures that leverage some aspects of conventional architectures, sometimes taking advantage of how the microarchitecture was implemented. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0<kill>, #c | ------------------------- For instance, some of such VLIW architectures allow that the programmer specifies whether "r0" in "op2" will have the value from t-1 (before the bundle) or from t+1 (the result produced by "op1"). If anything, because the forwarding of results is close enough in the pipeline to be used in the same bundle. It seems to me that such subtleties are better left near the target-dependent code, but the target-independent code should refrain from making strict interpretations of what a bundle should look like. When it gets to the MC layer, it seems to me that overloading the MCI operands with opcodes is a bit unusual, but, at first glance, it might not be too hard to handle. -- Evandro Menezes Austin, TX emenezes at codeaurora.org Qualcomm Innovation Center, Inc is a member of Code Aurora Forum On 12/03/11 09:12, Pekka Jääskeläinen wrote:> Hi, > > I'm glad to see some action with regard to static instruction > scheduling and VLIW support in LLVM. I have some questions and > remarks which might not be relevant as I'm not totally familiar > with the current code generation framework of LLVM nor your plan. > > On 12/02/2011 10:40 PM, Evan Cheng wrote: >> 2. It must be flexible enough to represent more than VLIW bundles. It should be >> useful to represent arbitrary sequence of instructions that must be scheduled as >> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random >> instruction sequences that are currently modeled as pseudo instructions that are >> expanded late. > > The concept of a "VLIW bundle" is to mark a set of instructions that > should/could be executed in *parallel*. A static parallel instruction > schedule for a single instruction cycle, that is. > > In other words, with a VLIW target a bundle might not be just "an atomic, > possibly sequentially executed chunk of instructions" or "a set of > instructions that can be executed in parallel but also sequentially". > In some architectures, the sequential execution might break the schedule > due to visible function unit pipeline latencies and no hardware interlocking. > > Is it wise to mix the two concepts of "parallel instructions" and the looser > "instructions that should be executed together"? The "parallel semantics" > implies changes to how the scheduling is done (the earliest/latest cycle where > an instruction can be scheduled) and also, e.g., the register allocation's live > ranges (if allocating regs on a "packetized" = parallel code)? > > Moreover, the definition of VLIW parallel bundle implies that there cannot be > no "intra bundle dependencies", otherwise those instructions could not be > executed in parallel in the reality. > > For example, looking at your example of a bundle with "intra-bundle > dependencies": > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0<kill>, #c | > ------------------------- > > In case of a static VLIW target the semantics of this instruction is that these > two "RISC instructions are executed in parallel, period". Thus, the first > instruction cannot depend on the latter (or the other way around) but op2 reads > the old value of r0, not the one written in the same bundle. > > It depends on the architecture's data hazard detection support, register file > bypasses, etc. whether the r0 update of the 1st instruction is available to > the second instruction in the bundle or whether the new r0 value can be read > only by the succeeding instruction bundles. If it is available, the execution > is sequential in reality as op1 must produce the value before op2 can > execute. > > Itanium machines are an example of "parallel bundle architectures" > (and of course also other "more traditional" VLIWs are, like the TI C64x[2]): > > "EPIC allows compilers to define independent instruction sequences, which allows > hardware to ignore dependency checks between these instructions. This same > hardware functionality in OOO RISC designs is very costly and complex." > [1] > > As an example of the "not truly parallel instruction bundles", on the other > hand, we have played a bit with the Cell SPU which is quite static architecture > but still has hardware data hazard detection and hardware interlocking. It > would differentiate between your case and the one where the order is different > because it follows the sequential instruction order in its hardware data > dependence resolving logic and stalls the pipeline (thus does not really > execute the instructions in parallel) if the sequential order has data hazards. > > For how to actually represent the (parallel) instruction bundles I do not have > a strong opinion, as long as these semantic difference between a "parallel > bundle" and "just a chunk of instructions that should be executed together" are > made clear and adhered to everywhere in the code generation. > > [1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf > [2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf > > Best regards,
Hi Evan, I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM. I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction.>From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions.However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary. I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling. Timo Von: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] Im Auftrag von Evan Cheng Gesendet: Freitag, 2. Dezember 2011 21:40 An: LLVM Dev Betreff: [LLVMdev] RFC: Machine Instruction Bundle Machine Instruction Bundle in LLVM Hi all, There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. Design Criteria 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? Bundle Representation 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: -------------- | Bundle | ------- -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | -------------- | Bundle | ------ -------------- \ | ---------------- | | MI | | ---------------- | | | ---------------- | | MI | | ---------------- | | | ... | -------------- | Bundle | ------ -------------- \ | ... This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. ---------------- | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | MI * | (* a new bundle) ---------------- | ---------------- | MI | ---------------- | ... We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. ---------------- | MI | (just a MI) ---------------- | ---------------- | MI * | (* Start of Thumb2 IT block) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (last MI in the block) ---------------- | ---------------- | MI | ---------------- | ... This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. Properties of Bundle If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0<kill>, #c | ------------------------- r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- ... = op3 r0, r0 is a def, not a use. What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. ---------------- | Bundle * | (A MI with special opcode "Bundle") ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | (no bit, this is the end of the bundle) -------------- | ---------------- | Bundle * | (a new bundle) ---------------- | ---------------- | MI * | ---------------- | ---------------- | MI | ---------------- | ... The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). Packetizing The current MI flow looks like this: 1. DAG to MI lowering (and pre-RA schedule) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Post-RA scheduling In the hopefully not very distant future it should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-RA scheduling 3d. Actual register allocation 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. e.g. Now vr0 has two defs! defs: vr0<dead>, vr3, uses: vr1, vr2 ---------------------------- | vr0 = op1 vr1, vr2 | | vr3 = op2 vr0<kill>, #c | ---------------------------- 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). So the overall flow should look like this: 1. DAG to MI lowering (no scheduling!) 2. MI optimizations (LICM, CSE, etc.) 3. Register allocation super pass 3a. De-ssa (2-address, phi slim) 3b. Coalescing 3c. Pre-scheduling packetization (optional) 3d. Pre-RA scheduling (or integrated packetization) 3e. Post-scheduling packetization (optional) 3f. Actual register allocation 3g. Packet finalization 4. Post-RA optimizations 5. PEI 6. Re-schedule restores, copies Lowering Bundles to MCInst There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. ------------------------- | r0 = op1 r1, r2 | | r3 = op2 r0, #c | ------------------------- => MCInst: op1 r0, r1, r2, op2, r3, r0, #c or MCInst: op1 op2 r0, r1, r2, r3, r0, #c What's Next? I am hoping to find some time to implement the followings in the near future: 1. Add BUNDLE opcode 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. 4. Add MachineInstr API to check for instruction properties and switch existing code over. 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. 6. Switch Thumb2 IT block to using MI bundles. 7. Add interface for targets to register their own packetization passes. I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! Evan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120111/e17b2c8e/attachment.html>
Hi Evan/Evan. This is great! I''ve been experimenting on a VLIW target by adding a FinalizeBundle loop and combination of Scoreboard / Schedule hazard classes. Only question that is nagging now is how to pass the scheduling information (or the flags) down PECI, to 'choose' the correct opcode. I am currently hacking a few bits off the "MachineInstr.Flags". But somebody suggested avoiding this and recommended adding a dummy operand. I'm not sure if that is the correct way. Any comments on that? Thanks. Girish. From: "Stripf, Timo (ITIV)" <timo.stripf at kit.edu>>To: Evan Cheng <evan.cheng at apple.com>; LLVM Dev <llvmdev at cs.uiuc.edu> >Sent: Wednesday, 11 January 2012 6:26 PM >Subject: Re: [LLVMdev] RFC: Machine Instruction Bundle > > >Hi Evan, > >I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM. > > >I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction. > >From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions. >However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary. > >I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling. > >Timo > > >Von:llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] Im Auftrag von Evan Cheng >Gesendet: Freitag, 2. Dezember 2011 21:40 >An: LLVM Dev >Betreff: [LLVMdev] RFC: Machine Instruction Bundle > >Machine Instruction Bundle in LLVM > >Hi all, > >There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. > >Design Criteria > >1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. >2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. >3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). >4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. > >Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? > >Bundle Representation > >1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. >2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. > >The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: > >-------------- >| Bundle | ------- >-------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | >-------------- >| Bundle | ------ >-------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | … > | >-------------- >| Bundle | ------ >-------------- \ > | > ... > > >This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. > > ---------------- > | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | MI * | (* a new bundle) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > >We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. > >We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. > > ---------------- > | MI | (just a MI) > ---------------- > | > ---------------- > | MI * | (* Start of Thumb2 IT block) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (last MI in the block) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > >This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. > >Properties of Bundle > >If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. > >------------------------- >| r0 = op1 r1, r2 | >| r3 = op2 r0<kill>, #c | >------------------------- > >r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). > >------------------------- >| r0 = op1 r1, r2 | >| r3 = op2 r0, #c | >------------------------- > ... > = op3 r0, > >r0 is a def, not a use. > >What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. > > ---------------- > | Bundle * | (A MI with special opcode "Bundle") > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | Bundle * | (a new bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > >The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. > >Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). > >Packetizing > >The current MI flow looks like this: > >1. DAG to MI lowering (and pre-RA schedule) >2. MI optimizations (LICM, CSE, etc.) >3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Actual register allocation >4. Post-RA optimizations >5. PEI >6. Post-RA scheduling > >In the hopefully not very distant future it should look like this: > >1. DAG to MI lowering (no scheduling!) >2. MI optimizations (LICM, CSE, etc.) >3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-RA scheduling > 3d. Actual register allocation >4. Post-RA optimizations >5. PEI >6. Re-schedule restores, copies > >The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. > >Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. > >1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. > >e.g. Now vr0 has two defs! >defs: vr0<dead>, vr3, uses: vr1, vr2 >---------------------------- >| vr0 = op1 vr1, vr2 | >| vr3 = op2 vr0<kill>, #c | >---------------------------- > >2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). > >So the overall flow should look like this: > >1. DAG to MI lowering (no scheduling!) >2. MI optimizations (LICM, CSE, etc.) >3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-scheduling packetization (optional) > 3d. Pre-RA scheduling (or integrated packetization) > 3e. Post-scheduling packetization (optional) > 3f. Actual register allocation > 3g. Packet finalization >4. Post-RA optimizations >5. PEI >6. Re-schedule restores, copies > >Lowering Bundles to MCInst > >There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. > >------------------------- >| r0 = op1 r1, r2 | >| r3 = op2 r0, #c | >------------------------- > >=> > >MCInst: op1 r0, r1, r2, op2, r3, r0, #c >or >MCInst: op1 op2 r0, r1, r2, r3, r0, #c > >What's Next? > >I am hoping to find some time to implement the followings in the near future: >1. Add BUNDLE opcode >2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). >3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. >4. Add MachineInstr API to check for instruction properties and switch existing code over. >5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. >6. Switch Thumb2 IT block to using MI bundles. >7. Add interface for targets to register their own packetization passes. > >I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) > >In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! > >Evan >_______________________________________________ >LLVM Developers mailing list >LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120111/92d6c75d/attachment.html>
On Jan 11, 2012, at 4:56 AM, Stripf, Timo (ITIV) wrote:> Hi Evan, > > I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM. > > > I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction. > > From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions. > However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary.What you're describing is how we expect some packetizers to be implemented. Although it's really up the the person implementing the packetizer whether or not to integrate it with scheduling. The point is that a framework will support both scheduling and bundling before RA (and after coalescing).> I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling.I think Evan referred to a regalloc "super pass". This means nothing more than a set of passes that all require and preserve liveness information. Once physical registers are assigned and we throw away vreg liveness, then passes can view bundles as individual instructions. That's what I thin of as "packet finalization". Nothing really prevents rebundling though. The late bundler just may not have as much freedom. -Andy> > Timo > > > Von: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] Im Auftrag von Evan Cheng > Gesendet: Freitag, 2. Dezember 2011 21:40 > An: LLVM Dev > Betreff: [LLVMdev] RFC: Machine Instruction Bundle > > Machine Instruction Bundle in LLVM > > Hi all, > > There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community. > > Design Criteria > > 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead. > 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late. > 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE). > 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle. > > Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices? > > Bundle Representation > > 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs. > 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle. > > The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this: > > -------------- > | Bundle | ------- > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | > -------------- > | Bundle | ------ > -------------- \ > | ---------------- > | | MI | > | ---------------- > | | > | ---------------- > | | MI | > | ---------------- > | | > | … > | > -------------- > | Bundle | ------ > -------------- \ > | > ... > > > This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle. > > ---------------- > | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | MI * | (* a new bundle) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles. > > We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks. > > ---------------- > | MI | (just a MI) > ---------------- > | > ---------------- > | MI * | (* Start of Thumb2 IT block) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (last MI in the block) > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available. > > Properties of Bundle > > If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0<kill>, #c | > ------------------------- > > r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def). > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > ... > = op3 r0, > > r0 is a def, not a use. > > What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation. > > ---------------- > | Bundle * | (A MI with special opcode "Bundle") > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | (no bit, this is the end of the bundle) > -------------- > | > ---------------- > | Bundle * | (a new bundle) > ---------------- > | > ---------------- > | MI * | > ---------------- > | > ---------------- > | MI | > ---------------- > | > ... > > The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later. > > Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles). > > Packetizing > > The current MI flow looks like this: > > 1. DAG to MI lowering (and pre-RA schedule) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. Post-RA scheduling > > In the hopefully not very distant future it should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-RA scheduling > 3d. Actual register allocation > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copies > > The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions. > > Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this. > > 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains. > > e.g. Now vr0 has two defs! > defs: vr0<dead>, vr3, uses: vr1, vr2 > ---------------------------- > | vr0 = op1 vr1, vr2 | > | vr3 = op2 vr0<kill>, #c | > ---------------------------- > > 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets). > > So the overall flow should look like this: > > 1. DAG to MI lowering (no scheduling!) > 2. MI optimizations (LICM, CSE, etc.) > 3. Register allocation super pass > 3a. De-ssa (2-address, phi slim) > 3b. Coalescing > 3c. Pre-scheduling packetization (optional) > 3d. Pre-RA scheduling (or integrated packetization) > 3e. Post-scheduling packetization (optional) > 3f. Actual register allocation > 3g. Packet finalization > 4. Post-RA optimizations > 5. PEI > 6. Re-schedule restores, copies > > Lowering Bundles to MCInst > > There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g. > > ------------------------- > | r0 = op1 r1, r2 | > | r3 = op2 r0, #c | > ------------------------- > > => > > MCInst: op1 r0, r1, r2, op2, r3, r0, #c > or > MCInst: op1 op2 r0, r1, r2, r3, r0, #c > > What's Next? > > I am hoping to find some time to implement the followings in the near future: > 1. Add BUNDLE opcode > 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo(). > 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator. > 4. Add MachineInstr API to check for instruction properties and switch existing code over. > 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI. > 6. Switch Thumb2 IT block to using MI bundles. > 7. Add interface for targets to register their own packetization passes. > > I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-) > > In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome! > > Evan > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120111/242df13c/attachment.html>