Justin Holewinski
2011-May-16 13:52 UTC
[LLVMdev] TargetRegisterInfo and "infinite" register files
Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes. I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does *not* have a concrete, pre-defined register file. The architecture is PTX, which is more of an intermediate form than a final assembly language. The format is essentially three-address code, with "virtual" registers instead of "physical" registers. After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints). However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new "physical" register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible. Generally stated, I would like to solve the register allocation problem as "allocate the minimum number of registers from an arbitrary set without spill code" instead of the more traditional "allocate the minimum number of registers from a fixed set." The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen. This works, but is not scalable. If the register allocator runs out of registers, spill code must be generated. However, the "optimal" solution in this case would be to extend the register file. A few alternatives I have come up with are: 1. Bypass register allocation completely and just emit virtual registers, 2. Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, or 3. Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is needed I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code. Option (2) seems promising, though I worry about the feasibility of the approach. Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture. Any thoughts from the experts out there? Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered. Thanks! -- Thanks, Justin Holewinski -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110516/f0434fef/attachment.html>
Villmow, Micah
2011-May-16 16:40 UTC
[LLVMdev] TargetRegisterInfo and "infinite" register files
Justin, We have the same issue with the AMDIL code generator. We tried #1, but there are passes after register allocator that don't like virtual registers. #3 could be done by having the two spill functions [load|store]Reg[From|To]StackSlot keep track of the FrameIndex to register mapping internally, but again, more of a hack than a proper solution. My solution was to just create a very large register file, 768 registers, that no sane kernel would ever reach and then do register allocation within that. A simple script that is run at compile time to generate the tables into a separate .td file and have that included in the necessary locations is all that is needed so it doesn't bloat the code. Micah From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Justin Holewinski Sent: Monday, May 16, 2011 6:52 AM To: LLVM Developers Mailing List Subject: [LLVMdev] TargetRegisterInfo and "infinite" register files Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes. I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does *not* have a concrete, pre-defined register file. The architecture is PTX, which is more of an intermediate form than a final assembly language. The format is essentially three-address code, with "virtual" registers instead of "physical" registers. After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints). However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new "physical" register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible. Generally stated, I would like to solve the register allocation problem as "allocate the minimum number of registers from an arbitrary set without spill code" instead of the more traditional "allocate the minimum number of registers from a fixed set." The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen. This works, but is not scalable. If the register allocator runs out of registers, spill code must be generated. However, the "optimal" solution in this case would be to extend the register file. A few alternatives I have come up with are: 1. Bypass register allocation completely and just emit virtual registers, 2. Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, or 3. Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is needed I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code. Option (2) seems promising, though I worry about the feasibility of the approach. Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture. Any thoughts from the experts out there? Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered. Thanks! -- Thanks, Justin Holewinski -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110516/4b9b7717/attachment.html>
Jakob Stoklund Olesen
2011-May-16 17:00 UTC
[LLVMdev] TargetRegisterInfo and "infinite" register files
On May 16, 2011, at 6:52 AM, Justin Holewinski wrote:> Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes. I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does *not* have a concrete, pre-defined register file. The architecture is PTX, which is more of an intermediate form than a final assembly language. The format is essentially three-address code, with "virtual" registers instead of "physical" registers. After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints). However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new "physical" register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible.What kind of optimizations can ptxas do? Can it hoist computations out of loops? Can it split live ranges? Can it coalesce live ranges?> Generally stated, I would like to solve the register allocation problem as "allocate the minimum number of registers from an arbitrary set without spill code" instead of the more traditional "allocate the minimum number of registers from a fixed set."It's a common misconception, but that is not what LLVM's register allocators do. They try to minimize the amount of executed spill code given the fixed set of registers. I wouldn't recommend dynamically growing the register file. You are likely to get super-linear compile time, and it is not clear that register allocation would achieve anything compared to simply outputting virtual registers. Surely, ptxas' register allocator can reuse a register for non-overlapping live ranges. That is all you would get out of this.> The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen. This works, but is not scalable. If the register allocator runs out of registers, spill code must be generated. However, the "optimal" solution in this case would be to extend the register file. A few alternatives I have come up with are: > • Bypass register allocation completely and just emit virtual registers,This is worth a try. It is possible you want to run LLVM's 2-addr, phi-elim, and coalescer passes first.> • Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, orDon't do that.> • Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is neededThis could also work. Spill slots actually do what you want. The register allocator tries to use as few as possible as long as performance doesn't suffer. Later, StackSlotColoring will merge non-overlapping stack slot ranges to save more space.> I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code. Option (2) seems promising, though I worry about the feasibility of the approach. Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture. > > Any thoughts from the experts out there? Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered.Sorry to be backwards, but I think you should try (1) or (3). Simply outputting virtual registers seems like a reasonable thing to to if ptx is really an intermediate form. LLVM's instruction selector and phi-elim tend to emit a lot of copies, so you probably want to run the coalescer before emission. That will minimize the number of copies. This is also the fastest thing you can do. There are two reasons you may want to run the register allocator anyway: - Coalescing is very aggressive. It creates long, interfering live ranges. If ptxas doesn't have live range splitting, you may benefit from LLVM's. - Passes like LICM and CSE will increase register pressure by hoisting redundant computations. If ptxas cannot rematerialize these computations in high register pressure situations, LLVM's register allocator can help you. Note that if you always make sure there are 'enough' physical registers, the register allocator will never split live ranges or rematerialize computations. That's why (2) doesn't buy you anything over (1). Use LLVM's register allocator like this: - Provide a realistic number of physical registers. Make it similar to the target architecture, but aim low. - Map spill slots to PTX registers. That means 'spilling' is really a noop, except you get live range splitting and remat. If you implement TII::canFoldMemoryOperand() and TII::foldMemoryOperandImpl(), there will be no inserted loads and stores. The result should be code that is easy to register allocate for ptxas with some live ranges that obviously should go in registers, and some that obviously should spill. There will be a number of live ranges that can go either way, depending on the actual number of registers targeted. /jakob
Justin Holewinski
2011-May-16 18:27 UTC
[LLVMdev] TargetRegisterInfo and "infinite" register files
On Mon, May 16, 2011 at 1:00 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:> > On May 16, 2011, at 6:52 AM, Justin Holewinski wrote: > > > Currently, the TableGen register info files for all of the back-ends > define concrete registers and divide them into logical register classes. I > would like to get some input from the LLVM experts around here on how best > to map this model to an architecture that does *not* have a concrete, > pre-defined register file. The architecture is PTX, which is more of an > intermediate form than a final assembly language. The format is essentially > three-address code, with "virtual" registers instead of "physical" > registers. After PTX code generation, the PTX assembly is compiled to a > device binary with a proprietary tool (ptxas) that does final register > allocation (based on device and user constraints). However, exploiting > register re-use at the LLVM/PTX level has shown performance improvement over > blindly using a new "physical" register for each def and letting ptxas > figure out all of the register allocation details, so I would like to take > advantage of the LLVM register allocation infrastructure if at all possible. > > What kind of optimizations can ptxas do? Can it hoist computations out of > loops? Can it split live ranges? Can it coalesce live ranges? >Part of my problem is that ptxas is proprietary software, so it's essentially a black box to me. It appears to do a reasonable job, but I've also seen cases where PTX-level register re-use led to better device register utilization.> > > Generally stated, I would like to solve the register allocation problem > as "allocate the minimum number of registers from an arbitrary set without > spill code" instead of the more traditional "allocate the minimum number of > registers from a fixed set." > > It's a common misconception, but that is not what LLVM's register > allocators do. They try to minimize the amount of executed spill code given > the fixed set of registers. > > I wouldn't recommend dynamically growing the register file. You are likely > to get super-linear compile time, and it is not clear that register > allocation would achieve anything compared to simply outputting virtual > registers. Surely, ptxas' register allocator can reuse a register for > non-overlapping live ranges. That is all you would get out of this. >That makes sense. If the LLVM register allocators do not actively try to minimize register usage, then I see how there would not be a win here.> > > The current implementation defines an arbitrary set of registers that the > register allocator can use during code-gen. This works, but is not > scalable. If the register allocator runs out of registers, spill code must > be generated. However, the "optimal" solution in this case would be to > extend the register file. A few alternatives I have come up with are: > > • Bypass register allocation completely and just emit virtual > registers, > > This is worth a try. It is possible you want to run LLVM's 2-addr, > phi-elim, and coalescer passes first. >I definitely need to look into those passes some more. I just hesitate to ignore the LLVM register allocator since I have seen it generate better final code (post-ptxas).> > > • Remove register definitions from the TableGen files and create > them at run-time using the virtual register counts as an upper bound on the > number of registers needed, or > > Don't do that. >I see now why that would be sub-optimal.> > > • Keep a small set of pre-defined physical registers, and craft > spill code that really just puts a new register definition in the final PTX > and copies to/from this register when spilling/restoring is needed > > This could also work. Spill slots actually do what you want. The register > allocator tries to use as few as possible as long as performance doesn't > suffer. Later, StackSlotColoring will merge non-overlapping stack slot > ranges to save more space. >That's good to know. I was hoping LLVM did something like that, but I have not really scanned that code too thoroughly yet.> > > I hesitate to use (1) or (3) as they rely too heavily on the final ptxas > tool to perform reasonable register allocation, which may not lead to > optimal code. Option (2) seems promising, though I worry about the > feasibility of the approach. Specifically, I am not yet sure if generating > TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit > into the existing architecture. > > > > Any thoughts from the experts out there? Specifically, I am interested > in any non-trivial pros/cons for any of these approaches, or any new > approaches I have not considered. > > Sorry to be backwards, but I think you should try (1) or (3). > > Simply outputting virtual registers seems like a reasonable thing to to if > ptx is really an intermediate form. LLVM's instruction selector and phi-elim > tend to emit a lot of copies, so you probably want to run the coalescer > before emission. That will minimize the number of copies. This is also the > fastest thing you can do. > > There are two reasons you may want to run the register allocator anyway: > > - Coalescing is very aggressive. It creates long, interfering live ranges. > If ptxas doesn't have live range splitting, you may benefit from LLVM's. > > - Passes like LICM and CSE will increase register pressure by hoisting > redundant computations. If ptxas cannot rematerialize these computations in > high register pressure situations, LLVM's register allocator can help you. > > Note that if you always make sure there are 'enough' physical registers, > the register allocator will never split live ranges or rematerialize > computations. That's why (2) doesn't buy you anything over (1). >Interesting. I was working under the assumption that the register allocators tried to minimize register use.> > Use LLVM's register allocator like this: > > - Provide a realistic number of physical registers. Make it similar to the > target architecture, but aim low. >Sounds reasonable.> > - Map spill slots to PTX registers. That means 'spilling' is really a noop, > except you get live range splitting and remat. If you implement > TII::canFoldMemoryOperand() and TII::foldMemoryOperandImpl(), there will be > no inserted loads and stores. >That's good to know.> > The result should be code that is easy to register allocate for ptxas with > some live ranges that obviously should go in registers, and some that > obviously should spill. There will be a number of live ranges that can go > either way, depending on the actual number of registers targeted. >This was definitely very informative! Thanks for the information!> > /jakob > >-- Thanks, Justin Holewinski -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110516/168378de/attachment.html>
Justin Holewinski
2011-May-16 18:30 UTC
[LLVMdev] TargetRegisterInfo and "infinite" register files
On Mon, May 16, 2011 at 12:40 PM, Villmow, Micah <Micah.Villmow at amd.com>wrote:> Justin, > > We have the same issue with the AMDIL code generator. We tried #1, but > there are passes after register allocator that don’t like virtual registers. > #3 could be done by having the two spill functions > [load|store]Reg[From|To]StackSlot keep track of the FrameIndex to register > mapping internally, but again, more of a hack than a proper solution. >After reading Jakob's comments, I think (3) may end up being the best in the long term. I'll definitely post any results to the list!> > > My solution was to just create a very large register file, 768 registers, > that no sane kernel would ever reach and then do register allocation within > that. A simple script that is run at compile time to generate the tables > into a separate .td file and have that included in the necessary locations > is all that is needed so it doesn’t bloat the code. >That is essentially what happens now, the only difference being the register description file is generated during dev-time instead of compile-time. I just feel there should be a more "scalable" approach.> > > Micah > > *From:* llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] *On > Behalf Of *Justin Holewinski > *Sent:* Monday, May 16, 2011 6:52 AM > *To:* LLVM Developers Mailing List > *Subject:* [LLVMdev] TargetRegisterInfo and "infinite" register files > > > > Currently, the TableGen register info files for all of the back-ends define > concrete registers and divide them into logical register classes. I would > like to get some input from the LLVM experts around here on how best to map > this model to an architecture that does *not* have a concrete, pre-defined > register file. The architecture is PTX, which is more of an intermediate > form than a final assembly language. The format is essentially > three-address code, with "virtual" registers instead of "physical" > registers. After PTX code generation, the PTX assembly is compiled to a > device binary with a proprietary tool (ptxas) that does final register > allocation (based on device and user constraints). However, exploiting > register re-use at the LLVM/PTX level has shown performance improvement over > blindly using a new "physical" register for each def and letting ptxas > figure out all of the register allocation details, so I would like to take > advantage of the LLVM register allocation infrastructure if at all possible. > > > > Generally stated, I would like to solve the register allocation problem as > "allocate the minimum number of registers from an arbitrary set without > spill code" instead of the more traditional "allocate the minimum number of > registers from a fixed set." > > > > The current implementation defines an arbitrary set of registers that the > register allocator can use during code-gen. This works, but is not > scalable. If the register allocator runs out of registers, spill code must > be generated. However, the "optimal" solution in this case would be to > extend the register file. A few alternatives I have come up with are: > > 1. Bypass register allocation completely and just emit virtual > registers, > 2. Remove register definitions from the TableGen files and create them > at run-time using the virtual register counts as an upper bound on the > number of registers needed, or > 3. Keep a small set of pre-defined physical registers, and craft spill > code that really just puts a new register definition in the final PTX and > copies to/from this register when spilling/restoring is needed > > I hesitate to use (1) or (3) as they rely too heavily on the final ptxas > tool to perform reasonable register allocation, which may not lead to > optimal code. Option (2) seems promising, though I worry about the > feasibility of the approach. Specifically, I am not yet sure if generating > TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit > into the existing architecture. > > > > Any thoughts from the experts out there? Specifically, I am interested in > any non-trivial pros/cons for any of these approaches, or any new approaches > I have not considered. > > > > Thanks! > > > > > > -- > > Thanks, > > > > Justin Holewinski > > >-- Thanks, Justin Holewinski -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110516/59c155a7/attachment.html>
Andrew Clinton
2011-May-17 16:18 UTC
[LLVMdev] TargetRegisterInfo and "infinite" register files
I have faced this same problem in my backend, and I'm working around it by providing a large physical register set. There are two problems with this: 1. There's a chance that the register allocator will run out of registers to assign, in which case the allocation will fail - making it necessary to retry with a larger register set 2. The code generator consumes storage proportional to the number of registers that could be assigned I'd be interested in an improvement to the code generator that makes it possible to specify an infinite register set without the need to store the registers explicitly. Andrew On 05/16/2011 09:52 AM, Justin Holewinski wrote:> Currently, the TableGen register info files for all of the back-ends > define concrete registers and divide them into logical register > classes. I would like to get some input from the LLVM experts around > here on how best to map this model to an architecture that does *not* > have a concrete, pre-defined register file. The architecture is PTX, > which is more of an intermediate form than a final assembly language. > The format is essentially three-address code, with "virtual" > registers instead of "physical" registers. After PTX code generation, > the PTX assembly is compiled to a device binary with a proprietary > tool (ptxas) that does final register allocation (based on device and > user constraints). However, exploiting register re-use at the > LLVM/PTX level has shown performance improvement over blindly using a > new "physical" register for each def and letting ptxas figure out all > of the register allocation details, so I would like to take advantage > of the LLVM register allocation infrastructure if at all possible. > > Generally stated, I would like to solve the register allocation > problem as "allocate the minimum number of registers from an arbitrary > set without spill code" instead of the more traditional "allocate the > minimum number of registers from a fixed set." > > The current implementation defines an arbitrary set of registers that > the register allocator can use during code-gen. This works, but is > not scalable. If the register allocator runs out of registers, spill > code must be generated. However, the "optimal" solution in this case > would be to extend the register file. A few alternatives I have come > up with are: > > 1. Bypass register allocation completely and just emit virtual > registers, > 2. Remove register definitions from the TableGen files and create > them at run-time using the virtual register counts as an upper > bound on the number of registers needed, or > 3. Keep a small set of pre-defined physical registers, and craft > spill code that really just puts a new register definition in > the final PTX and copies to/from this register when > spilling/restoring is needed > > I hesitate to use (1) or (3) as they rely too heavily on the final > ptxas tool to perform reasonable register allocation, which may not > lead to optimal code. Option (2) seems promising, though I worry > about the feasibility of the approach. Specifically, I am not yet > sure if generating TargetRegisterInfo and TargetRegisterClass > instances on-the-fly will fit into the existing architecture. > > Any thoughts from the experts out there? Specifically, I am > interested in any non-trivial pros/cons for any of these approaches, > or any new approaches I have not considered. > > Thanks! > > > > -- > > Thanks, > > Justin Holewinski > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110517/48472940/attachment.html>
Jakob Stoklund Olesen
2011-May-17 16:54 UTC
[LLVMdev] TargetRegisterInfo and "infinite" register files
On May 17, 2011, at 9:18 AM, Andrew Clinton wrote:> I have faced this same problem in my backend, and I'm working around it by providing a large physical register set. There are two problems with this: > > 1. There's a chance that the register allocator will run out of registers to assign, in which case the allocation will fail - making it necessary to retry with a larger register set > 2. The code generator consumes storage proportional to the number of registers that could be assigned > > I'd be interested in an improvement to the code generator that makes it possible to specify an infinite register set without the need to store the registers explicitly.As I explained to Justin, allocating against an infinite register file doesn't really do anything. It's a quadratic time no-op. What you can do instead is: 1) Just use virtual registers and skip register allocation, or 2) Allocate to a small register file, implement memory operand folding, and pretend that spill slots are registers. /jakob
Maybe Matching Threads
- [LLVMdev] TargetRegisterInfo and "infinite" register files
- [LLVMdev] TargetRegisterInfo and "infinite" register files
- [LLVMdev] TargetRegisterInfo and "infinite" register files
- [LLVMdev] [ptx] Propose a register class naming convention change
- [LLVMdev] [ptx] Propose a register class naming convention change