Devadasan, Christudasan via llvm-dev
2020-May-29 14:15 UTC
[llvm-dev] Dynamically determine the CostPerUse value in the register allocator.
[AMD Official Use Only - Internal Distribution Only] Hi All, For the AMDGPU architecture, during RA, we prefer to have a cost associated with the registers (CostPerUse) based on a target entity (for instance, the Calling Convention of the current MachineFunction). Presently CostPerUse is a one-time static value (either zero or a positive value) generated through table-gen. The current implementation doesn't allow us to control the reg-cost on the fly. The AMDGPU ABI has recently been revised by introducing more caller-saved VGPRs (the exact details are explained towards the end of this e-mail), and found that having a dynamic register cost is important to achieve an optical allocation. Precisely, it is important to limit the number of VGPRs allocated for a kernel/device-function to a smallest value since it will have a direct impact on the occupancy. The occupancy means the number of wavefronts that can be launched at runtime for a kernel program. Some initial thoughts on how to fix it: 1. Have a target interface (a switch) to enable/discard the CostPerUse value. 2. Get the register cost in the same way we define various calling conventions (*CallingConv.td). 3. Compute the CostPerUse in the way the AllocationOrder for the registers is determined during RA. The first one is the easiest method and that solves the immediate problem we currently address. However, the other two options are better if we want to associate different reg-cost values for different calling conventions (I presume, it will arise at some point). Other than these options, there can be a better way to fix it. Any suggestion in this regard would be helpful. AMDGPU ABI changes and the motivation for this discussion: Before the new ABI change: Apart from the initial reserved 32 argument registers, all VGPRs are callee-saved registers (VGPR32 - VGPR255). With the new ABI: We made VGPR32 - VGPR255 into equal number of callee-saved and caller-saved registers. For the same occupancy reason, these two sets are interleaved at a split boundary of 8. VGPR32-VGPR39 (Caller-saved) VGPR40-VGPR47 (Callee-saved) VGPR48-VGPR55 (Caller-saved) - - VGPR248-VGPR255 (Callee-saved) With the new ABI, the allocator's preference for callee-saved vs caller-saved depends on the input program. RA may end up allocating more caller-saved registers than the callee-saved in certain cases. The other way of allocation is possible too (more callee-saved registers) In either case, there will be unallocated registers left behind, bumping up the final VGPRs into a considerable number. It will have a bad impact on the occupancy. To override the default allocation preferences of RA, we tried to set a cost for all VGPRs such that the higher indices will have higher cost. It eliminated the problem by allocating all lower registers before picking the higher one, and with an expense of some spills in certain cases which is acceptable. But for the kernels with no device-function calls, the register cost is unnecessary. Because there is no ABI for such kernel programs. It caused a performance penalty for such kernels due to the register cost. That's the exact reason we need a method to determine dynamically either to have a reg-cost or not to have one. Regards, Christudasan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200529/d1e02e10/attachment.html>
Devadasan, Christudasan via llvm-dev
2020-May-30 11:23 UTC
[llvm-dev] Dynamically determine the CostPerUse value in the register allocator.
[AMD Official Use Only - Internal Distribution Only] Please ignore the header "AMD Official Use Only". I forgot to remove it while posting the email to llvm-dev. Regards, Christudasan From: Devadasan, Christudasan Sent: Friday, May 29, 2020 7:46 PM To: llvm-dev at lists.llvm.org Subject: [llvm-dev] Dynamically determine the CostPerUse value in the register allocator. [AMD Official Use Only - Internal Distribution Only] Hi All, For the AMDGPU architecture, during RA, we prefer to have a cost associated with the registers (CostPerUse) based on a target entity (for instance, the Calling Convention of the current MachineFunction). Presently CostPerUse is a one-time static value (either zero or a positive value) generated through table-gen. The current implementation doesn't allow us to control the reg-cost on the fly. The AMDGPU ABI has recently been revised by introducing more caller-saved VGPRs (the exact details are explained towards the end of this e-mail), and found that having a dynamic register cost is important to achieve an optical allocation. Precisely, it is important to limit the number of VGPRs allocated for a kernel/device-function to a smallest value since it will have a direct impact on the occupancy. The occupancy means the number of wavefronts that can be launched at runtime for a kernel program. Some initial thoughts on how to fix it: 1. Have a target interface (a switch) to enable/discard the CostPerUse value. 2. Get the register cost in the same way we define various calling conventions (*CallingConv.td). 3. Compute the CostPerUse in the way the AllocationOrder for the registers is determined during RA. The first one is the easiest method and that solves the immediate problem we currently address. However, the other two options are better if we want to associate different reg-cost values for different calling conventions (I presume, it will arise at some point). Other than these options, there can be a better way to fix it. Any suggestion in this regard would be helpful. AMDGPU ABI changes and the motivation for this discussion: Before the new ABI change: Apart from the initial reserved 32 argument registers, all VGPRs are callee-saved registers (VGPR32 - VGPR255). With the new ABI: We made VGPR32 - VGPR255 into equal number of callee-saved and caller-saved registers. For the same occupancy reason, these two sets are interleaved at a split boundary of 8. VGPR32-VGPR39 (Caller-saved) VGPR40-VGPR47 (Callee-saved) VGPR48-VGPR55 (Caller-saved) - - VGPR248-VGPR255 (Callee-saved) With the new ABI, the allocator's preference for callee-saved vs caller-saved depends on the input program. RA may end up allocating more caller-saved registers than the callee-saved in certain cases. The other way of allocation is possible too (more callee-saved registers) In either case, there will be unallocated registers left behind, bumping up the final VGPRs into a considerable number. It will have a bad impact on the occupancy. To override the default allocation preferences of RA, we tried to set a cost for all VGPRs such that the higher indices will have higher cost. It eliminated the problem by allocating all lower registers before picking the higher one, and with an expense of some spills in certain cases which is acceptable. But for the kernels with no device-function calls, the register cost is unnecessary. Because there is no ABI for such kernel programs. It caused a performance penalty for such kernels due to the register cost. That's the exact reason we need a method to determine dynamically either to have a reg-cost or not to have one. Regards, Christudasan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200530/5b317f5d/attachment.html>
Madhur Amilkanthwar via llvm-dev
2020-May-30 12:52 UTC
[llvm-dev] Dynamically determine the CostPerUse value in the register allocator.
I dont know the history behind CostPerUse word so I may be missing the background associated with it. It seems that it's misnomer for what it is intended. At first sight, the word indicates that the cost is a function of uses of the register - more the uses more the cost. How do we want to define the value of CostPerUse. Should it be a function of uses? or just the target? On Sat, May 30, 2020, 4:53 PM Devadasan, Christudasan via llvm-dev < llvm-dev at lists.llvm.org> wrote:> [AMD Official Use Only - Internal Distribution Only] > > > > Please ignore the header “AMD Official Use Only”. I forgot to remove it > while posting the email to llvm-dev. > > > > Regards, > > Christudasan > > > > *From:* Devadasan, Christudasan > *Sent:* Friday, May 29, 2020 7:46 PM > *To:* llvm-dev at lists.llvm.org > *Subject:* [llvm-dev] Dynamically determine the CostPerUse value in the > register allocator. > > > > [AMD Official Use Only - Internal Distribution Only] > > > > Hi All, > > > > For the AMDGPU architecture, during RA, we prefer to have a cost > associated with the registers (CostPerUse) based on a target entity (for > instance, the Calling Convention of the current MachineFunction). > > Presently CostPerUse is a one-time static value (either zero or a positive > value) generated through table-gen. > > The current implementation doesn’t allow us to control the reg-cost on the > fly. > > > > The AMDGPU ABI has recently been revised by introducing more caller-saved > VGPRs (the exact details are explained towards the end of this e-mail), and > found that having a dynamic register cost is important to achieve an > optical allocation. > > Precisely, it is important to limit the number of VGPRs allocated for a > kernel/device-function to a smallest value since it will have a direct > impact on the occupancy. The occupancy means the number of wavefronts that > can be launched at runtime for a kernel program. > > > > Some initial thoughts on how to fix it: > > 1. Have a target interface (a switch) to enable/discard the CostPerUse > value. > 2. Get the register cost in the same way we define various calling > conventions (*CallingConv.td). > 3. Compute the CostPerUse in the way the AllocationOrder for the > registers is determined during RA. > > > > The first one is the easiest method and that solves the immediate problem > we currently address. > > However, the other two options are better if we want to associate > different reg-cost values for different calling conventions (I presume, it > will arise at some point). > > Other than these options, there can be a better way to fix it. Any > suggestion in this regard would be helpful. > > > > AMDGPU ABI changes and the motivation for this discussion: > > > > Before the new ABI change: > > Apart from the initial reserved 32 argument registers, all VGPRs are > callee-saved registers (VGPR32 - VGPR255). > > With the new ABI: > > We made VGPR32 - VGPR255 into equal number of callee-saved and > caller-saved registers. > > For the same occupancy reason, these two sets are interleaved at a split > boundary of 8. > > VGPR32-VGPR39 (Caller-saved) > > VGPR40-VGPR47 (Callee-saved) > > VGPR48-VGPR55 (Caller-saved) > > - > > - > > VGPR248-VGPR255 (Callee-saved) > > > > With the new ABI, the allocator’s preference for callee-saved vs > caller-saved depends on the input program. > > RA may end up allocating more caller-saved registers than the callee-saved > in certain cases. The other way of allocation is possible too (more > callee-saved registers) > > In either case, there will be unallocated registers left behind, bumping > up the final VGPRs into a considerable number. It will have a bad impact on > the occupancy. > > To override the default allocation preferences of RA, we tried to set a > cost for all VGPRs such that the higher indices will have higher cost. > > It eliminated the problem by allocating all lower registers before picking > the higher one, and with an expense of some spills in certain cases which > is acceptable. > > > > But for the kernels with no device-function calls, the register cost is > unnecessary. Because there is no ABI for such kernel programs. > > It caused a performance penalty for such kernels due to the register cost. > > That’s the exact reason we need a method to determine dynamically either > to have a reg-cost or not to have one. > > > > Regards, > > Christudasan > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200530/178f6314/attachment.html>