Hi all, We've seen several examples recently of performance opportunities on POWER if we can improve the location of save/restore code for callee-saved registers. Both Nemanja and myself have discussed this with several people, and it seems that there are two possibilities for improving this: 1. Extend shrink wrapping to make the analysis of callee-saved registers more precise. 2. Focus on enabling and (possibly) improving SplitCSR. I would like opinions from people on the preferred way to proceed. I am leaning toward improving shrink wrapping, at least as a short-term solution. However, I fully admit that this is because I am familiar with the shrink wrapping code and completely naive about SplitCSR and what work would be necessary to get this working well. My proposal would be to implement the flow sensitive analysis described by Fred Chow (PLDI '88) and make the necessary extensions in shrink wrapping to handle multiple save/restore points. At that point we can do an evaluation to understand the improvements it provides and the impact on compile time. Once we have these results, we can look at the best way to enable it (e.g., option, target opt-in, higher opts, etc.). Thoughts? Kit
Francis Visoiu Mistrih via llvm-dev
2017-May-03 17:50 UTC
[llvm-dev] RFC: Shrink wrapping vs SplitCSR
Hi Kit,> On May 2, 2017, at 7:54 PM, Kit Barton via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > Hi all, > > We've seen several examples recently of performance opportunities on > POWER if we can improve the location of save/restore code for > callee-saved registers. Both Nemanja and myself have discussed this with > several people, and it seems that there are two possibilities for > improving this: > > 1. Extend shrink wrapping to make the analysis of callee-saved > registers more precise. > 2. Focus on enabling and (possibly) improving SplitCSR. > > I would like opinions from people on the preferred way to proceed. > > I am leaning toward improving shrink wrapping, at least as a short-term > solution. However, I fully admit that this is because I am familiar with > the shrink wrapping code and completely naive about SplitCSR and what > work would be necessary to get this working well. > > My proposal would be to implement the flow sensitive analysis described > by Fred Chow (PLDI '88) and make the necessary extensions in shrink > wrapping to handle multiple save/restore points. At that point we can do > an evaluation to understand the improvements it provides and the impact > on compile time. Once we have these results, we can look at the best way > to enable it (e.g., option, target opt-in, higher opts, etc.).Back in 2009, there was an implementation of Fred Chow’s algorithm, that has been removed in r193749, because it was unused and untested. I have been working on a new implementation of Fred Chow’s algorithm for a while now. It seems that this algorithm has been avoided because of the compile time impact that was not worth compared to the performance improvement. The main reason is that there are probably loops in the CFG. In my implementation, we decided that we never want to save / restore inside a loop, so we consider loops as a single block. We are using scc_iterators instead of MachineLoopInfo in order to handle irreducible loops as well, that are not handled by the current SW implementation. That way, we can compute the anticipation / availability attributes in linear time. In terms of correctness, there is one test on X86 that still fails, and two others on AArch64, because of the way compact unwinding encodes register saves. In terms of compile-time, there are no regressions on CTMark. In terms of code-size, we get a 0.8% increase on X86_64 and AArch64, mostly because we cannot use push / pop anymore. For now, we only worked on shrink-wrapping CSRs, but keep the stack setup in the entry block / return blocks, which can give worse results in some cases compared to the current shrink-wrapping pass. I am currently working on fixing this. For execution-time, not many improvements showed up due to the stack setup and the transformation of push/pop -> mov $reg, (mem)/mov (mem), $reg which can be partially solved. In terms of what the algorithm can do, and how it can outperform the current one, I got some stats based on where we save / restore, along with the block frequency (with PGO), and we can see a theoretical 8% improvement. I put an early review here a while ago: https://reviews.llvm.org/D30808, and I will update it soon. As you can see, all the PrologEpilogInserter and (X86|AArch64)TargetFrameLowering code look horribly hacky, because so many things assume the stack setup and callee saved saves will stick together. Fixing all those assumptions is going to be the most tricky part of shrink-wrapping, not the algorithm itself. There are some cases where the current shrink-wrapping is better, and that’s in cases like in the Fig.3 of Chow’s paper, and that can be probably solved by using something similar to what’s described in this paper: Post Register Allocation Spill Code Optimization by Christopher Lupo and Kent D. Wilken, which uses the PST to optimize saves / restores placement along with a cost model. Cheers, -- Francis Visoiu Mistrih> > Thoughts? > > Kit > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Francis Visoiu Mistrih <fvisoiumistrih at apple.com> writes: I'm resending this as I forgot to CC llvm-dev on my reply last night.... Hi Francis, Thanks for the detailed reply! I took a quick look at the patch, and I'll try to take a closer look at it tomorrow. Could you please subscribe me to future patches? Either myself or someone from my team can help with testing and/or integration on PPC if you'd like. This is something we've been struggling with for a while and are anxious to make some progress on it :) Kit> Hi Kit, > >> On May 2, 2017, at 7:54 PM, Kit Barton via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> >> Hi all, >> >> We've seen several examples recently of performance opportunities on >> POWER if we can improve the location of save/restore code for >> callee-saved registers. Both Nemanja and myself have discussed this with >> several people, and it seems that there are two possibilities for >> improving this: >> >> 1. Extend shrink wrapping to make the analysis of callee-saved >> registers more precise. >> 2. Focus on enabling and (possibly) improving SplitCSR. >> >> I would like opinions from people on the preferred way to proceed. >> >> I am leaning toward improving shrink wrapping, at least as a short-term >> solution. However, I fully admit that this is because I am familiar with >> the shrink wrapping code and completely naive about SplitCSR and what >> work would be necessary to get this working well. >> >> My proposal would be to implement the flow sensitive analysis described >> by Fred Chow (PLDI '88) and make the necessary extensions in shrink >> wrapping to handle multiple save/restore points. At that point we can do >> an evaluation to understand the improvements it provides and the impact >> on compile time. Once we have these results, we can look at the best way >> to enable it (e.g., option, target opt-in, higher opts, etc.). > > Back in 2009, there was an implementation of Fred Chow’s algorithm, that has been removed in r193749, because it was unused and untested. > > I have been working on a new implementation of Fred Chow’s algorithm for a while now. > > It seems that this algorithm has been avoided because of the compile time impact that was not worth compared to the performance improvement. > > The main reason is that there are probably loops in the CFG. In my > implementation, we decided that we never want to save / restore inside a loop, > so we consider loops as a single block. We are using scc_iterators instead of > MachineLoopInfo in order to handle irreducible loops as well, that are not > handled by the current SW implementation. That way, we can compute the > anticipation / availability attributes in linear time. > > In terms of correctness, there is one test on X86 that still fails, and two others on AArch64, because of the way compact unwinding encodes register saves. > > In terms of compile-time, there are no regressions on CTMark. > > In terms of code-size, we get a 0.8% increase on X86_64 and AArch64, mostly because we cannot use push / pop anymore. > > For now, we only worked on shrink-wrapping CSRs, but keep the stack setup in the entry block / return blocks, which can give worse results in some cases compared to the current shrink-wrapping pass. I am currently working on fixing this. > > For execution-time, not many improvements showed up due to the stack setup and the transformation of push/pop -> mov $reg, (mem)/mov (mem), $reg which can be partially solved. > > In terms of what the algorithm can do, and how it can outperform the current one, I got some stats based on where we save / restore, along with the block frequency (with PGO), and we can see a theoretical 8% improvement. > > I put an early review here a while ago: https://reviews.llvm.org/D30808, and I > will update it soon. As you can see, all the PrologEpilogInserter and > (X86|AArch64)TargetFrameLowering code look horribly hacky, because so many > things assume the stack setup and callee saved saves will stick together. Fixing > all those assumptions is going to be the most tricky part of shrink-wrapping, > not the algorithm itself. > > There are some cases where the current shrink-wrapping is better, and that’s in > cases like in the Fig.3 of Chow’s paper, and that can be probably solved by > using something similar to what’s described in this paper: Post Register > Allocation Spill Code Optimization by Christopher Lupo and Kent D. Wilken, which > uses the PST to optimize saves / restores placement along with a cost model. > > Cheers,
Reasonably Related Threads
- [IPRA] Do we required more aggressive shrink-wrapping optimization?
- [IPRA] Do we required more aggressive shrink-wrapping optimization?
- [AArch64] bug in shrink-wrapping
- RFC] Shrink-wrapping improvement
- [LLVMdev] Shrink Wrapping - RFC and initial implementation