On 7/17/13 12:35 PM, Diego Novillo wrote:> On Fri, Jul 12, 2013 at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > >> 3. How to parallelize post-IPO stage >> ===================================>> >> From 5k' high, the concept is very simple, just to >> step 1).divide the merged IR into small pieces, >> step 2).and compile each of this pieces independendly. >> step 3) the objects of each piece are fed back to linker to are linked >> into an executable, or a dynamic lib. > You seem to be describing GCC's strategy for whole program > optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization).Hi, Diego: Thank you for the comment. Quite honestly, I'm not very familiar with gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code. What I'm describing here is somewhat similar to Open64's IPA implementation. I did port gcc before and of course read its document . Based on my little knowledge of its internal, I think gcc's "whole-program" mode is almost identical to Open64's IPA in spirit, I guess gcc community likely borrowed some idea from Open64. On the other hand, as far as I can understand of the oldish gcc's implementation before I join Apple, I guess the "whole-program mode" in gcc is misnomer. I don't want call this work as "whole-program mode", I'd like to call "big program mode" or something like that, as I don't care the binary being built see the whole-program or not while the analyses/opt do care the difference between them.> > What is a bit confusing in your description is that you seem to be > wanting to do more work after IPO is *done*.IPO is difficult to be parallelized. What I try to do is to parallelize the post-lto compilation stage, including, optionally LNO, and scalar opt and compile-time hogging CodeGen.> If the optimizations are > done, then there is no need to parallelize anything. > > In GCC, we have two parallel stages: the generation of bytecode (which > is naturally parallelized by your build system) and the final > optimization phase, which is parallelized by partitioning the > callgraph into disjoint subsets. These subsets are then parallelized > by spawning the compiler on each of the partitions.I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage" post-IPO:-), and the IPO (which you call WPA) is sandwiched in the middle.> > The only sequential part is the actual analysis (what we call WPA or > whole program analysis).While not include in the proposal, I was proposing another level (earlier) of partition in order to build outrageously huge programs in order to parallelize the "WPA". The benefit is to parallize the IPO/"WPA", however, the cost is not seeing the "whole program". One of my coworker told me we care about the benefit reaped from seeing the "whole program" than the improved compile-time at the cost of compiling bunch of "incomplete program"s independently.> That is a single threaded phase that makes > all the optimization decisions, annotates the summaries on each > callgraph node, partitions the callgraph and then spawns all the > optimizers to work on each section of the callgraph. >This is what I called "post-IPO" :-)
On Wed, Jul 17, 2013 at 1:06 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:> > On 7/17/13 12:35 PM, Diego Novillo wrote: >> >> On Fri, Jul 12, 2013 at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> >> wrote: >> >>> 3. How to parallelize post-IPO stage >>> ===================================>>> >>> From 5k' high, the concept is very simple, just to >>> step 1).divide the merged IR into small pieces, >>> step 2).and compile each of this pieces independendly. >>> step 3) the objects of each piece are fed back to linker to are >>> linked >>> into an executable, or a dynamic lib. >> >> You seem to be describing GCC's strategy for whole program >> optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization). > > > Hi, Diego: > > Thank you for the comment. Quite honestly, I'm not very familiar with > gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code. > > What I'm describing here is somewhat similar to Open64's IPA implementation. > I did port gcc before and of course read its document . Based on my > little knowledge of its internal, I think gcc's "whole-program" > mode is almost identical to Open64's IPA in spirit, I guess gcc community > likely > borrowed some idea from Open64. > > On the other hand, as far as I can understand of the oldish gcc's > implementation > before I join Apple, I guess the "whole-program mode" in gcc is misnomer. > I don't want call this work as "whole-program mode", I'd like to call "big > program mode" > or something like that, as I don't care the binary being built see the > whole-program or > not while the analyses/opt > do care the difference between them. > > >> >> What is a bit confusing in your description is that you seem to be >> wanting to do more work after IPO is *done*. > > > IPO is difficult to be parallelized. What I try to do is to parallelize the > post-lto compilation > stage, including, optionally LNO, and scalar opt and compile-time hogging > CodeGen. > > >> If the optimizations are >> done, then there is no need to parallelize anything. >> >> In GCC, we have two parallel stages: the generation of bytecode (which >> is naturally parallelized by your build system) and the final >> optimization phase, which is parallelized by partitioning the >> callgraph into disjoint subsets. These subsets are then parallelized >> by spawning the compiler on each of the partitions. > > I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage" > post-IPO:-), > and the IPO (which you call WPA) is sandwiched in the middle. > > >> >> The only sequential part is the actual analysis (what we call WPA or >> whole program analysis). > > While not include in the proposal, I was proposing another level (earlier) > of partition in order > to build outrageously huge programs in order to parallelize the "WPA". The > benefit > is to parallize the IPO/"WPA", however, the cost is not seeing the "whole > program". > > One of my coworker told me we care about the benefit reaped from seeing the > "whole program" > than the improved compile-time at the cost of compiling bunch of "incomplete > program"s independently. > > > >> That is a single threaded phase that makes >> all the optimization decisions, annotates the summaries on each >> callgraph node, partitions the callgraph and then spawns all the >> optimizers to work on each section of the callgraph. >> > This is what I called "post-IPO" :-) >OK, yeah, this was horribly confusing. IPA would probably have been a better choice of term (or WPA) with the 'A' for analysis and the 'O' for optimization where we're doing the work. It makes your entire proposal make some sense now ;) I'll need to go back and reread it with that in mind. -eric
There are so many related terminologies that are usually confused or misused. It should be normalized before discussion. Here is one version: o IPA --> interprocedural analysis and optimization o IPO/CMO --> interprocedural optimization (cross module IPA). In different context, IPO may mean slightly different things. For instance, it may mean the whole compilation pipeline, or just the serial part when all IR files are merged and analyzed. IPO/CMO can operate both whole program mode and non-whole program mode. Note that IPA is more general than IPO/CMO, as it covers single module case tool. o LTO --> one type of implementation of IPO/CMO, which involves linker/linker plugin. It means the whole IPO pipeline. The following are a list of GCC specific terms: o WHOPR --> GCC LTO in whole program mode o WPA --> the serial/merge part of the IPO pipelines o LGEN --> pre-IPO compilation o LTRANS -> post-IPO compilation. Shuxin's proposal is about parallelizing 'LTRANS' in LLVM. thanks, David On Wed, Jul 17, 2013 at 1:16 PM, Eric Christopher <echristo at gmail.com> wrote:> On Wed, Jul 17, 2013 at 1:06 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: >> >> On 7/17/13 12:35 PM, Diego Novillo wrote: >>> >>> On Fri, Jul 12, 2013 at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> >>> wrote: >>> >>>> 3. How to parallelize post-IPO stage >>>> ===================================>>>> >>>> From 5k' high, the concept is very simple, just to >>>> step 1).divide the merged IR into small pieces, >>>> step 2).and compile each of this pieces independendly. >>>> step 3) the objects of each piece are fed back to linker to are >>>> linked >>>> into an executable, or a dynamic lib. >>> >>> You seem to be describing GCC's strategy for whole program >>> optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization). >> >> >> Hi, Diego: >> >> Thank you for the comment. Quite honestly, I'm not very familiar with >> gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code. >> >> What I'm describing here is somewhat similar to Open64's IPA implementation. >> I did port gcc before and of course read its document . Based on my >> little knowledge of its internal, I think gcc's "whole-program" >> mode is almost identical to Open64's IPA in spirit, I guess gcc community >> likely >> borrowed some idea from Open64. >> >> On the other hand, as far as I can understand of the oldish gcc's >> implementation >> before I join Apple, I guess the "whole-program mode" in gcc is misnomer. >> I don't want call this work as "whole-program mode", I'd like to call "big >> program mode" >> or something like that, as I don't care the binary being built see the >> whole-program or >> not while the analyses/opt >> do care the difference between them. >> >> >>> >>> What is a bit confusing in your description is that you seem to be >>> wanting to do more work after IPO is *done*. >> >> >> IPO is difficult to be parallelized. What I try to do is to parallelize the >> post-lto compilation >> stage, including, optionally LNO, and scalar opt and compile-time hogging >> CodeGen. >> >> >>> If the optimizations are >>> done, then there is no need to parallelize anything. >>> >>> In GCC, we have two parallel stages: the generation of bytecode (which >>> is naturally parallelized by your build system) and the final >>> optimization phase, which is parallelized by partitioning the >>> callgraph into disjoint subsets. These subsets are then parallelized >>> by spawning the compiler on each of the partitions. >> >> I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage" >> post-IPO:-), >> and the IPO (which you call WPA) is sandwiched in the middle. >> >> >>> >>> The only sequential part is the actual analysis (what we call WPA or >>> whole program analysis). >> >> While not include in the proposal, I was proposing another level (earlier) >> of partition in order >> to build outrageously huge programs in order to parallelize the "WPA". The >> benefit >> is to parallize the IPO/"WPA", however, the cost is not seeing the "whole >> program". >> >> One of my coworker told me we care about the benefit reaped from seeing the >> "whole program" >> than the improved compile-time at the cost of compiling bunch of "incomplete >> program"s independently. >> >> >> >>> That is a single threaded phase that makes >>> all the optimization decisions, annotates the summaries on each >>> callgraph node, partitions the callgraph and then spawns all the >>> optimizers to work on each section of the callgraph. >>> >> This is what I called "post-IPO" :-) >> > > OK, yeah, this was horribly confusing. IPA would probably have been a > better choice of term (or WPA) with the 'A' for analysis and the 'O' > for optimization where we're doing the work. > > It makes your entire proposal make some sense now ;) > > I'll need to go back and reread it with that in mind. > > -eric > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev