Hi, There: This is the proposal for parallelizing post-ipo stage. See the following for details. I also attach a toy-grade rudimentary implementation. This implementation can be used to illustrate some concepts here. This patch is not going to be committed. Unfortunately, this weekend I will be too busy to read emails. Please do not construe delayed response as being rude :-). Thanks a lot in advance for your time insightful comments! Shuxin The proposal ------------ It is organized as following: 1) background info, if you heard "/usr/bin/ls", please skip it 2) the motivation of parallelize post-IPO stage 3) how to parallelize post-IPO 4) the linker problems. 5) the toy-grade rudimentary implementation 6) misc 1.Some background ------------------ The Interprocedural-optimization compilation, aka IPO or IPA, typically consists of three stages: S1) pre-IPO Each function goes through some analysis and not-very-aggressive optimizations. Some information is collected during this stage, this info will be to IPO stages. This info is usually called summary info. The result of this stage is "fake-objects" which is binary files using some known object format to encapsulate IR as well as summary info along with other stuff. S2) IPO: Compiler works with linker to resolve and merge symbols in the "fake-objects" Then Interprocedural analyses (IPA) are invoked to perform interprocedural analysis either based on the summary-info, or directly on the IR. Interprocedural optimizations (IPO) are called based on the IPA result. In some compilers, IPA and IPO are separated. One reason is that many IPAs can be directly conduct on the concise summary info, while many IPOs need to load IRs and bulky annotation/metadata into memory. S3) post-IPO: Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc etc. While they are intra-procedural analyses/optimizers, they may directly benefit from the info collected in the IPO stages and pass down the road. LLVM collectively call S2 and S3 as "LTO CodeGen", which is very confusing. 2. Why parallelize post-IPO stage ============================= R1) To improve the scalarbility It is quite obvious that we are not able to put everything about a monster program in memory at once. Even if you can afford a expensive computer, the address space of a single compiler process cannot accommodate a monster program. R2) to take advantage of ample HW resource to shorten compile time. R3) make debugging lot easier. One can triage problems in a much smaller partition rather than the huge monster program. This proposal is not able to shoot the goal R1 at this moment, because during the IPO stage, currently the compiler brings everything into memory at once. 3. How to parallelize post-IPO stage =================================== From 5k' high, the concept is very simple, just to step 1).divide the merged IR into small pieces, step 2).and compile each of this pieces independendly. step 3) the objects of each piece are fed back to linker to are linked into an executable, or a dynamic lib. Section 3.1 through 3.3 describe these three steps respectively. 3.1. Partitioning ----------------- Partitioning is to cut a resonabely-sized chunk from the big merged IRs. It roughly consists of two steps, 1) determine the partition scheme, which is relatively easy step, and 2) physically scoop the partition out of the merged IR, which is much more involved. 3.1.1 Figure out Partition scheme ---------------------------------- we randomly pick up some function and put them in a partition. It would be nice to perform some optimization at this moment. One opt in my mind is to reorder functions in order to reduce working-set and improve locality. Unfortunately, this opt seems to be bit blind at this time, because - CallGraph is not annotated with estimated or profiled frequency. - some linkers don't respect the order. It seems they just remembers the function order of the pristine input obj/fake-obj, and enforce this order at final link (link-exec/shared-lib) stage. Anyway, I try to ignore all these problems, and try to perform partition via following steps. Maybe we have some luck on some platforms: o. DFS the call-graph, ignoring the self-resursive edges, if freq is available, prioritizing the edges (i.e. corresponding to call-sites) such that frequent edges are visited first. o. Cut the DFS spanning tree obtained from the previous step bottom-up, Each cut/partition contains reasonable # of functions, and the aggregate size of the functions of the partition should not exceeds predefined threshold. o. repeat the previous step until the Call-graph's DFS spanning tree is empty. 3.1.2 Partition transformation ------------------------------ This is bit involved. There are bunch of problems we have to tackle. 1) When the use/def of a symbol are separated in different modules, its attribute, like linkage, visibility, need to be changed as well. [Example 1], if a symbol is flagged as "internal" to the module where the it is defined, the linkage need to be changed into "internal" to the executable/lib being compiled. [Example 2], For compile-time constants, their initialized value needs to to cloned to the partitions where it is referenced, The rationale is to make the post-ipo passes to take advantage of the initlized value to squeeeze some performance. In order to not bloat the code size, the cloned constant should mark "don't emit". [end of eg2] Being able to precisely update symbols' attribute is not only vital to correctness, it has significant impact to the the performance as well. I have not yet taken a thorough investigation of this issue. My rudimentary implementation is simply flag symbol "external" when its use/def are separated in different module. I believe this is one of the most difficult part of this work. I guess it is going to take long time to become stable. 2) In order to compile each partition in each separate thread (see Section 3.2), we have to put partitions in distinct LLVMContext. I could be wrong, but I don't find the code which is able to perform function cloning across LLVMContext. My workaround in the patch is to perform function cloning in one LLVMContext (but in different Module, of course), then save the module to disk file, and load it to memory using a new LLVMContext. It is bit circuitous and expensive. One random observation: Currently, function-scoped static variables are considered as "global variables". When cloning a function with static variable, compiler has no idea if the static variables are used only by the function being cloned, and hence separate the function and the variables. I guess it would be nice if we organized symbols by its scope instead of its live-time. it would be convenient for this situation. 3.2 Compile partitions independently -------------------------------------- There are two camps: one camp advocate compiling partitions via multi-process, the other one favor multi-thread. Inside Apple compiler teams, I'm the only one belong to the 1st comp. I think while multi-proc sounds bit red-neck, it has its advantage for this purpose, and while multi-thread is certainly more eye-popping, it has its advantage as well. The advantage of multi-proc are: 1) easier to implement, the process run in its own address space. We don't need to worry about they can interfere with each other. 2)huge, or not unlimited, address space. The disadvantage is that it's expensive. But I guess the cost is almost negligible compared to the overall IPO compilation. The advantage of multi-threads I can imagine are: 1) sound fancy 2) it is light-weight 3) inter-thread communication is easier than IPC. Its disadvantage are: 1). Oftentime we will come across race-condition, and it took awful long time to figure it out. While the code is supposed to be mult-thread safe, we might miss some tricky case. Trouble-shooting race condition is a nightmare. 2) Small address space. This is big problem if we the compiler is built 32-bit . In that case, the compiler is not able to bring lots of stuff in memory even if the HW dose provide ample mem. 3) The thread-safe run-time lib is more expensive. I once linked a compiler using -lpthread (I dose not have to) on a UNIX platform, and saw the compiler slow down by about 1/3. I'm not able to convince the folks in other camp, neither are they able to convince me. I decide to implement both. Fortunately, this part is not difficult, it seems to be rather easy to crank out one within short period of time. It would be interesting to compare them side-by-side, and see which camp lose:-). On the other hand, if we run into race-condition problem, we choose multi-proc version as a fall-back. Regardless which tech is going to use to compile partition independently, in order to judiciously and adaptively choose appropriate parallel-factor, the compiler certainly need a lib which is able to figure out the load the entire system is in. I don't know if there are such magic lib or not. 4. the tale of two kinds of linker ---------------------------------- As far as I can tell, llvm suports two kind linker for its IPO compilation, and the supports are embodied by two set of APIs/interfaces. o. Interface 1, those stuff named lto_xxxx(). o. GNU gold interface, The compiler interact with GNU gold via the adapter implemented in tools/gold/gold-plugin.cpp. This adpater calls the interface-1 to control the IPO process. It dose not have to call the interface APIs, I think it is definitely ok it call internal functions. The compiler used to generate a single object file from the merged IR, now it will generate multiple of them, one for each partition. So, the interface 1 is *NOT* sufficient any more. For gold linker users, it is easy to make them happy just by hacking the adapter, informing the linker new input object files. This is done transparently, the users don't need to install new ld. For those system which invoke ld interacting with the libLTO.{so,dylib}, it has to accept the new APIs I added to the interface-1 in order to enable the new functionality. Or maybe we can invoke '/the/path/to/ld -r *.o -o merged.o' and feed the merged.o the linker (this will keep the interface interact)? Unfortunately, it dose not work at all, how can I know the path the ld? the libLTO.{so,dydlib} is invoked as plugin; it cannot see the argv. How about hack them by adding a nasty flag pointing to the right ld? Well, it works. However, I don't believe many people like to do it this way, that means I loose huge number of "QA" who are working hard for this compiler. What's wrong with the interface-1? The ld side is more active than the compiler side, however, in concept the IPO is driven by the compiler side. This mean this interface is changing over time. In contrast, the gold interface (as I rever-engineer from the adpator code) is more symbol-centric, taking little IPO-thing into account. That interface is simple and stable. 5. the rudimentary implementation --------------------------------- I make it works for bzip2 at cpu2kint yesterday. bzip2 is "tiny" program, I intentionally lower the partition size to get 3 partitions. There is no comment in the code, and it definitely need rewrite. I just check the correctness (with ref input), and I don't measure how much it degrade the performance. (due to the problem I have not got chance to tackle, see section 3.1.2, the symbol attribute stuff). The control flow basically is: 1. add a module pass to the IPO pass-manager, and figure out the partition scheme. 2) physically partition the merged partition. the IR and the obj of partition are placed in a new dir. "llvmipo" by default -- ls llvmipo/ Makefile merged part1.bc part1.o part2.bc part2.o part3.bc part3.o -- 3) For demo purpose, I drive the post-IPO stage via a makefile, which encapsulate hack and other nasty stuff. NOTE that the post-ipo pass in my hack contains only CodeGen pass, we need to reorganize the PassManagerBuilder::populateLTOPassManager(), which intermingle IPO pass along with intra-proc scalar pass, we need to separate them and the intra-proc scalar pass to post-IPO stage. 1 .PHONY = all 2 3 4 BC = part1.bc part2.bc part3.bc 5 OBJ = ${BC:.bc=.o} 6 7 all : merged 8 %.o : %.bc 9 $(HOME)/tmp/lto.llc -filetype=obj $+ -o $@ 10 11 merged : $(OBJ) 12 /usr/bin/ld $+ -r -o $@ 13 4. as the Makefile sugguest, the *.o of the partions are linked into a single obj "merged" and feed back to link. 6) Miscellaneous ========== Will partitioning degrade performance in theory. I think it depends on the definition of performance. If performance means execution-time, I guess it dose not. However, if performance includes code-size, I think it may have some negative impact. Following is few scenario: - constants generated by the post-IPO passes are not shared across partitions - dead func may be detected during the post-IPO stage, and they may not be deleted. -------------- next part -------------- Index: tools/lto/LTOCodeGenerator.cpp ==================================================================--- tools/lto/LTOCodeGenerator.cpp (revision 186109) +++ tools/lto/LTOCodeGenerator.cpp (working copy) @@ -17,8 +17,10 @@ #include "llvm/ADT/StringExtras.h" #include "llvm/Analysis/Passes.h" #include "llvm/Analysis/Verifier.h" +#include "llvm/Analysis/CallGraph.h" #include "llvm/Bitcode/ReaderWriter.h" #include "llvm/Config/config.h" +#include "llvm/ADT/SetVector.h" #include "llvm/IR/Constants.h" #include "llvm/IR/DataLayout.h" #include "llvm/IR/DerivedTypes.h" @@ -29,16 +31,19 @@ #include "llvm/MC/MCContext.h" #include "llvm/MC/SubtargetFeature.h" #include "llvm/PassManager.h" +#include "llvm/IRReader/IRReader.h" #include "llvm/Support/CommandLine.h" #include "llvm/Support/FileSystem.h" #include "llvm/Support/FormattedStream.h" #include "llvm/Support/Host.h" #include "llvm/Support/MemoryBuffer.h" +#include "llvm/Support/Program.h" #include "llvm/Support/Signals.h" #include "llvm/Support/TargetRegistry.h" #include "llvm/Support/TargetSelect.h" #include "llvm/Support/ToolOutputFile.h" #include "llvm/Support/system_error.h" +#include "llvm/Support/SourceMgr.h" #include "llvm/Target/Mangler.h" #include "llvm/Target/TargetMachine.h" #include "llvm/Target/TargetOptions.h" @@ -46,8 +51,11 @@ #include "llvm/Transforms/IPO.h" #include "llvm/Transforms/IPO/PassManagerBuilder.h" #include "llvm/Transforms/ObjCARC.h" +#include "llvm/Transforms/Utils/ValueMapper.h" +#include "llvm/Transforms/Utils/Cloning.h" using namespace llvm; + static cl::opt<bool> DisableOpt("disable-opt", cl::init(false), cl::desc("Do not run any optimization passes")); @@ -68,12 +76,154 @@ #endif } +class ModPartScheme { +public: + typedef SetVector<Function*> PartitionTy; + typedef PartitionTy::iterator iterator; + typedef PartitionTy::const_iterator const_iterator; + + ModPartScheme() {} + ModPartScheme(const ModPartScheme &That) : Partition(That.Partition) {} + + void AddFunction(Function *F) { Partition.insert(F); } + + iterator begin() { return Partition.begin(); } + iterator end() { return Partition.end(); } + const_iterator begin() const { return Partition.begin(); } + const_iterator end() const { return Partition.end(); } + int size() { return Partition.size(); } + bool count(Function *F) const { return Partition.count(F); } + +private: + PartitionTy Partition; +}; + +class ModPartSchemeMgr { +public: + typedef std::vector<ModPartScheme *> MPSchemeTy; + typedef MPSchemeTy::iterator iterator; + typedef MPSchemeTy::const_iterator const_iterator; + + ~ModPartSchemeMgr(); + + ModPartScheme *CreateEmptyPartition(void) { + ModPartScheme *P = new ModPartScheme; + PartSchemes.push_back(P); + return P; + } + iterator begin() { return PartSchemes.begin(); } + iterator end() { return PartSchemes.end(); } + const_iterator begin() const { return PartSchemes.begin(); } + const_iterator end() const { return PartSchemes.end(); } + int empty() const { return PartSchemes.empty(); } + +private: + MPSchemeTy PartSchemes; +}; + +class ModPartAnalysis : public ModulePass { +public: + static char ID; + ModPartAnalysis(ModPartSchemeMgr &MPSM): + ModulePass(ID), ModPartMgr(MPSM), CG(0) {} + + virtual bool runOnModule(Module &M); + virtual void getAnalysisUsage(AnalysisUsage &AU) const; + +private: + // Partition threshold, currently the metric for "size" is the number + // of functions in a partition. + enum { + MaxFuncInPart = 3 + }; + + class SizeMetric { + public: + SizeMetric(int func_num=0) : FuncNum(func_num) {}; + bool ExceedThreshold() const { return FuncNum > MaxFuncInPart; } + bool ExceedThresholdTooMuch() const + { return FuncNum >= MaxFuncInPart * 3 / 2; } + void IncFuncNum(int amt = 1) { FuncNum += amt; }; + const SizeMetric& operator+=(const SizeMetric &That) + { FuncNum += That.FuncNum; return *this; } + void Reset() { FuncNum = 0; } + + private: + int FuncNum; + }; + + void setVisited(CallGraphNode *N) { Visited[N] = true; } + bool isVisited(CallGraphNode *N) const { + return Visited.find(N) != Visited.end(); + } + + SizeMetric PerformPartitionHelper(CallGraphNode *Root); + void EmitPartition(CallGraphNode *DFSRoot, SizeMetric &SM); + SizeMetric EvaluateModuleSize(const Module *M) const; + +private: + ModPartSchemeMgr &ModPartMgr; + CallGraph *CG; + std::vector<CallGraphNode *> DFSStack; + SizeMetric RemainingModSize; + DenseMap<CallGraphNode *, bool> Visited; +}; + +char ModPartAnalysis::ID = 0; + +class ModPartXform { +public: + ModPartXform(Module *Mod, ModPartSchemeMgr &MPSM, IPOPartMgr &PM) : + PartSchemeMgr(MPSM), IPOPartMgr(PM), MergedModule(Mod), NextPartId(0) {} + + void getWorkDir(); + + void PerformTransform(); + +private: + IPOPartition *PerformTransform(ModPartScheme &PartScheme); + + void CollectGlobalSymbol(ModPartScheme &Part, Module *New, + ValueToValueMapTy &VMap); + void CollectGlobalSymbol(Function *F, Module *New, + ValueToValueMapTy &VMap); + + Function *CreateFuncDecl(const Function *F, Module *NewMod); + GlobalVariable *CreateVarDecl(const GlobalVariable *GV, Module *NewMod); + +private: + ModPartSchemeMgr &PartSchemeMgr; + IPOPartMgr &IPOPartMgr; + Module *MergedModule; + int NextPartId; +}; + +class PostIPOCompile { +public: + PostIPOCompile(IPOPartMgr &IPM, IPOFileMgr &IFM, bool ToMergeObjs = false) : + PartMgr(IPM), FileMgr(IFM), MergedObjFile(0), MergeObjs(ToMergeObjs) {} + + IPOFile *getMergedObjFile() const { return MergedObjFile; } + + bool Compile(std::string &ErrMsg); + +private: + bool generateMakefile(std::string &ErrMsg); + +private: + IPOPartMgr &PartMgr; + IPOFileMgr &FileMgr; + IPOFile *MergedObjFile; + bool MergeObjs; +}; + LTOCodeGenerator::LTOCodeGenerator() : _context(getGlobalContext()), _linker(new Module("ld-temp.o", _context)), _target(NULL), _emitDwarfDebugInfo(false), _scopeRestrictionsDone(false), _codeModel(LTO_CODEGEN_PIC_MODEL_DYNAMIC), - _nativeObjectFile(NULL) { + _nativeObjectFile(NULL), + _IPOPartMgr(_IPOFileMgr) { InitializeAllTargets(); InitializeAllTargetMCs(); InitializeAllAsmPrinters(); @@ -161,34 +311,42 @@ } bool LTOCodeGenerator::compile_to_file(const char** name, std::string& errMsg) { - // make unique temp .o file to put generated object file - SmallString<128> Filename; - int FD; - error_code EC = sys::fs::createTemporaryFile("lto-llvm", "o", FD, Filename); - if (EC) { - errMsg = EC.message(); + if (determineTarget(errMsg)) return true; - } - // generate object file - tool_output_file objFile(Filename.c_str(), FD); + PostIPOCompile PostIPOStage(_IPOPartMgr, _IPOFileMgr, true/*merge objects*/); + if (!_IPOFileMgr.CreateWorkDir(errMsg)) + return true; - bool genResult = generateObjectFile(objFile.os(), errMsg); - objFile.os().close(); - if (objFile.os().has_error()) { - objFile.os().clear_error(); - sys::fs::remove(Twine(Filename)); + performIPO(errMsg, true); + + if (!PostIPOStage.Compile(errMsg)) return true; - } - objFile.keep(); - if (genResult) { - sys::fs::remove(Twine(Filename)); + *name = PostIPOStage.getMergedObjFile()->getPath().c_str(); + return false; +} + +bool LTOCodeGenerator::compile_to_files(const char** name, std::string& errMsg) { + if (determineTarget(errMsg)) return true; + + performIPO(errMsg); + + // Parallelize post-IPO + _nativeObjectPath.clear(); + PostIPOCompile PostIPOStage(_IPOPartMgr, _IPOFileMgr); + if (!PostIPOStage.Compile(errMsg)) + return true; + + for (IPOPartMgr::iterator I = _IPOPartMgr.begin(), E = _IPOPartMgr.end(); + I != E; I++) { + _nativeObjectPath.append((*I)->getObjFilePath().data()); + _nativeObjectPath.append('\0'); } + _nativeObjectPath.append('\0'); + *name = _nativeObjectPath.c_str(); - _nativeObjectPath = Filename.c_str(); - *name = _nativeObjectPath.c_str(); return false; } @@ -357,16 +515,12 @@ _scopeRestrictionsDone = true; } -/// Optimize merged modules using various IPO passes -bool LTOCodeGenerator::generateObjectFile(raw_ostream &out, - std::string &errMsg) { - if (this->determineTarget(errMsg)) - return true; +void LTOCodeGenerator::performIPO(std::string &errMsg, bool PerformPartition) { Module* mergedModule = _linker.getModule(); // Mark which symbols can not be internalized - this->applyScopeRestrictions(); + applyScopeRestrictions(); // Instantiate the pass manager to organize the passes. PassManager passes; @@ -390,13 +544,30 @@ // Make sure everything is still good. passes.add(createVerifierPass()); + ModPartSchemeMgr MPSM; + if (PerformPartition) + passes.add(new ModPartAnalysis(MPSM)); + + passes.run(*mergedModule); + + if (!MPSM.empty()) { + ModPartXform MPT(mergedModule, MPSM, _IPOPartMgr); + MPT.PerformTransform(); + } else { + IPOPartition *P = _IPOPartMgr.CreateIPOPart(mergedModule); + P->SaveBitCode(); + } +} + +bool LTOCodeGenerator::performPostLTO(Module *Mod, formatted_raw_ostream &Out, + std::string &errMsg) { + // placeholder for post-IPO scalar opt + PassManager codeGenPasses; codeGenPasses.add(new DataLayout(*_target->getDataLayout())); _target->addAnalysisPasses(codeGenPasses); - formatted_raw_ostream Out(out); - // If the bitcode files contain ARC code and were compiled with optimization, // the ObjCARCContractPass must be run, so do it unconditionally here. codeGenPasses.add(createObjCARCContractPass()); @@ -404,16 +575,31 @@ if (_target->addPassesToEmitFile(codeGenPasses, Out, TargetMachine::CGFT_ObjectFile)) { errMsg = "target file type not supported"; + return true;; + } + + // Run the code generator, and write assembly file + codeGenPasses.run(*Mod); + return false; +} + +/// Optimize merged modules using various IPO passes +bool LTOCodeGenerator::generateObjectFile(Module *Mod, const char *FN, + std::string &errMsg) { + std::string errFile; + tool_output_file OutFile(FN, errMsg, raw_fd_ostream::F_Binary); + + if (!errFile.empty()) { + errMsg += errFile; return true; } + OutFile.keep(); - // Run our queue of passes all at once now, efficiently. - passes.run(*mergedModule); + formatted_raw_ostream OS(OutFile.os()); + bool Fail = performPostLTO(Mod, OS, errMsg); + OutFile.os().close(); - // Run the code generator, and write assembly file - codeGenPasses.run(*mergedModule); - - return false; // success + return Fail; } /// setCodeGenDebugOptions - Set codegen debugging options to aid in debugging @@ -428,3 +614,495 @@ _codegenOptions.push_back(strdup(o.first.str().c_str())); } } + +/////////////////////////////////////////////////////////////////////////// +// +// Implementation of ModPartSchemeMgr, ModPartXform +// +/////////////////////////////////////////////////////////////////////////// +// +ModPartSchemeMgr::~ModPartSchemeMgr() { + while (!PartSchemes.empty()) { + delete PartSchemes.back(); + PartSchemes.pop_back(); + } +} + +Function *ModPartXform::CreateFuncDecl(const Function *F, Module *NewModule) { + Function *NF = Function::Create(F->getFunctionType(), + GlobalValue::ExternalLinkage, + F->getName(), NewModule); + NF->copyAttributesFrom(F); + return NF; +} + +static void PromoteGlobalVarLinkage(GlobalVariable *GV) { + GV->setLinkage(GlobalValue::ExternalLinkage); +} + +static void PromoteGlobalFuncLinkage(Function *F) { + F->setLinkage(GlobalValue::ExternalLinkage); +} + +GlobalVariable *ModPartXform::CreateVarDecl(const GlobalVariable *GV, + Module *NewMod) { + GlobalVariable *G; + G = new GlobalVariable(*NewMod, GV->getType()->getElementType(), + GV->isConstant(), + GlobalValue::ExternalLinkage, + 0 /* InitVal */, + Twine(GV->getName()), + 0 /* Before */, + GV->getThreadLocalMode(), + GV->getType()->getPointerAddressSpace(), + GV->hasInitializer() ? true : false); + return G; +} + +void ModPartXform::CollectGlobalSymbol(Function *F, + Module *New, + ValueToValueMapTy &VMap) { + DenseMap<Value *, bool> Visited; + SmallVector<Constant *, 16> WorkList; + + for (Function::iterator BI = F->begin(), BE = F->end(); BI != BE; BI++) { + for (BasicBlock::iterator II = BI->begin(), IE = BI->end(); + II != IE; II++) { + Instruction &Inst = *II; + for (User::op_iterator op = Inst.op_begin(), E = Inst.op_end(); + op != E; ++op) { + if (Constant *C = dyn_cast<Constant>(*op)) { + if (!isa<BasicBlock>(C) && Visited.find(C) == Visited.end()) { + Visited[C] = true; + WorkList.push_back(C); + } + } + } + } + + while(!WorkList.empty()) { + Constant *C = WorkList.pop_back_val(); + if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C)) { + if (VMap.find(GV) == VMap.end()) { + VMap[GV] = CreateVarDecl(GV, New); + PromoteGlobalVarLinkage(GV); + } + continue; + } else if (Function *Func = dyn_cast<Function>(C)) { + if (VMap.find(Func) == VMap.end()) { + VMap[Func] = CreateFuncDecl(Func, New); + PromoteGlobalFuncLinkage(Func); + } + continue; + } + + for (User::const_op_iterator I = C->op_begin(), E = C->op_end(); + I != E; ++I) { + Constant *C2 = dyn_cast<Constant>(*I); + if (C2 && Visited.find(C2) == Visited.end()) { + Visited[C2] = true; + WorkList.push_back(C2); + } + } + } + } +} + +void ModPartXform::CollectGlobalSymbol(ModPartScheme &Part, Module *New, + ValueToValueMapTy &VMap) { + for (ModPartScheme::iterator I = Part.begin(), E = Part.end(); + I != E; I++) { + const Function *F = *I; + VMap[F] = CreateFuncDecl(F, New); + } + + for (ModPartScheme::iterator I = Part.begin(), E = Part.end(); + I != E; I++) + CollectGlobalSymbol(*I, New, VMap); +} + +void __attribute__((used)) dump_module(Module *M) { + std::string EI; + tool_output_file f("module.ll", EI); + f.keep(); + + M->print(f.os(), 0); +} + +void __attribute__((used)) dump_type(Type *T) { + T->dump(); +} + +// Splitting the merged module by moving the specified functions to the +// new module +IPOPartition *ModPartXform::PerformTransform(ModPartScheme &PartScheme) { + std::string FN; + raw_string_ostream OS(FN); + OS << "partition" << NextPartId++; + + Module *NewMod = new Module(OS.str(), MergedModule->getContext()); + NewMod->setDataLayout(MergedModule->getDataLayout()); + NewMod->setTargetTriple(MergedModule->getTargetTriple()); + + ValueToValueMapTy VMap; + CollectGlobalSymbol(PartScheme, NewMod, VMap); + + // Copy over functions in the partition + for (ModPartScheme::iterator I = PartScheme.begin(), E = PartScheme.end(); + I != E; I++) { + Function *OF = *I; + Function *NF = cast<Function>(VMap[OF]); + + // Steal some code from llvm::CloneFunction. + { + // Loop over the arguments, copying the names of the mapped arguments over... + Function::arg_iterator DestI = NF->arg_begin(); + for (Function::const_arg_iterator I = OF->arg_begin(), E = OF->arg_end(); + I != E; ++I) + // Is this argument preserved? WTF, how come an argument is preserved? + if (VMap.count(I) == 0) { + DestI->setName(I->getName()); // Copy the name over... + VMap[I] = DestI++; // Add mapping to VMap + } + } + SmallVector<ReturnInst*, 8> Returns; + CloneFunctionInto(NF, OF, VMap, true, Returns); + + OF->deleteBody(); + } + + IPOPartition *NewPart = IPOPartMgr.CreateIPOPart(NewMod); + + // We have to save the module to disk such that next time it's loaded + // it relong to different context. + NewPart->SaveBitCode(); + + return NewPart; +} + +void ModPartXform::PerformTransform() { + for (ModPartSchemeMgr::iterator I = PartSchemeMgr.begin(), + E = PartSchemeMgr.end(); I != E; I++) + (void)PerformTransform(**I); + + IPOPartition *MP = IPOPartMgr.CreateIPOPart(MergedModule); + MP->SaveBitCode(); +} + +/////////////////////////////////////////////////////////////////////////// +// +// Implementation of ModPartAnalysis +// +/////////////////////////////////////////////////////////////////////////// +// +void ModPartAnalysis::getAnalysisUsage(AnalysisUsage &AU) const { + AU.setPreservesAll(); + AU.addRequired<CallGraph>(); +} + +void ModPartAnalysis::EmitPartition(CallGraphNode *DFSRoot, SizeMetric &SM) { + ModPartScheme *P = ModPartMgr.CreateEmptyPartition(); + while (!DFSStack.empty()) { + CallGraphNode *N = DFSStack.back(); + P->AddFunction(N->getFunction()); + DFSStack.pop_back(); + if (N == DFSRoot) + break; + } +} + +ModPartAnalysis::SizeMetric +ModPartAnalysis::PerformPartitionHelper(CallGraphNode *R) { + SizeMetric SM; + setVisited(R); + + // Skip dummy call-graph node or declaration + { + Function *F = R->getFunction(); + if (!F || F->isDeclaration()) + return SM; + } + + DFSStack.push_back(R); + SM.IncFuncNum(); + + for (CallGraphNode::iterator I = R->begin(), E = R->end(); I != E; I++) { + CallGraphNode *Callee = (*I).second; + if (isVisited(Callee)) + continue; + + setVisited(Callee); + + // Skip dummy call-graph node or declaration + Function *F = R->getFunction(); + if (!F || F->isDeclaration()) + continue; + + SizeMetric T = PerformPartitionHelper(Callee); + bool Emit = false; + + if (T.ExceedThreshold()) + Emit = true; + else { + SM += T; + Emit = SM.ExceedThreshold(); + } + + if (Emit) { + EmitPartition(R, SM); + SM.Reset(); + if (!RemainingModSize.ExceedThresholdTooMuch()) + break; + } + } + return SM; +} + +// Return the "size" of given module. +ModPartAnalysis::SizeMetric ModPartAnalysis::EvaluateModuleSize + (const Module *M) const { + SizeMetric S; + for (Module::const_iterator I = M->begin(), E = M->end(); I != E; I++) { + const Function &F = *I; + if (!F.isDeclaration()) + S.IncFuncNum(); + } + return S; +} + +bool ModPartAnalysis::runOnModule(Module &M) { + SizeMetric S = EvaluateModuleSize(&M); + if (!S.ExceedThresholdTooMuch()) { + // While it may be big, it is okay. + return false; + } + + if (!(CG = getAnalysisIfAvailable<CallGraph>())) + return false; + + CallGraphNode *R = CG->getRoot(); + (void)PerformPartitionHelper(R); + + return false; +} + +// ///////////////////////////////////////////////////////////////////////////// +// +// Implementation of IPOPartition and IPOPartMgr +// +// ///////////////////////////////////////////////////////////////////////////// +// +IPOPartition::IPOPartition(Module *M, const char *NameWoExt, IPOFileMgr &FM) : + Mod(0), Ctx(0), IRFile(0), ObjFile(0), FileNameWoExt(NameWoExt), FileMgr(FM) { +} + +IPOFile &IPOPartition::getIRFile() const { + if (IRFile) + return *IRFile; + else { + std::string FN(FileNameWoExt + ".bc"); + return *(IRFile = FileMgr.CreateIRFile(FN.c_str())); + } +} + +IPOFile &IPOPartition::getObjFile() const { + if (ObjFile) + return *ObjFile; + else { + std::string FN(FileNameWoExt + ".o"); + return *(ObjFile = FileMgr.CreateObjFile(FN.c_str())); + } +} + + +bool IPOPartition::SaveBitCode() { + if (!Mod) { + // the bit-code have already saved in disk + return true; + } + + IPOFile &F = getIRFile(); + if (F.ErrOccur()) + return false; + + raw_fd_ostream OF(F.getPath().c_str(), F.getLastErrStr(), + raw_fd_ostream::F_Binary); + WriteBitcodeToFile(Mod, OF); + OF.close(); + + Mod = 0; + delete Ctx; + Ctx = 0; + + return !F.ErrOccur(); +} + +bool IPOPartition::LoadBitCode() { + if (Mod) + return true; + + IPOFile &F = getIRFile(); + if (F.ErrOccur()) + return false; + + Ctx = new LLVMContext; + SMDiagnostic Diag; + Mod = ParseIRFile(getIRFilePath(), Diag, *Ctx); + if (!Mod) { + F.getLastErrStr() = Diag.getMessage(); + return false; + } + + return true; +} + +IPOPartition *IPOPartMgr::CreateIPOPart(Module *M) { + std::string PartName; + raw_string_ostream OS(PartName); + OS << "part" << NextPartId++; + + IPOPartition *P = new IPOPartition(M, OS.str().c_str(), FileMgr); + P->Mod = M; + IPOParts.push_back(P); + return P; +} + +// ///////////////////////////////////////////////////////////////////////////// +// +// Implementation of IPOFile and IPOFileMgr +// +// ///////////////////////////////////////////////////////////////////////////// +IPOFile::IPOFile(const char *DirName, const char* BaseName, bool KeepFile) + : Fname(BaseName), Fpath(DirName), Keep(KeepFile) { + Fpath = Fpath + "/" + BaseName; + Keep = true; +} + +IPOFile::~IPOFile() { + if (Keep) + sys::fs::remove(Fpath); +} + +IPOFileMgr::IPOFileMgr(): WorkDir("llvmipo") { + IRFiles.reserve(20); + ObjFiles.reserve(20); + OtherFiles.reserve(8); + KeepFiles = true; + WorkDirCreated = false; +} + +IPOFileMgr::~IPOFileMgr() { + if (!KeepFiles) { + uint32_t NumRm; + sys::fs::remove_all(Twine(WorkDir), NumRm); + } +} + +bool IPOFileMgr::CreateWorkDir(std::string &ErrorInfo) { + if (WorkDirCreated) + return true; + + bool Exist; + error_code EC = sys::fs::create_directory(Twine(WorkDir), Exist); + if (EC == error_code::success()) { + WorkDirCreated = true; + return true; + } + + return false; +} + +IPOFile *IPOFileMgr::CreateIRFile(const char *Name) { + IPOFile *F = CreateFile(Name); + IRFiles.push_back(F); + return F; +} + +IPOFile *IPOFileMgr::CreateObjFile(const char *Name) { + IPOFile *F = CreateFile(Name); + ObjFiles.push_back(F); + return F; +} + +IPOFile *IPOFileMgr::CreateMakefile(const char *Name) { + IPOFile *F = CreateFile(Name); + OtherFiles.push_back(F); + return F; +} + +// ///////////////////////////////////////////////////////////////////////////// +// +// Implementation of PostIPOCompile +// +// ///////////////////////////////////////////////////////////////////////////// + +// The makefile looks something like this: +// +// .PHONY = all +// +// BC = part1.bc part2.bc part3.bc +// OBJ = ${BC:.bc=.o} +// +// all : merged.o +// %.o : %.bc +// $(HOME)/tmp/lto.llc -filetype=obj $< -o $@ +// +// merged.o : $(OBJ) +// /usr/bin/ld $+ -r -o $@ +// +bool PostIPOCompile::generateMakefile(std::string &ErrMsg) { + + IPOFile *MkFile = FileMgr.CreateMakefile("Makefile"); + + std::string NewErrMsg; + raw_fd_ostream Mk(MkFile->getPath().c_str(), NewErrMsg, 0); + + if (!NewErrMsg.empty()) { + ErrMsg += NewErrMsg; + return false; + } + + std::string BCFiles; + for (IPOPartMgr::iterator I = PartMgr.begin(), E = PartMgr.end(); + I != E; I++) { + BCFiles += (*I)->getIRFile().getName(); + BCFiles += " "; + } + + Mk << ".PHONY = all\n\n"; + + Mk << "\nBC = " << BCFiles << "\n"; + Mk << "OBJ = ${BC:.bc=.o}\n\n"; + + if (MergeObjs) + Mk << "all : " << MergedObjFile->getName() << "\n"; + else + Mk << "all : $(OBJ)\n"; + + // Emit rule + Mk << "%.o : %.bc\n\t$(HOME)/tmp/lto.llc -filetype=obj $+ -o $@\n\n"; + + if (MergeObjs) { + Mk << MergedObjFile->getName() << " : $(OBJ)\n"; + Mk << "\t/usr/bin/ld $+ -r -o $@\n\n"; + } + + Mk.close(); + + return true; +} + +bool PostIPOCompile::Compile(std::string &ErrMsg) { + if (MergeObjs) + MergedObjFile = FileMgr.CreateObjFile("merged"); + + if (!generateMakefile(ErrMsg)) + return false; + + const char *args[] = {"/usr/bin/make", "-C", 0, 0}; + args[2] = FileMgr.getWorkDir().c_str(); + + bool Fail; + sys::ExecuteAndWait("/usr/bin/make", args, 0/*envp*/, 0/*redirect*/, 0/*wait*/, 0, &ErrMsg, &Fail); + return !Fail; +} Index: tools/lto/LTOCodeGenerator.h ==================================================================--- tools/lto/LTOCodeGenerator.h (revision 186109) +++ tools/lto/LTOCodeGenerator.h (working copy) @@ -18,6 +18,8 @@ #include "llvm/ADT/SmallPtrSet.h" #include "llvm/ADT/StringMap.h" #include "llvm/Linker.h" +#include "llvm/Support/FormattedStream.h" +#include "llvm/Support/system_error.h" #include <string> #include <vector> @@ -28,6 +30,111 @@ class MemoryBuffer; class TargetMachine; class raw_ostream; + + class IPOFile; + class IPOFileMgr; + class IPOPartition { + public: + bool isInMemory() const { return Mod != 0; } + bool SaveBitCode(); + bool LoadBitCode(); + const std::string &getIRFilePath() const; + const std::string &getObjFilePath() const; + Module *getModule() const { return Mod; } + + IPOFile &getIRFile() const; + IPOFile &getObjFile() const; + + private: + friend class IPOPartMgr; + IPOPartition(Module *M, const char *FileNameWoExt, IPOFileMgr &FM); + + Module *Mod; + LLVMContext *Ctx; + mutable IPOFile *IRFile; + mutable IPOFile *ObjFile; + std::string FileNameWoExt; + IPOFileMgr &FileMgr; + }; + + class IPOPartMgr { + public: + typedef std::vector<IPOPartition *> IPOPartsTy; + typedef IPOPartsTy::iterator iterator; + typedef IPOPartsTy::const_iterator const_iterator; + + iterator begin() { return IPOParts.begin(); } + iterator end() { return IPOParts.end(); } + const_iterator begin() const { return IPOParts.begin(); } + const_iterator end() const { return IPOParts.end(); } + + IPOPartition *CreateIPOPart(Module *); + + IPOPartMgr(IPOFileMgr &IFM) : FileMgr(IFM), NextPartId(1) {} + + private: + IPOPartsTy IPOParts; + IPOFileMgr &FileMgr; + int NextPartId; + }; + + class IPOFile { + public: + const std::string &getName() { return Fname; } + const std::string &getPath() { return Fpath; } + + error_code &getLastErrCode() { return LastErr; } + std::string &getLastErrStr() { return LastErrStr; } + + bool ErrOccur() const { + return LastErr != error_code::success() || !LastErrStr.empty(); + } + + void setKeep() { Keep = true; } + bool toKeep() const { return Keep; } + + private: + friend class IPOFileMgr; + IPOFile(const char* DirName, const char *BaseName, bool Keep=false); + ~IPOFile(); + + private: + std::string Fname; + std::string Fpath; + error_code LastErr; + std::string LastErrStr; + bool Keep; + }; + + class IPOFileMgr { + public: + IPOFileMgr(); + ~IPOFileMgr(); + + bool CreateWorkDir(std::string &ErrorInfo); + const std::string &getWorkDir() { return WorkDir; } + + IPOFile *CreateIRFile(const char *Name); + IPOFile *CreateObjFile(const char *Name); + IPOFile *CreateMakefile(const char *Name); + + typedef std::vector<IPOFile *> FileVect; + FileVect &getIRFiles() { return IRFiles; } + FileVect &getObjFiles() { return ObjFiles; } + + private: + IPOFile *CreateFile(const char *Name) { + return new IPOFile(WorkDir.c_str(), Name); + } + + private: + FileVect IRFiles; + FileVect ObjFiles; + FileVect OtherFiles; + std::string WorkDir; + bool KeepFiles; + bool WorkDirCreated; + }; } //===----------------------------------------------------------------------===// @@ -52,11 +159,16 @@ bool writeMergedModules(const char *path, std::string &errMsg); bool compile_to_file(const char **name, std::string &errMsg); + bool compile_to_files(const char** name, std::string& errMsg); const void *compile(size_t *length, std::string &errMsg); void setCodeGenDebugOptions(const char *opts); private: - bool generateObjectFile(llvm::raw_ostream &out, std::string &errMsg); + void performIPO(std::string &errMsg, bool PerformPartition=false); + bool performPostLTO(llvm::Module *Mod, llvm::formatted_raw_ostream &Out, + std::string &errMsg); + + bool generateObjectFile(llvm::Module *, const char *Out, std::string &errMsg); void applyScopeRestrictions(); void applyRestriction(llvm::GlobalValue &GV, std::vector<const char*> &mustPreserveList, @@ -78,6 +190,8 @@ std::vector<char*> _codegenOptions; std::string _mCpu; std::string _nativeObjectPath; + llvm::IPOPartMgr _IPOPartMgr; + llvm::IPOFileMgr _IPOFileMgr; }; #endif // LTO_CODE_GENERATOR_H
On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:> 3.2 Compile partitions independently > -------------------------------------- > > There are two camps: one camp advocate compiling partitions via multi-process, > the other one favor multi-thread. > > Inside Apple compiler teams, I'm the only one belong to the 1st comp. I think > while multi-proc sounds bit red-neck, it has its advantage for this purpose, and > while multi-thread is certainly more eye-popping, it has its advantage > as well. > > The advantage of multi-proc are: > 1) easier to implement, the process run in its own address space. > We don't need to worry about they can interfere with each other. > > 2)huge, or not unlimited, address space. > > The disadvantage is that it's expensive. But I guess the cost is > almost negligible compared to the overall IPO compilation. > > The advantage of multi-threads I can imagine are: > 1) sound fancy > 2) it is light-weight > 3) inter-thread communication is easier than IPC. > > Its disadvantage are: > 1). Oftentime we will come across race-condition, and it took > awful long time to figure it out. While the code is supposed > to be mult-thread safe, we might miss some tricky case. > Trouble-shooting race condition is a nightmare. > > 2) Small address space. This is big problem if we the compiler > is built 32-bit . In that case, the compiler is not able to bring > lots of stuff in memory even if the HW dose > provide ample mem. > > 3) The thread-safe run-time lib is more expensive. > I once linked a compiler using -lpthread (I dose not have to) on a > UNIX platform, and saw the compiler slow down by about 1/3. > > I'm not able to convince the folks in other camp, neither are they > able to convince me. I decide to implement both. Fortunately, this > part is not difficult, it seems to be rather easy to crank out one within short > period of time. It would be interesting to compare them side-by-side, > and see which camp lose:-). On the other hand, if we run into race-condition > problem, we choose multi-proc version as a fall-back.While I am a self-proclaimed multi-process red-neck, in this case I would prefer to see a multi-threaded implementation because I want to verify that LLVMContext can be used as advertised. I'm sure some extra care will be needed to report failures/diagnostics, but we should start with the assumption that this approach is not significantly harder than multi-process because that's how we advertise the design. If any of the multi-threaded disadvantages you point out are real, I would like to find out about it. 1. Race Conditions: We should be able to verify that the thread-parallel vs. sequential or multi-process compilation generate the same result. If they diverge, we would like to know about the bug so it can be fixed--independent of LTO. 2. Small Address Space with LTO. We don't need to design around this hypothetical case. 3. Expensive thread-safe runtime lib. We should not speculate that platforms that we, as the LLVM community, care about have this problem. Let's assume that our platforms are well implemented unless we have data to the contrary. (Personally, I would even love to use TLS in the compiler to vastly simplify API design in the backend, but I am not going to be popular for saying so). We should be able to decompose each step of compilation for debugging. So the multi-process "implementation" should just be a degenerate form of threading with a bit of driver magic if you want to automate it. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/5b2bd09a/attachment.html>
On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:> 6) Miscellaneous > ==========> Will partitioning degrade performance in theory. I think it depends on the definition of > performance. If performance means execution-time, I guess it dose not. > However, if performance includes code-size, I think it may have some negative impact. > Following is few scenario: > > - constants generated by the post-IPO passes are not shared across partitions > - dead func may be detected during the post-IPO stage, and they may not be deleted.In don't know if it's feasible, but stable linker output, independent of the partioning, is highly desirable. One of the most irritating performance regressions to track down involves different versions of the host linker. If partitioning decisions are thrown into the mix, this could be annoying. Is it possible for the final link to do a better job cleaning up? -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/c07ff8d0/attachment.html>
On Sun, Jul 14, 2013 at 5:57 PM, Andrew Trick <atrick at apple.com> wrote:> > On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > > 6) Miscellaneous > ==========> Will partitioning degrade performance in theory. I think it depends on > the definition of > performance. If performance means execution-time, I guess it dose not. > However, if performance includes code-size, I think it may have some > negative impact. > Following is few scenario: > > - constants generated by the post-IPO passes are not shared across > partitions > - dead func may be detected during the post-IPO stage, and they may not be > deleted. > > > In don't know if it's feasible, but stable linker output, independent of the > partioning, is highly desirable. One of the most irritating performance > regressions to track down involves different versions of the host linker. If > partitioning decisions are thrown into the mix, this could be annoying. Is > it possible for the final link to do a better job cleaning up?While I haven't yet read the rest of the proposal I'm going to comment on this in particular. In my view this is an absolute requirement as the compiler should produce the same output given the same input every time with no deviation. -eric
On 7/14/13 5:56 PM, Andrew Trick wrote:> > On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com > <mailto:shuxin.llvm at gmail.com>> wrote: > >> 3.2 Compile partitions independently >> -------------------------------------- >> >> There are two camps: one camp advocate compiling partitions via >> multi-process, >> the other one favor multi-thread. >> >> Inside Apple compiler teams, I'm the only one belong to the 1st >> comp. I think >> while multi-proc sounds bit red-neck, it has its advantage for this >> purpose, and >> while multi-thread is certainly more eye-popping, it has its advantage >> as well. >> >> The advantage of multi-proc are: >> 1) easier to implement, the process run in its own address space. >> We don't need to worry about they can interfere with each other. >> >> 2)huge, or not unlimited, address space. >> >> The disadvantage is that it's expensive. But I guess the cost is >> almost negligible compared to the overall IPO compilation. >> >> The advantage of multi-threads I can imagine are: >> 1) sound fancy >> 2) it is light-weight >> 3) inter-thread communication is easier than IPC. >> >> Its disadvantage are: >> 1). Oftentime we will come across race-condition, and it took >> awful long time to figure it out. While the code is supposed >> to be mult-thread safe, we might miss some tricky case. >> Trouble-shooting race condition is a nightmare. >> >> 2) Small address space. This is big problem if we the compiler >> is built 32-bit . In that case, the compiler is not able to bring >> lots of stuff in memory even if the HW dose >> provide ample mem. >> >> 3) The thread-safe run-time lib is more expensive. >> I once linked a compiler using -lpthread (I dose not have to) on a >> UNIX platform, and saw the compiler slow down by about 1/3. >> >> I'm not able to convince the folks in other camp, neither are they >> able to convince me. I decide to implement both. Fortunately, this >> part is not difficult, it seems to be rather easy to crank out one >> within short >> period of time. It would be interesting to compare them side-by-side, >> and see which camp lose:-). On the other hand, if we run into >> race-condition >> problem, we choose multi-proc version as a fall-back. > > While I am a self-proclaimed multi-process red-neck, in this case I > would prefer to see a multi-threaded implementation because I want to > verify that LLVMContext can be used as advertised. I'm sure some extra > care will be needed to report failures/diagnostics, but we should > start with the assumption that this approach is not significantly > harder than multi-process because that's how we advertise the design. > > If any of the multi-threaded disadvantages you point out are real, I > would like to find out about it. > > 1. Race Conditions: We should be able to verify that the > thread-parallel vs. sequential or multi-process compilation generate > the same result. If they diverge, we would like to know about the bug > so it can be fixed--independent of LTO. > > 2. Small Address Space with LTO. We don't need to design around this > hypothetical case. > > 3. Expensive thread-safe runtime lib. We should not speculate that > platforms that we, as the LLVM community, care about have this > problem. Let's assume that our platforms are well implemented unless > we have data to the contrary. (Personally, I would even love to use > TLS in the compiler to vastly simplify API design in the backend, but > I am not going to be popular for saying so). > > We should be able to decompose each step of compilation for debugging.Yes, of course, we should be able to save the IR before and after each major steps, and when trouble-shooting, we should be able to focus on one smaller partition. Once the problem of of the partition is fixed, we can manually link all the partition and other libs into final executable/dyn-lib. This is one important reasons for partitioning.> So the multi-process "implementation" should just be a degenerate form > of threading with a bit of driver magic if you want to automate it. > >Yes! That is why I'd like implement both. There is one difference though, with multi-proc implementation, we need to pass the /path/to/{llc/opt} to the libLTO.{so|dylib}, such that it can invoke these tools from right place. While in multi-thread implementation, we don't need this info. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/713f532f/attachment.html>
On 7/14/13 5:57 PM, Andrew Trick wrote:> > On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com > <mailto:shuxin.llvm at gmail.com>> wrote: > >> 6) Miscellaneous >> ==========>> Will partitioning degrade performance in theory. I think it >> depends on the definition of >> performance. If performance means execution-time, I guess it dose not. >> However, if performance includes code-size, I think it may have some >> negative impact. >> Following is few scenario: >> >> - constants generated by the post-IPO passes are not shared across >> partitions >> - dead func may be detected during the post-IPO stage, and they may >> not be deleted. > > In don't know if it's feasible, but stable linker output, independent > of the partioning, is highly desirable.In theory, it is not possible. But I guess in practice, it is almost stable.> One of the most irritating performance regressions to track down > involves different versions of the host linker. If partitioning > decisions are thrown into the mix, this could be annoying. Is it > possible for the final link to do a better job cleaning up? > >If partition's tweaking parameter remain unchanged, the partition should remain unchanged as well. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/3276b42c/attachment.html>
Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline. On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:> > > 3.1.1 Figure out Partition scheme > ---------------------------------- > we randomly pick up some function and put them in a partition. > It would be nice to perform some optimization at this moment. One opt > in my mind is to reorder functions in order to reduce working-set and > improve locality. > > Unfortunately, this opt seems to be bit blind at this time, because > - CallGraph is not annotated with estimated or profiled frequency. > - some linkers don't respect the order. It seems they just > remembers the function order of the pristine input obj/fake-obj, > and enforce this order at final link (link-exec/shared-lib) stage. > > Anyway, I try to ignore all these problems, and try to perform partition > via following steps. Maybe we have some luck on some platforms: > > o. DFS the call-graph, ignoring the self-resursive edges, if freq is > available, prioritizing the edges (i.e. corresponding to call-sites) > such that frequent edges are visited first. > > o. Cut the DFS spanning tree obtained from the previous step bottom-up, > Each cut/partition contains reasonable # of functions, and the aggregate > size of the functions of the partition should not exceeds predefined > threshold.I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?> > o. repeat the previous step until the Call-graph's DFS spanning tree > is empty. > > 3.1.2 Partition transformation > ------------------------------ > > This is bit involved. There are bunch of problems we have to tackle. > 1) When the use/def of a symbol are separated in different modules, > its attribute, like linkage, visibility, need to be changed > as well. > > [Example 1], if a symbol is flagged as "internal" to the module where > the it is defined, the linkage need to be changed into "internal" > to the executable/lib being compiled. > > [Example 2], For compile-time constants, their initialized value > needs to to cloned to the partitions where it is referenced, > The rationale is to make the post-ipo passes to take advantage > of the initlized value to squeeeze some performance. > > In order to not bloat the code size, the cloned constant should > mark "don't emit". [end of eg2] > > Being able to precisely update symbols' attribute is not only > vital to correctness, it has significant impact to the the > performance as well. > > I have not yet taken a thorough investigation of this issue. My > rudimentary implementation is simply flag symbol "external" when its > use/def are separated in different module. I believe this is one > of the most difficult part of this work. I guess it is going to > take long time to become stable. > > 2) In order to compile each partition in each separate thread (see > Section 3.2), we have to put partitions in distinct LLVMContext. > > I could be wrong, but I don't find the code which is able to > perform function cloning across LLVMContext. > > My workaround in the patch is to perform function cloning in > one LLVMContext (but in different Module, of course), then > save the module to disk file, and load it to memory using a > new LLVMContext. > > It is bit circuitous and expensive.Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts? Evan> > One random observation: > Currently, function-scoped static variables are considered > as "global variables". When cloning a function with static variable, > compiler has no idea if the static variables are used only by > the function being cloned, and hence separate the function > and the variables. > > I guess it would be nice if we organized symbols by its scope > instead of its live-time. it would be convenient for this situation. > > 3.2 Compile partitions independently > -------------------------------------- > > There are two camps: one camp advocate compiling partitions via multi-process, > the other one favor multi-thread. > > Inside Apple compiler teams, I'm the only one belong to the 1st comp. I think > while multi-proc sounds bit red-neck, it has its advantage for this purpose, and > while multi-thread is certainly more eye-popping, it has its advantage > as well. > > The advantage of multi-proc are: > 1) easier to implement, the process run in its own address space. > We don't need to worry about they can interfere with each other. > > 2)huge, or not unlimited, address space. > > The disadvantage is that it's expensive. But I guess the cost is > almost negligible compared to the overall IPO compilation. > > The advantage of multi-threads I can imagine are: > 1) sound fancy > 2) it is light-weight > 3) inter-thread communication is easier than IPC. > > Its disadvantage are: > 1). Oftentime we will come across race-condition, and it took > awful long time to figure it out. While the code is supposed > to be mult-thread safe, we might miss some tricky case. > Trouble-shooting race condition is a nightmare. > > 2) Small address space. This is big problem if we the compiler > is built 32-bit . In that case, the compiler is not able to bring > lots of stuff in memory even if the HW dose > provide ample mem. > > 3) The thread-safe run-time lib is more expensive. > I once linked a compiler using -lpthread (I dose not have to) on a > UNIX platform, and saw the compiler slow down by about 1/3. > > I'm not able to convince the folks in other camp, neither are they > able to convince me. I decide to implement both. Fortunately, this > part is not difficult, it seems to be rather easy to crank out one within short > period of time. It would be interesting to compare them side-by-side, > and see which camp lose:-). On the other hand, if we run into race-condition > problem, we choose multi-proc version as a fall-back. > > Regardless which tech is going to use to compile partition > independently, in order to judiciously and adaptively choose appropriate > parallel-factor, the compiler certainly need a lib which is able to > figure out the load the entire system is in. I don't know if there are > such magic lib or not. > > 4. the tale of two kinds of linker > ---------------------------------- > > As far as I can tell, llvm suports two kind linker for its IPO compilation, > and the supports are embodied by two set of APIs/interfaces. > > o. Interface 1, those stuff named lto_xxxx(). > o. GNU gold interface, > The compiler interact with GNU gold via the adapter implemented > in tools/gold/gold-plugin.cpp. > > This adpater calls the interface-1 to control the IPO process. > It dose not have to call the interface APIs, I think it is definitely > ok it call internal functions. > > The compiler used to generate a single object file from the merged > IR, now it will generate multiple of them, one for each partition. > > So, the interface 1 is *NOT* sufficient any more. > > For gold linker users, it is easy to make them happy just by > hacking the adapter, informing the linker new input object > files. This is done transparently, the users don't need to install new ld. > > For those system which invoke ld interacting with the libLTO.{so,dylib}, > it has to accept the new APIs I added to the interface-1 in order to > enable the new functionality. Or maybe we can invoke '/the/path/to/ld -r *.o -o merged.o' > and feed the merged.o the linker (this will keep the interface > interact)? Unfortunately, it dose not work at all, how can I know the path > the ld? the libLTO.{so,dydlib} is invoked as plugin; it cannot see the argv. > How about hack them by adding a nasty flag pointing to the right ld? > Well, it works. However, I don't believe many people like to do it this way, > that means I loose huge number of "QA" who are working hard for this compiler. > > What's wrong with the interface-1? The ld side is more active than > the compiler side, however, in concept the IPO is driven by the compiler side. > This mean this interface is changing over time. > > In contrast, the gold interface (as I rever-engineer from the adpator > code) is more symbol-centric, taking little IPO-thing into account. > That interface is simple and stable. > > 5. the rudimentary implementation > --------------------------------- > > I make it works for bzip2 at cpu2kint yesterday. bzip2 is "tiny" > program, I intentionally lower the partition size to get 3 partitions. > There is no comment in the code, and it definitely need rewrite. I > just check the correctness (with ref input), and I don't measure how much > it degrade the performance. (due to the problem I have not got chance > to tackle, see section 3.1.2, the symbol attribute stuff). > > The control flow basically is: > 1. add a module pass to the IPO pass-manager, and figure > out the partition scheme. > > 2) physically partition the merged partition. > the IR and the obj of partition are placed in a new dir. "llvmipo" by default > > -- > ls llvmipo/ > Makefile merged part1.bc part1.o part2.bc part2.o part3.bc part3.o > -- > > 3) For demo purpose, I drive the post-IPO stage via a makefile, which encapsulate > hack and other nasty stuff. > > NOTE that the post-ipo pass in my hack contains only CodeGen pass, we need to > reorganize the PassManagerBuilder::populateLTOPassManager(), which intermingle > IPO pass along with intra-proc scalar pass, we need to separate them and the intra-proc > scalar pass to post-IPO stage. > > > 1 .PHONY = all > 2 > 3 > 4 BC = part1.bc part2.bc part3.bc > 5 OBJ = ${BC:.bc=.o} > 6 > 7 all : merged > 8 %.o : %.bc > 9 $(HOME)/tmp/lto.llc -filetype=obj $+ -o $@ > 10 > 11 merged : $(OBJ) > 12 /usr/bin/ld $+ -r -o $@ > 13 > > 4. as the Makefile sugguest, the *.o of the partions are linked into a single obj "merged" > and feed back to link. > > > 6) Miscellaneous > ==========> Will partitioning degrade performance in theory. I think it depends on the definition of > performance. If performance means execution-time, I guess it dose not. > However, if performance includes code-size, I think it may have some negative impact. > Following is few scenario: > > - constants generated by the post-IPO passes are not shared across partitions > - dead func may be detected during the post-IPO stage, and they may not be deleted. > > > > > <post-ipo.patch>_______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
+1 On Jul 14, 2013, at 5:56 PM, Andrew Trick <atrick at apple.com> wrote:> > On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > >> 3.2 Compile partitions independently >> -------------------------------------- >> >> There are two camps: one camp advocate compiling partitions via multi-process, >> the other one favor multi-thread. >> >> Inside Apple compiler teams, I'm the only one belong to the 1st comp. I think >> while multi-proc sounds bit red-neck, it has its advantage for this purpose, and >> while multi-thread is certainly more eye-popping, it has its advantage >> as well. >> >> The advantage of multi-proc are: >> 1) easier to implement, the process run in its own address space. >> We don't need to worry about they can interfere with each other. >> >> 2)huge, or not unlimited, address space. >> >> The disadvantage is that it's expensive. But I guess the cost is >> almost negligible compared to the overall IPO compilation. >> >> The advantage of multi-threads I can imagine are: >> 1) sound fancy >> 2) it is light-weight >> 3) inter-thread communication is easier than IPC. >> >> Its disadvantage are: >> 1). Oftentime we will come across race-condition, and it took >> awful long time to figure it out. While the code is supposed >> to be mult-thread safe, we might miss some tricky case. >> Trouble-shooting race condition is a nightmare. >> >> 2) Small address space. This is big problem if we the compiler >> is built 32-bit . In that case, the compiler is not able to bring >> lots of stuff in memory even if the HW dose >> provide ample mem. >> >> 3) The thread-safe run-time lib is more expensive. >> I once linked a compiler using -lpthread (I dose not have to) on a >> UNIX platform, and saw the compiler slow down by about 1/3. >> >> I'm not able to convince the folks in other camp, neither are they >> able to convince me. I decide to implement both. Fortunately, this >> part is not difficult, it seems to be rather easy to crank out one within short >> period of time. It would be interesting to compare them side-by-side, >> and see which camp lose:-). On the other hand, if we run into race-condition >> problem, we choose multi-proc version as a fall-back. > > While I am a self-proclaimed multi-process red-neck, in this case I would prefer to see a multi-threaded implementation because I want to verify that LLVMContext can be used as advertised. I'm sure some extra care will be needed to report failures/diagnostics, but we should start with the assumption that this approach is not significantly harder than multi-process because that's how we advertise the design. > > If any of the multi-threaded disadvantages you point out are real, I would like to find out about it. > > 1. Race Conditions: We should be able to verify that the thread-parallel vs. sequential or multi-process compilation generate the same result. If they diverge, we would like to know about the bug so it can be fixed--independent of LTO. > > 2. Small Address Space with LTO. We don't need to design around this hypothetical case. > > 3. Expensive thread-safe runtime lib. We should not speculate that platforms that we, as the LLVM community, care about have this problem. Let's assume that our platforms are well implemented unless we have data to the contrary. (Personally, I would even love to use TLS in the compiler to vastly simplify API design in the backend, but I am not going to be popular for saying so). > > We should be able to decompose each step of compilation for debugging. So the multi-process "implementation" should just be a degenerate form of threading with a bit of driver magic if you want to automate it. > > -Andy > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130716/35762267/attachment.html>
On 7/16/13 5:23 AM, Evan Cheng wrote:> Thanks for the proposal. This is important work which is one step towards making LTO more applicable for large applications. Some comments inline. > > On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > >> >> 3.1.1 Figure out Partition scheme >> ---------------------------------- >> we randomly pick up some function and put them in a partition. >> It would be nice to perform some optimization at this moment. One opt >> in my mind is to reorder functions in order to reduce working-set and >> improve locality. >> >> Unfortunately, this opt seems to be bit blind at this time, because >> - CallGraph is not annotated with estimated or profiled frequency. >> - some linkers don't respect the order. It seems they just >> remembers the function order of the pristine input obj/fake-obj, >> and enforce this order at final link (link-exec/shared-lib) stage. >> >> Anyway, I try to ignore all these problems, and try to perform partition >> via following steps. Maybe we have some luck on some platforms: >> >> o. DFS the call-graph, ignoring the self-resursive edges, if freq is >> available, prioritizing the edges (i.e. corresponding to call-sites) >> such that frequent edges are visited first. >> >> o. Cut the DFS spanning tree obtained from the previous step bottom-up, >> Each cut/partition contains reasonable # of functions, and the aggregate >> size of the functions of the partition should not exceeds predefined >> threshold. > I'd like to see more details about this step. How do you determine the "reasonable # of functions"? How do you define the threshold? It has to be the same for a given target / platform regardless of the configuration of the build machine, right?Say, each module should not contains : - no more than 100 functions, and - the total size of the functions in a partition should not exceed the pre-defined threshold, These threshold can be override by command line.>> o. repeat the previous step until the Call-graph's DFS spanning tree >> is empty. >> >> 3.1.2 Partition transformation >> ------------------------------ >> >> This is bit involved. There are bunch of problems we have to tackle. >> 1) When the use/def of a symbol are separated in different modules, >> its attribute, like linkage, visibility, need to be changed >> as well. >> >> [Example 1], if a symbol is flagged as "internal" to the module where >> the it is defined, the linkage need to be changed into "internal" >> to the executable/lib being compiled. >> >> [Example 2], For compile-time constants, their initialized value >> needs to to cloned to the partitions where it is referenced, >> The rationale is to make the post-ipo passes to take advantage >> of the initlized value to squeeeze some performance. >> >> In order to not bloat the code size, the cloned constant should >> mark "don't emit". [end of eg2] >> >> Being able to precisely update symbols' attribute is not only >> vital to correctness, it has significant impact to the the >> performance as well. >> >> I have not yet taken a thorough investigation of this issue. My >> rudimentary implementation is simply flag symbol "external" when its >> use/def are separated in different module. I believe this is one >> of the most difficult part of this work. I guess it is going to >> take long time to become stable. >> >> 2) In order to compile each partition in each separate thread (see >> Section 3.2), we have to put partitions in distinct LLVMContext. >> >> I could be wrong, but I don't find the code which is able to >> perform function cloning across LLVMContext. >> >> My workaround in the patch is to perform function cloning in >> one LLVMContext (but in different Module, of course), then >> save the module to disk file, and load it to memory using a >> new LLVMContext. >> >> It is bit circuitous and expensive. > Do you plan to fix this? What are the issues that prevented function cloning across multiple LLVMContexts? > > Evan > >We may fix it, I don't know for sure if it is a big gain at this moment. If issues is that, as far as I can tell, current code base dose not have functions support copying IR across different LLVMContext. For example, when it copy an instruction from src to dest, it check the "src", take a look of of its Type, and derive LLVMContext from the Type, and use the same context for the dest. So, we need to change the existing code.
A third approach is to decouple the backend compilation and parallelism strategy from the partitioning. The partitioning can spits out partition BC files and some action records in some standard format. All of this can be fed into some driver tools that converts the compilation action file into make/build file of the underlying build system of your choice: 1) it can simply a compiler driver that does thread level parallelism; 2) or a tool that generates Makfiles which are fed into parallel make to explore single node parallelism; 3) or a tool that generates BUILD files that feed into distributed build system (such as Google's blaze: http://google-engtools.blogspot.com/2011/08/build-in-cloud-how-build-system-works.html) Another benefit is it will make compiler debugging easier. thanks, David On Sun, Jul 14, 2013 at 5:56 PM, Andrew Trick <atrick at apple.com> wrote:> > On Jul 12, 2013, at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > > 3.2 Compile partitions independently > -------------------------------------- > > There are two camps: one camp advocate compiling partitions via > multi-process, > the other one favor multi-thread. > > Inside Apple compiler teams, I'm the only one belong to the 1st comp. I > think > while multi-proc sounds bit red-neck, it has its advantage for this purpose, > and > while multi-thread is certainly more eye-popping, it has its advantage > as well. > > The advantage of multi-proc are: > 1) easier to implement, the process run in its own address space. > We don't need to worry about they can interfere with each other. > > 2)huge, or not unlimited, address space. > > The disadvantage is that it's expensive. But I guess the cost is > almost negligible compared to the overall IPO compilation. > > The advantage of multi-threads I can imagine are: > 1) sound fancy > 2) it is light-weight > 3) inter-thread communication is easier than IPC. > > Its disadvantage are: > 1). Oftentime we will come across race-condition, and it took > awful long time to figure it out. While the code is supposed > to be mult-thread safe, we might miss some tricky case. > Trouble-shooting race condition is a nightmare. > > 2) Small address space. This is big problem if we the compiler > is built 32-bit . In that case, the compiler is not able to bring > lots of stuff in memory even if the HW dose > provide ample mem. > > 3) The thread-safe run-time lib is more expensive. > I once linked a compiler using -lpthread (I dose not have to) on a > UNIX platform, and saw the compiler slow down by about 1/3. > > I'm not able to convince the folks in other camp, neither are they > able to convince me. I decide to implement both. Fortunately, this > part is not difficult, it seems to be rather easy to crank out one within > short > period of time. It would be interesting to compare them side-by-side, > and see which camp lose:-). On the other hand, if we run into race-condition > problem, we choose multi-proc version as a fall-back. > > > While I am a self-proclaimed multi-process red-neck, in this case I would > prefer to see a multi-threaded implementation because I want to verify that > LLVMContext can be used as advertised. I'm sure some extra care will be > needed to report failures/diagnostics, but we should start with the > assumption that this approach is not significantly harder than multi-process > because that's how we advertise the design. > > If any of the multi-threaded disadvantages you point out are real, I would > like to find out about it. > > 1. Race Conditions: We should be able to verify that the thread-parallel vs. > sequential or multi-process compilation generate the same result. If they > diverge, we would like to know about the bug so it can be fixed--independent > of LTO. > > 2. Small Address Space with LTO. We don't need to design around this > hypothetical case. > > 3. Expensive thread-safe runtime lib. We should not speculate that platforms > that we, as the LLVM community, care about have this problem. Let's assume > that our platforms are well implemented unless we have data to the contrary. > (Personally, I would even love to use TLS in the compiler to vastly simplify > API design in the backend, but I am not going to be popular for saying so). > > We should be able to decompose each step of compilation for debugging. So > the multi-process "implementation" should just be a degenerate form of > threading with a bit of driver magic if you want to automate it. > > -Andy > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
On 12 July 2013 15:49, Shuxin Yang <shuxin.llvm at gmail.com> wrote:> Hi, There: > > This is the proposal for parallelizing post-ipo stage. See the following > for details. > > I also attach a toy-grade rudimentary implementation. This > implementation can be > used to illustrate some concepts here. This patch is not going to be > committed. > > Unfortunately, this weekend I will be too busy to read emails. Please do > not construe > delayed response as being rude :-). > > Thanks a lot in advance for your time insightful comments! > > Shuxin > > > The proposal > ------------ > It is organized as following: > 1) background info, if you heard "/usr/bin/ls", please skip it > 2) the motivation of parallelize post-IPO stage > 3) how to parallelize post-IPO > 4) the linker problems. > 5) the toy-grade rudimentary implementation > 6) misc > > 1.Some background > ------------------ > > The Interprocedural-optimization compilation, aka IPO or IPA, typically > consists of three stages: > > S1) pre-IPO > Each function goes through some analysis and not-very-aggressive > optimizations. > Some information is collected during this stage, this info will be to > IPO stages. > This info is usually called summary info. > > The result of this stage is "fake-objects" which is binary files using > some known object format to encapsulate IR as well as summary info > along with > other stuff. > > S2) IPO: > Compiler works with linker to resolve and merge symbols in the > "fake-objects" > > Then Interprocedural analyses (IPA) are invoked to perform > interprocedural > analysis either based on the summary-info, or directly on the IR. > > Interprocedural optimizations (IPO) are called based on the IPA result. > > In some compilers, IPA and IPO are separated. One reason is that many > IPAs can > be directly conduct on the concise summary info, while many IPOs need > to load > IRs and bulky annotation/metadata into memory. > > S3) post-IPO: > Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc etc. > While they > are intra-procedural analyses/optimizers, they may directly benefit > from > the info collected in the IPO stages and pass down the road. > > LLVM collectively call S2 and S3 as "LTO CodeGen", which is very > confusing. > > 2. Why parallelize post-IPO stage > =============================> > R1) To improve the scalarbility > It is quite obvious that we are not able to put everything about a > monster > program in memory at once. > > Even if you can afford a expensive computer, the address space of a > single compiler process cannot accommodate a monster program. > > R2) to take advantage of ample HW resource to shorten compile time. > R3) make debugging lot easier. > One can triage problems in a much smaller partition rather than > the huge monster program. > > This proposal is not able to shoot the goal R1 at this moment, because > during > the IPO stage, currently the compiler brings everything into memory at > once. > > 3. How to parallelize post-IPO stage > ==============================**=====> > From 5k' high, the concept is very simple, just to > step 1).divide the merged IR into small pieces, > step 2).and compile each of this pieces independendly. > step 3) the objects of each piece are fed back to linker to are linked > into an executable, or a dynamic lib. > > Section 3.1 through 3.3 describe these three steps respectively. >Yes, this is one approach. I think others at Google have looked into this sort of partitioning with GCC and found that the one thing which really helps you pick the partitions, is to profile the program and make the partitions based on the actual paths of functions seen by the profile. I don't think they saw much improvement without that. See http://research.google.com/pubs/pub36355.html . I do want to mention some other things we can do to parallelize. You may already know of these and have considered them and rejected them before deciding on the design you emailed out here, but I want to list them since there's already another thread with a different approach on the mailing list. * Use what we have. LTO partitions at a time, directories for instance, on the premise that LTO'ing them will produce something smaller than the sum of its inputs. Then when you do the final whole-program step, it will be receiving smaller inputs than it otherwise would. The change to LLVM here is to fix the inliner (or other optimizations?) to reduce the chance that LTO produces an output larger than its input. * Parallelize the per-function code generator within a single LLVMContext. CodeGen presently operates on a per-function basis, and is structured as an analysis over llvm IR. There shouldn't be any global state, and there shouldn't be any need for locking accesses to the IR since nothing will be mutating it. This will even speed up clang -O0, and is a solid first step that gets a thread-creation API into LLVM. - What happened to "one LLVMContext per thread"? Okay, that rule is a little white lie. Always was. LLVMContext allows two libraries to use llvm under the hood without interfering with each other (even to the point of separate maps of types to avoid one library from causing slow type lookups in the other). LLVM also doesn't have locking around accesses to the IR, and few guarantees how many things a single mutating operation will need to look at or change, but with those caveats in mind it is possible to share a context across threads. Because CodeGen is structured as an analysis over the IR without mutating the IR, it should work. There's probably still global state in the code generator somewhere, but it's not a structural problem. * Parallelize the function passes, and SCCs that are siblings in the call tree (again within a single LLVMContext). The gnarly part of this is that globals have shared use-lists which are updated as we modify each function individually. Those globals either need to have locks on their use-lists, replaced with a lockless list, or removed entirely. Probably both, as GlobalVariable's have use-lists we actually use in the optimizers, but we don't actually need the use-list for "i32 0". * Parallelize ModulePasses by splitting them into an analysis phase and an optimization phase. Make each per-TU build emit the .bc as usual plus an analysis-file (for instance, call graph, or "which functions reads/modify which globals"). Merge all the analysis-files and farm them back out to be used as input to the programs optimizing each .bc individually -- but now they have total knowledge of the whole-program call graph and other AA information, etc. - You can combine this with an RPC layer to give each worker the ability to ask for the definition of a function from another worker. LLVM already supports "unmaterialized" functions where the definition is loaded lazily. The analysis part should arrange to give us enough information to determine whether we want to do the inlining, then if we decide to materialize the function we get its body from another worker. * Parallelize by splitting into different LLVMContexts. This spares us the difficulty of adding locks or otherwise changing the internals of LLVM, gets us the ability to spread the load across multiple machines, and if combined with the RPC idea above you can also get good inlining without necessarily loading the whole program into memory on a single machine. I'm not planning to do the work any time soon so count my vote with that in mind, but if you ask me I think the first step should be to parallelize the backend within a single LLVMContext first, then to parallelize the function passes and CGSCC passes (across siblings only of course) second. Removing the use-list from simple constants is a very interesting thing to do to decrease lock contention, but we may want to do something smarter than just remove it -- consider emitting a large constant that is only used by an inline function. It is possible to emit a table of constants in the same COMDAT group as the function, then if the inline function is discarded by the linker the constants are discarded with it. I don't have a concrete proposal for that. Nick 3.1. Partitioning> ----------------- > Partitioning is to cut a resonabely-sized chunk from the big merged IRs. > It roughly consists of two steps, 1) determine the partition scheme, which > is relatively easy step, and 2) physically scoop the partition out of > the merged IR, which is much more involved. > > 3.1.1 Figure out Partition scheme > ------------------------------**---- > we randomly pick up some function and put them in a partition. > It would be nice to perform some optimization at this moment. One opt > in my mind is to reorder functions in order to reduce working-set and > improve locality. > > Unfortunately, this opt seems to be bit blind at this time, because > - CallGraph is not annotated with estimated or profiled frequency. > - some linkers don't respect the order. It seems they just > remembers the function order of the pristine input obj/fake-obj, > and enforce this order at final link (link-exec/shared-lib) stage. > > Anyway, I try to ignore all these problems, and try to perform partition > via following steps. Maybe we have some luck on some platforms: > > o. DFS the call-graph, ignoring the self-resursive edges, if freq is > available, prioritizing the edges (i.e. corresponding to call-sites) > such that frequent edges are visited first. > > o. Cut the DFS spanning tree obtained from the previous step bottom-up, > Each cut/partition contains reasonable # of functions, and the > aggregate > size of the functions of the partition should not exceeds predefined > threshold. > > o. repeat the previous step until the Call-graph's DFS spanning tree > is empty. > > 3.1.2 Partition transformation > ------------------------------ > > This is bit involved. There are bunch of problems we have to tackle. > 1) When the use/def of a symbol are separated in different modules, > its attribute, like linkage, visibility, need to be changed > as well. > > [Example 1], if a symbol is flagged as "internal" to the module where > the it is defined, the linkage need to be changed into "internal" > to the executable/lib being compiled. > > [Example 2], For compile-time constants, their initialized value > needs to to cloned to the partitions where it is referenced, > The rationale is to make the post-ipo passes to take advantage > of the initlized value to squeeeze some performance. > > In order to not bloat the code size, the cloned constant should > mark "don't emit". [end of eg2] > > Being able to precisely update symbols' attribute is not only > vital to correctness, it has significant impact to the the > performance as well. > > I have not yet taken a thorough investigation of this issue. My > rudimentary implementation is simply flag symbol "external" when its > use/def are separated in different module. I believe this is one > of the most difficult part of this work. I guess it is going to > take long time to become stable. > > 2) In order to compile each partition in each separate thread (see > Section 3.2), we have to put partitions in distinct LLVMContext. > > I could be wrong, but I don't find the code which is able to > perform function cloning across LLVMContext. > > My workaround in the patch is to perform function cloning in > one LLVMContext (but in different Module, of course), then > save the module to disk file, and load it to memory using a > new LLVMContext. > > It is bit circuitous and expensive. > > One random observation: > Currently, function-scoped static variables are considered > as "global variables". When cloning a function with static variable, > compiler has no idea if the static variables are used only by > the function being cloned, and hence separate the function > and the variables. > > I guess it would be nice if we organized symbols by its scope > instead of its live-time. it would be convenient for this situation. > > 3.2 Compile partitions independently > ------------------------------**-------- > > There are two camps: one camp advocate compiling partitions via > multi-process, > the other one favor multi-thread. > > Inside Apple compiler teams, I'm the only one belong to the 1st comp. I > think > while multi-proc sounds bit red-neck, it has its advantage for this > purpose, and > while multi-thread is certainly more eye-popping, it has its advantage > as well. > > The advantage of multi-proc are: > 1) easier to implement, the process run in its own address space. > We don't need to worry about they can interfere with each other. > > 2)huge, or not unlimited, address space. > > The disadvantage is that it's expensive. But I guess the cost is > almost negligible compared to the overall IPO compilation. > > The advantage of multi-threads I can imagine are: > 1) sound fancy > 2) it is light-weight > 3) inter-thread communication is easier than IPC. > > Its disadvantage are: > 1). Oftentime we will come across race-condition, and it took > awful long time to figure it out. While the code is supposed > to be mult-thread safe, we might miss some tricky case. > Trouble-shooting race condition is a nightmare. > > 2) Small address space. This is big problem if we the compiler > is built 32-bit . In that case, the compiler is not able to bring > lots of stuff in memory even if the HW dose > provide ample mem. > > 3) The thread-safe run-time lib is more expensive. > I once linked a compiler using -lpthread (I dose not have to) on a > UNIX platform, and saw the compiler slow down by about 1/3. > > I'm not able to convince the folks in other camp, neither are they > able to convince me. I decide to implement both. Fortunately, this > part is not difficult, it seems to be rather easy to crank out one within > short > period of time. It would be interesting to compare them side-by-side, > and see which camp lose:-). On the other hand, if we run into > race-condition > problem, we choose multi-proc version as a fall-back. > > Regardless which tech is going to use to compile partition > independently, in order to judiciously and adaptively choose appropriate > parallel-factor, the compiler certainly need a lib which is able to > figure out the load the entire system is in. I don't know if there are > such magic lib or not. > > 4. the tale of two kinds of linker > ------------------------------**---- > > As far as I can tell, llvm suports two kind linker for its IPO > compilation, > and the supports are embodied by two set of APIs/interfaces. > > o. Interface 1, those stuff named lto_xxxx(). > o. GNU gold interface, > The compiler interact with GNU gold via the adapter implemented > in tools/gold/gold-plugin.cpp. > > This adpater calls the interface-1 to control the IPO process. > It dose not have to call the interface APIs, I think it is definitely > ok it call internal functions. > > The compiler used to generate a single object file from the merged > IR, now it will generate multiple of them, one for each partition. > > So, the interface 1 is *NOT* sufficient any more. > > For gold linker users, it is easy to make them happy just by > hacking the adapter, informing the linker new input object > files. This is done transparently, the users don't need to install new ld. > > For those system which invoke ld interacting with the libLTO.{so,dylib}, > it has to accept the new APIs I added to the interface-1 in order to > enable the new functionality. Or maybe we can invoke '/the/path/to/ld -r > *.o -o merged.o' > and feed the merged.o the linker (this will keep the interface > interact)? Unfortunately, it dose not work at all, how can I know the path > the ld? the libLTO.{so,dydlib} is invoked as plugin; it cannot see the > argv. > How about hack them by adding a nasty flag pointing to the right ld? > Well, it works. However, I don't believe many people like to do it this > way, > that means I loose huge number of "QA" who are working hard for this > compiler. > > What's wrong with the interface-1? The ld side is more active than > the compiler side, however, in concept the IPO is driven by the compiler > side. > This mean this interface is changing over time. > > In contrast, the gold interface (as I rever-engineer from the adpator > code) is more symbol-centric, taking little IPO-thing into account. > That interface is simple and stable. > > 5. the rudimentary implementation > ------------------------------**--- > > I make it works for bzip2 at cpu2kint yesterday. bzip2 is "tiny" > program, I intentionally lower the partition size to get 3 partitions. > There is no comment in the code, and it definitely need rewrite. I > just check the correctness (with ref input), and I don't measure how much > it degrade the performance. (due to the problem I have not got chance > to tackle, see section 3.1.2, the symbol attribute stuff). > > The control flow basically is: > 1. add a module pass to the IPO pass-manager, and figure > out the partition scheme. > > 2) physically partition the merged partition. > the IR and the obj of partition are placed in a new dir. "llvmipo" > by default > > -- > ls llvmipo/ > Makefile merged part1.bc part1.o part2.bc part2.o > part3.bc part3.o > -- > > 3) For demo purpose, I drive the post-IPO stage via a makefile, which > encapsulate > hack and other nasty stuff. > > NOTE that the post-ipo pass in my hack contains only CodeGen pass, > we need to > reorganize the PassManagerBuilder::**populateLTOPassManager(), which > intermingle > IPO pass along with intra-proc scalar pass, we need to separate them > and the intra-proc > scalar pass to post-IPO stage. > > > 1 .PHONY = all > 2 > 3 > 4 BC = part1.bc part2.bc part3.bc > 5 OBJ = ${BC:.bc=.o} > 6 > 7 all : merged > 8 %.o : %.bc > 9 $(HOME)/tmp/lto.llc -filetype=obj $+ -o $@ > 10 > 11 merged : $(OBJ) > 12 /usr/bin/ld $+ -r -o $@ > 13 > > 4. as the Makefile sugguest, the *.o of the partions are linked into a > single obj "merged" > and feed back to link. > > > 6) Miscellaneous > ==========> Will partitioning degrade performance in theory. I think it depends on > the definition of > performance. If performance means execution-time, I guess it dose not. > However, if performance includes code-size, I think it may have some > negative impact. > Following is few scenario: > > - constants generated by the post-IPO passes are not shared across > partitions > - dead func may be detected during the post-IPO stage, and they may not > be deleted. > > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130716/09f02a8b/attachment.html>
On 7/16/13 3:32 PM, Nick Lewycky wrote:> On 12 July 2013 15:49, Shuxin Yang <shuxin.llvm at gmail.com > <mailto:shuxin.llvm at gmail.com>> wrote: > > Hi, There: > > This is the proposal for parallelizing post-ipo stage. See the > following for details. > > I also attach a toy-grade rudimentary implementation. This > implementation can be > used to illustrate some concepts here. This patch is not going to > be committed. > > Unfortunately, this weekend I will be too busy to read emails. > Please do not construe > delayed response as being rude :-). > > Thanks a lot in advance for your time insightful comments! > > Shuxin > > > The proposal > ------------ > It is organized as following: > 1) background info, if you heard "/usr/bin/ls", please skip it > 2) the motivation of parallelize post-IPO stage > 3) how to parallelize post-IPO > 4) the linker problems. > 5) the toy-grade rudimentary implementation > 6) misc > > 1.Some background > ------------------ > > The Interprocedural-optimization compilation, aka IPO or IPA, > typically > consists of three stages: > > S1) pre-IPO > Each function goes through some analysis and > not-very-aggressive optimizations. > Some information is collected during this stage, this info > will be to IPO stages. > This info is usually called summary info. > > The result of this stage is "fake-objects" which is binary > files using > some known object format to encapsulate IR as well as summary > info along with > other stuff. > > S2) IPO: > Compiler works with linker to resolve and merge symbols in the > "fake-objects" > > Then Interprocedural analyses (IPA) are invoked to perform > interprocedural > analysis either based on the summary-info, or directly on the IR. > > Interprocedural optimizations (IPO) are called based on the > IPA result. > > In some compilers, IPA and IPO are separated. One reason is > that many IPAs can > be directly conduct on the concise summary info, while many > IPOs need to load > IRs and bulky annotation/metadata into memory. > > S3) post-IPO: > Typically consist of Loop-Nest-Opt, Scalar Opt, Code-Gen etc > etc. While they > are intra-procedural analyses/optimizers, they may directly > benefit from > the info collected in the IPO stages and pass down the road. > > LLVM collectively call S2 and S3 as "LTO CodeGen", which is very > confusing. > > 2. Why parallelize post-IPO stage > =============================> > R1) To improve the scalarbility > It is quite obvious that we are not able to put everything > about a monster > program in memory at once. > > Even if you can afford a expensive computer, the address space > of a > single compiler process cannot accommodate a monster program. > > R2) to take advantage of ample HW resource to shorten compile time. > R3) make debugging lot easier. > One can triage problems in a much smaller partition rather than > the huge monster program. > > This proposal is not able to shoot the goal R1 at this moment, > because during > the IPO stage, currently the compiler brings everything into > memory at once. > > 3. How to parallelize post-IPO stage > ===================================> > From 5k' high, the concept is very simple, just to > step 1).divide the merged IR into small pieces, > step 2).and compile each of this pieces independendly. > step 3) the objects of each piece are fed back to linker to are > linked > into an executable, or a dynamic lib. > > Section 3.1 through 3.3 describe these three steps respectively. > > > Yes, this is one approach. I think others at Google have looked into > this sort of partitioning with GCC and found that the one thing which > really helps you pick the partitions, is to profile the program and > make the partitions based on the actual paths of functions seen by the > profile. I don't think they saw much improvement without that. See > http://research.google.com/pubs/pub36355.html . > > I do want to mention some other things we can do to parallelize. You > may already know of these and have considered them and rejected them > before deciding on the design you emailed out here, but I want to list > them since there's already another thread with a different approach on > the mailing list. > > * Use what we have. LTO partitions at a time, directories for > instance, on the premise that LTO'ing them will produce something > smaller than the sum of its inputs. Then when you do the final > whole-program step, it will be receiving smaller inputs than it > otherwise would. The change to LLVM here is to fix the inliner (or > other optimizations?) to reduce the chance that LTO produces an output > larger than its input. > > * Parallelize the per-function code generator within a single > LLVMContext. CodeGen presently operates on a per-function basis, and > is structured as an analysis over llvm IR. There shouldn't be any > global state, and there shouldn't be any need for locking accesses to > the IR since nothing will be mutating it. This will even speed up > clang -O0, and is a solid first step that gets a thread-creation API > into LLVM. > - What happened to "one LLVMContext per thread"? Okay, that rule is > a little white lie. Always was. LLVMContext allows two libraries to > use llvm under the hood without interfering with each other (even to > the point of separate maps of types to avoid one library from causing > slow type lookups in the other). LLVM also doesn't have locking around > accesses to the IR, and few guarantees how many things a single > mutating operation will need to look at or change, but with those > caveats in mind it is possible to share a context across threads. > Because CodeGen is structured as an analysis over the IR without > mutating the IR, it should work. There's probably still global state > in the code generator somewhere, but it's not a structural problem. > > * Parallelize the function passes, and SCCs that are siblings in the > call tree (again within a single LLVMContext). The gnarly part of this > is that globals have shared use-lists which are updated as we modify > each function individually. Those globals either need to have locks on > their use-lists, replaced with a lockless list, or removed entirely. > Probably both, as GlobalVariable's have use-lists we actually use in > the optimizers, but we don't actually need the use-list for "i32 0". > > * Parallelize ModulePasses by splitting them into an analysis phase > and an optimization phase. Make each per-TU build emit the .bc as > usual plus an analysis-file (for instance, call graph, or "which > functions reads/modify which globals"). Merge all the analysis-files > and farm them back out to be used as input to the programs optimizing > each .bc individually -- but now they have total knowledge of the > whole-program call graph and other AA information, etc. > - You can combine this with an RPC layer to give each worker the > ability to ask for the definition of a function from another worker. > LLVM already supports "unmaterialized" functions where the definition > is loaded lazily. The analysis part should arrange to give us enough > information to determine whether we want to do the inlining, then if > we decide to materialize the function we get its body from another worker. > > * Parallelize by splitting into different LLVMContexts. This spares us > the difficulty of adding locks or otherwise changing the internals of > LLVM, gets us the ability to spread the load across multiple machines, > and if combined with the RPC idea above you can also get good inlining > without necessarily loading the whole program into memory on a single > machine. > > I'm not planning to do the work any time soon so count my vote with > that in mind, but if you ask me I think the first step should be to > parallelize the backend within a single LLVMContext first, then to > parallelize the function passes and CGSCC passes (across siblings only > of course) second. Removing the use-list from simple constants is a > very interesting thing to do to decrease lock contention, but we may > want to do something smarter than just remove it -- consider emitting > a large constant that is only used by an inline function. It is > possible to emit a table of constants in the same COMDAT group as the > function, then if the inline function is discarded by the linker the > constants are discarded with it. I don't have a concrete proposal for > that. > > Nick > >Thank you for sharing your enlightening thoughts. I heard some ideas before, some are quite new to me. I will take this into account as the project move on. The motivation of this project is not merely to make compilation faster. It is also to: - significantly ease trouble shooting -- I was asked to fix LTO bugs for several times, it almost drive me mad to pin-point the bug in a huge merged module. It is definitely an unglamorous and painstaking undertaking:-) - this is one step toward better scalability. For now, I don't want to parallelize CodeGen only, as post-IPO scalar opt is compile-time hogging as well. On the other, HPC folks may invoke Loop-Nest-Opt/autopar/etc/ in the post-IPO stage as well, those intrinsically very slow breeds. So parallelize the entire post-IPO stage will make them happy. Finer-grained parallelism could be promising, however, it is too error-prone:-). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130716/3740ed56/attachment.html>
On Jul 16, 2013, at 3:32 PM, Nick Lewycky <nlewycky at google.com> wrote:> * Parallelize ModulePasses by splitting them into an analysis phase and an optimization phase. Make each per-TU build emit the .bc as usual plus an analysis-file (for instance, call graph, or "which functions reads/modify which globals"). Merge all the analysis-files and farm them back out to be used as input to the programs optimizing each .bc individually -- but now they have total knowledge of the whole-program call graph and other AA information, etc.Shuxin presented the same idea to me. It offers the best opportunity to scale while sidestepping concurrency issues. It avoids problem of loading all functions during the IPO phase then cloning them into separate LLVMContexts later. I don’t see this as part of the first milestone though. -Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130716/ccd1d0c7/attachment.html>
On Fri, Jul 12, 2013 at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote:> 3. How to parallelize post-IPO stage > ===================================> > From 5k' high, the concept is very simple, just to > step 1).divide the merged IR into small pieces, > step 2).and compile each of this pieces independendly. > step 3) the objects of each piece are fed back to linker to are linked > into an executable, or a dynamic lib.You seem to be describing GCC's strategy for whole program optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization). What is a bit confusing in your description is that you seem to be wanting to do more work after IPO is *done*. If the optimizations are done, then there is no need to parallelize anything. In GCC, we have two parallel stages: the generation of bytecode (which is naturally parallelized by your build system) and the final optimization phase, which is parallelized by partitioning the callgraph into disjoint subsets. These subsets are then parallelized by spawning the compiler on each of the partitions. The only sequential part is the actual analysis (what we call WPA or whole program analysis). That is a single threaded phase that makes all the optimization decisions, annotates the summaries on each callgraph node, partitions the callgraph and then spawns all the optimizers to work on each section of the callgraph. Diego.
On 7/17/13 12:35 PM, Diego Novillo wrote:> On Fri, Jul 12, 2013 at 3:49 PM, Shuxin Yang <shuxin.llvm at gmail.com> wrote: > >> 3. How to parallelize post-IPO stage >> ===================================>> >> From 5k' high, the concept is very simple, just to >> step 1).divide the merged IR into small pieces, >> step 2).and compile each of this pieces independendly. >> step 3) the objects of each piece are fed back to linker to are linked >> into an executable, or a dynamic lib. > You seem to be describing GCC's strategy for whole program > optimization (http://gcc.gnu.org/wiki/LinkTimeOptimization).Hi, Diego: Thank you for the comment. Quite honestly, I'm not very familiar with gcc's LTO implementation, and now I'm not allowed to read any GPL V3 code. What I'm describing here is somewhat similar to Open64's IPA implementation. I did port gcc before and of course read its document . Based on my little knowledge of its internal, I think gcc's "whole-program" mode is almost identical to Open64's IPA in spirit, I guess gcc community likely borrowed some idea from Open64. On the other hand, as far as I can understand of the oldish gcc's implementation before I join Apple, I guess the "whole-program mode" in gcc is misnomer. I don't want call this work as "whole-program mode", I'd like to call "big program mode" or something like that, as I don't care the binary being built see the whole-program or not while the analyses/opt do care the difference between them.> > What is a bit confusing in your description is that you seem to be > wanting to do more work after IPO is *done*.IPO is difficult to be parallelized. What I try to do is to parallelize the post-lto compilation stage, including, optionally LNO, and scalar opt and compile-time hogging CodeGen.> If the optimizations are > done, then there is no need to parallelize anything. > > In GCC, we have two parallel stages: the generation of bytecode (which > is naturally parallelized by your build system) and the final > optimization phase, which is parallelized by partitioning the > callgraph into disjoint subsets. These subsets are then parallelized > by spawning the compiler on each of the partitions.I call the 1st "parallel stage" pre-ipo, and the 2nd "parallel stage" post-IPO:-), and the IPO (which you call WPA) is sandwiched in the middle.> > The only sequential part is the actual analysis (what we call WPA or > whole program analysis).While not include in the proposal, I was proposing another level (earlier) of partition in order to build outrageously huge programs in order to parallelize the "WPA". The benefit is to parallize the IPO/"WPA", however, the cost is not seeing the "whole program". One of my coworker told me we care about the benefit reaped from seeing the "whole program" than the improved compile-time at the cost of compiling bunch of "incomplete program"s independently.> That is a single threaded phase that makes > all the optimization decisions, annotates the summaries on each > callgraph node, partitions the callgraph and then spawns all the > optimizers to work on each section of the callgraph. >This is what I called "post-IPO" :-)