Michael Spencer via llvm-dev
2017-Jun-15 16:51 UTC
[llvm-dev] [RFC] Profile guided section layout
I've recently implemented profile guided section layout in llvm + lld using the Call-Chain Clustering (C³) heuristic from https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf . In the programs I've tested it on I've gotten from 0% to 5% performance improvement over standard PGO with zero cases of slowdowns and up to 15% reduction in ITLB misses. There are three parts to this implementation. The first is a new llvm pass which uses branch frequency info to get counts for each call instruction and then adds a module flags metatdata table of function -> function edges along with their counts. The second takes the module flags metadata and writes it into a .note.llvm.callgraph section in the object file. This currently just dumps it as text, but could save space by reusing the string table. The last part is in lld. It reads the .note.llvm.callgraph data from each object file and merges them into a single table. It then builds a call graph based on the profile data then iteratively merges the hottest call edges using the C³ heuristic as long as it would not create a cluster larger than the page size. All clusters are then sorted by a density metric to further improve locality. There are a couple issues with the current implementation that shouldn't be too hard to fix. The .note.llvm.callgraph sections add about 10% to the size of each object file. This can be fixed by reusing the symbol string table instead of duplicating all the strings. The ordering algorithm is n^2 in the number of call graph edges. This is because it's using a trivial algorithm and can be fixed by doing something more intelligent. It doesn't currently work for LTO as the llvm pass needs to be run after all inlining decisions have been made and LTO codegen has to be done with -ffunction-sections. It's only implemented for ELF. I've attached the llvm and lld patches and would appreciate feedback on the overall design, the method of threading the profile data to the linker, and an improved ordering algorithm. No need for a detailed code review as I'll be making changes, adding tests, and posting proper patches for review. - Michael Spencer -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170615/cc7e7ae2/attachment-0001.html> -------------- next part -------------- diff --git a/ELF/Config.h b/ELF/Config.h index 9c73b4c..01267a9 100644 --- a/ELF/Config.h +++ b/ELF/Config.h @@ -10,6 +10,7 @@ #ifndef LLD_ELF_CONFIG_H #define LLD_ELF_CONFIG_H +#include "llvm/ADT/DenseMap.h" #include "llvm/ADT/MapVector.h" #include "llvm/ADT/StringRef.h" #include "llvm/ADT/StringSet.h" @@ -105,6 +106,8 @@ struct Configuration { std::vector<SymbolVersion> VersionScriptLocals; std::vector<uint8_t> BuildIdVector; llvm::MapVector<Symbol *, RenamedSymbol> RenamedSymbols; + llvm::DenseMap<std::pair<llvm::StringRef, llvm::StringRef>, uint64_t> + CFGProfile; bool AllowMultipleDefinition; bool AsNeeded = false; bool Bsymbolic; @@ -124,6 +127,7 @@ struct Configuration { bool GnuHash; bool ICF; bool MipsN32Abi = false; + bool NoCFGProfileReorder; bool NoGnuUnique; bool NoUndefinedVersion; bool Nostdlib; diff --git a/ELF/Driver.cpp b/ELF/Driver.cpp index 1534c7e..06dd045 100644 --- a/ELF/Driver.cpp +++ b/ELF/Driver.cpp @@ -649,6 +649,7 @@ void LinkerDriver::readConfigs(opt::InputArgList &Args) { Config->LTOO = getInteger(Args, OPT_lto_O, 2); Config->LTOPartitions = getInteger(Args, OPT_lto_partitions, 1); Config->MapFile = getString(Args, OPT_Map); + Config->NoCFGProfileReorder = Args.hasArg(OPT_no_cfg_profile_reorder); Config->NoGnuUnique = Args.hasArg(OPT_no_gnu_unique); Config->NoUndefinedVersion = Args.hasArg(OPT_no_undefined_version); Config->Nostdlib = Args.hasArg(OPT_nostdlib); diff --git a/ELF/InputFiles.cpp b/ELF/InputFiles.cpp index 902593b..d64c1da 100644 --- a/ELF/InputFiles.cpp +++ b/ELF/InputFiles.cpp @@ -527,6 +527,35 @@ elf::ObjectFile<ELFT>::createInputSection(const Elf_Shdr &Sec) { if (Name == ".eh_frame" && !Config->Relocatable) return make<EhInputSection>(this, &Sec, Name); + // Profile data. + if (Name == ".note.llvm.callgraph") { + ArrayRef<uint8_t> CallgraphBuff + check(this->getObj().getSectionContents(&Sec)); + + StringRef Buff((const char *)CallgraphBuff.data(), CallgraphBuff.size()); + + auto ReadString = [&Buff]() { + size_t F = Buff.find_first_of(" \n"); + StringRef Ret = Buff.substr(0, F); + Buff = Buff.substr(F + 1); + return Ret; + }; + + while (!Buff.empty()) { + StringRef From = ReadString(); + StringRef To = ReadString(); + uint64_t Count; + if (ReadString().getAsInteger(10, Count)) + break; + + // Merge duplicate counts by picking the largest. + uint64_t &C = Config->CFGProfile[std::make_pair(From, To)]; + C = std::max(C, Count); + } + + return &InputSection::Discarded; + } + if (shouldMerge(Sec)) return make<MergeInputSection>(this, &Sec, Name); return make<InputSection>(this, &Sec, Name); diff --git a/ELF/Options.td b/ELF/Options.td index 335c7ad..cea9ee1 100644 --- a/ELF/Options.td +++ b/ELF/Options.td @@ -144,6 +144,9 @@ def nostdlib: F<"nostdlib">, def no_as_needed: F<"no-as-needed">, HelpText<"Always DT_NEEDED for shared libraries">; +def no_cfg_profile_reorder: F<"no-cfg-profile-reorder">, + HelpText<"Disable reordering of sections based on profile information">; + def no_color_diagnostics: F<"no-color-diagnostics">, HelpText<"Do not use colors in diagnostics">; diff --git a/ELF/Writer.cpp b/ELF/Writer.cpp index d41baec..d9bd757 100644 --- a/ELF/Writer.cpp +++ b/ELF/Writer.cpp @@ -20,11 +20,13 @@ #include "SyntheticSections.h" #include "Target.h" #include "Threads.h" +#include "llvm/ADT/Hashing.h" #include "llvm/ADT/StringMap.h" #include "llvm/ADT/StringSwitch.h" #include "llvm/Support/FileOutputBuffer.h" #include "llvm/Support/raw_ostream.h" #include <climits> +#include <unordered_set> using namespace llvm; using namespace llvm::ELF; @@ -922,6 +924,157 @@ static void sortBySymbolsOrder(ArrayRef<OutputSection *> OutputSections) { Sec->sort([&](InputSectionBase *S) { return SectionOrder.lookup(S); }); } +// Sort sections by the profile data provided in the .note.llvm.callgraph +// sections. +// +// This algorithm is based on Call-Chain Clustering from: +// Optimizing Function Placement for Large-Scale Data-Center Applications +// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf +// +// This first builds a call graph based on the profile data then iteratively +// merges the hottest call edges as long as it would not create a cluster larger +// than the page size. All clusters are then sorted by a density metric to +// further improve locality. +template <class ELFT> +static void sortByCFGProfile(ArrayRef<OutputSection *> OutputSections) { + if (Config->NoCFGProfileReorder) + return; + + using NodeIndex = std::ptrdiff_t; + + struct Node { + Node() {} + Node(const InputSectionBase *IS) { + Sections.push_back(IS); + Size = IS->getSize(); + } + std::vector<const InputSectionBase *> Sections; + int64_t Size = 0; + uint64_t Weight = 0; + }; + + struct Edge { + NodeIndex From; + NodeIndex To; + mutable uint64_t Weight; + bool operator==(const Edge Other) const { + return From == Other.From && To == Other.To; + } + }; + + struct EdgeHash { + std::size_t operator()(const Edge E) const { + return llvm::hash_combine(E.From, E.To); + }; + }; + + std::vector<Node> Nodes; + std::unordered_set<Edge, EdgeHash> Edges; + + auto InsertOrIncrementEdge = [](std::unordered_set<Edge, EdgeHash> &Edges, + const Edge E) { + if (E.From == E.To) + return; + auto Res = Edges.insert(E); + if (!Res.second) + Res.first->Weight = SaturatingAdd(Res.first->Weight, E.Weight); + }; + + { + llvm::DenseMap<const InputSectionBase *, NodeIndex> SecToNode; + + auto GetOrCreateNode + [&Nodes, &SecToNode](const InputSectionBase *IS) -> NodeIndex { + auto Res = SecToNode.insert(std::make_pair(IS, Nodes.size())); + if (Res.second) + Nodes.emplace_back(IS); + return Res.first->second; + }; + + // Create the graph. + for (const auto &C : Config->CFGProfile) { + if (C.second == 0) + continue; + DefinedRegular *FromDR = dyn_cast_or_null<DefinedRegular>( + Symtab<ELFT>::X->find(C.first.first)); + DefinedRegular *ToDR = dyn_cast_or_null<DefinedRegular>( + Symtab<ELFT>::X->find(C.first.second)); + if (!FromDR || !ToDR) + continue; + auto FromSB = dyn_cast_or_null<const InputSectionBase>(FromDR->Section); + auto ToSB = dyn_cast_or_null<const InputSectionBase>(ToDR->Section); + if (!FromSB || !ToSB) + continue; + NodeIndex From = GetOrCreateNode(FromSB); + NodeIndex To = GetOrCreateNode(ToSB); + InsertOrIncrementEdge(Edges, {From, To, C.second}); + Nodes[To].Weight = SaturatingAdd(Nodes[To].Weight, C.second); + } + } + + // Collapse the graph. + while (!Edges.empty()) { + // Find the largest edge + // FIXME: non deterministic order for equal edges. + // FIXME: n^2 + auto Max = std::max_element( + Edges.begin(), Edges.end(), + [](const Edge A, const Edge B) { return A.Weight < B.Weight; }); + const Edge MaxE = *Max; + Edges.erase(Max); + // Merge the Nodes. + Node &From = Nodes[MaxE.From]; + Node &To = Nodes[MaxE.To]; + if (From.Size + To.Size > Target->PageSize) + continue; + From.Sections.insert(From.Sections.end(), To.Sections.begin(), + To.Sections.end()); + To.Sections.clear(); + From.Size += To.Size; + From.Weight = SaturatingAdd(From.Weight, To.Weight); + // Collect all edges from or to the removed node and update them for the new + // node. + std::vector<Edge> OldEdges; + // FIXME: n^2 + for (auto EI = Edges.begin(), EE = Edges.end(); EI != EE;) { + if (EI->From == MaxE.To || EI->To == MaxE.To) { + OldEdges.push_back(*EI); + EI = Edges.erase(EI); + } else + ++EI; + } + for (const Edge E : OldEdges) { + InsertOrIncrementEdge(Edges, + {E.From == MaxE.To ? MaxE.From : E.From, + E.To == MaxE.To ? MaxE.From : E.To, E.Weight}); + } + } + + // Sort by density. + std::sort(Nodes.begin(), Nodes.end(), [](const Node &A, const Node &B) { + return double(A.Weight) / double(A.Size) < + double(B.Weight) / double(B.Size); + }); + + // Generate order. + llvm::DenseMap<const InputSectionBase *, std::size_t> OrderMap; + ssize_t CurOrder = 0; + + for (const Node &N : Nodes) { + if (N.Sections.empty()) + continue; + for (const InputSectionBase *IS : N.Sections) + OrderMap[IS] = CurOrder++; + } + + for (OutputSection *OS : OutputSections) { + if (OS->Name != ".text") + continue; + OS->sort([&](InputSectionBase *IS) { return OrderMap.lookup(IS); }); + break; + } +} + template <class ELFT> void Writer<ELFT>::forEachRelSec(std::function<void(InputSectionBase &)> Fn) { for (InputSectionBase *IS : InputSections) { @@ -949,6 +1102,7 @@ template <class ELFT> void Writer<ELFT>::createSections() { if (IS) Factory.addInputSec(IS, getOutputSectionName(IS->Name)); + sortByCFGProfile<ELFT>(OutputSections); sortBySymbolsOrder<ELFT>(OutputSections); sortInitFini(findSection(".init_array")); sortInitFini(findSection(".fini_array")); -------------- next part -------------- diff --git a/include/llvm/InitializePasses.h b/include/llvm/InitializePasses.h index abb0aa3..93084d8 100644 --- a/include/llvm/InitializePasses.h +++ b/include/llvm/InitializePasses.h @@ -85,6 +85,7 @@ void initializeBreakCriticalEdgesPass(PassRegistry&); void initializeCFGOnlyPrinterLegacyPassPass(PassRegistry&); void initializeCFGOnlyViewerLegacyPassPass(PassRegistry&); void initializeCFGPrinterLegacyPassPass(PassRegistry&); +void initializeCFGProfilePassPass(PassRegistry&); void initializeCFGSimplifyPassPass(PassRegistry&); void initializeCFGViewerLegacyPassPass(PassRegistry&); void initializeCFLAndersAAWrapperPassPass(PassRegistry&); diff --git a/include/llvm/LinkAllPasses.h b/include/llvm/LinkAllPasses.h index c309ddb..59b9307 100644 --- a/include/llvm/LinkAllPasses.h +++ b/include/llvm/LinkAllPasses.h @@ -75,6 +75,7 @@ namespace { (void) llvm::createCallGraphDOTPrinterPass(); (void) llvm::createCallGraphViewerPass(); (void) llvm::createCFGSimplificationPass(); + (void) llvm::createCFGProfilePass(); (void) llvm::createLateCFGSimplificationPass(); (void) llvm::createCFLAndersAAWrapperPass(); (void) llvm::createCFLSteensAAWrapperPass(); diff --git a/include/llvm/Transforms/Instrumentation.h b/include/llvm/Transforms/Instrumentation.h index b6c6c09..ec8741b 100644 --- a/include/llvm/Transforms/Instrumentation.h +++ b/include/llvm/Transforms/Instrumentation.h @@ -199,6 +199,8 @@ inline ModulePass *createDataFlowSanitizerPassForJIT( // checking on loads, stores, and other memory intrinsics. FunctionPass *createBoundsCheckingPass(); +ModulePass *createCFGProfilePass(); + /// \brief Calculate what to divide by to scale counts. /// /// Given the maximum count, calculate a divisor that will scale all the diff --git a/lib/CodeGen/TargetLoweringObjectFileImpl.cpp b/lib/CodeGen/TargetLoweringObjectFileImpl.cpp index 6922e33..9f786b0 100644 --- a/lib/CodeGen/TargetLoweringObjectFileImpl.cpp +++ b/lib/CodeGen/TargetLoweringObjectFileImpl.cpp @@ -97,16 +97,48 @@ void TargetLoweringObjectFileELF::emitModuleMetadata( StringRef Section; GetObjCImageInfo(M, Version, Flags, Section); - if (Section.empty()) - return; + if (!Section.empty()) { + auto &C = getContext(); + auto *S = C.getELFSection(Section, ELF::SHT_PROGBITS, ELF::SHF_ALLOC); + Streamer.SwitchSection(S); + Streamer.EmitLabel(C.getOrCreateSymbol(StringRef("OBJC_IMAGE_INFO"))); + Streamer.EmitIntValue(Version, 4); + Streamer.EmitIntValue(Flags, 4); + Streamer.AddBlankLine(); + } - auto &C = getContext(); - auto *S = C.getELFSection(Section, ELF::SHT_PROGBITS, ELF::SHF_ALLOC); - Streamer.SwitchSection(S); - Streamer.EmitLabel(C.getOrCreateSymbol(StringRef("OBJC_IMAGE_INFO"))); - Streamer.EmitIntValue(Version, 4); - Streamer.EmitIntValue(Flags, 4); - Streamer.AddBlankLine(); + SmallVector<Module::ModuleFlagEntry, 8> ModuleFlags; + M.getModuleFlagsMetadata(ModuleFlags); + + MDNode *CFGProfile = nullptr; + + for (const auto &MFE : ModuleFlags) { + StringRef Key = MFE.Key->getString(); + if (Key == "CFG Profile") { + CFGProfile = cast<MDNode>(MFE.Val); + break; + } + } + + if (!CFGProfile) + return; + MCSectionELF *Sec + getContext().getELFSection(".note.llvm.callgraph", ELF::SHT_NOTE, 0); + Streamer.SwitchSection(Sec); + SmallString<256> Out; + for (const auto &Edge : CFGProfile->operands()) { + raw_svector_ostream O(Out); + MDNode *E = cast<MDNode>(Edge); + O << cast<MDString>(E->getOperand(0))->getString() << " " + << cast<MDString>(E->getOperand(1))->getString() << " " + << cast<ConstantAsMetadata>(E->getOperand(2)) + ->getValue() + ->getUniqueInteger() + .getZExtValue() + << "\n"; + Streamer.EmitBytes(O.str()); + Out.clear(); + } } MCSymbol *TargetLoweringObjectFileELF::getCFIPersonalitySymbol( diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp b/lib/Transforms/IPO/PassManagerBuilder.cpp index 4bc64ab..585962d 100644 --- a/lib/Transforms/IPO/PassManagerBuilder.cpp +++ b/lib/Transforms/IPO/PassManagerBuilder.cpp @@ -698,6 +698,8 @@ void PassManagerBuilder::populateModulePassManager( } } + MPM.add(createCFGProfilePass()); + if (MergeFunctions) MPM.add(createMergeFunctionsPass()); diff --git a/lib/Transforms/Instrumentation/CFGProfile.cpp b/lib/Transforms/Instrumentation/CFGProfile.cpp new file mode 100644 index 0000000..6aa76d3 --- /dev/null +++ b/lib/Transforms/Instrumentation/CFGProfile.cpp @@ -0,0 +1,103 @@ +//===-- CFGProfile.cpp ----------------------------------------------------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// + +#include "llvm/Analysis/BlockFrequencyInfo.h" +#include "llvm/Analysis/BranchProbabilityInfo.h" +#include "llvm/IR/Constants.h" +#include "llvm/IR/Instructions.h" +#include "llvm/IR/MDBuilder.h" +#include "llvm/IR/PassManager.h" +#include "llvm/Transforms/Instrumentation.h" + +#include <array> + +using namespace llvm; + +class CFGProfilePass : public ModulePass { +public: + static char ID; + + CFGProfilePass() : ModulePass(ID) { + initializeCFGProfilePassPass( + *PassRegistry::getPassRegistry()); + } + + StringRef getPassName() const override { return "CFGProfilePass"; } + +private: + bool runOnModule(Module &M) override; + + void getAnalysisUsage(AnalysisUsage &AU) const override { + AU.addRequired<BlockFrequencyInfoWrapperPass>(); + AU.addRequired<BranchProbabilityInfoWrapperPass>(); + } +}; + +bool CFGProfilePass::runOnModule(Module &M) { + if (skipModule(M)) + return false; + + llvm::DenseMap<std::pair<StringRef, StringRef>, uint64_t> Counts; + + for (auto &F : M) { + if (F.isDeclaration()) + continue; + getAnalysis<BranchProbabilityInfoWrapperPass>(F).getBPI(); + auto &BFI = getAnalysis<BlockFrequencyInfoWrapperPass>(F).getBFI(); + for (auto &BB : F) { + Optional<uint64_t> BBCount = BFI.getBlockProfileCount(&BB); + if (!BBCount) + continue; + for (auto &I : BB) { + auto *CI = dyn_cast<CallInst>(&I); + if (!CI) + continue; + Function *CalledF = CI->getCalledFunction(); + if (!CalledF || CalledF->isIntrinsic()) + continue; + + uint64_t &Count + Counts[std::make_pair(F.getName(), CalledF->getName())]; + Count = SaturatingAdd(Count, *BBCount); + } + } + } + + if (Counts.empty()) + return false; + + LLVMContext &Context = M.getContext(); + MDBuilder MDB(Context); + std::vector<Metadata *> Nodes; + + for (auto E : Counts) { + SmallVector<Metadata *, 3> Vals; + Vals.push_back(MDB.createString(E.first.first)); + Vals.push_back(MDB.createString(E.first.second)); + Vals.push_back(MDB.createConstant( + ConstantInt::get(Type::getInt64Ty(Context), E.second))); + Nodes.push_back(MDNode::get(Context, Vals)); + } + + M.addModuleFlag(Module::Append, "CFG Profile", MDNode::get(Context, Nodes)); + + return true; +} + +char CFGProfilePass::ID = 0; +INITIALIZE_PASS_BEGIN(CFGProfilePass, "cfg-profile", + "Generate profile information from the CFG.", false, false) + INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass) + INITIALIZE_PASS_DEPENDENCY(BranchProbabilityInfoWrapperPass) + INITIALIZE_PASS_END(CFGProfilePass, "cfg-profile", + "Generate profile information from the CFG.", false, false) + +ModulePass *llvm::createCFGProfilePass() { + return new CFGProfilePass(); +} diff --git a/lib/Transforms/Instrumentation/CMakeLists.txt b/lib/Transforms/Instrumentation/CMakeLists.txt index 7ff69b9..197575f 100644 --- a/lib/Transforms/Instrumentation/CMakeLists.txt +++ b/lib/Transforms/Instrumentation/CMakeLists.txt @@ -1,6 +1,7 @@ add_llvm_library(LLVMInstrumentation AddressSanitizer.cpp BoundsChecking.cpp + CFGProfile.cpp DataFlowSanitizer.cpp GCOVProfiling.cpp MemorySanitizer.cpp diff --git a/lib/Transforms/Instrumentation/Instrumentation.cpp b/lib/Transforms/Instrumentation/Instrumentation.cpp index 7bb62d2..d147a52 100644 --- a/lib/Transforms/Instrumentation/Instrumentation.cpp +++ b/lib/Transforms/Instrumentation/Instrumentation.cpp @@ -60,6 +60,7 @@ void llvm::initializeInstrumentation(PassRegistry &Registry) { initializeAddressSanitizerModulePass(Registry); initializeBoundsCheckingPass(Registry); initializeGCOVProfilerLegacyPassPass(Registry); + initializeCFGProfilePassPass(Registry); initializePGOInstrumentationGenLegacyPassPass(Registry); initializePGOInstrumentationUseLegacyPassPass(Registry); initializePGOIndirectCallPromotionLegacyPassPass(Registry);
Tobias Edler von Koch via llvm-dev
2017-Jun-15 17:08 UTC
[llvm-dev] [RFC] Profile guided section layout
Hi Michael, This is cool stuff, thanks for sharing! On 06/15/2017 11:51 AM, Michael Spencer via llvm-dev wrote:> The first is a new llvm pass which uses branch frequency info to get > counts for each call instruction and then adds a module flags > metatdata table of function -> function edges along with their counts. > > The second takes the module flags metadata and writes it into a > .note.llvm.callgraph section in the object file. This currently just > dumps it as text, but could save space by reusing the string table.Have you considered reading the profile in the linker and extracting that information directly from the profile? The profile should contain call sites and their sample counts and you could match these up with relocations (calls) in the section?> It doesn't currently work for LTO as the llvm pass needs to be run > after all inlining decisions have been made and LTO codegen has to be > done with -ffunction-sections.So this is just an implementation issue, right? You can make LTO run with -ffunction-sections (by setting TargetOptions.FunctionSections=true) and insert your pass in the appropriate place in the pipeline. Thanks, Tobias -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Michael Spencer via llvm-dev
2017-Jun-15 17:55 UTC
[llvm-dev] [RFC] Profile guided section layout
On Thu, Jun 15, 2017 at 10:08 AM, Tobias Edler von Koch < tobias at codeaurora.org> wrote:> Hi Michael, > > This is cool stuff, thanks for sharing! > > On 06/15/2017 11:51 AM, Michael Spencer via llvm-dev wrote: > >> The first is a new llvm pass which uses branch frequency info to get >> counts for each call instruction and then adds a module flags metatdata >> table of function -> function edges along with their counts. >> >> The second takes the module flags metadata and writes it into a >> .note.llvm.callgraph section in the object file. This currently just dumps >> it as text, but could save space by reusing the string table. >> > Have you considered reading the profile in the linker and extracting that > information directly from the profile? The profile should contain call > sites and their sample counts and you could match these up with relocations > (calls) in the section?I did this using IR PGO instead of sample PGO so the profile data can only be applied in the same place in the pipeline it is generated. Even for sample based this would be complicated as the linker would actually need to generate machine basic blocks from sections to be able to accurately match sample counts to relocations, as there may be cold calls in hot functions. It may be useful however for the linker to directly accept an externally generated call graph profile. The current approach can actually do this by embedding it into an extra object file.> > > It doesn't currently work for LTO as the llvm pass needs to be run after >> all inlining decisions have been made and LTO codegen has to be done with >> -ffunction-sections. >> > So this is just an implementation issue, right? You can make LTO run with > -ffunction-sections (by setting TargetOptions.FunctionSections=true) and > insert your pass in the appropriate place in the pipeline. >Yeah, just an implementation issue. Just need to build the pass pipeline differently for LTO and add a way to do -ffunction-sections in lld. - Michael Spencer> > Thanks, > Tobias > > -- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, > a Linux Foundation Collaborative Project. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170615/d34eebec/attachment.html>
Rafael Avila de Espindola via llvm-dev
2017-Jul-31 21:20 UTC
[llvm-dev] [RFC] Profile guided section layout
Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:> I've recently implemented profile guided section layout in llvm + lld using > the Call-Chain Clustering (C³) heuristic from > https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf > . In the programs I've tested it on I've gotten from 0% to 5% performance > improvement over standard PGO with zero cases of slowdowns and up to 15% > reduction in ITLB misses. > > > There are three parts to this implementation. > > The first is a new llvm pass which uses branch frequency info to get counts > for each call instruction and then adds a module flags metatdata table of > function -> function edges along with their counts. > > The second takes the module flags metadata and writes it into a > .note.llvm.callgraph section in the object file. This currently just dumps > it as text, but could save space by reusing the string table. > > The last part is in lld. It reads the .note.llvm.callgraph data from each > object file and merges them into a single table. It then builds a call > graph based on the profile data then iteratively merges the hottest call > edges using the C³ heuristic as long as it would not create a cluster > larger than the page size. All clusters are then sorted by a density metric > to further improve locality.Since the branch frequency info is in a llvm specific format, it makes sense for llvm to read it instead of expecting lld to do it again. Since .o files is how the compiler talks to the linker, it also makes sense for llvm to record the required information there. In the same way, since the linker is the first place with global knowledge, it makes sense for it to be the one that implements a section ordering heuristic instead of just being told by some other tool, which would complicate the build. However, do we need to start with instrumentation? The original paper uses sampling with good results and current intel cpus can record every branch in a program. I would propose starting with just an lld patch that reads the call graph from a file. The format would be very similar to what you propose, just weight,caller,callee. In a another patch we can then look at instrumentation: Why it is more convenient for some uses and what performance advantage it might have. I have written a small tool that usesr intel_bts and 'perf script' to construct the callgraph. I am giving it a try with your lld patch and will hopefully post results today. Cheers, Rafael
Tobias Edler von Koch via llvm-dev
2017-Jul-31 21:43 UTC
[llvm-dev] [RFC] Profile guided section layout
Hi Rafael, On 07/31/2017 04:20 PM, Rafael Avila de Espindola via llvm-dev wrote:> However, do we need to start with instrumentation? The original paper > uses sampling with good results and current intel cpus can record every > branch in a program. > > I would propose starting with just an lld patch that reads the call > graph from a file. The format would be very similar to what you propose, > just weight,caller,callee.The advantage of the proposed approach (weighted callgraph section) is that it's completely transparent: it works regardless of the particular profiling methodology (as long as there's !perf metadata when the pass runs). For this reason, it fits neatly into an *existing* PGO-based build flow. I only need to add 1 compiler flag to enable it. That's a big plus. On the other hand, I could see how your idea (callgraph input file for linker) would be useful in situations where I just want to do section layout but no PGO in the compiler... and of course for testing of the linker's sorting algorithm. So there's merits in both, but for my use cases Michael's original approach is the most practical. Tobias -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Sean Silva via llvm-dev
2017-Jul-31 23:18 UTC
[llvm-dev] [RFC] Profile guided section layout
On Jul 31, 2017 2:21 PM, "Rafael Avila de Espindola via llvm-dev" < llvm-dev at lists.llvm.org> wrote: Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:> I've recently implemented profile guided section layout in llvm + lldusing> the Call-Chain Clustering (C³) heuristic from > https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf> . In the programs I've tested it on I've gotten from 0% to 5% performance > improvement over standard PGO with zero cases of slowdowns and up to 15% > reduction in ITLB misses. > > > There are three parts to this implementation. > > The first is a new llvm pass which uses branch frequency info to getcounts> for each call instruction and then adds a module flags metatdata table of > function -> function edges along with their counts. > > The second takes the module flags metadata and writes it into a > .note.llvm.callgraph section in the object file. This currently just dumps > it as text, but could save space by reusing the string table. > > The last part is in lld. It reads the .note.llvm.callgraph data from each > object file and merges them into a single table. It then builds a call > graph based on the profile data then iteratively merges the hottest call > edges using the C³ heuristic as long as it would not create a cluster > larger than the page size. All clusters are then sorted by a densitymetric> to further improve locality.Since the branch frequency info is in a llvm specific format, it makes sense for llvm to read it instead of expecting lld to do it again. Since .o files is how the compiler talks to the linker, it also makes sense for llvm to record the required information there. In the same way, since the linker is the first place with global knowledge, it makes sense for it to be the one that implements a section ordering heuristic instead of just being told by some other tool, which would complicate the build. However, do we need to start with instrumentation? The original paper uses sampling with good results and current intel cpus can record every branch in a program. Both sample and instrumentation are turned into the same profile metadata in the IR which is what feeds BranchFrequencyInfo. So it should be naturally agnostic to which method is used. -- Sean Silva I would propose starting with just an lld patch that reads the call graph from a file. The format would be very similar to what you propose, just weight,caller,callee. In a another patch we can then look at instrumentation: Why it is more convenient for some uses and what performance advantage it might have. I have written a small tool that usesr intel_bts and 'perf script' to construct the callgraph. I am giving it a try with your lld patch and will hopefully post results today. Cheers, Rafael _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170731/f8bd7efe/attachment.html>
Michael Spencer via llvm-dev
2017-Aug-01 03:32 UTC
[llvm-dev] [RFC] Profile guided section layout
On Mon, Jul 31, 2017 at 2:20 PM, Rafael Avila de Espindola < rafael.espindola at gmail.com> wrote:> Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes: > > > I've recently implemented profile guided section layout in llvm + lld > using > > the Call-Chain Clustering (C³) heuristic from > > https://research.fb.com/wp-content/uploads/2017/01/ > cgo2017-hfsort-final1.pdf > > . In the programs I've tested it on I've gotten from 0% to 5% performance > > improvement over standard PGO with zero cases of slowdowns and up to 15% > > reduction in ITLB misses. > > > > > > There are three parts to this implementation. > > > > The first is a new llvm pass which uses branch frequency info to get > counts > > for each call instruction and then adds a module flags metatdata table of > > function -> function edges along with their counts. > > > > The second takes the module flags metadata and writes it into a > > .note.llvm.callgraph section in the object file. This currently just > dumps > > it as text, but could save space by reusing the string table. > > > > The last part is in lld. It reads the .note.llvm.callgraph data from each > > object file and merges them into a single table. It then builds a call > > graph based on the profile data then iteratively merges the hottest call > > edges using the C³ heuristic as long as it would not create a cluster > > larger than the page size. All clusters are then sorted by a density > metric > > to further improve locality. > > Since the branch frequency info is in a llvm specific format, it makes > sense for llvm to read it instead of expecting lld to do it again. Since > .o files is how the compiler talks to the linker, it also makes sense > for llvm to record the required information there. > > In the same way, since the linker is the first place with global > knowledge, it makes sense for it to be the one that implements a section > ordering heuristic instead of just being told by some other tool, which > would complicate the build. > > However, do we need to start with instrumentation? The original paper > uses sampling with good results and current intel cpus can record every > branch in a program. > > I would propose starting with just an lld patch that reads the call > graph from a file. The format would be very similar to what you propose, > just weight,caller,callee. > > In a another patch we can then look at instrumentation: Why it is more > convenient for some uses and what performance advantage it might have. > > I have written a small tool that usesr intel_bts and 'perf script' to > construct the callgraph. I am giving it a try with your lld patch and > will hopefully post results today. > > Cheers, > Rafael >I'm fine with approaching this from lld first, as the input reading is the simplest part of this. I started from instrumentation based because I started with a target program that already had the infrastructure setup to do it. - Michael Spencer -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170731/e2bd02cb/attachment.html>
Justin Bogner via llvm-dev
2017-Aug-01 20:57 UTC
[llvm-dev] [RFC] Profile guided section layout
Rafael Avila de Espindola via llvm-dev <llvm-dev at lists.llvm.org> writes:> Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes: > >> I've recently implemented profile guided section layout in llvm + lld using >> the Call-Chain Clustering (C³) heuristic from >> https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf >> . In the programs I've tested it on I've gotten from 0% to 5% performance >> improvement over standard PGO with zero cases of slowdowns and up to 15% >> reduction in ITLB misses. >> >> >> There are three parts to this implementation. >> >> The first is a new llvm pass which uses branch frequency info to get counts >> for each call instruction and then adds a module flags metatdata table of >> function -> function edges along with their counts. >> >> The second takes the module flags metadata and writes it into a >> .note.llvm.callgraph section in the object file. This currently just dumps >> it as text, but could save space by reusing the string table. >> >> The last part is in lld. It reads the .note.llvm.callgraph data from each >> object file and merges them into a single table. It then builds a call >> graph based on the profile data then iteratively merges the hottest call >> edges using the C³ heuristic as long as it would not create a cluster >> larger than the page size. All clusters are then sorted by a density metric >> to further improve locality. > > Since the branch frequency info is in a llvm specific format, it makes > sense for llvm to read it instead of expecting lld to do it again. Since > .o files is how the compiler talks to the linker, it also makes sense > for llvm to record the required information there. > > In the same way, since the linker is the first place with global > knowledge, it makes sense for it to be the one that implements a section > ordering heuristic instead of just being told by some other tool, which > would complicate the build. > > However, do we need to start with instrumentation? The original paper > uses sampling with good results and current intel cpus can record every > branch in a program.This already works without instrumentation. You can probably try it out as is with profiles generated with linux perf using the create_llvm_prof tool from the autofdo work: https://github.com/google/autofdo> I would propose starting with just an lld patch that reads the call > graph from a file. The format would be very similar to what you propose, > just weight,caller,callee. > > In a another patch we can then look at instrumentation: Why it is more > convenient for some uses and what performance advantage it might have. > > I have written a small tool that usesr intel_bts and 'perf script' to > construct the callgraph. I am giving it a try with your lld patch and > will hopefully post results today. > > Cheers, > Rafael > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev