thr3ads.net - llvm dev - [llvm-dev] [RFC] Profile guided section layout [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Michael Spencer via llvm-dev

2017-Jun-15 16:51 UTC

[llvm-dev] [RFC] Profile guided section layout

I've recently implemented profile guided section layout in llvm + lld using
the Call-Chain Clustering (C³) heuristic from
https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
. In the programs I've tested it on I've gotten from 0% to 5%
performance
improvement over standard PGO with zero cases of slowdowns and up to 15%
reduction in ITLB misses.


There are three parts to this implementation.

The first is a new llvm pass which uses branch frequency info to get counts
for each call instruction and then adds a module flags metatdata table of
function -> function edges along with their counts.

The second takes the module flags metadata and writes it into a
.note.llvm.callgraph section in the object file. This currently just dumps
it as text, but could save space by reusing the string table.

The last part is in lld. It reads the .note.llvm.callgraph data from each
object file and merges them into a single table. It then builds a call
graph based on the profile data then iteratively merges the hottest call
edges using the C³ heuristic as long as it would not create a cluster
larger than the page size. All clusters are then sorted by a density metric
to further improve locality.


There are a couple issues with the current implementation that shouldn't be
too hard to fix.

The .note.llvm.callgraph sections add about 10% to the size of each object
file. This can be fixed by reusing the symbol string table instead of
duplicating all the strings.

The ordering algorithm is n^2 in the number of call graph edges. This is
because it's using a trivial algorithm and can be fixed by doing something
more intelligent.

It doesn't currently work for LTO as the llvm pass needs to be run after
all inlining decisions have been made and LTO codegen has to be done with
-ffunction-sections.

It's only implemented for ELF.


I've attached the llvm and lld patches and would appreciate feedback on the
overall design, the method of threading the profile data to the linker, and
an improved ordering algorithm. No need for a detailed code review as I'll
be making changes, adding tests, and posting proper patches for review.

- Michael Spencer
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170615/cc7e7ae2/attachment-0001.html>
-------------- next part --------------
diff --git a/ELF/Config.h b/ELF/Config.h
index 9c73b4c..01267a9 100644
--- a/ELF/Config.h
+++ b/ELF/Config.h
@@ -10,6 +10,7 @@
 #ifndef LLD_ELF_CONFIG_H
 #define LLD_ELF_CONFIG_H
 
+#include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/StringRef.h"
 #include "llvm/ADT/StringSet.h"
@@ -105,6 +106,8 @@ struct Configuration {
   std::vector<SymbolVersion> VersionScriptLocals;
   std::vector<uint8_t> BuildIdVector;
   llvm::MapVector<Symbol *, RenamedSymbol> RenamedSymbols;
+  llvm::DenseMap<std::pair<llvm::StringRef, llvm::StringRef>,
uint64_t>
+      CFGProfile;
   bool AllowMultipleDefinition;
   bool AsNeeded = false;
   bool Bsymbolic;
@@ -124,6 +127,7 @@ struct Configuration {
   bool GnuHash;
   bool ICF;
   bool MipsN32Abi = false;
+  bool NoCFGProfileReorder;
   bool NoGnuUnique;
   bool NoUndefinedVersion;
   bool Nostdlib;
diff --git a/ELF/Driver.cpp b/ELF/Driver.cpp
index 1534c7e..06dd045 100644
--- a/ELF/Driver.cpp
+++ b/ELF/Driver.cpp
@@ -649,6 +649,7 @@ void LinkerDriver::readConfigs(opt::InputArgList &Args)
{
   Config->LTOO = getInteger(Args, OPT_lto_O, 2);
   Config->LTOPartitions = getInteger(Args, OPT_lto_partitions, 1);
   Config->MapFile = getString(Args, OPT_Map);
+  Config->NoCFGProfileReorder = Args.hasArg(OPT_no_cfg_profile_reorder);
   Config->NoGnuUnique = Args.hasArg(OPT_no_gnu_unique);
   Config->NoUndefinedVersion = Args.hasArg(OPT_no_undefined_version);
   Config->Nostdlib = Args.hasArg(OPT_nostdlib);
diff --git a/ELF/InputFiles.cpp b/ELF/InputFiles.cpp
index 902593b..d64c1da 100644
--- a/ELF/InputFiles.cpp
+++ b/ELF/InputFiles.cpp
@@ -527,6 +527,35 @@ elf::ObjectFile<ELFT>::createInputSection(const
Elf_Shdr &Sec) {
   if (Name == ".eh_frame" && !Config->Relocatable)
     return make<EhInputSection>(this, &Sec, Name);
 
+  // Profile data.
+  if (Name == ".note.llvm.callgraph") {
+    ArrayRef<uint8_t> CallgraphBuff +       
check(this->getObj().getSectionContents(&Sec));
+
+    StringRef Buff((const char *)CallgraphBuff.data(), CallgraphBuff.size());
+
+    auto ReadString = [&Buff]() {
+      size_t F = Buff.find_first_of(" \n");
+      StringRef Ret = Buff.substr(0, F);
+      Buff = Buff.substr(F + 1);
+      return Ret;
+    };
+
+    while (!Buff.empty()) {
+      StringRef From = ReadString();
+      StringRef To = ReadString();
+      uint64_t Count;
+      if (ReadString().getAsInteger(10, Count))
+        break;
+
+      // Merge duplicate counts by picking the largest.
+      uint64_t &C = Config->CFGProfile[std::make_pair(From, To)];
+      C = std::max(C, Count);
+    }
+
+    return &InputSection::Discarded;
+  }
+
   if (shouldMerge(Sec))
     return make<MergeInputSection>(this, &Sec, Name);
   return make<InputSection>(this, &Sec, Name);
diff --git a/ELF/Options.td b/ELF/Options.td
index 335c7ad..cea9ee1 100644
--- a/ELF/Options.td
+++ b/ELF/Options.td
@@ -144,6 +144,9 @@ def nostdlib: F<"nostdlib">,
 def no_as_needed: F<"no-as-needed">,
   HelpText<"Always DT_NEEDED for shared libraries">;
 
+def no_cfg_profile_reorder: F<"no-cfg-profile-reorder">,
+  HelpText<"Disable reordering of sections based on profile
information">;
+
 def no_color_diagnostics: F<"no-color-diagnostics">,
   HelpText<"Do not use colors in diagnostics">;
 
diff --git a/ELF/Writer.cpp b/ELF/Writer.cpp
index d41baec..d9bd757 100644
--- a/ELF/Writer.cpp
+++ b/ELF/Writer.cpp
@@ -20,11 +20,13 @@
 #include "SyntheticSections.h"
 #include "Target.h"
 #include "Threads.h"
+#include "llvm/ADT/Hashing.h"
 #include "llvm/ADT/StringMap.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Support/FileOutputBuffer.h"
 #include "llvm/Support/raw_ostream.h"
 #include <climits>
+#include <unordered_set>
 
 using namespace llvm;
 using namespace llvm::ELF;
@@ -922,6 +924,157 @@ static void sortBySymbolsOrder(ArrayRef<OutputSection
*> OutputSections) {
       Sec->sort([&](InputSectionBase *S) { return
SectionOrder.lookup(S); });
 }
 
+// Sort sections by the profile data provided in the .note.llvm.callgraph
+// sections.
+//
+// This algorithm is based on Call-Chain Clustering from:
+// Optimizing Function Placement for Large-Scale Data-Center Applications
+// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+//
+// This first builds a call graph based on the profile data then iteratively
+// merges the hottest call edges as long as it would not create a cluster
larger
+// than the page size. All clusters are then sorted by a density metric to
+// further improve locality.
+template <class ELFT>
+static void sortByCFGProfile(ArrayRef<OutputSection *> OutputSections) {
+  if (Config->NoCFGProfileReorder)
+    return;
+
+  using NodeIndex = std::ptrdiff_t;
+  
+  struct Node {
+    Node() {}
+    Node(const InputSectionBase *IS) {
+      Sections.push_back(IS);
+      Size = IS->getSize();
+    }
+    std::vector<const InputSectionBase *> Sections;
+    int64_t Size = 0;
+    uint64_t Weight = 0;
+  };
+
+  struct Edge {
+    NodeIndex From;
+    NodeIndex To;
+    mutable uint64_t Weight;
+    bool operator==(const Edge Other) const {
+      return From == Other.From && To == Other.To;
+    }
+  };
+
+  struct EdgeHash {
+    std::size_t operator()(const Edge E) const {
+      return llvm::hash_combine(E.From, E.To);
+    };
+  };
+
+  std::vector<Node> Nodes;
+  std::unordered_set<Edge, EdgeHash> Edges;
+
+  auto InsertOrIncrementEdge = [](std::unordered_set<Edge, EdgeHash>
&Edges,
+                                  const Edge E) {
+    if (E.From == E.To)
+      return;
+    auto Res = Edges.insert(E);
+    if (!Res.second)
+      Res.first->Weight = SaturatingAdd(Res.first->Weight, E.Weight);
+  };
+
+  {
+    llvm::DenseMap<const InputSectionBase *, NodeIndex> SecToNode;
+    
+    auto GetOrCreateNode +        [&Nodes, &SecToNode](const
InputSectionBase *IS) -> NodeIndex {
+      auto Res = SecToNode.insert(std::make_pair(IS, Nodes.size()));
+      if (Res.second)
+        Nodes.emplace_back(IS);
+      return Res.first->second;
+    };
+
+    // Create the graph.
+    for (const auto &C : Config->CFGProfile) {
+      if (C.second == 0)
+        continue;
+      DefinedRegular *FromDR = dyn_cast_or_null<DefinedRegular>(
+          Symtab<ELFT>::X->find(C.first.first));
+      DefinedRegular *ToDR = dyn_cast_or_null<DefinedRegular>(
+          Symtab<ELFT>::X->find(C.first.second));
+      if (!FromDR || !ToDR)
+        continue;
+      auto FromSB = dyn_cast_or_null<const
InputSectionBase>(FromDR->Section);
+      auto ToSB = dyn_cast_or_null<const
InputSectionBase>(ToDR->Section);
+      if (!FromSB || !ToSB)
+        continue;
+      NodeIndex From = GetOrCreateNode(FromSB);
+      NodeIndex To = GetOrCreateNode(ToSB);
+      InsertOrIncrementEdge(Edges, {From, To, C.second});
+      Nodes[To].Weight = SaturatingAdd(Nodes[To].Weight, C.second);
+    }
+  }
+
+  // Collapse the graph.
+  while (!Edges.empty()) {
+    // Find the largest edge
+    // FIXME: non deterministic order for equal edges.
+    // FIXME: n^2
+    auto Max = std::max_element(
+        Edges.begin(), Edges.end(),
+        [](const Edge A, const Edge B) { return A.Weight < B.Weight; });
+    const Edge MaxE = *Max;
+    Edges.erase(Max);
+    // Merge the Nodes.
+    Node &From = Nodes[MaxE.From];
+    Node &To = Nodes[MaxE.To];
+    if (From.Size + To.Size > Target->PageSize)
+      continue;
+    From.Sections.insert(From.Sections.end(), To.Sections.begin(),
+                         To.Sections.end());
+    To.Sections.clear();
+    From.Size += To.Size;
+    From.Weight = SaturatingAdd(From.Weight, To.Weight);
+    // Collect all edges from or to the removed node and update them for the
new
+    // node.
+    std::vector<Edge> OldEdges;
+    // FIXME: n^2
+    for (auto EI = Edges.begin(), EE = Edges.end(); EI != EE;) {
+      if (EI->From == MaxE.To || EI->To == MaxE.To) {
+        OldEdges.push_back(*EI);
+        EI = Edges.erase(EI);
+      } else
+        ++EI;
+    }
+    for (const Edge E : OldEdges) {
+      InsertOrIncrementEdge(Edges,
+                            {E.From == MaxE.To ? MaxE.From : E.From,
+                             E.To == MaxE.To ? MaxE.From : E.To, E.Weight});
+    }
+  }
+
+  // Sort by density.
+  std::sort(Nodes.begin(), Nodes.end(), [](const Node &A, const Node
&B) {
+    return double(A.Weight) / double(A.Size) <
+           double(B.Weight) / double(B.Size);
+  });
+
+  // Generate order.
+  llvm::DenseMap<const InputSectionBase *, std::size_t> OrderMap;
+  ssize_t CurOrder = 0;
+
+  for (const Node &N : Nodes) {
+    if (N.Sections.empty())
+      continue;
+    for (const InputSectionBase *IS : N.Sections)
+      OrderMap[IS] = CurOrder++;
+  }
+  
+  for (OutputSection *OS : OutputSections) {
+    if (OS->Name != ".text")
+      continue;
+    OS->sort([&](InputSectionBase *IS) { return OrderMap.lookup(IS); });
+    break;
+  }
+}
+
 template <class ELFT>
 void Writer<ELFT>::forEachRelSec(std::function<void(InputSectionBase
&)> Fn) {
   for (InputSectionBase *IS : InputSections) {
@@ -949,6 +1102,7 @@ template <class ELFT> void
Writer<ELFT>::createSections() {
     if (IS)
       Factory.addInputSec(IS, getOutputSectionName(IS->Name));
 
+  sortByCFGProfile<ELFT>(OutputSections);
   sortBySymbolsOrder<ELFT>(OutputSections);
   sortInitFini(findSection(".init_array"));
   sortInitFini(findSection(".fini_array"));
-------------- next part --------------
diff --git a/include/llvm/InitializePasses.h b/include/llvm/InitializePasses.h
index abb0aa3..93084d8 100644
--- a/include/llvm/InitializePasses.h
+++ b/include/llvm/InitializePasses.h
@@ -85,6 +85,7 @@ void initializeBreakCriticalEdgesPass(PassRegistry&);
 void initializeCFGOnlyPrinterLegacyPassPass(PassRegistry&);
 void initializeCFGOnlyViewerLegacyPassPass(PassRegistry&);
 void initializeCFGPrinterLegacyPassPass(PassRegistry&);
+void initializeCFGProfilePassPass(PassRegistry&);
 void initializeCFGSimplifyPassPass(PassRegistry&);
 void initializeCFGViewerLegacyPassPass(PassRegistry&);
 void initializeCFLAndersAAWrapperPassPass(PassRegistry&);
diff --git a/include/llvm/LinkAllPasses.h b/include/llvm/LinkAllPasses.h
index c309ddb..59b9307 100644
--- a/include/llvm/LinkAllPasses.h
+++ b/include/llvm/LinkAllPasses.h
@@ -75,6 +75,7 @@ namespace {
       (void) llvm::createCallGraphDOTPrinterPass();
       (void) llvm::createCallGraphViewerPass();
       (void) llvm::createCFGSimplificationPass();
+      (void) llvm::createCFGProfilePass();
       (void) llvm::createLateCFGSimplificationPass();
       (void) llvm::createCFLAndersAAWrapperPass();
       (void) llvm::createCFLSteensAAWrapperPass();
diff --git a/include/llvm/Transforms/Instrumentation.h
b/include/llvm/Transforms/Instrumentation.h
index b6c6c09..ec8741b 100644
--- a/include/llvm/Transforms/Instrumentation.h
+++ b/include/llvm/Transforms/Instrumentation.h
@@ -199,6 +199,8 @@ inline ModulePass *createDataFlowSanitizerPassForJIT(
 // checking on loads, stores, and other memory intrinsics.
 FunctionPass *createBoundsCheckingPass();
 
+ModulePass *createCFGProfilePass();
+
 /// \brief Calculate what to divide by to scale counts.
 ///
 /// Given the maximum count, calculate a divisor that will scale all the
diff --git a/lib/CodeGen/TargetLoweringObjectFileImpl.cpp
b/lib/CodeGen/TargetLoweringObjectFileImpl.cpp
index 6922e33..9f786b0 100644
--- a/lib/CodeGen/TargetLoweringObjectFileImpl.cpp
+++ b/lib/CodeGen/TargetLoweringObjectFileImpl.cpp
@@ -97,16 +97,48 @@ void TargetLoweringObjectFileELF::emitModuleMetadata(
   StringRef Section;
 
   GetObjCImageInfo(M, Version, Flags, Section);
-  if (Section.empty())
-    return;
+  if (!Section.empty()) {
+    auto &C = getContext();
+    auto *S = C.getELFSection(Section, ELF::SHT_PROGBITS, ELF::SHF_ALLOC);
+    Streamer.SwitchSection(S);
+   
Streamer.EmitLabel(C.getOrCreateSymbol(StringRef("OBJC_IMAGE_INFO")));
+    Streamer.EmitIntValue(Version, 4);
+    Streamer.EmitIntValue(Flags, 4);
+    Streamer.AddBlankLine();
+  }
 
-  auto &C = getContext();
-  auto *S = C.getELFSection(Section, ELF::SHT_PROGBITS, ELF::SHF_ALLOC);
-  Streamer.SwitchSection(S);
- 
Streamer.EmitLabel(C.getOrCreateSymbol(StringRef("OBJC_IMAGE_INFO")));
-  Streamer.EmitIntValue(Version, 4);
-  Streamer.EmitIntValue(Flags, 4);
-  Streamer.AddBlankLine();
+  SmallVector<Module::ModuleFlagEntry, 8> ModuleFlags;
+  M.getModuleFlagsMetadata(ModuleFlags);
+
+  MDNode *CFGProfile = nullptr;
+
+  for (const auto &MFE : ModuleFlags) {
+    StringRef Key = MFE.Key->getString();
+    if (Key == "CFG Profile") {
+      CFGProfile = cast<MDNode>(MFE.Val);
+      break;
+    }
+  }
+
+  if (!CFGProfile)
+    return;
+  MCSectionELF *Sec +     
getContext().getELFSection(".note.llvm.callgraph", ELF::SHT_NOTE, 0);
+  Streamer.SwitchSection(Sec);
+  SmallString<256> Out;
+  for (const auto &Edge : CFGProfile->operands()) {
+    raw_svector_ostream O(Out);
+    MDNode *E = cast<MDNode>(Edge);
+    O << cast<MDString>(E->getOperand(0))->getString()
<< " "
+      << cast<MDString>(E->getOperand(1))->getString()
<< " "
+      << cast<ConstantAsMetadata>(E->getOperand(2))
+             ->getValue()
+             ->getUniqueInteger()
+             .getZExtValue()
+      << "\n";
+    Streamer.EmitBytes(O.str());
+    Out.clear();
+  }
 }
 
 MCSymbol *TargetLoweringObjectFileELF::getCFIPersonalitySymbol(
diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp
b/lib/Transforms/IPO/PassManagerBuilder.cpp
index 4bc64ab..585962d 100644
--- a/lib/Transforms/IPO/PassManagerBuilder.cpp
+++ b/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -698,6 +698,8 @@ void PassManagerBuilder::populateModulePassManager(
     }
   }
 
+  MPM.add(createCFGProfilePass());
+
   if (MergeFunctions)
     MPM.add(createMergeFunctionsPass());
 
diff --git a/lib/Transforms/Instrumentation/CFGProfile.cpp
b/lib/Transforms/Instrumentation/CFGProfile.cpp
new file mode 100644
index 0000000..6aa76d3
--- /dev/null
+++ b/lib/Transforms/Instrumentation/CFGProfile.cpp
@@ -0,0 +1,103 @@
+//===-- CFGProfile.cpp
----------------------------------------------------===//
+//
+//                      The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/BlockFrequencyInfo.h"
+#include "llvm/Analysis/BranchProbabilityInfo.h"
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/MDBuilder.h"
+#include "llvm/IR/PassManager.h"
+#include "llvm/Transforms/Instrumentation.h"
+
+#include <array>
+
+using namespace llvm;
+
+class CFGProfilePass : public ModulePass {
+public:
+  static char ID;
+
+  CFGProfilePass() : ModulePass(ID) {
+    initializeCFGProfilePassPass(
+      *PassRegistry::getPassRegistry());
+  }
+
+  StringRef getPassName() const override { return "CFGProfilePass"; }
+
+private:
+  bool runOnModule(Module &M) override;
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<BlockFrequencyInfoWrapperPass>();
+    AU.addRequired<BranchProbabilityInfoWrapperPass>();
+  }
+};
+
+bool CFGProfilePass::runOnModule(Module &M) {
+  if (skipModule(M))
+    return false;
+
+  llvm::DenseMap<std::pair<StringRef, StringRef>, uint64_t> Counts;
+
+  for (auto &F : M) {
+    if (F.isDeclaration())
+      continue;
+    getAnalysis<BranchProbabilityInfoWrapperPass>(F).getBPI();
+    auto &BFI =
getAnalysis<BlockFrequencyInfoWrapperPass>(F).getBFI();
+    for (auto &BB : F) {
+      Optional<uint64_t> BBCount = BFI.getBlockProfileCount(&BB);
+      if (!BBCount)
+        continue;
+      for (auto &I : BB) {
+        auto *CI = dyn_cast<CallInst>(&I);
+        if (!CI)
+          continue;
+        Function *CalledF = CI->getCalledFunction();
+        if (!CalledF || CalledF->isIntrinsic())
+          continue;
+
+        uint64_t &Count +            Counts[std::make_pair(F.getName(),
CalledF->getName())];
+        Count = SaturatingAdd(Count, *BBCount);
+      }
+    }
+  }
+
+  if (Counts.empty())
+    return false;
+
+  LLVMContext &Context = M.getContext();
+  MDBuilder MDB(Context);
+  std::vector<Metadata *> Nodes;
+
+  for (auto E : Counts) {
+    SmallVector<Metadata *, 3> Vals;
+    Vals.push_back(MDB.createString(E.first.first));
+    Vals.push_back(MDB.createString(E.first.second));
+    Vals.push_back(MDB.createConstant(
+        ConstantInt::get(Type::getInt64Ty(Context), E.second)));
+    Nodes.push_back(MDNode::get(Context, Vals));
+  }
+
+  M.addModuleFlag(Module::Append, "CFG Profile", MDNode::get(Context,
Nodes));
+
+  return true;
+}
+
+char CFGProfilePass::ID = 0;
+INITIALIZE_PASS_BEGIN(CFGProfilePass, "cfg-profile",
+  "Generate profile information from the CFG.", false, false)
+  INITIALIZE_PASS_DEPENDENCY(BlockFrequencyInfoWrapperPass)
+  INITIALIZE_PASS_DEPENDENCY(BranchProbabilityInfoWrapperPass)
+  INITIALIZE_PASS_END(CFGProfilePass, "cfg-profile",
+    "Generate profile information from the CFG.", false, false)
+
+ModulePass *llvm::createCFGProfilePass() {
+  return new CFGProfilePass();
+}
diff --git a/lib/Transforms/Instrumentation/CMakeLists.txt
b/lib/Transforms/Instrumentation/CMakeLists.txt
index 7ff69b9..197575f 100644
--- a/lib/Transforms/Instrumentation/CMakeLists.txt
+++ b/lib/Transforms/Instrumentation/CMakeLists.txt
@@ -1,6 +1,7 @@
 add_llvm_library(LLVMInstrumentation
   AddressSanitizer.cpp
   BoundsChecking.cpp
+  CFGProfile.cpp
   DataFlowSanitizer.cpp
   GCOVProfiling.cpp
   MemorySanitizer.cpp
diff --git a/lib/Transforms/Instrumentation/Instrumentation.cpp
b/lib/Transforms/Instrumentation/Instrumentation.cpp
index 7bb62d2..d147a52 100644
--- a/lib/Transforms/Instrumentation/Instrumentation.cpp
+++ b/lib/Transforms/Instrumentation/Instrumentation.cpp
@@ -60,6 +60,7 @@ void llvm::initializeInstrumentation(PassRegistry
&Registry) {
   initializeAddressSanitizerModulePass(Registry);
   initializeBoundsCheckingPass(Registry);
   initializeGCOVProfilerLegacyPassPass(Registry);
+  initializeCFGProfilePassPass(Registry);
   initializePGOInstrumentationGenLegacyPassPass(Registry);
   initializePGOInstrumentationUseLegacyPassPass(Registry);
   initializePGOIndirectCallPromotionLegacyPassPass(Registry);

Tobias Edler von Koch via llvm-dev

2017-Jun-15 17:08 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

Hi Michael,

This is cool stuff, thanks for sharing!

On 06/15/2017 11:51 AM, Michael Spencer via llvm-dev
wrote:> The first is a new llvm pass which uses branch frequency info to get 
> counts for each call instruction and then adds a module flags 
> metatdata table of function -> function edges along with their counts.
>
> The second takes the module flags metadata and writes it into a 
> .note.llvm.callgraph section in the object file. This currently just 
> dumps it as text, but could save space by reusing the string table.Have you considered reading the profile in the linker and extracting 
that information directly from the profile? The profile should contain 
call sites and their sample counts and you could match these up with 
relocations (calls) in the section?
> It doesn't currently work for LTO as the llvm pass needs to be run 
> after all inlining decisions have been made and LTO codegen has to be 
> done with -ffunction-sections.So this is just an implementation issue, right? You can make LTO run 
with -ffunction-sections (by setting 
TargetOptions.FunctionSections=true) and insert your pass in the 
appropriate place in the pipeline.

Thanks,
Tobias

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project.

Michael Spencer via llvm-dev

2017-Jun-15 17:55 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

On Thu, Jun 15, 2017 at 10:08 AM, Tobias Edler von Koch <
tobias at codeaurora.org> wrote:
> Hi Michael,
>
> This is cool stuff, thanks for sharing!
>
> On 06/15/2017 11:51 AM, Michael Spencer via llvm-dev wrote:
>
>> The first is a new llvm pass which uses branch frequency info to get
>> counts for each call instruction and then adds a module flags metatdata
>> table of function -> function edges along with their counts.
>>
>> The second takes the module flags metadata and writes it into a
>> .note.llvm.callgraph section in the object file. This currently just
dumps
>> it as text, but could save space by reusing the string table.
>>
> Have you considered reading the profile in the linker and extracting that
> information directly from the profile? The profile should contain call
> sites and their sample counts and you could match these up with relocations
> (calls) in the section?

I did this using IR PGO instead of sample PGO so the profile data can only
be applied in the same place in the pipeline it is generated. Even for
sample based this would be complicated as the linker would actually need to
generate machine basic blocks from sections to be able to accurately match
sample counts to relocations, as there may be cold calls in hot functions.

It may be useful however for the linker to directly accept an externally
generated call graph profile. The current approach can actually do this by
embedding it into an extra object file.

>
>
> It doesn't currently work for LTO as the llvm pass needs to be run
after
>> all inlining decisions have been made and LTO codegen has to be done
with
>> -ffunction-sections.
>>
> So this is just an implementation issue, right? You can make LTO run with
> -ffunction-sections (by setting TargetOptions.FunctionSections=true) and
> insert your pass in the appropriate place in the pipeline.
>
Yeah, just an implementation issue. Just need to build the pass pipeline
differently for LTO and add a way to do -ffunction-sections in lld.

- Michael Spencer

>
> Thanks,
> Tobias
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170615/d34eebec/attachment.html>

Rafael Avila de Espindola via llvm-dev

2017-Jul-31 21:20 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:
> I've recently implemented profile guided section layout in llvm + lld
using
> the Call-Chain Clustering (C³) heuristic from
>
https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
> . In the programs I've tested it on I've gotten from 0% to 5%
performance
> improvement over standard PGO with zero cases of slowdowns and up to 15%
> reduction in ITLB misses.
>
>
> There are three parts to this implementation.
>
> The first is a new llvm pass which uses branch frequency info to get counts
> for each call instruction and then adds a module flags metatdata table of
> function -> function edges along with their counts.
>
> The second takes the module flags metadata and writes it into a
> .note.llvm.callgraph section in the object file. This currently just dumps
> it as text, but could save space by reusing the string table.
>
> The last part is in lld. It reads the .note.llvm.callgraph data from each
> object file and merges them into a single table. It then builds a call
> graph based on the profile data then iteratively merges the hottest call
> edges using the C³ heuristic as long as it would not create a cluster
> larger than the page size. All clusters are then sorted by a density metric
> to further improve locality.
Since the branch frequency info is in a llvm specific format, it makes
sense for llvm to read it instead of expecting lld to do it again. Since
.o files is how the compiler talks to the linker, it also makes sense
for llvm to record the required information there.

In the same way, since the linker is the first place with global
knowledge, it makes sense for it to be the one that implements a section
ordering heuristic instead of just being told by some other tool, which
would complicate the build.

However, do we need to start with instrumentation? The original paper
uses sampling with good results and current intel cpus can record every
branch in a program.

I would propose starting with just an lld patch that reads the call
graph from a file. The format would be very similar to what you propose,
just weight,caller,callee.

In a another patch we can then look at instrumentation: Why it is more
convenient for some uses and what performance advantage it might have.

I have written a small tool that usesr intel_bts and 'perf script' to
construct the callgraph. I am giving it a try with your lld patch and
will hopefully post results today.

Cheers,
Rafael

Tobias Edler von Koch via llvm-dev

2017-Jul-31 21:43 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

Hi Rafael,

On 07/31/2017 04:20 PM, Rafael Avila de Espindola via llvm-dev
wrote:> However, do we need to start with instrumentation? The original paper
> uses sampling with good results and current intel cpus can record every
> branch in a program.
>
> I would propose starting with just an lld patch that reads the call
> graph from a file. The format would be very similar to what you propose,
> just weight,caller,callee.
The advantage of the proposed approach (weighted callgraph section) is 
that it's completely transparent: it works regardless of the particular 
profiling methodology (as long as there's !perf metadata when the pass 
runs). For this reason, it fits neatly into an *existing* PGO-based 
build flow. I only need to add 1 compiler flag to enable it. That's a 
big plus.

On the other hand, I could see how your idea (callgraph input file for 
linker) would be useful in situations where I just want to do section 
layout but no PGO in the compiler... and of course for testing of the 
linker's sorting algorithm.

So there's merits in both, but for my use cases Michael's original 
approach is the most practical.

Tobias

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project.

Sean Silva via llvm-dev

2017-Jul-31 23:18 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

On Jul 31, 2017 2:21 PM, "Rafael Avila de Espindola via llvm-dev" <
llvm-dev at lists.llvm.org> wrote:

Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:
> I've recently implemented profile guided section layout in llvm + lld
using> the Call-Chain Clustering (C³) heuristic from
> https://research.fb.com/wp-content/uploads/2017/01/cgo2017-
hfsort-final1.pdf> . In the programs I've tested it on I've gotten from 0% to 5%
performance
> improvement over standard PGO with zero cases of slowdowns and up to 15%
> reduction in ITLB misses.
>
>
> There are three parts to this implementation.
>
> The first is a new llvm pass which uses branch frequency info to get
counts> for each call instruction and then adds a module flags metatdata table of
> function -> function edges along with their counts.
>
> The second takes the module flags metadata and writes it into a
> .note.llvm.callgraph section in the object file. This currently just dumps
> it as text, but could save space by reusing the string table.
>
> The last part is in lld. It reads the .note.llvm.callgraph data from each
> object file and merges them into a single table. It then builds a call
> graph based on the profile data then iteratively merges the hottest call
> edges using the C³ heuristic as long as it would not create a cluster
> larger than the page size. All clusters are then sorted by a density
metric> to further improve locality.
Since the branch frequency info is in a llvm specific format, it makes
sense for llvm to read it instead of expecting lld to do it again. Since
.o files is how the compiler talks to the linker, it also makes sense
for llvm to record the required information there.

In the same way, since the linker is the first place with global
knowledge, it makes sense for it to be the one that implements a section
ordering heuristic instead of just being told by some other tool, which
would complicate the build.

However, do we need to start with instrumentation? The original paper
uses sampling with good results and current intel cpus can record every
branch in a program.

Both sample and instrumentation are turned into the same profile metadata
in the IR which is what feeds BranchFrequencyInfo. So it should be
naturally agnostic to which method is used.

-- Sean Silva

I would propose starting with just an lld patch that reads the call
graph from a file. The format would be very similar to what you propose,
just weight,caller,callee.

In a another patch we can then look at instrumentation: Why it is more
convenient for some uses and what performance advantage it might have.

I have written a small tool that usesr intel_bts and 'perf script' to
construct the callgraph. I am giving it a try with your lld patch and
will hopefully post results today.

Cheers,
Rafael
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170731/f8bd7efe/attachment.html>

Michael Spencer via llvm-dev

2017-Aug-01 03:32 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

On Mon, Jul 31, 2017 at 2:20 PM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:
> Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
> > I've recently implemented profile guided section layout in llvm +
lld
> using
> > the Call-Chain Clustering (C³) heuristic from
> > https://research.fb.com/wp-content/uploads/2017/01/
> cgo2017-hfsort-final1.pdf
> > . In the programs I've tested it on I've gotten from 0% to 5%
performance
> > improvement over standard PGO with zero cases of slowdowns and up to
15%
> > reduction in ITLB misses.
> >
> >
> > There are three parts to this implementation.
> >
> > The first is a new llvm pass which uses branch frequency info to get
> counts
> > for each call instruction and then adds a module flags metatdata table
of
> > function -> function edges along with their counts.
> >
> > The second takes the module flags metadata and writes it into a
> > .note.llvm.callgraph section in the object file. This currently just
> dumps
> > it as text, but could save space by reusing the string table.
> >
> > The last part is in lld. It reads the .note.llvm.callgraph data from
each
> > object file and merges them into a single table. It then builds a call
> > graph based on the profile data then iteratively merges the hottest
call
> > edges using the C³ heuristic as long as it would not create a cluster
> > larger than the page size. All clusters are then sorted by a density
> metric
> > to further improve locality.
>
> Since the branch frequency info is in a llvm specific format, it makes
> sense for llvm to read it instead of expecting lld to do it again. Since
> .o files is how the compiler talks to the linker, it also makes sense
> for llvm to record the required information there.
>
> In the same way, since the linker is the first place with global
> knowledge, it makes sense for it to be the one that implements a section
> ordering heuristic instead of just being told by some other tool, which
> would complicate the build.
>
> However, do we need to start with instrumentation? The original paper
> uses sampling with good results and current intel cpus can record every
> branch in a program.
>
> I would propose starting with just an lld patch that reads the call
> graph from a file. The format would be very similar to what you propose,
> just weight,caller,callee.
>
> In a another patch we can then look at instrumentation: Why it is more
> convenient for some uses and what performance advantage it might have.
>
> I have written a small tool that usesr intel_bts and 'perf script'
to
> construct the callgraph. I am giving it a try with your lld patch and
> will hopefully post results today.
>
> Cheers,
> Rafael
>
I'm fine with approaching this from lld first, as the input reading is the
simplest part of this. I started from instrumentation based because I
started with a target program that already had the infrastructure setup to
do it.

- Michael Spencer
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170731/e2bd02cb/attachment.html>

Justin Bogner via llvm-dev

2017-Aug-01 20:57 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

Rafael Avila de Espindola via llvm-dev <llvm-dev at lists.llvm.org>
writes:> Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
>> I've recently implemented profile guided section layout in llvm +
lld using
>> the Call-Chain Clustering (C³) heuristic from
>>
https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
>> . In the programs I've tested it on I've gotten from 0% to 5%
performance
>> improvement over standard PGO with zero cases of slowdowns and up to
15%
>> reduction in ITLB misses.
>>
>>
>> There are three parts to this implementation.
>>
>> The first is a new llvm pass which uses branch frequency info to get
counts
>> for each call instruction and then adds a module flags metatdata table
of
>> function -> function edges along with their counts.
>>
>> The second takes the module flags metadata and writes it into a
>> .note.llvm.callgraph section in the object file. This currently just
dumps
>> it as text, but could save space by reusing the string table.
>>
>> The last part is in lld. It reads the .note.llvm.callgraph data from
each
>> object file and merges them into a single table. It then builds a call
>> graph based on the profile data then iteratively merges the hottest
call
>> edges using the C³ heuristic as long as it would not create a cluster
>> larger than the page size. All clusters are then sorted by a density
metric
>> to further improve locality.
>
> Since the branch frequency info is in a llvm specific format, it makes
> sense for llvm to read it instead of expecting lld to do it again. Since
> .o files is how the compiler talks to the linker, it also makes sense
> for llvm to record the required information there.
>
> In the same way, since the linker is the first place with global
> knowledge, it makes sense for it to be the one that implements a section
> ordering heuristic instead of just being told by some other tool, which
> would complicate the build.
>
> However, do we need to start with instrumentation? The original paper
> uses sampling with good results and current intel cpus can record every
> branch in a program.
This already works without instrumentation. You can probably try it out
as is with profiles generated with linux perf using the create_llvm_prof
tool from the autofdo work: https://github.com/google/autofdo
> I would propose starting with just an lld patch that reads the call
> graph from a file. The format would be very similar to what you propose,
> just weight,caller,callee.
>
> In a another patch we can then look at instrumentation: Why it is more
> convenient for some uses and what performance advantage it might have.
>
> I have written a small tool that usesr intel_bts and 'perf script'
to
> construct the callgraph. I am giving it a try with your lld patch and
> will hopefully post results today.
>
> Cheers,
> Rafael
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Jul 2017 - [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

Apparently Analagous Threads