thr3ads.net - llvm dev - [llvm-dev] [RFC] Profile guided section layout [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Tobias Edler von Koch via llvm-dev

2017-Jul-31 21:43 UTC

[llvm-dev] [RFC] Profile guided section layout

Hi Rafael,

On 07/31/2017 04:20 PM, Rafael Avila de Espindola via llvm-dev
wrote:> However, do we need to start with instrumentation? The original paper
> uses sampling with good results and current intel cpus can record every
> branch in a program.
>
> I would propose starting with just an lld patch that reads the call
> graph from a file. The format would be very similar to what you propose,
> just weight,caller,callee.
The advantage of the proposed approach (weighted callgraph section) is 
that it's completely transparent: it works regardless of the particular 
profiling methodology (as long as there's !perf metadata when the pass 
runs). For this reason, it fits neatly into an *existing* PGO-based 
build flow. I only need to add 1 compiler flag to enable it. That's a 
big plus.

On the other hand, I could see how your idea (callgraph input file for 
linker) would be useful in situations where I just want to do section 
layout but no PGO in the compiler... and of course for testing of the 
linker's sorting algorithm.

So there's merits in both, but for my use cases Michael's original 
approach is the most practical.

Tobias

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project.

Rafael Avila de Espindola via llvm-dev

2017-Jul-31 22:11 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

Tobias Edler von Koch <tobias at codeaurora.org> writes:
> Hi Rafael,
>
> On 07/31/2017 04:20 PM, Rafael Avila de Espindola via llvm-dev wrote:
>> However, do we need to start with instrumentation? The original paper
>> uses sampling with good results and current intel cpus can record every
>> branch in a program.
>>
>> I would propose starting with just an lld patch that reads the call
>> graph from a file. The format would be very similar to what you
propose,
>> just weight,caller,callee.
>
> The advantage of the proposed approach (weighted callgraph section) is 
> that it's completely transparent: it works regardless of the particular
> profiling methodology (as long as there's !perf metadata when the pass 
> runs). For this reason, it fits neatly into an *existing* PGO-based 
> build flow. I only need to add 1 compiler flag to enable it. That's a 
> big plus.
>
> On the other hand, I could see how your idea (callgraph input file for 
> linker) would be useful in situations where I just want to do section 
> layout but no PGO in the compiler... and of course for testing of the 
> linker's sorting algorithm.
>
> So there's merits in both, but for my use cases Michael's original 
> approach is the most practical.
Yes, I must stress that I am not proposing having the option of reading
the callgraph from another file *instead* of reading it from .o
files. Just that doing it first decouples most of the lld patch form the
llvm changes and can be useful in cases where only samples are available.

Cheers,
Rafael

Rafael Espíndola via llvm-dev

2017-Jul-31 22:12 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

A rebased version of the lld patch is attached.

Cheers,
Rafael


On 31 July 2017 at 15:11, Rafael Avila de Espindola
<rafael.espindola at gmail.com> wrote:> Tobias Edler von Koch <tobias at codeaurora.org> writes:
>
>> Hi Rafael,
>>
>> On 07/31/2017 04:20 PM, Rafael Avila de Espindola via llvm-dev wrote:
>>> However, do we need to start with instrumentation? The original
paper
>>> uses sampling with good results and current intel cpus can record
every
>>> branch in a program.
>>>
>>> I would propose starting with just an lld patch that reads the call
>>> graph from a file. The format would be very similar to what you
propose,
>>> just weight,caller,callee.
>>
>> The advantage of the proposed approach (weighted callgraph section) is
>> that it's completely transparent: it works regardless of the
particular
>> profiling methodology (as long as there's !perf metadata when the
pass
>> runs). For this reason, it fits neatly into an *existing* PGO-based
>> build flow. I only need to add 1 compiler flag to enable it. That's
a
>> big plus.
>>
>> On the other hand, I could see how your idea (callgraph input file for
>> linker) would be useful in situations where I just want to do section
>> layout but no PGO in the compiler... and of course for testing of the
>> linker's sorting algorithm.
>>
>> So there's merits in both, but for my use cases Michael's
original
>> approach is the most practical.
>
> Yes, I must stress that I am not proposing having the option of reading
> the callgraph from another file *instead* of reading it from .o
> files. Just that doing it first decouples most of the lld patch form the
> llvm changes and can be useful in cases where only samples are available.
>
> Cheers,
> Rafael-------------- next part --------------
diff --git a/ELF/Config.h b/ELF/Config.h
index 45c9565..6928583 100644
--- a/ELF/Config.h
+++ b/ELF/Config.h
@@ -10,6 +10,7 @@
 #ifndef LLD_ELF_CONFIG_H
 #define LLD_ELF_CONFIG_H
 
+#include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/StringRef.h"
 #include "llvm/ADT/StringSet.h"
@@ -108,6 +109,8 @@ struct Configuration {
   std::vector<SymbolVersion> VersionScriptLocals;
   std::vector<uint8_t> BuildIdVector;
   llvm::MapVector<Symbol *, RenamedSymbol> RenamedSymbols;
+  llvm::DenseMap<std::pair<llvm::StringRef, llvm::StringRef>,
uint64_t>
+      CFGProfile;
   bool AllowMultipleDefinition;
   bool AsNeeded = false;
   bool Bsymbolic;
@@ -127,6 +130,7 @@ struct Configuration {
   bool GnuHash;
   bool ICF;
   bool MipsN32Abi = false;
+  bool NoCFGProfileReorder;
   bool NoGnuUnique;
   bool NoUndefinedVersion;
   bool NoinhibitExec;
diff --git a/ELF/Driver.cpp b/ELF/Driver.cpp
index 263ba7b..84d4d80 100644
--- a/ELF/Driver.cpp
+++ b/ELF/Driver.cpp
@@ -644,6 +644,7 @@ void LinkerDriver::readConfigs(opt::InputArgList &Args)
{
   Config->LTOO = getInteger(Args, OPT_lto_O, 2);
   Config->LTOPartitions = getInteger(Args, OPT_lto_partitions, 1);
   Config->MapFile = Args.getLastArgValue(OPT_Map);
+  Config->NoCFGProfileReorder = Args.hasArg(OPT_no_cfg_profile_reorder);
   Config->NoGnuUnique = Args.hasArg(OPT_no_gnu_unique);
   Config->NoUndefinedVersion = Args.hasArg(OPT_no_undefined_version);
   Config->NoinhibitExec = Args.hasArg(OPT_noinhibit_exec);
diff --git a/ELF/InputFiles.cpp b/ELF/InputFiles.cpp
index a6cd1a6..848223b 100644
--- a/ELF/InputFiles.cpp
+++ b/ELF/InputFiles.cpp
@@ -511,6 +511,35 @@ InputSectionBase
*ObjFile<ELFT>::createInputSection(const Elf_Shdr &Sec) {
   if (Name == ".eh_frame" && !Config->Relocatable)
     return make<EhInputSection>(this, &Sec, Name);
 
+  // Profile data.
+  if (Name == ".note.llvm.callgraph") {
+    ArrayRef<uint8_t> CallgraphBuff +       
check(this->getObj().getSectionContents(&Sec));
+
+    StringRef Buff((const char *)CallgraphBuff.data(), CallgraphBuff.size());
+
+    auto ReadString = [&Buff]() {
+      size_t F = Buff.find_first_of(" \n");
+      StringRef Ret = Buff.substr(0, F);
+      Buff = Buff.substr(F + 1);
+      return Ret;
+    };
+
+    while (!Buff.empty()) {
+      StringRef From = ReadString();
+      StringRef To = ReadString();
+      uint64_t Count;
+      if (ReadString().getAsInteger(10, Count))
+        break;
+
+      // Merge duplicate counts by picking the largest.
+      uint64_t &C = Config->CFGProfile[std::make_pair(From, To)];
+      C = std::max(C, Count);
+    }
+
+    return &InputSection::Discarded;
+  }
+
   if (shouldMerge(Sec))
     return make<MergeInputSection>(this, &Sec, Name);
   return make<InputSection>(this, &Sec, Name);
diff --git a/ELF/Options.td b/ELF/Options.td
index 0de0d73..000e111 100644
--- a/ELF/Options.td
+++ b/ELF/Options.td
@@ -162,6 +162,9 @@ def nostdlib: F<"nostdlib">,
 def no_as_needed: F<"no-as-needed">,
   HelpText<"Always DT_NEEDED for shared libraries">;
 
+def no_cfg_profile_reorder: F<"no-cfg-profile-reorder">,
+  HelpText<"Disable reordering of sections based on profile
information">;
+
 def no_color_diagnostics: F<"no-color-diagnostics">,
   HelpText<"Do not use colors in diagnostics">;
 
diff --git a/ELF/Writer.cpp b/ELF/Writer.cpp
index a9e3856..3a6e174 100644
--- a/ELF/Writer.cpp
+++ b/ELF/Writer.cpp
@@ -20,11 +20,13 @@
 #include "SyntheticSections.h"
 #include "Target.h"
 #include "Threads.h"
+#include "llvm/ADT/Hashing.h"
 #include "llvm/ADT/StringMap.h"
 #include "llvm/ADT/StringSwitch.h"
 #include "llvm/Support/FileOutputBuffer.h"
 #include "llvm/Support/raw_ostream.h"
 #include <climits>
+#include <unordered_set>
 
 using namespace llvm;
 using namespace llvm::ELF;
@@ -896,6 +898,157 @@ template <class ELFT> static void
sortBySymbolsOrder() {
       Sec->sort([&](InputSectionBase *S) { return
SectionOrder.lookup(S); });
 }
 
+// Sort sections by the profile data provided in the .note.llvm.callgraph
+// sections.
+//
+// This algorithm is based on Call-Chain Clustering from:
+// Optimizing Function Placement for Large-Scale Data-Center Applications
+// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
+//
+// This first builds a call graph based on the profile data then iteratively
+// merges the hottest call edges as long as it would not create a cluster
larger
+// than the page size. All clusters are then sorted by a density metric to
+// further improve locality.
+template <class ELFT>
+static void sortByCFGProfile(ArrayRef<OutputSection *> OutputSections) {
+  if (Config->NoCFGProfileReorder)
+    return;
+
+  using NodeIndex = std::ptrdiff_t;
+
+  struct Node {
+    Node() {}
+    Node(const InputSectionBase *IS) {
+      Sections.push_back(IS);
+      Size = IS->getSize();
+    }
+    std::vector<const InputSectionBase *> Sections;
+    int64_t Size = 0;
+    uint64_t Weight = 0;
+  };
+
+  struct Edge {
+    NodeIndex From;
+    NodeIndex To;
+    mutable uint64_t Weight;
+    bool operator==(const Edge Other) const {
+      return From == Other.From && To == Other.To;
+    }
+  };
+
+  struct EdgeHash {
+    std::size_t operator()(const Edge E) const {
+      return llvm::hash_combine(E.From, E.To);
+    };
+  };
+
+  std::vector<Node> Nodes;
+  std::unordered_set<Edge, EdgeHash> Edges;
+
+  auto InsertOrIncrementEdge = [](std::unordered_set<Edge, EdgeHash>
&Edges,
+                                  const Edge E) {
+    if (E.From == E.To)
+      return;
+    auto Res = Edges.insert(E);
+    if (!Res.second)
+      Res.first->Weight = SaturatingAdd(Res.first->Weight, E.Weight);
+  };
+
+  {
+    llvm::DenseMap<const InputSectionBase *, NodeIndex> SecToNode;
+
+    auto GetOrCreateNode +        [&Nodes, &SecToNode](const
InputSectionBase *IS) -> NodeIndex {
+      auto Res = SecToNode.insert(std::make_pair(IS, Nodes.size()));
+      if (Res.second)
+        Nodes.emplace_back(IS);
+      return Res.first->second;
+    };
+
+    // Create the graph.
+    for (const auto &C : Config->CFGProfile) {
+      if (C.second == 0)
+        continue;
+      DefinedRegular *FromDR +         
dyn_cast_or_null<DefinedRegular>(Symtab->find(C.first.first));
+      DefinedRegular *ToDR +         
dyn_cast_or_null<DefinedRegular>(Symtab->find(C.first.second));
+      if (!FromDR || !ToDR)
+        continue;
+      auto FromSB = dyn_cast_or_null<const
InputSectionBase>(FromDR->Section);
+      auto ToSB = dyn_cast_or_null<const
InputSectionBase>(ToDR->Section);
+      if (!FromSB || !ToSB)
+        continue;
+      NodeIndex From = GetOrCreateNode(FromSB);
+      NodeIndex To = GetOrCreateNode(ToSB);
+      InsertOrIncrementEdge(Edges, {From, To, C.second});
+      Nodes[To].Weight = SaturatingAdd(Nodes[To].Weight, C.second);
+    }
+  }
+
+  // Collapse the graph.
+  while (!Edges.empty()) {
+    // Find the largest edge
+    // FIXME: non deterministic order for equal edges.
+    // FIXME: n^2
+    auto Max = std::max_element(
+        Edges.begin(), Edges.end(),
+        [](const Edge A, const Edge B) { return A.Weight < B.Weight; });
+    const Edge MaxE = *Max;
+    Edges.erase(Max);
+    // Merge the Nodes.
+    Node &From = Nodes[MaxE.From];
+    Node &To = Nodes[MaxE.To];
+    if (From.Size + To.Size > Target->PageSize)
+      continue;
+    From.Sections.insert(From.Sections.end(), To.Sections.begin(),
+                         To.Sections.end());
+    To.Sections.clear();
+    From.Size += To.Size;
+    From.Weight = SaturatingAdd(From.Weight, To.Weight);
+    // Collect all edges from or to the removed node and update them for the
new
+    // node.
+    std::vector<Edge> OldEdges;
+    // FIXME: n^2
+    for (auto EI = Edges.begin(), EE = Edges.end(); EI != EE;) {
+      if (EI->From == MaxE.To || EI->To == MaxE.To) {
+        OldEdges.push_back(*EI);
+        EI = Edges.erase(EI);
+      } else
+        ++EI;
+    }
+    for (const Edge E : OldEdges) {
+      InsertOrIncrementEdge(Edges,
+                            {E.From == MaxE.To ? MaxE.From : E.From,
+                             E.To == MaxE.To ? MaxE.From : E.To, E.Weight});
+    }
+  }
+
+  // Sort by density.
+  std::sort(Nodes.begin(), Nodes.end(), [](const Node &A, const Node
&B) {
+    return double(A.Weight) / double(A.Size) <
+           double(B.Weight) / double(B.Size);
+  });
+
+  // Generate order.
+  llvm::DenseMap<const InputSectionBase *, std::size_t> OrderMap;
+  ssize_t CurOrder = 0;
+
+  for (const Node &N : Nodes) {
+    if (N.Sections.empty())
+      continue;
+    for (const InputSectionBase *IS : N.Sections)
+      OrderMap[IS] = CurOrder++;
+  }
+
+  for (OutputSection *OS : OutputSections) {
+    if (OS->Name != ".text")
+      continue;
+    OS->sort([&](InputSectionBase *IS) { return OrderMap.lookup(IS); });
+    break;
+  }
+}
+
 template <class ELFT>
 void Writer<ELFT>::forEachRelSec(std::function<void(InputSectionBase
&)> Fn) {
   for (InputSectionBase *IS : InputSections) {
@@ -928,6 +1081,7 @@ template <class ELFT> void
Writer<ELFT>::createSections() {
                               Old.end());
 
   Script->fabricateDefaultCommands();
+  sortByCFGProfile<ELFT>(OutputSections);
   sortBySymbolsOrder<ELFT>();
   sortInitFini(findSection(".init_array"));
   sortInitFini(findSection(".fini_array"));

Tobias Edler von Koch via llvm-dev

2017-Jul-31 22:15 UTC

head link

[llvm-dev] [RFC] Profile guided section layout

On 07/31/2017 05:11 PM, Rafael Avila de Espindola wrote:>
>> The advantage of the proposed approach (weighted callgraph section) is
>> that it's completely transparent: it works regardless of the
particular
>> profiling methodology (as long as there's !perf metadata when the
pass
>> runs). For this reason, it fits neatly into an *existing* PGO-based
>> build flow. I only need to add 1 compiler flag to enable it. That's
a
>> big plus.
>>
>> On the other hand, I could see how your idea (callgraph input file for
>> linker) would be useful in situations where I just want to do section
>> layout but no PGO in the compiler... and of course for testing of the
>> linker's sorting algorithm.
>>
>> So there's merits in both, but for my use cases Michael's
original
>> approach is the most practical.
> Yes, I must stress that I am not proposing having the option of reading
> the callgraph from another file *instead* of reading it from .o
> files. Just that doing it first decouples most of the lld patch form the
> llvm changes and can be useful in cases where only samples are available.
Absolutely!

Michael/Rafael, would you mind uploading the patches to Phabricator so 
people can review them there? I think there is enough interest in this 
feature to move towards merging it in tree.

Thanks,
Tobias

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project.

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Jul 2017 - [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

[llvm-dev] [RFC] Profile guided section layout

Seemingly Similar Threads