thr3ads.net - llvm dev - [llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses [Mar 2021]

If this information is useful, please help other people find it:
Share via:

David Blaikie via llvm-dev

2021-Mar-31 03:15 UTC

[llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses

(FWIW, I'd still really love Mehdi, Owen, or anyone else's opinion on
the original proposal or any parts of this general problem - folks
working on some profile features have been continuing the template
direction to abstract over MIR+LLVM IR and if there's benefits to be
had here, it'd be great to get them... )

On Tue, Mar 30, 2021 at 9:20 AM Sameer Sahasrabuddhe <sameer at
sbuddhe.net> wrote:>
> ---- On Thu, 11 Feb 2021 06:07:17 +0530 David Blaikie via llvm-dev
<llvm-dev at lists.llvm.org> wrote ----
>
>  > Owen - perhaps you've got some relevant perspective here given
the GPU
>  > context/background, and general middle end optimization design
>  > questions. Could you take a look?
>  > (& perhaps Johannes?)
>  >
>  > On Mon, Feb 1, 2021 at 11:36 AM David Blaikie <dblaikie at
gmail.com> wrote:
>  > >
>  > > Thanks for sending this out!
>  > >
>  > > +Mehdi - if you have a chance to look, I'd appreciate your
thoughts on this, partly as a general LLVM contributor, but also with an eye to
abstractions over multiple IRs given your work on MLIR
>  > > +Lang - you mentioned maybe some colleagues of yours might be
able to take a look at the design here?
>  > >
>  > > I'd like to try writing up maybe alternatives to a couple of
the patches in the series to see how they compare, but haven't done that as
yet.
>  > >
>  > > My brief take is that I think the opaque handles make a fair bit
of sense/seem high-value/low-cost (if it turned out the handle for one of these
IR types needed to become a bit heavier, I don't think it'd be super
difficult to revisit this code - make the generic handle large enough to
accommodate whatever the largest concrete handle happens to be, etc). Though
I'm still a bit less certain about the full runtime polymorphic
abstractions.
>
> For context, this is about the options laid out in the following document
from Nicolai:
>
>
https://docs.google.com/document/d/1sbeGw5uNGFV0ZPVk6h8Q5_dRhk4qFnKHa-uZ-O3c4UY/edit?usp=sharing
>
> The above document was first introduced in the following email, which is
the starting point of the current thread:
>
> https://lists.llvm.org/pipermail/llvm-dev/2020-December/147433.html
>
> The main issue is the need for an abstraction that greatly improves the
experience of writing analyses that work on LLVM IR and MIR (and potentially
also on MLIR). The way I understand it, Nicolai proposed an abstraction that
involves dynamic polymorphism to abstract out the details of the underlying IR.
David was more in favour of first unifying the IRs to the point that the dynamic
polymorphism is limited to only occasional corner cases instead.
To clarify, I think what I was leaning towards was having the APIs
match syntactically, so it'd be easier to write templates over them
without the need for type traits and adapters.
> I am currently involved in the activity towards decoupling AMDGPU's
dependency on the abstractions being discussed here, and the one thing that
immediately jumps out is the ability to traverse the predecessors and successors
of a basic block. From all the emails so far and the resulting proposal, it
seems that both David and Nicolai think that having a common non-abstract base
class for basic blocks is a good idea.
I don't /think/ that's what I was advocating for (but it's been a
while, and I might be misremembering - if you've got pointers to
particular parts of the conversation to refresh my memory/frame this
discussion, that'd be handy) - I'd worry about the cost of trying to
introduce a common base class between these different APIs. I think
what I was advocating for was for them to have syntactically matching
APIs so that templated code could abstract over them without the need
for traits/adapters.
> Such a base class will provide a vector of predecessors and successors that
can be used to traverse the CFG independent of the actual IR.
>
> The current situation is that the basic blocks in LLVM IR and MLIR do not
explicitly track their preds and succs. Preds are determined from the uses of
the block, while succs are determined from the terminator instruction. MIR has
explicit pred and succ vectors, but the succ pointers are duplicated by their
presence as operands to the terminator instructions. The MachineBasicBlock is
not a value, so there is no notion of traversing its uses for predecessors.
>
> So if we create a common non-abstract base class for basic blocks, then
MLIR and LLVM IR will incur the overhead of two vectors, one each for the preds
and succs. Nicolai had reported a 3.1% to 4.1% increase in the size LLVM IR in
some typical programs:
>
>
https://docs.google.com/spreadsheets/d/1cwRy2K4XjWCjwfK53MuCqws1TsQ0S6FtRjbwoUdGBfo/edit?usp=sharing
That's overhead in the bitcode file size, yes?
> Is this overhead the main issue in deciding whether we should introduce the
common base class for traversing a CFG?
I expect that would be a pretty significant concern, but also memory
usage and any time overhead keeping these new data structures up to
date, or slowing down compilation due to memory paging in and out,
etc.
> Do note that MIR already duplicates successors. One could argue that
duplicating the successor pointers in LLVM IR and MLIR merely brings them on the
same level as MIR. If that argument sticks, then the "real" overhead
is only half of what is reported to account for the duplicate predecessor
pointers.
>
> The duplication of predecessor pointers stems from the fact that basic
blocks in LLVM IR and MLIR are also values that are (primarily) used as operands
to the terminator and PHI instructions. We could redefine these instructions to
use non-value basic blocks as operands. CFG edges are not the typical use-def
relation that other values represent, and it's reasonable to say that
PHINodes and terminators are special in this one way.  I have not managed to see
how far that rabbit hole goes, but I am not really convinced that this is
attractive in any way. Is this even an option?
>
> Sameer.

Sameer Sahasrabuddhe via llvm-dev

2021-Mar-31 16:01 UTC

head link

[llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses

David Blaikie writes:
>> I am currently involved in the activity towards decoupling AMDGPU's
>> dependency on the abstractions being discussed here, and the one
>> thing that immediately jumps out is the ability to traverse the
>> predecessors and successors of a basic block. From all the emails so
>> far and the resulting proposal, it seems that both David and Nicolai
>> think that having a common non-abstract base class for basic blocks
>> is a good idea.
>
> I don't /think/ that's what I was advocating for (but it's been
a
> while, and I might be misremembering - if you've got pointers to
> particular parts of the conversation to refresh my memory/frame this
> discussion, that'd be handy) - I'd worry about the cost of trying
to
> introduce a common base class between these different APIs. I think
> what I was advocating for was for them to have syntactically matching
> APIs so that templated code could abstract over them without the need
> for traits/adapters.
You're right. Turns out that you didn't specifically advocate what I
said above. The following is what was actually said, but I misunderstood
something else from the thread:

https://lists.llvm.org/pipermail/llvm-dev/2020-October/146090.html

  "I meant for a different direction - if we got buy-in for making LLVM IR
and
   MIR implement the same C++ concept (& wrote down what that concept was -
   which APIs, how they look textually) then I think it would be relatively
   easy to get the small/mechanical renaming patches reviewed. Incremental
   from day 1, not big new abstraction, then incremental."
>> So if we create a common non-abstract base class for basic blocks,
>> then MLIR and LLVM IR will incur the overhead of two vectors, one
>> each for the preds and succs. Nicolai had reported a 3.1% to 4.1%
>> increase in the size LLVM IR in some typical programs:
>>
>>
https://docs.google.com/spreadsheets/d/1cwRy2K4XjWCjwfK53MuCqws1TsQ0S6FtRjbwoUdGBfo/edit?usp=sharing
>
> That's overhead in the bitcode file size, yes?
It's the memory overhead of storing those pointers in a vector. I don't
think it needs to affect the bitcode, since it's just redundant
information that can be populated after reading the whole function.
>> Is this overhead the main issue in deciding whether we should
>> introduce the common base class for traversing a CFG?
>
> I expect that would be a pretty significant concern, but also memory
> usage and any time overhead keeping these new data structures up to
> date, or slowing down compilation due to memory paging in and out,
> etc.
Right. If the discussion progressed to those details, then we will be in
a good place :)

Everyone seems to agree that we need an abstraction that makes it easier
to write analyses, and Nicolai's document captures all the options. But
in particular the document makes a case against templates, and instead
advocates a way to write analyses without referring to the actual types
in the underlying IR. This is achieved with a combination of opaque
handles and dynamic polymorphism. The document also strongly recommends
minimizing the dynamic polymorphism by designing "common non-abstract
base classes for SSA basic block and instructions concepts".

My focus right now is on the base class for basic blocks. It represents
a frequently used property of the CFG, and the problem is concise, and
already familiar to anyone writing an analysis. Our current experience
with writing new analyses in AMDGPU provides a motivation for moving
away from templates. But this is only a specific part of the bigger
problem. In particular, it only represents the "CFG" part of the
problem, but not the "SSA IR" part.

The actual use-case can be seen in the branch below, which implements
"CycleInfo" as a template over LLVM IR and MIR. It's a work in
progress.

https://github.com/ssahasra/llvm-project/tree/cycleinfo-as-template

For now, we can consider a simpler analysis "CfgInfo" that reports
various properties such as outgoing edges per node.

Example A: CfgInfo over LLVM IR

    CfgInfo::compute(BasicBlock *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : successors(B)) {
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

Example B: CfgInfo over Machine IR

    CfgInfo::compute(MachineBasicBlock *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : B->successors()) { // <-- diff
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

Example C: Template over both IRs

    template <typename BlockType>
    CfgInfo::compute(BlockType *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : successors(B)) { // <-- needs an overload
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

In this example, any instance of BlockType needs to provide a global
function "successors()". The following is sufficient with Machine IR
in
this particular case, but in general, this would be provided by traits:

namespace llvm {
    auto successors(MachineBasicBlock *B) { return B->successors(); }
}

Example D: CfgInfo over a common base class

    CfgInfo::compute(BasicBlockBase *entry) {
      worklist.enqueue(entry);
      while (!worklist.empty) {
        auto B = worklist.dequeue();
        for (auto S : B->successors()) { // <-- not virtual
           worklist.enqueue_if_new(S);
           collectStats(B, S);
        }
      }
    }

This last version is not a template. It's an implementation that gets
reused for both LLVM IR and Machine IR by simply casting the respective
basic block type to the base type.

There are two main problems with the template approach:

1. Just for that one function "successors()", the type of the block
has
   to be passed around as a template parameter. Starting from the entry
   point of the analysis, this has to go all the way to every leaf
   function that wants to call "successors()". This can especially be
a
   problem if the template has other dependent types such as iterator
   adaptors.

2. The actual analysis can be quite complicated, but because it's a
   template, it has to be in a header file so that it can be
   instantiated at the point where the analysis is actually computed.

The CycleInfo patch demonstrates both these problems. There are multiple
files as follows:

GenericCycleInfo.h

This file defines the templates GenericCycleInfo<BlockType> and
GenericCycle<BlockType>. It also contains the entire implementation of
the analysis.

CycleInfo.h

- This file declares a thin wrapper CycleInfo around
  GenericCycleInfo<BasicBlock> and another wrapper Cycle around
  GenericCycle<BasicBlock>. These are the LLVM IR instances of the
  templates.
- The wrappers avoid including GenericCycleInfo.h by only needing the
  pointer type to the wrapped template instances.
- In particular, notice how the child_iterator explicitly exposes the
  details of the wrapped type, but then returns the wrapper with the `*'
  operator.

  using const_child_iterator_base     typename
std::vector<std::unique_ptr<GenericCycle<BasicBlock>>>::const_iterator;
  struct const_child_iterator
      : iterator_adaptor_base<const_child_iterator,
const_child_iterator_base> {
    using Base         iterator_adaptor_base<const_child_iterator,
const_child_iterator_base>;

    Cycle operator *() const;
    const_child_iterator(const_child_iterator_base I) : Base(I) {}
  };

CycleInfo.cpp

This file is where the template is actually instantiated, along with the
implementation of the wrapper classes.

A similar pair of files will be needed for MachineCycleInfo later. These
will be almost identical to the CycleInfo files, except for the name of
the block type and the overload for "succesors()" and
"predecessors()".

All these files are just boilerplate designed to hide a template and
separately expose two instances of that template. Every type introduced
by the template needs to be carefully wrapped for use in other
files. Instead, if there was an interface (and not just a concept) that
exposed CFG edges, there would be just one CycleInfo.h file to declare
the analysis classes and one CycleInfo.cpp file to implement them.

So the argument goes somewhat like this:

1. Static polymorphism through templates creates a development overhead
   in the form of boilerplate, and also some runtime overhead of indirecting
   through wrapper types.

2. Dynamic polymorphism through an abstract base class or an abstract
   context class removes the boilerplate, but it introduces the runtime
   overhead of indirect calls. It also retains the runtime overhead of
   indirecting through opaque wrappers.

3. A non-abstract base class eliminates all the above overheads (for
   this specific problem of CFG traversal), but introduces a memory
   overhead because some information from the derived classes has to be
   duplicated in the base class.

The CycleInfo analysis exemplifies our motivation for moving away from
templates and traits. Assuming virtual calls are considered strongly
undesirable, that leaves us with the option of a non-abstract base
class.

Sameer.

llvm dev - Mar 2021 - [RFC] Abstracting over SSA form IRs to implement generic analyses

[llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses

[llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses