David Blaikie via llvm-dev
2021-Mar-31 03:15 UTC
[llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses
(FWIW, I'd still really love Mehdi, Owen, or anyone else's opinion on the original proposal or any parts of this general problem - folks working on some profile features have been continuing the template direction to abstract over MIR+LLVM IR and if there's benefits to be had here, it'd be great to get them... ) On Tue, Mar 30, 2021 at 9:20 AM Sameer Sahasrabuddhe <sameer at sbuddhe.net> wrote:> > ---- On Thu, 11 Feb 2021 06:07:17 +0530 David Blaikie via llvm-dev <llvm-dev at lists.llvm.org> wrote ---- > > > Owen - perhaps you've got some relevant perspective here given the GPU > > context/background, and general middle end optimization design > > questions. Could you take a look? > > (& perhaps Johannes?) > > > > On Mon, Feb 1, 2021 at 11:36 AM David Blaikie <dblaikie at gmail.com> wrote: > > > > > > Thanks for sending this out! > > > > > > +Mehdi - if you have a chance to look, I'd appreciate your thoughts on this, partly as a general LLVM contributor, but also with an eye to abstractions over multiple IRs given your work on MLIR > > > +Lang - you mentioned maybe some colleagues of yours might be able to take a look at the design here? > > > > > > I'd like to try writing up maybe alternatives to a couple of the patches in the series to see how they compare, but haven't done that as yet. > > > > > > My brief take is that I think the opaque handles make a fair bit of sense/seem high-value/low-cost (if it turned out the handle for one of these IR types needed to become a bit heavier, I don't think it'd be super difficult to revisit this code - make the generic handle large enough to accommodate whatever the largest concrete handle happens to be, etc). Though I'm still a bit less certain about the full runtime polymorphic abstractions. > > For context, this is about the options laid out in the following document from Nicolai: > > https://docs.google.com/document/d/1sbeGw5uNGFV0ZPVk6h8Q5_dRhk4qFnKHa-uZ-O3c4UY/edit?usp=sharing > > The above document was first introduced in the following email, which is the starting point of the current thread: > > https://lists.llvm.org/pipermail/llvm-dev/2020-December/147433.html > > The main issue is the need for an abstraction that greatly improves the experience of writing analyses that work on LLVM IR and MIR (and potentially also on MLIR). The way I understand it, Nicolai proposed an abstraction that involves dynamic polymorphism to abstract out the details of the underlying IR. David was more in favour of first unifying the IRs to the point that the dynamic polymorphism is limited to only occasional corner cases instead.To clarify, I think what I was leaning towards was having the APIs match syntactically, so it'd be easier to write templates over them without the need for type traits and adapters.> I am currently involved in the activity towards decoupling AMDGPU's dependency on the abstractions being discussed here, and the one thing that immediately jumps out is the ability to traverse the predecessors and successors of a basic block. From all the emails so far and the resulting proposal, it seems that both David and Nicolai think that having a common non-abstract base class for basic blocks is a good idea.I don't /think/ that's what I was advocating for (but it's been a while, and I might be misremembering - if you've got pointers to particular parts of the conversation to refresh my memory/frame this discussion, that'd be handy) - I'd worry about the cost of trying to introduce a common base class between these different APIs. I think what I was advocating for was for them to have syntactically matching APIs so that templated code could abstract over them without the need for traits/adapters.> Such a base class will provide a vector of predecessors and successors that can be used to traverse the CFG independent of the actual IR. > > The current situation is that the basic blocks in LLVM IR and MLIR do not explicitly track their preds and succs. Preds are determined from the uses of the block, while succs are determined from the terminator instruction. MIR has explicit pred and succ vectors, but the succ pointers are duplicated by their presence as operands to the terminator instructions. The MachineBasicBlock is not a value, so there is no notion of traversing its uses for predecessors. > > So if we create a common non-abstract base class for basic blocks, then MLIR and LLVM IR will incur the overhead of two vectors, one each for the preds and succs. Nicolai had reported a 3.1% to 4.1% increase in the size LLVM IR in some typical programs: > > https://docs.google.com/spreadsheets/d/1cwRy2K4XjWCjwfK53MuCqws1TsQ0S6FtRjbwoUdGBfo/edit?usp=sharingThat's overhead in the bitcode file size, yes?> Is this overhead the main issue in deciding whether we should introduce the common base class for traversing a CFG?I expect that would be a pretty significant concern, but also memory usage and any time overhead keeping these new data structures up to date, or slowing down compilation due to memory paging in and out, etc.> Do note that MIR already duplicates successors. One could argue that duplicating the successor pointers in LLVM IR and MLIR merely brings them on the same level as MIR. If that argument sticks, then the "real" overhead is only half of what is reported to account for the duplicate predecessor pointers. > > The duplication of predecessor pointers stems from the fact that basic blocks in LLVM IR and MLIR are also values that are (primarily) used as operands to the terminator and PHI instructions. We could redefine these instructions to use non-value basic blocks as operands. CFG edges are not the typical use-def relation that other values represent, and it's reasonable to say that PHINodes and terminators are special in this one way. I have not managed to see how far that rabbit hole goes, but I am not really convinced that this is attractive in any way. Is this even an option? > > Sameer.
Sameer Sahasrabuddhe via llvm-dev
2021-Mar-31 16:01 UTC
[llvm-dev] [RFC] Abstracting over SSA form IRs to implement generic analyses
David Blaikie writes:>> I am currently involved in the activity towards decoupling AMDGPU's >> dependency on the abstractions being discussed here, and the one >> thing that immediately jumps out is the ability to traverse the >> predecessors and successors of a basic block. From all the emails so >> far and the resulting proposal, it seems that both David and Nicolai >> think that having a common non-abstract base class for basic blocks >> is a good idea. > > I don't /think/ that's what I was advocating for (but it's been a > while, and I might be misremembering - if you've got pointers to > particular parts of the conversation to refresh my memory/frame this > discussion, that'd be handy) - I'd worry about the cost of trying to > introduce a common base class between these different APIs. I think > what I was advocating for was for them to have syntactically matching > APIs so that templated code could abstract over them without the need > for traits/adapters.You're right. Turns out that you didn't specifically advocate what I said above. The following is what was actually said, but I misunderstood something else from the thread: https://lists.llvm.org/pipermail/llvm-dev/2020-October/146090.html "I meant for a different direction - if we got buy-in for making LLVM IR and MIR implement the same C++ concept (& wrote down what that concept was - which APIs, how they look textually) then I think it would be relatively easy to get the small/mechanical renaming patches reviewed. Incremental from day 1, not big new abstraction, then incremental.">> So if we create a common non-abstract base class for basic blocks, >> then MLIR and LLVM IR will incur the overhead of two vectors, one >> each for the preds and succs. Nicolai had reported a 3.1% to 4.1% >> increase in the size LLVM IR in some typical programs: >> >> https://docs.google.com/spreadsheets/d/1cwRy2K4XjWCjwfK53MuCqws1TsQ0S6FtRjbwoUdGBfo/edit?usp=sharing > > That's overhead in the bitcode file size, yes?It's the memory overhead of storing those pointers in a vector. I don't think it needs to affect the bitcode, since it's just redundant information that can be populated after reading the whole function.>> Is this overhead the main issue in deciding whether we should >> introduce the common base class for traversing a CFG? > > I expect that would be a pretty significant concern, but also memory > usage and any time overhead keeping these new data structures up to > date, or slowing down compilation due to memory paging in and out, > etc.Right. If the discussion progressed to those details, then we will be in a good place :) Everyone seems to agree that we need an abstraction that makes it easier to write analyses, and Nicolai's document captures all the options. But in particular the document makes a case against templates, and instead advocates a way to write analyses without referring to the actual types in the underlying IR. This is achieved with a combination of opaque handles and dynamic polymorphism. The document also strongly recommends minimizing the dynamic polymorphism by designing "common non-abstract base classes for SSA basic block and instructions concepts". My focus right now is on the base class for basic blocks. It represents a frequently used property of the CFG, and the problem is concise, and already familiar to anyone writing an analysis. Our current experience with writing new analyses in AMDGPU provides a motivation for moving away from templates. But this is only a specific part of the bigger problem. In particular, it only represents the "CFG" part of the problem, but not the "SSA IR" part. The actual use-case can be seen in the branch below, which implements "CycleInfo" as a template over LLVM IR and MIR. It's a work in progress. https://github.com/ssahasra/llvm-project/tree/cycleinfo-as-template For now, we can consider a simpler analysis "CfgInfo" that reports various properties such as outgoing edges per node. Example A: CfgInfo over LLVM IR CfgInfo::compute(BasicBlock *entry) { worklist.enqueue(entry); while (!worklist.empty) { auto B = worklist.dequeue(); for (auto S : successors(B)) { worklist.enqueue_if_new(S); collectStats(B, S); } } } Example B: CfgInfo over Machine IR CfgInfo::compute(MachineBasicBlock *entry) { worklist.enqueue(entry); while (!worklist.empty) { auto B = worklist.dequeue(); for (auto S : B->successors()) { // <-- diff worklist.enqueue_if_new(S); collectStats(B, S); } } } Example C: Template over both IRs template <typename BlockType> CfgInfo::compute(BlockType *entry) { worklist.enqueue(entry); while (!worklist.empty) { auto B = worklist.dequeue(); for (auto S : successors(B)) { // <-- needs an overload worklist.enqueue_if_new(S); collectStats(B, S); } } } In this example, any instance of BlockType needs to provide a global function "successors()". The following is sufficient with Machine IR in this particular case, but in general, this would be provided by traits: namespace llvm { auto successors(MachineBasicBlock *B) { return B->successors(); } } Example D: CfgInfo over a common base class CfgInfo::compute(BasicBlockBase *entry) { worklist.enqueue(entry); while (!worklist.empty) { auto B = worklist.dequeue(); for (auto S : B->successors()) { // <-- not virtual worklist.enqueue_if_new(S); collectStats(B, S); } } } This last version is not a template. It's an implementation that gets reused for both LLVM IR and Machine IR by simply casting the respective basic block type to the base type. There are two main problems with the template approach: 1. Just for that one function "successors()", the type of the block has to be passed around as a template parameter. Starting from the entry point of the analysis, this has to go all the way to every leaf function that wants to call "successors()". This can especially be a problem if the template has other dependent types such as iterator adaptors. 2. The actual analysis can be quite complicated, but because it's a template, it has to be in a header file so that it can be instantiated at the point where the analysis is actually computed. The CycleInfo patch demonstrates both these problems. There are multiple files as follows: GenericCycleInfo.h This file defines the templates GenericCycleInfo<BlockType> and GenericCycle<BlockType>. It also contains the entire implementation of the analysis. CycleInfo.h - This file declares a thin wrapper CycleInfo around GenericCycleInfo<BasicBlock> and another wrapper Cycle around GenericCycle<BasicBlock>. These are the LLVM IR instances of the templates. - The wrappers avoid including GenericCycleInfo.h by only needing the pointer type to the wrapped template instances. - In particular, notice how the child_iterator explicitly exposes the details of the wrapped type, but then returns the wrapper with the `*' operator. using const_child_iterator_base typename std::vector<std::unique_ptr<GenericCycle<BasicBlock>>>::const_iterator; struct const_child_iterator : iterator_adaptor_base<const_child_iterator, const_child_iterator_base> { using Base iterator_adaptor_base<const_child_iterator, const_child_iterator_base>; Cycle operator *() const; const_child_iterator(const_child_iterator_base I) : Base(I) {} }; CycleInfo.cpp This file is where the template is actually instantiated, along with the implementation of the wrapper classes. A similar pair of files will be needed for MachineCycleInfo later. These will be almost identical to the CycleInfo files, except for the name of the block type and the overload for "succesors()" and "predecessors()". All these files are just boilerplate designed to hide a template and separately expose two instances of that template. Every type introduced by the template needs to be carefully wrapped for use in other files. Instead, if there was an interface (and not just a concept) that exposed CFG edges, there would be just one CycleInfo.h file to declare the analysis classes and one CycleInfo.cpp file to implement them. So the argument goes somewhat like this: 1. Static polymorphism through templates creates a development overhead in the form of boilerplate, and also some runtime overhead of indirecting through wrapper types. 2. Dynamic polymorphism through an abstract base class or an abstract context class removes the boilerplate, but it introduces the runtime overhead of indirect calls. It also retains the runtime overhead of indirecting through opaque wrappers. 3. A non-abstract base class eliminates all the above overheads (for this specific problem of CFG traversal), but introduces a memory overhead because some information from the derived classes has to be duplicated in the base class. The CycleInfo analysis exemplifies our motivation for moving away from templates and traits. Assuming virtual calls are considered strongly undesirable, that leaves us with the option of a non-abstract base class. Sameer.