>> On Jul 27, 2020, at 10:11 AM, David Greene via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Son Tuan VU via llvm-dev <llvm-dev at lists.llvm.org> writes: >> >>> Currently metadata (other than debug info) can be attached to IR >>> instructions but disappears during DAG selection. >>> >>> My question is why we do not keep the metadata during code lowering and >>> then attach to MachineInstr, just as for IR instructions? Is there any >>> technical challenge, or is it only because nobody wants to do so? >> I have wanted codegen metadata for a very long time so I'm interested to >> hear the history behind this choice, and more importantly, whether >> adding such capability would be generally acceptable to the community. > The first questions need to be “what does it mean?”, “how does it work?”, and “what is it useful for?”. It is hard to evaluate a proposal without that.Hi everyone, I'm trying to answer to each of these questions; it is likely the answers won't be exhaustive, but I hope they will serve as a starting point for an interesting proposal (from my point of view and the one of Son Tuan VU and David Greene): - "What does it mean?": it means to preserve specific information, represented as metadata assigned to instructions, from the IR level, down to the codegen phases. - "How does it work?": metadata should be preserved during the several back-end transformations; for instance, during the lowering phase, DAGCombine performs several optimization to the IR, potentially combining several instructions. The new instruction should, then, assigned with metadata obtained as a proper combination of the original ones (e.g., a union of metadata information). It might be possible to have a dedicated data-structure for such metadata info, and an instance of such structure assigned to each instruction. - "What is it useful for?": I think it is quite context-specific; but, in general, it is useful when some "higher-level" information (e.g., that canbe discovered only before the back-end stage of the compiler) are required in the back-end to perform "semantic"-related optimizations. To give an (quite generic) example where such codegen metadata may be useful: in the field of "secure compilation", preservation of security properties during the compilation phases is essential; such properties are specified in the high-level specifications of the program, and may be expressed with IR metadata. The possibility to keep such IR metadata in the codegen phases may allow preservation of properties that may be invalidated by codegen phases. Cheers, -- Lorenzo> Metadata isn’t free - it must be maintained or invalidated for it to be useful. The details on that dramatically shape whether it can be used for any given purpose. > > -Chris-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200729/874be722/attachment.html>
Thanks for keeping this going, Lorenzo. Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:>> The first questions need to be “what does it mean?”, “how does it >> work?”, and “what is it useful for?”. It is hard to evaluate a >> proposal without that. > > Hi everyone, > > - "What does it mean?": it means to preserve specific information, > represented as metadata assigned to instructions, from the IR level, > down to the codegen phases.An important part of the definition is "how late?" For my particular uses it would be right up until lowering of asm pseudo-instructions, even after regalloc and scheduling. I don't know whether someone might need metadata even later than that (at asm/obj emission time?) but if metadata is supported on Machine IR then it shouldn't be an issue. As with IR-level metadata, there should be no guarantee that metadata is preserved and that it's a best-effort thing. In other words, relying on metadata for correctness is probably not the thing to do.> - "How does it work?": metadata should be preserved during the several > back-end transformations; for instance, during the lowering phase, > DAGCombine performs several optimization to the IR, potentially > combining several instructions. The new instruction should, then, > assigned with metadata obtained as a proper combination of the > original ones (e.g., a union of metadata information).I want to make it clear that this is expensive to do, in that the number of changes to the codegen pipeline is quite extensive and widespread. I know because I've done it*. :) It will help if there are utilities people can use to merge metadata during DAG transformation and the more we make such transfers and combinations "automatic" the easier it will be to preserve metadata. Once the mechanisms are there it also takes effort to keep them going. For example if a new DAG transformation is done people need to think about metadata. This is where "automatic" help makes a real difference. * By "it" I mean communicate information down to late phases of codegen. I don't have a "metadata in codegen" patch as such. I simply cobbled something together in our downstream fork that works for some very specific use-cases.> It might be possible to have a dedicated data-structure for such > metadata info, and an instance of such structure assigned to each > instruction.I'm not entirely sure what you mean by this.> - "What is it useful for?": I think it is quite context-specific; but, > in general, it is useful when some "higher-level" information > (e.g., that canbe discovered only before the back-end stage of the > compiler) are required in the back-end to perform "semantic"-related > optimizations.That's my use-case. There's semantic information codegen would like to know but is really much more practical to discover at the LLVM IR level or even passed from the frontend. Much information is lost by the time codegen is hit and it's often impractical or impossible for codegen to derive it from first principles.> To give an (quite generic) example where such codegen metadata may be > useful: in the field of "secure compilation", preservation of security > properties during the compilation phases is essential; such properties > are specified in the high-level specifications of the program, and may > be expressed with IR metadata. The possibility to keep such IR > metadata in the codegen phases may allow preservation of properties > that may be invalidated by codegen phases.That's a great use-case. I do wonder about your use of "essential" though. Is it needed for correctness? If so an intrinsics-based solution may be better. My use-cases mostly revolve around communication with a proprietary frontend and thus aren't useful to the community, which is why I haven't pursued this with any great vigor before this. I do have uses that convey information from LLVM analyses but unfortunately I can't share them for now. All of my use-cases are related to optimization. No "metadata" is needed for correctness. I have pondered whether intrinsics might work for my use-cases. My fear with intrinsics is that they will interfere with other codegen analyses and transformations. For example they could be a scheduling barrier. I also have wondered about how intrinsics work within SelectionDAG. Do they impact dagcombine and other transformations? The reason I call out SelectionDAG specifically is that most of our downstream changes related to conveying information are in DAG-related files (dagcombine, legalize, etc.). Perhaps intrinsics could suffice for the purposes of getting metadata through SelectionDAG with conversion to "first-class" metadata at the Machine IR level. Maybe this is even an intermediate step toward "full metadata" throughout the compilation. -David
Thanks Lorenzo, I was looking for a ‘one level deeper’ analysis of how this works. The issue is this: either information is preserved across certain sorts of transformations or it is not. If not, it either goes stale (problematic for anything that looks at it later) or is invalidated/removed. The fundamental issue in IR design is factoring the representation of information from the code that needs to inspect and update it. “Metadata” designs try to make it easy to add out of band information to the IR in various ways, with a goal of reducing the impact on the rest of the compiler. However, I’ve never seen them work out well. Either the data becomes stale, or you end up changing a lot of the compiler to support it. Look at debug info metadata in LLVM for example, it has both problems :-). This is why MLIR has moved to make source location information and attributes a first class part of the IR. -Chris> On Jul 29, 2020, at 12:33 AM, Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> wrote: > >>> On Jul 27, 2020, at 10:11 AM, David Greene via llvm-dev <llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org> wrote: >>> >>> Son Tuan VU via llvm-dev <llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org> writes: >>> >>>> Currently metadata (other than debug info) can be attached to IR >>>> instructions but disappears during DAG selection. >>>> >>>> My question is why we do not keep the metadata during code lowering and >>>> then attach to MachineInstr, just as for IR instructions? Is there any >>>> technical challenge, or is it only because nobody wants to do so? >>> I have wanted codegen metadata for a very long time so I'm interested to >>> hear the history behind this choice, and more importantly, whether >>> adding such capability would be generally acceptable to the community. >> The first questions need to be “what does it mean?”, “how does it work?”, and “what is it useful for?”. It is hard to evaluate a proposal without that. > Hi everyone, > > I'm trying to answer to each of these questions; it is likely the answers won't be > exhaustive, but I hope they will serve as a starting point for an interesting > proposal (from my point of view and the one of Son Tuan VU and David Greene): > > - "What does it mean?": it means to preserve specific information, represented as > metadata assigned to instructions, from the IR level, down to the codegen phases. > > - "How does it work?": metadata should be preserved during the several > back-end transformations; for instance, during the lowering phase, DAGCombine > performs several optimization to the IR, potentially combining several > instructions. The new instruction should, then, assigned with metadata obtained > as a proper combination of the original ones (e.g., a union of metadata > information). > > It might be possible to have a dedicated data-structure for such metadata info, > and an instance of such structure assigned to each instruction. > > - "What is it useful for?": I think it is quite context-specific; but, > in general, it is useful when some "higher-level" > information (e.g., that can be discovered only before the back-end > stage of the compiler) are required in the back-end to perform "semantic"-related > optimizations. > > > To give an (quite generic) example where such codegen metadata may be useful: in the field > of "secure compilation", preservation of security properties during the compilation > phases is essential; such properties are specified in the high-level specifications of > the program, and may be expressed with IR metadata. The possibility to keep such IR > metadata in the codegen phases may allow preservation of properties that may be invalidated > by codegen phases. > > > > Cheers, > -- Lorenzo > >> Metadata isn’t free - it must be maintained or invalidated for it to be useful. The details on that dramatically shape whether it can be used for any given purpose. >> >> -Chris > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200802/0cd80da5/attachment.html>
Am 31/07/20 um 22:47 schrieb David Greene: @David> Thanks for keeping this going, Lorenzo. > > Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes: > >>> The first questions need to be “what does it mean?”, “how does it >>> work?”, and “what is it useful for?”. It is hard to evaluate a >>> proposal without that. >> Hi everyone, >> >> - "What does it mean?": it means to preserve specific information, >> represented as metadata assigned to instructions, from the IR level, >> down to the codegen phases. > An important part of the definition is "how late?" For my particular > uses it would be right up until lowering of asm pseudo-instructions, > even after regalloc and scheduling. I don't know whether someone might > need metadata even later than that (at asm/obj emission time?) but if > metadata is supported on Machine IR then it shouldn't be an issue."How late" it is context-specific: even in my case, I required such information to be preserved until pseudo instruction expansion. Conservatively, they could be preserved until the last pass of codegen pipeline. Regarding their employment in the later steps, I would not say they are not required, sinceI worked on a specific topic of secure compilation, and I do not have the wholepicture in mind; nonetheless, it would be possible to test how things work out withthe codegen and later reason on future developments.> As with IR-level metadata, there should be no guarantee that metadata is > preserved and that it's a best-effort thing. In other words, relying on > metadata for correctness is probably not the thing to do.Ok, I made a mistake stating that metadata should be *preserved*; what I really meant is to preserve the *information* that such metadata represent.>> - "How does it work?": metadata should be preserved during the several >> back-end transformations; for instance, during the lowering phase, >> DAGCombine performs several optimization to the IR, potentially >> combining several instructions. The new instruction should, then, >> assigned with metadata obtained as a proper combination of the >> original ones (e.g., a union of metadata information). > I want to make it clear that this is expensive to do, in that the number > of changes to the codegen pipeline is quite extensive and widespread. I > know because I've done it*. :) It will help if there are utilities > people can use to merge metadata during DAG transformation and the more > we make such transfers and combinations "automatic" the easier it will > be to preserve metadata. > > Once the mechanisms are there it also takes effort to keep them going. > For example if a new DAG transformation is done people need to think > about metadata. This is where "automatic" help makes a real difference. > > * By "it" I mean communicate information down to late phases of codegen. > I don't have a "metadata in codegen" patch as such. I simply cobbled > something together in our downstream fork that works for some very > specific use-cases.I know what you have been through, and I can only agree with you: for the project I mentioned above, I had to perform several changes to the whole IR lowering phase in order to correctly propagate high-level information; it wasn't cheap and required a lot of effort.>> It might be possible to have a dedicated data-structure for such >> metadata info, and an instance of such structure assigned to each >> instruction. > I'm not entirely sure what you mean by this.I was imagining a per-instruction data-structure collecting metadata info related to that specific instruction, instead of having several metadata info directly embedded in each instruction.>> - "What is it useful for?": I think it is quite context-specific; but, >> in general, it is useful when some "higher-level" information >> (e.g., that canbe discovered only before the back-end stage of the >> compiler) are required in the back-end to perform "semantic"-related >> optimizations. > That's my use-case. There's semantic information codegen would like to > know but is really much more practical to discover at the LLVM IR level > or even passed from the frontend. Much information is lost by the time > codegen is hit and it's often impractical or impossible for codegen to > derive it from first principles. > >> To give an (quite generic) example where such codegen metadata may be >> useful: in the field of "secure compilation", preservation of security >> properties during the compilation phases is essential; such properties >> are specified in the high-level specifications of the program, and may >> be expressed with IR metadata. The possibility to keep such IR >> metadata in the codegen phases may allow preservation of properties >> that may be invalidated by codegen phases. > That's a great use-case. I do wonder about your use of "essential" > though.With *essential* I mean fundamental for satisfying a specific target security property.> Is it needed for correctness? If so an intrinsics-based > solution may be better.Uhm...it might sound as a naive question, but what do you mean with *correctness*?> My use-cases mostly revolve around communication with a proprietary > frontend and thus aren't useful to the community, which is why I haven't > pursued this with any great vigor before this. > > I do have uses that convey information from LLVM analyses but > unfortunately I can't share them for now. > > All of my use-cases are related to optimization. No "metadata" is > needed for correctness.> I have pondered whether intrinsics might work for my use-cases. My fear > with intrinsics is that they will interfere with other codegen analyses > and transformations. For example they could be a scheduling barrier. > > I also have wondered about how intrinsics work within SelectionDAG. Do > they impact dagcombine and other transformations? The reason I call out > SelectionDAG specifically is that most of our downstream changes related > to conveying information are in DAG-related files (dagcombine, legalize, > etc.). Perhaps intrinsics could suffice for the purposes of getting > metadata through SelectionDAG with conversion to "first-class" metadata > at the Machine IR level. Maybe this is even an intermediate step toward > "full metadata" throughout the compilation.I employed intrinsics as a mean for carrying metadata, but, by my experience, I am not sure they can be resorted as a valid alternative: - For each llvm-ir instruction employed in my project (e.g., store), a semantically equivalent intrinsic is declared, with particular parameters representing metadata (i.e., first-class metadata are represented by specific intrinsic's parameters). - During the lowering, each ad-hoc intrinsic must be properly handled, manually adding the proper legalization operations, DAG combinations and so on. - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to pseudo-instructions), metadata are passed to the MIR representation of the program. In particular, the second point rises a critical problem in terms of optimizations (e.g., intrinsic store + intrinsic trunc are not automatically converted into a intrinsic truncated store).Then, the backend must be instructed to perform such optimizations, which are actually already performed on non-intrinsic instructions (e.g., store + trunc is already converted into a truncated store). Instead of re-inventing the wheel, and since the backend should be nonetheless modified in order to support optimizations on intrinsics, I would rather prefer to insert some sort of mechanism to support metadata attachment as first-class elements of the IR/MIR, and automatic merging of metadata, for instance. ---- @Chris I may be wrong (in such case, please, correct me), but if I got it correctly, source-level debugging metadata are "external" (i.e., not a first-class element of the llvm-ir), and their management involve a great effort. As described above, in my project I used metadata as first class elements of the IR/MIR; I found this approach more immediate and simpler to handle, although some passes and transformation must be modified. Then, I agree with you saying that metadata infos should be first-class elements of the IR/MIR (or, at least, "packed" into a structure being first-class part of the IR/MIR). ---- In any case, I wonder if metadata at codegen level is actually a thing that the community would benefit (then, justifying a potentially huge and/or long serie of patches), or it is something in which only a small group would be interested in. Cheers -- Lorenzo
Chris Lattner via llvm-dev <llvm-dev at lists.llvm.org> writes:> The issue is this: either information is preserved across certain > sorts of transformations or it is not. If not, it either goes stale > (problematic for anything that looks at it later) or is > invalidated/removed. > > The fundamental issue in IR design is factoring the representation of > information from the code that needs to inspect and update it. > “Metadata” designs try to make it easy to add out of band information > to the IR in various ways, with a goal of reducing the impact on the > rest of the compiler. > > However, I’ve never seen them work out well. Either the data becomes > stale, or you end up changing a lot of the compiler to support it. > Look at debug info metadata in LLVM for example, it has both problems > :-). This is why MLIR has moved to make source location information > and attributes a first class part of the IR.I basically agree with your analysis. Some information is so pervasive that it really should be a part of the IR proper. But other information may not be. The kind of information I'm thinking of basically boils down to optimization hints. It's fine and semantically sound to drop it, though not ideal if it can be avoided. I see debug info as being in a quite different class. With the -g option we are making a promise to our users. So using a mechanism that by design doesn't make promises seems a poor fit. A long long time ago in the dark ages before git and Phabricator I submitted a patch for review that would have added comment information to machine instructions. It was basically a string member on every MachineInstr. At the time it was deemed too expensive and rightly so. Instead I ended up adding some flag values that the AsmPrinter uses as a hint to generate various comments. I'm still not very happy with that "solution" and a more general-purpose mechanism for annotating IR/SelectionDAG/MIR objects would be quite welcome. A generic first-class annotation construct would cover both use-cases. If you and the wider community are open to adding first-class generic information annotation, I'm eager to work on it! -David