Am 31/07/20 um 22:47 schrieb David Greene: @David> Thanks for keeping this going, Lorenzo. > > Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes: > >>> The first questions need to be “what does it mean?”, “how does it >>> work?”, and “what is it useful for?”. It is hard to evaluate a >>> proposal without that. >> Hi everyone, >> >> - "What does it mean?": it means to preserve specific information, >> represented as metadata assigned to instructions, from the IR level, >> down to the codegen phases. > An important part of the definition is "how late?" For my particular > uses it would be right up until lowering of asm pseudo-instructions, > even after regalloc and scheduling. I don't know whether someone might > need metadata even later than that (at asm/obj emission time?) but if > metadata is supported on Machine IR then it shouldn't be an issue."How late" it is context-specific: even in my case, I required such information to be preserved until pseudo instruction expansion. Conservatively, they could be preserved until the last pass of codegen pipeline. Regarding their employment in the later steps, I would not say they are not required, sinceI worked on a specific topic of secure compilation, and I do not have the wholepicture in mind; nonetheless, it would be possible to test how things work out withthe codegen and later reason on future developments.> As with IR-level metadata, there should be no guarantee that metadata is > preserved and that it's a best-effort thing. In other words, relying on > metadata for correctness is probably not the thing to do.Ok, I made a mistake stating that metadata should be *preserved*; what I really meant is to preserve the *information* that such metadata represent.>> - "How does it work?": metadata should be preserved during the several >> back-end transformations; for instance, during the lowering phase, >> DAGCombine performs several optimization to the IR, potentially >> combining several instructions. The new instruction should, then, >> assigned with metadata obtained as a proper combination of the >> original ones (e.g., a union of metadata information). > I want to make it clear that this is expensive to do, in that the number > of changes to the codegen pipeline is quite extensive and widespread. I > know because I've done it*. :) It will help if there are utilities > people can use to merge metadata during DAG transformation and the more > we make such transfers and combinations "automatic" the easier it will > be to preserve metadata. > > Once the mechanisms are there it also takes effort to keep them going. > For example if a new DAG transformation is done people need to think > about metadata. This is where "automatic" help makes a real difference. > > * By "it" I mean communicate information down to late phases of codegen. > I don't have a "metadata in codegen" patch as such. I simply cobbled > something together in our downstream fork that works for some very > specific use-cases.I know what you have been through, and I can only agree with you: for the project I mentioned above, I had to perform several changes to the whole IR lowering phase in order to correctly propagate high-level information; it wasn't cheap and required a lot of effort.>> It might be possible to have a dedicated data-structure for such >> metadata info, and an instance of such structure assigned to each >> instruction. > I'm not entirely sure what you mean by this.I was imagining a per-instruction data-structure collecting metadata info related to that specific instruction, instead of having several metadata info directly embedded in each instruction.>> - "What is it useful for?": I think it is quite context-specific; but, >> in general, it is useful when some "higher-level" information >> (e.g., that canbe discovered only before the back-end stage of the >> compiler) are required in the back-end to perform "semantic"-related >> optimizations. > That's my use-case. There's semantic information codegen would like to > know but is really much more practical to discover at the LLVM IR level > or even passed from the frontend. Much information is lost by the time > codegen is hit and it's often impractical or impossible for codegen to > derive it from first principles. > >> To give an (quite generic) example where such codegen metadata may be >> useful: in the field of "secure compilation", preservation of security >> properties during the compilation phases is essential; such properties >> are specified in the high-level specifications of the program, and may >> be expressed with IR metadata. The possibility to keep such IR >> metadata in the codegen phases may allow preservation of properties >> that may be invalidated by codegen phases. > That's a great use-case. I do wonder about your use of "essential" > though.With *essential* I mean fundamental for satisfying a specific target security property.> Is it needed for correctness? If so an intrinsics-based > solution may be better.Uhm...it might sound as a naive question, but what do you mean with *correctness*?> My use-cases mostly revolve around communication with a proprietary > frontend and thus aren't useful to the community, which is why I haven't > pursued this with any great vigor before this. > > I do have uses that convey information from LLVM analyses but > unfortunately I can't share them for now. > > All of my use-cases are related to optimization. No "metadata" is > needed for correctness.> I have pondered whether intrinsics might work for my use-cases. My fear > with intrinsics is that they will interfere with other codegen analyses > and transformations. For example they could be a scheduling barrier. > > I also have wondered about how intrinsics work within SelectionDAG. Do > they impact dagcombine and other transformations? The reason I call out > SelectionDAG specifically is that most of our downstream changes related > to conveying information are in DAG-related files (dagcombine, legalize, > etc.). Perhaps intrinsics could suffice for the purposes of getting > metadata through SelectionDAG with conversion to "first-class" metadata > at the Machine IR level. Maybe this is even an intermediate step toward > "full metadata" throughout the compilation.I employed intrinsics as a mean for carrying metadata, but, by my experience, I am not sure they can be resorted as a valid alternative: - For each llvm-ir instruction employed in my project (e.g., store), a semantically equivalent intrinsic is declared, with particular parameters representing metadata (i.e., first-class metadata are represented by specific intrinsic's parameters). - During the lowering, each ad-hoc intrinsic must be properly handled, manually adding the proper legalization operations, DAG combinations and so on. - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to pseudo-instructions), metadata are passed to the MIR representation of the program. In particular, the second point rises a critical problem in terms of optimizations (e.g., intrinsic store + intrinsic trunc are not automatically converted into a intrinsic truncated store).Then, the backend must be instructed to perform such optimizations, which are actually already performed on non-intrinsic instructions (e.g., store + trunc is already converted into a truncated store). Instead of re-inventing the wheel, and since the backend should be nonetheless modified in order to support optimizations on intrinsics, I would rather prefer to insert some sort of mechanism to support metadata attachment as first-class elements of the IR/MIR, and automatic merging of metadata, for instance. ---- @Chris I may be wrong (in such case, please, correct me), but if I got it correctly, source-level debugging metadata are "external" (i.e., not a first-class element of the llvm-ir), and their management involve a great effort. As described above, in my project I used metadata as first class elements of the IR/MIR; I found this approach more immediate and simpler to handle, although some passes and transformation must be modified. Then, I agree with you saying that metadata infos should be first-class elements of the IR/MIR (or, at least, "packed" into a structure being first-class part of the IR/MIR). ---- In any case, I wonder if metadata at codegen level is actually a thing that the community would benefit (then, justifying a potentially huge and/or long serie of patches), or it is something in which only a small group would be interested in. Cheers -- Lorenzo
Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:>> As with IR-level metadata, there should be no guarantee that metadata is >> preserved and that it's a best-effort thing. In other words, relying on >> metadata for correctness is probably not the thing to do.> Ok, I made a mistake stating that metadata should be *preserved*; what > I really meant is to preserve the *information* that such metadata > represent.We do have one way of doing that now that's nearly foolproof in terms of accidental loss: intrinsics. Intrinsics AFAIK are never just deleted and have to be explicitly handled at some point. Intrinsics may not work well for your use-case for a variety of reasons but they are an option. I'm mostly just writing this to get thoughts in my head organized. :)>> * By "it" I mean communicate information down to late phases of codegen. >> I don't have a "metadata in codegen" patch as such. I simply cobbled >> something together in our downstream fork that works for some very >> specific use-cases.> I know what you have been through, and I can only agree with you: for > the project I mentioned above, I had to perform several changes to the > whole IR lowering phase in order to correctly propagate high-level > information; it wasn't cheap and required a lot of effort.I know your pain. :)>>> It might be possible to have a dedicated data-structure for such >>> metadata info, and an instance of such structure assigned to each >>> instruction. >> I'm not entirely sure what you mean by this. > > I was imagining a per-instruction data-structure collecting metadata info > related to that specific instruction, instead of having several metadata info > directly embedded in each instruction.Interesting. At the IR level metadata isn't necessarily unique, though it can be made so. If multiple pieces of information were amalgamated into one structure that might reduce the ability to share the in-memory representation, which has a cost. I like the ability of IR metadata to be very flexible while at the same time being relatively cheap in terms of resource utilization. I don't always like that IR metadata is not scoped. It makes it more difficult to process the IR for a Function in isolation. But that's a relatively minor quibble for me. It's a tradeoff between convenience and resource utilization.>> That's a great use-case. I do wonder about your use of "essential" >> though.> With *essential* I mean fundamental for satisfying a specific target > security property.>> Is it needed for correctness? If so an intrinsics-based solution >> may be better.> Uhm...it might sound as a naive question, but what do you mean with > *correctness*?I mean will the compiler generate incorrect code or otherwise violate some contract. In your secure compilation example, if the compiler *promises* that the generated code will be "secure" then that's a contract that would be violated if the metadata were lost.> I employed intrinsics as a mean for carrying metadata, but, by my > experience, I am not sure they can be resorted as a valid alternative: > > - For each llvm-ir instruction employed in my project (e.g., store), > a semantically equivalent intrinsic is declared, with particular > parameters representing metadata (i.e., first-class metadata are > represented by specific intrinsic's parameters). > > - During the lowering, each ad-hoc intrinsic must be properly > handled, manually adding the proper legalization operations, DAG > combinations and so on. > > - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to > pseudo-instructions), metadata are passed to the MIR representation > of the program. > > In particular, the second point rises a critical problem in terms of > optimizations (e.g., intrinsic store + intrinsic trunc are not > automatically converted into a intrinsic truncated store).Then, the > backend must be instructed to perform such optimizations, which are > actually already performed on non-intrinsic instructions (e.g., store > + trunc is already converted into a truncated store).Gotcha. That certainly is a lot of burden. Do the intrinsics *have to* mirror the existing instructions exactly or could a more generic intrinsic be defined that took some data as an argument, for example a pointer to a static string? Then each intrinsic instance could reference a static string unique to its context. I have not really thought this through, just throwing out ideas in a devil's advocate sort of way. In my case using intrinsics would have to tie the intrinsic to the instruction it is annotating. This seems similar to your use-case. This is straightforward to do if everything is SSA but once we've gone beyond that things get a lot more complicated. The mapping of information to specific instructions really does seem like the most difficult bit.> Instead of re-inventing the wheel, and since the backend should be > nonetheless modified in order to support optimizations on intrinsics, > I would rather prefer to insert some sort of mechanism to support > metadata attachment as first-class elements of the IR/MIR, and > automatic merging of metadata, for instance.Can you explain a bit more what you mean by "first-class?"> In any case, I wonder if metadata at codegen level is actually a thing > that the community would benefit (then, justifying a potentially huge > and/or long serie of patches), or it is something in which only a > small group would be interested in.I would also like to know this. Have others found the need to convey information down to codegen and if so, what approaches were considered and tried? Maybe this is a niche requirement but I really don't think it is. I think it more likely that various hacks/modifications have been made over the years to sufficiently approximate a desired outcome and that this has led to not insignificant technical debt. Or maybe I just think that because I've worked on a 40-year-old compiler for my entire career. :) -David
Am 07/08/20 um 22:54 schrieb David Greene:> Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes: > >>> As with IR-level metadata, there should be no guarantee that metadata is >>> preserved and that it's a best-effort thing. In other words, relying on >>> metadata for correctness is probably not the thing to do. >> Ok, I made a mistake stating that metadata should be *preserved*; what >> I really meant is to preserve the *information* that such metadata >> represent. > We do have one way of doing that now that's nearly foolproof in terms of > accidental loss: intrinsics. Intrinsics AFAIK are never just deleted > and have to be explicitly handled at some point. Intrinsics may not > work well for your use-case for a variety of reasons but they are an > option. > > I'm mostly just writing this to get thoughts in my head organized. :)The only problem with intrinsics, for me, was the need to mirror the already existing instructions. As you pointed out, if there's a way to map intrinsics and instructions, there would be no reason to mirror the latter, andjust use the former to carry metadata.>>>> It might be possible to have a dedicated data-structure for such >>>> metadata info, and an instance of such structure assigned to each >>>> instruction. >>> I'm not entirely sure what you mean by this. >> I was imagining a per-instruction data-structure collecting metadata info >> related to that specific instruction, instead of having several metadata info >> directly embedded in each instruction. > Interesting. At the IR level metadata isn't necessarily unique, though > it can be made so. If multiple pieces of information were amalgamated > into one structure that might reduce the ability to share the in-memory > representation, which has a cost. I like the ability of IR metadata to > be very flexible while at the same time being relatively cheap in terms > of resource utilization. > > I don't always like that IR metadata is not scoped. It makes it more > difficult to process the IR for a Function in isolation. But that's a > relatively minor quibble for me. It's a tradeoff between convenience > and resource utilization. >Uhm...could I ask you to elaborate a bit more on the "limitation on in-memory representation sharing"? It is not clear to me how this would cause a problem.>>> That's a great use-case. I do wonder about your use of "essential" >>> though. >> With *essential* I mean fundamental for satisfying a specific target >> security property. >>> Is it needed for correctness? If so an intrinsics-based solution >>> may be better. >> Uhm...it might sound as a naive question, but what do you mean with >> *correctness*? > I mean will the compiler generate incorrect code or otherwise violate > some contract. In your secure compilation example, if the compiler > *promises* that the generated code will be "secure" then that's a > contract that would be violated if the metadata were lost.You got the point: if no metadata are provided/lost, the codegen phase is not able to fulfill the contract (in my use case, generate code that is "secure").>> I employed intrinsics as a mean for carrying metadata, but, by my >> experience, I am not sure they can be resorted as a valid alternative: >> >> - For each llvm-ir instruction employed in my project (e.g., store), >> a semantically equivalent intrinsic is declared, with particular >> parameters representing metadata (i.e., first-class metadata are >> represented by specific intrinsic's parameters). >> >> - During the lowering, each ad-hoc intrinsic must be properly >> handled, manually adding the proper legalization operations, DAG >> combinations and so on. >> >> - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to >> pseudo-instructions), metadata are passed to the MIR representation >> of the program. >> >> In particular, the second point rises a critical problem in terms of >> optimizations (e.g., intrinsic store + intrinsic trunc are not >> automatically converted into a intrinsic truncated store).Then, the >> backend must be instructed to perform such optimizations, which are >> actually already performed on non-intrinsic instructions (e.g., store >> + trunc is already converted into a truncated store). > Gotcha. That certainly is a lot of burden. Do the intrinsics *have to* > mirror the existing instructions exactly or could a more generic > intrinsic be defined that took some data as an argument, for example a > pointer to a static string? Then each intrinsic instance could > reference a static string unique to its context.> I have not really thought this through, just throwing out ideas in a > devil's advocate sort of way.I like brainstorming ;)> > In my case using intrinsics would have to tie the intrinsic to the > instruction it is annotating. This seems similar to your use-case. > This is straightforward to do if everything is SSA but once we've gone > beyond that things get a lot more complicated. The mapping of > information to specific instructions really does seem like the most > difficult bit.No, intrinsics does not have to mirror existing instructions; yes, they can be used just to carry around specific data as arguments. Nonetheless, there we have our (implementation) problem: how to map info (e.g., intrinsics) to instruction, and viceversa? I am really curious on how would you perform it in the pre-RA phase :)>> Instead of re-inventing the wheel, and since the backend should be >> nonetheless modified in order to support optimizations on intrinsics, >> I would rather prefer to insert some sort of mechanism to support >> metadata attachment as first-class elements of the IR/MIR, and >> automatic merging of metadata, for instance. > Can you explain a bit more what you mean by "first-class?"Never mind, I used the wrong terminology: I just meant to directly embed metadata in the IR/MIR.>> In any case, I wonder if metadata at codegen level is actually a thing >> that the community would benefit (then, justifying a potentially huge >> and/or long serie of patches), or it is something in which only a >> small group would be interested in. > I would also like to know this. Have others found the need to convey > information down to codegen and if so, what approaches were considered > and tried? > > Maybe this is a niche requirement but I really don't think it is. I > think it more likely that various hacks/modifications have been made > over the years to sufficiently approximate a desired outcome and that > this has led to not insignificant technical debt. > > Or maybe I just think that because I've worked on a 40-year-old compiler > for my entire career. :) > > -DavidBest regards, Lorenzo