thr3ads.net - llvm dev - [llvm-dev] Metadata in LLVM back-end [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Lorenzo Casalino via llvm-dev

2020-Aug-06 14:47 UTC

[llvm-dev] Metadata in LLVM back-end

Am 31/07/20 um 22:47 schrieb David Greene:

@David> Thanks for keeping this going, Lorenzo.
>
> Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
>>> The first questions need to be “what does it mean?”, “how does it
>>> work?”, and “what is it useful for?”.  It is hard to evaluate a
>>> proposal without that.
>> Hi everyone,
>>
>> - "What does it mean?": it means to preserve specific
information,
>> represented as   metadata assigned to instructions, from the IR level,
>> down to the codegen phases.
> An important part of the definition is "how late?"  For my
particular
> uses it would be right up until lowering of asm pseudo-instructions,
> even after regalloc and scheduling.  I don't know whether someone might
> need metadata even later than that (at asm/obj emission time?) but if
> metadata is supported on Machine IR then it shouldn't be an issue."How late" it is context-specific: even in my case, I required such
information
to be preserved until pseudo instruction expansion. Conservatively, they
could be
preserved until the last pass of codegen pipeline.

Regarding their employment in the later steps, I would not say they are not
required, sinceI worked on a specific topic of secure compilation, and I do
not have the wholepicture in mind; nonetheless, it would be possible to
test how
things work out withthe codegen and later reason on future developments.
> As with IR-level metadata, there should be no guarantee that metadata is
> preserved and that it's a best-effort thing.  In other words, relying
on
> metadata for correctness is probably not the thing to do.Ok, I made a mistake stating that metadata should be *preserved*; what
I really meant is to preserve the *information* that such metadata
represent.>> - "How does it work?": metadata should be preserved during
the several
>>    back-end transformations; for instance, during the lowering phase,
>> DAGCombine    performs several optimization to the IR, potentially
>> combining several    instructions. The new instruction should, then,
>> assigned with metadata obtained    as a proper combination of the
>> original ones (e.g., a union of metadata    information).
> I want to make it clear that this is expensive to do, in that the number
> of changes to the codegen pipeline is quite extensive and widespread.  I
> know because I've done it*.  :)  It will help if there are utilities
> people can use to merge metadata during DAG transformation and the more
> we make such transfers and combinations "automatic" the easier it
will
> be to preserve metadata.
>
> Once the mechanisms are there it also takes effort to keep them going.
> For example if a new DAG transformation is done people need to think
> about metadata.  This is where "automatic" help makes a real
difference.
>
> * By "it" I mean communicate information down to late phases of
codegen.
> I don't have a "metadata in codegen" patch as such.  I simply
cobbled
> something together in our downstream fork that works for some very
> specific use-cases.I know what you have been through, and I can only agree with you: for the
project I mentioned above, I had to perform several changes to the whole IR
lowering phase in order to correctly propagate high-level information;
it wasn't
cheap and required a lot of effort.>>    It might be possible to have a dedicated data-structure for such
>> metadata info,    and an instance of such structure assigned to each
>> instruction.
> I'm not entirely sure what you mean by this.
I was imagining a per-instruction data-structure collecting metadata info
related to that specific instruction, instead of having several metadata info
directly embedded in each instruction.
>> - "What is it useful for?": I think it is quite
context-specific; but,
>>   in general, it is useful when some "higher-level"  
information
>> (e.g., that canbe discovered only before the back-end   stage of the
>> compiler) are required in the back-end to perform
"semantic"-related  
>> optimizations.
> That's my use-case.  There's semantic information codegen would
like to
> know but is really much more practical to discover at the LLVM IR level
> or even passed from the frontend.  Much information is lost by the time
> codegen is hit and it's often impractical or impossible for codegen to
> derive it from first principles.
>
>> To give an (quite generic) example where such codegen metadata may be
>> useful: in the field of "secure compilation", preservation of
security
>> properties during the compilation phases is essential; such properties
>> are specified in the high-level specifications of the program, and may
>> be expressed with IR metadata. The possibility to keep such IR
>> metadata in the codegen phases may allow preservation of properties
>> that may be invalidated by codegen phases.
> That's a great use-case.  I do wonder about your use of
"essential"
> though.With *essential* I mean fundamental for satisfying a specific target
security property.>   Is it needed for correctness?  If so an intrinsics-based
> solution may be better.Uhm...it might sound as a naive question, but what do you mean with
*correctness*?> My use-cases mostly revolve around communication with a proprietary
> frontend and thus aren't useful to the community, which is why I
haven't
> pursued this with any great vigor before this.
>
> I do have uses that convey information from LLVM analyses but
> unfortunately I can't share them for now.
>
> All of my use-cases are related to optimization.  No "metadata"
is
> needed for correctness.
> I have pondered whether intrinsics might work for my use-cases.  My fear
> with intrinsics is that they will interfere with other codegen analyses
> and transformations.  For example they could be a scheduling barrier.
>
> I also have wondered about how intrinsics work within SelectionDAG.  Do
> they impact dagcombine and other transformations?  The reason I call out
> SelectionDAG specifically is that most of our downstream changes related
> to conveying information are in DAG-related files (dagcombine, legalize,
> etc.).  Perhaps intrinsics could suffice for the purposes of getting
> metadata through SelectionDAG with conversion to "first-class"
metadata
> at the Machine IR level.  Maybe this is even an intermediate step toward
> "full metadata" throughout the compilation.
I employed intrinsics as a mean for carrying metadata, but,
by my experience, I am not sure they can be resorted as a valid alternative:

 - For each llvm-ir instruction employed in my project (e.g., store), a
semantically
   equivalent intrinsic is declared, with particular parameters representing
   metadata (i.e., first-class metadata are represented by specific
intrinsic's
   parameters).

 - During the lowering, each ad-hoc intrinsic must be properly handled,
manually
   adding the proper legalization operations, DAG combinations and so on.

 - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
pseudo-instructions),
   metadata are passed to the MIR representation of the program.

In particular, the second point rises a critical problem in terms of
optimizations
(e.g., intrinsic store + intrinsic trunc are not automatically converted
into a
intrinsic truncated store).Then, the backend must be instructed to
perform such
optimizations, which are actually already performed on non-intrinsic
instructions
(e.g., store + trunc is already converted into a truncated store).

Instead of re-inventing the wheel, and since the backend should be
nonetheless
modified in order to support optimizations on intrinsics, I would rather
prefer to
insert some sort of mechanism to support metadata attachment as
first-class elements
of the IR/MIR, and automatic merging of metadata, for instance.

----

@Chris

I may be wrong (in such case, please, correct me), but if I got it
correctly,
source-level debugging metadata are "external" (i.e., not a
first-class
element
of the llvm-ir), and their management involve a great effort.

As described above, in my project I used metadata as first class
elements of the
IR/MIR; I found this approach more immediate and simpler to handle, although
some passes and transformation must be modified.

Then, I agree with you saying that metadata infos should be first-class
elements of
the IR/MIR (or, at least, "packed" into a structure being first-class
part of the
IR/MIR).

----

In any case, I wonder if metadata at codegen level is actually a thing
that the
community would benefit (then, justifying a potentially huge and/or long
serie of
patches), or it is something in which only a small group would be
interested in.


Cheers
-- Lorenzo

David Greene via llvm-dev

2020-Aug-07 20:54 UTC

head link

[llvm-dev] Metadata in LLVM back-end

Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:
>> As with IR-level metadata, there should be no guarantee that metadata
is
>> preserved and that it's a best-effort thing.  In other words,
relying on
>> metadata for correctness is probably not the thing to do.
> Ok, I made a mistake stating that metadata should be *preserved*; what
> I really meant is to preserve the *information* that such metadata
> represent.
We do have one way of doing that now that's nearly foolproof in terms of
accidental loss: intrinsics.  Intrinsics AFAIK are never just deleted
and have to be explicitly handled at some point.  Intrinsics may not
work well for your use-case for a variety of reasons but they are an
option.

I'm mostly just writing this to get thoughts in my head organized.  :)
>> * By "it" I mean communicate information down to late phases
of codegen.
>> I don't have a "metadata in codegen" patch as such.  I
simply cobbled
>> something together in our downstream fork that works for some very
>> specific use-cases.
> I know what you have been through, and I can only agree with you: for
> the project I mentioned above, I had to perform several changes to the
> whole IR lowering phase in order to correctly propagate high-level
> information; it wasn't cheap and required a lot of effort.
I know your pain.  :)
>>>    It might be possible to have a dedicated data-structure for such
>>> metadata info,    and an instance of such structure assigned to
each
>>> instruction.
>> I'm not entirely sure what you mean by this.
>
> I was imagining a per-instruction data-structure collecting metadata info
> related to that specific instruction, instead of having several metadata
info
> directly embedded in each instruction.
Interesting.  At the IR level metadata isn't necessarily unique, though
it can be made so.  If multiple pieces of information were amalgamated
into one structure that might reduce the ability to share the in-memory
representation, which has a cost.  I like the ability of IR metadata to
be very flexible while at the same time being relatively cheap in terms
of resource utilization.

I don't always like that IR metadata is not scoped.  It makes it more
difficult to process the IR for a Function in isolation.  But that's a
relatively minor quibble for me.  It's a tradeoff between convenience
and resource utilization.
>> That's a great use-case.  I do wonder about your use of
"essential"
>> though.
> With *essential* I mean fundamental for satisfying a specific target
> security property.
>> Is it needed for correctness?  If so an intrinsics-based solution
>> may be better.
> Uhm...it might sound as a naive question, but what do you mean with
> *correctness*?
I mean will the compiler generate incorrect code or otherwise violate
some contract.  In your secure compilation example, if the compiler
*promises* that the generated code will be "secure" then that's a
contract that would be violated if the metadata were lost.
> I employed intrinsics as a mean for carrying metadata, but, by my
> experience, I am not sure they can be resorted as a valid alternative:
>
>  - For each llvm-ir instruction employed in my project (e.g., store),
> a semantically    equivalent intrinsic is declared, with particular
> parameters representing    metadata (i.e., first-class metadata are
> represented by specific intrinsic's    parameters).
>
>  - During the lowering, each ad-hoc intrinsic must be properly
> handled, manually    adding the proper legalization operations, DAG
> combinations and so on.
>
>  - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
> pseudo-instructions),    metadata are passed to the MIR representation
> of the program.
>
> In particular, the second point rises a critical problem in terms of
> optimizations (e.g., intrinsic store + intrinsic trunc are not
> automatically converted into a intrinsic truncated store).Then, the
> backend must be instructed to perform such optimizations, which are
> actually already performed on non-intrinsic instructions (e.g., store
> + trunc is already converted into a truncated store).
Gotcha.  That certainly is a lot of burden.  Do the intrinsics *have to*
mirror the existing instructions exactly or could a more generic
intrinsic be defined that took some data as an argument, for example a
pointer to a static string?  Then each intrinsic instance could
reference a static string unique to its context.

I have not really thought this through, just throwing out ideas in a
devil's advocate sort of way.

In my case using intrinsics would have to tie the intrinsic to the
instruction it is annotating.  This seems similar to your use-case.
This is straightforward to do if everything is SSA but once we've gone
beyond that things get a lot more complicated.  The mapping of
information to specific instructions really does seem like the most
difficult bit.
> Instead of re-inventing the wheel, and since the backend should be
> nonetheless modified in order to support optimizations on intrinsics,
> I would rather prefer to insert some sort of mechanism to support
> metadata attachment as first-class elements of the IR/MIR, and
> automatic merging of metadata, for instance.
Can you explain a bit more what you mean by "first-class?"
> In any case, I wonder if metadata at codegen level is actually a thing
> that the community would benefit (then, justifying a potentially huge
> and/or long serie of patches), or it is something in which only a
> small group would be interested in.
I would also like to know this.  Have others found the need to convey
information down to codegen and if so, what approaches were considered
and tried?

Maybe this is a niche requirement but I really don't think it is.  I
think it more likely that various hacks/modifications have been made
over the years to sufficiently approximate a desired outcome and that
this has led to not insignificant technical debt.

Or maybe I just think that because I've worked on a 40-year-old compiler
for my entire career.  :)

                 -David

Lorenzo Casalino via llvm-dev

2020-Aug-18 06:27 UTC

head link

[llvm-dev] Metadata in LLVM back-end

Am 07/08/20 um 22:54 schrieb David Greene:> Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
>>> As with IR-level metadata, there should be no guarantee that
metadata is
>>> preserved and that it's a best-effort thing.  In other words,
relying on
>>> metadata for correctness is probably not the thing to do.
>> Ok, I made a mistake stating that metadata should be *preserved*; what
>> I really meant is to preserve the *information* that such metadata
>> represent.
> We do have one way of doing that now that's nearly foolproof in terms
of
> accidental loss: intrinsics.  Intrinsics AFAIK are never just deleted
> and have to be explicitly handled at some point.  Intrinsics may not
> work well for your use-case for a variety of reasons but they are an
> option.
>
> I'm mostly just writing this to get thoughts in my head organized.  :)The only problem with intrinsics, for me, was the need to mirror the
already existing instructions. As you pointed out, if there's a way to map
intrinsics and instructions, there would be no reason to mirror the latter,
andjust use the former to carry metadata.>>>>    It might be possible to have a dedicated data-structure for
such
>>>> metadata info,    and an instance of such structure assigned to
each
>>>> instruction.
>>> I'm not entirely sure what you mean by this.
>> I was imagining a per-instruction data-structure collecting metadata
info
>> related to that specific instruction, instead of having several
metadata info
>> directly embedded in each instruction.
> Interesting.  At the IR level metadata isn't necessarily unique, though
> it can be made so.  If multiple pieces of information were amalgamated
> into one structure that might reduce the ability to share the in-memory
> representation, which has a cost.  I like the ability of IR metadata to
> be very flexible while at the same time being relatively cheap in terms
> of resource utilization.
>
> I don't always like that IR metadata is not scoped.  It makes it more
> difficult to process the IR for a Function in isolation.  But that's a
> relatively minor quibble for me.  It's a tradeoff between convenience
> and resource utilization.
>Uhm...could I ask you to elaborate a bit more on the "limitation on
in-memory
representation sharing"? It is not clear to me how this would cause a
problem.>>> That's a great use-case.  I do wonder about your use of
"essential"
>>> though.
>> With *essential* I mean fundamental for satisfying a specific target
>> security property.
>>> Is it needed for correctness?  If so an intrinsics-based solution
>>> may be better.
>> Uhm...it might sound as a naive question, but what do you mean with
>> *correctness*?
> I mean will the compiler generate incorrect code or otherwise violate
> some contract.  In your secure compilation example, if the compiler
> *promises* that the generated code will be "secure" then
that's a
> contract that would be violated if the metadata were lost.You got the point: if no metadata are provided/lost, the codegen phase
is not
able to fulfill the contract (in my use case, generate code that is
"secure").>> I employed intrinsics as a mean for carrying metadata, but, by my
>> experience, I am not sure they can be resorted as a valid alternative:
>>
>>  - For each llvm-ir instruction employed in my project (e.g., store),
>> a semantically    equivalent intrinsic is declared, with particular
>> parameters representing    metadata (i.e., first-class metadata are
>> represented by specific intrinsic's    parameters).
>>
>>  - During the lowering, each ad-hoc intrinsic must be properly
>> handled, manually    adding the proper legalization operations, DAG
>> combinations and so on.
>>
>>  - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
>> pseudo-instructions),    metadata are passed to the MIR representation
>> of the program.
>>
>> In particular, the second point rises a critical problem in terms of
>> optimizations (e.g., intrinsic store + intrinsic trunc are not
>> automatically converted into a intrinsic truncated store).Then, the
>> backend must be instructed to perform such optimizations, which are
>> actually already performed on non-intrinsic instructions (e.g., store
>> + trunc is already converted into a truncated store).
> Gotcha.  That certainly is a lot of burden.  Do the intrinsics *have to*
> mirror the existing instructions exactly or could a more generic
> intrinsic be defined that took some data as an argument, for example a
> pointer to a static string?  Then each intrinsic instance could
> reference a static string unique to its context.
> I have not really thought this through, just throwing out ideas in a
> devil's advocate sort of way.
I like brainstorming ;)>
> In my case using intrinsics would have to tie the intrinsic to the
> instruction it is annotating.  This seems similar to your use-case.
> This is straightforward to do if everything is SSA but once we've gone
> beyond that things get a lot more complicated.  The mapping of
> information to specific instructions really does seem like the most
> difficult bit.No, intrinsics does not have to mirror existing instructions; yes, they
can be used just to carry around specific data as arguments.
Nonetheless, there
we have our (implementation) problem: how to map info (e.g., intrinsics) to
instruction, and viceversa?

I am really curious on how would you perform it in the pre-RA phase :)
>> Instead of re-inventing the wheel, and since the backend should be
>> nonetheless modified in order to support optimizations on intrinsics,
>> I would rather prefer to insert some sort of mechanism to support
>> metadata attachment as first-class elements of the IR/MIR, and
>> automatic merging of metadata, for instance.
> Can you explain a bit more what you mean by "first-class?"Never mind, I used the wrong terminology: I just meant to directly
embed metadata in the IR/MIR.>> In any case, I wonder if metadata at codegen level is actually a thing
>> that the community would benefit (then, justifying a potentially huge
>> and/or long serie of patches), or it is something in which only a
>> small group would be interested in.
> I would also like to know this.  Have others found the need to convey
> information down to codegen and if so, what approaches were considered
> and tried?
>
> Maybe this is a niche requirement but I really don't think it is.  I
> think it more likely that various hacks/modifications have been made
> over the years to sufficiently approximate a desired outcome and that
> this has led to not insignificant technical debt.
>
> Or maybe I just think that because I've worked on a 40-year-old
compiler
> for my entire career.  :)
>
>                  -David

Best regards,
Lorenzo

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Aug 2020 - Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

Possibly Parallel Threads