thr3ads.net - llvm dev - [llvm-dev] Metadata in LLVM back-end [Aug 2020]

If this information is useful, please help other people find it:
Share via:

Lorenzo Casalino via llvm-dev

2020-Jul-29 07:33 UTC

[llvm-dev] Metadata in LLVM back-end

>> On Jul 27, 2020, at 10:11 AM, David Greene via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>>
>> Son Tuan VU via llvm-dev <llvm-dev at lists.llvm.org> writes:
>>
>>> Currently metadata (other than debug info) can be attached to IR
>>> instructions but disappears during DAG selection.
>>>
>>> My question is why we do not keep the metadata during code lowering
and
>>> then attach to MachineInstr, just as for IR instructions? Is there
any
>>> technical challenge, or is it only because nobody wants to do so?
>> I have wanted codegen metadata for a very long time so I'm
interested to
>> hear the history behind this choice, and more importantly, whether
>> adding such capability would be generally acceptable to the community.
> The first questions need to be “what does it mean?”, “how does it work?”,
and “what is it useful for?”.  It is hard to evaluate a proposal without that.
Hi everyone,

I'm trying to answer to each of these questions; it is likely the
answers won't be
exhaustive, but I hope they will serve as a starting point for an
interesting
proposal (from my point of view and the one of Son Tuan VU and David
Greene):

- "What does it mean?": it means to preserve specific information,
represented as
  metadata assigned to instructions, from the IR level, down to the
codegen phases.

- "How does it work?": metadata should be preserved during the several
   back-end transformations; for instance, during the lowering phase,
DAGCombine
   performs several optimization to the IR, potentially combining several
   instructions. The new instruction should, then, assigned with
metadata obtained
   as a proper combination of the original ones (e.g., a union of metadata
   information).

   It might be possible to have a dedicated data-structure for such
metadata info,
   and an instance of such structure assigned to each instruction.

- "What is it useful for?": I think it is quite context-specific; but,
  in general, it is useful when some "higher-level"
  information (e.g., that canbe discovered only before the back-end
  stage of the compiler) are required in the back-end to perform
"semantic"-related
  optimizations.

To give an (quite generic) example where such codegen metadata may be
useful: in the field
of "secure compilation", preservation of security properties during
the
compilation
phases is essential; such properties are specified in the high-level
specifications of
the program, and may be expressed with IR metadata. The possibility to
keep such IR
metadata in the codegen phases may allow preservation of properties that
may be invalidated
by codegen phases.


Cheers,
-- Lorenzo
> Metadata isn’t free - it must be maintained or invalidated for it to be
useful.  The details on that dramatically shape whether it can be used for any
given purpose.
>
> -Chris-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200729/874be722/attachment.html>

David Greene via llvm-dev

2020-Jul-31 20:47 UTC

head link

[llvm-dev] Metadata in LLVM back-end

Thanks for keeping this going, Lorenzo.

Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:
>> The first questions need to be “what does it mean?”, “how does it
>> work?”, and “what is it useful for?”.  It is hard to evaluate a
>> proposal without that.
>
> Hi everyone,
>
> - "What does it mean?": it means to preserve specific
information,
> represented as   metadata assigned to instructions, from the IR level,
> down to the codegen phases.
An important part of the definition is "how late?"  For my particular
uses it would be right up until lowering of asm pseudo-instructions,
even after regalloc and scheduling.  I don't know whether someone might
need metadata even later than that (at asm/obj emission time?) but if
metadata is supported on Machine IR then it shouldn't be an issue.

As with IR-level metadata, there should be no guarantee that metadata is
preserved and that it's a best-effort thing.  In other words, relying on
metadata for correctness is probably not the thing to do.
> - "How does it work?": metadata should be preserved during the
several
>    back-end transformations; for instance, during the lowering phase,
> DAGCombine    performs several optimization to the IR, potentially
> combining several    instructions. The new instruction should, then,
> assigned with metadata obtained    as a proper combination of the
> original ones (e.g., a union of metadata    information).
I want to make it clear that this is expensive to do, in that the number
of changes to the codegen pipeline is quite extensive and widespread.  I
know because I've done it*.  :)  It will help if there are utilities
people can use to merge metadata during DAG transformation and the more
we make such transfers and combinations "automatic" the easier it will
be to preserve metadata.

Once the mechanisms are there it also takes effort to keep them going.
For example if a new DAG transformation is done people need to think
about metadata.  This is where "automatic" help makes a real
difference.

* By "it" I mean communicate information down to late phases of
codegen.
I don't have a "metadata in codegen" patch as such.  I simply
cobbled
something together in our downstream fork that works for some very
specific use-cases.
>    It might be possible to have a dedicated data-structure for such
> metadata info,    and an instance of such structure assigned to each
> instruction.
I'm not entirely sure what you mean by this.
> - "What is it useful for?": I think it is quite context-specific;
but,
>   in general, it is useful when some "higher-level"   information
> (e.g., that canbe discovered only before the back-end   stage of the
> compiler) are required in the back-end to perform
"semantic"-related  
> optimizations.
That's my use-case.  There's semantic information codegen would like to
know but is really much more practical to discover at the LLVM IR level
or even passed from the frontend.  Much information is lost by the time
codegen is hit and it's often impractical or impossible for codegen to
derive it from first principles.
> To give an (quite generic) example where such codegen metadata may be
> useful: in the field of "secure compilation", preservation of
security
> properties during the compilation phases is essential; such properties
> are specified in the high-level specifications of the program, and may
> be expressed with IR metadata. The possibility to keep such IR
> metadata in the codegen phases may allow preservation of properties
> that may be invalidated by codegen phases.
That's a great use-case.  I do wonder about your use of
"essential"
though.  Is it needed for correctness?  If so an intrinsics-based
solution may be better.

My use-cases mostly revolve around communication with a proprietary
frontend and thus aren't useful to the community, which is why I haven't
pursued this with any great vigor before this.

I do have uses that convey information from LLVM analyses but
unfortunately I can't share them for now.

All of my use-cases are related to optimization.  No "metadata" is
needed for correctness.

I have pondered whether intrinsics might work for my use-cases.  My fear
with intrinsics is that they will interfere with other codegen analyses
and transformations.  For example they could be a scheduling barrier.

I also have wondered about how intrinsics work within SelectionDAG.  Do
they impact dagcombine and other transformations?  The reason I call out
SelectionDAG specifically is that most of our downstream changes related
to conveying information are in DAG-related files (dagcombine, legalize,
etc.).  Perhaps intrinsics could suffice for the purposes of getting
metadata through SelectionDAG with conversion to "first-class"
metadata
at the Machine IR level.  Maybe this is even an intermediate step toward
"full metadata" throughout the compilation.

                -David

Chris Lattner via llvm-dev

2020-Aug-02 19:37 UTC

head link

[llvm-dev] Metadata in LLVM back-end

Thanks Lorenzo,

I was looking for a ‘one level deeper’ analysis of how this works.

The issue is this: either information is preserved across certain sorts of
transformations or it is not.  If not, it either goes stale (problematic for
anything that looks at it later) or is invalidated/removed.

The fundamental issue in IR design is factoring the representation of
information from the code that needs to inspect and update it.  “Metadata”
designs try to make it easy to add out of band information to the IR in various
ways, with a goal of reducing the impact on the rest of the compiler.

However, I’ve never seen them work out well.  Either the data becomes stale, or
you end up changing a lot of the compiler to support it.  Look at debug info
metadata in LLVM for example, it has both problems :-).  This is why MLIR has
moved to make source location information and attributes a first class part of
the IR.

-Chris

> On Jul 29, 2020, at 12:33 AM, Lorenzo Casalino via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
>>> On Jul 27, 2020, at 10:11 AM, David Greene via llvm-dev
<llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org>
wrote:
>>> 
>>> Son Tuan VU via llvm-dev <llvm-dev at lists.llvm.org>
<mailto:llvm-dev at lists.llvm.org> writes:
>>> 
>>>> Currently metadata (other than debug info) can be attached to
IR
>>>> instructions but disappears during DAG selection.
>>>> 
>>>> My question is why we do not keep the metadata during code
lowering and
>>>> then attach to MachineInstr, just as for IR instructions? Is
there any
>>>> technical challenge, or is it only because nobody wants to do
so?
>>> I have wanted codegen metadata for a very long time so I'm
interested to
>>> hear the history behind this choice, and more importantly, whether
>>> adding such capability would be generally acceptable to the
community.
>> The first questions need to be “what does it mean?”, “how does it
work?”, and “what is it useful for?”.  It is hard to evaluate a proposal without
that.
> Hi everyone,
> 
> I'm trying to answer to each of these questions; it is likely the
answers won't be
> exhaustive, but I hope they will serve as a starting point for an
interesting
> proposal (from my point of view and the one of Son Tuan VU and David
Greene):
> 
> - "What does it mean?": it means to preserve specific
information, represented as
>   metadata assigned to instructions, from the IR level, down to the codegen
phases.
> 
> - "How does it work?": metadata should be preserved during the
several
>    back-end transformations; for instance, during the lowering phase,
DAGCombine
>    performs several optimization to the IR, potentially combining several
>    instructions. The new instruction should, then, assigned with metadata
obtained
>    as a proper combination of the original ones (e.g., a union of metadata
>    information).
> 
>    It might be possible to have a dedicated data-structure for such
metadata info,
>    and an instance of such structure assigned to each instruction.
> 
> - "What is it useful for?": I think it is quite context-specific;
but,
>   in general, it is useful when some "higher-level"
>   information (e.g., that can be discovered only before the back-end
>   stage of the compiler) are required in the back-end to perform
"semantic"-related
>   optimizations.
> 
> 
> To give an (quite generic) example where such codegen metadata may be
useful: in the field
> of "secure compilation", preservation of security properties
during the compilation
> phases is essential; such properties are specified in the high-level
specifications of
> the program, and may be expressed with IR metadata. The possibility to keep
such IR
> metadata in the codegen phases may allow preservation of properties that
may be invalidated
> by codegen phases.
> 
> 
> 
> Cheers,
> -- Lorenzo
> 
>> Metadata isn’t free - it must be maintained or invalidated for it to be
useful.  The details on that dramatically shape whether it can be used for any
given purpose.
>> 
>> -Chris
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200802/0cd80da5/attachment.html>

Lorenzo Casalino via llvm-dev

2020-Aug-06 14:47 UTC

head link

[llvm-dev] Metadata in LLVM back-end

Am 31/07/20 um 22:47 schrieb David Greene:

@David> Thanks for keeping this going, Lorenzo.
>
> Lorenzo Casalino via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
>>> The first questions need to be “what does it mean?”, “how does it
>>> work?”, and “what is it useful for?”.  It is hard to evaluate a
>>> proposal without that.
>> Hi everyone,
>>
>> - "What does it mean?": it means to preserve specific
information,
>> represented as   metadata assigned to instructions, from the IR level,
>> down to the codegen phases.
> An important part of the definition is "how late?"  For my
particular
> uses it would be right up until lowering of asm pseudo-instructions,
> even after regalloc and scheduling.  I don't know whether someone might
> need metadata even later than that (at asm/obj emission time?) but if
> metadata is supported on Machine IR then it shouldn't be an issue."How late" it is context-specific: even in my case, I required such
information
to be preserved until pseudo instruction expansion. Conservatively, they
could be
preserved until the last pass of codegen pipeline.

Regarding their employment in the later steps, I would not say they are not
required, sinceI worked on a specific topic of secure compilation, and I do
not have the wholepicture in mind; nonetheless, it would be possible to
test how
things work out withthe codegen and later reason on future developments.
> As with IR-level metadata, there should be no guarantee that metadata is
> preserved and that it's a best-effort thing.  In other words, relying
on
> metadata for correctness is probably not the thing to do.Ok, I made a mistake stating that metadata should be *preserved*; what
I really meant is to preserve the *information* that such metadata
represent.>> - "How does it work?": metadata should be preserved during
the several
>>    back-end transformations; for instance, during the lowering phase,
>> DAGCombine    performs several optimization to the IR, potentially
>> combining several    instructions. The new instruction should, then,
>> assigned with metadata obtained    as a proper combination of the
>> original ones (e.g., a union of metadata    information).
> I want to make it clear that this is expensive to do, in that the number
> of changes to the codegen pipeline is quite extensive and widespread.  I
> know because I've done it*.  :)  It will help if there are utilities
> people can use to merge metadata during DAG transformation and the more
> we make such transfers and combinations "automatic" the easier it
will
> be to preserve metadata.
>
> Once the mechanisms are there it also takes effort to keep them going.
> For example if a new DAG transformation is done people need to think
> about metadata.  This is where "automatic" help makes a real
difference.
>
> * By "it" I mean communicate information down to late phases of
codegen.
> I don't have a "metadata in codegen" patch as such.  I simply
cobbled
> something together in our downstream fork that works for some very
> specific use-cases.I know what you have been through, and I can only agree with you: for the
project I mentioned above, I had to perform several changes to the whole IR
lowering phase in order to correctly propagate high-level information;
it wasn't
cheap and required a lot of effort.>>    It might be possible to have a dedicated data-structure for such
>> metadata info,    and an instance of such structure assigned to each
>> instruction.
> I'm not entirely sure what you mean by this.
I was imagining a per-instruction data-structure collecting metadata info
related to that specific instruction, instead of having several metadata info
directly embedded in each instruction.
>> - "What is it useful for?": I think it is quite
context-specific; but,
>>   in general, it is useful when some "higher-level"  
information
>> (e.g., that canbe discovered only before the back-end   stage of the
>> compiler) are required in the back-end to perform
"semantic"-related  
>> optimizations.
> That's my use-case.  There's semantic information codegen would
like to
> know but is really much more practical to discover at the LLVM IR level
> or even passed from the frontend.  Much information is lost by the time
> codegen is hit and it's often impractical or impossible for codegen to
> derive it from first principles.
>
>> To give an (quite generic) example where such codegen metadata may be
>> useful: in the field of "secure compilation", preservation of
security
>> properties during the compilation phases is essential; such properties
>> are specified in the high-level specifications of the program, and may
>> be expressed with IR metadata. The possibility to keep such IR
>> metadata in the codegen phases may allow preservation of properties
>> that may be invalidated by codegen phases.
> That's a great use-case.  I do wonder about your use of
"essential"
> though.With *essential* I mean fundamental for satisfying a specific target
security property.>   Is it needed for correctness?  If so an intrinsics-based
> solution may be better.Uhm...it might sound as a naive question, but what do you mean with
*correctness*?> My use-cases mostly revolve around communication with a proprietary
> frontend and thus aren't useful to the community, which is why I
haven't
> pursued this with any great vigor before this.
>
> I do have uses that convey information from LLVM analyses but
> unfortunately I can't share them for now.
>
> All of my use-cases are related to optimization.  No "metadata"
is
> needed for correctness.
> I have pondered whether intrinsics might work for my use-cases.  My fear
> with intrinsics is that they will interfere with other codegen analyses
> and transformations.  For example they could be a scheduling barrier.
>
> I also have wondered about how intrinsics work within SelectionDAG.  Do
> they impact dagcombine and other transformations?  The reason I call out
> SelectionDAG specifically is that most of our downstream changes related
> to conveying information are in DAG-related files (dagcombine, legalize,
> etc.).  Perhaps intrinsics could suffice for the purposes of getting
> metadata through SelectionDAG with conversion to "first-class"
metadata
> at the Machine IR level.  Maybe this is even an intermediate step toward
> "full metadata" throughout the compilation.
I employed intrinsics as a mean for carrying metadata, but,
by my experience, I am not sure they can be resorted as a valid alternative:

 - For each llvm-ir instruction employed in my project (e.g., store), a
semantically
   equivalent intrinsic is declared, with particular parameters representing
   metadata (i.e., first-class metadata are represented by specific
intrinsic's
   parameters).

 - During the lowering, each ad-hoc intrinsic must be properly handled,
manually
   adding the proper legalization operations, DAG combinations and so on.

 - During MIR conversion of the llvm-ir (i.e., mapping intrinsics to
pseudo-instructions),
   metadata are passed to the MIR representation of the program.

In particular, the second point rises a critical problem in terms of
optimizations
(e.g., intrinsic store + intrinsic trunc are not automatically converted
into a
intrinsic truncated store).Then, the backend must be instructed to
perform such
optimizations, which are actually already performed on non-intrinsic
instructions
(e.g., store + trunc is already converted into a truncated store).

Instead of re-inventing the wheel, and since the backend should be
nonetheless
modified in order to support optimizations on intrinsics, I would rather
prefer to
insert some sort of mechanism to support metadata attachment as
first-class elements
of the IR/MIR, and automatic merging of metadata, for instance.

----

@Chris

I may be wrong (in such case, please, correct me), but if I got it
correctly,
source-level debugging metadata are "external" (i.e., not a
first-class
element
of the llvm-ir), and their management involve a great effort.

As described above, in my project I used metadata as first class
elements of the
IR/MIR; I found this approach more immediate and simpler to handle, although
some passes and transformation must be modified.

Then, I agree with you saying that metadata infos should be first-class
elements of
the IR/MIR (or, at least, "packed" into a structure being first-class
part of the
IR/MIR).

----

In any case, I wonder if metadata at codegen level is actually a thing
that the
community would benefit (then, justifying a potentially huge and/or long
serie of
patches), or it is something in which only a small group would be
interested in.


Cheers
-- Lorenzo

David Greene via llvm-dev

2020-Aug-07 21:09 UTC

head link

[llvm-dev] Metadata in LLVM back-end

Chris Lattner via llvm-dev <llvm-dev at lists.llvm.org> writes:
> The issue is this: either information is preserved across certain
> sorts of transformations or it is not.  If not, it either goes stale
> (problematic for anything that looks at it later) or is
> invalidated/removed.
>
> The fundamental issue in IR design is factoring the representation of
> information from the code that needs to inspect and update it.
> “Metadata” designs try to make it easy to add out of band information
> to the IR in various ways, with a goal of reducing the impact on the
> rest of the compiler.
>
> However, I’ve never seen them work out well.  Either the data becomes
> stale, or you end up changing a lot of the compiler to support it.
> Look at debug info metadata in LLVM for example, it has both problems
> :-).  This is why MLIR has moved to make source location information
> and attributes a first class part of the IR.
I basically agree with your analysis.  Some information is so pervasive
that it really should be a part of the IR proper.  But other information
may not be.  The kind of information I'm thinking of basically boils
down to optimization hints.  It's fine and semantically sound to drop
it, though not ideal if it can be avoided.

I see debug info as being in a quite different class.  With the -g
option we are making a promise to our users.  So using a mechanism that
by design doesn't make promises seems a poor fit.

A long long time ago in the dark ages before git and Phabricator I
submitted a patch for review that would have added comment information
to machine instructions.  It was basically a string member on every
MachineInstr.  At the time it was deemed too expensive and rightly so.
Instead I ended up adding some flag values that the AsmPrinter uses as a
hint to generate various comments.  I'm still not very happy with that
"solution" and a more general-purpose mechanism for annotating
IR/SelectionDAG/MIR objects would be quite welcome.

A generic first-class annotation construct would cover both use-cases.
If you and the wider community are open to adding first-class generic
information annotation, I'm eager to work on it!

               -David

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Aug 2020 - Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

Seemingly Similar Threads