thr3ads.net - llvm dev - [llvm-dev] Metadata in LLVM back-end [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Matt Morehouse via llvm-dev

2021-Jun-16 20:42 UTC

[llvm-dev] Metadata in LLVM back-end

Thanks for the update, Lorenzo.

I have some free time to work on an RFC, but I'm unfamiliar with how the
implementation details would work.

If I dig through this thread and try to draft something, would you and/or
Son be willing to contribute?

Thanks,
Matt

On Wed, Jun 16, 2021 at 12:02 PM Lorenzo Casalino <
lorenzo.casalino93 at gmail.com> wrote:
> Hello Matt,
>
> I think that the RFC drafting went stale some months ago due to heavy
> workload on which all the partecipants were subject to.
>
> As of now, I do not know when the RFC will be actually drafted and sent.
>
> Cheers,
> Lorenzo
>
> Le 16 juin 2021 à 1:32 AM, Matt Morehouse <mascasa at google.com> a
écrit :
>
> 
> Did anyone send an RFC for this?
>
> First-class metadata would be exceptionally useful for sanitizers and
> other dynamic tools.  For
> example, we want to construct PC-keyed metadata tables in the binary
> (without affecting the
> generated code), to inform program behavior at runtime or to allow offline
> analysis.  A
> prerequisite is to actually propagate the metadata we need from the Clang
> frontend or LLVM
> middle-end down to the assembly printer.
>
> Our team has brainstormed many use cases:
>
> - *GWP-TSan* <https://youtu.be/2KvaKEyMVEU>:  storing PCs of accesses
> lowered from C++ atomics, to filter them out from race
>   detection.
>   *  List<atomic access PC>
>
> - *Stack trace compression*:  storing a conservative call graph
> <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151044.html>,
for
> use in decompressing stack
>   traces offline.
>   * Map[callsite PC] -> List<callee PC>
>
> - *no_sanitize attributes*:  storing a map of functions that have the
> no_sanitize("...")
>   attribute to the associated sanitizer, for filtering out from GWP-*San.
> Ideally we do not
>   introduce new no_sanitize string literals, but simply rely on existing
> ones (e.g. a
>   no_sanitize("thread") works for both TSan but also GWP-TSan).
>   *  Map[Func] -> SanitizerKind
>
> - *Fuzzing aid/CFG reconstruction*:  marking coverage PCs as function
> entry/exit or # of
>   outgoing edges from BB (allows to find gaps in coverage frontier).
>
> - *Type-aware malloc and heap profiling*:  enable the allocator to get
> the type for a given new
>   call, to optimize for expected usage of the allocation.
>   *  Map[new callsite PC] -> object type
>
> - *Other*:  potential use cases for future bug-finding tools (GWP-assert,
> GWP-MSan,
>   GWP-DFSan, GWP-UBSan).
>
> First-class metadata would open the door to some really cool things.
>
> Thanks,
> Matt Morehouse
>
>
> On Wed, Jan 6, 2021 at 5:56 AM Lorenzo Casalino via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Dear Tuan,
>>
>> How are you doing? Did you manage to start the draft for the RFC?
>>
>>
>> I take this opportunity to wish you all the best for this new year :)
>>
>> Best regards,
>> Lorenzo Casalino
>> Le 10/11/20 à 09:27, Lorenzo Casalino a écrit :
>>
>>
>> Le 09/11/20 à 00:30, Son Tuan VU a écrit :
>>
>> Hi,
>>
>> Thank you all for keeping this going. Indeed I was not aware that the
>> discussion was going on, I am really sorry for this late reply.
>>
>> Nice to hear you again! Thank you for starting this thread ;)
>>
>> I understand Chris' point about metadata design. Either the
metadata
>> becomes stale or removed (if we do not teach transformations to
preserve
>> it), or we end up modifying many (if not all) transformations to keep
the
>> data intact.
>> Currently in the IR, I feel like the default behavior is to
ignore/remove
>> the metadata, and only a limited number of transformations know how to
>> maintain and update it, which is a best-effort approach.
>> That being said, my initial thought was to adopt this approach to the
>> MIR, so that we can at least have a minimal mechanism to communicate
>> additional information to various transformations, or even dump it to
the
>> asm/object file.
>> In other words, it is the responsibility of the users who introduce/use
>> the metadata in the MIR to teach the transformations they selected how
to
>> preserve their metadata. A common API to abstract this would definitely
>> help, just as combineMetadata() from lib/Transforms/Utils/Local.cpp
does.
>>
>> Unfortunately, I never worked with the LLVM-IR Metadata (I almost
focused
>> on the back-end
>> and I just scratched the LLVM's middle-end), but I see your point.
>>
>> Clearly, applying the needed modifications to all the back-end
>> transformations/optimizations
>> is unfeasible and, probably, not worth it -- different users may have
>> different requirements/needs
>> regarding a specific pass.
>>
>> I like the idea of a common API to handle the MIR metadata, and let the
>> end user handle
>> such data. Of course, if the community encounters common cases while
>> handling the metadata, such
>> cases may be integrated with the upstream project.
>>
>> Nonetheless, the main point of this thread is to preserve middle-end
>> metadata down to the
>> back-end, right after the Instruction Selection phase. Hence, despite
the
>> need of the end user, a
>> "preserve-all" policy during the lowering stage is required,
which will
>> involve a bit of changes,
>> in particular in the DAGCombine pass.
>>
>>
>> As for my use case, it is also security-related. However, I do not
>> consider the metadata to be a compilation "correctness"
criteria: metadata,
>> by definition (from the LLVM IR), can be safely removed without
affecting
>> the program's correctness.
>> If possible, I would like to have more details on Lorenzo's use
case in
>> order to see how metadata would interfere with program's
correctness.
>>
>> I would really like to discuss here the details, but, unfortunately, I
am
>> working on a publication
>> and, thus, I cannot disclose any detail here :(
>>
>> However, with "correctness" I do not refer to "I/O
correctness", but the
>> preservation of a
>> security property expressed in the front-end (e.g., specified in the
>> source-code) or in the
>> middle-end (e.g., specified in the LLVM-IR, for instance by a
>> transformation pass).
>>
>> From a security point-of-view, removing or altering metadata does not
>> interfere with the I/O
>> functionality of the code (although may impact on the performances),
but
>> may introduce
>> vulnerabilities.
>>
>> As for the RFC, I can definitely try to write one, but this would be my
>> first time doing so. But maybe it is better to start with Lorenzo's
>> proposal, as you have already been working on this? Please tell me if
you
>> prefer me to start the RFC though.
>>
>> It is the first time for me too, do not worry!
>>
>> We could just use any other RFC as a template to get started :D
>>
>> I think that a structure like the following would be fine:
>>
>>   1. Background
>>      1.1 Motivation
>>      1.2 Use-cases
>>      1.3 Other approaches
>>   2. Goal(s)
>>   3. Requirements
>>   4. Drawbacks and main bottlenecks
>>   5. Design sketch
>>   6. Roadmap sketch
>>   7. Potential future development
>>
>> It may be a bit overkill; you are warmly invited to cut/refine these
>> points!
>>
>> And...no, I still have no sketch of the RFC; sorry, I had a bit of
>> workload in these
>> days.
>>
>> Yes, you can start the write up of the RFC.
>>
>> Quoting David:
>>
>>   "Since you first raised the topic [...] I want to give you right
of
>> first refusal."
>>
>>
>> Have a nice day!
>>
>> -- Lorenzo
>>
>> Thank you again for keeping this going.
>>
>> Sincerely,
>>
>> - Son
>>
>> On Wed, Nov 4, 2020 at 6:30 PM Lorenzo Casalino <
>> lorenzo.casalino93 at gmail.com> wrote:
>>
>>>
>>> Le 04/11/20 à 17:40, David Greene a écrit :
>>> > Sorry about the late reply.
>>> >
>>> > Lorenzo Casalino <lorenzo.casalino93 at gmail.com>
writes:
>>> >
>>> >>>>> - Should not impact compile time excessively
(what is "excessive?")
>>> >>>> Probably, such estimation should be performed on
>>> >>> Did something get cut off here?
>>> >> Uops. Yep, I removed a paragraph, but, apparentely I
forgot the first
>>> >> period. In any case, we should discuss about how to
quantitatively
>>> >> determine an acceptable upper-bound on the overhead on the
compilation
>>> >> time and give a motivation for it. For instance, max n%
overhead on
>>> the
>>> >> compilation time must be guaranteed, because ** list of
reasons **.
>>> > I am not sure how we'd arrive at such a number or
motivate/defend it.
>>> > Do we have any sense of the impact of the existing metadata
>>> > infrastructure?  If not I'm not sure we can do it for
something
>>> > completely new.  I think we can set a goal but we'd have
to revise it
>>> as
>>> > we gain experience.
>>> I think it is the best approach to employ :)
>>> >>> Since you initially raised the topic, do you want to
take the lead in
>>> >>> writing up a RFC?  I can certainly do it too but I
want to give you
>>> >>> right of first refusal.  :)
>>> >>>                     -David
>>> >> Uhm...actually, it wasn't me but Son Tuan, so the
right of refusal
>>> >> should be granted to him :) And I noticed now that he
wasn't included
>>> in
>>> >> CC of all our mails; I hope he was able to follow our
discussion
>>> >> anyways. I am adding him in this mail and let us wait if
he has any
>>> >> critical feature or point to discuss.
>>> > Fair enough!  I have recently taken on a lot more work so
unfortunately
>>> > I can't devote a lot of time to this at the moment. 
I've got to clear
>>> > out my pipeline first.  I'd be very happy to help review
text, etc.
>>> Do not worry, it is ok ;) Meanwhile we wait for any feedback/input
from
>>> Son,
>>> I'll try to prepare a draft of RFC and publish it here.
>>>
>>> Thank you David, and have a nice day :)
>>>
>>> -- Lorenzo
>>>
>>> >                  -David
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210616/a5417e2d/attachment-0001.html>

Son Tuan VU via llvm-dev

2021-Jun-16 23:49 UTC

head link

[llvm-dev] Metadata in LLVM back-end

Hi all,

Thanks for resuscitating this discussion.

@Lorenzo please pardon me for dropping this for quite a while. It was
indeed a tense period for me.

@Matt yes it'd be awesome if you can sketch an RFC, we can definitely
iterate over to come up with more polished versions. I'd be more than happy
to help in any way I can.

Son Tuan Vu


On Wed, 16 Jun 2021 at 22:42, Matt Morehouse <mascasa at google.com>
wrote:
> Thanks for the update, Lorenzo.
>
> I have some free time to work on an RFC, but I'm unfamiliar with how
the
> implementation details would work.
>
> If I dig through this thread and try to draft something, would you and/or
> Son be willing to contribute?
>
> Thanks,
> Matt
>
> On Wed, Jun 16, 2021 at 12:02 PM Lorenzo Casalino <
> lorenzo.casalino93 at gmail.com> wrote:
>
>> Hello Matt,
>>
>> I think that the RFC drafting went stale some months ago due to heavy
>> workload on which all the partecipants were subject to.
>>
>> As of now, I do not know when the RFC will be actually drafted and
sent.
>>
>> Cheers,
>> Lorenzo
>>
>> Le 16 juin 2021 à 1:32 AM, Matt Morehouse <mascasa at google.com>
a écrit :
>>
>> 
>> Did anyone send an RFC for this?
>>
>> First-class metadata would be exceptionally useful for sanitizers and
>> other dynamic tools.  For
>> example, we want to construct PC-keyed metadata tables in the binary
>> (without affecting the
>> generated code), to inform program behavior at runtime or to allow
>> offline analysis.  A
>> prerequisite is to actually propagate the metadata we need from the
Clang
>> frontend or LLVM
>> middle-end down to the assembly printer.
>>
>> Our team has brainstormed many use cases:
>>
>> - *GWP-TSan* <https://youtu.be/2KvaKEyMVEU>:  storing PCs of
accesses
>> lowered from C++ atomics, to filter them out from race
>>   detection.
>>   *  List<atomic access PC>
>>
>> - *Stack trace compression*:  storing a conservative call graph
>>
<https://lists.llvm.org/pipermail/llvm-dev/2021-June/151044.html>, for
>> use in decompressing stack
>>   traces offline.
>>   * Map[callsite PC] -> List<callee PC>
>>
>> - *no_sanitize attributes*:  storing a map of functions that have the
>> no_sanitize("...")
>>   attribute to the associated sanitizer, for filtering out from
GWP-*San.
>> Ideally we do not
>>   introduce new no_sanitize string literals, but simply rely on
existing
>> ones (e.g. a
>>   no_sanitize("thread") works for both TSan but also
GWP-TSan).
>>   *  Map[Func] -> SanitizerKind
>>
>> - *Fuzzing aid/CFG reconstruction*:  marking coverage PCs as function
>> entry/exit or # of
>>   outgoing edges from BB (allows to find gaps in coverage frontier).
>>
>> - *Type-aware malloc and heap profiling*:  enable the allocator to get
>> the type for a given new
>>   call, to optimize for expected usage of the allocation.
>>   *  Map[new callsite PC] -> object type
>>
>> - *Other*:  potential use cases for future bug-finding tools
>> (GWP-assert, GWP-MSan,
>>   GWP-DFSan, GWP-UBSan).
>>
>> First-class metadata would open the door to some really cool things.
>>
>> Thanks,
>> Matt Morehouse
>>
>>
>> On Wed, Jan 6, 2021 at 5:56 AM Lorenzo Casalino via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Dear Tuan,
>>>
>>> How are you doing? Did you manage to start the draft for the RFC?
>>>
>>>
>>> I take this opportunity to wish you all the best for this new year
:)
>>>
>>> Best regards,
>>> Lorenzo Casalino
>>> Le 10/11/20 à 09:27, Lorenzo Casalino a écrit :
>>>
>>>
>>> Le 09/11/20 à 00:30, Son Tuan VU a écrit :
>>>
>>> Hi,
>>>
>>> Thank you all for keeping this going. Indeed I was not aware that
the
>>> discussion was going on, I am really sorry for this late reply.
>>>
>>> Nice to hear you again! Thank you for starting this thread ;)
>>>
>>> I understand Chris' point about metadata design. Either the
metadata
>>> becomes stale or removed (if we do not teach transformations to
preserve
>>> it), or we end up modifying many (if not all) transformations to
keep the
>>> data intact.
>>> Currently in the IR, I feel like the default behavior is to
>>> ignore/remove the metadata, and only a limited number of
transformations
>>> know how to maintain and update it, which is a best-effort
approach.
>>> That being said, my initial thought was to adopt this approach to
the
>>> MIR, so that we can at least have a minimal mechanism to
communicate
>>> additional information to various transformations, or even dump it
to the
>>> asm/object file.
>>> In other words, it is the responsibility of the users who
introduce/use
>>> the metadata in the MIR to teach the transformations they selected
how to
>>> preserve their metadata. A common API to abstract this would
definitely
>>> help, just as combineMetadata() from lib/Transforms/Utils/Local.cpp
does.
>>>
>>> Unfortunately, I never worked with the LLVM-IR Metadata (I almost
>>> focused on the back-end
>>> and I just scratched the LLVM's middle-end), but I see your
point.
>>>
>>> Clearly, applying the needed modifications to all the back-end
>>> transformations/optimizations
>>> is unfeasible and, probably, not worth it -- different users may
have
>>> different requirements/needs
>>> regarding a specific pass.
>>>
>>> I like the idea of a common API to handle the MIR metadata, and let
the
>>> end user handle
>>> such data. Of course, if the community encounters common cases
while
>>> handling the metadata, such
>>> cases may be integrated with the upstream project.
>>>
>>> Nonetheless, the main point of this thread is to preserve
middle-end
>>> metadata down to the
>>> back-end, right after the Instruction Selection phase. Hence,
despite
>>> the need of the end user, a
>>> "preserve-all" policy during the lowering stage is
required, which will
>>> involve a bit of changes,
>>> in particular in the DAGCombine pass.
>>>
>>>
>>> As for my use case, it is also security-related. However, I do not
>>> consider the metadata to be a compilation "correctness"
criteria: metadata,
>>> by definition (from the LLVM IR), can be safely removed without
affecting
>>> the program's correctness.
>>> If possible, I would like to have more details on Lorenzo's use
case in
>>> order to see how metadata would interfere with program's
correctness.
>>>
>>> I would really like to discuss here the details, but,
unfortunately, I
>>> am working on a publication
>>> and, thus, I cannot disclose any detail here :(
>>>
>>> However, with "correctness" I do not refer to "I/O
correctness", but the
>>> preservation of a
>>> security property expressed in the front-end (e.g., specified in
the
>>> source-code) or in the
>>> middle-end (e.g., specified in the LLVM-IR, for instance by a
>>> transformation pass).
>>>
>>> From a security point-of-view, removing or altering metadata does
not
>>> interfere with the I/O
>>> functionality of the code (although may impact on the
performances), but
>>> may introduce
>>> vulnerabilities.
>>>
>>> As for the RFC, I can definitely try to write one, but this would
be my
>>> first time doing so. But maybe it is better to start with
Lorenzo's
>>> proposal, as you have already been working on this? Please tell me
if you
>>> prefer me to start the RFC though.
>>>
>>> It is the first time for me too, do not worry!
>>>
>>> We could just use any other RFC as a template to get started :D
>>>
>>> I think that a structure like the following would be fine:
>>>
>>>   1. Background
>>>      1.1 Motivation
>>>      1.2 Use-cases
>>>      1.3 Other approaches
>>>   2. Goal(s)
>>>   3. Requirements
>>>   4. Drawbacks and main bottlenecks
>>>   5. Design sketch
>>>   6. Roadmap sketch
>>>   7. Potential future development
>>>
>>> It may be a bit overkill; you are warmly invited to cut/refine
these
>>> points!
>>>
>>> And...no, I still have no sketch of the RFC; sorry, I had a bit of
>>> workload in these
>>> days.
>>>
>>> Yes, you can start the write up of the RFC.
>>>
>>> Quoting David:
>>>
>>>   "Since you first raised the topic [...] I want to give you
right of
>>> first refusal."
>>>
>>>
>>> Have a nice day!
>>>
>>> -- Lorenzo
>>>
>>> Thank you again for keeping this going.
>>>
>>> Sincerely,
>>>
>>> - Son
>>>
>>> On Wed, Nov 4, 2020 at 6:30 PM Lorenzo Casalino <
>>> lorenzo.casalino93 at gmail.com> wrote:
>>>
>>>>
>>>> Le 04/11/20 à 17:40, David Greene a écrit :
>>>> > Sorry about the late reply.
>>>> >
>>>> > Lorenzo Casalino <lorenzo.casalino93 at gmail.com>
writes:
>>>> >
>>>> >>>>> - Should not impact compile time
excessively (what is
>>>> "excessive?")
>>>> >>>> Probably, such estimation should be performed
on
>>>> >>> Did something get cut off here?
>>>> >> Uops. Yep, I removed a paragraph, but, apparentely I
forgot the first
>>>> >> period. In any case, we should discuss about how to
quantitatively
>>>> >> determine an acceptable upper-bound on the overhead on
the
>>>> compilation
>>>> >> time and give a motivation for it. For instance, max
n% overhead on
>>>> the
>>>> >> compilation time must be guaranteed, because ** list
of reasons **.
>>>> > I am not sure how we'd arrive at such a number or
motivate/defend it.
>>>> > Do we have any sense of the impact of the existing
metadata
>>>> > infrastructure?  If not I'm not sure we can do it for
something
>>>> > completely new.  I think we can set a goal but we'd
have to revise it
>>>> as
>>>> > we gain experience.
>>>> I think it is the best approach to employ :)
>>>> >>> Since you initially raised the topic, do you want
to take the lead
>>>> in
>>>> >>> writing up a RFC?  I can certainly do it too but I
want to give you
>>>> >>> right of first refusal.  :)
>>>> >>>                     -David
>>>> >> Uhm...actually, it wasn't me but Son Tuan, so the
right of refusal
>>>> >> should be granted to him :) And I noticed now that he
wasn't
>>>> included in
>>>> >> CC of all our mails; I hope he was able to follow our
discussion
>>>> >> anyways. I am adding him in this mail and let us wait
if he has any
>>>> >> critical feature or point to discuss.
>>>> > Fair enough!  I have recently taken on a lot more work so
>>>> unfortunately
>>>> > I can't devote a lot of time to this at the moment. 
I've got to clear
>>>> > out my pipeline first.  I'd be very happy to help
review text, etc.
>>>> Do not worry, it is ok ;) Meanwhile we wait for any
feedback/input from
>>>> Son,
>>>> I'll try to prepare a draft of RFC and publish it here.
>>>>
>>>> Thank you David, and have a nice day :)
>>>>
>>>> -- Lorenzo
>>>>
>>>> >                  -David
>>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210617/697e51a0/attachment.html>

llvm dev - Jun 2021 - Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end

[llvm-dev] Metadata in LLVM back-end