thr3ads.net - llvm dev - [llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports [Oct 2020]

If this information is useful, please help other people find it:
Share via:

Dan Liew via llvm-dev

2020-Oct-07 17:23 UTC

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

Hi,

On Tue, 6 Oct 2020 at 18:31, David Blaikie <dblaikie at gmail.com>
wrote:>
> My 2c would be to push back a bit more on the "let's not have a
machine readable format, but instead parse the human readable format" - it
seems like that's going to make the human readable format/parsing fairly
brittle/hard to change (I mean, having the parser in tree will help, for sure).
I was operating under the assumption that the decision made in
https://github.com/google/sanitizers/issues/268 was still the status
quo. That was six years ago though so I'll let Kostya chime in here if
he now thinks differently about this.

Even if we go down the route of having the sanitizers supporting
machine-readable output I'd still like there to be an in-tree tool
that supports doing offline symboliation on the machine readable
output. So there still might be a case for having the proposed
"llvm-xsan" tool in-tree.
> It'd be interesting to know more about what problems the valgrind XML
format have had and how/whether different solutions would address/avoid those
problems. Also might be good to hear about how other tools are parsing the
output - whether or not/how they might benefit if it were machine readable to
begin with.
Huh. I didn't know Valgrind had an XML format so I can't really
comment on that (yet).

On my side I can say we have at least two use cases inside Apple where
we are parsing ASan reports and each use case ended up implementing
their own parser.
>
> But, yeah, if that's the direction - having an in-tree tool with fairly
narrow uses could be nice. One action to convert human readable reports to json,
another to symbolize such a report, a simple tool to render the (symbolized or
not) data back into human readable form - then sets it up for other tools to
consume that json and, say, render it in a GUI, perform other
diagnostics/analysis on the report, etc.
I hadn't thought about a tool to re-render reports in human readable
form. That's a good idea.

> On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>>
>> # Summary
>>
>> Currently the Sanitizer family of runtime bug finding tools (e.g.
>> Address Sanitizer) provide useful reports of problems upon detection.
>> This RFC proposes adding tools to
>>
>> 1. Parse Sanitizer reports into structured data to make interfacing
>> with other tools simpler.
>> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add
>> missing symbol information (function name, source file, line number)
>> to the structured data version of the report.
>>
>> The initial stubs for the proposal in this RFC are provided in this
>> patch: https://reviews.llvm.org/D88938 .
>>
>> Any thoughts on this RFC on the patch would be appreciated.
>>
>> # Issues with the existing solutions
>>
>> * An official parser for sanitizer reports does not exist. Currently
>> we just tell our users to implement their own (e.g. [1]). This creates
>> an unnecessary duplication of effort.
>> * The existing symbolizer (asan_symbolize.py) only works with ASan
>> reports and doesn’t support other sanitizers like TSan.
>> * The architecture of the existing symbolizer makes it cumbersome to
>> support inline frames.
>> * The architecture of the existing symbolizer is sequential which
>> prevents performing batched symbolication of stack frames.
>>
>> # Tools
>>
>> The proposed tools would be a sub-tools of a new llvm-xsan tool.
>>
>> E.g.
>>
>> llvm-xsan <subtool>
>>
>> Sub-tools will support nesting of sub-tools to allow building
>> ergonomic tools. E.g.:
>>
>> llvm-xsan asan <asan subtool>
>>
>> * The tools would be part of compiler-rt and will optionally ship with
>> this project.
>> * The tools will be considered experimental while being incrementally
>> developed on the master branch.
>> * Functionality of the tools will be maintained via tests in the
compiler-rt.
>>
>> llvm-xsan could be also used as a vehicle for shipping other Sanitizer
>> tools in the toolchain in the future.
>>
>> ## Parsing tool
>>
>> Sanitizer reports are primarily meant to be human readable,
>> consequently the reports are not structured data (e.g. JSON). This
>> means that Sanitizer reports are not conveniently machine-readable.
>>
>> A request [2] was made in the past to teach the sanitizers to emit a
>> machine-readable format for reports. This request was denied but an
>> alternative was proposed where a tool could be provided to convert the
>> human readable Sanitizer reports into a structured data format. This
>> proposal will implement this alternative.
>>
>> My proposal is that we implement a parser for Sanitizer reports that
>> converts them into a structured data. In particular:
>>
>> * The tool is tied to the Clang/compiler-rt runtime that it ships
>> with. This means the tool will parse Sanitizer reports that come from
>> binaries built using the corresponding Clang. However the tool is not
>> required to parse Sanitizer reports that come from different versions
>> of Clang.
>> * The tool can also output a schema that describes the structured data
>> format. This schema would be versioned and would be allowed to change
>> once the tool moves out of the experimental stage.
>> * The format of the human readable Sanitizer reports is allowed to
>> change but the parser should be correspondingly changed when this
>> happens. This will be enforced with tests.
>>
>> The parsing tools would be subtools of the asan, tsan, ubsan subtools.
>> This would require the user to explicitly communicate the report type
>> ahead of time. Command line invocation would look something like:
>>
>> ```
>> llvm-xsan asan parse < asan_report.txt > asan_report.json
>> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
>> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
>> ```
>>
>> The structured data format would be JSON. The schema details still
>> need to be worked out but the schema will need to cover every type of
>> issue that a Sanitizer can find.
>>
>> ## Symbolication tool
>>
>> Sanitizer reports include detailed stack traces which show the program
>> counter (PC) for each frame. PCs are typically not useful to a
>> developer. Instead they are likely more interested in the function
>> name, source file and line number that correspond to each of the PCs.
>> The process of finding the function name, source file and line number
>> that correspond to a PC is known as “Symbolication”.
>>
>> There are two approaches to symbolication, online and offline. Online
>> symbolication performs Symbolication in the process where the issue
>> was found by invoking an external tool (e.g. llvm-symbolizer) to
>> “symbolize” each of the PCs. Offline symbolication performs
>> symbolication outside the process where the issue was found. The
>> Sanitizers perform online symbolication by default. This process needs
>> the debug information to be available at runtime. However this
>> information might be missing. For example:
>>
>> * The instrumented binary might have been stripped of debug info (e.g.
>> to reduce binary size).
>> * The PC points inside a system library which has no available debug
info.
>> * The instrumented binary was built on a different machine. On Apple
>> platforms debug info lives outside the binary (inside “.dSYM” bundles)
>> so these might not be copied across from the build machine.
>>
>> In these cases online symbolication fails and we are left with a
>> sanitizer report that is extremely hard for a developer to read.
>>
>> To turn the unsymbolicated Sanitizer report into something useful for
>> a developer, offline symbolication is necessary. However, the existing
>> infrastructure (asan_symbolize.py) for doing this has some
>> deficiencies.
>>
>> * Only Address Sanitizer reports are supported.
>> * The current implementation processes each stackframe sequentially.
>> This does not fit well in contexts where we would like to symbolicate
>> multiple PCs at a time.
>> * The current implementation doesn’t provide a way to handle inline
>> frames (i.e. a PC maps to two or more source locations).
>>
>> These problems can be resolved by building new tools on top of the
>> structured data format. This gives a nice separation of concerns
>> because parsing the report is now separate from symbolicating the PCs
>> in it.
>>
>> The symbolication tools would be subtools of the asan, tsan, ubsan
>> subtools. This would require the user to explicitly communicate the
>> report type ahead of time. Command line invocation would look
>> something like:
>>
>> ```
>> llvm-xsan asan symbolicate < asan_report.json >
asan_report_symbolicated.json
>> llvm-xsan tsan symbolicate < tsan_report.json >
tsan_report_symbolicated.json
>> llvm-xsan ubsan symbolicate < ubsan_report.json >
ubsan_report_symbolicated.json
>> ```
>>
>> There are multiple ways to perform symbolication (some of which are
>> platform specific). Like asan_symbolize.py the plan would be to
>> support multiple symbolication backends (that can also be chained
>> together) that are specified via command line options.
>>
>> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py
>> [2] https://github.com/google/sanitizers/issues/268
>>
>> Thanks,
>> Dan.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Petr Hosek via llvm-dev

2020-Oct-07 17:38 UTC

head link

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

We ran into the same issues you described and the solution we came up with
is the Fuchsia symbolizer markup format, see
https://fuchsia.dev/fuchsia-src/reference/kernel/symbolizer_markup. Despite
its name, nothing about the format is Fuchsia specific, the format should
be generally usable and has already been adopted by other systems such as
RTEMS.

The symbolizer markup should address many of the issues you mentioned:
* It's already available in sanitizer_common and supports all sanitizers,
see
https://github.com/llvm/llvm-project/blob/fccea7f372cbd33376d2c776f34a0c6925982981/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_markup.cpp
* It supports inline frames which was the most recent changes to the markup
based on our experience with sanitizer rollout, see
https://cs.opensource.google/fuchsia/fuchsia/+/db6e2155d125c389bfc43bafe2f140231da0b6d0
* It's designed for offline and batched symbolization.

The advantage over emitting JSON directly is that the markup format is line
delimited, which simplifies emission and parsing, it's more compact, and it
can be easily embedded in other formats (even JSON) which is important in
our use case.

Currently, the markup is consumed by our symbolizer which is a thin wrapper
around llvm-symbolizer, but I planned on eventually proposing and
implementing support for this format directly in llvm-symbolizer. We
support emitting JSON output in our symbolizer wrapper which would be great
to have in llvm-symbolizer as well and is in line with the plan to support
JSON output in various LLVM tools that has been repeatedly discussed in the
past.

Our hope has been that this markup could be eventually adopted by other
platforms and I'd be interested to hear your thoughts. I understand that it
may not be a fit for your use cases, but I'd be also interested to hear if
there are ways to make it usable for your use.

Regarding offline symbolization, we use offline symbolization by default in
Fuchsia and our symbolizer wrapper fetches debug info on-demand from our
symbol server. We originally used a custom scheme, but recently we started
switching to debuginfod which is being quickly adopted by various binary
tools in the GNU ecosystem. I'd like to implement debuginfod support
directly in LLVM (see also the recent thread about HTTP client/server
libraries in LLVM) and integrate it into tools like  llvm-symbolizer which
is also important to bring llvm-symbolizer on par with addr2line. This
would address the offline symbolization use case in a way that doesn't
require new tools.

On Wed, Oct 7, 2020 at 10:23 AM Dan Liew <dan at su-root.co.uk> wrote:
> Hi,
>
> On Tue, 6 Oct 2020 at 18:31, David Blaikie <dblaikie at gmail.com>
wrote:
> >
> > My 2c would be to push back a bit more on the "let's not have
a machine
> readable format, but instead parse the human readable format" - it
seems
> like that's going to make the human readable format/parsing fairly
> brittle/hard to change (I mean, having the parser in tree will help, for
> sure).
>
> I was operating under the assumption that the decision made in
> https://github.com/google/sanitizers/issues/268 was still the status
> quo. That was six years ago though so I'll let Kostya chime in here if
> he now thinks differently about this.
>
> Even if we go down the route of having the sanitizers supporting
> machine-readable output I'd still like there to be an in-tree tool
> that supports doing offline symboliation on the machine readable
> output. So there still might be a case for having the proposed
> "llvm-xsan" tool in-tree.
>
> > It'd be interesting to know more about what problems the valgrind
XML
> format have had and how/whether different solutions would address/avoid
> those problems. Also might be good to hear about how other tools are
> parsing the output - whether or not/how they might benefit if it were
> machine readable to begin with.
>
> Huh. I didn't know Valgrind had an XML format so I can't really
> comment on that (yet).
>
> On my side I can say we have at least two use cases inside Apple where
> we are parsing ASan reports and each use case ended up implementing
> their own parser.
>
> >
> > But, yeah, if that's the direction - having an in-tree tool with
fairly
> narrow uses could be nice. One action to convert human readable reports to
> json, another to symbolize such a report, a simple tool to render the
> (symbolized or not) data back into human readable form - then sets it up
> for other tools to consume that json and, say, render it in a GUI, perform
> other diagnostics/analysis on the report, etc.
>
> I hadn't thought about a tool to re-render reports in human readable
> form. That's a good idea.
>
>
> > On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>
> >> # Summary
> >>
> >> Currently the Sanitizer family of runtime bug finding tools (e.g.
> >> Address Sanitizer) provide useful reports of problems upon
detection.
> >> This RFC proposes adding tools to
> >>
> >> 1. Parse Sanitizer reports into structured data to make
interfacing
> >> with other tools simpler.
> >> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add
> >> missing symbol information (function name, source file, line
number)
> >> to the structured data version of the report.
> >>
> >> The initial stubs for the proposal in this RFC are provided in
this
> >> patch: https://reviews.llvm.org/D88938 .
> >>
> >> Any thoughts on this RFC on the patch would be appreciated.
> >>
> >> # Issues with the existing solutions
> >>
> >> * An official parser for sanitizer reports does not exist.
Currently
> >> we just tell our users to implement their own (e.g. [1]). This
creates
> >> an unnecessary duplication of effort.
> >> * The existing symbolizer (asan_symbolize.py) only works with ASan
> >> reports and doesn’t support other sanitizers like TSan.
> >> * The architecture of the existing symbolizer makes it cumbersome
to
> >> support inline frames.
> >> * The architecture of the existing symbolizer is sequential which
> >> prevents performing batched symbolication of stack frames.
> >>
> >> # Tools
> >>
> >> The proposed tools would be a sub-tools of a new llvm-xsan tool.
> >>
> >> E.g.
> >>
> >> llvm-xsan <subtool>
> >>
> >> Sub-tools will support nesting of sub-tools to allow building
> >> ergonomic tools. E.g.:
> >>
> >> llvm-xsan asan <asan subtool>
> >>
> >> * The tools would be part of compiler-rt and will optionally ship
with
> >> this project.
> >> * The tools will be considered experimental while being
incrementally
> >> developed on the master branch.
> >> * Functionality of the tools will be maintained via tests in the
> compiler-rt.
> >>
> >> llvm-xsan could be also used as a vehicle for shipping other
Sanitizer
> >> tools in the toolchain in the future.
> >>
> >> ## Parsing tool
> >>
> >> Sanitizer reports are primarily meant to be human readable,
> >> consequently the reports are not structured data (e.g. JSON). This
> >> means that Sanitizer reports are not conveniently
machine-readable.
> >>
> >> A request [2] was made in the past to teach the sanitizers to emit
a
> >> machine-readable format for reports. This request was denied but
an
> >> alternative was proposed where a tool could be provided to convert
the
> >> human readable Sanitizer reports into a structured data format.
This
> >> proposal will implement this alternative.
> >>
> >> My proposal is that we implement a parser for Sanitizer reports
that
> >> converts them into a structured data. In particular:
> >>
> >> * The tool is tied to the Clang/compiler-rt runtime that it ships
> >> with. This means the tool will parse Sanitizer reports that come
from
> >> binaries built using the corresponding Clang. However the tool is
not
> >> required to parse Sanitizer reports that come from different
versions
> >> of Clang.
> >> * The tool can also output a schema that describes the structured
data
> >> format. This schema would be versioned and would be allowed to
change
> >> once the tool moves out of the experimental stage.
> >> * The format of the human readable Sanitizer reports is allowed to
> >> change but the parser should be correspondingly changed when this
> >> happens. This will be enforced with tests.
> >>
> >> The parsing tools would be subtools of the asan, tsan, ubsan
subtools.
> >> This would require the user to explicitly communicate the report
type
> >> ahead of time. Command line invocation would look something like:
> >>
> >> ```
> >> llvm-xsan asan parse < asan_report.txt > asan_report.json
> >> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
> >> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
> >> ```
> >>
> >> The structured data format would be JSON. The schema details still
> >> need to be worked out but the schema will need to cover every type
of
> >> issue that a Sanitizer can find.
> >>
> >> ## Symbolication tool
> >>
> >> Sanitizer reports include detailed stack traces which show the
program
> >> counter (PC) for each frame. PCs are typically not useful to a
> >> developer. Instead they are likely more interested in the function
> >> name, source file and line number that correspond to each of the
PCs.
> >> The process of finding the function name, source file and line
number
> >> that correspond to a PC is known as “Symbolication”.
> >>
> >> There are two approaches to symbolication, online and offline.
Online
> >> symbolication performs Symbolication in the process where the
issue
> >> was found by invoking an external tool (e.g. llvm-symbolizer) to
> >> “symbolize” each of the PCs. Offline symbolication performs
> >> symbolication outside the process where the issue was found. The
> >> Sanitizers perform online symbolication by default. This process
needs
> >> the debug information to be available at runtime. However this
> >> information might be missing. For example:
> >>
> >> * The instrumented binary might have been stripped of debug info
(e.g.
> >> to reduce binary size).
> >> * The PC points inside a system library which has no available
debug
> info.
> >> * The instrumented binary was built on a different machine. On
Apple
> >> platforms debug info lives outside the binary (inside “.dSYM”
bundles)
> >> so these might not be copied across from the build machine.
> >>
> >> In these cases online symbolication fails and we are left with a
> >> sanitizer report that is extremely hard for a developer to read.
> >>
> >> To turn the unsymbolicated Sanitizer report into something useful
for
> >> a developer, offline symbolication is necessary. However, the
existing
> >> infrastructure (asan_symbolize.py) for doing this has some
> >> deficiencies.
> >>
> >> * Only Address Sanitizer reports are supported.
> >> * The current implementation processes each stackframe
sequentially.
> >> This does not fit well in contexts where we would like to
symbolicate
> >> multiple PCs at a time.
> >> * The current implementation doesn’t provide a way to handle
inline
> >> frames (i.e. a PC maps to two or more source locations).
> >>
> >> These problems can be resolved by building new tools on top of the
> >> structured data format. This gives a nice separation of concerns
> >> because parsing the report is now separate from symbolicating the
PCs
> >> in it.
> >>
> >> The symbolication tools would be subtools of the asan, tsan, ubsan
> >> subtools. This would require the user to explicitly communicate
the
> >> report type ahead of time. Command line invocation would look
> >> something like:
> >>
> >> ```
> >> llvm-xsan asan symbolicate < asan_report.json >
> asan_report_symbolicated.json
> >> llvm-xsan tsan symbolicate < tsan_report.json >
> tsan_report_symbolicated.json
> >> llvm-xsan ubsan symbolicate < ubsan_report.json >
> ubsan_report_symbolicated.json
> >> ```
> >>
> >> There are multiple ways to perform symbolication (some of which
are
> >> platform specific). Like asan_symbolize.py the plan would be to
> >> support multiple symbolication backends (that can also be chained
> >> together) that are specified via command line options.
> >>
> >> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py
> >> [2] https://github.com/google/sanitizers/issues/268
> >>
> >> Thanks,
> >> Dan.
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/b98bdf65/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3996 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/b98bdf65/attachment.bin>

Kostya Serebryany via llvm-dev

2020-Oct-07 23:24 UTC

head link

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

On Wed, Oct 7, 2020 at 10:23 AM Dan Liew <dan at su-root.co.uk> wrote:
> Hi,
>
> On Tue, 6 Oct 2020 at 18:31, David Blaikie <dblaikie at gmail.com>
wrote:
> >
> > My 2c would be to push back a bit more on the "let's not have
a machine
> readable format, but instead parse the human readable format" - it
seems
> like that's going to make the human readable format/parsing fairly
> brittle/hard to change (I mean, having the parser in tree will help, for
> sure).
>
> I was operating under the assumption that the decision made in
> https://github.com/google/sanitizers/issues/268 was still the status
> quo. That was six years ago though so I'll let Kostya chime in here if
> he now thinks differently about this.
>
My opinion on the matter didn't change, nor did the motivation.
I am opposed to making the sanitizer run-time any more complex,
and I prefer the approach proposed here: separate, adjacently maintained
parser.

On top of the previous motivation, here is some more.
We are going to have more sanitizer-like things in the near future (Arm MTE
is one of them),
that are not necessarily going to be in LLVM and that will not emit JSON.
(and they shouldn't: we don't want any such thing in a production
run-time).
But we can support those things with a separate parser.

I have a mild preference to have the parser written as a C++ library, with
C interface.
Not in python, so that it can be used programmatically w/o launching a
sub-process.
But I don't insist (especially given the code is written already)

--kcc

>
> Even if we go down the route of having the sanitizers supporting
> machine-readable output I'd still like there to be an in-tree tool
> that supports doing offline symboliation on the machine readable
> output. So there still might be a case for having the proposed
> "llvm-xsan" tool in-tree.
>
> > It'd be interesting to know more about what problems the valgrind
XML
> format have had and how/whether different solutions would address/avoid
> those problems. Also might be good to hear about how other tools are
> parsing the output - whether or not/how they might benefit if it were
> machine readable to begin with.
>
> Huh. I didn't know Valgrind had an XML format so I can't really
> comment on that (yet).
>
> On my side I can say we have at least two use cases inside Apple where
> we are parsing ASan reports and each use case ended up implementing
> their own parser.
>
> >
> > But, yeah, if that's the direction - having an in-tree tool with
fairly
> narrow uses could be nice. One action to convert human readable reports to
> json, another to symbolize such a report, a simple tool to render the
> (symbolized or not) data back into human readable form - then sets it up
> for other tools to consume that json and, say, render it in a GUI, perform
> other diagnostics/analysis on the report, etc.
>
> I hadn't thought about a tool to re-render reports in human readable
> form. That's a good idea.
>
>
> > On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >>
> >> # Summary
> >>
> >> Currently the Sanitizer family of runtime bug finding tools (e.g.
> >> Address Sanitizer) provide useful reports of problems upon
detection.
> >> This RFC proposes adding tools to
> >>
> >> 1. Parse Sanitizer reports into structured data to make
interfacing
> >> with other tools simpler.
> >> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add
> >> missing symbol information (function name, source file, line
number)
> >> to the structured data version of the report.
> >>
> >> The initial stubs for the proposal in this RFC are provided in
this
> >> patch: https://reviews.llvm.org/D88938 .
> >>
> >> Any thoughts on this RFC on the patch would be appreciated.
> >>
> >> # Issues with the existing solutions
> >>
> >> * An official parser for sanitizer reports does not exist.
Currently
> >> we just tell our users to implement their own (e.g. [1]). This
creates
> >> an unnecessary duplication of effort.
> >> * The existing symbolizer (asan_symbolize.py) only works with ASan
> >> reports and doesn’t support other sanitizers like TSan.
> >> * The architecture of the existing symbolizer makes it cumbersome
to
> >> support inline frames.
> >> * The architecture of the existing symbolizer is sequential which
> >> prevents performing batched symbolication of stack frames.
> >>
> >> # Tools
> >>
> >> The proposed tools would be a sub-tools of a new llvm-xsan tool.
> >>
> >> E.g.
> >>
> >> llvm-xsan <subtool>
> >>
> >> Sub-tools will support nesting of sub-tools to allow building
> >> ergonomic tools. E.g.:
> >>
> >> llvm-xsan asan <asan subtool>
> >>
> >> * The tools would be part of compiler-rt and will optionally ship
with
> >> this project.
> >> * The tools will be considered experimental while being
incrementally
> >> developed on the master branch.
> >> * Functionality of the tools will be maintained via tests in the
> compiler-rt.
> >>
> >> llvm-xsan could be also used as a vehicle for shipping other
Sanitizer
> >> tools in the toolchain in the future.
> >>
> >> ## Parsing tool
> >>
> >> Sanitizer reports are primarily meant to be human readable,
> >> consequently the reports are not structured data (e.g. JSON). This
> >> means that Sanitizer reports are not conveniently
machine-readable.
> >>
> >> A request [2] was made in the past to teach the sanitizers to emit
a
> >> machine-readable format for reports. This request was denied but
an
> >> alternative was proposed where a tool could be provided to convert
the
> >> human readable Sanitizer reports into a structured data format.
This
> >> proposal will implement this alternative.
> >>
> >> My proposal is that we implement a parser for Sanitizer reports
that
> >> converts them into a structured data. In particular:
> >>
> >> * The tool is tied to the Clang/compiler-rt runtime that it ships
> >> with. This means the tool will parse Sanitizer reports that come
from
> >> binaries built using the corresponding Clang. However the tool is
not
> >> required to parse Sanitizer reports that come from different
versions
> >> of Clang.
> >> * The tool can also output a schema that describes the structured
data
> >> format. This schema would be versioned and would be allowed to
change
> >> once the tool moves out of the experimental stage.
> >> * The format of the human readable Sanitizer reports is allowed to
> >> change but the parser should be correspondingly changed when this
> >> happens. This will be enforced with tests.
> >>
> >> The parsing tools would be subtools of the asan, tsan, ubsan
subtools.
> >> This would require the user to explicitly communicate the report
type
> >> ahead of time. Command line invocation would look something like:
> >>
> >> ```
> >> llvm-xsan asan parse < asan_report.txt > asan_report.json
> >> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json
> >> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json
> >> ```
> >>
> >> The structured data format would be JSON. The schema details still
> >> need to be worked out but the schema will need to cover every type
of
> >> issue that a Sanitizer can find.
> >>
> >> ## Symbolication tool
> >>
> >> Sanitizer reports include detailed stack traces which show the
program
> >> counter (PC) for each frame. PCs are typically not useful to a
> >> developer. Instead they are likely more interested in the function
> >> name, source file and line number that correspond to each of the
PCs.
> >> The process of finding the function name, source file and line
number
> >> that correspond to a PC is known as “Symbolication”.
> >>
> >> There are two approaches to symbolication, online and offline.
Online
> >> symbolication performs Symbolication in the process where the
issue
> >> was found by invoking an external tool (e.g. llvm-symbolizer) to
> >> “symbolize” each of the PCs. Offline symbolication performs
> >> symbolication outside the process where the issue was found. The
> >> Sanitizers perform online symbolication by default. This process
needs
> >> the debug information to be available at runtime. However this
> >> information might be missing. For example:
> >>
> >> * The instrumented binary might have been stripped of debug info
(e.g.
> >> to reduce binary size).
> >> * The PC points inside a system library which has no available
debug
> info.
> >> * The instrumented binary was built on a different machine. On
Apple
> >> platforms debug info lives outside the binary (inside “.dSYM”
bundles)
> >> so these might not be copied across from the build machine.
> >>
> >> In these cases online symbolication fails and we are left with a
> >> sanitizer report that is extremely hard for a developer to read.
> >>
> >> To turn the unsymbolicated Sanitizer report into something useful
for
> >> a developer, offline symbolication is necessary. However, the
existing
> >> infrastructure (asan_symbolize.py) for doing this has some
> >> deficiencies.
> >>
> >> * Only Address Sanitizer reports are supported.
> >> * The current implementation processes each stackframe
sequentially.
> >> This does not fit well in contexts where we would like to
symbolicate
> >> multiple PCs at a time.
> >> * The current implementation doesn’t provide a way to handle
inline
> >> frames (i.e. a PC maps to two or more source locations).
> >>
> >> These problems can be resolved by building new tools on top of the
> >> structured data format. This gives a nice separation of concerns
> >> because parsing the report is now separate from symbolicating the
PCs
> >> in it.
> >>
> >> The symbolication tools would be subtools of the asan, tsan, ubsan
> >> subtools. This would require the user to explicitly communicate
the
> >> report type ahead of time. Command line invocation would look
> >> something like:
> >>
> >> ```
> >> llvm-xsan asan symbolicate < asan_report.json >
> asan_report_symbolicated.json
> >> llvm-xsan tsan symbolicate < tsan_report.json >
> tsan_report_symbolicated.json
> >> llvm-xsan ubsan symbolicate < ubsan_report.json >
> ubsan_report_symbolicated.json
> >> ```
> >>
> >> There are multiple ways to perform symbolication (some of which
are
> >> platform specific). Like asan_symbolize.py the plan would be to
> >> support multiple symbolication backends (that can also be chained
> >> together) that are specified via command line options.
> >>
> >> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py
> >> [2] https://github.com/google/sanitizers/issues/268
> >>
> >> Thanks,
> >> Dan.
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/52e97d84/attachment.html>

Dan Liew via llvm-dev

2020-Oct-08 01:07 UTC

head link

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

On Wed, 7 Oct 2020 at 10:38, Petr Hosek <phosek at google.com>
wrote:>
> We ran into the same issues you described and the solution we came up with
is the Fuchsia symbolizer markup format, see
https://fuchsia.dev/fuchsia-src/reference/kernel/symbolizer_markup. Despite its
name, nothing about the format is Fuchsia specific, the format should be
generally usable and has already been adopted by other systems such as RTEMS.
>
> The symbolizer markup should address many of the issues you mentioned:
> * It's already available in sanitizer_common and supports all
sanitizers, see
https://github.com/llvm/llvm-project/blob/fccea7f372cbd33376d2c776f34a0c6925982981/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_markup.cpp
> * It supports inline frames which was the most recent changes to the markup
based on our experience with sanitizer rollout, see
https://cs.opensource.google/fuchsia/fuchsia/+/db6e2155d125c389bfc43bafe2f140231da0b6d0
> * It's designed for offline and batched symbolization.
>
> The advantage over emitting JSON directly is that the markup format is line
delimited, which simplifies emission and parsing, it's more compact, and it
can be easily embedded in other formats (even JSON) which is important in our
use case.
The approach you've outlined is a really great way to handle offline
symbolization. However, it only solves part of what I want to solve. I
also want to have a description of the ASan report that is
machine-readable. Having a machine-readable description of the ASan
report allows you to do things like:

* Perform some automated bug-triage. E.g. work out which frame(s)
might be responsible based on the stack trace and the bug-type.
* Create custom user interfaces to display ASan reports.
* Simplifies consuming ASan reports in a database. Such a database
could be used for de-duplication of reports and gathering statistics.

There are probably other things too but these are the first things
that come to mind.
> Currently, the markup is consumed by our symbolizer which is a thin wrapper
around llvm-symbolizer, but I planned on eventually proposing and implementing
support for this format directly in llvm-symbolizer. We support emitting JSON
output in our symbolizer wrapper which would be great to have in llvm-symbolizer
as well and is in line with the plan to support JSON output in various LLVM
tools that has been repeatedly discussed in the past.
>
> Our hope has been that this markup could be eventually adopted by other
platforms and I'd be interested to hear your thoughts. I understand that it
may not be a fit for your use cases, but I'd be also interested to hear if
there are ways to make it usable for your use.
>
Does this JSON output only describe the stacktraces or does it
describe other parts of the ASan report too (e.g. bug type, pc,
read/write, access size, shadow memory contents)?
> Regarding offline symbolization, we use offline symbolization by default in
Fuchsia and our symbolizer wrapper fetches debug info on-demand from our symbol
server. We originally used a custom scheme, but recently we started switching to
debuginfod which is being quickly adopted by various binary tools in the GNU
ecosystem. I'd like to implement debuginfod support directly in LLVM (see
also the recent thread about HTTP client/server libraries in LLVM) and integrate
it into tools like  llvm-symbolizer which is also important to bring
llvm-symbolizer on par with addr2line. This would address the offline
symbolization use case in a way that doesn't require new tools.
I didn't realise that addr2line could talk to debuginfod so that
sounds like a sensible thing to support in llvm-symbolizer. For Apple
platforms I think we mostly use `atos` instead of llvm-symbolizer
because it supports Swift demangling, but there may be other reasons
that I'm unaware of.

Dan Liew via llvm-dev

2020-Oct-14 18:53 UTC

head link

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

>> I was operating under the assumption that the decision made in
>> https://github.com/google/sanitizers/issues/268 was still the status
>> quo. That was six years ago though so I'll let Kostya chime in here
if
>> he now thinks differently about this.
>
>
> My opinion on the matter didn't change, nor did the motivation.
> I am opposed to making the sanitizer run-time any more complex,
> and I prefer the approach proposed here: separate, adjacently maintained
parser.
Okay. If this is your position can we proceed to review
https://reviews.llvm.org/D88938 ?
> On top of the previous motivation, here is some more.
> We are going to have more sanitizer-like things in the near future (Arm MTE
is one of them),
> that are not necessarily going to be in LLVM and that will not emit JSON.
> (and they shouldn't: we don't want any such thing in a production
run-time).
> But we can support those things with a separate parser.
Just to push back on this a little. I don't think emitting JSON is any
worse than what we do today. It's still just printing strings to a
file/system log.
Out of curiosity, how would you propose a production run-time emit
"sanitizer" like reports? Maybe a special purpose syscall and then
trap?
> I have a mild preference to have the parser written as a C++ library, with
C interface.
> Not in python, so that it can be used programmatically w/o launching a
sub-process.
> But I don't insist (especially given the code is written already)
My reasons for writing this in Python are:

* Support for extending the tool with plug-ins is planned. Python
makes writing plug-ins easy. Writing plugins in a C++ world is fraught
with problems.
* A functioning tool can be built very quickly due to Python's large
ecosystem (stdlib and external packages).

We could certainly rewrite parts of the tool in C++ in should we
actually need it in the future. Right now though Python seems like the
better choice.

Thanks,
Dan.

llvm dev - Oct 2020 - [RFC] Tooling for parsing and symbolication of Sanitizer reports

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports

[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports