Dan Liew via llvm-dev
2020-Oct-08 01:07 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
On Wed, 7 Oct 2020 at 10:38, Petr Hosek <phosek at google.com> wrote:> > We ran into the same issues you described and the solution we came up with is the Fuchsia symbolizer markup format, see https://fuchsia.dev/fuchsia-src/reference/kernel/symbolizer_markup. Despite its name, nothing about the format is Fuchsia specific, the format should be generally usable and has already been adopted by other systems such as RTEMS. > > The symbolizer markup should address many of the issues you mentioned: > * It's already available in sanitizer_common and supports all sanitizers, see https://github.com/llvm/llvm-project/blob/fccea7f372cbd33376d2c776f34a0c6925982981/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_markup.cpp > * It supports inline frames which was the most recent changes to the markup based on our experience with sanitizer rollout, see https://cs.opensource.google/fuchsia/fuchsia/+/db6e2155d125c389bfc43bafe2f140231da0b6d0 > * It's designed for offline and batched symbolization. > > The advantage over emitting JSON directly is that the markup format is line delimited, which simplifies emission and parsing, it's more compact, and it can be easily embedded in other formats (even JSON) which is important in our use case.The approach you've outlined is a really great way to handle offline symbolization. However, it only solves part of what I want to solve. I also want to have a description of the ASan report that is machine-readable. Having a machine-readable description of the ASan report allows you to do things like: * Perform some automated bug-triage. E.g. work out which frame(s) might be responsible based on the stack trace and the bug-type. * Create custom user interfaces to display ASan reports. * Simplifies consuming ASan reports in a database. Such a database could be used for de-duplication of reports and gathering statistics. There are probably other things too but these are the first things that come to mind.> Currently, the markup is consumed by our symbolizer which is a thin wrapper around llvm-symbolizer, but I planned on eventually proposing and implementing support for this format directly in llvm-symbolizer. We support emitting JSON output in our symbolizer wrapper which would be great to have in llvm-symbolizer as well and is in line with the plan to support JSON output in various LLVM tools that has been repeatedly discussed in the past. > > Our hope has been that this markup could be eventually adopted by other platforms and I'd be interested to hear your thoughts. I understand that it may not be a fit for your use cases, but I'd be also interested to hear if there are ways to make it usable for your use. >Does this JSON output only describe the stacktraces or does it describe other parts of the ASan report too (e.g. bug type, pc, read/write, access size, shadow memory contents)?> Regarding offline symbolization, we use offline symbolization by default in Fuchsia and our symbolizer wrapper fetches debug info on-demand from our symbol server. We originally used a custom scheme, but recently we started switching to debuginfod which is being quickly adopted by various binary tools in the GNU ecosystem. I'd like to implement debuginfod support directly in LLVM (see also the recent thread about HTTP client/server libraries in LLVM) and integrate it into tools like llvm-symbolizer which is also important to bring llvm-symbolizer on par with addr2line. This would address the offline symbolization use case in a way that doesn't require new tools.I didn't realise that addr2line could talk to debuginfod so that sounds like a sensible thing to support in llvm-symbolizer. For Apple platforms I think we mostly use `atos` instead of llvm-symbolizer because it supports Swift demangling, but there may be other reasons that I'm unaware of.
Aaron Ballman via llvm-dev
2020-Oct-08 12:08 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
On Wed, Oct 7, 2020 at 9:07 PM Dan Liew via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > On Wed, 7 Oct 2020 at 10:38, Petr Hosek <phosek at google.com> wrote: > > > > We ran into the same issues you described and the solution we came up with is the Fuchsia symbolizer markup format, see https://fuchsia.dev/fuchsia-src/reference/kernel/symbolizer_markup. Despite its name, nothing about the format is Fuchsia specific, the format should be generally usable and has already been adopted by other systems such as RTEMS. > > > > The symbolizer markup should address many of the issues you mentioned: > > * It's already available in sanitizer_common and supports all sanitizers, see https://github.com/llvm/llvm-project/blob/fccea7f372cbd33376d2c776f34a0c6925982981/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_markup.cpp > > * It supports inline frames which was the most recent changes to the markup based on our experience with sanitizer rollout, see https://cs.opensource.google/fuchsia/fuchsia/+/db6e2155d125c389bfc43bafe2f140231da0b6d0 > > * It's designed for offline and batched symbolization. > > > > The advantage over emitting JSON directly is that the markup format is line delimited, which simplifies emission and parsing, it's more compact, and it can be easily embedded in other formats (even JSON) which is important in our use case. > > The approach you've outlined is a really great way to handle offline > symbolization. However, it only solves part of what I want to solve. I > also want to have a description of the ASan report that is > machine-readable. Having a machine-readable description of the ASan > report allows you to do things like: > > * Perform some automated bug-triage. E.g. work out which frame(s) > might be responsible based on the stack trace and the bug-type. > * Create custom user interfaces to display ASan reports. > * Simplifies consuming ASan reports in a database. Such a database > could be used for de-duplication of reports and gathering statistics. > > There are probably other things too but these are the first things > that come to mind.There is a standardized JSON-based format used for exchanging static analysis finding reports between tools called SARIF that seems like it may be a natural fit for this work, perhaps. What's more, Clang already has some SARIF writing capabilities that could perhaps be lifted for the implementation (it's one of the formats the clang static analyzer produces for output). You can see the SARIF site for more information: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html ~Aaron> > > Currently, the markup is consumed by our symbolizer which is a thin wrapper around llvm-symbolizer, but I planned on eventually proposing and implementing support for this format directly in llvm-symbolizer. We support emitting JSON output in our symbolizer wrapper which would be great to have in llvm-symbolizer as well and is in line with the plan to support JSON output in various LLVM tools that has been repeatedly discussed in the past. > > > > Our hope has been that this markup could be eventually adopted by other platforms and I'd be interested to hear your thoughts. I understand that it may not be a fit for your use cases, but I'd be also interested to hear if there are ways to make it usable for your use. > > > > Does this JSON output only describe the stacktraces or does it > describe other parts of the ASan report too (e.g. bug type, pc, > read/write, access size, shadow memory contents)? > > > Regarding offline symbolization, we use offline symbolization by default in Fuchsia and our symbolizer wrapper fetches debug info on-demand from our symbol server. We originally used a custom scheme, but recently we started switching to debuginfod which is being quickly adopted by various binary tools in the GNU ecosystem. I'd like to implement debuginfod support directly in LLVM (see also the recent thread about HTTP client/server libraries in LLVM) and integrate it into tools like llvm-symbolizer which is also important to bring llvm-symbolizer on par with addr2line. This would address the offline symbolization use case in a way that doesn't require new tools. > > I didn't realise that addr2line could talk to debuginfod so that > sounds like a sensible thing to support in llvm-symbolizer. For Apple > platforms I think we mostly use `atos` instead of llvm-symbolizer > because it supports Swift demangling, but there may be other reasons > that I'm unaware of. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Dan Liew via llvm-dev
2020-Oct-14 19:04 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
> There is a standardized JSON-based format used for exchanging static > analysis finding reports between tools called SARIF that seems like it > may be a natural fit for this work, perhaps. What's more, Clang > already has some SARIF writing capabilities that could perhaps be > lifted for the implementation (it's one of the formats the clang > static analyzer produces for output). You can see the SARIF site for > more information: > https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.htmlThanks for bringing this up. I wasn't aware of this before. I'm struggling to grok that documentation and would probably need concrete examples to understand if it's a good fit. TBH I'm much more likely to go for a custom JSON schema though because the structured version of sanitizer reports will be very closely tied to the Sanitizers. Thanks, Dan.