Dan Liew via llvm-dev
2020-Oct-07 01:11 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
# Summary Currently the Sanitizer family of runtime bug finding tools (e.g. Address Sanitizer) provide useful reports of problems upon detection. This RFC proposes adding tools to 1. Parse Sanitizer reports into structured data to make interfacing with other tools simpler. 2. Take the Sanitizer reports and “Symbolicate” them. That is, add missing symbol information (function name, source file, line number) to the structured data version of the report. The initial stubs for the proposal in this RFC are provided in this patch: https://reviews.llvm.org/D88938 . Any thoughts on this RFC on the patch would be appreciated. # Issues with the existing solutions * An official parser for sanitizer reports does not exist. Currently we just tell our users to implement their own (e.g. [1]). This creates an unnecessary duplication of effort. * The existing symbolizer (asan_symbolize.py) only works with ASan reports and doesn’t support other sanitizers like TSan. * The architecture of the existing symbolizer makes it cumbersome to support inline frames. * The architecture of the existing symbolizer is sequential which prevents performing batched symbolication of stack frames. # Tools The proposed tools would be a sub-tools of a new llvm-xsan tool. E.g. llvm-xsan <subtool> Sub-tools will support nesting of sub-tools to allow building ergonomic tools. E.g.: llvm-xsan asan <asan subtool> * The tools would be part of compiler-rt and will optionally ship with this project. * The tools will be considered experimental while being incrementally developed on the master branch. * Functionality of the tools will be maintained via tests in the compiler-rt. llvm-xsan could be also used as a vehicle for shipping other Sanitizer tools in the toolchain in the future. ## Parsing tool Sanitizer reports are primarily meant to be human readable, consequently the reports are not structured data (e.g. JSON). This means that Sanitizer reports are not conveniently machine-readable. A request [2] was made in the past to teach the sanitizers to emit a machine-readable format for reports. This request was denied but an alternative was proposed where a tool could be provided to convert the human readable Sanitizer reports into a structured data format. This proposal will implement this alternative. My proposal is that we implement a parser for Sanitizer reports that converts them into a structured data. In particular: * The tool is tied to the Clang/compiler-rt runtime that it ships with. This means the tool will parse Sanitizer reports that come from binaries built using the corresponding Clang. However the tool is not required to parse Sanitizer reports that come from different versions of Clang. * The tool can also output a schema that describes the structured data format. This schema would be versioned and would be allowed to change once the tool moves out of the experimental stage. * The format of the human readable Sanitizer reports is allowed to change but the parser should be correspondingly changed when this happens. This will be enforced with tests. The parsing tools would be subtools of the asan, tsan, ubsan subtools. This would require the user to explicitly communicate the report type ahead of time. Command line invocation would look something like: ``` llvm-xsan asan parse < asan_report.txt > asan_report.json llvm-xsan tsan parse < tsan_report.txt > tsan_report.json llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json ``` The structured data format would be JSON. The schema details still need to be worked out but the schema will need to cover every type of issue that a Sanitizer can find. ## Symbolication tool Sanitizer reports include detailed stack traces which show the program counter (PC) for each frame. PCs are typically not useful to a developer. Instead they are likely more interested in the function name, source file and line number that correspond to each of the PCs. The process of finding the function name, source file and line number that correspond to a PC is known as “Symbolication”. There are two approaches to symbolication, online and offline. Online symbolication performs Symbolication in the process where the issue was found by invoking an external tool (e.g. llvm-symbolizer) to “symbolize” each of the PCs. Offline symbolication performs symbolication outside the process where the issue was found. The Sanitizers perform online symbolication by default. This process needs the debug information to be available at runtime. However this information might be missing. For example: * The instrumented binary might have been stripped of debug info (e.g. to reduce binary size). * The PC points inside a system library which has no available debug info. * The instrumented binary was built on a different machine. On Apple platforms debug info lives outside the binary (inside “.dSYM” bundles) so these might not be copied across from the build machine. In these cases online symbolication fails and we are left with a sanitizer report that is extremely hard for a developer to read. To turn the unsymbolicated Sanitizer report into something useful for a developer, offline symbolication is necessary. However, the existing infrastructure (asan_symbolize.py) for doing this has some deficiencies. * Only Address Sanitizer reports are supported. * The current implementation processes each stackframe sequentially. This does not fit well in contexts where we would like to symbolicate multiple PCs at a time. * The current implementation doesn’t provide a way to handle inline frames (i.e. a PC maps to two or more source locations). These problems can be resolved by building new tools on top of the structured data format. This gives a nice separation of concerns because parsing the report is now separate from symbolicating the PCs in it. The symbolication tools would be subtools of the asan, tsan, ubsan subtools. This would require the user to explicitly communicate the report type ahead of time. Command line invocation would look something like: ``` llvm-xsan asan symbolicate < asan_report.json > asan_report_symbolicated.json llvm-xsan tsan symbolicate < tsan_report.json > tsan_report_symbolicated.json llvm-xsan ubsan symbolicate < ubsan_report.json > ubsan_report_symbolicated.json ``` There are multiple ways to perform symbolication (some of which are platform specific). Like asan_symbolize.py the plan would be to support multiple symbolication backends (that can also be chained together) that are specified via command line options. [1] https://github.com/dobin/asanparser/blob/master/asanparser.py [2] https://github.com/google/sanitizers/issues/268 Thanks, Dan.
David Blaikie via llvm-dev
2020-Oct-07 01:31 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
My 2c would be to push back a bit more on the "let's not have a machine readable format, but instead parse the human readable format" - it seems like that's going to make the human readable format/parsing fairly brittle/hard to change (I mean, having the parser in tree will help, for sure). It'd be interesting to know more about what problems the valgrind XML format have had and how/whether different solutions would address/avoid those problems. Also might be good to hear about how other tools are parsing the output - whether or not/how they might benefit if it were machine readable to begin with. But, yeah, if that's the direction - having an in-tree tool with fairly narrow uses could be nice. One action to convert human readable reports to json, another to symbolize such a report, a simple tool to render the (symbolized or not) data back into human readable form - then sets it up for other tools to consume that json and, say, render it in a GUI, perform other diagnostics/analysis on the report, etc. On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev < llvm-dev at lists.llvm.org> wrote:> # Summary > > Currently the Sanitizer family of runtime bug finding tools (e.g. > Address Sanitizer) provide useful reports of problems upon detection. > This RFC proposes adding tools to > > 1. Parse Sanitizer reports into structured data to make interfacing > with other tools simpler. > 2. Take the Sanitizer reports and “Symbolicate” them. That is, add > missing symbol information (function name, source file, line number) > to the structured data version of the report. > > The initial stubs for the proposal in this RFC are provided in this > patch: https://reviews.llvm.org/D88938 . > > Any thoughts on this RFC on the patch would be appreciated. > > # Issues with the existing solutions > > * An official parser for sanitizer reports does not exist. Currently > we just tell our users to implement their own (e.g. [1]). This creates > an unnecessary duplication of effort. > * The existing symbolizer (asan_symbolize.py) only works with ASan > reports and doesn’t support other sanitizers like TSan. > * The architecture of the existing symbolizer makes it cumbersome to > support inline frames. > * The architecture of the existing symbolizer is sequential which > prevents performing batched symbolication of stack frames. > > # Tools > > The proposed tools would be a sub-tools of a new llvm-xsan tool. > > E.g. > > llvm-xsan <subtool> > > Sub-tools will support nesting of sub-tools to allow building > ergonomic tools. E.g.: > > llvm-xsan asan <asan subtool> > > * The tools would be part of compiler-rt and will optionally ship with > this project. > * The tools will be considered experimental while being incrementally > developed on the master branch. > * Functionality of the tools will be maintained via tests in the > compiler-rt. > > llvm-xsan could be also used as a vehicle for shipping other Sanitizer > tools in the toolchain in the future. > > ## Parsing tool > > Sanitizer reports are primarily meant to be human readable, > consequently the reports are not structured data (e.g. JSON). This > means that Sanitizer reports are not conveniently machine-readable. > > A request [2] was made in the past to teach the sanitizers to emit a > machine-readable format for reports. This request was denied but an > alternative was proposed where a tool could be provided to convert the > human readable Sanitizer reports into a structured data format. This > proposal will implement this alternative. > > My proposal is that we implement a parser for Sanitizer reports that > converts them into a structured data. In particular: > > * The tool is tied to the Clang/compiler-rt runtime that it ships > with. This means the tool will parse Sanitizer reports that come from > binaries built using the corresponding Clang. However the tool is not > required to parse Sanitizer reports that come from different versions > of Clang. > * The tool can also output a schema that describes the structured data > format. This schema would be versioned and would be allowed to change > once the tool moves out of the experimental stage. > * The format of the human readable Sanitizer reports is allowed to > change but the parser should be correspondingly changed when this > happens. This will be enforced with tests. > > The parsing tools would be subtools of the asan, tsan, ubsan subtools. > This would require the user to explicitly communicate the report type > ahead of time. Command line invocation would look something like: > > ``` > llvm-xsan asan parse < asan_report.txt > asan_report.json > llvm-xsan tsan parse < tsan_report.txt > tsan_report.json > llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json > ``` > > The structured data format would be JSON. The schema details still > need to be worked out but the schema will need to cover every type of > issue that a Sanitizer can find. > > ## Symbolication tool > > Sanitizer reports include detailed stack traces which show the program > counter (PC) for each frame. PCs are typically not useful to a > developer. Instead they are likely more interested in the function > name, source file and line number that correspond to each of the PCs. > The process of finding the function name, source file and line number > that correspond to a PC is known as “Symbolication”. > > There are two approaches to symbolication, online and offline. Online > symbolication performs Symbolication in the process where the issue > was found by invoking an external tool (e.g. llvm-symbolizer) to > “symbolize” each of the PCs. Offline symbolication performs > symbolication outside the process where the issue was found. The > Sanitizers perform online symbolication by default. This process needs > the debug information to be available at runtime. However this > information might be missing. For example: > > * The instrumented binary might have been stripped of debug info (e.g. > to reduce binary size). > * The PC points inside a system library which has no available debug info. > * The instrumented binary was built on a different machine. On Apple > platforms debug info lives outside the binary (inside “.dSYM” bundles) > so these might not be copied across from the build machine. > > In these cases online symbolication fails and we are left with a > sanitizer report that is extremely hard for a developer to read. > > To turn the unsymbolicated Sanitizer report into something useful for > a developer, offline symbolication is necessary. However, the existing > infrastructure (asan_symbolize.py) for doing this has some > deficiencies. > > * Only Address Sanitizer reports are supported. > * The current implementation processes each stackframe sequentially. > This does not fit well in contexts where we would like to symbolicate > multiple PCs at a time. > * The current implementation doesn’t provide a way to handle inline > frames (i.e. a PC maps to two or more source locations). > > These problems can be resolved by building new tools on top of the > structured data format. This gives a nice separation of concerns > because parsing the report is now separate from symbolicating the PCs > in it. > > The symbolication tools would be subtools of the asan, tsan, ubsan > subtools. This would require the user to explicitly communicate the > report type ahead of time. Command line invocation would look > something like: > > ``` > llvm-xsan asan symbolicate < asan_report.json > > asan_report_symbolicated.json > llvm-xsan tsan symbolicate < tsan_report.json > > tsan_report_symbolicated.json > llvm-xsan ubsan symbolicate < ubsan_report.json > > ubsan_report_symbolicated.json > ``` > > There are multiple ways to perform symbolication (some of which are > platform specific). Like asan_symbolize.py the plan would be to > support multiple symbolication backends (that can also be chained > together) that are specified via command line options. > > [1] https://github.com/dobin/asanparser/blob/master/asanparser.py > [2] https://github.com/google/sanitizers/issues/268 > > Thanks, > Dan. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201006/b7d92ab0/attachment.html>
Vitaly Buka via llvm-dev
2020-Oct-07 11:14 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
There was a refactoring a few years after [2] which organized all asan reports into simple structs to view them in debugger. It should be quite straightforward to serialize them into json. If it's a part of compiler-rt and we have to maintain that, I'd prefer to maintain direct json serialization then report->json converter. On Tue, 6 Oct 2020 at 18:32, David Blaikie via llvm-dev < llvm-dev at lists.llvm.org> wrote:> My 2c would be to push back a bit more on the "let's not have a machine > readable format, but instead parse the human readable format" - it seems > like that's going to make the human readable format/parsing fairly > brittle/hard to change (I mean, having the parser in tree will help, for > sure). It'd be interesting to know more about what problems the valgrind > XML format have had and how/whether different solutions would address/avoid > those problems. Also might be good to hear about how other tools are > parsing the output - whether or not/how they might benefit if it were > machine readable to begin with. > > But, yeah, if that's the direction - having an in-tree tool with fairly > narrow uses could be nice. One action to convert human readable reports to > json, another to symbolize such a report, a simple tool to render the > (symbolized or not) data back into human readable form - then sets it up > for other tools to consume that json and, say, render it in a GUI, perform > other diagnostics/analysis on the report, etc. > > On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> # Summary >> >> Currently the Sanitizer family of runtime bug finding tools (e.g. >> Address Sanitizer) provide useful reports of problems upon detection. >> This RFC proposes adding tools to >> >> 1. Parse Sanitizer reports into structured data to make interfacing >> with other tools simpler. >> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add >> missing symbol information (function name, source file, line number) >> to the structured data version of the report. >> >> The initial stubs for the proposal in this RFC are provided in this >> patch: https://reviews.llvm.org/D88938 . >> >> Any thoughts on this RFC on the patch would be appreciated. >> >> # Issues with the existing solutions >> >> * An official parser for sanitizer reports does not exist. Currently >> we just tell our users to implement their own (e.g. [1]). This creates >> an unnecessary duplication of effort. >> * The existing symbolizer (asan_symbolize.py) only works with ASan >> reports and doesn’t support other sanitizers like TSan. >> * The architecture of the existing symbolizer makes it cumbersome to >> support inline frames. >> * The architecture of the existing symbolizer is sequential which >> prevents performing batched symbolication of stack frames. >> >> # Tools >> >> The proposed tools would be a sub-tools of a new llvm-xsan tool. >> >> E.g. >> >> llvm-xsan <subtool> >> >> Sub-tools will support nesting of sub-tools to allow building >> ergonomic tools. E.g.: >> >> llvm-xsan asan <asan subtool> >> >> * The tools would be part of compiler-rt and will optionally ship with >> this project. >> * The tools will be considered experimental while being incrementally >> developed on the master branch. >> * Functionality of the tools will be maintained via tests in the >> compiler-rt. >> >> llvm-xsan could be also used as a vehicle for shipping other Sanitizer >> tools in the toolchain in the future. >> >> ## Parsing tool >> >> Sanitizer reports are primarily meant to be human readable, >> consequently the reports are not structured data (e.g. JSON). This >> means that Sanitizer reports are not conveniently machine-readable. >> >> A request [2] was made in the past to teach the sanitizers to emit a >> machine-readable format for reports. This request was denied but an >> alternative was proposed where a tool could be provided to convert the >> human readable Sanitizer reports into a structured data format. This >> proposal will implement this alternative. >> >> My proposal is that we implement a parser for Sanitizer reports that >> converts them into a structured data. In particular: >> >> * The tool is tied to the Clang/compiler-rt runtime that it ships >> with. This means the tool will parse Sanitizer reports that come from >> binaries built using the corresponding Clang. However the tool is not >> required to parse Sanitizer reports that come from different versions >> of Clang. >> * The tool can also output a schema that describes the structured data >> format. This schema would be versioned and would be allowed to change >> once the tool moves out of the experimental stage. >> * The format of the human readable Sanitizer reports is allowed to >> change but the parser should be correspondingly changed when this >> happens. This will be enforced with tests. >> >> The parsing tools would be subtools of the asan, tsan, ubsan subtools. >> This would require the user to explicitly communicate the report type >> ahead of time. Command line invocation would look something like: >> >> ``` >> llvm-xsan asan parse < asan_report.txt > asan_report.json >> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json >> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json >> ``` >> >> The structured data format would be JSON. The schema details still >> need to be worked out but the schema will need to cover every type of >> issue that a Sanitizer can find. >> >> ## Symbolication tool >> >> Sanitizer reports include detailed stack traces which show the program >> counter (PC) for each frame. PCs are typically not useful to a >> developer. Instead they are likely more interested in the function >> name, source file and line number that correspond to each of the PCs. >> The process of finding the function name, source file and line number >> that correspond to a PC is known as “Symbolication”. >> >> There are two approaches to symbolication, online and offline. Online >> symbolication performs Symbolication in the process where the issue >> was found by invoking an external tool (e.g. llvm-symbolizer) to >> “symbolize” each of the PCs. Offline symbolication performs >> symbolication outside the process where the issue was found. The >> Sanitizers perform online symbolication by default. This process needs >> the debug information to be available at runtime. However this >> information might be missing. For example: >> >> * The instrumented binary might have been stripped of debug info (e.g. >> to reduce binary size). >> * The PC points inside a system library which has no available debug info. >> * The instrumented binary was built on a different machine. On Apple >> platforms debug info lives outside the binary (inside “.dSYM” bundles) >> so these might not be copied across from the build machine. >> >> In these cases online symbolication fails and we are left with a >> sanitizer report that is extremely hard for a developer to read. >> >> To turn the unsymbolicated Sanitizer report into something useful for >> a developer, offline symbolication is necessary. However, the existing >> infrastructure (asan_symbolize.py) for doing this has some >> deficiencies. >> >> * Only Address Sanitizer reports are supported. >> * The current implementation processes each stackframe sequentially. >> This does not fit well in contexts where we would like to symbolicate >> multiple PCs at a time. >> * The current implementation doesn’t provide a way to handle inline >> frames (i.e. a PC maps to two or more source locations). >> >> These problems can be resolved by building new tools on top of the >> structured data format. This gives a nice separation of concerns >> because parsing the report is now separate from symbolicating the PCs >> in it. >> >> The symbolication tools would be subtools of the asan, tsan, ubsan >> subtools. This would require the user to explicitly communicate the >> report type ahead of time. Command line invocation would look >> something like: >> >> ``` >> llvm-xsan asan symbolicate < asan_report.json > >> asan_report_symbolicated.json >> llvm-xsan tsan symbolicate < tsan_report.json > >> tsan_report_symbolicated.json >> llvm-xsan ubsan symbolicate < ubsan_report.json > >> ubsan_report_symbolicated.json >> ``` >> >> There are multiple ways to perform symbolication (some of which are >> platform specific). Like asan_symbolize.py the plan would be to >> support multiple symbolication backends (that can also be chained >> together) that are specified via command line options. >> >> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py >> [2] https://github.com/google/sanitizers/issues/268 >> >> Thanks, >> Dan. >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/4019a175/attachment.html>
Philip Reames via llvm-dev
2020-Oct-07 15:17 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
I agree with this. We should just support a machine readable format, and build a tooling ecosystem around that. Just make sure to include a version id in the format from the beginning so that we can change it. :) Philip On 10/6/20 6:31 PM, David Blaikie via llvm-dev wrote:> My 2c would be to push back a bit more on the "let's not have a > machine readable format, but instead parse the human readable format" > - it seems like that's going to make the human readable format/parsing > fairly brittle/hard to change (I mean, having the parser in tree will > help, for sure). It'd be interesting to know more about what problems > the valgrind XML format have had and how/whether different solutions > would address/avoid those problems. Also might be good to hear about > how other tools are parsing the output - whether or not/how they might > benefit if it were machine readable to begin with. > > But, yeah, if that's the direction - having an in-tree tool with > fairly narrow uses could be nice. One action to convert human readable > reports to json, another to symbolize such a report, a simple tool to > render the (symbolized or not) data back into human readable form > - then sets it up for other tools to consume that json and, say, > render it in a GUI, perform other diagnostics/analysis on the report, etc. > > On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > # Summary > > Currently the Sanitizer family of runtime bug finding tools (e.g. > Address Sanitizer) provide useful reports of problems upon detection. > This RFC proposes adding tools to > > 1. Parse Sanitizer reports into structured data to make interfacing > with other tools simpler. > 2. Take the Sanitizer reports and “Symbolicate” them. That is, add > missing symbol information (function name, source file, line number) > to the structured data version of the report. > > The initial stubs for the proposal in this RFC are provided in this > patch: https://reviews.llvm.org/D88938 . > > Any thoughts on this RFC on the patch would be appreciated. > > # Issues with the existing solutions > > * An official parser for sanitizer reports does not exist. Currently > we just tell our users to implement their own (e.g. [1]). This creates > an unnecessary duplication of effort. > * The existing symbolizer (asan_symbolize.py) only works with ASan > reports and doesn’t support other sanitizers like TSan. > * The architecture of the existing symbolizer makes it cumbersome to > support inline frames. > * The architecture of the existing symbolizer is sequential which > prevents performing batched symbolication of stack frames. > > # Tools > > The proposed tools would be a sub-tools of a new llvm-xsan tool. > > E.g. > > llvm-xsan <subtool> > > Sub-tools will support nesting of sub-tools to allow building > ergonomic tools. E.g.: > > llvm-xsan asan <asan subtool> > > * The tools would be part of compiler-rt and will optionally ship with > this project. > * The tools will be considered experimental while being incrementally > developed on the master branch. > * Functionality of the tools will be maintained via tests in the > compiler-rt. > > llvm-xsan could be also used as a vehicle for shipping other Sanitizer > tools in the toolchain in the future. > > ## Parsing tool > > Sanitizer reports are primarily meant to be human readable, > consequently the reports are not structured data (e.g. JSON). This > means that Sanitizer reports are not conveniently machine-readable. > > A request [2] was made in the past to teach the sanitizers to emit a > machine-readable format for reports. This request was denied but an > alternative was proposed where a tool could be provided to convert the > human readable Sanitizer reports into a structured data format. This > proposal will implement this alternative. > > My proposal is that we implement a parser for Sanitizer reports that > converts them into a structured data. In particular: > > * The tool is tied to the Clang/compiler-rt runtime that it ships > with. This means the tool will parse Sanitizer reports that come from > binaries built using the corresponding Clang. However the tool is not > required to parse Sanitizer reports that come from different versions > of Clang. > * The tool can also output a schema that describes the structured data > format. This schema would be versioned and would be allowed to change > once the tool moves out of the experimental stage. > * The format of the human readable Sanitizer reports is allowed to > change but the parser should be correspondingly changed when this > happens. This will be enforced with tests. > > The parsing tools would be subtools of the asan, tsan, ubsan subtools. > This would require the user to explicitly communicate the report type > ahead of time. Command line invocation would look something like: > > ``` > llvm-xsan asan parse < asan_report.txt > asan_report.json > llvm-xsan tsan parse < tsan_report.txt > tsan_report.json > llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json > ``` > > The structured data format would be JSON. The schema details still > need to be worked out but the schema will need to cover every type of > issue that a Sanitizer can find. > > ## Symbolication tool > > Sanitizer reports include detailed stack traces which show the program > counter (PC) for each frame. PCs are typically not useful to a > developer. Instead they are likely more interested in the function > name, source file and line number that correspond to each of the PCs. > The process of finding the function name, source file and line number > that correspond to a PC is known as “Symbolication”. > > There are two approaches to symbolication, online and offline. Online > symbolication performs Symbolication in the process where the issue > was found by invoking an external tool (e.g. llvm-symbolizer) to > “symbolize” each of the PCs. Offline symbolication performs > symbolication outside the process where the issue was found. The > Sanitizers perform online symbolication by default. This process needs > the debug information to be available at runtime. However this > information might be missing. For example: > > * The instrumented binary might have been stripped of debug info (e.g. > to reduce binary size). > * The PC points inside a system library which has no available > debug info. > * The instrumented binary was built on a different machine. On Apple > platforms debug info lives outside the binary (inside “.dSYM” bundles) > so these might not be copied across from the build machine. > > In these cases online symbolication fails and we are left with a > sanitizer report that is extremely hard for a developer to read. > > To turn the unsymbolicated Sanitizer report into something useful for > a developer, offline symbolication is necessary. However, the existing > infrastructure (asan_symbolize.py) for doing this has some > deficiencies. > > * Only Address Sanitizer reports are supported. > * The current implementation processes each stackframe sequentially. > This does not fit well in contexts where we would like to symbolicate > multiple PCs at a time. > * The current implementation doesn’t provide a way to handle inline > frames (i.e. a PC maps to two or more source locations). > > These problems can be resolved by building new tools on top of the > structured data format. This gives a nice separation of concerns > because parsing the report is now separate from symbolicating the PCs > in it. > > The symbolication tools would be subtools of the asan, tsan, ubsan > subtools. This would require the user to explicitly communicate the > report type ahead of time. Command line invocation would look > something like: > > ``` > llvm-xsan asan symbolicate < asan_report.json > > asan_report_symbolicated.json > llvm-xsan tsan symbolicate < tsan_report.json > > tsan_report_symbolicated.json > llvm-xsan ubsan symbolicate < ubsan_report.json > > ubsan_report_symbolicated.json > ``` > > There are multiple ways to perform symbolication (some of which are > platform specific). Like asan_symbolize.py the plan would be to > support multiple symbolication backends (that can also be chained > together) that are specified via command line options. > > [1] https://github.com/dobin/asanparser/blob/master/asanparser.py > [2] https://github.com/google/sanitizers/issues/268 > > Thanks, > Dan. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201007/b289499f/attachment.html>
Dan Liew via llvm-dev
2020-Oct-07 17:23 UTC
[llvm-dev] [RFC] Tooling for parsing and symbolication of Sanitizer reports
Hi, On Tue, 6 Oct 2020 at 18:31, David Blaikie <dblaikie at gmail.com> wrote:> > My 2c would be to push back a bit more on the "let's not have a machine readable format, but instead parse the human readable format" - it seems like that's going to make the human readable format/parsing fairly brittle/hard to change (I mean, having the parser in tree will help, for sure).I was operating under the assumption that the decision made in https://github.com/google/sanitizers/issues/268 was still the status quo. That was six years ago though so I'll let Kostya chime in here if he now thinks differently about this. Even if we go down the route of having the sanitizers supporting machine-readable output I'd still like there to be an in-tree tool that supports doing offline symboliation on the machine readable output. So there still might be a case for having the proposed "llvm-xsan" tool in-tree.> It'd be interesting to know more about what problems the valgrind XML format have had and how/whether different solutions would address/avoid those problems. Also might be good to hear about how other tools are parsing the output - whether or not/how they might benefit if it were machine readable to begin with.Huh. I didn't know Valgrind had an XML format so I can't really comment on that (yet). On my side I can say we have at least two use cases inside Apple where we are parsing ASan reports and each use case ended up implementing their own parser.> > But, yeah, if that's the direction - having an in-tree tool with fairly narrow uses could be nice. One action to convert human readable reports to json, another to symbolize such a report, a simple tool to render the (symbolized or not) data back into human readable form - then sets it up for other tools to consume that json and, say, render it in a GUI, perform other diagnostics/analysis on the report, etc.I hadn't thought about a tool to re-render reports in human readable form. That's a good idea.> On Tue, Oct 6, 2020 at 6:12 PM Dan Liew via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> # Summary >> >> Currently the Sanitizer family of runtime bug finding tools (e.g. >> Address Sanitizer) provide useful reports of problems upon detection. >> This RFC proposes adding tools to >> >> 1. Parse Sanitizer reports into structured data to make interfacing >> with other tools simpler. >> 2. Take the Sanitizer reports and “Symbolicate” them. That is, add >> missing symbol information (function name, source file, line number) >> to the structured data version of the report. >> >> The initial stubs for the proposal in this RFC are provided in this >> patch: https://reviews.llvm.org/D88938 . >> >> Any thoughts on this RFC on the patch would be appreciated. >> >> # Issues with the existing solutions >> >> * An official parser for sanitizer reports does not exist. Currently >> we just tell our users to implement their own (e.g. [1]). This creates >> an unnecessary duplication of effort. >> * The existing symbolizer (asan_symbolize.py) only works with ASan >> reports and doesn’t support other sanitizers like TSan. >> * The architecture of the existing symbolizer makes it cumbersome to >> support inline frames. >> * The architecture of the existing symbolizer is sequential which >> prevents performing batched symbolication of stack frames. >> >> # Tools >> >> The proposed tools would be a sub-tools of a new llvm-xsan tool. >> >> E.g. >> >> llvm-xsan <subtool> >> >> Sub-tools will support nesting of sub-tools to allow building >> ergonomic tools. E.g.: >> >> llvm-xsan asan <asan subtool> >> >> * The tools would be part of compiler-rt and will optionally ship with >> this project. >> * The tools will be considered experimental while being incrementally >> developed on the master branch. >> * Functionality of the tools will be maintained via tests in the compiler-rt. >> >> llvm-xsan could be also used as a vehicle for shipping other Sanitizer >> tools in the toolchain in the future. >> >> ## Parsing tool >> >> Sanitizer reports are primarily meant to be human readable, >> consequently the reports are not structured data (e.g. JSON). This >> means that Sanitizer reports are not conveniently machine-readable. >> >> A request [2] was made in the past to teach the sanitizers to emit a >> machine-readable format for reports. This request was denied but an >> alternative was proposed where a tool could be provided to convert the >> human readable Sanitizer reports into a structured data format. This >> proposal will implement this alternative. >> >> My proposal is that we implement a parser for Sanitizer reports that >> converts them into a structured data. In particular: >> >> * The tool is tied to the Clang/compiler-rt runtime that it ships >> with. This means the tool will parse Sanitizer reports that come from >> binaries built using the corresponding Clang. However the tool is not >> required to parse Sanitizer reports that come from different versions >> of Clang. >> * The tool can also output a schema that describes the structured data >> format. This schema would be versioned and would be allowed to change >> once the tool moves out of the experimental stage. >> * The format of the human readable Sanitizer reports is allowed to >> change but the parser should be correspondingly changed when this >> happens. This will be enforced with tests. >> >> The parsing tools would be subtools of the asan, tsan, ubsan subtools. >> This would require the user to explicitly communicate the report type >> ahead of time. Command line invocation would look something like: >> >> ``` >> llvm-xsan asan parse < asan_report.txt > asan_report.json >> llvm-xsan tsan parse < tsan_report.txt > tsan_report.json >> llvm-xsan ubsan parse < ubsan_report.txt > ubsan_report.json >> ``` >> >> The structured data format would be JSON. The schema details still >> need to be worked out but the schema will need to cover every type of >> issue that a Sanitizer can find. >> >> ## Symbolication tool >> >> Sanitizer reports include detailed stack traces which show the program >> counter (PC) for each frame. PCs are typically not useful to a >> developer. Instead they are likely more interested in the function >> name, source file and line number that correspond to each of the PCs. >> The process of finding the function name, source file and line number >> that correspond to a PC is known as “Symbolication”. >> >> There are two approaches to symbolication, online and offline. Online >> symbolication performs Symbolication in the process where the issue >> was found by invoking an external tool (e.g. llvm-symbolizer) to >> “symbolize” each of the PCs. Offline symbolication performs >> symbolication outside the process where the issue was found. The >> Sanitizers perform online symbolication by default. This process needs >> the debug information to be available at runtime. However this >> information might be missing. For example: >> >> * The instrumented binary might have been stripped of debug info (e.g. >> to reduce binary size). >> * The PC points inside a system library which has no available debug info. >> * The instrumented binary was built on a different machine. On Apple >> platforms debug info lives outside the binary (inside “.dSYM” bundles) >> so these might not be copied across from the build machine. >> >> In these cases online symbolication fails and we are left with a >> sanitizer report that is extremely hard for a developer to read. >> >> To turn the unsymbolicated Sanitizer report into something useful for >> a developer, offline symbolication is necessary. However, the existing >> infrastructure (asan_symbolize.py) for doing this has some >> deficiencies. >> >> * Only Address Sanitizer reports are supported. >> * The current implementation processes each stackframe sequentially. >> This does not fit well in contexts where we would like to symbolicate >> multiple PCs at a time. >> * The current implementation doesn’t provide a way to handle inline >> frames (i.e. a PC maps to two or more source locations). >> >> These problems can be resolved by building new tools on top of the >> structured data format. This gives a nice separation of concerns >> because parsing the report is now separate from symbolicating the PCs >> in it. >> >> The symbolication tools would be subtools of the asan, tsan, ubsan >> subtools. This would require the user to explicitly communicate the >> report type ahead of time. Command line invocation would look >> something like: >> >> ``` >> llvm-xsan asan symbolicate < asan_report.json > asan_report_symbolicated.json >> llvm-xsan tsan symbolicate < tsan_report.json > tsan_report_symbolicated.json >> llvm-xsan ubsan symbolicate < ubsan_report.json > ubsan_report_symbolicated.json >> ``` >> >> There are multiple ways to perform symbolication (some of which are >> platform specific). Like asan_symbolize.py the plan would be to >> support multiple symbolication backends (that can also be chained >> together) that are specified via command line options. >> >> [1] https://github.com/dobin/asanparser/blob/master/asanparser.py >> [2] https://github.com/google/sanitizers/issues/268 >> >> Thanks, >> Dan. >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev