Shuyang Wang via llvm-dev
2017-Jul-12 20:54 UTC
[llvm-dev] A Prototype to Track Input Read for Sparse File Fuzzing
Hi everyone, I wrote a prototype based on LLVM sanitizer infrastructure to improve fuzzing performance, especially over sparse file format. I’d like to upstream it if the anyone thinks it is useful. Sparse file format are formats that only a small portion of the file data could have impact on the behavior of the program that parses it. Common examples are archive files or a file system image where only metadata would affect program behavior. When fuzzing those formats, a general fuzzer will randomly select ranges to mutate. Because of the sparse nature of the formats, random range selection has a high probability to hit the "wholes" where data have no influence on the parser. While applying trim over the input could sometimes improve the effective range hit rate, it would not always work. For instance, some program may pose a minimum file size requirement which turns to be fairly large for fuzzing, or the effective ranges are sparsely distributed over an entire file instead of being centralized in the beginning. The tool I wrote leverages the observation that a piece of data would only have influence on its parser's behavior only if the data is at least read out by the parser, and the read regions of a sparse file is usually pretty small compared to the entire file. By generating an read map for each input and feeding the map to a modified fuzzer that prioritizes mutating those ranges, we noticed over 10X performance improvement in path discovery at bootstrap time in our test. The modified fuzzer was also able to find crashes in 0.5 hour where the original version couldn't find in 72 hours when we ended the test. The high level idea about how the tool works is it uses an instrumentation pass to record any memory read in shadow memory, while a runtime tracks buffer propagation from a user specified buffer (the initial buffer a file is read into), and coalesces shadow memory for these buffers. A read map can be generated for each input file with the instrumented binary. I hope this is interesting to some people and I can provide more details. The prototype is not ready to upstream yet, but I would like to work on it if the community is interested. _________________________ Shuyang Wang Apple -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170712/b9194b0e/attachment.html>
Kostya Serebryany via llvm-dev
2017-Jul-13 21:04 UTC
[llvm-dev] A Prototype to Track Input Read for Sparse File Fuzzing
This topic pops up regularly when discussing fuzzers, and not only for sparse input formats. I hope to eventually have a reasonable solution in libFuzzer itself. One way is to couple libFuzzer with dfsan (I even had some code for this, but removed it later). In the mean time, contribution is very welcome in various forms: * add micro fuzzing tests (puzzles) to https://github.com/llvm-mirror/llvm/tree/master/lib/Fuzzer/test * add real-life examples to https://github.com/google/fuzzer-test-suite/ * add standalone "custom mutator" (see LLVMFuzzerCustomMutator) that uses your tool to apply mutations only to the relevant parts of the input. On Wed, Jul 12, 2017 at 1:54 PM, Shuyang Wang <swang2 at apple.com> wrote:> Hi everyone, > > I wrote a prototype based on LLVM sanitizer infrastructure to improve > fuzzing performance, especially over sparse file format. I’d like to > upstream it if the anyone thinks it is useful. > > Sparse file format are formats that only a small portion of the file data > could have impact on the behavior of the program that parses it. Common > examples are archive files or a file system image where only metadata would > affect program behavior. When fuzzing those formats, a general fuzzer will > randomly select ranges to mutate. Because of the sparse nature of the > formats, random range selection has a high probability to hit the "wholes" > where data have no influence on the parser. While applying trim over the > input could sometimes improve the effective range hit rate, it would not > always work. For instance, some program may pose a minimum file size > requirement which turns to be fairly large for fuzzing, or the effective > ranges are sparsely distributed over an entire file instead of being > centralized in the beginning. > > The tool I wrote leverages the observation that a piece of data would only > have influence on its parser's behavior only if the data is at least read > out by the parser, and the read regions of a sparse file is usually pretty > small compared to the entire file. By generating an read map for each input > and feeding the map to a modified fuzzer that prioritizes mutating those > ranges, we noticed over 10X performance improvement in path discovery at > bootstrap time in our test. The modified fuzzer was also able to find > crashes in 0.5 hour where the original version couldn't find in 72 hours > when we ended the test. > > The high level idea about how the tool works is it uses an instrumentation > pass to record any memory read in shadow memory, while a runtime tracks > buffer propagation from a user specified buffer (the initial buffer a file > is read into), and coalesces shadow memory for these buffers. A read map > can be generated for each input file with the instrumented binary. > > I hope this is interesting to some people and I can provide more details. > The prototype is not ready to upstream yet, but I would like to work on it > if the community is interested. > > _________________________ > Shuyang Wang > Apple > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170713/1b1c4927/attachment.html>
Shuyang Wang via llvm-dev
2017-Jul-14 22:40 UTC
[llvm-dev] A Prototype to Track Input Read for Sparse File Fuzzing
Hi Kostya, Thanks for getting back to me. My work is somewhat similar to dfsan from high level but is specifically designed only to identify read regions of an input file. It uses the foundmental sanitizer infrastructure, so if dfsan can be integrated with libfuzzer, I think my work can as well. Regarding dfsan with libfuzzer, can you refresh me why it was removed? _________________________ Shuyang Wang Security Engineering & Architecture> On Jul 13, 2017, at 2:04 PM, Kostya Serebryany <kcc at google.com> wrote: > > This topic pops up regularly when discussing fuzzers, and not only for sparse input formats. > I hope to eventually have a reasonable solution in libFuzzer itself. > One way is to couple libFuzzer with dfsan (I even had some code for this, but removed it later). > > In the mean time, contribution is very welcome in various forms: > * add micro fuzzing tests (puzzles) to https://github.com/llvm-mirror/llvm/tree/master/lib/Fuzzer/test <https://github.com/llvm-mirror/llvm/tree/master/lib/Fuzzer/test> > * add real-life examples to https://github.com/google/fuzzer-test-suite/ <https://github.com/google/fuzzer-test-suite/>* add standalone "custom mutator" (see LLVMFuzzerCustomMutator) that uses your tool to apply mutations only to the relevant parts of the input. > > > > On Wed, Jul 12, 2017 at 1:54 PM, Shuyang Wang <swang2 at apple.com <mailto:swang2 at apple.com>> wrote: > Hi everyone, > > I wrote a prototype based on LLVM sanitizer infrastructure to improve fuzzing performance, especially over sparse file format. I’d like to upstream it if the anyone thinks it is useful. > > Sparse file format are formats that only a small portion of the file data could have impact on the behavior of the program that parses it. Common examples are archive files or a file system image where only metadata would affect program behavior. When fuzzing those formats, a general fuzzer will randomly select ranges to mutate. Because of the sparse nature of the formats, random range selection has a high probability to hit the "wholes" where data have no influence on the parser. While applying trim over the input could sometimes improve the effective range hit rate, it would not always work. For instance, some program may pose a minimum file size requirement which turns to be fairly large for fuzzing, or the effective ranges are sparsely distributed over an entire file instead of being centralized in the beginning. > > The tool I wrote leverages the observation that a piece of data would only have influence on its parser's behavior only if the data is at least read out by the parser, and the read regions of a sparse file is usually pretty small compared to the entire file. By generating an read map for each input and feeding the map to a modified fuzzer that prioritizes mutating those ranges, we noticed over 10X performance improvement in path discovery at bootstrap time in our test. The modified fuzzer was also able to find crashes in 0.5 hour where the original version couldn't find in 72 hours when we ended the test. > > The high level idea about how the tool works is it uses an instrumentation pass to record any memory read in shadow memory, while a runtime tracks buffer propagation from a user specified buffer (the initial buffer a file is read into), and coalesces shadow memory for these buffers. A read map can be generated for each input file with the instrumented binary. > > I hope this is interesting to some people and I can provide more details. The prototype is not ready to upstream yet, but I would like to work on it if the community is interested. > > _________________________ > Shuyang Wang > Apple > > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170714/e64ff354/attachment.html>