Choongwoo Han via llvm-dev
2021-Feb-15 18:24 UTC
[llvm-dev] RFC: Support regex based path remapping for llvm-cov
Hi all, We'd like to contribute something related to -path-equivalence option of llvm-cov. We want to hear feedbacks from you guys. In summary, there are three changes. 1. Use regular expression to remap paths. 2. Do not use `sys::fs::equivalent` to find the same file, which causes a huge performance issue in our environment. 3. Use remapped paths for output directory paths of `llvm-show` instead of original paths. Here is some context why we want this feature. We're building a single binary from multiple machines, so the binary contains different source code paths for the same file. There is already `-path-equivalence` option for remapping paths, but it does not meet our requirement because we need n to 1 mapping while the current implementation only supports 1 to 1 mapping. Thus, currently, we are using a custom llvm-cov build using a regular expression. The command line looks like this $ llvm-cov show -path-equivalence='[A-Za-z0-9_/\\:]+[\\/]src[\\/][0-9]+[\\/]([A-Za-z]),:/' ... Then, /path/to/src/0/f/asdf becomes f:/asdf. We are using a capture to find a prefix (i.e. captured_string + filepath.replace(regex, to)), but we can drop this part if it looks complicated and not intuitive. Secondly, there is a huge performance degradation in our environment because of file system accesses in llvm-cov. https://github.com/llvm/llvm-project/blob/d079dbc591899159925a1fe10b081fa0f6bb61bd/llvm/tools/llvm-cov/CodeCoverage.cpp#L251-L253 ``` ErrorOr<const MemoryBuffer &> CodeCoverageTool::getSourceFile(StringRef SourceFile) { ... for (const auto &Files : LoadedSourceFiles) if (sys::fs::equivalent(SourceFile, Files.first)) return *Files.second; ``` If I rewrite the `sys::fs::equivalent` to a plain string comparison, the performance is improved a lot. When I tested it with a sample binary, the running time reduced from 1 minutes to 7 seconds. And, the overall running time of llvm-cov in our Windows environment reduced from 17 hours to 4 hours. Since we are already using the remapped paths to find the same files, we don't need this kind of strict check to find equivalent files. So, I'd like to replace this strict comparison to a simple one when this regex-based remapping is being used. Lastly, the current implementation is using the original paths to generate output files. For example, if `-path-equivalence=/tmp,/output -format=html -output-dir=./outdir` is used, the generated file paths will be `./outdir/tmp/...`, not `./outdir/output/...`. It does not make sense because there can be multiple different original paths. Thus, I'd like to use the remapped file paths to generate the output files. Since we don't want to break the existing option, I'm thinking about introducing a new flag adding `-regex` at the end (i.e. `-path-equivalence-regex`). I want to ask which approach is best for this case (introducing a new flag or updating the existing behavior). Please let me know if there is any concern from this proposal. I also uploaded a proof of concept to https://reviews.llvm.org/D96696 Thanks, Choongwoo Han -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210215/2b06bc9b/attachment.html>
Petr Hosek via llvm-dev
2021-Feb-15 22:50 UTC
[llvm-dev] RFC: Support regex based path remapping for llvm-cov
We have a similar issue. We use a distributed compilation system where different TUs may be compiled on different machines and the compilation directory would differ for each TU. Our solution is to make the compiler output deterministic regardless of the compilation directory by storing compilation directory separately from filenames, which can be manually overridden, and by using relative paths if those were passed to the compiler. This is similar to debug info handling, see https://blog.llvm.org/2019/11/deterministic-builds-with-clang-and-lld.html. This solution was implemented in https://reviews.llvm.org/D95753 and I hope to land this within the next day or two. Would this address your problem as well? On Mon, Feb 15, 2021 at 10:25 AM Choongwoo Han via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi all, > > We'd like to contribute something related to -path-equivalence option of > llvm-cov. We want to hear feedbacks from you guys. > > In summary, there are three changes. > 1. Use regular expression to remap paths. > 2. Do not use `sys::fs::equivalent` to find the same file, which causes a > huge performance issue in our environment. > 3. Use remapped paths for output directory paths of `llvm-show` instead of > original paths. > > Here is some context why we want this feature. We're building a single > binary from multiple machines, so the binary contains different source code > paths for the same file. There is already `-path-equivalence` option for > remapping paths, but it does not meet our requirement because we need n to > 1 mapping while the current implementation only supports 1 to 1 mapping. > Thus, currently, we are using a custom llvm-cov build using a regular > expression. The command line looks like this > $ llvm-cov show > -path-equivalence='[A-Za-z0-9_/\\:]+[\\/]src[\\/][0-9]+[\\/]([A-Za-z]),:/' ... > Then, /path/to/src/0/f/asdf becomes f:/asdf. > > We are using a capture to find a prefix (i.e. captured_string + > filepath.replace(regex, to)), but we can drop this part if it looks > complicated and not intuitive. > > Secondly, there is a huge performance degradation in our environment > because of file system accesses in llvm-cov. > > https://github.com/llvm/llvm-project/blob/d079dbc591899159925a1fe10b081fa0f6bb61bd/llvm/tools/llvm-cov/CodeCoverage.cpp#L251-L253 > ``` > ErrorOr<const MemoryBuffer &> > CodeCoverageTool::getSourceFile(StringRef SourceFile) { > ... > for (const auto &Files : LoadedSourceFiles) > if (sys::fs::equivalent(SourceFile, Files.first)) > return *Files.second; > ``` > > If I rewrite the `sys::fs::equivalent` to a plain string comparison, the > performance is improved a lot. When I tested it with a sample binary, the > running time reduced from 1 minutes to 7 seconds. And, the overall running > time of llvm-cov in our Windows environment reduced from 17 hours to 4 > hours. > Since we are already using the remapped paths to find the same files, we > don't need this kind of strict check to find equivalent files. So, I'd like > to replace this strict comparison to a simple one when this regex-based > remapping is being used. > > Lastly, the current implementation is using the original paths to generate > output files. For example, if `-path-equivalence=/tmp,/output -format=html > -output-dir=./outdir` is used, the generated file paths will be > `./outdir/tmp/...`, not `./outdir/output/...`. It does not make sense > because there can be multiple different original paths. Thus, I'd like to > use the remapped file paths to generate the output files. > > Since we don't want to break the existing option, I'm thinking about > introducing a new flag adding `-regex` at the end (i.e. > `-path-equivalence-regex`). I want to ask which approach is best for this > case (introducing a new flag or updating the existing behavior). > > Please let me know if there is any concern from this proposal. > I also uploaded a proof of concept to https://reviews.llvm.org/D96696 > > Thanks, > Choongwoo Han > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210215/f9aba81b/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 3996 bytes Desc: S/MIME Cryptographic Signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210215/f9aba81b/attachment.bin>