Dean Michael Berris via llvm-dev
2016-Jul-20 07:48 UTC
[llvm-dev] [XRay] Build instrumented Clang, some analysis results
Hi everyone, TL;DR: With current pending patches applied in compiler-rt and llvm, and trunk clang, you can build your application with XRay tracing enabled on Linux with tracing enabled before main starts, and logging stops when the main thread exits. Just a quick update, I have some patches under review that when applied cleanly to LLVM and compiler-rt allows for building applications with XRay tracing enabled on Linux binaries from start to end. The list of patches to apply are: * https://reviews.llvm.org/D21987 (accounting tool, lives in llvm) * https://reviews.llvm.org/D21612 then https://reviews.llvm.org/D21982 (compiler-rt XRay runtime implementation) To test this out, I built clang with XRay and ran it on a hello world application. Here's how I did it: 1) Build clang+compiler-rt with the above patches patched in. Follow the normal instructions for getting a version of clang (from trunk) built. I use CMake+Ninja to do it, and I build everything. Example session: # assume ./llvm is set up appropriately $ cd llvm-build $ cmake -DCMAKE_BUILD_TYPE=Release -G Ninja ../llvm $ ninja $ cd .. 2) Build clang again, preferably in another build directory where you can use the clang and compiler-rt built in #1, this time with the following additional flags for CMAKE_CXX_FLAGS: `-fxray-instrument` `-fxray-instruction-threshold=1`. The second flag tells clang+llvm to instrument functions that have at least one instruction in it. An example session of this would look something like: $ mkdir xray-llvm-build $ cd xray-llvm-build $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=../llvm-build/bin/clang++ -DCMAKE_C_COMPILER=../llvm-build/bin/clang -DCMAKE_CXX_FLAGS='-fxray-instrument -fxray-instruction-threshold=1 -g' -DBUILD_SHARED_LIBS=Off ../llvm $ ninja clang $ cd .. 3) Run clang on a sample program, configure XRAY_OPTIONS while doing so for verbosity control. In my case I have the following "Hello, World!" program compiled by clang: ---->8 test.cc 8<---- #include <iostream> int main(int argc, char* argv[]) { std::cout << "I'm being compiled with XRay!" << std::endl; } ---->8 test.cc 8<---- $ XRAY_OPTIONS='verbosity=1' xray-llvm-build/bin/clang -o test.bin -O3 -x c++ -stdlib=libstdc++ test.cc You should then see two lines similar to the following printed: XRay: Log file in 'xray-log.DqqGQk' 4) Once you have one of those XRay log files, we can then use the 'xray-fc-account' tool built in the llvm-build/bin directory (in step 1). Warning: there's a lot of output in this step, I suggest piping to files. Example session: $ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k >sample.txt 2>errs.txt We can even output it to CSV, which allows us to load the output in a spreadsheet application where we can start sorting and analysing the data as we see fit, by adding the "-format" flag to the above command: $ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k -format=csv >sample.csv 2>errs.txt Analysis Results === Given the above steps, I've done some quick analysis with my favourite spreadsheet application where I'm showing the top results for the Clang run above. Here's some interesting findings: - The top 20 functions called in clang (in the above invocation) are: 292174 clang::DeclContext::getPrimaryContext() 283681 llvm::FoldingSetNodeID::AddPointer(void const*) 214706 clang::Preprocessor::Lex(clang::Token&) 193352 llvm::FoldingSetNodeID::AddInteger(unsigned int) 161706 clang::Lexer::LexTokenInternal(clang::Token&, bool) 161055 clang::Lexer::Lex(clang::Token&) 143420 clang::DeclarationName::getNameKind() const 142071 llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>::Allocate(unsigned long, unsigned long) 139009 clang::NamespaceDecl::getOriginalNamespace() 125706 clang::Redeclarable<clang::TagDecl>::DeclLink::getNext(clang::TagDecl const*) const 107928 clang::Type::getAsTagDecl() const 107785 clang::Type::getAsCXXRecordDecl() const 101150 llvm::FoldingSetNodeID::operator==(llvm::FoldingSetNodeID const&) const 96244 llvm::hashing::detail::hash_short(char const*, unsigned long, unsigned long) 94872 _ZN4llvm7hashing6detail23hash_combine_range_implIKjEENSt9enable_ifIXsr16is_hashable_dataIT_EE5valueENS_9hash_codeEE4typeEPS5_S9_ 91536 clang::DeclContext::lookup(clang::DeclarationName) const 84122 clang::TypeLoc::getNextTypeLocImpl(clang::TypeLoc) 81368 clang::TypeLoc::getLocalSourceRangeImpl(clang::TypeLoc) 80626 clang::FunctionDecl::getCanonicalDecl() 79489 clang::TagDecl::getDefinition() const 77347 clang::Lexer::LexIdentifier(clang::Token&, char const*) - The top few slowest (filtered by hand, because, see caveats below): 1. main 2. clang::Parser::ParseNamespace(unsigned int, clang::SourceLocation&, clang::SourceLocation) 3. clang::Parser::ParseInnerNamespace(std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, std::vector<clang::IdentifierInfo*, std::allocator<clang::IdentifierInfo*> >&, std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, unsigned int, clang::SourceLocation&, clang::ParsedAttributes&, clang::BalancedDelimiterTracker&) 4. clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&) 5. clang::Parser::ParseDeclaration(unsigned int, clang::SourceLocation&, clang::Parser::ParsedAttributesWithRange&) 6. clang::Parser::ParseDeclarationStartingWithTemplate(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*) 7. clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) 8. clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, llvm::raw_pwrite_stream*) 9. clang::Parser::ParseExternalDeclaration(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*) 10. clang::Parser::ParseTemplateDeclarationOrSpecialization(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*) 11. clang::Parser::ParseSingleDeclarationAfterTemplate(unsigned int, clang::Parser::ParsedTemplateInfo const&, clang::ParsingDeclRAIIObject&, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*) - For a single-threaded application, it ran on three "cpus": CPUID min tsc max tsc duration 5 14,393,810,869,713,800.00 14,393,812,737,589,100.00 0.7181373 12 14,393,816,387,291,400.00 14,393,816,388,054,500.00 0.0002933702 13 14,393,810,074,468,400.00 14,393,810,869,659,100.00 0.305725 - There's a single thread, and we get the full range of timestamps to get the overall duration: TID min tsc max tsc duration 83637 14,393,810,074,468,400.00 14,393,816,388,054,500.00 2.427369 Caveats === * Because clang was built with -O3, we're missing some function exits because of tail call and sibling optimisations -- we try to handle those cases from the function call accounting tool instead of adding instrumentation points (yet) in LLVM. Further work needs to be done to identify those exit points as well so that we have less ambiguity in the data. * In that simple invocation of clang that took roughly around 1 second to complete, we are able to generate roughly 606MB from the naive in-memory logging implementation. That's a lot of data for the fully-instrumented clang version. Removing the '-fxray-instruction-threshold=1' flag defaults this to 200 instructions, and there's considerably less data logged and only from non-trivial functions (~34MB when I tested it). * The durations are in seconds, computed by taking the maximum CPU frequency. We assume that the CPU TSC counters don't drift/pause (i.e. assume "constant tic" and "nonstop tsc"). This may not be safe assumptions to make, but should be fine for the interim as we iterate on how to get this data more reliably at least in Linux. * If you add '-v' to the invocation of xray-fc-account, it will give you a stack reconstruction as it goes through every entry/exit event. Last I tried this it took a long time to generate the text, and I gave up when it went over the 1.7GB (of text) mark. It's pretty and hypnotic to watch the stack grow and shrink just from the entry and exits of functions, but it's not practical to do. :) ---- If you're still reading up to here, thank you for taking the time. I'm happy to iterate on the questions people might have. Have fun, and I hope this helps! Cheers PS. If you'd like to get the raw trace, I'd be happy to share it but it's 81MB bzip2'd. :) Also, if you'd like to see the csv output or the text output, I'd be happy to share it too but not via email on the mailing list. :D
C Bergström via llvm-dev
2016-Jul-20 10:02 UTC
[llvm-dev] [XRay] Build instrumented Clang, some analysis results
Some general questions about X-Ray ------------- Is there a plan to make a separate mailing list or project around this? Do you have a list of planned features? Graphics tools for analysis? AMD open sourced their CodeAnalyst - What about some integration with that? Linux + Perf support (planned/exists)? How much is this tied to something specific about Linux or it could be easily ported to another platform? What's the benefit of this vs a stable and production ready tool like Dtrace? How much overhead do you typically measure? If you're injection calls before/after every function - does it end up blocking optimizations? Without looking at the implementation, if you're injecting the calls late enough in the compilation process it won't be a "problem", but if it's too early - you're going to end up blocking a lot of optimizations and interfering with things a lot.. On Wed, Jul 20, 2016 at 3:48 PM, Dean Michael Berris via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi everyone, > > TL;DR: With current pending patches applied in compiler-rt and llvm, and trunk clang, you can build your application with XRay tracing enabled on Linux with tracing enabled before main starts, and logging stops when the main thread exits. > > Just a quick update, I have some patches under review that when applied cleanly to LLVM and compiler-rt allows for building applications with XRay tracing enabled on Linux binaries from start to end. The list of patches to apply are: > > * https://reviews.llvm.org/D21987 (accounting tool, lives in llvm) > * https://reviews.llvm.org/D21612 then https://reviews.llvm.org/D21982 (compiler-rt XRay runtime implementation) > > To test this out, I built clang with XRay and ran it on a hello world application. Here's how I did it: > > 1) Build clang+compiler-rt with the above patches patched in. Follow the normal instructions for getting a version of clang (from trunk) built. I use CMake+Ninja to do it, and I build everything. Example session: > > # assume ./llvm is set up appropriately > $ cd llvm-build > $ cmake -DCMAKE_BUILD_TYPE=Release -G Ninja ../llvm > $ ninja > $ cd .. > > 2) Build clang again, preferably in another build directory where you can use the clang and compiler-rt built in #1, this time with the following additional flags for CMAKE_CXX_FLAGS: `-fxray-instrument` `-fxray-instruction-threshold=1`. The second flag tells clang+llvm to instrument functions that have at least one instruction in it. An example session of this would look something like: > > $ mkdir xray-llvm-build > $ cd xray-llvm-build > $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=../llvm-build/bin/clang++ -DCMAKE_C_COMPILER=../llvm-build/bin/clang -DCMAKE_CXX_FLAGS='-fxray-instrument -fxray-instruction-threshold=1 -g' -DBUILD_SHARED_LIBS=Off ../llvm > $ ninja clang > $ cd .. > > 3) Run clang on a sample program, configure XRAY_OPTIONS while doing so for verbosity control. In my case I have the following "Hello, World!" program compiled by clang: > > ---->8 test.cc 8<---- > #include <iostream> > > int main(int argc, char* argv[]) { > std::cout << "I'm being compiled with XRay!" << std::endl; > } > ---->8 test.cc 8<---- > > $ XRAY_OPTIONS='verbosity=1' xray-llvm-build/bin/clang -o test.bin -O3 -x c++ -stdlib=libstdc++ test.cc > > You should then see two lines similar to the following printed: > > XRay: Log file in 'xray-log.DqqGQk' > > 4) Once you have one of those XRay log files, we can then use the 'xray-fc-account' tool built in the llvm-build/bin directory (in step 1). Warning: there's a lot of output in this step, I suggest piping to files. Example session: > > $ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k >sample.txt 2>errs.txt > > We can even output it to CSV, which allows us to load the output in a spreadsheet application where we can start sorting and analysing the data as we see fit, by adding the "-format" flag to the above command: > > $ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k -format=csv >sample.csv 2>errs.txt > > Analysis Results > ===> > Given the above steps, I've done some quick analysis with my favourite spreadsheet application where I'm showing the top results for the Clang run above. Here's some interesting findings: > > - The top 20 functions called in clang (in the above invocation) are: > > 292174 clang::DeclContext::getPrimaryContext() > 283681 llvm::FoldingSetNodeID::AddPointer(void const*) > 214706 clang::Preprocessor::Lex(clang::Token&) > 193352 llvm::FoldingSetNodeID::AddInteger(unsigned int) > 161706 clang::Lexer::LexTokenInternal(clang::Token&, bool) > 161055 clang::Lexer::Lex(clang::Token&) > 143420 clang::DeclarationName::getNameKind() const > 142071 llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>::Allocate(unsigned long, unsigned long) > 139009 clang::NamespaceDecl::getOriginalNamespace() > 125706 clang::Redeclarable<clang::TagDecl>::DeclLink::getNext(clang::TagDecl const*) const > 107928 clang::Type::getAsTagDecl() const > 107785 clang::Type::getAsCXXRecordDecl() const > 101150 llvm::FoldingSetNodeID::operator==(llvm::FoldingSetNodeID const&) const > 96244 llvm::hashing::detail::hash_short(char const*, unsigned long, unsigned long) > 94872 _ZN4llvm7hashing6detail23hash_combine_range_implIKjEENSt9enable_ifIXsr16is_hashable_dataIT_EE5valueENS_9hash_codeEE4typeEPS5_S9_ > 91536 clang::DeclContext::lookup(clang::DeclarationName) const > 84122 clang::TypeLoc::getNextTypeLocImpl(clang::TypeLoc) > 81368 clang::TypeLoc::getLocalSourceRangeImpl(clang::TypeLoc) > 80626 clang::FunctionDecl::getCanonicalDecl() > 79489 clang::TagDecl::getDefinition() const > 77347 clang::Lexer::LexIdentifier(clang::Token&, char const*) > > - The top few slowest (filtered by hand, because, see caveats below): > > 1. main > 2. clang::Parser::ParseNamespace(unsigned int, clang::SourceLocation&, clang::SourceLocation) > 3. clang::Parser::ParseInnerNamespace(std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, std::vector<clang::IdentifierInfo*, std::allocator<clang::IdentifierInfo*> >&, std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, unsigned int, clang::SourceLocation&, clang::ParsedAttributes&, clang::BalancedDelimiterTracker&) > 4. clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&) > 5. clang::Parser::ParseDeclaration(unsigned int, clang::SourceLocation&, clang::Parser::ParsedAttributesWithRange&) > 6. clang::Parser::ParseDeclarationStartingWithTemplate(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*) > 7. clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) > 8. clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, llvm::raw_pwrite_stream*) > 9. clang::Parser::ParseExternalDeclaration(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*) > 10. clang::Parser::ParseTemplateDeclarationOrSpecialization(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*) > 11. clang::Parser::ParseSingleDeclarationAfterTemplate(unsigned int, clang::Parser::ParsedTemplateInfo const&, clang::ParsingDeclRAIIObject&, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*) > > - For a single-threaded application, it ran on three "cpus": > > CPUID min tsc max tsc duration > 5 14,393,810,869,713,800.00 14,393,812,737,589,100.00 0.7181373 > 12 14,393,816,387,291,400.00 14,393,816,388,054,500.00 0.0002933702 > 13 14,393,810,074,468,400.00 14,393,810,869,659,100.00 0.305725 > > - There's a single thread, and we get the full range of timestamps to get the overall duration: > > TID min tsc max tsc duration > 83637 14,393,810,074,468,400.00 14,393,816,388,054,500.00 2.427369 > > Caveats > ===> > * Because clang was built with -O3, we're missing some function exits because of tail call and sibling optimisations -- we try to handle those cases from the function call accounting tool instead of adding instrumentation points (yet) in LLVM. Further work needs to be done to identify those exit points as well so that we have less ambiguity in the data. > > * In that simple invocation of clang that took roughly around 1 second to complete, we are able to generate roughly 606MB from the naive in-memory logging implementation. That's a lot of data for the fully-instrumented clang version. Removing the '-fxray-instruction-threshold=1' flag defaults this to 200 instructions, and there's considerably less data logged and only from non-trivial functions (~34MB when I tested it). > > * The durations are in seconds, computed by taking the maximum CPU frequency. We assume that the CPU TSC counters don't drift/pause (i.e. assume "constant tic" and "nonstop tsc"). This may not be safe assumptions to make, but should be fine for the interim as we iterate on how to get this data more reliably at least in Linux. > > * If you add '-v' to the invocation of xray-fc-account, it will give you a stack reconstruction as it goes through every entry/exit event. Last I tried this it took a long time to generate the text, and I gave up when it went over the 1.7GB (of text) mark. It's pretty and hypnotic to watch the stack grow and shrink just from the entry and exits of functions, but it's not practical to do. :) > > ---- > > If you're still reading up to here, thank you for taking the time. I'm happy to iterate on the questions people might have. > > Have fun, and I hope this helps! > > Cheers > > PS. If you'd like to get the raw trace, I'd be happy to share it but it's 81MB bzip2'd. :) Also, if you'd like to see the csv output or the text output, I'd be happy to share it too but not via email on the mailing list. :D > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Dean Michael Berris via llvm-dev
2016-Jul-20 10:26 UTC
[llvm-dev] [XRay] Build instrumented Clang, some analysis results
> On 20 Jul 2016, at 20:02, C Bergström <cbergstrom at pathscale.com> wrote: > > Some general questions about X-Ray > ------------- > Is there a plan to make a separate mailing list or project around > this? Do you have a list of planned features?Interesting question -- so far we haven't decided yet whether XRay will live as another project. I'm certainly open to this possibility. No concrete plans yet. It's an open question in the original RFC too (http://lists.llvm.org/pipermail/llvm-dev/2016-April/098901.html). There's a white paper that details what we plan to implement out in the open (http://research.google.com/pubs/pub45287.html). We're still working our way to getting to a full version as described in that white paper (basically blocked by my lack of familiarity with the LLVM codebase, and other n00b-y things :D). There's not a concrete list of features, and we're certainly open to contributions from the community to add features that make sense. :)> > Graphics tools for analysis? AMD open sourced their CodeAnalyst - What > about some integration with that? >Thanks for the pointer! Yes, I'd love to have more integration with existing visualisation tools that read a particular well-documented format. Others have mentioned Jumpshot which might be a little dated, but still something that some people use for similar things.> Linux + Perf support (planned/exists)? >There are no plans to support the perf counter-lookups yet. Although I certainly think that's a nice source of data to be logging XRay-style. FWIW, the API for XRay allows us to decouple the things being logged at function entry/exit. Getting performance counters at those points is a nice idea, it should be doable.> How much is this tied to something specific about Linux or it could be > easily ported to another platform?Currently, the only Linux-specific part I can remember is getting the cpu frequency (looking at sysfs files). That can be implemented on a platform-agnostic (or at least pluggable and portable) manner. There are x86'isms and I'm working on understanding how to do this in Aarch64 or ARM.> > What's the benefit of this vs a stable and production ready tool like Dtrace? >I think I've pointed out the differences in a separate mail (some mail filters may have squashed that response, so apologies if that was missed): http://lists.llvm.org/pipermail/llvm-dev/2016-July/101922.html -- the short version is: - Dtrace requires kernel-side support. - XRay is completely in-process and controllable by the process through an API (not sure if dtrace is the same). - XRay is selective and configurable by the application developer. - XRay's cost is borne by the application only, and does not require stopping the application.> How much overhead do you typically measure? >We've seen in the "null logging case" something around O(100) cycles in X86 for the trampoline side of XRay. Of course richer logging requires more cycles, and is completely implementation-dependent. The current one under development only writes fixed-sized records, uses __rdtscp(), does aligned writes only, and flushes when the buffer is full. The buffer is 32k per thread. I haven't formally done benchmarks on the current implementation yet, but I'd be happy to do that soon.> If you're injection calls before/after every function - does it end up > blocking optimizations? Without looking at the implementation, if > you're injecting the calls late enough in the compilation process it > won't be a "problem", but if it's too early - you're going to end up > blocking a lot of optimizations and interfering with things a lot.. >It's currently implemented as a MachineFunctionPass, and as far as I can tell is already late enough in the process that we are not interfering with optimisations. Cheers
Possibly Parallel Threads
- Using Xray compilation with C programs
- XRay Trace of Clang, Loadable through Chrome Trace Viewer
- [cfe-dev] XRay Trace of Clang, Loadable through Chrome Trace Viewer
- [X-ray] How to check successful instrumentation and generate call trace?
- [llvm-xray] llvm-xray cannot log every functions