Nipun Sehrawat
2015-Feb-24 21:50 UTC
[LLVMdev] Removing contention in PassRegistry accesses to speed up compiles
Hi, We use LLVM libraries to compile C++ code and noticed slow downs when multiple threads of a process were compiling at once. *perf *indicated that most of the CPU time was spent in a spin lock, which was being locked/unlocked from llvm::PassRegistry::getPassInfo(). We read the relevant LLVM code and found out that PassRegistry is a ManagedStatic and is shared among all threads in case of a multi-threaded setup. This sharing requires locking in PassRegistry's method, which becomes source of the contention. To get rid of the contention, we made a change to make PassRegistry thread-local and got rid of the locking. This removed all the contention and we noticed a 2x speed up in single thread compiles and 7x improvement when ten threads were compiling in parallel. Please find attached the diff for this change. We are using a old version of LLVM code (svn revision 170375), so the code might be quite outdated. We have two questions: 1. Does the change look reasonable? Or are we missing something here? 2. When we run with 1000 threads compiling concurrently, we deterministically run into a segfault in PassRegistry lookup. Any insights into the segfault? Please find attached the following files: 1. *pass_registry.txt*: Git diff of our change. Note that it is against LLVM svn revision 170375. 2.* contention.txt*: Perf report with existing LLVM code - shows contention in llvm::PassRegistry::getPassInfo() 3. *no_contention.txt*: Perf report of LLVM built with our change. 4. *segfault.txt*: Segfault we are encountering after our change. 5. *clang_compile.cpp*: Snippet of code we use to compile code using LLVM. Thanks a lot, Nipun -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150224/47a0c9d5/attachment.html> -------------- next part -------------- Samples: 696K of event 'cycles', Event count (approx.): 427355036253 - 48.98% caching_compile [kernel.kallsyms] [k] __ticket_spin_lock - __ticket_spin_lock - 98.91% _raw_spin_lock - 57.49% futex_wake do_futex SyS_futex - system_call_fastpath - 100.00% __lll_unlock_wake + 99.77% llvm::PassRegistry::getPassInfo(void const*) const - 38.01% futex_wait_setup futex_wait do_futex SyS_futex - system_call_fastpath - 100.00% __lll_lock_wait + 99.86% llvm::PassRegistry::getPassInfo(void const*) const + 4.03% try_to_wake_up + 0.61% _raw_spin_lock_irq + 4.10% caching_compile libpthread.so.0 [.] pthread_mutex_lock + 1.81% caching_compile caching_compiler_test_old [.] llvm::sys::MemoryFence() + 1.63% caching_compile libpthread.so.0 [.] __pthread_mutex_unlock_usercnt -------------- next part -------------- Samples: 113K of event 'cycles', Event count (approx.): 70370381305 + 10.38% caching_compile caching_compiler_test [.] common::RecordParser::MatchData(common::RecordParser::Record*, unsigned long) + 5.45% caching_compile caching_compiler_test [.] common::RecordParser::Stream::FileStreamAdapter::good() + 3.31% caching_compile caching_compiler_test [.] common::RecordParser::Stream::FileStreamAdapter::get() + 1.17% caching_compile caching_compiler_test [.] llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) + 1.10% caching_compile caching_compiler_test [.] operator new(unsigned long) + 1.07% caching_compile caching_compiler_test [.] llvm::SmallPtrSetImpl::insert_imp(void const*) + 1.00% caching_compile caching_compiler_test [.] llvm::PMDataManager::removeNotPreservedAnalysis(llvm::Pass*) + 0.99% caching_compile caching_compiler_test [.] operator delete(void*) + 0.69% caching_compile caching_compiler_test [.] llvm::sys::MemoryFence() + 0.64% caching_compile caching_compiler_test [.] llvm::PMDataManager::initializeAnalysisImpl(llvm::Pass*) + 0.59% caching_compile libc.so.6 [.] __memcpy_ssse3_back + 0.57% caching_compile caching_compiler_test [.] llvm::PassRegistry::getPassInfo(void const*) const + 0.54% caching_compile libc.so.6 [.] __memset_sse2 + 0.53% caching_compile caching_compiler_test [.] (anonymous namespace)::VectorLegalizer::LegalizeOp(llvm::SDValue) + 0.47% caching_compile caching_compiler_test [.] llvm::PMDataManager::findAnalysisPass(void const*, bool) -------------- next part -------------- diff --git a/llvm/llvm.svnrev.170375/include/llvm/PassSupport.h b/llvm/llvm.svnrev.170375/include/llvm/PassSupport.h index 3633f47..44b4d56 100644 --- a/llvm/llvm.svnrev.170375/include/llvm/PassSupport.h +++ b/llvm/llvm.svnrev.170375/include/llvm/PassSupport.h @@ -130,25 +130,14 @@ private: PassInfo(const PassInfo &) LLVM_DELETED_FUNCTION; }; + + #define CALL_ONCE_INITIALIZATION(function) \ - static volatile sys::cas_flag initialized = 0; \ - sys::cas_flag old_val = sys::CompareAndSwap(&initialized, 1, 0); \ - if (old_val == 0) { \ + static __thread int initialized = 0; \ + if (initialized == 0) { \ function(Registry); \ - sys::MemoryFence(); \ - TsanIgnoreWritesBegin(); \ - TsanHappensBefore(&initialized); \ - initialized = 2; \ - TsanIgnoreWritesEnd(); \ - } else { \ - sys::cas_flag tmp = initialized; \ - sys::MemoryFence(); \ - while (tmp != 2) { \ - tmp = initialized; \ - sys::MemoryFence(); \ - } \ - } \ - TsanHappensAfter(&initialized); + initialized = 1; \ + } #define INITIALIZE_PASS(passName, arg, name, cfg, analysis) \ static void* initialize##passName##PassOnce(PassRegistry &Registry) { \ diff --git a/llvm/llvm.svnrev.170375/lib/VMCore/PassRegistry.cpp b/llvm/llvm.svnrev.170375/lib/VMCore/PassRegistry.cpp index 5d22879..8cca689 100644 --- a/llvm/llvm.svnrev.170375/lib/VMCore/PassRegistry.cpp +++ b/llvm/llvm.svnrev.170375/lib/VMCore/PassRegistry.cpp @@ -30,12 +30,17 @@ using namespace llvm; // llvm_shutdown clear this map prevents successful resurrection after // llvm_shutdown is run. Ideally we should find a solution so that we don't // leak the map, AND can still resurrect after shutdown. -static ManagedStatic<PassRegistry> PassRegistryObj; + +// Each thread gets its own PassRegistry. +static __thread PassRegistry* PassRegistryObj = NULL; PassRegistry *PassRegistry::getPassRegistry() { - return &*PassRegistryObj; + // TODO(nipun): Handle deletion. + if (NULL == PassRegistryObj) { + PassRegistryObj = new PassRegistry(); + } + return PassRegistryObj; } -static ManagedStatic<sys::SmartMutex<true> > Lock; //===----------------------------------------------------------------------===// // PassRegistryImpl @@ -72,26 +77,23 @@ void *PassRegistry::getImpl() const { // PassRegistry::~PassRegistry() { - sys::SmartScopedLock<true> Guard(*Lock); PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(pImpl); - + for (std::vector<const PassInfo*>::iterator I = Impl->ToFree.begin(), E = Impl->ToFree.end(); I != E; ++I) delete *I; - + delete Impl; pImpl = 0; } const PassInfo *PassRegistry::getPassInfo(const void *TI) const { - sys::SmartScopedLock<true> Guard(*Lock); PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(getImpl()); PassRegistryImpl::MapType::const_iterator I = Impl->PassInfoMap.find(TI); return I != Impl->PassInfoMap.end() ? I->second : 0; } const PassInfo *PassRegistry::getPassInfo(StringRef Arg) const { - sys::SmartScopedLock<true> Guard(*Lock); PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(getImpl()); PassRegistryImpl::StringMapType::const_iterator I = Impl->PassInfoStringMap.find(Arg); @@ -103,7 +105,6 @@ const PassInfo *PassRegistry::getPassInfo(StringRef Arg) const { // void PassRegistry::registerPass(const PassInfo &PI, bool ShouldFree) { - sys::SmartScopedLock<true> Guard(*Lock); PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(getImpl()); bool Inserted Impl->PassInfoMap.insert(std::make_pair(PI.getTypeInfo(),&PI)).second; @@ -115,24 +116,22 @@ void PassRegistry::registerPass(const PassInfo &PI, bool ShouldFree) { for (std::vector<PassRegistrationListener*>::iterator I = Impl->Listeners.begin(), E = Impl->Listeners.end(); I != E; ++I) (*I)->passRegistered(&PI); - + if (ShouldFree) Impl->ToFree.push_back(&PI); } void PassRegistry::unregisterPass(const PassInfo &PI) { - sys::SmartScopedLock<true> Guard(*Lock); PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(getImpl()); - PassRegistryImpl::MapType::iterator I = + PassRegistryImpl::MapType::iterator I Impl->PassInfoMap.find(PI.getTypeInfo()); assert(I != Impl->PassInfoMap.end() && "Pass registered but not in map!"); - + // Remove pass from the map. Impl->PassInfoMap.erase(I); Impl->PassInfoStringMap.erase(PI.getPassArgument()); } void PassRegistry::enumerateWith(PassRegistrationListener *L) { - sys::SmartScopedLock<true> Guard(*Lock); PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(getImpl()); for (PassRegistryImpl::MapType::const_iterator I = Impl->PassInfoMap.begin(), E = Impl->PassInfoMap.end(); I != E; ++I) @@ -160,8 +159,6 @@ void PassRegistry::registerAnalysisGroup(const void *InterfaceID, assert(ImplementationInfo && "Must register pass before adding to AnalysisGroup!"); - sys::SmartScopedLock<true> Guard(*Lock); - // Make sure we keep track of the fact that the implementation implements // the interface. ImplementationInfo->addInterfaceImplemented(InterfaceInfo); @@ -180,20 +177,18 @@ void PassRegistry::registerAnalysisGroup(const void *InterfaceID, InterfaceInfo->setNormalCtor(ImplementationInfo->getNormalCtor()); } } - + PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(getImpl()); if (ShouldFree) Impl->ToFree.push_back(&Registeree); } void PassRegistry::addRegistrationListener(PassRegistrationListener *L) { - sys::SmartScopedLock<true> Guard(*Lock); PassRegistryImpl *Impl = static_cast<PassRegistryImpl*>(getImpl()); Impl->Listeners.push_back(L); } void PassRegistry::removeRegistrationListener(PassRegistrationListener *L) { - sys::SmartScopedLock<true> Guard(*Lock); - + // NOTE: This is necessary, because removeRegistrationListener() can be called // as part of the llvm_shutdown sequence. Since we have no control over the // order of that sequence, we need to gracefully handle the case where the -------------- next part -------------- *** Aborted at 1424812565 (unix time) try "date -d @1424812565" if you are using GNU date *** PC: @ 0x51c290 llvm::PMTopLevelManager::findAnalysisPass() *** SIGSEGV (@0x20) received by PID 50599 (TID 0x7f1fab00e700) from PID 32; stack trace: *** @ 0x7f20d332a0d0 (unknown) @ 0x51c290 llvm::PMTopLevelManager::findAnalysisPass() @ 0x51f508 llvm::PMDataManager::initializeAnalysisImpl() @ 0x5232fb llvm::FPPassManager::runOnFunction() @ 0x523426 llvm::FunctionPassManagerImpl::run() @ 0x52354d llvm::FunctionPassManager::run() @ 0x5407aa llvm::JIT::jitTheFunction() @ 0x540db8 llvm::JIT::runJITOnFunctionUnlocked() @ 0x540fbf llvm::JIT::getPointerToFunction() @ 0x424793 falcon::ClangCompiler::ClangModuleHandle::GetPointerToFunction() @ 0x40f92d _ZNSt6thread5_ImplISt12_Bind_simpleIFZN6falcon29CachingCompiler_PerfTest_Test8TestBodyEvEUlvE0_vEEE6_M_runEv @ 0x7f20d23ab784 execute_native_thread_routine @ 0x7f20d3322dc2 start_thread @ 0x7f20d1b413fd __clone -------------- next part -------------- A non-text attachment was scrubbed... Name: clang_compiler.cpp Type: text/x-c++src Size: 3554 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150224/47a0c9d5/attachment.cpp>
Nipun Sehrawat
2015-Feb-26 17:58 UTC
[LLVMdev] Removing contention in PassRegistry accesses to speed up compiles
We have root caused the segfault - it was due to a caching layer we have in our code, which is to avoid duplicate compilations. Basically, llvm::JIT::getPointerToFunction() looks up PassRegistry, but as our change introduces a separate PassRegistry for each thread, this means that the thread that calls llvm::JIT::getPointerToFunction() should have appropriate PassRegistry setup. In our setup, some threads were witnessing a cache hit on a code that was compiled by another thread, but when such a thread called llvm::JIT::getPointerToFunction(), it was getting a segfault as its PassRegistry was not setup. Any comments on the our change to the PassRegistry? Thanks, Nipun On Tue, Feb 24, 2015 at 1:50 PM, Nipun Sehrawat <nipun at thoughtspot.com> wrote:> Hi, > > We use LLVM libraries to compile C++ code and noticed slow downs when > multiple threads of a process were compiling at once. *perf *indicated > that most of the CPU time was spent in a spin lock, which was being > locked/unlocked from llvm::PassRegistry::getPassInfo(). > > We read the relevant LLVM code and found out that PassRegistry is a > ManagedStatic and is shared among all threads in case of a multi-threaded > setup. This sharing requires locking in PassRegistry's method, which > becomes source of the contention. To get rid of the contention, we made a > change to make PassRegistry thread-local and got rid of the locking. This > removed all the contention and we noticed a 2x speed up in single thread > compiles and 7x improvement when ten threads were compiling in parallel. > > Please find attached the diff for this change. We are using a old version > of LLVM code (svn revision 170375), so the code might be quite outdated. > > We have two questions: > 1. Does the change look reasonable? Or are we missing something here? > 2. When we run with 1000 threads compiling concurrently, we > deterministically run into a segfault in PassRegistry lookup. Any insights > into the segfault? > > > Please find attached the following files: > 1. *pass_registry.txt*: Git diff of our change. Note that it is against > LLVM svn revision 170375. > 2.* contention.txt*: Perf report with existing LLVM code - shows > contention in llvm::PassRegistry::getPassInfo() > 3. *no_contention.txt*: Perf report of LLVM built with our change. > 4. *segfault.txt*: Segfault we are encountering after our change. > 5. *clang_compile.cpp*: Snippet of code we use to compile code using LLVM. > > > Thanks a lot, > Nipun >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150226/b7aa95c6/attachment.html>
Sanjoy Das
2015-Feb-26 18:42 UTC
[LLVMdev] Removing contention in PassRegistry accesses to speed up compiles
Hi Nipun, The usual way to get feedback on patches is to upload them on phabricator ( http://llvm.org/docs/Phabricator.html ) and send them to llvm-commits. If you do the first part (upload patch to phabricator) correctly, phabricator will mail your patch to llvm-commits with an appropriate description automatically. -- Sanjoy On Thu, Feb 26, 2015 at 9:58 AM, Nipun Sehrawat <nipun at thoughtspot.com> wrote:> We have root caused the segfault - it was due to a caching layer we have in > our code, which is to avoid duplicate compilations. Basically, > llvm::JIT::getPointerToFunction() looks up PassRegistry, but as our change > introduces a separate PassRegistry for each thread, this means that the > thread that calls llvm::JIT::getPointerToFunction() should have appropriate > PassRegistry setup. In our setup, some threads were witnessing a cache hit > on a code that was compiled by another thread, but when such a thread called > llvm::JIT::getPointerToFunction(), it was getting a segfault as its > PassRegistry was not setup. > > Any comments on the our change to the PassRegistry? > > Thanks, > Nipun > > On Tue, Feb 24, 2015 at 1:50 PM, Nipun Sehrawat <nipun at thoughtspot.com> > wrote: >> >> Hi, >> >> We use LLVM libraries to compile C++ code and noticed slow downs when >> multiple threads of a process were compiling at once. perf indicated that >> most of the CPU time was spent in a spin lock, which was being >> locked/unlocked from llvm::PassRegistry::getPassInfo(). >> >> We read the relevant LLVM code and found out that PassRegistry is a >> ManagedStatic and is shared among all threads in case of a multi-threaded >> setup. This sharing requires locking in PassRegistry's method, which becomes >> source of the contention. To get rid of the contention, we made a change to >> make PassRegistry thread-local and got rid of the locking. This removed all >> the contention and we noticed a 2x speed up in single thread compiles and 7x >> improvement when ten threads were compiling in parallel. >> >> Please find attached the diff for this change. We are using a old version >> of LLVM code (svn revision 170375), so the code might be quite outdated. >> >> We have two questions: >> 1. Does the change look reasonable? Or are we missing something here? >> 2. When we run with 1000 threads compiling concurrently, we >> deterministically run into a segfault in PassRegistry lookup. Any insights >> into the segfault? >> >> >> Please find attached the following files: >> 1. pass_registry.txt: Git diff of our change. Note that it is against LLVM >> svn revision 170375. >> 2. contention.txt: Perf report with existing LLVM code - shows contention >> in llvm::PassRegistry::getPassInfo() >> 3. no_contention.txt: Perf report of LLVM built with our change. >> 4. segfault.txt: Segfault we are encountering after our change. >> 5. clang_compile.cpp: Snippet of code we use to compile code using LLVM. >> >> >> Thanks a lot, >> Nipun > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >