So I finally took the plunge and switched to MCJIT (wasn't too bad, as long as you remember to call InitializeNativeTargetDisassembler if you want disassembly...), and I got the functionality to a point I was happy with so I wanted to test perf of the system. I created a simple benchmark and I'd thought I'd share the results, both because I know I personally had no idea what the results would be, and because it seems like there's some low-hanging fruit to improve performance. My JIT is currently structured as creating a new module per function it wants to jit; I had experimented with using an approach where I had an "incubator module" where all IR starts, and then on-demand extract it to "compilation modules" when I want to send it to MCJIT, but my experience was that this wasn't very helpful. (My goal was to enable cross-function optimizations such as inlining, but there's no easy way [and might not even make sense] to run module-level optimizations on a single function.) The benchmark I set up is a simple REPL loop, where the input is a pre-parsed no-op statement. I put this in a loop and measured the amount of time it took, and tested it at 1k iterations and 10k iterations. This includes my IR-generation, but my expectation is that that should be negligible compared to the MCJIT time (confirmed through profiling). The absolute numbers are from a Release build with asserts turned off (this made a big difference), and the percentages are from a Release+Profiling build. For 1k iterations, the test took about 640ms on my desktop machine, ie 0.64ms per module. Looking at the profiling results, it looks like about 47% of the time is spent in PassManagerImpl::run, and another 47% is spent in addPassesToEmitMC, which feels like it could be avoided by doing that just once. Of the time spent in PassManagerImpl::run, about 35% is spent in PassManager overhead such as initializeAnalysisImpl() / removeNotPreservedAnalysis() / removeDeadPasses(). For 10k iterations, the test took about 12.6s, or 1.26ms per module, so there's definitely some slowdown happening. Looking at the profiling output, it looks like the main difference is the appearance of MCJIT::finalizeLoadedModules(), which ultimately calls RuntimeDyldImpl::resolveRelocations() and SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate over all memory sections leading to quadratic overhead. I'm not sure how easy it would be, but it seems like there could be single-module variants of these apis that could cut down on the overhead, since it looks like MCJIT knows what modules need to be finalized but doesn't pass this information to the dyld / memory manager. My overall takeaway from these numbers is pretty good: they're good enough for where my JIT is right now, and it seems like there's some relatively-straightforward work that can be done to make them better. I'm curious what other people think. Kevin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131118/3f204a42/attachment.html>
The pass manager is re-created in emitObject on every call. Andy, is that needed or can we create the PM in MCJIT constructor and keep it around? Yaron 2013/11/19 Kevin Modzelewski <kmod at dropbox.com>> So I finally took the plunge and switched to MCJIT (wasn't too bad, as > long as you remember to call InitializeNativeTargetDisassembler if you want > disassembly...), and I got the functionality to a point I was happy with so > I wanted to test perf of the system. I created a simple benchmark and I'd > thought I'd share the results, both because I know I personally had no idea > what the results would be, and because it seems like there's some > low-hanging fruit to improve performance. > > My JIT is currently structured as creating a new module per function it > wants to jit; I had experimented with using an approach where I had an > "incubator module" where all IR starts, and then on-demand extract it to > "compilation modules" when I want to send it to MCJIT, but my experience > was that this wasn't very helpful. (My goal was to enable cross-function > optimizations such as inlining, but there's no easy way [and might not even > make sense] to run module-level optimizations on a single function.) > > The benchmark I set up is a simple REPL loop, where the input is a > pre-parsed no-op statement. I put this in a loop and measured the amount > of time it took, and tested it at 1k iterations and 10k iterations. This > includes my IR-generation, but my expectation is that that should be > negligible compared to the MCJIT time (confirmed through profiling). The > absolute numbers are from a Release build with asserts turned off (this > made a big difference), and the percentages are from a Release+Profiling > build. > > For 1k iterations, the test took about 640ms on my desktop machine, ie > 0.64ms per module. Looking at the profiling results, it looks like about > 47% of the time is spent in PassManagerImpl::run, and another 47% is spent > in addPassesToEmitMC, which feels like it could be avoided by doing that > just once. Of the time spent in PassManagerImpl::run, about 35% is spent > in PassManager overhead such as initializeAnalysisImpl() / > removeNotPreservedAnalysis() / removeDeadPasses(). > > For 10k iterations, the test took about 12.6s, or 1.26ms per module, so > there's definitely some slowdown happening. Looking at the profiling > output, it looks like the main difference is the appearance of > MCJIT::finalizeLoadedModules(), which ultimately calls > RuntimeDyldImpl::resolveRelocations() and > SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate > over all memory sections leading to quadratic overhead. I'm not sure how > easy it would be, but it seems like there could be single-module variants > of these apis that could cut down on the overhead, since it looks like > MCJIT knows what modules need to be finalized but doesn't pass this > information to the dyld / memory manager. > > > My overall takeaway from these numbers is pretty good: they're good enough > for where my JIT is right now, and it seems like there's some > relatively-straightforward work that can be done to make them better. I'm > curious what other people think. > > Kevin > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131119/506fe6df/attachment.html>
Thanks for the analysis, Kevin! This is a great jumping off point for moving forward. Unfortunately, I'm not sure the PM/addPassesToEmitMC issue is as straightforward as it seems. It does sound like there may be some duplicated effort that could be done just once, but I think it might involve some restructuring at the TargetMachine level. addPassesToEmitMC actually does a couple of things. It does, as the name implies, add code generation passes to the PassManager. However, it also sets up the MCObjectStreamer, which is necessarily module-specific and gets added to the PassManager. So it seems to me that optimizing this would at least require separating the pass creation from the object streaming creation. Whether that's worth doing depends heavily on where the time is being spent inside addPassesToEmitMC and whether or not it can be better optimized as it currently is structured. The topic of optimizing PM.run() came up at the BoF. That's a broader issue than just MCJIT, so I imagine if you raise awareness of the problem there will be a lot of interest in fixing it. Maybe send some profiling numbers to the llvmdev list with a general subject line like "PassManager::run() has a lot of overhead." There is definitely some low hanging fruit. In RuntimeDyldImpl::resolveRelocations() and SectionMemoryManager::applyMemoryGroupPermissions(). In the case of RuntimeDyldImpl::resolveRelocations(), the vast majority of the Sections being iterated over won't actually have any pending relocations because we remove relocations from the lists as we apply them. If we just kept a separate list of sections which contain pending relocations and removed sections from that list as appropriate it should fix that time sink. In the case of SectionMemoryManager::applyMemoryGroupPermissions(), we would only be hitting each section in the worst case scenario where each module is immediately compiled after it is defined. Otherwise, the memory manager combines sections into common memory groups whenever possible (though the implementation may need some work). However, that still leaves a glaring issue that this function is doing redundant work. Namely, it is reapplying permissions to memory groups that in most cases already have the permissions it is setting. Again, it should be trivial manage separate data structures to track which memory groups need permissions applied and which do not. With regard to SectionMemoryManager, however, I feel I should mention that it is only intended as a reference implementation to get people up and running, and it is my expectation that many clients will want to implement their own memory manager to fine tune performance in accordance with their particular workload characteristics. Even so, there's no reason we shouldn't fix obvious problems in the reference implementation. BTW, another bit of low hanging fruit would be to turn off the module verifier. The parameter to disable it (in the call to addPassesToEmitMC) is hard-coded to 'false' in MCJIT right now. Now, having said all this, I need to tell you that based on my current priorities I don't have time to take on any of this. However, I'd be more than happy to review patches if someone else has time to do the work. -Andy From: Yaron Keren [mailto:yaron.keren at gmail.com] Sent: Tuesday, November 19, 2013 3:54 AM To: Kevin Modzelewski; Kaylor, Andrew Cc: <llvmdev at cs.uiuc.edu> Subject: Re: [LLVMdev] Some MCJIT benchmark numbers The pass manager is re-created in emitObject on every call. Andy, is that needed or can we create the PM in MCJIT constructor and keep it around? Yaron 2013/11/19 Kevin Modzelewski <kmod at dropbox.com<mailto:kmod at dropbox.com>> So I finally took the plunge and switched to MCJIT (wasn't too bad, as long as you remember to call InitializeNativeTargetDisassembler if you want disassembly...), and I got the functionality to a point I was happy with so I wanted to test perf of the system. I created a simple benchmark and I'd thought I'd share the results, both because I know I personally had no idea what the results would be, and because it seems like there's some low-hanging fruit to improve performance. My JIT is currently structured as creating a new module per function it wants to jit; I had experimented with using an approach where I had an "incubator module" where all IR starts, and then on-demand extract it to "compilation modules" when I want to send it to MCJIT, but my experience was that this wasn't very helpful. (My goal was to enable cross-function optimizations such as inlining, but there's no easy way [and might not even make sense] to run module-level optimizations on a single function.) The benchmark I set up is a simple REPL loop, where the input is a pre-parsed no-op statement. I put this in a loop and measured the amount of time it took, and tested it at 1k iterations and 10k iterations. This includes my IR-generation, but my expectation is that that should be negligible compared to the MCJIT time (confirmed through profiling). The absolute numbers are from a Release build with asserts turned off (this made a big difference), and the percentages are from a Release+Profiling build. For 1k iterations, the test took about 640ms on my desktop machine, ie 0.64ms per module. Looking at the profiling results, it looks like about 47% of the time is spent in PassManagerImpl::run, and another 47% is spent in addPassesToEmitMC, which feels like it could be avoided by doing that just once. Of the time spent in PassManagerImpl::run, about 35% is spent in PassManager overhead such as initializeAnalysisImpl() / removeNotPreservedAnalysis() / removeDeadPasses(). For 10k iterations, the test took about 12.6s, or 1.26ms per module, so there's definitely some slowdown happening. Looking at the profiling output, it looks like the main difference is the appearance of MCJIT::finalizeLoadedModules(), which ultimately calls RuntimeDyldImpl::resolveRelocations() and SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate over all memory sections leading to quadratic overhead. I'm not sure how easy it would be, but it seems like there could be single-module variants of these apis that could cut down on the overhead, since it looks like MCJIT knows what modules need to be finalized but doesn't pass this information to the dyld / memory manager. My overall takeaway from these numbers is pretty good: they're good enough for where my JIT is right now, and it seems like there's some relatively-straightforward work that can be done to make them better. I'm curious what other people think. Kevin _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131119/75400348/attachment.html>
Apparently Analagous Threads
- [LLVMdev] Some MCJIT benchmark numbers
- SectionMemoryManager::finalizeMemory ... read only data become executable?
- [LLVMdev] Bad permissions for mapped region
- [LLVMdev] Bad permissions for mapped region
- COFF::IMAGE_REL_AMD64_REL32 relocation overflow when compiling for x86_64