thr3ads.net - llvm dev - [LLVMdev] Some MCJIT benchmark numbers [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Kevin Modzelewski

2013-Nov-19 02:53 UTC

[LLVMdev] Some MCJIT benchmark numbers

So I finally took the plunge and switched to MCJIT (wasn't too bad, as long
as you remember to call InitializeNativeTargetDisassembler if you want
disassembly...), and I got the functionality to a point I was happy with so
I wanted to test perf of the system.  I created a simple benchmark and I'd
thought I'd share the results, both because I know I personally had no idea
what the results would be, and because it seems like there's some
low-hanging fruit to improve performance.

My JIT is currently structured as creating a new module per function it
wants to jit; I had experimented with using an approach where I had an
"incubator module" where all IR starts, and then on-demand extract it
to
"compilation modules" when I want to send it to MCJIT, but my
experience
was that this wasn't very helpful.  (My goal was to enable cross-function
optimizations such as inlining, but there's no easy way [and might not even
make sense] to run module-level optimizations on a single function.)

The benchmark I set up is a simple REPL loop, where the input is a
pre-parsed no-op statement.  I put this in a loop and measured the amount
of time it took, and tested it at 1k iterations and 10k iterations.  This
includes my IR-generation, but my expectation is that that should be
negligible compared to the MCJIT time (confirmed through profiling).  The
absolute numbers are from a Release build with asserts turned off (this
made a big difference), and the percentages are from a Release+Profiling
build.

For 1k iterations, the test took about 640ms on my desktop machine, ie
0.64ms per module.  Looking at the profiling results, it looks like about
47% of the time is spent in PassManagerImpl::run, and another 47% is spent
in addPassesToEmitMC, which feels like it could be avoided by doing that
just once.  Of the time spent in PassManagerImpl::run, about 35% is spent
in PassManager overhead such as initializeAnalysisImpl() /
removeNotPreservedAnalysis() / removeDeadPasses().

For 10k iterations, the test took about 12.6s, or 1.26ms per module, so
there's definitely some slowdown happening.  Looking at the profiling
output, it looks like the main difference is the appearance of
MCJIT::finalizeLoadedModules(), which ultimately calls
RuntimeDyldImpl::resolveRelocations() and
SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate
over all memory sections leading to quadratic overhead.  I'm not sure how
easy it would be, but it seems like there could be single-module variants
of these apis that could cut down on the overhead, since it looks like
MCJIT knows what modules need to be finalized but doesn't pass this
information to the dyld / memory manager.


My overall takeaway from these numbers is pretty good: they're good enough
for where my JIT is right now, and it seems like there's some
relatively-straightforward work that can be done to make them better.  I'm
curious what other people think.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131118/3f204a42/attachment.html>

Yaron Keren

2013-Nov-19 11:53 UTC

head link

[LLVMdev] Some MCJIT benchmark numbers

The pass manager is re-created in emitObject on every call.

Andy, is that needed or can we create the PM in MCJIT constructor and keep
it around?

Yaron




2013/11/19 Kevin Modzelewski <kmod at dropbox.com>
> So I finally took the plunge and switched to MCJIT (wasn't too bad, as
> long as you remember to call InitializeNativeTargetDisassembler if you want
> disassembly...), and I got the functionality to a point I was happy with so
> I wanted to test perf of the system.  I created a simple benchmark and
I'd
> thought I'd share the results, both because I know I personally had no
idea
> what the results would be, and because it seems like there's some
> low-hanging fruit to improve performance.
>
> My JIT is currently structured as creating a new module per function it
> wants to jit; I had experimented with using an approach where I had an
> "incubator module" where all IR starts, and then on-demand
extract it to
> "compilation modules" when I want to send it to MCJIT, but my
experience
> was that this wasn't very helpful.  (My goal was to enable
cross-function
> optimizations such as inlining, but there's no easy way [and might not
even
> make sense] to run module-level optimizations on a single function.)
>
> The benchmark I set up is a simple REPL loop, where the input is a
> pre-parsed no-op statement.  I put this in a loop and measured the amount
> of time it took, and tested it at 1k iterations and 10k iterations.  This
> includes my IR-generation, but my expectation is that that should be
> negligible compared to the MCJIT time (confirmed through profiling).  The
> absolute numbers are from a Release build with asserts turned off (this
> made a big difference), and the percentages are from a Release+Profiling
> build.
>
> For 1k iterations, the test took about 640ms on my desktop machine, ie
> 0.64ms per module.  Looking at the profiling results, it looks like about
> 47% of the time is spent in PassManagerImpl::run, and another 47% is spent
> in addPassesToEmitMC, which feels like it could be avoided by doing that
> just once.  Of the time spent in PassManagerImpl::run, about 35% is spent
> in PassManager overhead such as initializeAnalysisImpl() /
> removeNotPreservedAnalysis() / removeDeadPasses().
>
> For 10k iterations, the test took about 12.6s, or 1.26ms per module, so
> there's definitely some slowdown happening.  Looking at the profiling
> output, it looks like the main difference is the appearance of
> MCJIT::finalizeLoadedModules(), which ultimately calls
> RuntimeDyldImpl::resolveRelocations() and
> SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate
> over all memory sections leading to quadratic overhead.  I'm not sure
how
> easy it would be, but it seems like there could be single-module variants
> of these apis that could cut down on the overhead, since it looks like
> MCJIT knows what modules need to be finalized but doesn't pass this
> information to the dyld / memory manager.
>
>
> My overall takeaway from these numbers is pretty good: they're good
enough
> for where my JIT is right now, and it seems like there's some
> relatively-straightforward work that can be done to make them better. 
I'm
> curious what other people think.
>
> Kevin
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131119/506fe6df/attachment.html>

Kaylor, Andrew

2013-Nov-19 18:45 UTC

head link

[LLVMdev] Some MCJIT benchmark numbers

Thanks for the analysis, Kevin!  This is a great jumping off point for moving
forward.

Unfortunately, I'm not sure the PM/addPassesToEmitMC issue is as
straightforward as it seems.  It does sound like there may be some duplicated
effort that could be done just once, but I think it might involve some
restructuring at the TargetMachine level.

addPassesToEmitMC actually does a couple of things.  It does, as the name
implies, add code generation passes to the PassManager.  However, it also sets
up the MCObjectStreamer, which is necessarily module-specific and gets added to
the PassManager.  So it seems to me that optimizing this would at least require
separating the pass creation from the object streaming creation.  Whether
that's worth doing depends heavily on where the time is being spent inside
addPassesToEmitMC and whether or not it can be better optimized as it currently
is structured.

The topic of optimizing PM.run() came up at the BoF.  That's a broader issue
than just MCJIT, so I imagine if you raise awareness of the problem there will
be a lot of interest in fixing it.  Maybe send some profiling numbers to the
llvmdev list with a general subject line like "PassManager::run() has a lot
of overhead."

There is definitely some low hanging fruit. In
RuntimeDyldImpl::resolveRelocations() and
SectionMemoryManager::applyMemoryGroupPermissions().

In the case of RuntimeDyldImpl::resolveRelocations(), the vast majority of the
Sections being iterated over won't actually have any pending relocations
because we remove relocations from the lists as we apply them.  If we just kept
a separate list of sections which contain pending relocations and removed
sections from that list as appropriate it should fix that time sink.

In the case of SectionMemoryManager::applyMemoryGroupPermissions(), we would
only be hitting each section in the worst case scenario where each module is
immediately compiled after it is defined.  Otherwise, the memory manager
combines sections into common memory groups whenever possible (though the
implementation may need some work).  However, that still leaves a glaring issue
that this function is doing redundant work.  Namely, it is reapplying
permissions to memory groups that in most cases already have the permissions it
is setting.  Again, it should be trivial manage separate data structures to
track which memory groups need permissions applied and which do not.

With regard to SectionMemoryManager, however, I feel I should mention that it is
only intended as a reference implementation to get people up and running, and it
is my expectation that many clients will want to implement their own memory
manager to fine tune performance in accordance with their particular workload
characteristics.  Even so, there's no reason we shouldn't fix obvious
problems in the reference implementation.

BTW, another bit of low hanging fruit would be to turn off the module verifier. 
The parameter to disable it (in the call to addPassesToEmitMC) is hard-coded to
'false' in MCJIT right now.


Now, having said all this, I need to tell you that based on my current
priorities I don't have time to take on any of this.  However, I'd be
more than happy to review patches if someone else has time to do the work.

-Andy



From: Yaron Keren [mailto:yaron.keren at gmail.com]
Sent: Tuesday, November 19, 2013 3:54 AM
To: Kevin Modzelewski; Kaylor, Andrew
Cc: <llvmdev at cs.uiuc.edu>
Subject: Re: [LLVMdev] Some MCJIT benchmark numbers

The pass manager is re-created in emitObject on every call.

Andy, is that needed or can we create the PM in MCJIT constructor and keep it
around?

Yaron



2013/11/19 Kevin Modzelewski <kmod at dropbox.com<mailto:kmod at
dropbox.com>>
So I finally took the plunge and switched to MCJIT (wasn't too bad, as long
as you remember to call InitializeNativeTargetDisassembler if you want
disassembly...), and I got the functionality to a point I was happy with so I
wanted to test perf of the system.  I created a simple benchmark and I'd
thought I'd share the results, both because I know I personally had no idea
what the results would be, and because it seems like there's some
low-hanging fruit to improve performance.

My JIT is currently structured as creating a new module per function it wants to
jit; I had experimented with using an approach where I had an "incubator
module" where all IR starts, and then on-demand extract it to
"compilation modules" when I want to send it to MCJIT, but my
experience was that this wasn't very helpful.  (My goal was to enable
cross-function optimizations such as inlining, but there's no easy way [and
might not even make sense] to run module-level optimizations on a single
function.)

The benchmark I set up is a simple REPL loop, where the input is a pre-parsed
no-op statement.  I put this in a loop and measured the amount of time it took,
and tested it at 1k iterations and 10k iterations.  This includes my
IR-generation, but my expectation is that that should be negligible compared to
the MCJIT time (confirmed through profiling).  The absolute numbers are from a
Release build with asserts turned off (this made a big difference), and the
percentages are from a Release+Profiling build.

For 1k iterations, the test took about 640ms on my desktop machine, ie 0.64ms
per module.  Looking at the profiling results, it looks like about 47% of the
time is spent in PassManagerImpl::run, and another 47% is spent in
addPassesToEmitMC, which feels like it could be avoided by doing that just once.
Of the time spent in PassManagerImpl::run, about 35% is spent in PassManager
overhead such as initializeAnalysisImpl() / removeNotPreservedAnalysis() /
removeDeadPasses().

For 10k iterations, the test took about 12.6s, or 1.26ms per module, so
there's definitely some slowdown happening.  Looking at the profiling
output, it looks like the main difference is the appearance of
MCJIT::finalizeLoadedModules(), which ultimately calls
RuntimeDyldImpl::resolveRelocations() and
SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate over
all memory sections leading to quadratic overhead.  I'm not sure how easy it
would be, but it seems like there could be single-module variants of these apis
that could cut down on the overhead, since it looks like MCJIT knows what
modules need to be finalized but doesn't pass this information to the dyld /
memory manager.


My overall takeaway from these numbers is pretty good: they're good enough
for where my JIT is right now, and it seems like there's some
relatively-straightforward work that can be done to make them better.  I'm
curious what other people think.

Kevin

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131119/75400348/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Nov 2013 - [LLVMdev] Some MCJIT benchmark numbers

[LLVMdev] Some MCJIT benchmark numbers

[LLVMdev] Some MCJIT benchmark numbers

[LLVMdev] Some MCJIT benchmark numbers

Seemingly Similar Threads