thr3ads.net - llvm dev - [LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64 [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Óscar Fuentes

2011-Apr-05 19:41 UTC

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

Jim Grosbach <grosbach at apple.com> writes:
>> To me, increasing coverage of the FastISel seemed more involved than
>> directly emitting opcodes to memory, with a lesser outlook on
>> reducing overhead.
>
> That seems extremely unlikely. You'd be effectively re-implementing
> both fast-isel and the MC binary emitter layers, and it sounds like a
> new register allocator as well.
>
> What Eric is suggesting is instead locating which IR constructs are
> not being handled by fast-isel and are causing problems (i.e., are
> being frequently encountered in your code-base) and implementing
> fast-isel handling for them. That will remove the selectiondag
> overhead that you've identified as the primary compile-time problem.
At some point on the past someone was kind enough to add fast-isel for
some instructions frequently emitted by my compiler, hoping that that
would speed up JITting. The results were dissapointing (negligible,
IIRC). Either fast-isel does not make much of a difference or the main
inefficiency is elsewhere.

Eric Christopher

2011-Apr-05 21:04 UTC

head link

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

On Apr 5, 2011, at 12:41 PM, Óscar Fuentes wrote:
> Jim Grosbach <grosbach at apple.com> writes:
> 
>>> To me, increasing coverage of the FastISel seemed more involved
than
>>> directly emitting opcodes to memory, with a lesser outlook on
>>> reducing overhead.
>> 
>> That seems extremely unlikely. You'd be effectively re-implementing
>> both fast-isel and the MC binary emitter layers, and it sounds like a
>> new register allocator as well.
>> 
>> What Eric is suggesting is instead locating which IR constructs are
>> not being handled by fast-isel and are causing problems (i.e., are
>> being frequently encountered in your code-base) and implementing
>> fast-isel handling for them. That will remove the selectiondag
>> overhead that you've identified as the primary compile-time
problem.
> 
> At some point on the past someone was kind enough to add fast-isel for
> some instructions frequently emitted by my compiler, hoping that that
> would speed up JITting. The results were dissapointing (negligible,
> IIRC). Either fast-isel does not make much of a difference or the main
> inefficiency is elsewhere.
Bug number?

Seriously, if you haven't looked at how fast-isel does then you need to.
It is, in theory, almost no different than what he's planning on doing.
You may still be falling into selection dag. If you're not then some
investigation may be enlightening.

-eric

Tilmann Scheller

2011-Apr-05 21:49 UTC

head link

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

Hi Viktor,

On Tue, Apr 5, 2011 at 9:41 PM, Óscar Fuentes <ofv at wanadoo.es> wrote:
> Jim Grosbach <grosbach at apple.com> writes:
>
> >> To me, increasing coverage of the FastISel seemed more involved
than
> >> directly emitting opcodes to memory, with a lesser outlook on
> >> reducing overhead.
> >
> > That seems extremely unlikely. You'd be effectively
re-implementing
> > both fast-isel and the MC binary emitter layers, and it sounds like a
> > new register allocator as well.
> >
> > What Eric is suggesting is instead locating which IR constructs are
> > not being handled by fast-isel and are causing problems (i.e., are
> > being frequently encountered in your code-base) and implementing
> > fast-isel handling for them. That will remove the selectiondag
> > overhead that you've identified as the primary compile-time
problem.
>
> At some point on the past someone was kind enough to add fast-isel for
> some instructions frequently emitted by my compiler, hoping that that
> would speed up JITting. The results were dissapointing (negligible,
> IIRC). Either fast-isel does not make much of a difference or the main
> inefficiency is elsewhere.
>
> fast-isel discussion aside, I think the real speed killer of a dynamic
binary translator (or other users of the JIT which invoke it many times on
small pieces of code) is the constant time of the JIT which is required for
every source ISA BB (each BB gets mapped to an LLVM Function).

[1] cites a constant overhead of 10 ms per BB. I just did some simple
measurements with callgrind doing an lli on a simple .ll file which only
contains a main function which immediately returns. With -regalloc=fast and
-fast-isel and an -O2 compiled lli we spend about 725000 instructions in
getPointerToFunction(). Clearly, that's quite some constant overhead and I
doubt that we can get it down by two orders of magnitude, so what about
this:

The old qemu JIT used an extremely simple and fast approach which performed
surprisingly well: Having chunks of precompiled machine code (from C
sources) for the individual IR instructions which at runtime get glued
together and patched as necessary.

The idea would be to use the same approach to generate machine code from
LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM
IR instructions (ideally describing the mapping with TableGen), glueing them
together doing no dynamic register allocation, no scheduling.

I'd be willing to mentor such a project, let me know if you're
interested.

Regards,

Tilmann


[1] http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp212-216.pdf
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110405/cc45fb99/attachment.html>

Eric Christopher

2011-Apr-05 22:13 UTC

head link

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

On Apr 5, 2011, at 2:49 PM, Tilmann Scheller wrote:
> The idea would be to use the same approach to generate machine code from
LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM IR
instructions (ideally describing the mapping with TableGen), glueing them
together doing no dynamic register allocation, no scheduling.
*nod* If we were going to do that, I'd be up for replacing the scheme in
fast-isel with that. I just don't see how the general method is going to be
any different. Either you have to cover every bit of the IR, or you don't
and you punt to a scheme like the DAG which has to cover everything.

That said, the compilation itself could be way slow as you mentioned with the
binary translation. I'd be really curious where the overhead is. I'd
liked to get that down for sure.

-eric

Florian Brandner

2011-Apr-06 07:50 UTC

head link

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

On Tue April 5 2011 23:49:21 Tilmann Scheller wrote:> Hi Viktor,
> 
> On Tue, Apr 5, 2011 at 9:41 PM, Óscar Fuentes <ofv at wanadoo.es>
wrote:
> 
> > Jim Grosbach <grosbach at apple.com> writes:
> >
> > >> To me, increasing coverage of the FastISel seemed more
involved than
> > >> directly emitting opcodes to memory, with a lesser outlook on
> > >> reducing overhead.
> > >
> > > That seems extremely unlikely. You'd be effectively
re-implementing
> > > both fast-isel and the MC binary emitter layers, and it sounds
like a
> > > new register allocator as well.
it should be possible to leverage some of the existing infrastructure.
in particular, as Tilmann said, the MC binary emitter.
> > At some point on the past someone was kind enough to add fast-isel for
> > some instructions frequently emitted by my compiler, hoping that that
> > would speed up JITting. The results were dissapointing (negligible,
> > IIRC). Either fast-isel does not make much of a difference or the main
> > inefficiency is elsewhere.
going through machine-level IR is certainly one of those inefficiencies.
we should try to do code generation in two passes. one over the IR generating
binary code. the second fixing-up relocations on the binary code. instruction
selection and register allocation should be performed in one go.

everyone in the need for something more sophisticated can fall back to the 
regular backend flow.
>  fast-isel discussion aside, I think the real speed killer of a dynamic
> binary translator (or other users of the JIT which invoke it many times on
> small pieces of code) is the constant time of the JIT which is required for
> every source ISA BB (each BB gets mapped to an LLVM Function).
> 
> [1] cites a constant overhead of 10 ms per BB. I just did some simple
> measurements with callgrind doing an lli on a simple .ll file which only
> contains a main function which immediately returns. With -regalloc=fast and
> -fast-isel and an -O2 compiled lli we spend about 725000 instructions in
> getPointerToFunction(). Clearly, that's quite some constant overhead
and I
> doubt that we can get it down by two orders of magnitude, so what about
> this:
i fully agree. i did some measurements (quite a while ago) and the backend
part was always dominating. even when rather heavy high-level optimizations 
were enabled.

in short, the JIT is not a JIT. it is merely a regular backend emitting 
instructions directly to memory.
(that's neat, but does not deliver the compile time people hope for)
> The old qemu JIT used an extremely simple and fast approach which performed
> surprisingly well: Having chunks of precompiled machine code (from C
> sources) for the individual IR instructions which at runtime get glued
> together and patched as necessary.
> 
> The idea would be to use the same approach to generate machine code from
> LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM
> IR instructions (ideally describing the mapping with TableGen), glueing
them
> together doing no dynamic register allocation, no scheduling.
that is the way to go for me.

i just see one large obstacles: how do we handle lowering?
 
lowering is important to realize unsupported operations, ABI conventions, ...
most of this is now done in C++ code that is highly depending on the DAG. 

i would suggest to do lowering on the linear LLVM IR, that handles all relevant
constructs to be able to generate machine code directly from the IR. the DAG 
based lowering could be completely eliminated (or stripped down to a form of 
lowering/ABI/pre-isel optimization).
> I'd be willing to mentor such a project, let me know if you're
interested.
i think this would be an important step to make the LLVM JIT more attractive.

bye,
Florian

-- 
Florian Brandner
Compilation and Embedded Computing Systems Group (COMPSYS)
Ecole Normale Superieure de Lyon (ENS Lyon)
Laboratoire de l'Informatique du Parallelisme (LIP)
46 Allee d'Italie, F-69364 Lyon Cedex 07, France 

phone: +33 4 72 72 83 52
email : florian.brandner at ens-lyon.fr
web   : http://perso.ens-lyon.fr/florian.brandner/

Viktor Pavlu

2011-Apr-06 14:47 UTC

head link

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

Thanks for all the replies!

I wanted to closely resemble what the CACAO VM[1] backend did with
success for a long time: for every CACAO IR instruction, there is a
sequence of x86 instructions that get written directly to the executable
memory. In CACAO, registers are used while available, then everything is
spilled. Relocations are resolved and patched in a second go.

It seems this is similar to what Tilmann refers to in the old qemu JIT:

On Tue, Apr 5, 2011 at 11:49 PM, Tilmann Scheller
<tilmann.scheller at googlemail.com> wrote:
> The old qemu JIT used an extremely simple and fast approach which performed
> surprisingly well: Having chunks of precompiled machine code (from C
> sources) for the individual IR instructions which at runtime get glued
> together and patched as necessary.
> The idea would be to use the same approach to generate machine code from
> LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM
> IR instructions (ideally describing the mapping with TableGen), glueing
them
> together doing no dynamic register allocation, no scheduling.
> I'd be willing to mentor such a project, let me know if you're
interested.
So yes, I would be interested.

Only recently is CACAO starting to get a register allocator to improve
quality of the generated code.
I wanted to include this in my first stab at the project but leaving
register allocation for future work or even for the regular backend is
fine with me, too.

- Viktor

[1]: CACAO VM
http://www.cacaovm.org/

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Apr 2011 - [LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

[LLVMdev] GSoC 2011: Fast JIT Code Generation for x86-64

Seemingly Similar Threads