thr3ads.net - llvm dev - [LLVMdev] Odd code layout requirements for MCJIT [Nov 2014]

If this information is useful, please help other people find it:
Share via:

Brett Simmers

2014-Nov-19 04:21 UTC

[LLVMdev] Odd code layout requirements for MCJIT

I'm part of a team working on adding an llvm codegen backend to HHVM 
(PHP JIT, http://hhvm.com) using MCJIT. We have a code layout problem 
and I'm looking for opinions on good ways to solve it.

The short version is that the memory we emit code into is split into a 
few different areas, and we'd like a way to control which area each 
BasicBlock ends up in during codegen. I know this probably sounds pretty 
odd, so here's a much more detailed explanation:

Our translation unit is a "tracelet" which is typically a single basic
block of PHP code. We emit code one tracelet at a time as it's needed 
for execution, so the physical layout of the generated code looks like 
the execution path taken the first time the code is run (we're actually 
a little smarter than that now, but this description is good enough for 
the problem at hand).

We're under pretty severe icache/iTLB pressure, so we do whatever we can 
to keep the hot path as compact as possible. One of the ways we do this 
is by dividing our code cache into three fixed-size areas: main, cold, 
and frozen. Our current, non-llvm codegen backend has one area tag per 
basic block, and most tracelets we compile will span all three areas. 
This means that if we emit code for tracelets A and then B, A's main 
code will be followed immediately by B's main code (and the same for 
cold/frozen). It ends up looking something like this:

good layout: | A_main B_main <unused> | A_cold B_cold <unused> |
                A_frozen... |

Main is the primary execution path. Cold contains code that we expect to 
happen rarely. Conditional branches from main to cold are generally 
taken <5% of the time. Frozen contains code that we expect to almost 
never run, but must have for correctness (mostly for handling 
exceptions). Conditional branches from main to frozen are generally 
taken <0.01% of the time. The fact that we have three areas isn't really 
relevant to the problem itself; I think any solution should allow an 
arbitrary number of areas.

We've done some experiments where we collapse all three areas into one, 
so A's main code is followed by A's cold code, which is followed by
A's
frozen code, and so on for B and other tracelets, like this:

bad layout: | A_main A_cold A_frozen B_main B_cold B_frozen... |

This causes an unacceptable performance regression, so for our 
experimental llvm backend we need to come up with a solution.

We're using a custom calling convention to model each tracelet as a 
single llvm function. I know there are intrinsics in llvm that allow us 
to give blocks relative probabilities, and if I'm understanding things 
correctly that should allow us to achieve the bad layout above, since 
the code for a given llvm function will be emitted in one contiguous 
chunk, regardless of how the blocks are organized within it. I'd like to 
figure out a way to achieve the good layout.

The best idea I have so far is to annotate each BasicBlock with an area 
tag like we do in our existing codegen backend, and then teach the llvm 
codegen backend how to place each of those blocks in the appropriate 
part of our code cache. We have a custom subclass of RTDyldMemoryManager 
to get it emitting all code in the main area right now; I assume any 
solution would impact the API of this class to get at the different areas.

If we have to split the main/cold/frozen parts of each tracelet into 
different llvm functions we might be able to make that work but I'm 
skeptical. My main concern is that cold/frozen code nearly always uses a 
bunch of values from the main code path that branched to it, and if the 
overhead of getting from main to cold/frozen is more than a single 
jmp/jcc instruction (on x86) that's going to be a dealbreaker.

Does this sound doable, and is it something you'd be ok having in llvm? 
If we can come up with a good design my team is happy to do the actual 
work, though if anyone else is interested in doing it we certainly won't 
complain :).

Thanks!
Brett

Bruce Hoult

2014-Nov-19 09:29 UTC

head link

[LLVMdev] Odd code layout requirements for MCJIT

That actually sounds like a generally useful thing to do for any kind of
code where relative frequency of execution can be predicted.

You don't need to put those cold and frozen areas very far apart. Certainly
they should be within range of a normal conditional branch instruction!
They only need to be in different memory pages, which usually these days
means 4KB block. 386 can do 64KB jcc, so that should be fine.

Depending on how big your typical tracelets are you may want to make each
area several blocks in size, to avoid too marge a percentage unused at the
end of them, but probably not very many.

On Wed, Nov 19, 2014 at 5:21 PM, Brett Simmers <bsimmers at fb.com> wrote:
> I'm part of a team working on adding an llvm codegen backend to HHVM
(PHP
> JIT, http://hhvm.com) using MCJIT. We have a code layout problem and
I'm
> looking for opinions on good ways to solve it.
>
> The short version is that the memory we emit code into is split into a few
> different areas, and we'd like a way to control which area each
BasicBlock
> ends up in during codegen. I know this probably sounds pretty odd, so
> here's a much more detailed explanation:
>
> Our translation unit is a "tracelet" which is typically a single
basic
> block of PHP code. We emit code one tracelet at a time as it's needed
for
> execution, so the physical layout of the generated code looks like the
> execution path taken the first time the code is run (we're actually a
> little smarter than that now, but this description is good enough for the
> problem at hand).
>
> We're under pretty severe icache/iTLB pressure, so we do whatever we
can
> to keep the hot path as compact as possible. One of the ways we do this is
> by dividing our code cache into three fixed-size areas: main, cold, and
> frozen. Our current, non-llvm codegen backend has one area tag per basic
> block, and most tracelets we compile will span all three areas. This means
> that if we emit code for tracelets A and then B, A's main code will be
> followed immediately by B's main code (and the same for cold/frozen).
It
> ends up looking something like this:
>
> good layout: | A_main B_main <unused> | A_cold B_cold <unused>
|
>                A_frozen... |
>
> Main is the primary execution path. Cold contains code that we expect to
> happen rarely. Conditional branches from main to cold are generally taken
> <5% of the time. Frozen contains code that we expect to almost never
run,
> but must have for correctness (mostly for handling exceptions). Conditional
> branches from main to frozen are generally taken <0.01% of the time. The
> fact that we have three areas isn't really relevant to the problem
itself;
> I think any solution should allow an arbitrary number of areas.
>
> We've done some experiments where we collapse all three areas into one,
so
> A's main code is followed by A's cold code, which is followed by
A's frozen
> code, and so on for B and other tracelets, like this:
>
> bad layout: | A_main A_cold A_frozen B_main B_cold B_frozen... |
>
> This causes an unacceptable performance regression, so for our
> experimental llvm backend we need to come up with a solution.
>
> We're using a custom calling convention to model each tracelet as a
single
> llvm function. I know there are intrinsics in llvm that allow us to give
> blocks relative probabilities, and if I'm understanding things
correctly
> that should allow us to achieve the bad layout above, since the code for a
> given llvm function will be emitted in one contiguous chunk, regardless of
> how the blocks are organized within it. I'd like to figure out a way to
> achieve the good layout.
>
> The best idea I have so far is to annotate each BasicBlock with an area
> tag like we do in our existing codegen backend, and then teach the llvm
> codegen backend how to place each of those blocks in the appropriate part
> of our code cache. We have a custom subclass of RTDyldMemoryManager to get
> it emitting all code in the main area right now; I assume any solution
> would impact the API of this class to get at the different areas.
>
> If we have to split the main/cold/frozen parts of each tracelet into
> different llvm functions we might be able to make that work but I'm
> skeptical. My main concern is that cold/frozen code nearly always uses a
> bunch of values from the main code path that branched to it, and if the
> overhead of getting from main to cold/frozen is more than a single jmp/jcc
> instruction (on x86) that's going to be a dealbreaker.
>
> Does this sound doable, and is it something you'd be ok having in llvm?
If
> we can come up with a good design my team is happy to do the actual work,
> though if anyone else is interested in doing it we certainly won't
complain
> :).
>
> Thanks!
> Brett
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141119/7ae235d8/attachment.html>

Reid Kleckner

2014-Nov-19 18:49 UTC

head link

[LLVMdev] Odd code layout requirements for MCJIT

I'm pretty sure LLVM does not currently have support for this kind of
hot-cold BB separation, but I think it would be a welcome addition. I think
the best way to add such support would be to move cold code to a separate
object file section. For x86 this might look like:

.text
foo:
  xor eax, eax
  ...
  jz Lcold_bb
Lreturn:
  ret

.text.cold
Lcold_bb:
  .. EH
  jmp Lreturn

This might require changing the way we emit relocations to local labels,
but it would be generally useful to others. GCC already does something like
this. Most of the work to make this work at the AOT object file level
should also be necessary to make this work with MCJIT, which can then be
taught to allocate .text.cold somewhere else.

---

We've also discussed outlining regions of similar cold code at the LLVM IR
level. This will definitely work today without many backend changes. You
should be able to apply the preserve_mostcc calling convention and add a
.text.cold section annotation and that should be enough to start making
MCJIT memory allocation changes.

On Tue, Nov 18, 2014 at 8:21 PM, Brett Simmers <bsimmers at fb.com> wrote:
> I'm part of a team working on adding an llvm codegen backend to HHVM
(PHP
> JIT, http://hhvm.com) using MCJIT. We have a code layout problem and
I'm
> looking for opinions on good ways to solve it.
>
> The short version is that the memory we emit code into is split into a few
> different areas, and we'd like a way to control which area each
BasicBlock
> ends up in during codegen. I know this probably sounds pretty odd, so
> here's a much more detailed explanation:
>
> Our translation unit is a "tracelet" which is typically a single
basic
> block of PHP code. We emit code one tracelet at a time as it's needed
for
> execution, so the physical layout of the generated code looks like the
> execution path taken the first time the code is run (we're actually a
> little smarter than that now, but this description is good enough for the
> problem at hand).
>
> We're under pretty severe icache/iTLB pressure, so we do whatever we
can
> to keep the hot path as compact as possible. One of the ways we do this is
> by dividing our code cache into three fixed-size areas: main, cold, and
> frozen. Our current, non-llvm codegen backend has one area tag per basic
> block, and most tracelets we compile will span all three areas. This means
> that if we emit code for tracelets A and then B, A's main code will be
> followed immediately by B's main code (and the same for cold/frozen).
It
> ends up looking something like this:
>
> good layout: | A_main B_main <unused> | A_cold B_cold <unused>
|
>                A_frozen... |
>
> Main is the primary execution path. Cold contains code that we expect to
> happen rarely. Conditional branches from main to cold are generally taken
> <5% of the time. Frozen contains code that we expect to almost never
run,
> but must have for correctness (mostly for handling exceptions). Conditional
> branches from main to frozen are generally taken <0.01% of the time. The
> fact that we have three areas isn't really relevant to the problem
itself;
> I think any solution should allow an arbitrary number of areas.
>
> We've done some experiments where we collapse all three areas into one,
so
> A's main code is followed by A's cold code, which is followed by
A's frozen
> code, and so on for B and other tracelets, like this:
>
> bad layout: | A_main A_cold A_frozen B_main B_cold B_frozen... |
>
> This causes an unacceptable performance regression, so for our
> experimental llvm backend we need to come up with a solution.
>
> We're using a custom calling convention to model each tracelet as a
single
> llvm function. I know there are intrinsics in llvm that allow us to give
> blocks relative probabilities, and if I'm understanding things
correctly
> that should allow us to achieve the bad layout above, since the code for a
> given llvm function will be emitted in one contiguous chunk, regardless of
> how the blocks are organized within it. I'd like to figure out a way to
> achieve the good layout.
>
> The best idea I have so far is to annotate each BasicBlock with an area
> tag like we do in our existing codegen backend, and then teach the llvm
> codegen backend how to place each of those blocks in the appropriate part
> of our code cache. We have a custom subclass of RTDyldMemoryManager to get
> it emitting all code in the main area right now; I assume any solution
> would impact the API of this class to get at the different areas.
>
> If we have to split the main/cold/frozen parts of each tracelet into
> different llvm functions we might be able to make that work but I'm
> skeptical. My main concern is that cold/frozen code nearly always uses a
> bunch of values from the main code path that branched to it, and if the
> overhead of getting from main to cold/frozen is more than a single jmp/jcc
> instruction (on x86) that's going to be a dealbreaker.
>
> Does this sound doable, and is it something you'd be ok having in llvm?
If
> we can come up with a good design my team is happy to do the actual work,
> though if anyone else is interested in doing it we certainly won't
complain
> :).
>
> Thanks!
> Brett
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141119/2f22c91d/attachment.html>

Jim Grosbach

2014-Nov-19 19:00 UTC

head link

[LLVMdev] Odd code layout requirements for MCJIT

Separate sections will be pretty tough to pull off for non-X86, where
PC-relative branches have a much more restricted range.
> On Nov 19, 2014, at 10:49 AM, Reid Kleckner <rnk at google.com>
wrote:
> 
> I'm pretty sure LLVM does not currently have support for this kind of
hot-cold BB separation, but I think it would be a welcome addition. I think the
best way to add such support would be to move cold code to a separate object
file section. For x86 this might look like:
> 
> .text
> foo:
>   xor eax, eax
>   ...
>   jz Lcold_bb
> Lreturn:
>   ret
> 
> .text.cold
> Lcold_bb:
>   .. EH
>   jmp Lreturn
> 
> This might require changing the way we emit relocations to local labels,
but it would be generally useful to others. GCC already does something like
this. Most of the work to make this work at the AOT object file level should
also be necessary to make this work with MCJIT, which can then be taught to
allocate .text.cold somewhere else.
> 
> ---
> 
> We've also discussed outlining regions of similar cold code at the LLVM
IR level. This will definitely work today without many backend changes. You
should be able to apply the preserve_mostcc calling convention and add a
.text.cold section annotation and that should be enough to start making MCJIT
memory allocation changes.
> 
> On Tue, Nov 18, 2014 at 8:21 PM, Brett Simmers <bsimmers at fb.com
<mailto:bsimmers at fb.com>> wrote:
> I'm part of a team working on adding an llvm codegen backend to HHVM
(PHP JIT, http://hhvm.com <http://hhvm.com/>) using MCJIT. We have a code
layout problem and I'm looking for opinions on good ways to solve it.
> 
> The short version is that the memory we emit code into is split into a few
different areas, and we'd like a way to control which area each BasicBlock
ends up in during codegen. I know this probably sounds pretty odd, so here's
a much more detailed explanation:
> 
> Our translation unit is a "tracelet" which is typically a single
basic block of PHP code. We emit code one tracelet at a time as it's needed
for execution, so the physical layout of the generated code looks like the
execution path taken the first time the code is run (we're actually a little
smarter than that now, but this description is good enough for the problem at
hand).
> 
> We're under pretty severe icache/iTLB pressure, so we do whatever we
can to keep the hot path as compact as possible. One of the ways we do this is
by dividing our code cache into three fixed-size areas: main, cold, and frozen.
Our current, non-llvm codegen backend has one area tag per basic block, and most
tracelets we compile will span all three areas. This means that if we emit code
for tracelets A and then B, A's main code will be followed immediately by
B's main code (and the same for cold/frozen). It ends up looking something
like this:
> 
> good layout: | A_main B_main <unused> | A_cold B_cold <unused>
|
>                A_frozen... |
> 
> Main is the primary execution path. Cold contains code that we expect to
happen rarely. Conditional branches from main to cold are generally taken <5%
of the time. Frozen contains code that we expect to almost never run, but must
have for correctness (mostly for handling exceptions). Conditional branches from
main to frozen are generally taken <0.01% of the time. The fact that we have
three areas isn't really relevant to the problem itself; I think any
solution should allow an arbitrary number of areas.
> 
> We've done some experiments where we collapse all three areas into one,
so A's main code is followed by A's cold code, which is followed by
A's frozen code, and so on for B and other tracelets, like this:
> 
> bad layout: | A_main A_cold A_frozen B_main B_cold B_frozen... |
> 
> This causes an unacceptable performance regression, so for our experimental
llvm backend we need to come up with a solution.
> 
> We're using a custom calling convention to model each tracelet as a
single llvm function. I know there are intrinsics in llvm that allow us to give
blocks relative probabilities, and if I'm understanding things correctly
that should allow us to achieve the bad layout above, since the code for a given
llvm function will be emitted in one contiguous chunk, regardless of how the
blocks are organized within it. I'd like to figure out a way to achieve the
good layout.
> 
> The best idea I have so far is to annotate each BasicBlock with an area tag
like we do in our existing codegen backend, and then teach the llvm codegen
backend how to place each of those blocks in the appropriate part of our code
cache. We have a custom subclass of RTDyldMemoryManager to get it emitting all
code in the main area right now; I assume any solution would impact the API of
this class to get at the different areas.
> 
> If we have to split the main/cold/frozen parts of each tracelet into
different llvm functions we might be able to make that work but I'm
skeptical. My main concern is that cold/frozen code nearly always uses a bunch
of values from the main code path that branched to it, and if the overhead of
getting from main to cold/frozen is more than a single jmp/jcc instruction (on
x86) that's going to be a dealbreaker.
> 
> Does this sound doable, and is it something you'd be ok having in llvm?
If we can come up with a good design my team is happy to do the actual work,
though if anyone else is interested in doing it we certainly won't complain
:).
> 
> Thanks!
> Brett
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141119/786582fa/attachment.html>

Philip Reames

2014-Nov-20 18:23 UTC

head link

[LLVMdev] Odd code layout requirements for MCJIT

On 11/18/2014 08:21 PM, Brett Simmers wrote:> I'm part of a team working on adding an llvm codegen backend to HHVM 
> (PHP JIT, http://hhvm.com) using MCJIT. We have a code layout problem 
> and I'm looking for opinions on good ways to solve it.
>
> The short version is that the memory we emit code into is split into a 
> few different areas, and we'd like a way to control which area each 
> BasicBlock ends up in during codegen. I know this probably sounds 
> pretty odd, so here's a much more detailed explanation:
>
> Our translation unit is a "tracelet" which is typically a single
basic
> block of PHP code. We emit code one tracelet at a time as it's needed 
> for execution, so the physical layout of the generated code looks like 
> the execution path taken the first time the code is run (we're 
> actually a little smarter than that now, but this description is good 
> enough for the problem at hand).
>
> We're under pretty severe icache/iTLB pressure, so we do whatever we 
> can to keep the hot path as compact as possible. One of the ways we do 
> this is by dividing our code cache into three fixed-size areas: main, 
> cold, and frozen. Our current, non-llvm codegen backend has one area 
> tag per basic block, and most tracelets we compile will span all three 
> areas. This means that if we emit code for tracelets A and then B, A's 
> main code will be followed immediately by B's main code (and the same 
> for cold/frozen). It ends up looking something like this:
>
> good layout: | A_main B_main <unused> | A_cold B_cold <unused>
|
>                A_frozen... |
>
> Main is the primary execution path. Cold contains code that we expect 
> to happen rarely. Conditional branches from main to cold are generally 
> taken <5% of the time. Frozen contains code that we expect to almost 
> never run, but must have for correctness (mostly for handling 
> exceptions). Conditional branches from main to frozen are generally 
> taken <0.01% of the time. The fact that we have three areas isn't 
> really relevant to the problem itself; I think any solution should 
> allow an arbitrary number of areas.
>
> We've done some experiments where we collapse all three areas into 
> one, so A's main code is followed by A's cold code, which is
followed
> by A's frozen code, and so on for B and other tracelets, like this:
>
> bad layout: | A_main A_cold A_frozen B_main B_cold B_frozen... |
>
> This causes an unacceptable performance regression, so for our 
> experimental llvm backend we need to come up with a solution.
>
> We're using a custom calling convention to model each tracelet as a 
> single llvm function. I know there are intrinsics in llvm that allow 
> us to give blocks relative probabilities, and if I'm understanding 
> things correctly that should allow us to achieve the bad layout above, 
> since the code for a given llvm function will be emitted in one 
> contiguous chunk, regardless of how the blocks are organized within 
> it. I'd like to figure out a way to achieve the good layout.
>
> The best idea I have so far is to annotate each BasicBlock with an 
> area tag like we do in our existing codegen backend, and then teach 
> the llvm codegen backend how to place each of those blocks in the 
> appropriate part of our code cache. We have a custom subclass of 
> RTDyldMemoryManager to get it emitting all code in the main area right 
> now; I assume any solution would impact the API of this class to get 
> at the different areas.
>
> If we have to split the main/cold/frozen parts of each tracelet into 
> different llvm functions we might be able to make that work but I'm 
> skeptical. My main concern is that cold/frozen code nearly always uses 
> a bunch of values from the main code path that branched to it, and if 
> the overhead of getting from main to cold/frozen is more than a single 
> jmp/jcc instruction (on x86) that's going to be a dealbreaker.I suspect that using musttail calls with function outlining would work 
fairly well actually.  LLVM already has the ability to convert tail 
calls into jmps.  Functionally, I think this would work today. 
Performance wise might be more complicated.

Currently, I expect your conditional branch to colder code would end up 
being a conditional br followed by a unconditional jmp (tail call).  
This seems like an obvious candidate for a peephole optimization (if it 
doesn't already exist.)

The other area of concern would be argument passing.  Keeping all the 
arguments in registers across the jump (rather than spilling some of 
them per a calling convention) would likely be the hard part.  This is 
effectively an internal call with an unspecified calling convention.  If 
we were willing to do register allocation in the caller before the 
outlined callee, we could potentially rewrite the calling convention of 
the call pair.  Figuring out how to do that would be a generally useful 
optimization for internal only functions.

Even without the calling convention changes, I'd expect the overhead of 
the argument passing to be in a small cold block (in the hot code 
area).  This might be good enough.
>
> Does this sound doable, and is it something you'd be ok having in 
> llvm? If we can come up with a good design my team is happy to do the 
> actual work, though if anyone else is interested in doing it we 
> certainly won't complain :).As with others, I'd be interested in seeing something like this land.  
This isn't immediately relevant for me, but it definitely falls into the 
nice to have category down the road.  I'd lean towards the outlining 
scheme before introducing the complication in the backend, but am open 
to either approach.>
> Thanks!
> Brett
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Lang Hames

2014-Nov-20 22:44 UTC

head link

[LLVMdev] Odd code layout requirements for MCJIT

> Most of the work to make this work at the AOT object file level should
also be necessary to> make this work with MCJIT, which can then be taught to allocate.text.cold somewhere else.

MCJIT memory layout is delegated to the RTDyldMemoryManager which is
supplied by the client, so if/when the AOT work is done this should  just
work in MCJIT without any further changes.

- Lang.

On Wed, Nov 19, 2014 at 10:49 AM, Reid Kleckner <rnk at google.com> wrote:
> I'm pretty sure LLVM does not currently have support for this kind of
> hot-cold BB separation, but I think it would be a welcome addition. I think
> the best way to add such support would be to move cold code to a separate
> object file section. For x86 this might look like:
>
> .text
> foo:
>   xor eax, eax
>   ...
>   jz Lcold_bb
> Lreturn:
>   ret
>
> .text.cold
> Lcold_bb:
>   .. EH
>   jmp Lreturn
>
> This might require changing the way we emit relocations to local labels,
> but it would be generally useful to others. GCC already does something like
> this. Most of the work to make this work at the AOT object file level
> should also be necessary to make this work with MCJIT, which can then be
> taught to allocate .text.cold somewhere else.
>
> ---
>
> We've also discussed outlining regions of similar cold code at the LLVM
IR
> level. This will definitely work today without many backend changes. You
> should be able to apply the preserve_mostcc calling convention and add a
> .text.cold section annotation and that should be enough to start making
> MCJIT memory allocation changes.
>
> On Tue, Nov 18, 2014 at 8:21 PM, Brett Simmers <bsimmers at fb.com>
wrote:
>
>> I'm part of a team working on adding an llvm codegen backend to
HHVM (PHP
>> JIT, http://hhvm.com) using MCJIT. We have a code layout problem and
I'm
>> looking for opinions on good ways to solve it.
>>
>> The short version is that the memory we emit code into is split into a
>> few different areas, and we'd like a way to control which area each
>> BasicBlock ends up in during codegen. I know this probably sounds
pretty
>> odd, so here's a much more detailed explanation:
>>
>> Our translation unit is a "tracelet" which is typically a
single basic
>> block of PHP code. We emit code one tracelet at a time as it's
needed for
>> execution, so the physical layout of the generated code looks like the
>> execution path taken the first time the code is run (we're actually
a
>> little smarter than that now, but this description is good enough for
the
>> problem at hand).
>>
>> We're under pretty severe icache/iTLB pressure, so we do whatever
we can
>> to keep the hot path as compact as possible. One of the ways we do this
is
>> by dividing our code cache into three fixed-size areas: main, cold, and
>> frozen. Our current, non-llvm codegen backend has one area tag per
basic
>> block, and most tracelets we compile will span all three areas. This
means
>> that if we emit code for tracelets A and then B, A's main code will
be
>> followed immediately by B's main code (and the same for
cold/frozen). It
>> ends up looking something like this:
>>
>> good layout: | A_main B_main <unused> | A_cold B_cold
<unused> |
>>                A_frozen... |
>>
>> Main is the primary execution path. Cold contains code that we expect
to
>> happen rarely. Conditional branches from main to cold are generally
taken
>> <5% of the time. Frozen contains code that we expect to almost never
run,
>> but must have for correctness (mostly for handling exceptions).
Conditional
>> branches from main to frozen are generally taken <0.01% of the time.
The
>> fact that we have three areas isn't really relevant to the problem
itself;
>> I think any solution should allow an arbitrary number of areas.
>>
>> We've done some experiments where we collapse all three areas into
one,
>> so A's main code is followed by A's cold code, which is
followed by A's
>> frozen code, and so on for B and other tracelets, like this:
>>
>> bad layout: | A_main A_cold A_frozen B_main B_cold B_frozen... |
>>
>> This causes an unacceptable performance regression, so for our
>> experimental llvm backend we need to come up with a solution.
>>
>> We're using a custom calling convention to model each tracelet as a
>> single llvm function. I know there are intrinsics in llvm that allow us
to
>> give blocks relative probabilities, and if I'm understanding things
>> correctly that should allow us to achieve the bad layout above, since
the
>> code for a given llvm function will be emitted in one contiguous chunk,
>> regardless of how the blocks are organized within it. I'd like to
figure
>> out a way to achieve the good layout.
>>
>> The best idea I have so far is to annotate each BasicBlock with an area
>> tag like we do in our existing codegen backend, and then teach the llvm
>> codegen backend how to place each of those blocks in the appropriate
part
>> of our code cache. We have a custom subclass of RTDyldMemoryManager to
get
>> it emitting all code in the main area right now; I assume any solution
>> would impact the API of this class to get at the different areas.
>>
>> If we have to split the main/cold/frozen parts of each tracelet into
>> different llvm functions we might be able to make that work but I'm
>> skeptical. My main concern is that cold/frozen code nearly always uses
a
>> bunch of values from the main code path that branched to it, and if the
>> overhead of getting from main to cold/frozen is more than a single
jmp/jcc
>> instruction (on x86) that's going to be a dealbreaker.
>>
>> Does this sound doable, and is it something you'd be ok having in
llvm?
>> If we can come up with a good design my team is happy to do the actual
>> work, though if anyone else is interested in doing it we certainly
won't
>> complain :).
>>
>> Thanks!
>> Brett
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141120/fb6ad911/attachment.html>

llvm dev - Nov 2014 - [LLVMdev] Odd code layout requirements for MCJIT

[LLVMdev] Odd code layout requirements for MCJIT

[LLVMdev] Odd code layout requirements for MCJIT

[LLVMdev] Odd code layout requirements for MCJIT

[LLVMdev] Odd code layout requirements for MCJIT

[LLVMdev] Odd code layout requirements for MCJIT

[LLVMdev] Odd code layout requirements for MCJIT