Hello everyone, On behalf of GHC hackers, I would like to discuss the possibility of having a proper implementation of the tables-next-to-code optimisation in LLVM. Currently, the object code produced by all three GHC backends follows the convention that the table with the metadata of a closure is located immediately before the code of the closure. This makes it possible to get to both the code and the metadata after a single pointer resolution operation. This, obviously, requires certain ordering of data and text in the object code. Since LLVM does not make it possible to explicitly control the placement of data and code, the necessary ordering is currently achieved by injecting GNU Assembler subsections on platforms supported by GNU Assembler. Mac assembler, however, does not support this feature, so the resulting object code is post-processed directly. It would be desirable that LLVM offered some control over the ordering of text and data sections in the code. In particular, David Terei suggests one of the following [0]: 1. Add a special variable "llvm.order" of an array type. It will list function names and global pointers in the order in which they should appear in the resulting object code. This basically corresponds to the "fno-toplevel-reorder" option in GCC. 2. Add support for attaching a global variable to a function. In this way one could be able to specify a that a global variable should appear before the corresponding function in the object code. An example of a possible implementation of (2) has been recently suggested by Gabor Greif. He proposes adding a "placebefore" attribute to global variables (or, similarly, a "placeafter" attribute for functions). The corresponding example is: @foo_D = common global %struct.Descr zeroinitializer, align 8, placebefore @foo define i32 @foo() nounwind uwtable readnone { ret i32 undef } The questions is whether the LLVM developers are willing to accept such modifications and, if yes, which approach to implementing such modifications they would recommend. Sergiu [0] http://hackage.haskell.org/trac/ghc/ticket/4213
On Feb 13, 2012, at 6:49 AM, Sergiu Ivanov wrote:> On behalf of GHC hackers, I would like to discuss the possibility of > having a proper implementation of the tables-next-to-code optimisation > in LLVM.It would be great to have this. However, the design will be tricky. Is there anything that spells out how the TNTC optimization works at the actual machine instruction level? It seems that there should be a blog post somewhere that shows the code with and without the optimization, but I can't find it offhand.> This, obviously, requires certain > ordering of data and text in the object code. Since LLVM does not > make it possible to explicitly control the placement of data and code, > the necessary ordering is currently achieved by injecting GNU > Assembler subsections on platforms supported by GNU Assembler. Mac > assembler, however, does not support this feature, so the resulting > object code is post-processed directly.It's interesting that you bring this up. It turns out that on the mac toolchain (unless you disable subsectionsviasymbol, a gross hack) does not give you the ability to control the ordering of blobs of code separated by global labels (aka 'atoms' in the linker's terminology). This is important because it enables link-time dead code elimination, profile based code reordering etc. My understanding is that ELF toolchains don't have something like this, but it would be unfortunate if TNTC fundamentally prevents something like this from working. Beyond this, the proposed model has some other issues: code ordering only makes sense within a linker section, but modeling "the table" and "the code" as two different LLVM values (a global value and a function) would mean that the optimizer will be tempted to put them into different sections, do dead code elimination, etc.> He proposes adding a "placebefore" > attribute to global variables (or, similarly, a "placeafter" attribute > for functions). The corresponding example is:This is a non-starter for a few reasons, but that doesn't mean that there aren't other reasonable options. I'd really like to see the codegen that you guys are after to try to help come up with another suggestion that isn't a complete one-off hack for GHC. :) One random question: have you considered placing the table *inside* of the function? If the prologue for the closure was effectively: Closure: jmp .LAfterTable .word ... .word ... .LAfterTable: push $rbp ... then you can avoid a lot of problems. I realize that this is not going to be absolutely as fast as your current TNTC implementation, but processors are *really really* good at predicting unconditional branches, so the cost is probably minimal, and it is likely to be much much faster than not having TNTC at all. Getting even this to work will not be fully straight-forward, but again I'd like to understand more of what you're looking for from codegen to understand what the constraints are. -Chris
Hmm writing a blog post about TNTC is beyond the time I have right now. Here is some high level documentation of the layout of Heap objects in GHC: http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#InfoTables With TNTC enabled we generate code for closures of this form: .text .align 8 .long Main_main1_srt-(Main_main1_info)+0 .long 0 .quad 4294967299 .quad 0 .quad 270582939663 .globl Main_main1_info .type Main_main1_info, @object Main_main1_info: .Lc1Df: leaq -8(%rbp),%rax cmpq %r15,%rax jb .Lc1Dh [...] .data .globl Main_main1_closure .type Main_main1_closure, @object Main_main1_closure: .quad Main_main1_info .quad 0 Without TNTC we instead generated code of this form: .text .globl Main_main1_entry .type Main_main1_entry, @function Main_main1_entry: .LFB15: leaq -8(%rbp),%rax cmpq %r15,%rax jb .Lc1Dh [...] .data .globl Main_main1_info .align 8 .type Main_main1_info, @object .size Main_main1_info, 40 Main_main1_info: .quad Main_main1_entry .quad 0 .quad 270582939663 .quad 4294967299 .quad Main_main1_srt .globl Main_main1_closure .align 16 .type Main_main1_closure, @object .size Main_main1_closure, 16 Main_main1_closure: .quad Main_main1_info .quad 0 Hopefully that helps understand what we are looking for if replicating the current scheme exactly. Below follow some extracts from an email conversation I had with Simon Marlow (principle GHC developer) mussing briefly about some other schemes for TNTC: --- Simon Marlow: "The runtime overhead is mostly from the extra indirect memory access. For closure entry this is less of an issue now that we have pointer tagging, but the extra memory access is still required for every return to a stack frame. The code size overhead comes from the extra word in every info table (most info tables are 2 words, and increase to 3 without TNTC), and also the small increase in code due to the extra memory access." [...] "So there are a few alternatives, all of which entail some kind of compromise. The one you suggest, having both an info pointer and a code pointer in every closure/stack frame, entails greater memory overheads. I suspect this would be worse than the current double-indirection scheme. The info tables are accessed relatively rarely during mutation, so it makes sense to look at schemes that prioritise the code pointer. For example, you could store the info tables in a separate area of memory entirely, and keep a hash table to map code addresses to info tables (this is how the C ABI keeps track of metadata associated with stack frames for C++ exceptions I think, but I don't know much about how it works). A slightly more cunning scheme would be to have each code fragment start with a dummy instruction containing the address of the info table, e.g. movl <infoptr>, %dummy this should be faster than the double-indirection scheme, but is very architecture-dependent and probably involves some horrible hacks in the compiler (could it even be done with LLVM? Maybe with the mangler)." --- Cheers, David On 14 February 2012 02:59, Chris Lattner <clattner at apple.com> wrote:> On Feb 13, 2012, at 6:49 AM, Sergiu Ivanov wrote: >> On behalf of GHC hackers, I would like to discuss the possibility of >> having a proper implementation of the tables-next-to-code optimisation >> in LLVM. > > It would be great to have this. However, the design will be tricky. Is there anything that spells out how the TNTC optimization works at the actual machine instruction level? It seems that there should be a blog post somewhere that shows the code with and without the optimization, but I can't find it offhand. > >> This, obviously, requires certain >> ordering of data and text in the object code. Since LLVM does not >> make it possible to explicitly control the placement of data and code, >> the necessary ordering is currently achieved by injecting GNU >> Assembler subsections on platforms supported by GNU Assembler. Mac >> assembler, however, does not support this feature, so the resulting >> object code is post-processed directly. > > It's interesting that you bring this up. It turns out that on the mac toolchain (unless you disable subsectionsviasymbol, a gross hack) does not give you the ability to control the ordering of blobs of code separated by global labels (aka 'atoms' in the linker's terminology). This is important because it enables link-time dead code elimination, profile based code reordering etc. My understanding is that ELF toolchains don't have something like this, but it would be unfortunate if TNTC fundamentally prevents something like this from working. > > Beyond this, the proposed model has some other issues: code ordering only makes sense within a linker section, but modeling "the table" and "the code" as two different LLVM values (a global value and a function) would mean that the optimizer will be tempted to put them into different sections, do dead code elimination, etc. > >> He proposes adding a "placebefore" >> attribute to global variables (or, similarly, a "placeafter" attribute >> for functions). The corresponding example is: > > This is a non-starter for a few reasons, but that doesn't mean that there aren't other reasonable options. I'd really like to see the codegen that you guys are after to try to help come up with another suggestion that isn't a complete one-off hack for GHC. :) > > One random question: have you considered placing the table *inside* of the function? If the prologue for the closure was effectively: > > Closure: > jmp .LAfterTable > .word ... > .word ... > .LAfterTable: > push $rbp > ... > > then you can avoid a lot of problems. I realize that this is not going to be absolutely as fast as your current TNTC implementation, but processors are *really really* good at predicting unconditional branches, so the cost is probably minimal, and it is likely to be much much faster than not having TNTC at all. > > Getting even this to work will not be fully straight-forward, but again I'd like to understand more of what you're looking for from codegen to understand what the constraints are. > > -Chris > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hi Chris, One remaining question here is, if the GHC team tries some of these alternative schemes and finds them unsatisfactory what is the LLVM communities feeling in regards to extending LLVM IR to support directly implementing TNTC? How do you envision this would look at the IR level, how much work do you think it would be and most importantly do you feel LLVM would be willing to accept patches for it? We simply post process the assembly right now to get the desired code and it works very well but it means we can't use the integrated assembler which annoys me. Cheers, David. On 14 February 2012 02:59, Chris Lattner <clattner at apple.com> wrote:> On Feb 13, 2012, at 6:49 AM, Sergiu Ivanov wrote: >> On behalf of GHC hackers, I would like to discuss the possibility of >> having a proper implementation of the tables-next-to-code optimisation >> in LLVM. > > It would be great to have this. However, the design will be tricky. Is there anything that spells out how the TNTC optimization works at the actual machine instruction level? It seems that there should be a blog post somewhere that shows the code with and without the optimization, but I can't find it offhand. > >> This, obviously, requires certain >> ordering of data and text in the object code. Since LLVM does not >> make it possible to explicitly control the placement of data and code, >> the necessary ordering is currently achieved by injecting GNU >> Assembler subsections on platforms supported by GNU Assembler. Mac >> assembler, however, does not support this feature, so the resulting >> object code is post-processed directly. > > It's interesting that you bring this up. It turns out that on the mac toolchain (unless you disable subsectionsviasymbol, a gross hack) does not give you the ability to control the ordering of blobs of code separated by global labels (aka 'atoms' in the linker's terminology). This is important because it enables link-time dead code elimination, profile based code reordering etc. My understanding is that ELF toolchains don't have something like this, but it would be unfortunate if TNTC fundamentally prevents something like this from working. > > Beyond this, the proposed model has some other issues: code ordering only makes sense within a linker section, but modeling "the table" and "the code" as two different LLVM values (a global value and a function) would mean that the optimizer will be tempted to put them into different sections, do dead code elimination, etc. > >> He proposes adding a "placebefore" >> attribute to global variables (or, similarly, a "placeafter" attribute >> for functions). The corresponding example is: > > This is a non-starter for a few reasons, but that doesn't mean that there aren't other reasonable options. I'd really like to see the codegen that you guys are after to try to help come up with another suggestion that isn't a complete one-off hack for GHC. :) > > One random question: have you considered placing the table *inside* of the function? If the prologue for the closure was effectively: > > Closure: > jmp .LAfterTable > .word ... > .word ... > .LAfterTable: > push $rbp > ... > > then you can avoid a lot of problems. I realize that this is not going to be absolutely as fast as your current TNTC implementation, but processors are *really really* good at predicting unconditional branches, so the cost is probably minimal, and it is likely to be much much faster than not having TNTC at all. > > Getting even this to work will not be fully straight-forward, but again I'd like to understand more of what you're looking for from codegen to understand what the constraints are. > > -Chris > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev