Hello all, Recently on this list (as of last month), Renato Golin of Linaro posted a thread entitled "Named register variables, GNU-style"[1]. This thread concerned the implementation of the GNU Register variables feature for LLVM. I'd like to give some input on this, as a developer of the Glasgow Haskell Compiler, as we are a user of this feature. Furthermore, our use case is atypical - it is efficiency oriented, not hardware oriented (e.g. I believe the Linux APIC x86 subsystem uses them for hardware, as well as MIPS Linux as mentioned). Bear with me on the details. I'll say up front our use case alone shouldn't sway major decisions, nor am I screaming for the feature - I can sleep at night. But I found there was a surprising lack of highlighted use cases, and perhaps in the future if things change, these points can have some insight. The summary is this: we use this feature in our garbage collector to steal a register that is solely dedicated to a thread-local storage for our multicore runtime system. This thread local data structure is possibly the most performance sensitive variable in the entire multicore system, to the point where we have spent significant time optimizing every read or write, load or spill that could affect it. Furthermore, the GC is tied to the threading system in several ways and is parallel itself - a loss in performance here directly equates to a large overall performance loss for every parallel, multicore program. The lack of this feature is now causing us significant problems, particularly on Mac OS X, as it now uses Clang by default. You would think that considering this variable is (p)thread local, we could just use a __thread variable, or pthread_{get,set}specific to manage. But on OS X, both of these equate to an absolutely huge performance loss, upwards of 25%. Which is unacceptable, realistically speaking, but we've had to deal with it. On Linux, the situation isn't so bad. The ABI allows a __thread variable to just be stored at a direct offset to the %fs segment, meaning that a read/write is still very fast. In fact, __thread is preferable on i386 Linux: the pathetic number of registers means stealing one is a loss, not a win. The situation is not so good on x86_64 OS X. Generally we would steal r13 on a 64-bit platform. But that's not allowed with Clang. Furthermore, the __thread implementation on OS X is terrible compared to Linux: while internally it uses %fs for a specific set of internal, predefined keys, and it also uses them for __thread and pthread_{get,set}specific, a read or write to a __thread variable does NOT translate to a direct read/write. It translates to an indirect call through %rdi. In other words, this code: #include <stdio.h> #include <stdlib.h> __thread int foo; int main(int ac, char* av[]) { if (ac < 2) foo = 10; else foo = atoi(av[1]); printf("foo = %d\n", foo); return 0; } Translates to this on x86_64 Linux with Clang: (gdb) disassemble main Dump of assembler code for function main: 0x00000000004005b0 <+0>: push %rax 0x00000000004005b1 <+1>: mov %rsi,%rax 0x00000000004005b4 <+4>: cmp $0x2,%edi 0x00000000004005b7 <+7>: mov $0xa,%esi 0x00000000004005bc <+12>: jl 0x4005d1 <main+33> 0x00000000004005be <+14>: mov 0x8(%rax),%rdi 0x00000000004005c2 <+18>: xor %esi,%esi 0x00000000004005c4 <+20>: mov $0xa,%edx 0x00000000004005c9 <+25>: callq 0x4004b0 <strtol at plt> 0x00000000004005ce <+30>: mov %rax,%rsi 0x00000000004005d1 <+33>: mov %esi,%fs:0xfffffffffffffffc 0x00000000004005d9 <+41>: mov $0x400694,%edi 0x00000000004005de <+46>: xor %eax,%eax 0x00000000004005e0 <+48>: callq 0x400480 <printf at plt> 0x00000000004005e5 <+53>: xor %eax,%eax 0x00000000004005e7 <+55>: pop %rdx 0x00000000004005e8 <+56>: retq It translates to this on x86_64 OS X with Clang: (lldb) disassemble -m -n main a.out`main a.out[0x100000f20]: pushq %rbp a.out[0x100000f21]: movq %rsp, %rbp a.out[0x100000f24]: pushq %rbx a.out[0x100000f25]: pushq %rax a.out[0x100000f26]: movl $0xa, %ebx a.out[0x100000f2b]: cmpl $0x2, %edi a.out[0x100000f2e]: jl 0x100000f3b ; main + 27 a.out[0x100000f30]: movq 0x8(%rsi), %rdi a.out[0x100000f34]: callq 0x100000f60 ; symbol stub for: atoi a.out[0x100000f39]: movl %eax, %ebx a.out[0x100000f3b]: leaq 0xde(%rip), %rdi ; foo a.out[0x100000f42]: callq *(%rdi) a.out[0x100000f44]: movl %ebx, (%rax) a.out[0x100000f46]: leaq 0x43(%rip), %rdi ; "foo = %d\n" a.out[0x100000f4d]: xorl %eax, %eax a.out[0x100000f4f]: movl %ebx, %esi a.out[0x100000f51]: callq 0x100000f66 ; symbol stub for: printf a.out[0x100000f56]: xorl %eax, %eax a.out[0x100000f58]: addq $0x8, %rsp a.out[0x100000f5c]: popq %rbx a.out[0x100000f5d]: popq %rbp a.out[0x100000f5e]: ret Note the indirect call through %rdi on OS X. Again, the performance difference between these two snippets cannot be understated. And pthread_{get,set}specific do even worse because they're not inlined at all (remember, we're talking 25-30% loss for all programs.) There are details here on a bug of ours[2], where I have tracked and examined this issue for the past year or so. We are getting desperate to fix this for OS X users - to the point of inlining XNU internals to either use 'predefined keys' (e.g. OS X has special 'fast TLS' keys for WebKit on some versions) or inline the 'fast path' of pthread_{get}specific to do a direct read/write. We've tried many combinations of compiler settings and tweaks to try and minimize these effects in the past, but still, a register variable is essentially superior to all other solutions we've found, especially on x86_64. Even passing the thread-local variable around directly as an argument to every single function is slower - because the function bodies are so large, a spill will inevitably occur somewhere, causing loads (or other spills) to interfere with a read/write later. Even combined with manually lowering/lifting reads/writes, it still results in minor losses and doesn't guarantee the compiler won't optimistically undo that. Not not as bad as 30% though, more like 5-7% last I checked. But that's still significant, still slower, and it's far uglier for us to implement, and penalizes Linux unfairly unless it gets even uglier. So, that's the long and short of it. Now we get to LLVM's implementation. First, obviously, is that this need precludes Renato's proposal that only non-allocatable registers must be available.[3] We absolutely have to have GPRs available, and nothing else makes sense for our use case. Chandler was strongly against this sort of idea, and likely with good reason (I don't know anything about parameterizing the LLVM register set over the set of reserved registers from a user. I don't know anything about the designs. Sounds like madness to me, too). I have no input on logistics. But we do need it, otherwise this feature is totally useless to us. Also, in the last set of discussions, Joerg Sonnenberger proposed[4] that these registers are reserved - possibly at the global (translation unit) level or local (function body) level. We also require this - temporarily spilling GPRs otherwise will almost certainly result in the same sort of problem as using a function argument - they will always collide in ways we cannot control or predict. We *do* actually care about every single read, write, spill and load. Renato replied the need for this is just a of workaround for an inefficient compiler - and he's right, it is. Otherwise, we wouldn't do it. :) And based on our observations, I'm sorry to say I don't think GCC or LLVM are going to magically eliminate that difference of 5-7% loss we saw *consistently* any time soon. It's a realistic difference to eliminate with enough work - but those wins don't ever come easy, I know, and our code base is large and complex. That's going to be a lot of work (but I know you're all smart enough for it). Again, to recap, GHC alone probably is not enough of a compelling use case by itself to support these two points on the design - which seem somewhat radical on review of the original threads. Our needs are atypical for sure. But I hope they serve as a useful input while you consider the design space. And also, I apologize in advanced if this is considered beating a dead horse. Thanks. [1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071503.html [2] https://ghc.haskell.org/trac/ghc/ticket/7602 [3] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071561.html [4] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071620.html -- Regards, Austin - PGP: 4096R/0x91384671
On 19 April 2014 15:41, Austin Seipp <aseipp at pobox.com> wrote:> You would think that considering this variable is (p)thread local, we > could just use a __thread variable, or pthread_{get,set}specific to > manage. But on OS X, both of these equate to an absolutely huge > performance loss, upwards of 25%. Which is unacceptable, realistically > speaking, but we've had to deal with it.In practice, pthread_getspecific() on x86-64 on Mac OS X is just a very simple assembly routine: movq %gs:_PTHREAD_TSD_OFFSET(,%rdi,8),%rax ret For Native Client on Mac x86-64, we check that pthread_getspecific() contains the code above, and we inline the %gs access into NaCl's runtime code (reading the value of _PTHREAD_TSD_OFFSET from pthread_getspecific()'s code). You can find the code for doing that here: https://src.chromium.org/viewvc/native_client/trunk/src/native_client/src/trusted/service_runtime/arch/x86_64/nacl_tls_64.c?revision=11149 NaCl's reason for doing this is that NaCl needs to be able to read a thread-local variable in a context when there's no stack available for calling pthread_getspecific(). (We could pre-allocate a pool of stacks and then allocate a stack from this pool with an atomic operation, then call pthread_getspecific() on that stack. But that's a lot more complicated, and slower.) This will of course break if OS X's implementation of pthread_getspecific() changes (other than to change _PTHREAD_TSD_OFFSET). Hopefully, if that ever happens, OS X will have already started providing better thread-local variables that can be accessed without calling a function, like what Linux/ELF and Windows provide. :-) This is hacky, but it should be completely reliable if pthread_getspecific() matches the expected pattern, because it's not like the code for pthread_getspecific() is going to change underneath you. You could use the same trick, and fall back to calling pthread_getspecific() if the code it contains doesn't match the pattern you expect. Cheers, Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140419/5df52ea5/attachment.html>
On 19 April 2014 19:41, Austin Seipp <aseipp at pobox.com> wrote:> Recently on this list (as of last month), Renato Golin of Linaro > posted a thread entitled "Named register variables, GNU-style"[1].Hi Austin, FYI, this is the (now outdated) first proposal on the non-allocatable registers: http://reviews.llvm.org/D3261 I read your email to the end and I understand why this is not good enough.> Again, the performance difference between these two snippets cannot be > understated. And pthread_{get,set}specific do even worse because > they're not inlined at all (remember, we're talking 25-30% loss for > all programs.)It's for problems like these that the named GPRs feature exist (not the stack register trick), but there are two issues that need to be solved, and I'm solving one at a time.> First, obviously, is that this need precludes Renato's proposal that > only non-allocatable registers must be available.[3] We absolutely > have to have GPRs available, and nothing else makes sense for our use > case.I believe we really should have GPRs in the named register scheme in the future, but there are other problems that need to be dealt with first, as Chandler exposed. This is not flogging a dead horse and a feature that I believe is important, not because it's heavily used by many people, but because it's sparsely used by critical parts of very low level software that needs the extra edge to give *all* relying dependent software a big performance boost. People writing high-level software should not use (like inline asm, etc) or will suffer the consequences. We need to do the following steps, in order: 1. create the representation in IR (the intrinsics, metadata, etc), and to lower it on some back-ends without any special reservation. (my current work) 2. make it possible to add GPRs to the reserved list of the allocator on a module scope (from those metadata nodes) and create some tests with edge cases (especially ABI-related registers) to make sure the code generation won't go crazy or pervert the ABI, creating error/warning messages for the cases where it does. 3. move the code in with a flag to enable it, and let it run for a few months/releases. 4. when the dust settles, make it default on. I don't think enabling this feature by default will have any impact in current code, since if you don't use it, there's no difference. But the worry is that code that used it (and will be now compiled with Clang/LLVM) will perform badly/wrong. Since this is an experimental feature, on very specific code, I think the problems we'll see will be manageable. I'll get to step 1 next week, and we should start thinking about the GPRs issue right afterwards. Since that's not particularly important for me right now (the kernel doesn't need it), I may slow down a bit, so your help (and those that need it) will be highly appreciated. I may be wrong, but from what I've seen of the reservation mechanism, it shouldn't be too hard to do it dynamically on a module-level. But I only want to start thinking about it when I finish step 1. Makes sense? cheers, --renato
Richard Sandiford
2014-Apr-24 15:00 UTC
[LLVMdev] Named register variables GNU-style, deux
Thanks for the excellent write-up. Just wanted to clarify... Austin Seipp <aseipp at pobox.com> writes:> Recently on this list (as of last month), Renato Golin of Linaro > posted a thread entitled "Named register variables, GNU-style"[1]. > This thread concerned the implementation of the GNU Register variables > feature for LLVM. I'd like to give some input on this, as a developer > of the Glasgow Haskell Compiler, as we are a user of this feature. > Furthermore, our use case is atypical - it is efficiency oriented, not > hardware oriented (e.g. I believe the Linux APIC x86 subsystem uses > them for hardware, as well as MIPS Linux as mentioned).The MIPS case sounds pretty similar to yours: it sets aside a specific GPR to hold thread-local information. The main difference is that MIPS Linux was in the lucky position of being able to use a nonallocatable GPR, since $gp ($28) is normally reserved for ABI features that Linux doesn't need. Thanks, Richard