I haven't read the all the emails in full detail, but it seems pretty clear that __x86_indirect_thunk and __llvm_retpoline_push do not do the same things. It sounds like __llvm_retpoline_push is equivalent to __x86_indirect_thunk except first it swaps the two words on the top of the stack. I arranged it this way because the x86 call instruction puts the intended return address on the top of the stack, and there's no easy way to put it anywhere else. We use this thunk when we want to make an indirect call and there are no available scratch registers, i.e. 32-bit -mregparm=3 and the call has three or more arguments, which happens in Linux. One way to avoid this would be to compile with -mregparm=2, but that would pessimize direct calls unnecessarily. It sounds like Linux is already providing its own thunks, so it might be better to for us to go back to the __llvm_retpoline_push external thunk name and then get Linux to provide that thunk ifdef __clang__. On Fri, Feb 9, 2018 at 7:37 AM, David Woodhouse via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Fri, 2018-02-09 at 12:54 +0000, David Woodhouse wrote: > > On Fri, 2018-02-09 at 10:36 +0000, David Woodhouse wrote: > > > > > > Did you get anywhere with the function attribute? Having isolated the > > > next boot failure to "it goes away if I compile io_apic.c without > > > retpoline", bisecting it per-function would help to further delay the > > > bit where I actually have to start *thinking*... > > > > It's mp_register_ioapic(), and only when io_apic_unique_id() gets > > inlined, at which point bad things start happening. > > > > [ 0.000000] mp_register_ioapic, 0 fec00000 0 c1b2fe88 > > [ 0.000000] At line 412, gsi_base is 0 > > [ 0.000000] At line 425, gsi_base is -1043715332 > > [ 0.000000] At line 427, gsi_base is -1043715332 > > > > http://git.infradead.org/users/dwmw2/linux-retpoline.git/shortlog/ref > > s/heads/clang > > http://david.woodhou.se/clang32.config > > > > > > http://david.woodhou.se/io_apic_b.i > > http://david.woodhou.se/io_apic_b.noretpoline.s > > http://david.woodhou.se/io_apic_b.retpoline.s > > > > I don't *think* I screwed up copying and pasting the retpoline thunk. > > So, looking at the retpoline version... > > gsi_base is in %edi, and gets spilled to the stack at about .Ltmp22 > which is at line 412 right after the printk call: > > .Ltmp22: > addl $12, %esp > movl %edi, 12(%esp) # 4-byte Spill > > At .Ltmp28 we then call __x86_indirect_thunk which *does* look like > it's doing the right thing (and using the LLVM-emitted thunk instead of > my own behaves the same; I don't think it's my copy-paste at fault). > > At .Ltmp29 we call bad_ioapic_register() and then when returns zero (it > does) we je to .LBB0_10 aka .Ltmp34. At which point we happily stomp on > the value of 12(%esp) by spilling something *else* over it: > > .LBB0_10: # %if.end28 > .Ltmp34: > #DEBUG_VALUE: mp_register_ioapic:address <- $esi > #DEBUG_VALUE: mp_register_ioapic:cfg <- [DW_OP_plus_uconst 8] > [$ebp+0] > .loc 2 0 1 is_stmt 0 # arch/x86/kernel/apic/io_apic_ > b.c:0:1 > movl %ebx, 12(%esp) # 4-byte Spill > movl %edi, 20(%esp) # 4-byte Spill > > Compare with the noretpoline.s version where those offsets from %esp > are different: > > .LBB0_10: # %if.end28 > .Ltmp33: > #DEBUG_VALUE: mp_register_ioapic:gsi_base <- [DW_OP_plus_uconst > 12] [$esp+0] > #DEBUG_VALUE: mp_register_ioapic:address <- $esi > #DEBUG_VALUE: mp_register_ioapic:cfg <- [DW_OP_plus_uconst 8] > [$ebp+0] > .loc 2 0 1 is_stmt 0 # arch/x86/kernel/apic/io_apic_ > b.c:0:1 > movl %ebx, 8(%esp) # 4-byte Spill > movl %edi, 16(%esp) # 4-byte Spill > > > So... obviously the main preceding difference here is that the > retpoline version did this for an indirect call: > > movl %esi, %edx > pushl pv_mmu_ops+132 > .Ltmp28: > calll __x86_indirect_thunk > > ... while the noretpoline version did this: > > movl %esi, %edx > calll *pv_mmu_ops+132 > .Ltmp28: > > Which leaves me suspecting that when we push the target onto the stack > for a retpoline call, that affects the %esp-based offsets of everything > that's been spilled, of course... and perhaps we're not correctly > adjusting those offsets *back* again by accounting for the fact that > calling the thunk actually moves the stack pointer back up again? > > Breakpoint 1, mp_register_ioapic (id=0, address=4273995776, gsi_base=0, > cfg=0xc1b2fe88 <init_thread_union+7816>) at arch/x86/kernel/apic/io_apic_ > b.c:389 > 389 bool hotplug = !!ioapic_initialized; > 1: x/i $pc > => 0xc10469c7 <mp_register_ioapic+23>: mov 0xc1d36170,%eax > 2: gsi_base = 0 > (gdb) ni > 393 pr_warn("%s, %d %x %x %px\n", __func__, id, address, > gsi_base, cfg); > 1: x/i $pc > => 0xc10469cc <mp_register_ioapic+28>: mov %eax,0x18(%esp) > 2: gsi_base = 0 > (gdb) > 0xc10469d0 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469d0 <mp_register_ioapic+32>: pushl 0x8(%ebp) > 2: gsi_base = 0 > (gdb) > 0xc10469d3 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469d3 <mp_register_ioapic+35>: push %ecx > 2: gsi_base = 0 > (gdb) > 0xc10469d4 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469d4 <mp_register_ioapic+36>: push %edx > 2: gsi_base = 0 > (gdb) > 0xc10469d5 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469d5 <mp_register_ioapic+37>: push %ebx > 2: gsi_base = 0 > (gdb) > 0xc10469d6 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469d6 <mp_register_ioapic+38>: push $0xc1a3f457 > 2: gsi_base = 0 > (gdb) > 0xc10469db 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469db <mp_register_ioapic+43>: push $0xc1a3f443 > 2: gsi_base = 0 > (gdb) > 0xc10469e0 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469e0 <mp_register_ioapic+48>: call 0xc10aa840 <printk> > 2: gsi_base = 0 > (gdb) > 0xc10469e5 393 pr_warn("%s, %d %x %x %px\n", __func__, > id, address, gsi_base, cfg); > 1: x/i $pc > => 0xc10469e5 <mp_register_ioapic+53>: add $0x18,%esp > 2: gsi_base = 0 > (gdb) > 395 if (!address) { > 1: x/i $pc > => 0xc10469e8 <mp_register_ioapic+56>: test %esi,%esi > 2: gsi_base = 0 > (gdb) > 0xc10469ea 395 if (!address) { > 1: x/i $pc > => 0xc10469ea <mp_register_ioapic+58>: je 0xc1046a34 > <mp_register_ioapic+132> > 2: gsi_base = 0 > (gdb) > 399 for_each_ioapic(ioapic) > 1: x/i $pc > => 0xc10469ec <mp_register_ioapic+60>: mov 0xc1d36144,%eax > 2: gsi_base = 0 > (gdb) > 0xc10469f1 399 for_each_ioapic(ioapic) > 1: x/i $pc > => 0xc10469f1 <mp_register_ioapic+65>: test %eax,%eax > 2: gsi_base = 0 > (gdb) > 0xc10469f3 399 for_each_ioapic(ioapic) > 1: x/i $pc > => 0xc10469f3 <mp_register_ioapic+67>: jle 0xc1046a10 > <mp_register_ioapic+96> > 2: gsi_base = 0 > (gdb) > 406 idx = find_free_ioapic_entry(); > 1: x/i $pc > => 0xc1046a10 <mp_register_ioapic+96>: call 0xc1045670 > <find_free_ioapic_entry> > 2: gsi_base = 0 > (gdb) > 407 if (idx >= MAX_IO_APICS) { > 1: x/i $pc > => 0xc1046a15 <mp_register_ioapic+101>: cmp $0x40,%eax > 2: gsi_base = 0 > (gdb) > 0xc1046a18 407 if (idx >= MAX_IO_APICS) { > 1: x/i $pc > => 0xc1046a18 <mp_register_ioapic+104>: jl 0xc1046a4b > <mp_register_ioapic+155> > 2: gsi_base = 0 > (gdb) > 0xc1046a4b 396 pr_warn("Bogus (zero) I/O APIC > address found, skipping!\n"); > 1: x/i $pc > => 0xc1046a4b <mp_register_ioapic+155>: mov %ebx,0x4(%esp) > 2: gsi_base = 0 > (gdb) > 412 pr_warn("At line %d, gsi_base is %d\n", __LINE__,gsi_base); > 1: x/i $pc > => 0xc1046a4f <mp_register_ioapic+159>: push %edi > 2: gsi_base = 0 > (gdb) > 0xc1046a50 412 pr_warn("At line %d, gsi_base is %d\n", > __LINE__,gsi_base); > 1: x/i $pc > => 0xc1046a50 <mp_register_ioapic+160>: push $0x19c > 2: gsi_base = 0 > (gdb) > 0xc1046a55 412 pr_warn("At line %d, gsi_base is %d\n", > __LINE__,gsi_base); > 1: x/i $pc > => 0xc1046a55 <mp_register_ioapic+165>: push $0xc1a3f4fd > 2: gsi_base = 0 > (gdb) > 0xc1046a5a 412 pr_warn("At line %d, gsi_base is %d\n", > __LINE__,gsi_base); > 1: x/i $pc > => 0xc1046a5a <mp_register_ioapic+170>: mov %eax,%ebx > 2: gsi_base = 0 > (gdb) > 0xc1046a5c 412 pr_warn("At line %d, gsi_base is %d\n", > __LINE__,gsi_base); > 1: x/i $pc > => 0xc1046a5c <mp_register_ioapic+172>: call 0xc10aa840 <printk> > 2: gsi_base = 0 > (gdb) > 0xc1046a61 412 pr_warn("At line %d, gsi_base is %d\n", > __LINE__,gsi_base); > 1: x/i $pc > => 0xc1046a61 <mp_register_ioapic+177>: add $0xc,%esp > 2: gsi_base = 0 > (gdb) > 0xc1046a64 412 pr_warn("At line %d, gsi_base is %d\n", > __LINE__,gsi_base); > 1: x/i $pc > => 0xc1046a64 <mp_register_ioapic+180>: mov %edi,0xc(%esp) > 2: gsi_base = 0 > (gdb) p/x 0xc+$esp > $1 = 0xc1b2fe34 > (gdb) disp *(int *)0xc1b2fe34 > 3: *(int *)0xc1b2fe34 = 120 > (gdb) ni > 414 ioapics[idx].mp_config.type = MP_IOAPIC; > 1: x/i $pc > => 0xc1046a68 <mp_register_ioapic+184>: imul $0x2c,%ebx,%edi > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046a6b 414 ioapics[idx].mp_config.type = MP_IOAPIC; > 1: x/i $pc > => 0xc1046a6b <mp_register_ioapic+187>: movb $0x2,-0x3e2cb1bc(%edi) > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 415 ioapics[idx].mp_config.flags = MPC_APIC_USABLE; > 1: x/i $pc > => 0xc1046a72 <mp_register_ioapic+194>: movb $0x1,-0x3e2cb1b9(%edi) > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 416 ioapics[idx].mp_config.apicaddr = address; > 1: x/i $pc > => 0xc1046a79 <mp_register_ioapic+201>: mov %esi,-0x3e2cb1b8(%edi) > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 418 set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address); > 1: x/i $pc > => 0xc1046a7f <mp_register_ioapic+207>: lea 0x5(%ebx),%eax > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > cachemode2protval (pcm=_PAGE_CACHE_MODE_UC) at ./arch/x86/include/asm/ > pgtable_types.h:439 > 439 return __cachemode2pte_tbl[pcm]; > 1: x/i $pc > => 0xc1046a82 <mp_register_ioapic+210>: movzwl 0xc1b424c6,%ecx > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > mp_register_ioapic (id=<optimized out>, address=4273995776, gsi_base=0, > cfg=0xc1b2fe88 <init_thread_union+7816>) at arch/x86/kernel/apic/io_apic_ > b.c:418 > 418 set_fixmap_nocache(FIX_IO_APIC_BASE_0 + idx, address); > 1: x/i $pc > => 0xc1046a89 <mp_register_ioapic+217>: or $0x163,%ecx > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046a8f 418 set_fixmap_nocache(FIX_IO_APIC_BASE_0 + > idx, address); > 1: x/i $pc > => 0xc1046a8f <mp_register_ioapic+223>: mov %eax,0x14(%esp) > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > __set_fixmap (idx=5, phys=<optimized out>, flags=...) at > ./arch/x86/include/asm/paravirt.h:660 > 660 pv_mmu_ops.set_fixmap(idx, phys, flags); > 1: x/i $pc > => 0xc1046a93 <mp_register_ioapic+227>: mov %esi,%edx > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046a95 660 pv_mmu_ops.set_fixmap(idx, phys, flags); > 1: x/i $pc > => 0xc1046a95 <mp_register_ioapic+229>: pushl 0xc1ad6434 > 2: gsi_base = 0 > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046a9b 660 pv_mmu_ops.set_fixmap(idx, phys, flags); > 1: x/i $pc > => 0xc1046a9b <mp_register_ioapic+235>: call 0xc18ff3a0 > <__x86_indirect_thunk> > 2: gsi_base = <optimized out> > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > mp_register_ioapic (id=<optimized out>, address=4273995776, > gsi_base=<optimized out>, cfg=0xc1b2fe88 <init_thread_union+7816>) at > arch/x86/kernel/apic/io_apic_b.c:419 > 419 if (bad_ioapic_register(idx)) { > 1: x/i $pc > => 0xc1046aa0 <mp_register_ioapic+240>: mov %ebx,%eax > 2: gsi_base = <optimized out> > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046aa2 419 if (bad_ioapic_register(idx)) { > 1: x/i $pc > => 0xc1046aa2 <mp_register_ioapic+242>: call 0xc10455f0 > <bad_ioapic_register> > 2: gsi_base = <optimized out> > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046aa7 419 if (bad_ioapic_register(idx)) { > 1: x/i $pc > => 0xc1046aa7 <mp_register_ioapic+247>: test %eax,%eax > 2: gsi_base = <optimized out> > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046aa9 419 if (bad_ioapic_register(idx)) { > 1: x/i $pc > => 0xc1046aa9 <mp_register_ioapic+249>: je 0xc1046ae1 > <mp_register_ioapic+305> > 2: gsi_base = <optimized out> > 3: *(int *)0xc1b2fe34 = 0 > (gdb) > 0xc1046ae1 482 } > 1: x/i $pc > => 0xc1046ae1 <mp_register_ioapic+305>: mov %ebx,0xc(%esp) > 2: gsi_base = <optimized out> > 3: *(int *)0xc1b2fe34 = 0 > (gdb) p/x $esp+0xc > $2 = 0xc1b2fe34 > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180209/86e11261/attachment-0001.html>
David Woodhouse via llvm-dev
2018-Feb-09 19:38 UTC
[llvm-dev] retpoline mitigation and 6.0
On Fri, 2018-02-09 at 11:24 -0800, Reid Kleckner wrote:> I haven't read the all the emails in full detail, but it seems pretty > clear that __x86_indirect_thunk and __llvm_retpoline_push do not do > the same things. It sounds like __llvm_retpoline_push is equivalent > to __x86_indirect_thunk except first it swaps the two words on the > top of the stack. > > I arranged it this way because the x86 call instruction puts the > intended return address on the top of the stack, and there's no easy > way to put it anywhere else. We use this thunk when we want to make > an indirect call and there are no available scratch registers, i.e. > 32-bit -mregparm=3 and the call has three or more arguments, which > happens in Linux. One way to avoid this would be to compile with > -mregparm=2, but that would pessimize direct calls unnecessarily. > > It sounds like Linux is already providing its own thunks, so it might > be better to for us to go back to the __llvm_retpoline_push external > thunk name and then get Linux to provide that thunk ifdef __clang__.Yes, the thunks are different in this case. That one should indeed be renamed, and I'll provide the correct one in the kernel. However, even if I let LLVM emit the thunk for itself, code generation seems hosed. It looks like after a call to __llvm_retpoline_push (or perhaps more to the point after pushing the target address in preparation for the call?) we're losing track of where %esp points, and thus everything else that was spilled to the stack. Please let me know if you need any more detail on that, than I provided in my last email. You should have everything you need there to follow along with my analysis and reproduce it for yourself. Except perhaps the Grandma-suck-eggs command lines like $ qemu-system-i386 -display none -serial stdio -kernel arch/x86/boot/bzImage -append earlyprintk=ttyS0,keep -s -S $ gdb vmlinux (gdb) tar rem localhost:1234 (gdb) break mp_register_ioapic (gdb) c -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5213 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180209/e03d7c00/attachment.bin>
I think I see what's going on, and I agree it looks like a bug. It was too much to hope that later passes weren't going to mess with the PUSH instruction. :( While I was trying to reproduce your problem, I think I found another one that looks like this: $ clang -S -O2 -m32 -mregparm=3 -mretpoline spill_across_rp.cpp -o - | grep _retpoline_push -B2 ... movl %eax, 8(%esp) # 4-byte Spill ... pushl %edi movl 8(%esp), %edi # 4-byte Reload calll __llvm_retpoline_push That's obviously broken, it doesn't account for the SP adjustment in the push. It's weird, because it's basically the opposite of the problem you're having, which looks like LLVM *is* accounting for the push and is trying to adjust its offsets accordingly. Can you send along the full command line used to compile io_apic_b.i? On Fri, Feb 9, 2018 at 11:38 AM, David Woodhouse <dwmw2 at infradead.org> wrote:> On Fri, 2018-02-09 at 11:24 -0800, Reid Kleckner wrote: > > I haven't read the all the emails in full detail, but it seems pretty > > clear that __x86_indirect_thunk and __llvm_retpoline_push do not do > > the same things. It sounds like __llvm_retpoline_push is equivalent > > to __x86_indirect_thunk except first it swaps the two words on the > > top of the stack. > > > > I arranged it this way because the x86 call instruction puts the > > intended return address on the top of the stack, and there's no easy > > way to put it anywhere else. We use this thunk when we want to make > > an indirect call and there are no available scratch registers, i.e. > > 32-bit -mregparm=3 and the call has three or more arguments, which > > happens in Linux. One way to avoid this would be to compile with > > -mregparm=2, but that would pessimize direct calls unnecessarily. > > > > It sounds like Linux is already providing its own thunks, so it might > > be better to for us to go back to the __llvm_retpoline_push external > > thunk name and then get Linux to provide that thunk ifdef __clang__. > > Yes, the thunks are different in this case. That one should indeed be > renamed, and I'll provide the correct one in the kernel. > > However, even if I let LLVM emit the thunk for itself, code generation > seems hosed. It looks like after a call to __llvm_retpoline_push (or > perhaps more to the point after pushing the target address in > preparation for the call?) we're losing track of where %esp points, and > thus everything else that was spilled to the stack. > > Please let me know if you need any more detail on that, than I provided > in my last email. You should have everything you need there to follow > along with my analysis and reproduce it for yourself. Except perhaps > the Grandma-suck-eggs command lines like > > $ qemu-system-i386 -display none -serial stdio -kernel > arch/x86/boot/bzImage -append earlyprintk=ttyS0,keep -s -S > > $ gdb vmlinux > (gdb) tar rem localhost:1234 > (gdb) break mp_register_ioapic > (gdb) c >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180209/0ffda508/attachment.html>
David Woodhouse via llvm-dev
2018-Feb-10 10:37 UTC
[llvm-dev] retpoline mitigation and 6.0
> I arranged it this way because the x86 call instruction puts the intended > return address on the top of the stack, and there's no easy way to put it > anywhere else. We use this thunk when we want to make an indirect call and > there are no available scratch registers, i.e. 32-bit -mregparm=3 and the > call has three or more arguments, which happens in Linux. One way to avoid > this would be to compile with -mregparm=2, but that would pessimize direct > calls unnecessarily.I appreciate that x86 is horribly register-starved but there *are* more than three. For many of the cases where LLVM will emit its _push retpoline, GCC would have just used another register trampoline like %ebx or %esi that *isn't* one of the parameters. -- dwmw2