Andrew Cooper
2017-Oct-12 19:27 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 12/10/17 20:11, Boris Ostrovsky wrote:> On 10/06/2017 10:32 AM, Josh Poimboeuf wrote: >> On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote: >>>> #ifdef CONFIG_PARAVIRT >>>> +/* >>>> + * Paravirt alternatives are applied much earlier than normal alternatives. >>>> + * They are only applied when running on a hypervisor. They replace some >>>> + * native instructions with calls to pv ops. >>>> + */ >>>> +void __init apply_pv_alternatives(void) >>>> +{ >>>> + setup_force_cpu_cap(X86_FEATURE_PV_OPS); >>> Not for Xen HVM guests. >> From what I can tell, HVM guests still use pv_time_ops and >> pv_mmu_ops.exit_mmap, right? >> >>>> + apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end); >>>> +} >>> This is a problem (at least for Xen PV guests): >>> apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death. >> Ah, right. >> >>> It might be possible not to turn off/on the interrupts in this >>> particular case since the guest probably won't be able to handle an >>> interrupt at this point anyway. >> Yeah, that should work. For Xen and for the other hypervisors, this is >> called well before irq init, so interrupts can't be handled yet anyway. > There is also another problem: > > [ 1.312425] general protection fault: 0000 [#1] SMP > [ 1.312901] Modules linked in: > [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 > [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 > [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 > [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 > [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: > 00007fcfc959f59a > [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > 0000000000000000 > [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: > 0000000000000064 > [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: > 00007fcfc958ad60 > [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: > 0000000000001000 > [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) > knlGS:0000000000000000 > [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: > 0000000000042660 > [ 1.319235] Call Trace: > [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 > 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 > 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 > [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 > [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- > [ 1.345009] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x0000000b > > > All code > =======> 0: 51 push %rcx > 1: 50 push %rax > 2: 57 push %rdi > 3: 56 push %rsi > 4: 52 push %rdx > 5: 51 push %rcx > 6: 6a da pushq $0xffffffffffffffda > 8: 41 50 push %r8 > a: 41 51 push %r9 > c: 41 52 push %r10 > e: 41 53 push %r11 > 10: 48 83 ec 30 sub $0x30,%rsp > 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 > 1b: 00 00 > 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) > 24: 0f 85 a5 00 00 00 jne 0xcf > 2a: 50 push %rax > 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # > 0xffffffffffd095cd <-- trapping instruction > 31: 58 pop %rax > 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax > 38: 77 0f ja 0x49 > 3a: 4c 89 d1 mov %r10,%rcx > 3d: ff .byte 0xff > 3e: 14 c5 adc $0xc5,%al > > > so the original 'cli' was replaced with the pv call but to me the offset > looks a bit off, no? Shouldn't it always be positive?callq takes a 32bit signed displacement, so jumping back by up to 2G is perfectly legitimate. The #GP[0] however means that whatever 8 byte value was found at -0x2f6a64(%rip) was a non-canonical address. One option is that the pvops structure hasn't been initialised properly, but an alternative is that the relocation wasn't processed correctly, and the code is trying to reference something which isn't a function pointer. ~Andrew
Boris Ostrovsky
2017-Oct-12 19:53 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/12/2017 03:27 PM, Andrew Cooper wrote:> On 12/10/17 20:11, Boris Ostrovsky wrote: >> On 10/06/2017 10:32 AM, Josh Poimboeuf wrote: >>> On Thu, Oct 05, 2017 at 04:35:03PM -0400, Boris Ostrovsky wrote: >>>>> #ifdef CONFIG_PARAVIRT >>>>> +/* >>>>> + * Paravirt alternatives are applied much earlier than normal alternatives. >>>>> + * They are only applied when running on a hypervisor. They replace some >>>>> + * native instructions with calls to pv ops. >>>>> + */ >>>>> +void __init apply_pv_alternatives(void) >>>>> +{ >>>>> + setup_force_cpu_cap(X86_FEATURE_PV_OPS); >>>> Not for Xen HVM guests. >>> From what I can tell, HVM guests still use pv_time_ops and >>> pv_mmu_ops.exit_mmap, right? >>> >>>>> + apply_alternatives(__pv_alt_instructions, __pv_alt_instructions_end); >>>>> +} >>>> This is a problem (at least for Xen PV guests): >>>> apply_alternatives()->text_poke_early()->local_irq_save()->...'cli'->death. >>> Ah, right. >>> >>>> It might be possible not to turn off/on the interrupts in this >>>> particular case since the guest probably won't be able to handle an >>>> interrupt at this point anyway. >>> Yeah, that should work. For Xen and for the other hypervisors, this is >>> called well before irq init, so interrupts can't be handled yet anyway. >> There is also another problem: >> >> [ 1.312425] general protection fault: 0000 [#1] SMP >> [ 1.312901] Modules linked in: >> [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 >> [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 >> [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 >> [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 >> [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: >> 00007fcfc959f59a >> [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >> 0000000000000000 >> [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: >> 0000000000000064 >> [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: >> 00007fcfc958ad60 >> [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: >> 0000000000001000 >> [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) >> knlGS:0000000000000000 >> [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: >> 0000000000042660 >> [ 1.319235] Call Trace: >> [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 >> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 >> 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 >> [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 >> [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- >> [ 1.345009] Kernel panic - not syncing: Attempted to kill init! >> exitcode=0x0000000b >> >> >> All code >> =======>> 0: 51 push %rcx >> 1: 50 push %rax >> 2: 57 push %rdi >> 3: 56 push %rsi >> 4: 52 push %rdx >> 5: 51 push %rcx >> 6: 6a da pushq $0xffffffffffffffda >> 8: 41 50 push %r8 >> a: 41 51 push %r9 >> c: 41 52 push %r10 >> e: 41 53 push %r11 >> 10: 48 83 ec 30 sub $0x30,%rsp >> 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 >> 1b: 00 00 >> 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) >> 24: 0f 85 a5 00 00 00 jne 0xcf >> 2a: 50 push %rax >> 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # >> 0xffffffffffd095cd <-- trapping instruction >> 31: 58 pop %rax >> 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax >> 38: 77 0f ja 0x49 >> 3a: 4c 89 d1 mov %r10,%rcx >> 3d: ff .byte 0xff >> 3e: 14 c5 adc $0xc5,%al >> >> >> so the original 'cli' was replaced with the pv call but to me the offset >> looks a bit off, no? Shouldn't it always be positive? > callq takes a 32bit signed displacement, so jumping back by up to 2G is > perfectly legitimate.Yes, but ostr at workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath ffffffff817365dd t entry_SYSCALL_64_fastpath ostr at workbase> nm vmlinux | grep " pv_irq_ops" ffffffff81c2dbc0 D pv_irq_ops ostr at workbase> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I didn't mean that x86 instruction set doesn't allow negative displacement, I was trying to say that pv_irq_ops always live further down)> > The #GP[0] however means that whatever 8 byte value was found at > -0x2f6a64(%rip) was a non-canonical address. > > One option is that the pvops structure hasn't been initialised properly,It was, I did check that. And just to make sure I re-initialized it before alt instructions were rewritten.> but an alternative is that the relocation wasn't processed correctly, > and the code is trying to reference something which isn't a function > pointer.Let me see if I can poke at what's there. -boris
Boris Ostrovsky
2017-Oct-16 18:18 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:> On 10/12/2017 03:27 PM, Andrew Cooper wrote: >> On 12/10/17 20:11, Boris Ostrovsky wrote: >>> There is also another problem: >>> >>> [ 1.312425] general protection fault: 0000 [#1] SMP >>> [ 1.312901] Modules linked in: >>> [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 >>> [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 >>> [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 >>> [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 >>> [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: >>> 00007fcfc959f59a >>> [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >>> 0000000000000000 >>> [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: >>> 0000000000000064 >>> [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: >>> 00007fcfc958ad60 >>> [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: >>> 0000000000001000 >>> [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) >>> knlGS:0000000000000000 >>> [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: >>> 0000000000042660 >>> [ 1.319235] Call Trace: >>> [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 >>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 >>> 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 >>> [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 >>> [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- >>> [ 1.345009] Kernel panic - not syncing: Attempted to kill init! >>> exitcode=0x0000000b >>> >>> >>> All code >>> =======>>> 0: 51 push %rcx >>> 1: 50 push %rax >>> 2: 57 push %rdi >>> 3: 56 push %rsi >>> 4: 52 push %rdx >>> 5: 51 push %rcx >>> 6: 6a da pushq $0xffffffffffffffda >>> 8: 41 50 push %r8 >>> a: 41 51 push %r9 >>> c: 41 52 push %r10 >>> e: 41 53 push %r11 >>> 10: 48 83 ec 30 sub $0x30,%rsp >>> 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 >>> 1b: 00 00 >>> 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) >>> 24: 0f 85 a5 00 00 00 jne 0xcf >>> 2a: 50 push %rax >>> 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # >>> 0xffffffffffd095cd <-- trapping instruction >>> 31: 58 pop %rax >>> 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax >>> 38: 77 0f ja 0x49 >>> 3a: 4c 89 d1 mov %r10,%rcx >>> 3d: ff .byte 0xff >>> 3e: 14 c5 adc $0xc5,%al >>> >>> >>> so the original 'cli' was replaced with the pv call but to me the offset >>> looks a bit off, no? Shouldn't it always be positive? >> callq takes a 32bit signed displacement, so jumping back by up to 2G is >> perfectly legitimate. > Yes, but > > ostr at workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath > ffffffff817365dd t entry_SYSCALL_64_fastpath > ostr at workbase> nm vmlinux | grep " pv_irq_ops" > ffffffff81c2dbc0 D pv_irq_ops > ostr at workbase> > > so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I > didn't mean that x86 instruction set doesn't allow negative > displacement, I was trying to say that pv_irq_ops always live further down)I believe the problem is this: #define PV_INDIRECT(addr) *addr(%rip) The displacement that the linker computes will be relative to the where this instruction is placed at the time of linking, which is in .pv_altinstructions (and not .text). So when we copy it into .text the displacement becomes bogus. Replacing the macro with #define PV_INDIRECT(addr) *addr // well, it's not so much indirect anymore makes things work. Or maybe it can be adjusted top be kept truly indirect. -boris
Possibly Parallel Threads
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure