Boris Ostrovsky
2017-Oct-16 18:18 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/12/2017 03:53 PM, Boris Ostrovsky wrote:> On 10/12/2017 03:27 PM, Andrew Cooper wrote: >> On 12/10/17 20:11, Boris Ostrovsky wrote: >>> There is also another problem: >>> >>> [ 1.312425] general protection fault: 0000 [#1] SMP >>> [ 1.312901] Modules linked in: >>> [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 >>> [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 >>> [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 >>> [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 >>> [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: >>> 00007fcfc959f59a >>> [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >>> 0000000000000000 >>> [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: >>> 0000000000000064 >>> [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: >>> 00007fcfc958ad60 >>> [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: >>> 0000000000001000 >>> [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) >>> knlGS:0000000000000000 >>> [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: >>> 0000000000042660 >>> [ 1.319235] Call Trace: >>> [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 >>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 >>> 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 >>> [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 >>> [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- >>> [ 1.345009] Kernel panic - not syncing: Attempted to kill init! >>> exitcode=0x0000000b >>> >>> >>> All code >>> =======>>> 0: 51 push %rcx >>> 1: 50 push %rax >>> 2: 57 push %rdi >>> 3: 56 push %rsi >>> 4: 52 push %rdx >>> 5: 51 push %rcx >>> 6: 6a da pushq $0xffffffffffffffda >>> 8: 41 50 push %r8 >>> a: 41 51 push %r9 >>> c: 41 52 push %r10 >>> e: 41 53 push %r11 >>> 10: 48 83 ec 30 sub $0x30,%rsp >>> 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 >>> 1b: 00 00 >>> 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) >>> 24: 0f 85 a5 00 00 00 jne 0xcf >>> 2a: 50 push %rax >>> 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # >>> 0xffffffffffd095cd <-- trapping instruction >>> 31: 58 pop %rax >>> 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax >>> 38: 77 0f ja 0x49 >>> 3a: 4c 89 d1 mov %r10,%rcx >>> 3d: ff .byte 0xff >>> 3e: 14 c5 adc $0xc5,%al >>> >>> >>> so the original 'cli' was replaced with the pv call but to me the offset >>> looks a bit off, no? Shouldn't it always be positive? >> callq takes a 32bit signed displacement, so jumping back by up to 2G is >> perfectly legitimate. > Yes, but > > ostr at workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath > ffffffff817365dd t entry_SYSCALL_64_fastpath > ostr at workbase> nm vmlinux | grep " pv_irq_ops" > ffffffff81c2dbc0 D pv_irq_ops > ostr at workbase> > > so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I > didn't mean that x86 instruction set doesn't allow negative > displacement, I was trying to say that pv_irq_ops always live further down)I believe the problem is this: #define PV_INDIRECT(addr) *addr(%rip) The displacement that the linker computes will be relative to the where this instruction is placed at the time of linking, which is in .pv_altinstructions (and not .text). So when we copy it into .text the displacement becomes bogus. Replacing the macro with #define PV_INDIRECT(addr) *addr // well, it's not so much indirect anymore makes things work. Or maybe it can be adjusted top be kept truly indirect. -boris
Josh Poimboeuf
2017-Oct-17 05:24 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote:> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote: > > On 10/12/2017 03:27 PM, Andrew Cooper wrote: > >> On 12/10/17 20:11, Boris Ostrovsky wrote: > >>> There is also another problem: > >>> > >>> [ 1.312425] general protection fault: 0000 [#1] SMP > >>> [ 1.312901] Modules linked in: > >>> [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 > >>> [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 > >>> [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 > >>> [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 > >>> [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: > >>> 00007fcfc959f59a > >>> [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > >>> 0000000000000000 > >>> [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: > >>> 0000000000000064 > >>> [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: > >>> 00007fcfc958ad60 > >>> [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: > >>> 0000000000001000 > >>> [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) > >>> knlGS:0000000000000000 > >>> [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: > >>> 0000000000042660 > >>> [ 1.319235] Call Trace: > >>> [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 > >>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 > >>> 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 > >>> [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 > >>> [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- > >>> [ 1.345009] Kernel panic - not syncing: Attempted to kill init! > >>> exitcode=0x0000000b > >>> > >>> > >>> All code > >>> =======> >>> 0: 51 push %rcx > >>> 1: 50 push %rax > >>> 2: 57 push %rdi > >>> 3: 56 push %rsi > >>> 4: 52 push %rdx > >>> 5: 51 push %rcx > >>> 6: 6a da pushq $0xffffffffffffffda > >>> 8: 41 50 push %r8 > >>> a: 41 51 push %r9 > >>> c: 41 52 push %r10 > >>> e: 41 53 push %r11 > >>> 10: 48 83 ec 30 sub $0x30,%rsp > >>> 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 > >>> 1b: 00 00 > >>> 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) > >>> 24: 0f 85 a5 00 00 00 jne 0xcf > >>> 2a: 50 push %rax > >>> 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # > >>> 0xffffffffffd095cd <-- trapping instruction > >>> 31: 58 pop %rax > >>> 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax > >>> 38: 77 0f ja 0x49 > >>> 3a: 4c 89 d1 mov %r10,%rcx > >>> 3d: ff .byte 0xff > >>> 3e: 14 c5 adc $0xc5,%al > >>> > >>> > >>> so the original 'cli' was replaced with the pv call but to me the offset > >>> looks a bit off, no? Shouldn't it always be positive? > >> callq takes a 32bit signed displacement, so jumping back by up to 2G is > >> perfectly legitimate. > > Yes, but > > > > ostr at workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath > > ffffffff817365dd t entry_SYSCALL_64_fastpath > > ostr at workbase> nm vmlinux | grep " pv_irq_ops" > > ffffffff81c2dbc0 D pv_irq_ops > > ostr at workbase> > > > > so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I > > didn't mean that x86 instruction set doesn't allow negative > > displacement, I was trying to say that pv_irq_ops always live further down) > > I believe the problem is this: > > #define PV_INDIRECT(addr) *addr(%rip) > > The displacement that the linker computes will be relative to the where > this instruction is placed at the time of linking, which is in > .pv_altinstructions (and not .text). So when we copy it into .text the > displacement becomes bogus.apply_alternatives() is supposed to adjust that displacement based on the new IP, though it could be messing that up somehow. (See patch 10/13.) -- Josh
Brian Gerst
2017-Oct-17 13:10 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On Mon, Oct 16, 2017 at 2:18 PM, Boris Ostrovsky <boris.ostrovsky at oracle.com> wrote:> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote: >> On 10/12/2017 03:27 PM, Andrew Cooper wrote: >>> On 12/10/17 20:11, Boris Ostrovsky wrote: >>>> There is also another problem: >>>> >>>> [ 1.312425] general protection fault: 0000 [#1] SMP >>>> [ 1.312901] Modules linked in: >>>> [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 >>>> [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 >>>> [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 >>>> [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 >>>> [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: >>>> 00007fcfc959f59a >>>> [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >>>> 0000000000000000 >>>> [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: >>>> 0000000000000064 >>>> [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: >>>> 00007fcfc958ad60 >>>> [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: >>>> 0000000000001000 >>>> [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) >>>> knlGS:0000000000000000 >>>> [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: >>>> 0000000000042660 >>>> [ 1.319235] Call Trace: >>>> [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 >>>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 >>>> 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 >>>> [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 >>>> [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- >>>> [ 1.345009] Kernel panic - not syncing: Attempted to kill init! >>>> exitcode=0x0000000b >>>> >>>> >>>> All code >>>> =======>>>> 0: 51 push %rcx >>>> 1: 50 push %rax >>>> 2: 57 push %rdi >>>> 3: 56 push %rsi >>>> 4: 52 push %rdx >>>> 5: 51 push %rcx >>>> 6: 6a da pushq $0xffffffffffffffda >>>> 8: 41 50 push %r8 >>>> a: 41 51 push %r9 >>>> c: 41 52 push %r10 >>>> e: 41 53 push %r11 >>>> 10: 48 83 ec 30 sub $0x30,%rsp >>>> 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 >>>> 1b: 00 00 >>>> 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) >>>> 24: 0f 85 a5 00 00 00 jne 0xcf >>>> 2a: 50 push %rax >>>> 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # >>>> 0xffffffffffd095cd <-- trapping instruction >>>> 31: 58 pop %rax >>>> 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax >>>> 38: 77 0f ja 0x49 >>>> 3a: 4c 89 d1 mov %r10,%rcx >>>> 3d: ff .byte 0xff >>>> 3e: 14 c5 adc $0xc5,%al >>>> >>>> >>>> so the original 'cli' was replaced with the pv call but to me the offset >>>> looks a bit off, no? Shouldn't it always be positive? >>> callq takes a 32bit signed displacement, so jumping back by up to 2G is >>> perfectly legitimate. >> Yes, but >> >> ostr at workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath >> ffffffff817365dd t entry_SYSCALL_64_fastpath >> ostr at workbase> nm vmlinux | grep " pv_irq_ops" >> ffffffff81c2dbc0 D pv_irq_ops >> ostr at workbase> >> >> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I >> didn't mean that x86 instruction set doesn't allow negative >> displacement, I was trying to say that pv_irq_ops always live further down) > > I believe the problem is this: > > #define PV_INDIRECT(addr) *addr(%rip) > > The displacement that the linker computes will be relative to the where > this instruction is placed at the time of linking, which is in > .pv_altinstructions (and not .text). So when we copy it into .text the > displacement becomes bogus. > > Replacing the macro with > > #define PV_INDIRECT(addr) *addr // well, it's not so much > indirect anymore > > makes things work. Or maybe it can be adjusted top be kept truly indirect.That is still an indirect call, just using absolute addressing for the pointer instead of RIP-relative. Alternatives has very limited relocation capabilities. It will only handle a single call or jmp replacement. Using absolute addressing is slightly less efficient (takes one extra byte to encode, and needs a relocation for KASLR), but it works just as well. You could also relocate the instruction manually by adding the delta between the original and replacement code to the displacement. -- Brian Gerst
Boris Ostrovsky
2017-Oct-17 13:58 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/17/2017 01:24 AM, Josh Poimboeuf wrote:> On Mon, Oct 16, 2017 at 02:18:48PM -0400, Boris Ostrovsky wrote: >> On 10/12/2017 03:53 PM, Boris Ostrovsky wrote: >>> On 10/12/2017 03:27 PM, Andrew Cooper wrote: >>>> On 12/10/17 20:11, Boris Ostrovsky wrote: >>>>> There is also another problem: >>>>> >>>>> [ 1.312425] general protection fault: 0000 [#1] SMP >>>>> [ 1.312901] Modules linked in: >>>>> [ 1.313389] CPU: 0 PID: 1 Comm: init Not tainted 4.14.0-rc4+ #6 >>>>> [ 1.313878] task: ffff88003e2c0000 task.stack: ffffc9000038c000 >>>>> [ 1.314360] RIP: 10000e030:entry_SYSCALL_64_fastpath+0x1/0xa5 >>>>> [ 1.314854] RSP: e02b:ffffc9000038ff50 EFLAGS: 00010046 >>>>> [ 1.315336] RAX: 000000000000000c RBX: 000055f550168040 RCX: >>>>> 00007fcfc959f59a >>>>> [ 1.315827] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >>>>> 0000000000000000 >>>>> [ 1.316315] RBP: 000000000000000a R08: 000000000000037f R09: >>>>> 0000000000000064 >>>>> [ 1.316805] R10: 000000001f89cbf5 R11: ffff88003e2c0000 R12: >>>>> 00007fcfc958ad60 >>>>> [ 1.317300] R13: 0000000000000000 R14: 000055f550185954 R15: >>>>> 0000000000001000 >>>>> [ 1.317801] FS: 0000000000000000(0000) GS:ffff88003f800000(0000) >>>>> knlGS:0000000000000000 >>>>> [ 1.318267] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 1.318750] CR2: 00007fcfc97ab218 CR3: 000000003c88e000 CR4: >>>>> 0000000000042660 >>>>> [ 1.319235] Call Trace: >>>>> [ 1.319700] Code: 51 50 57 56 52 51 6a da 41 50 41 51 41 52 41 53 48 >>>>> 83 ec 30 65 4c 8b 1c 25 c0 d2 00 00 41 f7 03 df 39 08 90 0f 85 a5 00 00 >>>>> 00 50 <ff> 15 9c 95 d0 ff 58 48 3d 4c 01 00 00 77 0f 4c 89 d1 ff 14 c5 >>>>> [ 1.321161] RIP: entry_SYSCALL_64_fastpath+0x1/0xa5 RSP: ffffc9000038ff50 >>>>> [ 1.344255] ---[ end trace d7cb8cd6cd7c294c ]--- >>>>> [ 1.345009] Kernel panic - not syncing: Attempted to kill init! >>>>> exitcode=0x0000000b >>>>> >>>>> >>>>> All code >>>>> =======>>>>> 0: 51 push %rcx >>>>> 1: 50 push %rax >>>>> 2: 57 push %rdi >>>>> 3: 56 push %rsi >>>>> 4: 52 push %rdx >>>>> 5: 51 push %rcx >>>>> 6: 6a da pushq $0xffffffffffffffda >>>>> 8: 41 50 push %r8 >>>>> a: 41 51 push %r9 >>>>> c: 41 52 push %r10 >>>>> e: 41 53 push %r11 >>>>> 10: 48 83 ec 30 sub $0x30,%rsp >>>>> 14: 65 4c 8b 1c 25 c0 d2 mov %gs:0xd2c0,%r11 >>>>> 1b: 00 00 >>>>> 1d: 41 f7 03 df 39 08 90 testl $0x900839df,(%r11) >>>>> 24: 0f 85 a5 00 00 00 jne 0xcf >>>>> 2a: 50 push %rax >>>>> 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) # >>>>> 0xffffffffffd095cd <-- trapping instruction >>>>> 31: 58 pop %rax >>>>> 32: 48 3d 4c 01 00 00 cmp $0x14c,%rax >>>>> 38: 77 0f ja 0x49 >>>>> 3a: 4c 89 d1 mov %r10,%rcx >>>>> 3d: ff .byte 0xff >>>>> 3e: 14 c5 adc $0xc5,%al >>>>> >>>>> >>>>> so the original 'cli' was replaced with the pv call but to me the offset >>>>> looks a bit off, no? Shouldn't it always be positive? >>>> callq takes a 32bit signed displacement, so jumping back by up to 2G is >>>> perfectly legitimate. >>> Yes, but >>> >>> ostr at workbase> nm vmlinux | grep entry_SYSCALL_64_fastpath >>> ffffffff817365dd t entry_SYSCALL_64_fastpath >>> ostr at workbase> nm vmlinux | grep " pv_irq_ops" >>> ffffffff81c2dbc0 D pv_irq_ops >>> ostr at workbase> >>> >>> so pv_irq_ops.irq_disable is about 5MB ahead of where we are now. (I >>> didn't mean that x86 instruction set doesn't allow negative >>> displacement, I was trying to say that pv_irq_ops always live further down) >> I believe the problem is this: >> >> #define PV_INDIRECT(addr) *addr(%rip) >> >> The displacement that the linker computes will be relative to the where >> this instruction is placed at the time of linking, which is in >> .pv_altinstructions (and not .text). So when we copy it into .text the >> displacement becomes bogus. > apply_alternatives() is supposed to adjust that displacement based on > the new IP, though it could be messing that up somehow. (See patch > 10/13.) >That patch doesn't take into account the fact that replacement instructions may have to save/restore registers. So, for example, - if (a->replacementlen && is_jmp(replacement[0])) + } else if (a->replacementlen == 6 && *insnbuf == 0xff && + *(insnbuf+1) == 0x15) { + /* indirect call */ + *(s32 *)(insnbuf + 2) += replacement - instr; + DPRINTK("Fix indirect CALL offset: 0x%x, CALL *0x%lx", + *(s32 *)(insnbuf + 2), + (unsigned long)instr + *(s32 *)(insnbuf + 2) + 6); + doesn't do the adjustment of 2a: 50 push %rax 2b:* ff 15 9c 95 d0 ff callq *-0x2f6a64(%rip) 31: 58 pop %rax because instbuf points to 'push' and not to 'call'. -boris
Boris Ostrovsky
2017-Oct-17 14:05 UTC
[Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
On 10/17/2017 09:10 AM, Brian Gerst wrote:> On Mon, Oct 16, 2017 at 2:18 PM, Boris Ostrovsky > <boris.ostrovsky at oracle.com> wrote: >> >> Replacing the macro with >> >> #define PV_INDIRECT(addr) *addr // well, it's not so much >> indirect anymore >> >> makes things work. Or maybe it can be adjusted top be kept truly indirect. > That is still an indirect call, just using absolute addressing for the > pointer instead of RIP-relative.Oh, right, I've got my terminology all wrong. -boris> Alternatives has very limited > relocation capabilities. It will only handle a single call or jmp > replacement. Using absolute addressing is slightly less efficient > (takes one extra byte to encode, and needs a relocation for KASLR), > but it works just as well. You could also relocate the instruction > manually by adding the delta between the original and replacement code > to the displacement.
Apparently Analagous Threads
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure
- [Xen-devel] [PATCH 11/13] x86/paravirt: Add paravirt alternatives infrastructure