Rusty Russell
2007-Apr-18 13:02 UTC
[PATCH] (with benchmarks) binary patching of paravirt_ops call sites
Hi all, Sorry for the delay. This implements binary patching of call sites for interrupt-related paravirt ops, since no-doubt Andi wasn't the only one to believe this approach is slow. The benchmarks were done on a UP 3GHz Pentium 4 with 512MB of RAM. 2.6.17-rc4 vs 2.6.17-rc4 with CONFIG_PARAVIRT=y vs 2.6.17-rc4 CONFIG_PARAVIRT=y with patch. Summary: with binary patching, the difference from CONFIG_PARAVIRT=n is in the noise (with the possible exception of lmbench's exec). Full results can be found in subdirs of http://kernel.org/pub/linux/kernel/people/rusty/Paravirt/stats The patch sites are 10-12 bytes long; we can shave more bytes off by telling GCC we clobber regs. The Xen SMP cli() code might actually need more bytes than this, but I haven't measured it yet. Kernel compile system time: normal 40.73 paravirt 41.0633 [.818%] paravirt-patch 39.99 [-1.816%] Kernel compile wall time: normal 84.1506 paravirt 84.7539 [.716%] paravirt-patch 84.4994 [.414%] Tbench: normal 73.473 paravirt 73.7777 [.414%] paravirt-patch 73.8841 [.559%] Dbench: normal 264.418 paravirt 262.445 [-.746%] paravirt-patch 262.26 [-.816%] Ubench: normal 2.95643 paravirt 2.89111 [-2.209%] paravirt-patch 2.89052 [-2.229%] Lmbench null syscall: normal 0.37 paravirt 0.38 [2.702%] paravirt-patch 0.37 [0%] Lmbench null I/O: normal 0.452 paravirt 0.46 [1.769%] paravirt-patch 0.452 [0%] Lmbench fork: normal 73.14 paravirt 73.725 [.799%] paravirt-patch 72.86 [-.382%] Lmbench exec: normal 348.8 paravirt 365.75 [4.859%] paravirt-patch 361.4 [3.612%] Lmbench sh: normal 6000 paravirt 6072.25 [1.204%] paravirt-patch 6027.6 [.460%] Lmbench 2p/0k context switch: normal 1.29 paravirt 1.395 [8.139%] paravirt-patch 1.288 [-.155%] Lmbench pipe latency: normal 5.026 paravirt 5.20875 [3.636%] paravirt-patch 4.9856 [-.803%] Lmbench UNIX socket latency: normal 7.86 paravirt 8.5375 [8.619%] paravirt-patch 7.864 [.050%] Lmbench UDP latency: normal 10.8 paravirt 12.275 [13.657%] paravirt-patch 11.06 [2.407%] Lmbench TCP latency: normal 13.28 paravirt 14.45 [8.810%] paravirt-patch 13.34 [.451%] Lmbench TCP connection latency: normal 45.48 paravirt 51.1 [12.357%] paravirt-patch 45.52 [.087%] Lmbench pipe bandwidth: normal 2522.2 paravirt 2335.5 [-7.402%] paravirt-patch 2401 [-4.805%] Lmbench UNIX socket bandwidth: normal 2935 paravirt 2617 [-10.834%] paravirt-patch 2788.2 [-5.001%] Lmbench TCP socket bandwidth: normal 627.8 paravirt 670 [6.721%] paravirt-patch 635.2 [1.178%] ==Name: Binary patch over critical fastpaths Depends: Paravirt/function-call-abstraction.patch.gz Status: Tested on 2.6.17-rc4 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> It turns out that the most called ops, by several orders of magnitude, are the interrupt manipulation ops. These are obvious candidates for patching, so mark them up and create infrastructure for it. The method used is that the ops structure has a patch function, which is called for each place which needs to be patched: this returns a number of instructions (the rest are NOP-padded). diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/alternative.c working-2.6.17-rc4-bench/arch/i386/kernel/alternative.c --- working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/alternative.c 2006-05-16 10:50:48.000000000 +1000 +++ working-2.6.17-rc4-bench/arch/i386/kernel/alternative.c 2006-05-23 13:10:26.000000000 +1000 @@ -3,6 +3,7 @@ #include <linux/list.h> #include <asm/alternative.h> #include <asm/sections.h> +#include <asm/paravirt.h> #define DEBUG 0 #if DEBUG @@ -283,6 +284,35 @@ void alternatives_smp_switch(int smp) spin_unlock_irqrestore(&smp_alt, flags); } +#ifdef CONFIG_PARAVIRT +void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end) +{ + unsigned char **noptable = find_nop_table(); + struct paravirt_patch *p; + int diff, i, k; + + for (p = start; p < end; p++) { + unsigned int used; + used = paravirt_ops.patch(p->instrtype, p->instr, p->len); + /* Pad the rest with nops */ + diff = p->len - used; + for (i = used; diff > 0; diff -= k, i += k) { + k = diff; + if (k > ASM_NOP_MAX) + k = ASM_NOP_MAX; + memcpy(p->instr + i, noptable[k], k); + } + } +} +extern struct paravirt_patch __start_parainstructions[], + __stop_parainstructions[]; +#else +void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end) +{ +} +extern char __start_parainstructions[], __stop_parainstructions[]; +#endif /* CONFIG_PARAVIRT */ + void __init alternative_instructions(void) { apply_alternatives(__alt_instructions, __alt_instructions_end); @@ -318,4 +354,6 @@ void __init alternative_instructions(voi _text, _etext); alternatives_smp_switch(0); } + + apply_paravirt(__start_parainstructions, __stop_parainstructions); } diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/module.c working-2.6.17-rc4-bench/arch/i386/kernel/module.c --- working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/module.c 2006-05-16 10:50:48.000000000 +1000 +++ working-2.6.17-rc4-bench/arch/i386/kernel/module.c 2006-05-22 15:21:55.000000000 +1000 @@ -108,7 +108,8 @@ int module_finalize(const Elf_Ehdr *hdr, const Elf_Shdr *sechdrs, struct module *me) { - const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL; + const Elf_Shdr *s, *text = NULL, *alt = NULL, *locks = NULL, + *para = NULL; char *secstrings = (void *)hdr + sechdrs[hdr->e_shstrndx].sh_offset; for (s = sechdrs; s < sechdrs + hdr->e_shnum; s++) { @@ -118,6 +119,8 @@ int module_finalize(const Elf_Ehdr *hdr, alt = s; if (!strcmp(".smp_locks", secstrings + s->sh_name)) locks= s; + if (!strcmp(".parainstructions", secstrings + s->sh_name)) + para = s; } if (alt) { @@ -132,6 +135,10 @@ int module_finalize(const Elf_Ehdr *hdr, lseg, lseg + locks->sh_size, tseg, tseg + text->sh_size); } + if (para) { + void *aseg = (void *)alt->sh_addr; + apply_paravirt(aseg, aseg + alt->sh_size); + } return 0; } diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/paravirt.c working-2.6.17-rc4-bench/arch/i386/kernel/paravirt.c --- working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/paravirt.c 2006-05-23 14:01:46.000000000 +1000 +++ working-2.6.17-rc4-bench/arch/i386/kernel/paravirt.c 2006-05-22 15:21:55.000000000 +1000 @@ -336,8 +336,35 @@ static void nopara_set_iopl_mask(unsigne extern void nopara_iret(void); extern void nopara_irq_enable_sysexit(void); +/* Simple instruction patching code. */ +static struct native_insns +{ + unsigned int len; + const char *insns; +} native_insns[] = { + [PARAVIRT_IRQ_DISABLE] = { 1, "\xFA" /* cli */ }, + [PARAVIRT_IRQ_ENABLE] = { 1, "\xFB" /* sti */ }, + [PARAVIRT_RESTORE_FLAGS] = { 2, "\x50\x9D" /* push %eax; popf */ }, + [PARAVIRT_SAVE_FLAGS] = { 2, "\x9c\x58" /* pushf; pop %eax */ }, +}; + +static unsigned nopara_patch(unsigned int type, void *firstinsn, unsigned len) +{ + /* Don't touch it if we don't have a replacement */ + if (type >= ARRAY_SIZE(native_insns) || !native_insns[type].insns) + return len; + + /* Similarly if we can't fit replacement. */ + if (len < native_insns[type].len) + return len; + + memcpy(firstinsn, native_insns[type].insns, native_insns[type].len); + return native_insns[type].len; +} + struct paravirt_ops paravirt_ops = { .kernel_rpl = 0, + .patch = nopara_patch, .cpuid = nopara_cpuid, .get_debugreg = nopara_get_debugreg, .set_debugreg = nopara_set_debugreg, diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/vmlinux.lds.S working-2.6.17-rc4-bench/arch/i386/kernel/vmlinux.lds.S --- working-2.6.17-rc4-function-call-abstraction/arch/i386/kernel/vmlinux.lds.S 2006-05-16 10:50:48.000000000 +1000 +++ working-2.6.17-rc4-bench/arch/i386/kernel/vmlinux.lds.S 2006-05-23 12:55:11.000000000 +1000 @@ -128,6 +128,12 @@ SECTIONS .altinstr_replacement : AT(ADDR(.altinstr_replacement) - LOAD_OFFSET) { *(.altinstr_replacement) } + . = ALIGN(4); + __start_parainstructions = .; + .parainstructions : AT(ADDR(.parainstructions) - LOAD_OFFSET) { + *(.parainstructions) + } + __stop_parainstructions = .; /* .exit.text is discard at runtime, not link time, to deal with references from .altinstructions and .eh_frame */ .exit.text : AT(ADDR(.exit.text) - LOAD_OFFSET) { *(.exit.text) } diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.6.17-rc4-function-call-abstraction/include/asm-i386/alternative.h working-2.6.17-rc4-bench/include/asm-i386/alternative.h --- working-2.6.17-rc4-function-call-abstraction/include/asm-i386/alternative.h 2006-05-16 10:51:38.000000000 +1000 +++ working-2.6.17-rc4-bench/include/asm-i386/alternative.h 2006-05-22 15:21:55.000000000 +1000 @@ -20,7 +20,8 @@ extern void alternatives_smp_module_add( void *text, void *text_end); extern void alternatives_smp_module_del(struct module *mod); extern void alternatives_smp_switch(int smp); - +struct paravirt_patch; +void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end); #endif /* diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.6.17-rc4-function-call-abstraction/include/asm-i386/paravirt.h working-2.6.17-rc4-bench/include/asm-i386/paravirt.h --- working-2.6.17-rc4-function-call-abstraction/include/asm-i386/paravirt.h 2006-05-23 14:01:46.000000000 +1000 +++ working-2.6.17-rc4-bench/include/asm-i386/paravirt.h 2006-05-23 12:59:48.000000000 +1000 @@ -3,11 +3,18 @@ /* Various instructions on x86 need to be replaced for * para-virtualization: those hooks are defined here. */ #include <linux/config.h> +#include <linux/stringify.h> #ifndef CONFIG_PARAVIRT #include <asm/no_paravirt.h> #else +/* These are the most common ops, so we want to be able to patch callers. */ +#define PARAVIRT_IRQ_DISABLE 0 +#define PARAVIRT_IRQ_ENABLE 1 +#define PARAVIRT_RESTORE_FLAGS 2 +#define PARAVIRT_SAVE_FLAGS 3 + #ifndef __ASSEMBLY__ struct thread_struct; struct Xgt_desc_struct; @@ -15,6 +22,8 @@ struct paravirt_ops { unsigned int kernel_rpl; + unsigned (*patch)(unsigned int type, void *firstinsn, unsigned len); + void (*cpuid)(unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx); @@ -108,10 +117,6 @@ static inline void sync_core(void) #define read_cr4_safe(x) paravirt_ops.read_cr4_safe() #define write_cr4(x) paravirt_ops.write_cr4(x) -#define __local_save_flags() paravirt_ops.save_fl() -#define __local_irq_restore(f) paravirt_ops.restore_fl(f) -#define local_irq_disable() paravirt_ops.irq_disable() -#define local_irq_enable() paravirt_ops.irq_enable() #define safe_halt() paravirt_ops.safe_halt() #define halt() paravirt_ops.halt() #define wbinvd() paravirt_ops.wbinvd() @@ -181,11 +186,85 @@ static inline void sync_core(void) #define write_gdt_entry(dt, entry, a, b) (paravirt_ops.write_gdt_entry((dt), (entry), ((u64)a) << 32 | b)) #define write_idt_entry(dt, entry, a, b) (paravirt_ops.write_idt_entry((dt), (entry), ((u64)a) << 32 | b)) #define set_iopl_mask(mask) (paravirt_ops.set_iopl_mask(mask)) + +/* These all sit in the .parainstructions section to tell us what to patch. */ +struct paravirt_patch { + u8 *instr; /* original instructions */ + u8 instrtype; /* type of this instruction */ + u8 len; /* length of original instruction */ + u16 pad; +}; + +#define paravirt_alt(insn_string, typenum) \ + "771:\n\t" insn_string "\n" "772:\n" \ + ".section .parainstructions,\"a\"\n" \ + " .align 4\n" \ + " .long 771b\n" \ + " .byte " __stringify(typenum) "\n" \ + " .byte 772b-771b\n" \ + ".previous" + +static inline unsigned long __local_save_flags(void) +{ + register unsigned long f asm("eax"); + /* Asm is to place value in eax. */ + __asm__ __volatile__(paravirt_alt("pushl %%ecx; pushl %%edx\n\t" + "call *%1\n\t" + "popl %%edx; popl %%ecx", + PARAVIRT_SAVE_FLAGS) + : "=r"(f): "m"(paravirt_ops.save_fl) : "memory"); + return f; +} + +static inline void __local_irq_restore(unsigned long f) +{ + /* We guarantee that f is in eax on entry to asm. */ + register unsigned long eax asm("eax") = f; + __asm__ __volatile__(paravirt_alt("pushl %%ecx; pushl %%edx\n\t" + "pushl %1; call *%0\n\t" + "popl %1; popl %%edx; popl %%ecx", + PARAVIRT_RESTORE_FLAGS) + : : "m" (paravirt_ops.restore_fl), "r"(eax) + : "memory"); +} + +static inline void local_irq_disable(void) +{ + __asm__ __volatile__(paravirt_alt("pushl %%ecx; pushl %%edx\n\t" + "pushl %%eax; call *%0\n\t" + "popl %%eax; popl %%edx; popl %%ecx", + PARAVIRT_IRQ_DISABLE) + : : "m" (paravirt_ops.irq_disable) : "memory"); +} + +static inline void local_irq_enable(void) +{ + __asm__ __volatile__(paravirt_alt("pushl %%ecx; pushl %%edx\n\t" + "pushl %%eax; call *%0\n\t" + "popl %%eax; popl %%edx; popl %%ecx", + PARAVIRT_IRQ_ENABLE) + : : "m" (paravirt_ops.irq_enable) : "memory"); +} + +#define CLI_STRING paravirt_alt("pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_disable; popl %edx; popl %ecx; popl %eax", PARAVIRT_IRQ_DISABLE) +#define STI_STRING paravirt_alt("pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_enable; popl %edx; popl %ecx; popl %eax", PARAVIRT_IRQ_ENABLE) + #else /* ... __ASSEMBLY__ */ +#define PARA_PATCH(ptype, ops) \ +771:; \ + ops; \ +772:; \ + .section .parainstructions,"a"; \ + .align 4; \ + .long 771b; \ + .byte ptype; \ + .byte 772b-771b; \ + .previous + #define IRET jmp *paravirt_ops+PARAVIRT_iret -#define CLI pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_disable; popl %edx; popl %ecx; popl %eax -#define STI pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_enable; popl %edx; popl %ecx; popl %eax +#define CLI PARA_PATCH(PARAVIRT_IRQ_DISABLE,pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_disable; popl %edx; popl %ecx; popl %eax) +#define STI PARA_PATCH(PARAVIRT_IRQ_ENABLE,pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_enable; popl %edx; popl %ecx; popl %eax) #define STI_SYSEXIT jmp *paravirt_ops+PARAVIRT_irq_enable_sysexit #define GET_CR0 pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_read_cr0; popl %edx; popl %ecx @@ -204,9 +283,6 @@ static inline void sync_core(void) popl %eax #endif /* __ASSEMBLY__ */ -#define CLI_STRING "pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_disable; popl %edx; popl %ecx; popl %eax" -#define STI_STRING "pushl %eax; pushl %ecx; pushl %edx; call *paravirt_ops+PARAVIRT_irq_enable; popl %edx; popl %ecx; popl %eax" - #endif /* PARAVIRT */ #endif /* __ASM_PARAVIRT_H */ -- ccontrol: http://ccontrol.ozlabs.org
Zachary Amsden
2007-Apr-18 17:49 UTC
[PATCH] (with benchmarks) binary patching of paravirt_ops call sites
Rusty Russell wrote:> Hi all, > > Sorry for the delay. This implements binary patching of call sites for > interrupt-related paravirt ops, since no-doubt Andi wasn't the only one > to believe this approach is slow. > > >Sorry to take so long to look over this. I believe this is another good step. But you do need more - I believe the following are extremely sensitive to context switch latency:> Lmbench pipe bandwidth: > normal 2522.2 > paravirt 2335.5 [-7.402%] > paravirt-patch 2401 [-4.805%] > Lmbench UNIX socket bandwidth: > normal 2935 > paravirt 2617 [-10.834%] > paravirt-patch 2788.2 [-5.001%]This means you'll probably need to inline / patch everything on the common path in switch_to, which includes GDT updates and a reload of CR3. You'll probably also want to inline / patch the read of CR2 in the page fault path. So if you have to do inlining for both a read and write CR accessors, doesn't it seem easier to just do them all and be gone with the stub implementations? Having a common approach is what led us down the patch of full blown patching, as it was easier to maintain than an ad-hoc set of interfaces selected simply by virtue of being on the critical path. The critical paths are quite a bit different on 64-bit as well, which means things like CR8 and WRMSR become important to inline. In either case, letting the kernel decide which interfaces to complicate with inline patching could be the best solution - but we'd have to be careful to require non-virtualizable interfaces (or interfaces which require memory trapping) to always provide patchable alternatives. Zach
Possibly Parallel Threads
- [PATCH] (with benchmarks) binary patching of paravirt_ops call sites
- [RFC/PATCH PV_OPS X86_64 01/17] paravirt_ops - core changes
- [RFC/PATCH PV_OPS X86_64 01/17] paravirt_ops - core changes
- [PATCH] Fix CONFIG_PARAVIRT for 2.6.19-rc5-mm1
- [PATCH] Fix CONFIG_PARAVIRT for 2.6.19-rc5-mm1