Jeremy Fitzhardinge
2008-May-31 00:04 UTC
[PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction (take 2)
Hi all, [ Change since last post: change name to ptep_modify_prot_, on the grounds that it isn't really a general pte-modification interface. ] This little series adds a new transaction-like abstraction for doing RMW updates to a pte, hooks it into paravirt_ops, and then makes use of it in Xen. The basic problem is that mprotect is very slow under Xen (up to 50x slower than native), primarily because of the ptent = ptep_get_and_clear(mm, addr, pte); ptent = pte_modify(ptent, newprot); /* ... */ set_pte_at(mm, addr, pte, ptent); sequence in mm/mprotect.c:change_pte_range(). This is bad for Xen for two reasons: 1: ptep_get_and_clear() ends up being a xchg on the pte. Since the pte page is read-only (as it must be, because Xen needs to control all pte updates), this traps into Xen, which then emulates the instruction. Trapping into the instruction emulator is inherently expensive. And, 2: because ptep_get_and_clear has atomic-fetch-and-update semantics, it's impossible to implement in a way which can be batched to amortize the cost of trapping into the hypervisor. This series adds the ptep_modify_prot_start() and ptep_modify_prot_commit() operations, which change this sequence to: ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); /* ... */ ptep_modify_prot_commit(mm, addr, pte, ptent); Which looks very familiar. And, indeed, when compiled without CONFIG_PARAVIRT (or on a non-x86 architecture), it will end up doing precisely the same thing as before. However, the effective semantics are a bit different. ptep_modify_prot_start() means "I'm reading this pte with the intention of updating it; please don't lose any hardware pte changes in the meantime". And ptep_modify_prot_commit() means "Here's a new value for the pte, but make sure you don't lose any hardware changes". The default implementation achieves these semantics by making ptep_modify_prot_start() set the pte to non-present, which prevents any async hardware changes to the pte. The ptep_modify_prot_commit() can then just write the new value into place without having to worry about preserving any changes, because it knows there are none. Xen implements ptep_modify_prot_start() as a simple read of the pte. This leaves the pte unchanged in memory, and the hardware may make asynchronous changes to it. It implements ptep_modify_prot_commit() using a batched hypercall which preserves the state of the Access/Dirty bits when updating the pte. This allows the whole change_pte_range() loop to be run without any synchronous unbatched traps into the hypervisor. With this change in place, an mprotect microbenchmark goes from being 50x worse than native to around 7x, which is acceptible. I believe that other virtualization systems, whether they use direct paging like Xen, or a shadow pagetable scheme (vmi, kvm, lguest), can make use of this interface to improve the performance. Unfortunately (or fortunately) there aren't very many other areas of the kernel which can really take advantage of this. There's only a couple of other instances of ptep_get_and_clear() in mm/, and they're being used in a similar way; but I don't think they're very performance critical (though zap_pte_range might be interesting). In general, mprotect is rarely a performance bottleneck. But some debugging libraries (such as electric fence) and garbage collectors can be very heavy users of mprotect, and this change could materially benefit them. Thanks, J
Jeremy Fitzhardinge
2008-May-31 00:04 UTC
[PATCH 1 of 4] mm: add a ptep_modify_prot transaction abstraction
This patch adds an API for doing read-modify-write updates to a pte's protection bits which may race against hardware updates to the pte. After reading the pte, the hardware may asynchonously set the accessed or dirty bits on a pte, which would be lost when writing back the modified pte value. The existing technique to handle this race is to use ptep_get_and_clear() atomically fetch the old pte value and clear it in memory. This has the effect of marking the pte as non-present, which will prevent the hardware from updating its state. When the new value is written back, the pte will be present again, and the hardware can resume updating the access/dirty flags. When running in a virtualized environment, pagetable updates are relatively expensive, since they generally involve some trap into the hypervisor. To mitigate the cost of these updates, we tend to batch them. However, because of the atomic nature of ptep_get_and_clear(), it is inherently non-batchable. This new interface allows batching by giving the underlying implementation enough information to open a transaction between the read and write phases: ptep_modify_prot_start() returns the current pte value, and puts the pte entry into a state where either the hardware will not update the pte, or if it does, the updates will be preserved on commit. ptep_modify_prot_commit() writes back the updated pte, makes sure that any hardware updates made since ptep_modify_prot_start() are preserved. ptep_modify_prot_start() and _commit() must be exactly paired, and used while holding the appropriate pte lock. They do not protect against other software updates of the pte in any way. The current implementations of ptep_modify_prot_start and _commit are functionally unchanged from before: _start() uses ptep_get_and_clear() fetch the pte and zero the entry, preventing any hardware updates. _commit() simply writes the new pte value back knowing that the hardware has not updated the pte in the meantime. The only current user of this interface is mprotect Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge at citrix.com> --- include/asm-generic/pgtable.h | 53 +++++++++++++++++++++++++++++++++++++++++ mm/mprotect.c | 10 +++---- 2 files changed, 57 insertions(+), 6 deletions(-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -197,6 +197,59 @@ } #endif /* CONFIG_MMU */ +static inline pte_t __ptep_modify_prot_start(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep) +{ + /* Get the current pte state, but zero it out to make it + non-present, preventing the hardware from asynchronously + updating it. */ + return ptep_get_and_clear(mm, addr, ptep); +} + +static inline void __ptep_modify_prot_commit(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep, pte_t pte) +{ + /* The pte is non-present, so there's no hardware state to + preserve. */ + set_pte_at(mm, addr, ptep, pte); +} + +#ifndef __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION +/* + * Start a pte protection read-modify-write transaction, which + * protects against asynchronous hardware modifications to the pte. + * The intention is not to prevent the hardware from making pte + * updates, but to prevent any updates it may make from being lost. + * + * This does not protect against other software modifications of the + * pte; the appropriate pte lock must be held over the transation. + * + * Note that this interface is intended to be batchable, meaning that + * ptep_modify_prot_commit may not actually update the pte, but merely + * queue the update to be done at some later time. The update must be + * actually committed before the pte lock is released, however. + */ +static inline pte_t ptep_modify_prot_start(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep) +{ + return __ptep_modify_prot_start(mm, addr, ptep); +} + +/* + * Commit an update to a pte, leaving any hardware-controlled bits in + * the PTE unmodified. + */ +static inline void ptep_modify_prot_commit(struct mm_struct *mm, + unsigned long addr, + pte_t *ptep, pte_t pte) +{ + __ptep_modify_prot_commit(mm, addr, ptep, pte); +} +#endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */ + /* * A facility to provide lazy MMU batching. This allows PTE updates and * page invalidations to be delayed until a call to leave lazy MMU mode diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -47,19 +47,17 @@ if (pte_present(oldpte)) { pte_t ptent; - /* Avoid an SMP race with hardware updated dirty/clean - * bits by wiping the pte and then setting the new pte - * into place. - */ - ptent = ptep_get_and_clear(mm, addr, pte); + ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); + /* * Avoid taking write faults for pages we know to be * dirty. */ if (dirty_accountable && pte_dirty(ptent)) ptent = pte_mkwrite(ptent); - set_pte_at(mm, addr, pte, ptent); + + ptep_modify_prot_commit(mm, addr, pte, ptent); #ifdef CONFIG_MIGRATION } else if (!pte_file(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte);
Jeremy Fitzhardinge
2008-May-31 00:04 UTC
[PATCH 2 of 4] paravirt: add hooks for ptep_modify_prot_start/commit
This patch adds paravirt-ops hooks in pv_mmu_ops for ptep_modify_prot_start and ptep_modify_prot_commit. This allows the hypervisor-specific backends to implement these in some more efficient way. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge at citrix.com> --- arch/x86/kernel/paravirt.c | 3 +++ arch/x86/xen/enlighten.c | 3 +++ include/asm-x86/paravirt.h | 28 ++++++++++++++++++++++++++++ 3 files changed, 34 insertions(+) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -383,6 +383,9 @@ .pte_update = paravirt_nop, .pte_update_defer = paravirt_nop, + .ptep_modify_prot_start = __ptep_modify_prot_start, + .ptep_modify_prot_commit = __ptep_modify_prot_commit, + #ifdef CONFIG_HIGHPTE .kmap_atomic_pte = kmap_atomic, #endif diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1103,6 +1103,9 @@ .set_pte_at = xen_set_pte_at, .set_pmd = xen_set_pmd, + .ptep_modify_prot_start = __ptep_modify_prot_start, + .ptep_modify_prot_commit = __ptep_modify_prot_commit, + .pte_val = xen_pte_val, .pte_flags = native_pte_val, .pgd_val = xen_pgd_val, diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h --- a/include/asm-x86/paravirt.h +++ b/include/asm-x86/paravirt.h @@ -238,6 +238,11 @@ pte_t *ptep); void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); + + pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr, + pte_t *ptep); + void (*ptep_modify_prot_commit)(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte); pteval_t (*pte_val)(pte_t); pteval_t (*pte_flags)(pte_t); @@ -1040,6 +1045,29 @@ return ret; } +#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION +static inline pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) +{ + pteval_t ret; + + ret = PVOP_CALL3(pteval_t, pv_mmu_ops.ptep_modify_prot_start, + mm, addr, ptep); + + return (pte_t) { .pte = ret }; +} + +static inline void ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + if (sizeof(pteval_t) > sizeof(long)) + /* 5 arg words */ + pv_mmu_ops.ptep_modify_prot_commit(mm, addr, ptep, pte); + else + PVOP_VCALL4(pv_mmu_ops.ptep_modify_prot_commit, + mm, addr, ptep, pte.pte); +} + static inline void set_pte(pte_t *ptep, pte_t pte) { if (sizeof(pteval_t) > sizeof(long))
Jeremy Fitzhardinge
2008-May-31 00:04 UTC
[PATCH 3 of 4] xen: implement ptep_modify_prot_start/commit
Xen has a pte update function which will update a pte while preserving its accessed and dirty bits. This means that ptep_modify_prot_start() can be implemented as a simple read of the pte value. The hardware may update the pte in the meantime, but ptep_modify_prot_commit() updates it while preserving any changes that may have happened in the meantime. The updates in ptep_modify_prot_commit() are batched if we're currently in lazy mmu mode. The mmu_update hypercall can take a batch of updates to perform, but this code doesn't make particular use of that feature, in favour of using generic multicall batching to get them all into the hypervisor. The net effect of this is that each mprotect pte update turns from two expensive trap-and-emulate faults into they hypervisor into a single hypercall whose cost is amortized in a batched multicall. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge at citrix.com> --- arch/x86/xen/enlighten.c | 13 ++++++++++--- arch/x86/xen/mmu.c | 21 +++++++++++++++++++++ arch/x86/xen/mmu.h | 4 ++++ include/xen/interface/features.h | 3 +++ include/xen/interface/xen.h | 9 +++++++-- 5 files changed, 45 insertions(+), 5 deletions(-) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -140,7 +140,9 @@ { printk(KERN_INFO "Booting paravirtualized kernel on %s\n", pv_info.name); - printk(KERN_INFO "Hypervisor signature: %s\n", xen_start_info->magic); + printk(KERN_INFO "Hypervisor signature: %s%s\n", + xen_start_info->magic, + xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : ""); } static void xen_cpuid(unsigned int *ax, unsigned int *bx, @@ -1208,6 +1210,8 @@ BUG_ON(memcmp(xen_start_info->magic, "xen-3", 5) != 0); + xen_setup_features(); + /* Install Xen paravirt ops */ pv_info = xen_info; pv_init_ops = xen_init_ops; @@ -1217,13 +1221,16 @@ pv_apic_ops = xen_apic_ops; pv_mmu_ops = xen_mmu_ops; + if (xen_feature(XENFEAT_mmu_pt_update_preserve_ad)) { + pv_mmu_ops.ptep_modify_prot_start = xen_ptep_modify_prot_start; + pv_mmu_ops.ptep_modify_prot_commit = xen_ptep_modify_prot_commit; + } + machine_ops = xen_machine_ops; #ifdef CONFIG_SMP smp_ops = xen_smp_ops; #endif - - xen_setup_features(); /* Get mfn list */ if (!xen_feature(XENFEAT_auto_translated_physmap)) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -298,6 +298,27 @@ out: if (mm == &init_mm) preempt_enable(); +} + +pte_t xen_ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr, pte_t *ptep) +{ + /* Just return the pte as-is. We preserve the bits on commit */ + return *ptep; +} + +void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + struct multicall_space mcs; + struct mmu_update *u; + + mcs = xen_mc_entry(sizeof(*u)); + u = mcs.args; + u->ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD; + u->val = pte_val_ma(pte); + MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF); + + xen_mc_issue(PARAVIRT_LAZY_MMU); } pteval_t xen_pte_val(pte_t pte) diff --git a/arch/x86/xen/mmu.h b/arch/x86/xen/mmu.h --- a/arch/x86/xen/mmu.h +++ b/arch/x86/xen/mmu.h @@ -52,4 +52,8 @@ void xen_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep); void xen_pmd_clear(pmd_t *pmdp); +pte_t xen_ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr, pte_t *ptep); +void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte); + #endif /* _XEN_MMU_H */ diff --git a/include/xen/interface/features.h b/include/xen/interface/features.h --- a/include/xen/interface/features.h +++ b/include/xen/interface/features.h @@ -38,6 +38,9 @@ */ #define XENFEAT_pae_pgdir_above_4gb 4 +/* x86: Does this Xen host support the MMU_PT_UPDATE_PRESERVE_AD hypercall? */ +#define XENFEAT_mmu_pt_update_preserve_ad 5 + #define XENFEAT_NR_SUBMAPS 1 #endif /* __XEN_PUBLIC_FEATURES_H__ */ diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h --- a/include/xen/interface/xen.h +++ b/include/xen/interface/xen.h @@ -113,9 +113,14 @@ * ptr[:2] -- Machine address within the frame whose mapping to modify. * The frame must belong to the FD, if one is specified. * val -- Value to write into the mapping entry. + * + * ptr[1:0] == MMU_PT_UPDATE_PRESERVE_AD: + * As MMU_NORMAL_PT_UPDATE above, but A/D bits currently in the PTE are ORed + * with those in @val. */ -#define MMU_NORMAL_PT_UPDATE 0 /* checked '*ptr = val'. ptr is MA. */ -#define MMU_MACHPHYS_UPDATE 1 /* ptr = MA of frame to modify entry for */ +#define MMU_NORMAL_PT_UPDATE 0 /* checked '*ptr = val'. ptr is MA. */ +#define MMU_MACHPHYS_UPDATE 1 /* ptr = MA of frame to modify entry for */ +#define MMU_PT_UPDATE_PRESERVE_AD 2 /* atomically: *ptr = val | (*ptr&(A|D)) */ /* * MMU EXTENDED OPERATIONS
Jeremy Fitzhardinge
2008-May-31 00:04 UTC
[Xen-devel] [PATCH 4 of 4] xen: add mechanism to extend existing multicalls
Some Xen hypercalls accept an array of operations to work on. In general this is because its more efficient for the hypercall to the work all at once rather than as separate hypercalls (even batched as a multicall). This patch adds a mechanism (xen_mc_extend_args()) to allocate more argument space to the last-issued multicall, in order to extend its argument list. The user of this mechanism is xen/mmu.c, which uses it to extend the args array of mmu_update. This is particularly valuable when doing the update for a large mprotect, which goes via ptep_modify_prot_commit(), but it also manages to batch updates to pgd/pmds as well. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/xen/mmu.c | 56 ++++++++++++++++++++++++++++----------------- arch/x86/xen/multicalls.c | 40 +++++++++++++++++++++++++++----- arch/x86/xen/multicalls.h | 12 +++++++++ 3 files changed, 81 insertions(+), 27 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -218,19 +218,35 @@ BUG(); } - -void xen_set_pmd(pmd_t *ptr, pmd_t val) +static void extend_mmu_update(const struct mmu_update *update) { struct multicall_space mcs; struct mmu_update *u; + mcs = xen_mc_extend_args(__HYPERVISOR_mmu_update, sizeof(*u)); + + if (mcs.mc != NULL) + mcs.mc->args[1]++; + else { + mcs = __xen_mc_entry(sizeof(*u)); + MULTI_mmu_update(mcs.mc, mcs.args, 1, NULL, DOMID_SELF); + } + + u = mcs.args; + *u = *update; +} + +void xen_set_pmd(pmd_t *ptr, pmd_t val) +{ + struct mmu_update u; + preempt_disable(); - mcs = xen_mc_entry(sizeof(*u)); - u = mcs.args; - u->ptr = virt_to_machine(ptr).maddr; - u->val = pmd_val_ma(val); - MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF); + xen_mc_batch(); + + u.ptr = virt_to_machine(ptr).maddr; + u.val = pmd_val_ma(val); + extend_mmu_update(&u); xen_mc_issue(PARAVIRT_LAZY_MMU); @@ -309,14 +325,13 @@ void xen_ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { - struct multicall_space mcs; - struct mmu_update *u; + struct mmu_update u; - mcs = xen_mc_entry(sizeof(*u)); - u = mcs.args; - u->ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD; - u->val = pte_val_ma(pte); - MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF); + xen_mc_batch(); + + u.ptr = virt_to_machine(ptep).maddr | MMU_PT_UPDATE_PRESERVE_AD; + u.val = pte_val_ma(pte); + extend_mmu_update(&u); xen_mc_issue(PARAVIRT_LAZY_MMU); } @@ -367,16 +382,15 @@ void xen_set_pud(pud_t *ptr, pud_t val) { - struct multicall_space mcs; - struct mmu_update *u; + struct mmu_update u; preempt_disable(); - mcs = xen_mc_entry(sizeof(*u)); - u = mcs.args; - u->ptr = virt_to_machine(ptr).maddr; - u->val = pud_val_ma(val); - MULTI_mmu_update(mcs.mc, u, 1, NULL, DOMID_SELF); + xen_mc_batch(); + + u.ptr = virt_to_machine(ptr).maddr; + u.val = pud_val_ma(val); + extend_mmu_update(&u); xen_mc_issue(PARAVIRT_LAZY_MMU); diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c --- a/arch/x86/xen/multicalls.c +++ b/arch/x86/xen/multicalls.c @@ -29,14 +29,14 @@ #define MC_DEBUG 1 #define MC_BATCH 32 -#define MC_ARGS (MC_BATCH * 16 / sizeof(u64)) +#define MC_ARGS (MC_BATCH * 16) struct mc_buffer { struct multicall_entry entries[MC_BATCH]; #if MC_DEBUG struct multicall_entry debug[MC_BATCH]; #endif - u64 args[MC_ARGS]; + unsigned char args[MC_ARGS]; struct callback { void (*fn)(void *); void *data; @@ -107,20 +107,48 @@ { struct mc_buffer *b = &__get_cpu_var(mc_buffer); struct multicall_space ret; - unsigned argspace = (args + sizeof(u64) - 1) / sizeof(u64); + unsigned argidx = roundup(b->argidx, sizeof(u64)); BUG_ON(preemptible()); - BUG_ON(argspace > MC_ARGS); + BUG_ON(b->argidx > MC_ARGS); if (b->mcidx == MC_BATCH || - (b->argidx + argspace) > MC_ARGS) + (argidx + args) > MC_ARGS) { xen_mc_flush(); + argidx = roundup(b->argidx, sizeof(u64)); + } ret.mc = &b->entries[b->mcidx]; b->mcidx++; + ret.args = &b->args[argidx]; + b->argidx = argidx + args; + + BUG_ON(b->argidx > MC_ARGS); + return ret; +} + +struct multicall_space xen_mc_extend_args(unsigned long op, size_t size) +{ + struct mc_buffer *b = &__get_cpu_var(mc_buffer); + struct multicall_space ret = { NULL, NULL }; + + BUG_ON(preemptible()); + BUG_ON(b->argidx > MC_ARGS); + + if (b->mcidx == 0) + return ret; + + if (b->entries[b->mcidx - 1].op != op) + return ret; + + if ((b->argidx + size) > MC_ARGS) + return ret; + + ret.mc = &b->entries[b->mcidx - 1]; ret.args = &b->args[b->argidx]; - b->argidx += argspace; + b->argidx += size; + BUG_ON(b->argidx > MC_ARGS); return ret; } diff --git a/arch/x86/xen/multicalls.h b/arch/x86/xen/multicalls.h --- a/arch/x86/xen/multicalls.h +++ b/arch/x86/xen/multicalls.h @@ -45,4 +45,16 @@ /* Set up a callback to be called when the current batch is flushed */ void xen_mc_callback(void (*fn)(void *), void *data); +/* + * Try to extend the arguments of the previous multicall command. The + * previous command''s op must match. If it does, then it attempts to + * extend the argument space allocated to the multicall entry by + * arg_size bytes. + * + * The returned multicall_space will return with mc pointing to the + * command on success, or NULL on failure, and args pointing to the + * newly allocated space. + */ +struct multicall_space xen_mc_extend_args(unsigned long op, size_t arg_size); + #endif /* _XEN_MULTICALLS_H */ _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ingo Molnar
2008-Jun-02 11:12 UTC
[PATCH 2 of 4] paravirt: add hooks for ptep_modify_prot_start/commit
* Jeremy Fitzhardinge <jeremy at goop.org> wrote:> +#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTIONplease turn this into CONFIG_ARCH_HAS_PTEP_MODIFY_PROT_TRANSACTION instead. Ingo
Ingo Molnar
2008-Jun-02 11:13 UTC
[PATCH 1 of 4] mm: add a ptep_modify_prot transaction abstraction
* Jeremy Fitzhardinge <jeremy at goop.org> wrote:> + /* Get the current pte state, but zero it out to make it > + non-present, preventing the hardware from asynchronously > + updating it. */please use standard and consistent comment style, similar to:> +/* > + * Start a pte protection read-modify-write transaction, which > + * protects against asynchronous hardware modifications to the pte. > + * The intention is not to prevent the hardware from making pte > + * updates, but to prevent any updates it may make from being lost.Ingo
Jeremy Fitzhardinge
2008-Jun-02 11:57 UTC
[PATCH 1 of 4] mm: add a ptep_modify_prot transaction abstraction
Ingo Molnar wrote:> * Jeremy Fitzhardinge <jeremy at goop.org> wrote: > > >> + /* Get the current pte state, but zero it out to make it >> + non-present, preventing the hardware from asynchronously >> + updating it. */ >> > > please use standard and consistent comment style, similar to: >OK. J
Jeremy Fitzhardinge
2008-Jun-16 11:29 UTC
[PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction (take 2)
Hi all, [ Change since last post: change name to ptep_modify_prot_, on the grounds that it isn't really a general pte-modification interface. ] This little series adds a new transaction-like abstraction for doing RMW updates to a pte, hooks it into paravirt_ops, and then makes use of it in Xen. The basic problem is that mprotect is very slow under Xen (up to 50x slower than native), primarily because of the ptent = ptep_get_and_clear(mm, addr, pte); ptent = pte_modify(ptent, newprot); /* ... */ set_pte_at(mm, addr, pte, ptent); sequence in mm/mprotect.c:change_pte_range(). This is bad for Xen for two reasons: 1: ptep_get_and_clear() ends up being a xchg on the pte. Since the pte page is read-only (as it must be, because Xen needs to control all pte updates), this traps into Xen, which then emulates the instruction. Trapping into the instruction emulator is inherently expensive. And, 2: because ptep_get_and_clear has atomic-fetch-and-update semantics, it's impossible to implement in a way which can be batched to amortize the cost of trapping into the hypervisor. This series adds the ptep_modify_prot_start() and ptep_modify_prot_commit() operations, which change this sequence to: ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); /* ... */ ptep_modify_prot_commit(mm, addr, pte, ptent); Which looks very familiar. And, indeed, when compiled without CONFIG_PARAVIRT (or on a non-x86 architecture), it will end up doing precisely the same thing as before. However, the effective semantics are a bit different. ptep_modify_prot_start() means "I'm reading this pte with the intention of updating it; please don't lose any hardware pte changes in the meantime". And ptep_modify_prot_commit() means "Here's a new value for the pte, but make sure you don't lose any hardware changes". The default implementation achieves these semantics by making ptep_modify_prot_start() set the pte to non-present, which prevents any async hardware changes to the pte. The ptep_modify_prot_commit() can then just write the new value into place without having to worry about preserving any changes, because it knows there are none. Xen implements ptep_modify_prot_start() as a simple read of the pte. This leaves the pte unchanged in memory, and the hardware may make asynchronous changes to it. It implements ptep_modify_prot_commit() using a batched hypercall which preserves the state of the Access/Dirty bits when updating the pte. This allows the whole change_pte_range() loop to be run without any synchronous unbatched traps into the hypervisor. With this change in place, an mprotect microbenchmark goes from being 50x worse than native to around 7x, which is acceptible. I believe that other virtualization systems, whether they use direct paging like Xen, or a shadow pagetable scheme (vmi, kvm, lguest), can make use of this interface to improve the performance. Unfortunately (or fortunately) there aren't very many other areas of the kernel which can really take advantage of this. There's only a couple of other instances of ptep_get_and_clear() in mm/, and they're being used in a similar way; but I don't think they're very performance critical (though zap_pte_range might be interesting). In general, mprotect is rarely a performance bottleneck. But some debugging libraries (such as electric fence) and garbage collectors can be very heavy users of mprotect, and this change could materially benefit them. Thanks, J
Jeremy Fitzhardinge
2008-Jun-16 11:30 UTC
[PATCH 2 of 4] paravirt: add hooks for ptep_modify_prot_start/commit
This patch adds paravirt-ops hooks in pv_mmu_ops for ptep_modify_prot_start and ptep_modify_prot_commit. This allows the hypervisor-specific backends to implement these in some more efficient way. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge at citrix.com> --- arch/x86/kernel/paravirt.c | 3 +++ arch/x86/xen/enlighten.c | 3 +++ include/asm-x86/paravirt.h | 28 ++++++++++++++++++++++++++++ 3 files changed, 34 insertions(+) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -383,6 +383,9 @@ .pte_update = paravirt_nop, .pte_update_defer = paravirt_nop, + .ptep_modify_prot_start = __ptep_modify_prot_start, + .ptep_modify_prot_commit = __ptep_modify_prot_commit, + #ifdef CONFIG_HIGHPTE .kmap_atomic_pte = kmap_atomic, #endif diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -1139,6 +1139,9 @@ .set_pte_at = xen_set_pte_at, .set_pmd = xen_set_pmd_hyper, + .ptep_modify_prot_start = __ptep_modify_prot_start, + .ptep_modify_prot_commit = __ptep_modify_prot_commit, + .pte_val = xen_pte_val, .pte_flags = native_pte_val, .pgd_val = xen_pgd_val, diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h --- a/include/asm-x86/paravirt.h +++ b/include/asm-x86/paravirt.h @@ -238,6 +238,11 @@ pte_t *ptep); void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, pte_t *ptep); + + pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr, + pte_t *ptep); + void (*ptep_modify_prot_commit)(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte); pteval_t (*pte_val)(pte_t); pteval_t (*pte_flags)(pte_t); @@ -1040,6 +1045,29 @@ return ret; } +#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION +static inline pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) +{ + pteval_t ret; + + ret = PVOP_CALL3(pteval_t, pv_mmu_ops.ptep_modify_prot_start, + mm, addr, ptep); + + return (pte_t) { .pte = ret }; +} + +static inline void ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + if (sizeof(pteval_t) > sizeof(long)) + /* 5 arg words */ + pv_mmu_ops.ptep_modify_prot_commit(mm, addr, ptep, pte); + else + PVOP_VCALL4(pv_mmu_ops.ptep_modify_prot_commit, + mm, addr, ptep, pte.pte); +} + static inline void set_pte(pte_t *ptep, pte_t pte) { if (sizeof(pteval_t) > sizeof(long))
Possibly Parallel Threads
- [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction (take 2)
- [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction (take 2)
- [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
- [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction
- [PATCH 0 of 4] mm+paravirt+xen: add pte read-modify-write abstraction