David Vrabel
2012-Jun-13 10:20 UTC
[PATCH 0/2] x86/mm: remove arch-specific PTE/PMD get-and-clear functions
This series removes the x86-specific implementation of ptep_get_and_clear() and pmdp_get_and_clear(). The principal reason for this is it allows Xen paravitualized guests to batch the PTE clears which is a significant performance optimization of munmap() and mremap() -- the number of entries into the hypervisor is reduced by about a factor of about 30 (60 in 32-bit guests) for munmap(). There may be minimal gains on native and KVM guests due to the removal of the locked xchg. Removal of arch-specific functions where generic ones are suitable seems to be a generally useful thing to me. The full reasoning for why this is safe is included in the commit message of patch 1 but to summarize. The atomic get-and-clear does not guarantee that the latest dirty/accessed bits are returned as TLB as there is a still a window after the get-and-clear and before the TLB flush that the bits may be updated on other processors. So, user space applications accessing pages that are being unmapped or remapped already have unpredictable behaviour. David
David Vrabel
2012-Jun-13 10:20 UTC
[PATCH 1/2] x86/mm: remove arch-specific ptep_get_and_clear() function
From: David Vrabel <david.vrabel@citrix.com> x86 provides a ptep_get_and_clear() function instead of using the generic implementation so it can atomically get and clear the PTE. It is not clear why x86 has this requirement. PTE updates are done while the appropriate PTE lock are held, so presumably, it is an attempt to ensure that the accessed and dirty bits of the PTE are saved even if they have been updated by the hardware due to a concurrent access on another CPU. However, the atomic exchange is not sufficient for saving the dirty bit as if there is a TLB entry for this page on another CPU then the dirty bit may be written by that processor after the PTE is cleared but before the TLB entry is flushed. It is also not clear from the Intel SDM[1] if the processor''s read of the PTE and the write to set the accessed bit are atomic or not. With this patch the get/clear becomes a read of the PTE and call to pte_clear() which allows the clears to be batched by some paravirtualized guests (such as Xen guests). This can lead to significant performance improvements in munmap(). e.g., for Xen, a trap-and-emulate[2] for every page becomes a hypercall for every 31 pages. There should be no change in the performance on native or for KVM guests. There may be a performance regression with lguest guests as an optimization for avoiding calling pte_update() when doing a full teardown of an mm is removed. As a consequence there is a slight increase of the window where a set (by the processor) of the accessed or dirty bit may be lost. This shouldn''t change the behaviour of user space in any meaningful way. Any user space application accessing a mmap()''d region that is being munmap()''d or mremap()''d is already going to have unpredicatable behaviour -- the access may succeed, it may fault, or the write to the mapped file may be lost. [1] Intel Software Developer''s Manual Vol 3A section 4.8 says nothing on this. [2] For 32-bit guests, two traps are required for every page to update both halves of the PTE. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- arch/x86/include/asm/pgtable-2level.h | 9 -------- arch/x86/include/asm/pgtable-3level.h | 16 -------------- arch/x86/include/asm/pgtable.h | 37 --------------------------------- arch/x86/include/asm/pgtable_64.h | 13 ----------- 4 files changed, 0 insertions(+), 75 deletions(-) diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h index 98391db..be7c20e 100644 --- a/arch/x86/include/asm/pgtable-2level.h +++ b/arch/x86/include/asm/pgtable-2level.h @@ -38,15 +38,6 @@ static inline void native_pte_clear(struct mm_struct *mm, } #ifdef CONFIG_SMP -static inline pte_t native_ptep_get_and_clear(pte_t *xp) -{ - return __pte(xchg(&xp->pte_low, 0)); -} -#else -#define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp) -#endif - -#ifdef CONFIG_SMP static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp) { return __pmd(xchg((pmdval_t *)xp, 0)); diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index 43876f1..42101e9 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -134,22 +134,6 @@ static inline void pud_clear(pud_t *pudp) } #ifdef CONFIG_SMP -static inline pte_t native_ptep_get_and_clear(pte_t *ptep) -{ - pte_t res; - - /* xchg acts as a barrier before the setting of the high bits */ - res.pte_low = xchg(&ptep->pte_low, 0); - res.pte_high = ptep->pte_high; - ptep->pte_high = 0; - - return res; -} -#else -#define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp) -#endif - -#ifdef CONFIG_SMP union split_pmd { struct { u32 pmd_low; diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 49afb3f..4413bed 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -598,16 +598,6 @@ static inline int pgd_none(pgd_t pgd) extern int direct_gbpages; -/* local pte updates need not use xchg for locking */ -static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep) -{ - pte_t res = *ptep; - - /* Pure native function needs no input for mm, addr */ - native_pte_clear(NULL, 0, ptep); - return res; -} - static inline pmd_t native_local_pmdp_get_and_clear(pmd_t *pmdp) { pmd_t res = *pmdp; @@ -668,33 +658,6 @@ extern int ptep_test_and_clear_young(struct vm_area_struct *vma, extern int ptep_clear_flush_young(struct vm_area_struct *vma, unsigned long address, pte_t *ptep); -#define __HAVE_ARCH_PTEP_GET_AND_CLEAR -static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, - pte_t *ptep) -{ - pte_t pte = native_ptep_get_and_clear(ptep); - pte_update(mm, addr, ptep); - return pte; -} - -#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL -static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, - unsigned long addr, pte_t *ptep, - int full) -{ - pte_t pte; - if (full) { - /* - * Full address destruction in progress; paravirt does not - * care about updates and native needs no locking - */ - pte = native_local_ptep_get_and_clear(ptep); - } else { - pte = ptep_get_and_clear(mm, addr, ptep); - } - return pte; -} - #define __HAVE_ARCH_PTEP_SET_WRPROTECT static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 975f709..cc12d27 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -69,19 +69,6 @@ static inline void native_pmd_clear(pmd_t *pmd) native_set_pmd(pmd, native_make_pmd(0)); } -static inline pte_t native_ptep_get_and_clear(pte_t *xp) -{ -#ifdef CONFIG_SMP - return native_make_pte(xchg(&xp->pte, 0)); -#else - /* native_local_ptep_get_and_clear, - but duplicated because of cyclic dependency */ - pte_t ret = *xp; - native_pte_clear(NULL, 0, xp); - return ret; -#endif -} - static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp) { #ifdef CONFIG_SMP -- 1.7.2.5
David Vrabel
2012-Jun-13 10:20 UTC
[PATCH 2/2] x86/mm: remove arch-specific pmdp_get_and_clear() function
From: David Vrabel <david.vrabel@citrix.com> Similar to the removal of the x86 specific ptep_get_and_clear(), do the same for pmdp_get_and_clear(). The reasoning for this is the same. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- arch/x86/include/asm/pgtable-2level.h | 9 --------- arch/x86/include/asm/pgtable-3level.h | 23 ----------------------- arch/x86/include/asm/pgtable.h | 17 ----------------- arch/x86/include/asm/pgtable_64.h | 13 ------------- 4 files changed, 0 insertions(+), 62 deletions(-) diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h index be7c20e..9b08339 100644 --- a/arch/x86/include/asm/pgtable-2level.h +++ b/arch/x86/include/asm/pgtable-2level.h @@ -37,15 +37,6 @@ static inline void native_pte_clear(struct mm_struct *mm, *xp = native_make_pte(0); } -#ifdef CONFIG_SMP -static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp) -{ - return __pmd(xchg((pmdval_t *)xp, 0)); -} -#else -#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp) -#endif - /* * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken, * split up the 29 bits of offset into this range: diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index 42101e9..f081699 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -133,29 +133,6 @@ static inline void pud_clear(pud_t *pudp) */ } -#ifdef CONFIG_SMP -union split_pmd { - struct { - u32 pmd_low; - u32 pmd_high; - }; - pmd_t pmd; -}; -static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp) -{ - union split_pmd res, *orig = (union split_pmd *)pmdp; - - /* xchg acts as a barrier before setting of the high bits */ - res.pmd_low = xchg(&orig->pmd_low, 0); - res.pmd_high = orig->pmd_high; - orig->pmd_high = 0; - - return res.pmd; -} -#else -#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp) -#endif - /* * Bits 0, 6 and 7 are taken in the low part of the pte, * put the 32 bits of offset into the high part. diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 4413bed..34f576c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -598,14 +598,6 @@ static inline int pgd_none(pgd_t pgd) extern int direct_gbpages; -static inline pmd_t native_local_pmdp_get_and_clear(pmd_t *pmdp) -{ - pmd_t res = *pmdp; - - native_pmd_clear(pmdp); - return res; -} - static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep , pte_t pte) { @@ -694,15 +686,6 @@ static inline int pmd_write(pmd_t pmd) return pmd_flags(pmd) & _PAGE_RW; } -#define __HAVE_ARCH_PMDP_GET_AND_CLEAR -static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr, - pmd_t *pmdp) -{ - pmd_t pmd = native_pmdp_get_and_clear(pmdp); - pmd_update(mm, addr, pmdp); - return pmd; -} - #define __HAVE_ARCH_PMDP_SET_WRPROTECT static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index cc12d27..92eb013 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -69,19 +69,6 @@ static inline void native_pmd_clear(pmd_t *pmd) native_set_pmd(pmd, native_make_pmd(0)); } -static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp) -{ -#ifdef CONFIG_SMP - return native_make_pmd(xchg(&xp->pmd, 0)); -#else - /* native_local_pmdp_get_and_clear, - but duplicated because of cyclic dependency */ - pmd_t ret = *xp; - native_pmd_clear(xp); - return ret; -#endif -} - static inline void native_set_pud(pud_t *pudp, pud_t pud) { *pudp = pud; -- 1.7.2.5
Konrad Rzeszutek Wilk
2012-Jun-13 14:04 UTC
Re: [PATCH 0/2] x86/mm: remove arch-specific PTE/PMD get-and-clear functions
On Wed, Jun 13, 2012 at 11:20:43AM +0100, David Vrabel wrote:> This series removes the x86-specific implementation of > ptep_get_and_clear() and pmdp_get_and_clear(). > > The principal reason for this is it allows Xen paravitualized guests > to batch the PTE clears which is a significant performance > optimization of munmap() and mremap() -- the number of entries into > the hypervisor is reduced by about a factor of about 30 (60 in 32-bit > guests) for munmap(). > > There may be minimal gains on native and KVM guests due to the removal > of the locked xchg.What about lguest?> > Removal of arch-specific functions where generic ones are suitable > seems to be a generally useful thing to me. > > The full reasoning for why this is safe is included in the commit > message of patch 1 but to summarize. The atomic get-and-clear does > not guarantee that the latest dirty/accessed bits are returned as TLB > as there is a still a window after the get-and-clear and before the > TLB flush that the bits may be updated on other processors. So, user > space applications accessing pages that are being unmapped or remapped > already have unpredictable behaviour. > > David
David Vrabel
2012-Jun-13 15:00 UTC
Re: [PATCH 0/2] x86/mm: remove arch-specific PTE/PMD get-and-clear functions
On 13/06/12 15:04, Konrad Rzeszutek Wilk wrote:> On Wed, Jun 13, 2012 at 11:20:43AM +0100, David Vrabel wrote: >> This series removes the x86-specific implementation of >> ptep_get_and_clear() and pmdp_get_and_clear(). >> >> The principal reason for this is it allows Xen paravitualized guests >> to batch the PTE clears which is a significant performance >> optimization of munmap() and mremap() -- the number of entries into >> the hypervisor is reduced by about a factor of about 30 (60 in 32-bit >> guests) for munmap(). >> >> There may be minimal gains on native and KVM guests due to the removal >> of the locked xchg. > > What about lguest?As I note in the description of patch 1: "There may be a performance regression with lguest guests as an optimization for avoiding calling pte_update() when doing a full teardown of an mm is removed." I don''t know how much this performance regression would be or if the performance of lguest guests is something people care about. We could have an x86-specific ptep_get_and_clear_full() which looks like: pte_t ptep_get_and_clear_full( struct mm_struct *mm, unsigned long addr, pte_t *ptep, int full) { pte_t pte = *ptep; pte_clear(mm, address, ptep); if (!full) pte_update(mm, addr, ptep); return pte; } Which would have all the performance benefits of the proposed patch without the performance regression with lguest. David>> >> Removal of arch-specific functions where generic ones are suitable >> seems to be a generally useful thing to me. >> >> The full reasoning for why this is safe is included in the commit >> message of patch 1 but to summarize. The atomic get-and-clear does >> not guarantee that the latest dirty/accessed bits are returned as TLB >> as there is a still a window after the get-and-clear and before the >> TLB flush that the bits may be updated on other processors. So, user >> space applications accessing pages that are being unmapped or remapped >> already have unpredictable behaviour. >> >> David
Konrad Rzeszutek Wilk
2012-Jun-14 18:29 UTC
Re: [PATCH 0/2] x86/mm: remove arch-specific PTE/PMD get-and-clear functions
On Wed, Jun 13, 2012 at 04:00:23PM +0100, David Vrabel wrote:> On 13/06/12 15:04, Konrad Rzeszutek Wilk wrote: > > On Wed, Jun 13, 2012 at 11:20:43AM +0100, David Vrabel wrote: > >> This series removes the x86-specific implementation of > >> ptep_get_and_clear() and pmdp_get_and_clear(). > >> > >> The principal reason for this is it allows Xen paravitualized guests > >> to batch the PTE clears which is a significant performance > >> optimization of munmap() and mremap() -- the number of entries into > >> the hypervisor is reduced by about a factor of about 30 (60 in 32-bit > >> guests) for munmap(). > >> > >> There may be minimal gains on native and KVM guests due to the removal > >> of the locked xchg. > > > > What about lguest? > > As I note in the description of patch 1: > > "There may be a performance regression with lguest guests as > an optimization for avoiding calling pte_update() when doing a full > teardown of an mm is removed." > > I don''t know how much this performance regression would be or if the > performance of lguest guests is something people care about. > > We could have an x86-specific ptep_get_and_clear_full() which looks like: > > pte_t ptep_get_and_clear_full( > struct mm_struct *mm, unsigned long addr, pte_t *ptep, > int full) > { > pte_t pte = *ptep; > > pte_clear(mm, address, ptep); > if (!full) > pte_update(mm, addr, ptep); > > return pte; > } > > Which would have all the performance benefits of the proposed patch > without the performance regression with lguest.Lets rope Rusty in this since he is the maintainer of lguest.> > David > > >> > >> Removal of arch-specific functions where generic ones are suitable > >> seems to be a generally useful thing to me. > >> > >> The full reasoning for why this is safe is included in the commit > >> message of patch 1 but to summarize. The atomic get-and-clear does > >> not guarantee that the latest dirty/accessed bits are returned as TLB > >> as there is a still a window after the get-and-clear and before the > >> TLB flush that the bits may be updated on other processors. So, user > >> space applications accessing pages that are being unmapped or remapped > >> already have unpredictable behaviour. > >> > >> David
H. Peter Anvin
2012-Jun-14 18:41 UTC
Re: [PATCH 0/2] x86/mm: remove arch-specific PTE/PMD get-and-clear functions
On 06/14/2012 11:29 AM, Konrad Rzeszutek Wilk wrote:> > Lets rope Rusty in this since he is the maintainer of lguest. >Also, let''s have realistic expectations for lguest. I don''t want to break lguest just for sh*ts and giggles, but if lguest is in the way of making forward progress for other things, it would have to have a very strong case to not be sacrificed. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don''t speak on their behalf.
David Vrabel
2012-Jun-15 09:41 UTC
Re: [PATCH 1/2] x86/mm: remove arch-specific ptep_get_and_clear() function
On 13/06/12 11:20, David Vrabel wrote:> From: David Vrabel <david.vrabel@citrix.com> > > x86 provides a ptep_get_and_clear() function instead of using the > generic implementation so it can atomically get and clear the PTE. > > It is not clear why x86 has this requirement. PTE updates are done > while the appropriate PTE lock are held, so presumably, it is an > attempt to ensure that the accessed and dirty bits of the PTE are > saved even if they have been updated by the hardware due to a > concurrent access on another CPU. > > However, the atomic exchange is not sufficient for saving the dirty > bit as if there is a TLB entry for this page on another CPU then the > dirty bit may be written by that processor after the PTE is cleared > but before the TLB entry is flushed. It is also not clear from the > Intel SDM[1] if the processor''s read of the PTE and the write to set > the accessed bit are atomic or not.This reasoning is probably not correct. When a dirty bit must be updated in a PTE the processor does a pagetable walk (possibly using any cached page table structures). The AMD APM section 5.4.2 states: "The processor never sets the Accessed bit or the Dirty bit for a not present page (P = 0)." and "If PTE[D] is cleared to 0, software can rely on the fact that the page has not been written." Thus this patch would /introduce/ a race where a dirty bit set would be lost (rather than extending the window where this would happen). However (and this is a weaker argument), no sensible userspace application should be accessing pages that are being unmapped or remapped (since it is unpredictable whether they will fault) so perhaps this additional unpredictable behaviour is acceptable?> With this patch the get/clear becomes a read of the PTE and call to > pte_clear() which allows the clears to be batched by some > paravirtualized guests (such as Xen guests). This can lead to > significant performance improvements in munmap(). e.g., for Xen, a > trap-and-emulate[2] for every page becomes a hypercall for every 31 > pages. > > There should be no change in the performance on native or for KVM > guests. There may be a performance regression with lguest guests as > an optimization for avoiding calling pte_update() when doing a full > teardown of an mm is removed.Having looked further into how lguest works the performance penalty is likely to not be as bad as I first thought. pte_clear() does call pte_update() but these are batched and the batch size seems to be up-to 64 calls so the performance penalty should be pretty minimal.> [1] Intel Software Developer''s Manual Vol 3A section 4.8 says nothing > on this. > > [2] For 32-bit guests, two traps are required for every page to update > both halves of the PTE.
Keir Fraser
2012-Jun-15 10:49 UTC
Re: [Xen-devel] [PATCH 1/2] x86/mm: remove arch-specific ptep_get_and_clear() function
On 15/06/2012 10:41, "David Vrabel" <david.vrabel@citrix.com> wrote:> This reasoning is probably not correct. When a dirty bit must be > updated in a PTE the processor does a pagetable walk (possibly using any > cached page table structures). The AMD APM section 5.4.2 states: > > "The processor never sets the Accessed bit or the Dirty bit for a not > present page (P = 0)." > > and > > "If PTE[D] is cleared to 0, software can rely on the fact that the page > has not been written."Writing of dirty and accessed bits is done as part of the page-table walk on TLB fill. A/D bits never have writeback caching semantics. It wouldn''t be safe: e.g., on unmap, TLB flushes happen after ptes have been cleared (to avoid TLB-fill races), but that would mean that A/D updates could be lost even on non-explicit unmaps (e.g., page out) which is obviously bad.> Thus this patch would /introduce/ a race where a dirty bit set would be > lost (rather than extending the window where this would happen). > > However (and this is a weaker argument), no sensible userspace > application should be accessing pages that are being unmapped or > remapped (since it is unpredictable whether they will fault) so perhaps > this additional unpredictable behaviour is acceptable?If there''s a big win to be had through batching, we''re better off devising a hypercall method for capturing the atomic rmw operation as it stands, rather than subtly messing with semantics. -- Keir
Rusty Russell
2012-Jun-18 09:13 UTC
Re: [PATCH 0/2] x86/mm: remove arch-specific PTE/PMD get-and-clear functions
On Thu, 14 Jun 2012 11:41:16 -0700, "H. Peter Anvin" <hpa@zytor.com> wrote:> On 06/14/2012 11:29 AM, Konrad Rzeszutek Wilk wrote: > > > > Lets rope Rusty in this since he is the maintainer of lguest. > > > > Also, let''s have realistic expectations for lguest. > > I don''t want to break lguest just for sh*ts and giggles, but if lguest > is in the way of making forward progress for other things, it would have > to have a very strong case to not be sacrificed.Exactly! Please cc'' me if you need me to unbreak lguest, but for performance? Not really... Cheers, Rusty.