Boris Ostrovsky
2013-Feb-26 22:56 UTC
[PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
When CONFIG_DEBUG_PAGEALLOC is set page table updates made by kernel_map_pages() are not made visible (via TLB flush) immediately if lazy MMU is on. In environments that support lazy MMU (e.g. Xen) this may lead to fatal page faults, for example, when zap_pte_range() needs to allocate pages in __tlb_remove_page() -> tlb_next_batch(). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> --- arch/x86/mm/pageattr.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index ca1f1c2..7b3216e 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1369,6 +1369,8 @@ void kernel_map_pages(struct page *page, int numpages, int enable) * but that can deadlock->flush only current cpu: */ __flush_tlb_all(); + + arch_flush_lazy_mmu_mode(); } #ifdef CONFIG_HIBERNATION -- 1.8.1.2
H. Peter Anvin
2013-Feb-26 23:38 UTC
Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/26/2013 02:56 PM, Boris Ostrovsky wrote:> When CONFIG_DEBUG_PAGEALLOC is set page table updates made by > kernel_map_pages() are not made visible (via TLB flush) immediately if lazy > MMU is on. In environments that support lazy MMU (e.g. Xen) this may lead to > fatal page faults, for example, when zap_pte_range() needs to allocate pages > in __tlb_remove_page() -> tlb_next_batch(). > > Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> > --- > arch/x86/mm/pageattr.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c > index ca1f1c2..7b3216e 100644 > --- a/arch/x86/mm/pageattr.c > +++ b/arch/x86/mm/pageattr.c > @@ -1369,6 +1369,8 @@ void kernel_map_pages(struct page *page, int numpages, int enable) > * but that can deadlock->flush only current cpu: > */ > __flush_tlb_all(); > + > + arch_flush_lazy_mmu_mode(); > } > > #ifdef CONFIG_HIBERNATION >This sounds like a critical fix, i.e. a -stable candidate. Am I correct? -hpa
Boris Ostrovsky
2013-Feb-26 23:57 UTC
Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
----- hpa@zytor.com wrote:> On 02/26/2013 02:56 PM, Boris Ostrovsky wrote: > > When CONFIG_DEBUG_PAGEALLOC is set page table updates made by > > kernel_map_pages() are not made visible (via TLB flush) immediately > if lazy > > MMU is on. In environments that support lazy MMU (e.g. Xen) this may > lead to > > fatal page faults, for example, when zap_pte_range() needs to > allocate pages > > in __tlb_remove_page() -> tlb_next_batch(). > > > > Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> > > --- > > arch/x86/mm/pageattr.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c > > index ca1f1c2..7b3216e 100644 > > --- a/arch/x86/mm/pageattr.c > > +++ b/arch/x86/mm/pageattr.c > > @@ -1369,6 +1369,8 @@ void kernel_map_pages(struct page *page, int > numpages, int enable) > > * but that can deadlock->flush only current cpu: > > */ > > __flush_tlb_all(); > > + > > + arch_flush_lazy_mmu_mode(); > > } > > > > #ifdef CONFIG_HIBERNATION > > > > This sounds like a critical fix, i.e. a -stable candidate. Am I > correct?I considered copying stable but then I decided that this is a debugging feature --- kernel_map_pages() is only defined if CONFIG_DEBUG_PAGEALLOC is set and my thinking was that stable kernels usually don''t do this. -boris
H. Peter Anvin
2013-Feb-27 22:40 UTC
Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
Greg, policy opinion? -hpa On 02/26/2013 03:57 PM, Boris Ostrovsky wrote:> > ----- hpa@zytor.com wrote: > >> On 02/26/2013 02:56 PM, Boris Ostrovsky wrote: >>> When CONFIG_DEBUG_PAGEALLOC is set page table updates made by >>> kernel_map_pages() are not made visible (via TLB flush) immediately >> if lazy >>> MMU is on. In environments that support lazy MMU (e.g. Xen) this may >> lead to >>> fatal page faults, for example, when zap_pte_range() needs to >> allocate pages >>> in __tlb_remove_page() -> tlb_next_batch(). >>> >>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> >>> --- >>> arch/x86/mm/pageattr.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c >>> index ca1f1c2..7b3216e 100644 >>> --- a/arch/x86/mm/pageattr.c >>> +++ b/arch/x86/mm/pageattr.c >>> @@ -1369,6 +1369,8 @@ void kernel_map_pages(struct page *page, int >> numpages, int enable) >>> * but that can deadlock->flush only current cpu: >>> */ >>> __flush_tlb_all(); >>> + >>> + arch_flush_lazy_mmu_mode(); >>> } >>> >>> #ifdef CONFIG_HIBERNATION >>> >> >> This sounds like a critical fix, i.e. a -stable candidate. Am I >> correct? > > I considered copying stable but then I decided that this is a debugging feature > --- kernel_map_pages() is only defined if CONFIG_DEBUG_PAGEALLOC is set and my > thinking was that stable kernels usually don''t do this. > > > -boris >
Greg KH
2013-Feb-27 23:00 UTC
Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On Wed, Feb 27, 2013 at 02:40:01PM -0800, H. Peter Anvin wrote:> Greg, policy opinion? > > -hpa > > On 02/26/2013 03:57 PM, Boris Ostrovsky wrote: > > > > ----- hpa@zytor.com wrote: > > > >> On 02/26/2013 02:56 PM, Boris Ostrovsky wrote: > >>> When CONFIG_DEBUG_PAGEALLOC is set page table updates made by > >>> kernel_map_pages() are not made visible (via TLB flush) immediately > >> if lazy > >>> MMU is on. In environments that support lazy MMU (e.g. Xen) this may > >> lead to > >>> fatal page faults, for example, when zap_pte_range() needs to > >> allocate pages > >>> in __tlb_remove_page() -> tlb_next_batch(). > >>> > >>> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> > >>> --- > >>> arch/x86/mm/pageattr.c | 2 ++ > >>> 1 file changed, 2 insertions(+) > >>> > >>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c > >>> index ca1f1c2..7b3216e 100644 > >>> --- a/arch/x86/mm/pageattr.c > >>> +++ b/arch/x86/mm/pageattr.c > >>> @@ -1369,6 +1369,8 @@ void kernel_map_pages(struct page *page, int > >> numpages, int enable) > >>> * but that can deadlock->flush only current cpu: > >>> */ > >>> __flush_tlb_all(); > >>> + > >>> + arch_flush_lazy_mmu_mode(); > >>> } > >>> > >>> #ifdef CONFIG_HIBERNATION > >>> > >> > >> This sounds like a critical fix, i.e. a -stable candidate. Am I > >> correct? > > > > I considered copying stable but then I decided that this is a debugging feature > > --- kernel_map_pages() is only defined if CONFIG_DEBUG_PAGEALLOC is set and my > > thinking was that stable kernels usually don''t do this."Stable" kernels are used all over the place, like in distros, which might enable this. I have no objection to taking this patch in a stable release, as it does fix a real problem. thanks, greg k-h
H. Peter Anvin
2013-Feb-27 23:07 UTC
Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/27/2013 03:00 PM, Greg KH wrote:> > "Stable" kernels are used all over the place, like in distros, which > might enable this. > > I have no objection to taking this patch in a stable release, as it does > fix a real problem. >OK. I will queue it up in the next fixes (tip:x86/urgent) batch to Linus. -hpa
Konrad Rzeszutek Wilk
2013-Feb-28 14:29 UTC
Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On Wed, Feb 27, 2013 at 03:07:35PM -0800, H. Peter Anvin wrote:> On 02/27/2013 03:00 PM, Greg KH wrote: > > > > "Stable" kernels are used all over the place, like in distros, which > > might enable this. > > > > I have no objection to taking this patch in a stable release, as it does > > fix a real problem. > > > > OK. I will queue it up in the next fixes (tip:x86/urgent) batch to Linus.Thank you. Could you also consider this one (I CC-ed Ingo on it but never got any response): From a6ed4a88eff4f6329bb4acae3372cccc8a8367d5 Mon Sep 17 00:00:00 2001 From: Samu Kallio <samu.kallio@aberdeencloud.com> Date: Sun, 17 Feb 2013 02:35:52 +0000 Subject: [PATCH] x86: mm: Fix vmalloc_fault oops during lazy MMU updates. In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops when lazy MMU updates are enabled, because set_pgd effects are being deferred. One instance of this problem is during process mm cleanup with memory cgroups enabled. The chain of events is as follows: - zap_pte_range enables lazy MMU updates - zap_pte_range eventually calls mem_cgroup_charge_statistics, which accesses the vmalloc''d mem_cgroup per-cpu stat area - vmalloc_fault is triggered which tries to sync the corresponding PGD entry with set_pgd, but the update is deferred - vmalloc_fault oopses due to a mismatch in the PUD entries The OOPs usually looks as so: ------------[ cut here ]------------ kernel BUG at arch/x86/mm/fault.c:396! invalid opcode: 0000 [#1] SMP .. snip .. CPU 1 Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1 RIP: e030:[<ffffffff816271bf>] [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208 .. snip .. Call Trace: [<ffffffff81627759>] do_page_fault+0x399/0x4b0 [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110 [<ffffffff81624065>] page_fault+0x25/0x30 [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50 [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350 [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60 [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150 [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80 [<ffffffff81153e61>] unmap_single_vma+0x531/0x870 [<ffffffff81154962>] unmap_vmas+0x52/0xa0 [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100 [<ffffffff8115c8f8>] exit_mmap+0x98/0x170 [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81059ce3>] mmput+0x83/0xf0 [<ffffffff810624c4>] exit_mm+0x104/0x130 [<ffffffff8106264a>] do_exit+0x15a/0x8c0 [<ffffffff810630ff>] do_group_exit+0x3f/0xa0 [<ffffffff81063177>] sys_exit_group+0x17/0x20 [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the changes visible to the consistency checks. RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737 Reported-and-Tested-by: Krishna Raman <kraman@redhat.com> CC: stable@vger.kernel.org Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- arch/x86/mm/fault.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index fb674fd..4f7d793 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -378,10 +378,12 @@ static noinline __kprobes int vmalloc_fault(unsigned long address) if (pgd_none(*pgd_ref)) return -1; - if (pgd_none(*pgd)) + if (pgd_none(*pgd)) { set_pgd(pgd, *pgd_ref); - else + arch_flush_lazy_mmu_mode(); + } else { BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref)); + } /* * Below here mismatches are bugs because these lower tables -- 1.8.0.2> > -hpa > >
Borislav Petkov
2013-Feb-28 15:38 UTC
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On Thu, Feb 28, 2013 at 09:29:10AM -0500, Konrad Rzeszutek Wilk wrote:> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index fb674fd..4f7d793 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -378,10 +378,12 @@ static noinline __kprobes int vmalloc_fault(unsigned long address) > if (pgd_none(*pgd_ref)) > return -1; > > - if (pgd_none(*pgd)) > + if (pgd_none(*pgd)) { > set_pgd(pgd, *pgd_ref); > - else > + arch_flush_lazy_mmu_mode();Do I understand it correctly that this would cost us a "preempt_disable(); preempt_enable()" needlessly on baremetal when running with CONFIG_PARAVIRT enabled? Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --
H. Peter Anvin
2013-Feb-28 15:53 UTC
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/28/2013 07:38 AM, Borislav Petkov wrote:> On Thu, Feb 28, 2013 at 09:29:10AM -0500, Konrad Rzeszutek Wilk wrote: >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c >> index fb674fd..4f7d793 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -378,10 +378,12 @@ static noinline __kprobes int vmalloc_fault(unsigned long address) >> if (pgd_none(*pgd_ref)) >> return -1; >> >> - if (pgd_none(*pgd)) >> + if (pgd_none(*pgd)) { >> set_pgd(pgd, *pgd_ref); >> - else >> + arch_flush_lazy_mmu_mode(); > > Do I understand it correctly that this would cost us a > "preempt_disable(); preempt_enable()" needlessly on baremetal when > running with CONFIG_PARAVIRT enabled? >OK, this is mind-boggingly crazy and these things that just drives me batty about PV. arch_flush_lazy_mmu_mode() involves a function call and preemption off *before it even tests for being a noop*. On bare metal this function doesn''t do anything, and cannot do anything. At the very least we should have an early filter for the **COMMON!** case that we are not on a PV platform. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don''t speak on their behalf.
Borislav Petkov
2013-Feb-28 16:10 UTC
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On Thu, Feb 28, 2013 at 07:53:44AM -0800, H. Peter Anvin wrote:> At the very least we should have an early filter for the **COMMON!** > case that we are not on a PV platform.... or, patch it out with the alternatives on baremetal, as Steven suggested. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --
Boris Ostrovsky
2013-Feb-28 16:20 UTC
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/28/2013 11:10 AM, Borislav Petkov wrote:> On Thu, Feb 28, 2013 at 07:53:44AM -0800, H. Peter Anvin wrote: >> At the very least we should have an early filter for the **COMMON!** >> case that we are not on a PV platform. > ... or, patch it out with the alternatives on baremetal, as Steven > suggested. >I think making a check for paravirt_enabled() is safe enough. I''ll send a patch shortly. -boris
Borislav Petkov
2013-Feb-28 16:22 UTC
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On Thu, Feb 28, 2013 at 11:20:20AM -0500, Boris Ostrovsky wrote:> On 02/28/2013 11:10 AM, Borislav Petkov wrote: > >On Thu, Feb 28, 2013 at 07:53:44AM -0800, H. Peter Anvin wrote: > >>At the very least we should have an early filter for the **COMMON!** > >>case that we are not on a PV platform. > >... or, patch it out with the alternatives on baremetal, as Steven > >suggested. > > > I think making a check for paravirt_enabled() is safe enough. I''ll > send a patch shortly.Why not make it absolutely for free? -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --
H. Peter Anvin
2013-Feb-28 16:24 UTC
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/28/2013 08:22 AM, Borislav Petkov wrote:> On Thu, Feb 28, 2013 at 11:20:20AM -0500, Boris Ostrovsky wrote: >> On 02/28/2013 11:10 AM, Borislav Petkov wrote: >>> On Thu, Feb 28, 2013 at 07:53:44AM -0800, H. Peter Anvin wrote: >>>> At the very least we should have an early filter for the **COMMON!** >>>> case that we are not on a PV platform. >>> ... or, patch it out with the alternatives on baremetal, as Steven >>> suggested. >>> >> I think making a check for paravirt_enabled() is safe enough. I''ll >> send a patch shortly. > > Why not make it absolutely for free?That makes more sense, I think... although of course, there is a cost in complexity. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don''t speak on their behalf.
Boris Ostrovsky
2013-Feb-28 16:27 UTC
Re: Is: x86: mm: Fix vmalloc_fault oops during lazy MMU updates Was: Re: [PATCH] mm/x86: Flush lazy MMU when DEBUG_PAGEALLOC is set
On 02/28/2013 11:22 AM, Borislav Petkov wrote:> On Thu, Feb 28, 2013 at 11:20:20AM -0500, Boris Ostrovsky wrote: >> On 02/28/2013 11:10 AM, Borislav Petkov wrote: >>> On Thu, Feb 28, 2013 at 07:53:44AM -0800, H. Peter Anvin wrote: >>>> At the very least we should have an early filter for the **COMMON!** >>>> case that we are not on a PV platform. >>> ... or, patch it out with the alternatives on baremetal, as Steven >>> suggested.What was the suggestion exactly? I don''t remember seeing that message. -boris>>> >> I think making a check for paravirt_enabled() is safe enough. I''ll >> send a patch shortly. > Why not make it absolutely for free? >