Tian, Kevin
2011-Apr-29 04:10 UTC
[Xen-devel] [PATCH] xen mmu: fix a race window causing leave_mm BUG()
xen mmu: fix a race window causing leave_mm BUG() there''s a race window in xen_drop_mm_ref, where remote cpu may exit dirty bitmap between the check on this cpu and the point where remote cpu handles drop request. So in drop_other_mm_ref we need check whether TLB state is still lazy before calling into leave_mm. This bug is rarely observed in earlier kernel, but exaggerated by the commit 831d52bc153971b70e64eccfbed2b232394f22f8 which clears bitmap after changing the TLB state. thanks for Maxiaoyun<tinnycloud@hotmail.com> to verify it. Signed-off-by: Kevin Tian <kevin.tian@intel.com> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 4e5a611..74c6e4a 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1260,7 +1260,7 @@ static void drop_other_mm_ref(void *info) active_mm = percpu_read(cpu_tlbstate.active_mm); - if (active_mm == mm) + if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK) leave_mm(smp_processor_id()); /* If this cpu still has a stale cr3 reference, then make sure _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-10 20:27 UTC
Re: [Xen-devel] [PATCH] xen mmu: fix a race window causing leave_mm BUG()
On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote:> xen mmu: fix a race window causing leave_mm BUG()I''ve this in mailbox and I am wondering whether this still an issue with the 2.6.39 type kernels? How do you reproduce the failure? When using LVM?> > there''s a race window in xen_drop_mm_ref, where remote cpu may exit > dirty bitmap between the check on this cpu and the point where remote > cpu handles drop request. So in drop_other_mm_ref we need check > whether TLB state is still lazy before calling into leave_mm. This > bug is rarely observed in earlier kernel, but exaggerated by the > commit 831d52bc153971b70e64eccfbed2b232394f22f8 which clears bitmap > after changing the TLB state. > > thanks for Maxiaoyun<tinnycloud@hotmail.com> to verify it. > > Signed-off-by: Kevin Tian <kevin.tian@intel.com> > > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c > index 4e5a611..74c6e4a 100644 > --- a/arch/x86/xen/mmu.c > +++ b/arch/x86/xen/mmu.c > @@ -1260,7 +1260,7 @@ static void drop_other_mm_ref(void *info) > > active_mm = percpu_read(cpu_tlbstate.active_mm); > > - if (active_mm == mm) > + if (active_mm == mm && percpu_read(cpu_tlbstate.state) != TLBSTATE_OK) > leave_mm(smp_processor_id()); > > /* If this cpu still has a stale cr3 reference, then make sure> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2011-May-11 01:20 UTC
RE: [Xen-devel] [PATCH] xen mmu: fix a race window causing leave_mm BUG()
> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > Sent: Wednesday, May 11, 2011 4:27 AM > > On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote: > > xen mmu: fix a race window causing leave_mm BUG() > > I''ve this in mailbox and I am wondering whether this still an issue with the > 2.6.39 type kernels? > How do you reproduce the failure? When using LVM?this issue is reported by Xiaoyun when he did extensive test which happened occasionally after dozen of hours running. From the phenomenon and info provided by Xiaoyun, I found this potential race window and Xiaoyun has verified this patch solving his stability issue. the original thread is at: http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.html his kernel is based on 2.6.38, and I checked latest 2.6.39 from your maintained repo, and same issue still exists. btw, I didn''t reproduce it myself, and not sure whether Xiaoyun uses LVM. But I think it has nothing to do with storage type, and a pure mmu design issue. Thanks Kevin> > > > there''s a race window in xen_drop_mm_ref, where remote cpu may exit > > dirty bitmap between the check on this cpu and the point where remote > > cpu handles drop request. So in drop_other_mm_ref we need check > > whether TLB state is still lazy before calling into leave_mm. This > > bug is rarely observed in earlier kernel, but exaggerated by the > > commit 831d52bc153971b70e64eccfbed2b232394f22f8 which clears > bitmap > > after changing the TLB state. > > > > thanks for Maxiaoyun<tinnycloud@hotmail.com> to verify it. > > > > Signed-off-by: Kevin Tian <kevin.tian@intel.com> > > > > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index > > 4e5a611..74c6e4a 100644 > > --- a/arch/x86/xen/mmu.c > > +++ b/arch/x86/xen/mmu.c > > @@ -1260,7 +1260,7 @@ static void drop_other_mm_ref(void *info) > > > > active_mm = percpu_read(cpu_tlbstate.active_mm); > > > > - if (active_mm == mm) > > + if (active_mm == mm && percpu_read(cpu_tlbstate.state) !> > +TLBSTATE_OK) > > leave_mm(smp_processor_id()); > > > > /* If this cpu still has a stale cr3 reference, then make sure > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-May-11 09:44 UTC
RE: [Xen-devel] [PATCH] xen mmu: fix a race window causing leave_mm BUG()
On Wed, 2011-05-11 at 02:20 +0100, Tian, Kevin wrote:> > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > Sent: Wednesday, May 11, 2011 4:27 AM > > > > On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote: > > > xen mmu: fix a race window causing leave_mm BUG() > > > > I''ve this in mailbox and I am wondering whether this still an issue with the > > 2.6.39 type kernels? > > How do you reproduce the failure? When using LVM? > > this issue is reported by Xiaoyun when he did extensive test which happened > occasionally after dozen of hours running. From the phenomenon and info > provided by Xiaoyun, I found this potential race window and Xiaoyun has > verified this patch solving his stability issue. > > the original thread is at: > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.html > > his kernel is based on 2.6.38, and I checked latest 2.6.39 from your maintained > repo, and same issue still exists. > > btw, I didn''t reproduce it myself, and not sure whether Xiaoyun uses LVM. But > I think it has nothing to do with storage type, and a pure mmu design issue.Is there a specific stack trace (or two) which is associated with this bug? I''m wondering if http://bugs.debian.org/613073 might be the same thing... Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Tian, Kevin
2011-May-11 12:34 UTC
RE: [Xen-devel] [PATCH] xen mmu: fix a race window causing leave_mm BUG()
> From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > Sent: Wednesday, May 11, 2011 5:44 PM > > On Wed, 2011-05-11 at 02:20 +0100, Tian, Kevin wrote: > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > > Sent: Wednesday, May 11, 2011 4:27 AM > > > > > > On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote: > > > > xen mmu: fix a race window causing leave_mm BUG() > > > > > > I've this in mailbox and I am wondering whether this still an issue > > > with the > > > 2.6.39 type kernels? > > > How do you reproduce the failure? When using LVM? > > > > this issue is reported by Xiaoyun when he did extensive test which > > happened occasionally after dozen of hours running. From the > > phenomenon and info provided by Xiaoyun, I found this potential race > > window and Xiaoyun has verified this patch solving his stability issue. > > > > the original thread is at: > > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.ht > > ml > > > > his kernel is based on 2.6.38, and I checked latest 2.6.39 from your > > maintained repo, and same issue still exists. > > > > btw, I didn't reproduce it myself, and not sure whether Xiaoyun uses > > LVM. But I think it has nothing to do with storage type, and a pure mmu > design issue. > > Is there a specific stack trace (or two) which is associated with this bug? I'm > wondering if http://bugs.debian.org/613073 might be the same thing... >If you look into above thread: http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00657.html [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30 ... Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-May-11 15:44 UTC
Re: [Xen-devel] [PATCH] xen mmu: fix a race window causing leave_mm BUG()
On Wed, May 11, 2011 at 08:34:46PM +0800, Tian, Kevin wrote:> > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > > Sent: Wednesday, May 11, 2011 5:44 PM > > > > On Wed, 2011-05-11 at 02:20 +0100, Tian, Kevin wrote: > > > > From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com] > > > > Sent: Wednesday, May 11, 2011 4:27 AM > > > > > > > > On Fri, Apr 29, 2011 at 12:10:57PM +0800, Tian, Kevin wrote: > > > > > xen mmu: fix a race window causing leave_mm BUG() > > > > > > > > I''ve this in mailbox and I am wondering whether this still an issue > > > > with the > > > > 2.6.39 type kernels? > > > > How do you reproduce the failure? When using LVM? > > > > > > this issue is reported by Xiaoyun when he did extensive test which > > > happened occasionally after dozen of hours running. From the > > > phenomenon and info provided by Xiaoyun, I found this potential race > > > window and Xiaoyun has verified this patch solving his stability issue. > > > > > > the original thread is at: > > > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg01186.ht > > > ml > > > > > > his kernel is based on 2.6.38, and I checked latest 2.6.39 from your > > > maintained repo, and same issue still exists. > > > > > > btw, I didn''t reproduce it myself, and not sure whether Xiaoyun uses > > > LVM. But I think it has nothing to do with storage type, and a pure mmu > > design issue. > > > > Is there a specific stack trace (or two) which is associated with this bug? I''m > > wondering if http://bugs.debian.org/613073 might be the same thing... > > > > If you look into above thread: > > http://lists.xensource.com/archives/html/xen-devel/2011-04/msg00657.html > > [<ffffffff8100e4a4>] drop_other_mm_ref+0x2a/0x53 > > [<ffffffff81087224>] generic_smp_call_function_single_interrupt+0xd8/0xfc > > [<ffffffff810100e8>] xen_call_function_single_interrupt+0x13/0x28 > > [<ffffffff810a936a>] handle_IRQ_event+0x66/0x120 > > [<ffffffff810aac5b>] handle_percpu_irq+0x41/0x6e > > [<ffffffff8128c1a8>] __xen_evtchn_do_upcall+0x1ab/0x27d > > [<ffffffff8128dcf9>] xen_evtchn_do_upcall+0x33/0x46 > > [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30Can you resend the patch to me, based on top of v2.6.39-rc7, with the above stack dump? And please resend it as an attachment. Your mailer mangles the patch. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel